TWI807941B

TWI807941B - Resource allocation method in downlink sparse code multiple access system based on artificial intelligence

Info

Publication number: TWI807941B
Application number: TW111128419A
Authority: TW
Inventors: 陳曉華; 周廣誌
Original assignee: 國立成功大學
Priority date: 2022-07-28
Filing date: 2022-07-28
Publication date: 2023-07-01
Also published as: TW202406391A

Abstract

A resource allocation method, a base station allocates N subcarriers to KUEs to obtain N× Kcurrent subcarrier allocation results, and then the base station obtains N× Kcurrent allocated powers, and obtains multiple action values by an action reinforcement learning network. The base station determines whether the action values are all less than or equal to 0. When the determination result is no, the base station selects a target allocation action to obtain a plurality of update allocation powers, so as to generates and stores a training data, and then trains at least one reinforcement learning network including the action reinforcement learning network according to a plurality of target training data. The above operations are repeated until the action values are all less than or equal to 0. When the determination result is yes, a candidate spectral efficiency is calculated according to the currently allocated powers by the base station. The above actions are repeated to obtain Pcandidate spectral efficiencies to select a target spectral efficiency.

Description

基於人工智慧算法之下行稀疏碼多址接入系統資源分配方法Resource Allocation Method for Downlink Sparse Code Multiple Access System Based on Artificial Intelligence Algorithm

本發明是有關於一種資源分配方法，特別是指一種基於人工智慧算法之下行稀疏碼多址接入系統資源分配方法。 The present invention relates to a resource allocation method, in particular to a resource allocation method for a downlink sparse code multiple access system based on an artificial intelligence algorithm.

在現有的正交多重存取(Orthogonal multiple access,OMA)中，每一個用戶只能使用一個特定的資源塊，如頻帶、時隙、正交擴頻碼，但隨著行動通訊蓬的發展，對頻譜效率的需求也與日俱增，正交多重存取顯然已經無法滿足現今用戶的需求。 In the existing Orthogonal Multiple Access (OMA), each user can only use a specific resource block, such as frequency band, time slot, and orthogonal spreading code. However, with the development of mobile communication, the demand for spectrum efficiency is also increasing day by day, and Orthogonal Multiple Access is obviously unable to meet the needs of today's users.

為因應頻譜效率提升的需求，非正交多重存取(Non-orthogonal multiple access,NOMA)技術，例如模式區分多址接入(Pattern Division Multiple Access,PDMA)及稀疏碼多址接入(Sparse Code Multiple Access,SCMA)技術。 In order to meet the demand for spectrum efficiency improvement, Non-orthogonal multiple access (NOMA) technologies, such as Pattern Division Multiple Access (PDMA) and Sparse Code Multiple Access (SCMA) technologies.

MUST技術是屬於單載波NOMA，在MUST技術中，通過功率域、碼域或星座域的疊加，允許多個用戶複用同一個資源塊，以提高頻譜效率和接入用戶數，且在MUST系統傳輸訊息時，重疊編碼將多用戶的訊號用不同的功率分配疊加在一起，傳送到接收端時再利用連續性干擾消除(successive interference cancellation，SIC)技術將多用戶的疊加訊號分離開來，這時如果用戶訊號間的能量差異越大，就越容易分辨出訊號，從而有較佳的錯誤率，故合理分配訊號的功率對MUST系統來說尤為重要。 MUST technology belongs to single-carrier NOMA. In MUST technology, the The superposition of the power domain, code domain or constellation domain allows multiple users to reuse the same resource block to increase the spectrum efficiency and the number of access users. When transmitting messages in the MUST system, overlapping coding superimposes the multi-user signals with different power allocations. When transmitting to the receiving end, the successive interference cancellation (SIC) technology is used to separate the multi-user superimposed signals. At this time, if the energy difference between the user signals is greater, the signals are easier to distinguish and have better results. Therefore, it is particularly important for the MUST system to allocate signal power reasonably.

PDMA技術是屬於混合域NOMA技術，在PDMA系統傳輸訊息時，除了重疊編碼將多用戶的訊號用不同的功率分配疊加在一起外，還通過模式矩陣設計將用戶的相同編碼位元映射到不同的子載波上，從而實現分集(Diversity)及多路複用(multiplexing)，同時在接收端利用連續性干擾消除(successive interference cancellation，SIC)技術將多用戶的疊加訊號分離開來。 PDMA technology belongs to the mixed-domain NOMA technology. When transmitting information in the PDMA system, in addition to overlapping coding to superimpose multi-user signals with different power allocations, the same code bits of users are mapped to different subcarriers through pattern matrix design, thereby achieving diversity and multiplexing. At the same time, the receiving end uses successive interference cancellation (SIC) technology to separate the superimposed signals of multiple users.

不同於PDMA技術，SCMA技術是碼域NOMA技術，將資源映射到不同碼本，不同用戶被分配各自的碼本。SCMA接收機是利用消息傳遞演算法(Message Passing Algorithm,MPA)進行多用戶檢測(Multi-user Detection,MUD)，即使在系統嚴重超載的情況下也能獲得很好的性能。MPA演算法是一種基於因數圖求邊緣概率分佈的反覆運算的演算法，資訊在資源節點(Resource Node,RN)和變數節點(Variable Node,VN)之間不斷的傳遞，再由對應變數節點的通道解碼器進行解碼。 Different from PDMA technology, SCMA technology is a code domain NOMA technology, which maps resources to different codebooks, and different users are assigned their own codebooks. The SCMA receiver uses the Message Passing Algorithm (MPA) to perform multi-user detection (Multi-user Detection, MUD), and can obtain good performance even when the system is seriously overloaded. The MPA algorithm is an iterative calculation algorithm based on the factor graph to find the edge probability distribution. The information is in the resource node (Resource Node, RN) and the variable node (Variable Node, VN) are continuously transmitted, and then decoded by the channel decoder corresponding to the variable node.

詳細而言，SCMA編碼器的公式可以表示為：x _k=V _k g _k(b _k)，其中x _k是第k個用戶的SCMA碼本，b _k是第k個用戶的位元流，g _k表示高階調製過程，將用戶的位元流映射為多維星座c _k，V _k是第k個用戶的二進位映射矩陣，與因數圖矩陣F _N×K存在如下關係：F _N×K=[f ₁,f ₂,...f _k,...f _K]，

其中K表示用戶數，N表示資源塊數量(或子載波數量)，diag表示對角陣，因數圖矩陣F _N×K=[f ₁,f ₂,...f _k,...f _K]中不同的f _k相當於使用了該f _k對應的碼本V _k g _k(b _k)。由於因數圖矩陣F _N×K中的資源塊即為子載波，因此用戶被分配了不同的子載波就相當於被分配了不同的f _k(V _k)，即被分配了不同的碼本，故合理分配訊號的功率及子載波對SCMA系統來說尤為重要。 In detail, the formula of the SCMA encoder can be expressed as: x _k = V _k g _k ( b _k ), where x _k is the SCMA codebook of the k- th user, b _k is the bit stream of the k- th user, g _k represents the high-order modulation process, and maps the user's bit stream to a multi-dimensional constellation c _k , V _k is the binary mapping matrix of the k- th user, and has the following relationship with the factor map matrix F _{N × K} : F _{N × K} =[ f ₁ , f ₂ ,... f _k ,... f _k ],

Where K represents the number of users, N represents the number of resource blocks (or the number of subcarriers), diag represents a diagonal matrix, and different f _k in the factor graph matrix F _{N × K} =[ f ₁ , f ₂ ,... f _k ,... f _K ] are equivalent to using the codebook V _k g _k ( b _k ) corresponding to the f _k . Since the resource blocks in the factor graph matrix F _{N × K} are subcarriers, the allocation of different subcarriers to users is equivalent to being allocated different f _k ( V _k ), that is, different codebooks. Therefore, it is particularly important for SCMA systems to allocate signal power and subcarriers reasonably.

然而，現有的SCMA系統無法根據系統的動態場景進行最優功率及子載波分配。 However, the existing SCMA system cannot perform optimal power and subcarrier allocation according to the dynamic scene of the system.

因此，本發明的目的，即在提供一種根據系統的動態場景進行最優功率及子載波分配的基於人工智慧算法之下行稀疏碼多址接入系統資源分配方法。 Therefore, the object of the present invention is to provide a kind of downlink sparse code based on artificial intelligence algorithm for optimal power and subcarrier allocation according to the dynamic scene of the system. A method for allocating resources in a multiple access system.

於是，本發明基於人工智慧算法之下行稀疏碼多址接入系統資源分配方法，由一基站來實施，該基站經由一無線通道與K個用戶端通訊連接，該基站儲存多個子載波分配動作及多個功率分配動作，其中K>1，N>1，該方法包含一步驟(A)、一步驟(B)、一步驟(C)、一步驟(D)、一步驟(E)、一步驟(F)、一步驟(G)、一步驟(H)、一步驟(I)、一步驟(J)、一步驟(K)、一步驟(L)，及一步驟(M)。 Therefore, the resource allocation method of the downlink sparse code multiple access system based on the artificial intelligence algorithm of the present invention is implemented by a base station. The base station communicates with K user terminals through a wireless channel. The base station stores multiple subcarrier allocation actions and multiple power allocation actions, wherein K> 1, N> 1. The method includes a step (A), a step (B), a step (C), a step (D), a step (E), a step (F), a step (G), a step (H), and a step (I). , a step (J), a step (K), a step (L), and a step (M).

在該步驟(A)中，該基站向該等用戶端分配該等子載波，以獲得N×K個指示出該等用戶端是否分配到該等子載波的當前子載波分配結果，該等當前子載波分配結果j _n,k,t滿足下列條件：1

L，及

d _f，其中，j _n,k,t為第k個用戶端在當前時刻t是否分配到第n個子載波的當前子載波分配結果，j _n,k,t

{0,1}，n

{1,2,...,N}，k

{1,2,...,K}，j _n,k,t=1為第k個用戶端在當前時刻t分配到第n個子載波，j _n,k,t=0為第k個用戶端在當前時刻t未分配到第n個子載波，L為每一用戶端分配到的最大子載波數，d _f為每一子載波上的最大用戶端數。 In this step (A), the base station allocates the subcarriers to the UEs to obtain N × K current subcarrier allocation results indicating whether the UEs are allocated to the subcarriers, and the current subcarrier allocation results j _{n , k , t} satisfy the following conditions: 1

L , and

d _f , where j _{n , k , t} is the current subcarrier allocation result of whether the kth user terminal is allocated to the nth subcarrier at the current time t , j _{n , k , t}

{0,1}, n

{1,2,..., N }, k

{1,2,..., K }, j _{n , k , t} = 1 means that the kth user terminal is allocated to the nth subcarrier at the current time t , j _{n , k , t} = 0 means that the kth user terminal is not allocated to the nth subcarrier at the current time t , L is the maximum number of subcarriers allocated to each user terminal, and d _f is the maximum number of user terminals on each subcarrier.

在該步驟(B)中，該基站根據該等當前子載波分配結果獲得N×K個分別對應該等當前子載波分配結果的當前分配功率。 In the step (B), the base station obtains N × K current allocated powers respectively corresponding to the current subcarrier allocation results according to the current subcarrier allocation results.

在該步驟(C)中，該基站將該等子載波分配動作、該等功率分配動作、該等當前子載波分配結果，及該等當前分配功率輸入至一動作強化學習網路，以致該動作強化學習網路輸出多個分別對應該等功率分配動作及該等子載波分配動作的動作值。 In the step (C), the base station inputs the subcarrier allocation actions, the power allocation actions, the current subcarrier allocation results, and the current allocation power into an action reinforcement learning network, so that the action reinforcement learning network outputs a plurality of action values respectively corresponding to the power allocation actions and the subcarrier allocation actions.

在該步驟(D)中，該基站判定該等動作值是否皆小於等於0。 In the step (D), the base station determines whether the action values are all less than or equal to 0.

在該步驟(E)中，當判定出該等動作值之其中一者大於0時，該基站從該等子載波分配動作及該等功率分配動作中選擇一目標分配動作。 In the step (E), when it is determined that one of the action values is greater than 0, the base station selects a target allocation action from the subcarrier allocation actions and the power allocation actions.

在該步驟(F)中，該基站根據該等當前子載波分配結果、該等當前分配功率及該目標分配動作，獲得多個分別對應該等當前子載波分配結果的更新子載波分配結果及多個分別對應該等當前分配功率的更新分配功率。 In the step (F), the base station obtains a plurality of updated subcarrier allocation results respectively corresponding to the current subcarrier allocation results and a plurality of updated allocated powers corresponding to the current allocated power according to the current subcarrier allocation results, the current allocated powers, and the target allocation action.

在該步驟(G)中，該基站根據該等當前分配功率及該等更新分配功率計算出一獎勵值。 In the step (G), the base station calculates a reward value according to the current allocated power and the updated allocated power.

在該步驟(H)中，該基站產生並儲存一包括該等當前子載波分配結果、該等當前分配功率、該目標分配動作、該獎勵值、該等更新子載波分配結果，及該等更新分配功率的訓練資料。 In the step (H), the base station generates and stores a training data including the current subcarrier allocation results, the current allocation power, the target allocation action, the reward value, the updated subcarrier allocation results, and the updated allocation power.

在該步驟(I)中，該基站從儲存的訓練資料中選取多筆目標訓練資料，並根據該等目標訓練資料訓練至少一強化學習網路，該至少一強化學習網路包括該動作強化學習網路。 In this step (I), the base station selects multiple items from the stored training data target training data, and train at least one reinforcement learning network according to the target training data, and the at least one reinforcement learning network includes the action reinforcement learning network.

在該步驟(J)中，該基站將該等更新子載波分配結果及該等更新分配功率分別作為該等當前子載波分配結果及該等當前分配功率重複步驟(C)~(I)直到該等動作值皆小於等於0。 In the step (J), the base station uses the updated subcarrier allocation results and the updated allocated powers as the current subcarrier allocation results and the current allocated powers to repeat steps (C)~(I) until the action values are all less than or equal to 0.

在該步驟(K)中，當判定出該等動作值皆小於等於0時，該基站根據該等當前分配功率計算出一候選頻譜效率，並儲存該等當前子載波分配結果、該等當前分配功率，及該候選頻譜效率。 In the step (K), when it is determined that the action values are all less than or equal to 0, the base station calculates a candidate spectral efficiency according to the current allocated power, and stores the current subcarrier allocation results, the current allocated power, and the candidate spectral efficiency.

在該步驟(L)中，重複進行步驟(A)~(K)P次，以獲得P個候選頻譜效率，其中P>1。 In the step (L), the steps (A)~(K) are repeated P times to obtain P candidate spectral efficiencies, where P >1.

在該步驟(M)中，該基站從該等候選頻譜效率中獲得一最高的目標頻譜效率。 In the step (M), the base station obtains a highest target spectral efficiency from the candidate spectral efficiencies.

本發明之功效在於：該基站利用該動作強化學習網路在不同場景記錄學習，以獲取具有最大的獎勵值之最佳分配動作，並進一步獲得該等候選頻譜效率，再從該等候選頻譜效率中獲得最高的該目標頻譜效率，其中，該目標頻譜效率對應的子載波分配及功率分配即為最優。 The effect of the present invention is that: the base station uses the action reinforcement learning network to record learning in different scenarios to obtain the optimal allocation action with the largest reward value, and further obtain the candidate spectral efficiencies, and then obtain the highest target spectral efficiency from the candidate spectral efficiencies, wherein the subcarrier allocation and power allocation corresponding to the target spectral efficiency are optimal.

11:基站 11: Base station

12:用戶端 12: Client

100:無線通道 100: wireless channel

21~34:步驟 21~34: Steps

281~289:子步驟 281~289: sub-steps

301~303:子步驟 301~303: sub-steps

321~323:子步驟 321~323: sub-steps

本發明的其他的特徵及功效，將於參照圖式的實施方式中清楚地呈現，其中：圖1是一方塊圖，說明用以實施本發明基於人工智慧算法之下行稀疏碼多址接入資源分配方法的一實施例的一基站；圖2是一流程圖，說明本發明基於人工智慧算法之下行稀疏碼多址接入系統資源分配方法的該實施例；圖3是一流程圖，輔助說明圖2步驟28的子步驟；圖4是一流程圖，輔助說明圖2步驟30的子步驟；及圖5是一流程圖，輔助說明圖2步驟32的子步驟。 Other features and effects of the present invention will be described with reference to the embodiments of the drawings 1 is a block diagram illustrating a base station for implementing an embodiment of the downlink sparse code multiple access resource allocation method based on the artificial intelligence algorithm of the present invention; FIG. 2 is a flowchart illustrating the embodiment of the present invention based on the artificial intelligence algorithm downlink sparse code multiple access system resource allocation method; FIG. substeps of .

在本發明被詳細描述之前，應當注意在以下的說明內容中，類似的元件是以相同的編號來表示。 Before the present invention is described in detail, it should be noted that in the following description, similar elements are denoted by the same numerals.

參閱圖1，本發明基於人工智慧算法之下行稀疏碼多址接入系統資源分配方法的一實施例是由一基站11執行，該基站11支援下行功率域的稀疏碼多址接入技術，該基站11經由一無線通道100與K個用戶端12通訊連接，該基站11通過為每一用戶端12使用不同等級的功率將該等用戶端12的信號疊加在N個子載波上，其中K>1，N>1。值的注意的是，在本實施例中，該基站11例如為單天線基站(base station,BS)，該等用戶端12例如為智慧型手機，但不以此為限。 Referring to FIG. 1, an embodiment of the resource allocation method of the downlink sparse code multiple access system based on the artificial intelligence algorithm of the present invention is implemented by a base station 11, the base station 11 supports the sparse code multiple access technology in the downlink power domain, the base station 11 communicates with K user terminals 12 via a wireless channel 100, and the base station 11 superimposes the signals of these user terminals 12 on N subcarriers by using different levels of power for each user terminal 12, wherein K> 1, N> 1. It should be noted that in this embodiment, the base station 11 is, for example, a single-antenna base station (base station, BS), and the user terminals 12 are, for example, smart phones, but not limited thereto.

該基站11儲存有多個子載波分配動作及多個功率分配動作。 The base station 11 stores multiple subcarrier allocation actions and multiple power allocation actions.

參閱圖1、2展示了本發明基於人工智慧算法之下行稀疏碼多址接入系統資源分配方法的該實施例，以下詳述圖2所示的該實施例的各個步驟。 1 and 2 show the embodiment of the resource allocation method of the downlink sparse code multiple access system based on the artificial intelligence algorithm of the present invention, and the steps of the embodiment shown in FIG. 2 will be described in detail below.

在步驟21中，該基站11初始化多個強化學習網路。 In step 21, the base station 11 initializes a plurality of reinforcement learning networks.

值得注意的是，在本實施例中，該等強化學習網路的類型例如為Q學習網路，且數量為二，該等強化學習網路分別為一更新網路和一目標網路，該等強化學習網路例如包括一具有五十個節點的全連階層，啟動函數例如為整流線性單位函數(Rectified Linear Unit,ReLU)，設定一學習演算法例如為自適應時刻估計方法(Adaptive Moment Estimation,Adam)，設定一損失函數例如為均方誤差(mean-square error,MSE)，在其他實施方式中，該等強化學習網路例如包括一對照表(Q表格)，該學習演算法可為隨機梯度下降法(Stochastic gradient descent,SGD)、動量梯度下降法(Momentum)、或Adagrad算法，損失函數可為平方損失函數或絕對值損失函數，此外，強化學習網路的類型不限於Q學習網路，同時該基站11亦可僅初始化一強化學習網路，但不以此為限。 It is worth noting that, in this embodiment, the type of these reinforcement learning networks is, for example, Q-learning network, and the number is two. These reinforcement learning networks are respectively an update network and a target network. These reinforcement learning networks include, for example, a fully connected layer with fifty nodes. The activation function is, for example, a rectified linear unit function (Rectified Linear Unit, ReLU). The loss function is, for example, mean-square error (mean-square error, MSE). In other embodiments, the reinforcement learning networks include, for example, a comparison table (Q table). The learning algorithm can be stochastic gradient descent (SGD), momentum gradient descent (Momentum), or Adagrad algorithm. Only one reinforcement learning network can be initialized, but not limited thereto.

在步驟22中，該基站11判定是否已循環P次。當該基站 11判定出未循環P次時，流程進行步驟23；而當該基站11判定出已循環P次時，流程進行步驟34。值得注意的是，在本實施例中，該基站是以一循環計數器(圖未示)計數循環次數，其中P=20000，但不以此為限。 In step 22, the base station 11 determines whether it has cycled P times. When the base station 11 determines that the cycle has not been performed P times, the process proceeds to step 23 ; and when the base station 11 determines that the cycle has been performed P times, the process proceeds to step 34 . It should be noted that, in this embodiment, the base station counts the number of cycles with a cycle counter (not shown), wherein P =20000, but not limited thereto.

在步驟23中，該基站11向該等用戶端12分配該等子載波，以獲得N×K個指示出該等用戶端12是否分配到該等子載波的當前子載波分配結果。 In step 23, the base station 11 allocates the subcarriers to the UEs 12 to obtain N×K current subcarrier allocation results indicating whether the UEs 12 are allocated to the subcarriers.

值得注意的是，該基站11是按照一因數圖矩陣(factor graph matrix)J _SCMA=(j _n,k,t)_N×K向該等用戶端12分配該等子載波，1

L，及

{0,1}，n

{1,2,...,N}，k

{1,2,...,K}，j _n,k,t=1為第k個用戶端在當前時刻t分配到第n個子載波，j _n,k,t=0為第k個用戶端在當前時刻t未分配到第n個子載波，L為每一用戶端分配到的最大子載波數，d _f為每一子載波上的最大用戶端數，每一用戶端12所分配到的最大子載波數L不能超過子載波總數的一半，同時每一子載波上的最大用戶端數d _f可由該因數圖矩陣J _SCM得知，該因數圖矩陣J _SCM可表示為：

It should be noted that the base station 11 allocates the subcarriers to the UEs 12 according to a factor graph matrix J _SCMA =( j _{n , k , t} ) _{N × K} , 1

L , and

{0,1}, n

{1,2,..., N }, k

{1,2,..., K }, j _{n , k , t} = 1 means that the kth user terminal is allocated to the nth subcarrier at the current time t , j _{n , k , t} = 0 means that the kth user terminal is not allocated to the nth subcarrier at the current time t , L is the maximum number of subcarriers allocated to each user terminal, d _f is the maximum number of user terminals on each subcarrier, and the maximum number of subcarriers L allocated to each user terminal 12 cannot exceed half of the total number of subcarriers. The maximum number of clients d _{f on} can be known from the factor graph matrix J _SCM , and the factor graph matrix J _SCM can be expressed as:

要再注意的是，每一用戶端12分配的子載波數量和每一子載波上的用戶端數量由各用戶端12使用的碼本決定，假設K=6，N=4，每一用戶端12使用的碼本長度與子載波個數等同，每一碼本中非0元素數量N _c應不超過碼本長度的1/2即N _c={1,2}，每一用戶端12分配的最大子載波數量L為1

L

2，每一子載波上的用戶端數量U為U

3，當N _c=2時，可用的碼本數量為，

，可以保證每一用戶端分到一個碼本，此時每一用戶端的碼本中有2個非0元素和2個0元素，即每一用戶允許佔用2個子載波，且每一子載波上有3個非0元素發生碰撞，即

It should be noted again that the number of subcarriers allocated by each client 12 and the number of clients on each subcarrier are determined by the codebook used by each client 12. _Assume that K =6, N =4, the length of the codebook used _by each client 12 is equal to the number of subcarriers, and the number of non-zero elements Nc in each codebook should not exceed 1/2 of the length of the codebook, that is, Nc ={1,2}. The maximum number of subcarriers L allocated by each client 12 is 1

L

2. The number of users U on each subcarrier is U

3. When N _c =2, the number of available codebooks is,

, it can be guaranteed that each client is assigned a codebook. At this time, there are 2 non-zero elements and 2 0 elements in the codebook of each client, that is, each user is allowed to occupy 2 subcarriers, and there are 3 non-zero elements colliding on each subcarrier, that is

在步驟24中，該基站11根據該等當前子載波分配結果獲得N×K個分別對應該等當前子載波分配結果的當前分配功率。 In step 24, the base station 11 obtains N × K currently allocated powers respectively corresponding to the current subcarrier allocation results according to the current subcarrier allocation results.

其中，該等當前分配功率v _n,k,t滿足下列條件：

1，0

v _n,k,t

1，及

n

{1,2,...,N}，k

{1,2,...,K}，j _n,k,t為第k個用戶端12在當前時刻t是否分配到第n個子載波的子載波分配結果，j _n,k,t

{0,1}，v _n,k,t為在第n個子載波上的第k個順序的用戶端12在當前時刻t分配到的當前分配功率之係數。 Among them, the current allocated power v _{n , k , t} satisfy the following conditions:

1,0

v _{n , k , t}

1, and

no

{1,2,..., N }, k

{1,2,..., K }, j _{n , k , t} are the subcarrier allocation results of whether the kth user terminal 12 is allocated to the nth subcarrier at the current time t , j _{n , k , t}

{0,1}, v _{n , k , t} are coefficients of the current allocated power allocated to the k -th sequential UE 12 on the n -th subcarrier at the current time t .

在步驟25中，該基站11將該等子載波分配動作、該等功率分配動作、該等當前子載波分配結果，及該等當前分配功率輸入至該等強化學習網路中之一動作強化學習網路，以致該動作強化學習網路輸出多個分別對應該等功率分配動作及該等子載波分配動作的動作值。 In step 25, the base station 11 inputs the subcarrier allocation actions, the power allocation actions, the current subcarrier allocation results, and the current allocation power into one of the reinforcement learning networks, so that the action reinforcement learning network outputs a plurality of action values respectively corresponding to the power allocation actions and the subcarrier allocation actions.

值得注意的是，在本實施例中，該動作強化學習網路為該更新網路，該等動作值為Q值，每一子載波分配動作一次只調整一個用戶端12的一個子載波，該等子載波分配動作可以下式表示：

其中n _n,k,t=1表示在當前時刻t第n個子載波被分配給第k個用戶端12，如第n個子載波在上一時刻已經被分配給第k個用戶端12，則保持子載波分配情況不變。n _n,k,t=0表示在當前時刻t第n個子載波未被分配給第k個用戶端12，如第n個子載波在上一時刻已經未被分配給第k個用戶端12，則保持子載波分配情況不變，該等子載波分配動作的數量為2×N×K個。此外，每一功率分配動作一次只調整一個功率係數，該等功率分配動作可以下式表示：

其中δ _n,k,t

{δ,0,-δ}，0<δ<1，δ _n,k,t=δ表示對功率係數v _n,k,t增加δ，δ _n,k,t=0表示功率係數v _n,k,t不變，δ _n,k,t=-δ表示對功率係數v _n,k,t減少δ，該等功率分配動作的數量為3×N×K個，但不以此為限。 It is worth noting that, in this embodiment, the action reinforcement learning network is the update network, the action values are Q values, and each sub-carrier allocation action only adjusts one sub-carrier of one user terminal 12 at a time, and the sub-carrier allocation actions can be represented by the following formula:

Where n _{n , k , t} = 1 means that the nth subcarrier is allocated to the kth user terminal 12 at the current moment t , if the nth subcarrier has been allocated to the kth user terminal 12 at the previous moment, the subcarrier allocation remains unchanged. n _{n , k , t} = 0 means that the nth subcarrier has not been allocated to the kth user terminal 12 at the current moment t , if the nth subcarrier has not been allocated to the kth user terminal 12 at the previous moment, then keep the subcarrier allocation situation unchanged, and the number of such subcarrier allocation actions is 2 × N × K. In addition, each power allocation action only adjusts one power coefficient at a time, and these power allocation actions can be expressed by the following formula:

where δ _{n , k , t}

{ δ , 0, - δ }, 0< δ < 1, δ _{n , k , t} = δ means adding δ to the power coefficient v _{n , k , t ,} δ _{n , k , t} = 0 means the power coefficient v _{n , k , t} remains unchanged, δ _{n , k , t} =- δ means reducing δ to the power coefficient v _{n , k} , t, and the number of such power distribution actions is 3× N × K , but not limited to this .

在步驟26中，該基站11判定該等動作值是否皆小於等於0。當該基站11判定出該等動作值之其中一者大於0時，流程進行步驟27；而當該基站11判定出該等動作值皆小於等於0時，則流程進行步驟33。 In step 26, the base station 11 determines whether the action values are all less than or equal to 0. When the base station 11 determines that one of the action values is greater than 0, the process proceeds to step 27 ; and when the base station 11 determines that both of the action values are less than or equal to 0, the process proceeds to step 33 .

要特別注意的是，在本實施例的步驟26中，判定該等動作值是否皆小於等於0，只觀察適用於當前超載率的該更新網路輸出的動作值，並不借鑒當前超載率下的該目標網路的輸出值，因此，在步驟25中，該基站11只將該等子載波分配動作、該等功率分配動作，及該等當前分配功率輸入至該更新網路。 It should be particularly noted that, in step 26 of this embodiment, it is determined whether the action values are all less than or equal to 0, only the action value output by the update network applicable to the current overload rate is observed, and the output value of the target network under the current overload rate is not used for reference. Therefore, in step 25, the base station 11 only inputs the subcarrier allocation actions, the power allocation actions, and the current allocated power to the update network.

要再特別注意的是，若該等動作值皆小於等於0，則認為在當前的狀態下採取任何功率分配動作都會使得長期預期獎勵變低，然而，獎勵需要越高越好，因此判定此時的功率分配動作為最優結果，不再進行功率分配動作，而進行步驟33。 It should be noted that if the values of these actions are all less than or equal to 0, it is considered that any power allocation action in the current state will lower the long-term expected reward. However, the higher the reward, the better. Therefore, it is determined that the power allocation action at this time is the optimal result, and the power allocation action is no longer performed, and step 33 is performed.

在步驟27中，該基站11從該等子載波分配動作及該等功率分配動作中選擇一目標分配動作。其中，該目標分配動作為隨機選取的機率為P ₁，該目標分配動作對應的動作值為該等動作值中最高的機率為P ₂，P ₁+P ₂=1且P ₁<P ₂。值得注意的是，在本實施例中，P ₁為10%，P ₂為90%，但不以此為限，在其他實施方式中，該目標分配動作亦可僅為根據當前狀態選取，或是選擇該等動作值中最高者所對應的動作。 In step 27, the base station 11 selects a target allocation action from the subcarrier allocation actions and the power allocation actions. Wherein, the probability that the target assignment action is randomly selected is P ₁ , the action value corresponding to the target assignment action is the highest probability among the action values, P ₂ , P ₁ + P ₂ =1 and P ₁ < P ₂ . It should be noted that, in this embodiment, P1 is 10%, and P2 is 90%, but it is not limited thereto. In other embodiments, the target allocation action can also be selected according to _the _current state, or the action corresponding to the highest among these action values can be selected.

在步驟28中，該基站11根據該等當前子載波分配結果、該等當前分配功率及該目標分配動作，獲得多個分別對應該等當前子載波分配結果的更新子載波分配結果及多個分別對應該等當前分配功率的更新分配功率。 In step 28, the base station 11 obtains a plurality of updated subcarrier allocation results respectively corresponding to the current subcarrier allocation results and a plurality of updated allocation powers corresponding to the current allocation powers according to the current subcarrier allocation results, the current allocation powers and the target allocation action.

搭配參閱圖3，步驟28包括子步驟281~289，以下說明步驟28所包括的子步驟。 Referring to FIG. 3 , step 28 includes sub-steps 281 to 289 , and the sub-steps included in step 28 are described below.

在子步驟281中，該基站11判定該目標分配動作是否為子載波分配動作。當該基站11判定出該目標分配動作為子載波分配動作，流程進行子步驟282；而當該基站11判定出該目標分配動作不為子載波分配動作，表示目標分配動作為功率分配動作，則流程進行子步驟286。 In sub-step 281, the base station 11 determines whether the target allocation action is a subcarrier allocation action. When the base station 11 determines that the target allocation action is a subcarrier allocation action, the process proceeds to sub-step 282; and when the base station 11 determines that the target allocation action is not a sub-carrier allocation action, indicating that the target allocation action is a power allocation action, the process proceeds to sub-step 286.

在子步驟282中，該基站11根據該目標分配動作獲得N×K個分別對應該等當前子載波分配結果的替換子載波分配結果。 In sub-step 282, the base station 11 obtains N×K replacement subcarrier allocation results respectively corresponding to the current subcarrier allocation results according to the target allocation action.

在子步驟283中，該基站11判定該等替換子載波分配結果是否滿足多個子載波分配條件。當該基站11判定出該等替換子載波分配結果不滿足該等子載波分配條件之其中一者時，流程進行子步驟284；而當該基站11判定出該等替換子載波分配結果滿足該等子載波分配條件時，則流程進行子步驟285。 In sub-step 283, the base station 11 determines the alternative subcarrier allocation results Whether the multiple subcarrier allocation conditions are satisfied. When the base station 11 determines that the replacement subcarrier allocation results do not meet one of the subcarrier allocation conditions, the process proceeds to sub-step 284; and when the base station 11 determines that the replacement sub-carrier allocation results meet the sub-carrier allocation conditions, then the process proceeds to sub-step 285.

值得注意的是，該等子載波分配條件包括：1

L，及

d _f，其中，j _n,k,t+1為在第n個子載波上的第k個順序的用戶端12在下一時刻t+1的替換子載波分配結果，j _n,k,t+1

{0,1}，j _n,k,t+1=1為第k個用戶端12在下一時刻t+1分配到第n個子載波，j _n,k,t+1=0為第k個用戶端12在下一時刻t+1未分配到第n個子載波，L為每一用戶端分配到的最大子載波數，d _f為每一子載波上的最大用戶端數，但不以此為限。 It is worth noting that the subcarrier allocation conditions include: 1

L , and

d _f , wherein, j _{n , k , t +1} are the replacement subcarrier allocation results of the kth sequential user terminal 12 on the nth subcarrier at the next time t +1, j _{n , k , t +1}

{0,1}, j _{n , k , t +1} =1 means that the kth user terminal 12 is allocated to the nth subcarrier at the next time t +1, j _{n , k , t +1} = 0 means that the kth user terminal 12 is not allocated to the nth subcarrier at the next time t +1, L is the maximum number of subcarriers allocated to each user terminal, _and df is the maximum number of user terminals on each subcarrier, but not limited thereto.

在子步驟284中，該基站11將該等當前子載波分配結果及該等當前分配功率分別作為該等更新子載波分配結果及該等更新分配功率，即子載波分配結果及分配功率保持不變。 In sub-step 284, the base station 11 uses the current subcarrier allocation results and the current allocated powers as the updated subcarrier allocation results and the updated allocated powers respectively, that is, the subcarrier allocation results and the allocated power remain unchanged.

在子步驟285中，該基站11將該等替換子載波分配結果作為該等更新子載波分配結果，並根據該等更新子載波分配結果獲得該等更新分配功率。 In sub-step 285, the base station 11 uses the replacement subcarrier allocation results as the updated subcarrier allocation results, and obtains the updated allocated powers according to the updated subcarrier allocation results.

在子步驟286中，該基站11對該等當前分配功率進行該目標分配動作，以獲得多個分別對應該等當前分配功率的替換分配功率。 In sub-step 286, the base station 11 performs the The target allocation action is to obtain a plurality of alternative allocated powers respectively corresponding to the current allocated powers.

要特別注意的是，若該目標分配動作為功率分配，不論是根據當前狀態選取的動作或是該等動作值中最高者所對應的動作δ _n,k,t，對應的當前子載波分配結果j _n,k,t=1，且要增加δ的v _n,k",t或減少δ的v _n,k',t對應的當前子載波分配結果j _n,k",t,j _n,k',t=1。 It should be noted that if the target allocation action is power allocation, whether it is the action selected according to the current state or the action δ _{n , k , t corresponding to the highest of these action values, the corresponding current subcarrier allocation result j n , k} _{, t =1, and the current subcarrier allocation result j n ,} k " _{, t ,} j _{n , k ' , t} =1 corresponding to v _{n , k} _{" , t} that increases δ or v n , k ', t that decreases δ .

在子步驟287中，該基站11判定該等替換分配功率是否滿足多個功率分配條件。當該基站11判定出該等替換分配功率不滿足該等功率分配條件之其中一者時，流程進行子步驟288；而當該基站11判定出該等替換分配功率滿足該等功率分配條件時，則流程進行子步驟289。 In sub-step 287, the base station 11 determines whether the alternative allocated power satisfies a plurality of power allocation conditions. When the base station 11 determines that the alternative allocated power does not meet one of the power allocation conditions, the process proceeds to sub-step 288; and when the base station 11 determines that the alternative allocated power meets the power allocation conditions, the process proceeds to sub-step 289.

值得注意的是，該等功率分配條件包括：

1，0

v _n,k,t+1

1，及

其中，k

{1,2,...,K}，v _n,k,t+1為在第n個子載波上的第k個順序的用戶端12在下一時刻t+1分配到的替換分配功率之係數。 It is worth noting that such power allocation conditions include:

1,0

v _{n , k , t} +1

1, and

Among them, k

{1,2,..., K }, v _{n , k , t +1} are the coefficients of the replacement allocated power allocated to the kth sequential user terminal 12 on the nth subcarrier at the next time t +1.

在子步驟288中，該基站11將該等當前子載波分配結果及該等當前分配功率分別作為該等更新子載波分配結果及該等更新分配功率，即子載波分配結果及分配功率保持不變。 In sub-step 288, the base station 11 uses the current subcarrier allocation results and the current allocated power as the updated subcarrier allocation results and the updated subcarrier allocation results, respectively. The newly allocated power, that is, the subcarrier allocation result and the allocated power remain unchanged.

在子步驟289中，該基站11將該等當前子載波分配結果及該等替換分配功率分別作為該等更新子載波分配結果及該等更新分配功率。 In sub-step 289, the base station 11 uses the current subcarrier allocation results and the replacement allocated powers as the updated subcarrier allocation results and the updated allocated powers, respectively.

在步驟29中，該基站11判定一相關於當前該基站11通訊連接的用戶端12之數量與用戶端12的信號疊加到的子載波之數量的超載率是否為K/N。當該基站11判定出該超載率為K/N時，流程進行步驟30；而當該基站11判定出該超載率不為K/N時，則流程重複步驟21。 In step 29, the base station 11 determines whether an overload rate related to the number of UEs 12 currently connected to the base station 11 and the number of subcarriers on which the signals of the UEs 12 are superimposed is K / N . When the base station 11 determines that the overload rate is K / N , the process proceeds to step 30; and when the base station 11 determines that the overload rate is not K / N , the process repeats step 21.

要特別注意的是，該超載率為當前該基站11通訊連接的用戶端12之數量除以用戶端12的信號疊加到的子載波之數量，在本實施例中，該基站11通訊的用戶端數量和位置都是不固定的，該基站11會根據用戶端12的數量調整資源配置方案，故在通過上行導頻估計發現該超載率不為K/N時，即該超載率改變時(設改變後的超載率為K’/N，K’>1且K’≠K)，該基站11會儲存一包括該等强化學習網路且對應超載率為K/N的歷史強化學習網路資訊，並判定是否儲存有一對應超載率為K’/N的目標歷史強化學習網路資訊，若儲存有該目標歷史強化學習網路資訊，則載入該歷史強化學習網路資訊，並進行步驟22，否則流程回到步驟21，該基站11初始化該等强化學習網路，以作為適用於超載率為K’/N的强化學習網路。 It should be particularly noted that the overload rate is divided by the number of UEs 12 currently connected to the base station 11 by the number of subcarriers on which the signals of the UEs 12 are superimposed. In this embodiment, the number and location of the UEs communicating with the base station 11 are not fixed. The base station 11 will adjust the resource allocation scheme according to the number of UEs 12. Therefore, it is found that the overload rate is notK/N, that is, when the overload rate changes (assuming that the overload rate after the changeK'/N,K’>1 andK’≠K), the base station 11 will store a list including the reinforcement learning network and the corresponding overload rateK/NThe historical reinforcement learning network information, and determine whether to store a corresponding overload rateK'/NThe target historical reinforcement learning network information of , if the target historical reinforcement learning network information is stored, then load the historical reinforcement learning network information, and proceed to step 22, otherwise the process returns to step 21, the base station 11 initializes the reinforcement learning networks as the applicable overload rateK'/NReinforcement Learning Network road.

在步驟30中，該基站11根據該等當前子載波分配結果、該等當前分配功率、該等更新子載波分配結果，及該等更新分配功率計算出一獎勵值。 In step 30, the base station 11 calculates a reward value according to the current subcarrier allocation results, the current allocated power, the updated subcarrier allocation results, and the updated allocated power.

搭配參閱圖4，步驟30包括子步驟301~303，以下說明步驟30所包括的子步驟。 Referring to FIG. 4 , step 30 includes sub-steps 301 to 303 , and the sub-steps included in step 30 are described below.

在子步驟301中，該基站11根據該等當前子載波分配結果及該等當前分配功率計算出一第一頻譜效率f _t。其中該第一頻譜效率f _t以下式表示：

，R _n,k,t=B _n log₂(1+ρ _n,k,t)，

其中，s _t={J _t,V _t}該等當前子載波分配結果及該等當前分配功率的集合，J _t={j _1,1,t,...,j _n,k,t,...,j _N,K,t}為該等當前子載波分配結果，V _t={v _1,1,t,...,v _n,k,t,...,v _N,K,t}為該等當前分配功率，R _n,k,t為第k個用戶端12在第n個子載波及在當前時刻t的通道容量(Channel capacity)，B _n為第n個子載波頻寬，ρ _n,k,t為第k個用戶端12在第n個子載波及在當前時刻t的信幹噪比，v _n,j,t為在第n個子載波上的第j個順序的用戶端12在當前時刻t分配到的當前分配功率之係數，j _n,j,t為第j個用戶端12在當前時刻t是否分配到第n個子載波的當前子載波分配結果，P _T為該基站11的分配的總功率，σ ²為加性高斯白色雜訊(AWGN)。 In sub-step 301, the base station 11 calculates a first spectral efficiency f _t according to the current subcarrier allocation results and the current allocated power. Wherein _{, the first spectral efficiency ft} is represented by the following formula:

, R _{n , k , t} = B _n log ₂ (1+ ρ _{n , k , t} ),

其中， s _t ={ J _t , V _t }該等當前子載波分配結果及該等當前分配功率的集合， J _t ={ j _{1,1, t} ,..., j _{n , k , t} ,..., j _{N , K , t} }為該等當前子載波分配結果， V _t ={ v _{1,1, t} ,..., v _{n , k , t} ,..., v _{N , K , t} }為該等當前分配功率， R _{n , k , t}為第k個用戶端12在第n個子載波及在當前時刻t的通道容量(Channel capacity)， B _n為第n個子載波頻寬， ρ _{n , k , t}為第k個用戶端12在第n個子載波及在當前時刻t的信幹噪比， v _{n , j , t}為在第n個子載波上的第j個順序的用戶端12在當前時刻t分配到的當前分配功率之係數， j _{n , j , t}為第j個用戶端12在當前時刻t是否分配到第n個子載波的當前子載波分配結果， P _T為該基站11的分配的總功率， σ ²為加性高斯白色雜訊(AWGN)。

要再特別注意的是，由於未分配到該子載波的用戶端12則不分配功率，因此在步驟30中，該基站11實際可僅根據該等當前分配功率及該等更新分配功率計算出該獎勵值，第k個用戶端12在第n個子載波及在當前時刻t的信幹噪比ρ _n,k,t亦可表示為：

It should be particularly noted that since the user terminal 12 not allocated to the subcarrier does not allocate power, so in step 30, the base station 11 can actually calculate the reward value only based on the current allocated power and the updated allocated power. The signal-to-interference-noise ratio ρ _{n , k , t} of the k- th user terminal 12 on the n-th sub-carrier and at the current moment t can also be expressed as:

在子步驟302中，該基站11根據該等更新子載波分配結果及該等更新分配功率計算出一第二頻譜效率f _t+1。該第二頻譜效率f _t+1算式與該第一頻譜效率f _t相同故在此不加以贅述。 In sub-step 302, the base station 11 calculates a second spectral efficiency f _{t +1} according to the updated subcarrier allocation results and the updated allocated power. The formula of the second spectral efficiency f _{t +1} is the same as that of the first spectral efficiency f _t and thus will not be repeated here.

在子步驟303中，該基站11根據該第一頻譜效率f _t及該第二頻譜效率f _t+1計算出一獎勵值r(s _t,a _t)，a _t為在當前時刻t所選取的該目標分配動作。 In sub-step 303, the base station 11 calculates a reward value r ( st _, at ) according to the first spectral efficiency f _t and the second spectral efficiency f _t ₊₁ , where _at is the target allocation action selected at the current time t .

值得注意的是，在本實施例中，該獎勵值為該第二頻譜效率減去該第一頻譜效率，即該獎勵值r(s _t,a _t)=f _t+1-f _t，但不以此為限。 It should be noted that, in this embodiment, the reward value is the second spectral efficiency minus the first spectral efficiency, that is, the reward value r ( s _t , a _t )= f _{t +1} − f _t , but not limited thereto.

在步驟31中，該基站11產生並儲存一包括該等當前子載波分配結果、該等當前分配功率、該目標分配動作、該獎勵值、該等更新子載波分配結果，及該等更新分配功率的訓練資料。 In step 31, the base station 11 generates and stores a training data including the current subcarrier allocation results, the current allocated power, the target allocation action, the reward value, the updated subcarrier allocation results, and the updated allocated power.

在步驟32中，該基站11從儲存的訓練資料中選取多筆目標訓練資料，並根據該等目標訓練資料訓練該等強化學習網路，並重複進行步驟25。 In step 32, the base station 11 selects a plurality of target training data from stored training data, and trains the reinforcement learning networks according to the target training data, and repeats step 25.

值得一提的是，在重複步驟25前，該基站11會先將在步驟28所獲得該等更新子載波分配結果及該等更新分配功率分別作為該等當前子載波分配結果及該等當前分配功率，再重複進行步驟25。 It is worth mentioning that before repeating step 25, the base station 11 will first use the updated subcarrier allocation results and the updated allocated power obtained in step 28 as the current subcarrier allocation results and the current allocated power respectively, and then repeat step 25.

搭配參閱圖5，步驟32包括子步驟321~323，以下說明步驟32所包括的子步驟。 Referring to FIG. 5 , step 32 includes sub-steps 321 to 323 , and the sub-steps included in step 32 are described below.

在子步驟321中，該基站11從儲存的訓練資料中選取該等目標訓練資料。 In sub-step 321, the base station 11 selects the target training data from stored training data.

值得注意的是，在本實施例中，該基站11例如隨機選取32筆目標訓練資料，而在循環開始初期，因為沒有儲存足夠的訓練資料，故32筆目標訓練資料中會有幾筆目標訓練資料為空，但不以此為限。 It should be noted that, in this embodiment, the base station 11 randomly selects 32 pieces of target training data, and at the beginning of the cycle, because there is not enough training data stored, some of the 32 pieces of target training data will be empty, but not limited thereto.

在子步驟322中，該基站11將該等目標訓練資料的當前子載波分配結果、當前分配功率，及目標分配動作輸入至該動作強化學習網路，以致該動作強化學習網路輸出多個分別對應該等目標訓練資料的訓練動作值。 In sub-step 322, the base station 11 inputs the current subcarrier allocation result, current allocated power, and target allocation action of the target training data into the action reinforcement learning network, so that the action reinforcement learning network outputs a plurality of training action values respectively corresponding to the target training data.

在子步驟323中，該基站11根據該等目標訓練資料及該等訓練動作值調整該等強化學習網路。 In sub-step 323, the base station 11 according to the target training data and the The training action values adjust the reinforcement learning networks.

值得注意的是，在本實施例中，該基站11根據該等目標訓練資料的獎勵值及該等訓練動作值利用該損失函數獲得一損失值，並根據該損失值利用該學習演算法將該等強化學習網路進行更新，以調整該等強化學習網路，亦即對於每一目標訓練資料，該基站11將該目標訓練資料中的當前子載波分配結果、當前分配功率，及目標分配動作輸入至該更新網路，使得該更新網路輸出Q(s _t,a _t)，再將該目標訓練資料中的獎勵值、更新子載波分配結果，及更新分配功率輸入至該目標網路，使得該目標網路輸出r(s _t,a _t)+γ max Q(s _t+1,a _t+1)，並求得r(s _t,a _t)+γ max Q(s _t+1,a _t+1)與Q(s _t,a _t)的均方誤差作為該損失值，其中γ

[0,1]為權衡即時獎勵和後續獎勵重要性的折現因數，Q(s _t,a _t)為該目標訓練資料對應的訓練動作值，max Q(s _t+1,a _t+1)為該目標訓練資料的更新子載波分配結果及更新分配功率集合搭配所有子載波分配動作及功率分配動作能獲得的最大動作值，再根據該等目標訓練資料的損失值利用自適應時刻估計方法對該更新網路的參數進行更新，在多次更新之後，例如32次，再將該更新網路的參數複製到該目標網路，以更新該目標網路的參數，但不以此為限，在其他只有該更新網路的實施方式中，則不需要將該更新網路的參數複製到該目標網路。 It is worth noting that, in this embodiment, the base station 11 uses the loss function to obtain a loss value according to the reward value of the target training data and the training action values, and uses the learning algorithm to update the reinforcement learning networks according to the loss value, so as to adjust the reinforcement learning networks, that is, for each target training data, the base station 11 inputs the current subcarrier allocation result, current allocation power, and target allocation action in the target training data to the update network, so that the update network outputsQ(the s _t,a _t), and then input the reward value in the target training data, update the subcarrier allocation result, and update the allocated power to the target network, so that the target network outputsr(the s _t,a _t)+gamma maxQ(the s _{t +1},a _{t +1}), and obtainr(the s _t,a _t)+gamma maxQ(the s _{t +1},a _{t +1})andQ(the s _t,a _t) as the loss value, wheregamma

[0,1] is the discount factor to weigh the importance of immediate rewards and follow-up rewards,Q(the s _t,a _t) is the training action value corresponding to the target training data, maxQ(the s _{t +1},a _{t +1}) is the updated subcarrier allocation result and updated allocated power set of the target training data and the maximum action value that can be obtained by all subcarrier allocation actions and power allocation actions, and then uses the adaptive time estimation method to update the parameters of the update network according to the loss value of the target training data. After multiple updates, such as 32 times, copy the parameters of the update network to the target network to update the parameters of the target network. network.

要特別注意的是，在其他該等強化學習網路例如包括該對照表的實施方式中，該對照表具有多個表格動作值，每一表格動作值對應一子載波分配結果、一分配功率結果，及一分配動作，在步驟32中，該基站11根據該等目標訓練資料更新該對照表，以訓練該等強化學習網路。詳細而言，該基站11根據以下公式更新該對照表：

其中，s _i表示第i筆目標訓練資料的子載波分配集合及分配功率集合，a _i表示第i筆目標訓練資料的目標分配動作，r(s _i,a _i)表示第i筆目標訓練資料的獎勵值，s _i'表示第i筆目標訓練資料的更新子載波分配結果及更新分配功率集合，m表示Q(s _i,a _i)更新的次數，Q _m(s _i,a _i)為該對照表中對應該第i筆目標訓練資料的子載波分配結果、分配功率，及目標分配動作的一目標表格動作值，Q _m+1(s _i,a _i)表示該目標表格動作值更新後的值，α表示更新的學習率，

表示該對照表中對應該第i筆目標訓練資料的更新子載波分配結果及更新分配功率集合搭配所有子載波分配動作及功率分配動作能獲得的一最大表格動作值，

是由該等強化學習網路中的目標網路計算出來，Q _m(s _i,a _i)是由該等強化學習網路中的更新網路計算出來，因為SCMA技術子載波分配動作及功率分配動作較多，Q表格需要較多的儲存空間，故本實施例是採用含有一隱藏層的Q網路對Q表格進行擬合，即Q網路的輸入對應Q表格中的狀態矩陣，Q網路的輸出對應Q表格中該狀態的Q值，因為Q網路中參數的個數遠小於Q表格中Q值的個數，所以節省了該基站的儲存空間。 It should be particularly noted that, in other implementations where the reinforcement learning networks include the comparison table, for example, the comparison table has multiple table action values, and each table action value corresponds to a subcarrier allocation result, a power allocation result, and an allocation action. In step 32, the base station 11 updates the comparison table according to the target training data to train the reinforcement learning networks. In detail, the base station 11 updates the comparison table according to the following formula:

其中， s _i表示第i筆目標訓練資料的子載波分配集合及分配功率集合， a _i表示第i筆目標訓練資料的目標分配動作， r ( s _i , a _i )表示第i筆目標訓練資料的獎勵值， s _i '表示第i筆目標訓練資料的更新子載波分配結果及更新分配功率集合， m表示Q ( s _i , a _i )更新的次數， Q _m ( s _i , a _i )為該對照表中對應該第i筆目標訓練資料的子載波分配結果、分配功率，及目標分配動作的一目標表格動作值， Q _{m + 1} ( s _i , a _i )表示該目標表格動作值更新後的值， α表示更新的學習率，

Indicates a maximum table action value that can be obtained by matching all subcarrier allocation actions and power allocation actions with the updated subcarrier allocation results and updated allocation power sets corresponding to the i- th target training data in the comparison table,

is calculated by the target network in these reinforcement learning networks, and Q _m ( s _i , a _i ) is calculated by the update network in these reinforcement learning networks. Because the SCMA technology has more subcarrier allocation actions and power allocation actions, the Q table requires more storage space, so this embodiment uses a Q network with a hidden layer to fit the Q table, that is, the input of the Q network corresponds to the state matrix in the Q table, and the output of the Q network corresponds to the Q value of the state in the Q table, because the parameters in the Q network The number is much smaller than the number of Q values in the Q table, so the storage space of the base station is saved.

在步驟33中，該基站11根據該等當前子載波分配結果及該等當前分配功率計算出一候選頻譜效率，並儲存該等當前子載波分配結果、該等當前分配功率，及該候選頻譜效率，並重複進行步驟22。值得注意的是，在本實施例中，每當進行步驟33該循環計數器加1，但不以此為限，在其他實施方式中，該循環計數器亦可在步驟23或步驟24加1。 In step 33, the base station 11 calculates a candidate spectral efficiency according to the current subcarrier allocation results and the current allocated power, and stores the current subcarrier allocation results, the current allocated power, and the candidate spectral efficiency, and repeats step 22. It should be noted that, in this embodiment, the loop counter is incremented by 1 every time step 33 is performed, but not limited thereto. In other implementation manners, the loop counter can also be incremented by 1 in step 23 or step 24 .

在步驟34中，該基站11從該等候選頻譜效率中獲得一最高的目標頻譜效率，該循環計數器清零並重複步驟22，其中，該目標頻譜效率對應的子載波分配結果及分配功率即為最佳的子載波分配結果及最佳的分配功率。 In step 34, the base station 11 obtains a highest target spectral efficiency from the candidate spectral efficiencies, clears the loop counter and repeats step 22, wherein the subcarrier allocation result and allocated power corresponding to the target spectral efficiency are the best subcarrier allocation result and the best allocated power.

綜上所述，本發明為基於人工智慧算法之下行稀疏碼多址接入系統資源分配方法，藉由該基站11利用該等強化學習網路在不同場景記錄學習，以獲取具有最大的獎勵值之最佳分配動作，並進一步獲得該等候選頻譜效率，再從該等候選頻譜效率中獲得最高的該目標頻譜效率，其中，該目標頻譜效率對應的子載波分配及功率分配即為最優，故確實能達成本發明的目的。 To sum up, the present invention is a resource allocation method for a downlink sparse code multiple access system based on an artificial intelligence algorithm. The base station 11 utilizes the reinforcement learning networks to record learning in different scenarios to obtain the optimal allocation action with the largest reward value, and further obtains the candidate spectral efficiencies, and then obtains the highest target spectral efficiency from the candidate spectral efficiencies. Among them, the subcarrier allocation and power allocation corresponding to the target spectral efficiency are optimal, so the purpose of the present invention can indeed be achieved.

惟以上所述者，僅為本發明的實施例而已，當不能以此限定本發明實施的範圍，凡是依本發明申請專利範圍及專利說明書內容所作的簡單的等效變化與修飾，皆仍屬本發明專利涵蓋的範圍內。 But the above are only embodiments of the present invention, and should not limit the scope of the present invention. All simple equivalent changes and modifications made according to the patent scope of the present invention and the content of the patent specification are still within the scope of the patent of the present invention.

21~34:步驟 21~34: Steps

Claims

一種基於人工智慧算法之下行稀疏碼多址接入系統資源分配方法，由一基站來實施，該基站經由一無線通道與K個用戶端通訊連接，該基站儲存多個子載波分配動作及多個功率分配動作，其中K>1，N>1，該方法包含以下步驟：(A)向該等用戶端分配該等子載波，以獲得N×K個指示出該等用戶端是否分配到該等子載波的當前子載波分配結果，該等當前子載波分配結果j _n,k,t滿足下列條件：1

L，及

{0,1}，n

{1,2,...,N}，k

{1,2,...,K}，j _n,k,t=1為第k個用戶端在當前時刻t分配到第n個子載波，j _n,k,t=0為第k個用戶端在當前時刻t未分配到第n個子載波，L為每一用戶端分配到的最大子載波數，d _f為每一子載波上的最大用戶端數；(B)根據該等當前子載波分配結果獲得N×K個分別對應該等當前子載波分配結果的當前分配功率；(C)將該等子載波分配動作、該等功率分配動作、該等當前子載波分配結果，及該等當前分配功率輸入至一動作強化學習網路，以致該動作強化學習網路輸出多個分別對應該等功率分配動作及該等子載波分配動作的動作值；(D)判定該等動作值是否皆小於等於0； (E)當判定出該等動作值之其中一者大於0時，從該等子載波分配動作及該等功率分配動作中選擇一目標分配動作；(F)根據該等當前子載波分配結果、該等當前分配功率及該目標分配動作，獲得多個分別對應該等當前子載波分配結果的更新子載波分配結果及多個分別對應該等當前分配功率的更新分配功率；(G)根據該等當前分配功率及該等更新分配功率計算出一獎勵值(reward value)；(H)產生並儲存一包括該等當前子載波分配結果、該等當前分配功率、該目標分配動作、該獎勵值、該等更新子載波分配結果，及該等更新分配功率的訓練資料；(I)從儲存的訓練資料中選取多筆目標訓練資料，並根據該等目標訓練資料訓練至少一強化學習網路，該至少一強化學習網路包括該動作強化學習網路；(J)將該等更新子載波分配結果及該等更新分配功率分別作為該等當前子載波分配結果及該等當前分配功率重複步驟(C)~(I)直到該等動作值皆小於等於0；(K)當判定出該等動作值皆小於等於0時，根據該等當前子載波分配結果及該等當前分配功率計算出一候選頻譜效率，並儲存該等當前子載波分配結果、該等當前分配功率，及該候選頻譜效率；(L)重複進行步驟(A)~(K)P次，以獲得P個候選頻譜效率，其中P>1；及 (M)從該等候選頻譜效率中獲得一最高的目標頻譜效率。 A resource allocation method for downlink sparse code multiple access system based on artificial intelligence algorithm, implemented by a base station, the base station communicates with K user terminals via a wireless channel, the base station stores a plurality of subcarrier allocation actions and a plurality of power allocation actions, wherein K> 1, N> 1, the method includes the following steps: (A) allocate the subcarriers to the user terminals, to obtain N × K current subcarrier allocation results indicating whether the user terminals are allocated to the subcarriers, the current subcarrier allocation results j _{n , k , t} satisfies the following conditions: 1

L , and

{0,1}, n

{1,2,..., N }, k

{1,2,..., K }， j _{n , k , t} =1為第k個用戶端在當前時刻t分配到第n個子載波， j _{n , k , t} =0為第k個用戶端在當前時刻t未分配到第n個子載波， L為每一用戶端分配到的最大子載波數， d _f為每一子載波上的最大用戶端數；(B)根據該等當前子載波分配結果獲得N × K個分別對應該等當前子載波分配結果的當前分配功率；(C)將該等子載波分配動作、該等功率分配動作、該等當前子載波分配結果，及該等當前分配功率輸入至一動作強化學習網路，以致該動作強化學習網路輸出多個分別對應該等功率分配動作及該等子載波分配動作的動作值；(D)判定該等動作值是否皆小於等於0； (E)當判定出該等動作值之其中一者大於0時，從該等子載波分配動作及該等功率分配動作中選擇一目標分配動作；(F)根據該等當前子載波分配結果、該等當前分配功率及該目標分配動作，獲得多個分別對應該等當前子載波分配結果的更新子載波分配結果及多個分別對應該等當前分配功率的更新分配功率；(G)根據該等當前分配功率及該等更新分配功率計算出一獎勵值(reward value)；(H)產生並儲存一包括該等當前子載波分配結果、該等當前分配功率、該目標分配動作、該獎勵值、該等更新子載波分配結果，及該等更新分配功率的訓練資料；(I)從儲存的訓練資料中選取多筆目標訓練資料，並根據該等目標訓練資料訓練至少一強化學習網路，該至少一強化學習網路包括該動作強化學習網路；(J)將該等更新子載波分配結果及該等更新分配功率分別作為該等當前子載波分配結果及該等當前分配功率重複步驟(C)~(I)直到該等動作值皆小於等於0；(K)當判定出該等動作值皆小於等於0時，根據該等當前子載波分配結果及該等當前分配功率計算出一候選頻譜效率，並儲存該等當前子載波分配結果、該等當前分配功率，及該候選頻譜效率；(L)重複進行步驟(A)~(K) P次，以獲得P個候選頻譜效率，其中P >1；及(M)從該等候選頻譜效率中獲得一最高的目標頻譜效率。

如請求項1所述的基於人工智慧算法之下行稀疏碼多址接入系統資源分配方法，其中，在步驟(B)中，該等當前分配功率v _n,k,t滿足下列條件

1，0

v _n,k,t

1，及

其中，n

{1,2,...,N}，k

{1,2,...,K}，j _n,k,t為第k個用戶端在當前時刻t是否分配到第n個子載波的當前子載波分配結果，j _n,k,t

{0,1}，j _n,k,t=1為第k個用戶端在當前時刻t分配到第n個子載波，j _n,k,t=0為第k個用戶端在當前時刻t未分配到第n個子載波，v _n,k,t為第n個子載波上的第k個順序的用戶端在當前時刻t分配到的當前分配功率之係數。 The resource allocation method for downlink sparse code multiple access system based on artificial intelligence algorithm as described in claim 1, wherein, in step (B), the current allocated powers v _{n , k , t} satisfy the following conditions

1,0

v _{n , k , t}

1, and

Among them, n

{1,2,..., N }, k

{1,2,..., K }, j _{n , k , t} is the current subcarrier allocation result of whether the kth user terminal is allocated to the nth subcarrier at the current time t , j _{n , k , t}

{0,1}, j _{n , k , t} = 1 means that the kth user terminal is allocated to the nth subcarrier at the current time t , j _{n , k , t} = 0 means that the kth user terminal is not allocated to the nth subcarrier at the current time t, v _{n , k , t} are the coefficients of the current allocated power allocated to the kth order user terminal on the nth subcarrier at the current time t .

如請求項1所述的基於人工智慧算法之下行稀疏碼多址接入系統資源分配方法，其中，步驟(F)包括以下子步驟：(F-1)判定該目標分配動作是否為子載波分配動作；(F-2)當判定出該目標分配動作為子載波分配動作時，根據該目標分配動作獲得N×K個分別對應該等當前子載波分配結果的替換子載波分配結果；(F-3)判定該等替換子載波分配結果是否滿足多個子載波分配條件；(F-4)當判定出不滿足該等子載波分配條件之其中一者時，將該等當前子載波分配結果及該等當前分配功率分別作為該等更新子載波分配結果及該等更新分配功率；及(F-5)當判定出滿足該等子載波分配條件時，將該等替換子載波分配結果作為該等更新子載波分配結果，並根據該等更新子載波分配結果獲得該等更新分配功率。 The resource allocation method for downlink sparse code multiple access system based on artificial intelligence algorithm as described in claim 1, wherein step (F) includes the following sub-steps: (F-1) determine whether the target allocation action is a subcarrier allocation action; (F-2) when it is determined that the target allocation action is a subcarrier allocation action, obtain according to the target allocation actionN×KA replacement subcarrier allocation result respectively corresponding to the current subcarrier allocation results; (F-3) determine whether the replacement subcarrier allocation results meet multiple subcarrier allocation conditions; (F-4) when it is determined that one of the subcarrier allocation conditions is not satisfied or, taking the current subcarrier allocation results and the current allocated powers as the updated subcarrier allocation results and the updated allocated powers respectively; and (F-5) when it is determined that the subcarrier allocation conditions are satisfied, using the replacement subcarrier allocation results as the updated subcarrier allocation results, and obtaining the updated subcarrier allocation results according to the updated subcarrier allocation results.

如請求項3所述的基於人工智慧算法之下行稀疏碼多址接入系統資源分配方法，其中，在步驟(F-3)中，該等子載波分配條件包括：1

L，及

d _f，其中，j _n,k,t+1為第k個用戶端在下一時刻t+1是否分配到第n個子載波的替換子載波分配結果，j _n,k,t+1

{0,1}，n

{1,2,...,N}，k

{1,2,...,K}，j _n,k,t+1=1為第k個用戶端在下一時刻t+1分配到第n個子載波，j _n,k,t+1=0為第k個用戶端在下一時刻t+1未分配到第n個子載波。 The resource allocation method for downlink sparse code multiple access system based on artificial intelligence algorithm as described in claim 3, wherein, in step (F-3), the subcarrier allocation conditions include: 1

L , and

d _f , where j _{n , k , t +1} is the subcarrier allocation result of whether the kth user terminal is allocated to the nth subcarrier at the next moment t +1, j _{n , k , t +1}

{0,1}, n

{1,2,..., N }, k

{1,2,..., K }, j _{n , k , t +1} =1 means that the kth user terminal is allocated to the nth subcarrier at the next time t +1, and j _{n , k , t +1} =0 means that the kth user terminal is not allocated to the nth subcarrier at the next time t +1.

如請求項3所述的基於人工智慧算法之下行稀疏碼多址接入系統資源分配方法，其中，在子步驟(F-1)之後包括以下子步驟：(F-6)當判定出該目標分配動作不為子載波分配動作時，對該等當前分配功率進行該目標分配動作，以獲得多個分別對應該等當前分配功率的替換分配功率；(F-7)判定該等替換分配功率是否滿足多個功率分配條件； (F-8)當判定出不滿足該等功率分配條件之其中一者時，將該等當前子載波分配結果及該等當前分配功率分別作為該等更新子載波分配結果及該等更新分配功率；及(F-9)當判定出滿足該等功率分配條件時，將該等當前子載波分配結果及該等替換分配功率分別作為該等更新子載波分配結果及該等更新分配功率。 The resource allocation method for downlink sparse code multiple access system based on artificial intelligence algorithm as described in claim item 3, wherein, after the sub-step (F-1), the following sub-steps are included: (F-6) when it is determined that the target allocation action is not a subcarrier allocation action, perform the target allocation action on the current allocation powers, so as to obtain multiple alternative allocation powers respectively corresponding to the current allocation powers; (F-7) determine whether the alternative allocation powers meet multiple power allocation conditions; (F-8) When it is determined that one of the power allocation conditions is not met, use the current subcarrier allocation results and the current allocated power as the updated subcarrier allocation results and the updated allocated powers, respectively; and (F-9) When it is determined that the power allocation conditions are met, use the current subcarrier allocation results and the replacement allocated powers as the updated subcarrier allocation results and the updated allocated powers, respectively.

如請求項5所述的基於人工智慧算法之下行稀疏碼多址接入系統資源分配方法，其中，在子步驟(F-7)中，該等功率分配條件包括：

1，0

v _n,k,t+1

1，及

其中，n

{1,2,...,N}，k

{1,2,...,K}，j _n,k,t+1為第k個用戶端在下一時刻t+1是否分配到第n個子載波的當前子載波分配結果，j _n,k,t+1

{0,1}，j _n,k,t+1=1為第k個用戶端在下一時刻t+1分配到第n個子載波，j _n,k,t+1=0為第k個用戶端在下一時刻t+1未分配到第n個子載波，v _n,k,t+1為在第n個子載波上的第k個順序的用戶端在下一時刻t+1分配到的替換分配功率之係數。 The resource allocation method for downlink sparse code multiple access system based on artificial intelligence algorithm as described in claim 5, wherein, in the sub-step (F-7), the power allocation conditions include:

1,0

v _{n , k , t +1}

1, and

Among them, n

{1,2,..., N }, k

{1,2,..., K }, j _{n , k , t +1} is the current subcarrier allocation result of whether the kth user terminal is allocated to the nth subcarrier at the next moment t +1, j _{n , k , t +1}

{0,1}, j _{n , k , t +1} =1 means that the kth user terminal is allocated to the nth subcarrier at the next time t +1, j _{n , k , t +1} = 0 means that the kth user terminal is not allocated to the nth subcarrier at the next time t +1, v _{n , k , t +1} are the coefficients of the replacement allocated power allocated to the kth order user terminal on the nth subcarrier at the next time t +1.

如請求項1所述的基於人工智慧算法之下行稀疏碼多址接入系統資源分配方法，在步驟(F)及步驟(G)之間還包含以下步驟：(M)判定一相關於當前該基站通訊連接的用戶端之數量與用戶端的信號疊加到的子載波之數量的一超載率是否為K/N；當判定出該超載率為K/N時，進行步驟(G)。 The resource allocation method of downlink sparse code multiple access system based on artificial intelligence algorithm as described in claim item 1, also includes the following steps between step (F) and step (G): (M) determine whether an overload rate related to the number of user terminals connected to the current base station communication and the number of subcarriers to which the signal of the user terminal is superimposed is K / N ; when it is determined that the overload rate is K / N , step (G) is performed.

如請求項7所述的基於人工智慧算法之下行稀疏碼多址接入系統資源分配方法，在步驟(A)之前還包含以下步驟：(N)初始化多個强化學習網路；在步驟(M)之後還包含以下步驟：(O)當判定出該超載率不為K/N時，儲存一包括該等强化學習網路且對應超載率為K/N的歷史強化學習網路資訊，並判定是否儲存有一對應超載率為K’/N的目標歷史強化學習網路資訊，K’>1且K’≠K；及(P)當判定出儲存有該目標歷史強化學習網路資訊時，載入該目標歷史強化學習網路資訊，並重複步驟(A)~(F)、(M)；當判定出不儲存有該目標歷史強化學習網路資訊時，重複步驟(N)、(A)~(F)、(M)。 The resource allocation method of downlink sparse code multiple access system based on artificial intelligence algorithm as described in claim item 7, also includes the following steps before step (A): (N) initializing a plurality of reinforcement learning networks; after step (M), also includes the following steps: (O) when it is determined that the overload rate is notK/NWhen , storage 1 includes the reinforcement learning network and the corresponding overload rateK/NThe historical reinforcement learning network information, and determine whether to store a corresponding overload rateK'/NThe goal of historical reinforcement learning network information,K’>1 andK’≠Kand (P) when it is determined that the historical reinforcement learning network information of the target is stored, loading the historical reinforcement learning network information of the target, and repeating steps (A)~(F), (M); when it is determined that the historical reinforcement learning network information of the target is not stored, repeating steps (N), (A)~(F), (M).

如請求項1所述的基於人工智慧算法之下行稀疏碼多址接入系統資源分配方法，其中，在步驟(E)中，該目標分配動作為隨機選取的機率為P ₁，該目標分配動作對應的動作值為該等動作值中最高的機率為P ₂，P ₁+P ₂=1且P ₁<P ₂。 The resource allocation method of downlink sparse code multiple access system based on artificial intelligence algorithm as _described in claim item 1, wherein, in step (E), the probability that the target allocation action is randomly selected is P ₁ , and the action value corresponding to the target allocation action is the highest probability among the action values, P ₁ + P ₂ =1 and P ₁ < P ₂ .

如請求項1所述的基於人工智慧算法之下行稀疏碼多址接入系統資源分配方法，其中，步驟(G)包括以下子步驟：(G-1)根據該等當前分配功率計算出一第一頻譜效率； (G-2)根據該等更新分配功率計算出一第二頻譜效率；及(G-3)根據該第一頻譜效率及該第二頻譜效率計算出該獎勵值。 The resource allocation method for downlink sparse code multiple access system based on artificial intelligence algorithm as described in Claim 1, wherein step (G) includes the following sub-steps: (G-1) calculating a first spectrum efficiency according to the current allocated power; (G-2) calculating a second spectral efficiency according to the updated allocated power; and (G-3) calculating the reward value according to the first spectral efficiency and the second spectral efficiency.

如請求項1所述的基於人工智慧算法之下行稀疏碼多址接入系統資源分配方法，其中，步驟(I)包括以下子步驟：(I-1)從儲存的訓練資料中選取該等目標訓練資料；(I-2)將該等目標訓練資料的子載波分配結果、分配功率，及目標分配動作輸入至該動作強化學習網路，以致該動作強化學習網路輸出多個分別對應該等目標訓練資料的訓練動作值；及(I-3)根據該等目標訓練資料及該等訓練動作值調整該至少一強化學習網路。 The resource allocation method of downlink sparse code multiple access system based on artificial intelligence algorithm as described in claim item 1, wherein, step (I) includes the following sub-steps: (I-1) selecting the target training data from the stored training data; (I-2) inputting the subcarrier allocation results, power allocation, and target allocation actions of the target training data into the action reinforcement learning network, so that the action reinforcement learning network outputs a plurality of training action values respectively corresponding to the target training data; and (I-3) according to the target training data and the training The action value adjusts the at least one reinforcement learning network.

如請求項11所述的基於人工智慧算法之下行稀疏碼多址接入系統資源分配方法，其中，在步驟(I-3)中，根據該等目標訓練資料的獎勵值及該等訓練動作值利用一損失函數獲得一損失值，並根據該損失值利用一學習演算法將該至少一強化學習網路進行更新，以調整該至少一強化學習網路。The resource allocation method of downlink sparse code multiple access system based on artificial intelligence algorithm as described in claim item 11, wherein, in step (I-3), a loss value is obtained by using a loss function according to the reward value of the target training data and the training action values, and the at least one reinforcement learning network is updated using a learning algorithm according to the loss value to adjust the at least one reinforcement learning network.