WO2022137574A1 - Control device, virtual network allocation method, and program - Google Patents

Control device, virtual network allocation method, and program Download PDF

Info

Publication number
WO2022137574A1
WO2022137574A1 PCT/JP2020/049028 JP2020049028W WO2022137574A1 WO 2022137574 A1 WO2022137574 A1 WO 2022137574A1 JP 2020049028 W JP2020049028 W JP 2020049028W WO 2022137574 A1 WO2022137574 A1 WO 2022137574A1
Authority
WO
WIPO (PCT)
Prior art keywords
agent
allocation
learning
value function
action value
Prior art date
Application number
PCT/JP2020/049028
Other languages
French (fr)
Japanese (ja)
Inventor
晃人 鈴木
薫明 原田
Original Assignee
日本電信電話株式会社
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 日本電信電話株式会社 filed Critical 日本電信電話株式会社
Priority to PCT/JP2020/049028 priority Critical patent/WO2022137574A1/en
Priority to JP2022571008A priority patent/JPWO2022137574A1/ja
Publication of WO2022137574A1 publication Critical patent/WO2022137574A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • the present invention relates to a technique for allocating a virtual network to a physical network.
  • VNF Virtual Network Virtualization
  • Examples of physical resources include network resources such as link bandwidth and server resources such as CPU and HDD capacity.
  • network resources such as link bandwidth
  • server resources such as CPU and HDD capacity.
  • VN virtual networks
  • VN allocation refers to allocating a VN consisting of a virtual link and a virtual node to a physical resource.
  • the virtual link represents the demand for network resources such as the required bandwidth and required delay between VNFs, and the connection relationship between VNFs and users.
  • the virtual node represents the demand for server resources such as the number of CPUs required to execute VNF and the amount of memory required.
  • Optimal allocation refers to allocation that maximizes the value of the objective function such as resource utilization efficiency while satisfying constraints such as service requirements and resource capacity.
  • the dynamic VN allocation method is a method for obtaining the optimum VN allocation for the time-varying VN demand.
  • it is necessary to satisfy the optimality and immediacy of allocation at the same time.
  • there is a trade-off relationship between the two and there is a problem that it is difficult to satisfy the optimality and immediacy of allocation at the same time.
  • In order to increase the accuracy of the allocation result it is necessary to increase the calculation time.
  • an increase in calculation time is directly linked to an increase in the allocation cycle, and as a result, the immediacy of allocation is reduced.
  • it is necessary to reduce the allocation cycle in order to respond immediately to fluctuations in demand.
  • the reduction of the allocation cycle directly leads to the reduction of the calculation time, and as a result, the optimality of the allocation is reduced. From the above, it is difficult to satisfy the optimality and immediacy of allocation at the same time.
  • Non-Patent Documents 1 and 2 a dynamic VN allocation method by reinforcement learning has been proposed (Non-Patent Documents 1 and 2).
  • Reinforcement learning (RL) is a method of learning the behavior in which the sum of rewards (cumulative rewards) obtained in the future is the largest.
  • the relationship between the network state and the optimum allocation is learned in advance by RL, and the optimization calculation at each time is not required. Sex can be realized at the same time.
  • the problem of finding the optimum VN allocation is a combination optimization problem, and the number of VN allocation solution candidates (that is, the number of actions) increases exponentially with respect to the number of servers and the number of VNs.
  • the cost of pre-learning increases exponentially with the increase in the number of servers and the number of VNs, so a mechanism for increasing the scaleability is required. Therefore, in the technique of Non-Patent Document 2, a mechanism for improving the scaleability with respect to the number of VNs by multi-agent is introduced into the technique of Non-Patent Document 1.
  • VN demand is grouped, RL agents are prepared for each group, and the learning cost of each agent is reduced to improve the scaleability with respect to the number of VNs.
  • Non-Patent Document 2 the number of agents that can act at each time is limited to one in order to prevent a decrease in allocation accuracy due to the simultaneous action of a plurality of agents and the occurrence of allocation conflict among the agents. is doing.
  • allocation conflict between agents it means that multiple agents try to move a virtual machine (VM) to a server with low resource utilization at the same time, and the server becomes overloaded.
  • VM virtual machine
  • Such restriction of actions at each time can prevent allocation conflicts, but leads to a decrease in the accuracy of VN allocation.
  • the present invention has been made in view of the above points, and in a technique for allocating a virtual network to a physical network using multi-agent reinforcement learning, it is possible to avoid deterioration of allocation accuracy due to allocation conflict between agents and allocate.
  • the purpose is to provide a technique for improving accuracy.
  • control device for assigning a virtual network to a physical network having a link and a server by cooperative multi-agent reinforcement learning.
  • Pre-learning department to learn from A control device including an allocation unit that allocates a virtual network to the physical network is provided by using an individual action value function for each agent learned by the pre-learning unit.
  • a dynamic VN allocation method by cooperative multi-agent reinforcement learning (Coop-MALL) is used.
  • Coop-MALL cooperative multi-agent reinforcement learning
  • a mechanism is introduced in which a plurality of agents cooperate (cooperate) to learn the optimum VN allocation to the dynamic VN allocation method by the multi-agent reinforcement learning (MALL) of the reference [2].
  • MALL multi-agent reinforcement learning
  • the VN demand and the usage amount of the physical network at each time are defined as the states, and the network route and the VN allocation are changed. Is defined as an action, and the optimal VN allocation method is learned by designing the reward according to the objective function and constraints.
  • the VN allocation problem of the entire network is divided into a plurality of sub-problems by multi-agentization. That is, by dividing the VN demand into a plurality of groups and preparing an agent for each group, the scaleability with respect to the number of VNs is improved.
  • Each agent learns the optimum VN allocation for the VN included in each group in advance, and at the time of actual control, the agent immediately determines the optimum VN allocation based on the learning result, so that the optimum and immediacy can be achieved at the same time. Realize.
  • each agent independently learns the optimum behavior for each agent, whereas in the technique according to the present embodiment, each agent learns the optimum cooperative behavior for all agents. ..
  • FIG. 1 shows a configuration example of the system according to the present embodiment.
  • the system has a control device 100 and a physical network 200.
  • the control device 100 is a device that executes dynamic VN allocation by cooperative multi-agent reinforcement learning.
  • the physical network 200 is a network having physical resources to which the VN is allocated.
  • the control device 100 is connected to the physical network 200 by a control network or the like, acquires state information from the devices constituting the physical network 200, and transmits a setting command to the devices constituting the physical network 200. be able to. Further, the control device 100 can also receive the user's VN demand from the user's management terminal or the like.
  • the physical network 200 has a plurality of physical nodes 300 and a plurality of physical links 400 connecting the physical nodes 300.
  • a physical server is connected to the physical node 300.
  • a user (user terminal, user network, etc.) is connected to the physical node 300.
  • the physical server exists in the physical node 300 and the user exists in the physical node 300.
  • the physical server to which the VM is assigned, and the user (physical node) and the allocation destination A route (a set of physical links) to and from a physical server is determined, and settings are made to the physical network 200 based on the determined configuration.
  • the physical server may be simply called a "server”
  • the physical link may be simply called a "link”.
  • FIG. 2 shows an example of the functional configuration of the control device 100.
  • the control device 100 includes a pre-learning unit 110, a reward calculation unit 120, an allocation unit 130, and a data storage unit 140.
  • the reward calculation unit 120 may be included in the pre-learning unit 110.
  • the "pre-learning unit 110, the reward calculation unit 120" and the “allocation unit 130" may be provided in separate devices (computers operating by the program, etc.). The outline of the functions of each part is as follows.
  • the pre-learning unit 110 performs pre-learning of the action value function (Q k of each agent and Q tot of all agents as described later) using the reward calculated by the reward calculation unit 120.
  • the reward calculation unit 120 calculates the reward.
  • the allocation unit 130 executes the allocation of the VN to the physical resource by using the action value function Qk learned by the pre-learning unit 110.
  • the data storage unit 140 has a Play Memory function and stores parameters and the like necessary for calculation.
  • the pre-learning unit 110 includes an agent in the learning model of reinforcement learning. "Learning the agent" corresponds to the pre-learning unit 110 learning the action value function. The detailed operation of each part will be described later.
  • the control device 100 can be realized, for example, by causing a computer to execute a program.
  • This computer may be a physical computer or a virtual machine.
  • control device 100 can be realized by executing a program corresponding to the processing executed by the control device 100 using hardware resources such as a CPU and a memory built in the computer.
  • the above program can be recorded on a computer-readable recording medium (portable memory, etc.), stored, and distributed. It is also possible to provide the above program through a network such as the Internet or e-mail.
  • FIG. 3 is a diagram showing an example of the hardware configuration of the above computer.
  • the computer of FIG. 3 has a drive device 1000, an auxiliary storage device 1002, a memory device 1003, a CPU 1004, an interface device 1005, a display device 1006, an input device 1007, and the like, which are connected to each other by a bus B, respectively.
  • the program that realizes the processing on the computer is provided by, for example, a recording medium 1001 such as a CD-ROM or a memory card.
  • a recording medium 1001 such as a CD-ROM or a memory card.
  • the program is installed in the auxiliary storage device 1002 from the recording medium 1001 via the drive device 1000.
  • the program does not necessarily have to be installed from the recording medium 1001, and may be downloaded from another computer via the network.
  • the auxiliary storage device 1002 stores the installed program and also stores necessary files, data, and the like.
  • the memory device 1003 reads and stores the program from the auxiliary storage device 1002 when the program is instructed to start.
  • the CPU 1004 realizes the function related to the control device 100 according to the program stored in the memory device 1003.
  • the interface device 1005 is used as an interface for connecting to a network, and functions as an input means and an output means via the network.
  • the display device 1006 displays a GUI (Graphical User Interface) or the like by a program.
  • the input device 157 is composed of a keyboard, a mouse, buttons, a touch panel, or the like, and is used for inputting various operation instructions.
  • FIG. 4 is a variable definition related to VN allocation. As shown in FIG. 4, the variables are defined as follows.
  • FIG. 5 shows the definition of variables for Coop-MADRL. As shown in FIG. 5, the variables are defined as follows.
  • the action value function Q k of each agent may be referred to as an individual action value function, and the action value relation Q tot number of all agents may be referred to as an overall action value function.
  • each VN demand is composed of a traffic demand as a virtual link and a VM demand (virtual machine demand) as a virtual node.
  • the server capacity cZ i which means the upper limit of allocation of the resource amount, and the link capacity c Lij are set.
  • VN demand is composed of a start point (user), an end point (VM), a traffic demand D t , and a VM size V t .
  • the VM size indicates the processing capacity of the VM requested by the user, and it is assumed that the server capacity is consumed by the amount of the VM size at the time of server allocation.
  • VN demand is allocated to the physical network, it is assumed that network resources and server resources are consumed by Dt and Vt .
  • the control device 100 of the present embodiment allocates the VN to the physical network by solving the dynamic VN allocation problem for the above physical network and VN demand.
  • the dynamic VN allocation problem imposes, as an objective function, the minimization of the sum of the maximum link utilization rate UL t and the maximum server utilization rate U Z t over all times. That is, it can be expressed by the following equation.
  • the maximum link utilization rate and the maximum server utilization rate are each 1 or less. That is, it can be expressed by UL t ⁇ 1 and U Z t ⁇ 1.
  • FIG. 6 is a flowchart showing the overall operation of the control device 100.
  • the pre-learning unit 110 first performs pre-learning for each agent and for all agents (S100).
  • the allocation unit 130 performs actual control based on the action value function of the learned agent, that is, a setting for virtual network allocation to the physical network (S200).
  • VN demand changes at each time step.
  • VN demand is first observed at each time step t.
  • the trained agent calculates the optimum VN allocation in the next time step t + 1 based on the observed value.
  • the route and VM arrangement are changed based on the calculation result.
  • N be the number of VNs and M be the number of agents, and the VN demand is divided into groups having the same number of agents.
  • Agent g k assigns the kth VN group.
  • the number of VNs is N
  • the number of servers is
  • the number of actions when the candidate route is limited to one is
  • the VN demand is divided into M (1 ⁇ M ⁇ N) pieces, and the number of actions of each agent is reduced to
  • the number of VNs is 4
  • N M, that is, the same number of agents as the number of VNs will be prepared.
  • the pre-learning unit 110 learns the behavioral value function Q ( st , at) in the pre-learning.
  • the action value function Q ( st , at) represents an estimated value of the cumulative reward obtained when the action at is selected in the state st .
  • each agent can observe only a part of the state st , and the observation of each agent is ok .
  • the action value function of the kth agent is Q k (ok, a k ).
  • the behavioral value function of the kth agent is Qk, and the behavioral value function of the whole is Qtot .
  • Q k represents the action value for each agent, and Q tot represents the action value for all agents.
  • the pre-learning unit 110 calculates Q tot based on Q k .
  • the method for calculating Q tot based on Q k is not limited to a specific method, but can be calculated by, for example, the following equation (2) (reference [4]). That is, in the equation (2), the sum of the agents of Q k ( ok, a k ) is set as Q tot .
  • each learned agent selects the action that maximizes Q k as the action at time t.
  • Each trained agent corresponds to the allocation unit 130 that executes the allocation process using the trained action value function.
  • each agent observes ok t , and each agent performs an action that maximizes Q k based on its own Q k .
  • step t after all the agents have determined the action akt , the state st is updated based on the action at of all the agents. After that, rt is calculated. That is, in the phase of collecting training sample data, sample data is collected based on Q k without using Q tot .
  • Q k is updated based on the collected sample data (o j , a j , r j , o j + 1 ).
  • Q tot is calculated based on Q k .
  • Q k is learned so that Q tot is maximized. That is, Q k is updated.
  • each Q k is updated so that each Q k is maximized, but in the present embodiment, each Q k is updated so that the Q tot is maximized, which is optimal for all agents. You can choose an action.
  • the learning algorithm for reinforcement learning is not limited, and can be applied to any learning algorithm.
  • the technique disclosed in reference [3] or the like can be used as a learning algorithm for reinforcement learning.
  • An example of a learning algorithm when the technique of reference [3] is used will be described.
  • ⁇ k represents the hyperparameters of the kth neural network, and ⁇ represents all ⁇ k .
  • the Loss function used for learning the neural network is defined by the following formula.
  • is a hyperparameter.
  • Q k is learned by updating ⁇ using the collected sample data (o j , a j , r j , o j + 1 ) and the stochastic gradient descent method.
  • the action that maximizes each Qk is selected as in the phase of collecting the learning sample data of the pre-learning.
  • the Q k here is the Q k learned so that the Q tot is maximized, it is possible to select a coordinated action in consideration of other agents.
  • D t and V t are traffic demand and VM size
  • RL t and R Z t are the residual bandwidth of each link and the residual capacity of each server.
  • r Z i and t are calculated in the same manner.
  • is defined as Ak , and one action a k t is selected from A k at each time.
  • the route is uniquely determined for the allocation destination server, and the VN allocation is determined by the combination of the VM and the allocation destination server.
  • the reward calculation procedure for g k is shown in FIG. This procedure is executed by the reward calculation unit 120 of the control device 100.
  • Eff (x) on the first line represents an efficiency function, and is a function defined as follows so that Eff (x) decreases as x increases.
  • a penalty is given according to the reassignment of the VN in order to suppress unnecessary relocation of the VN.
  • the reward calculation unit 120 determines that the reallocation has been performed (when Y t and Y t + 1 are different)
  • the reward calculation unit 120 proceeds to the third line and performs rt ⁇ P (Y t , Y t + 1 ).
  • P (Y t , Y t + 1 ) is a penalty function for suppressing VN rearrangement, and is set so that the P value is large when the rearrangement is suppressed and the P value is small when the rearrangement is allowed. ..
  • Y t represents the VM allocation state.
  • FIG. 8 shows a pre-learning procedure (pre-learning algorithm) of the cooperative multi-agent reinforcement learning (Coop-MALL) executed by the pre-learning unit 110.
  • the series of actions in time step E is called an episode, and the episode is repeated until learning is completed.
  • the pre-learning unit 110 At the beginning of each episode (lines 2 to 18), the pre-learning unit 110 generates candidates for learning traffic demand and VM demand having the number of steps E for all VNs.
  • the pre-learning unit 110 repeatedly executes a series of procedures (lines 5 to 16).
  • a learning sample (a pair of a state ok t , an action a kt , a reward rt , and a next state ok t + 1 ) of each agent is generated and stored in the Play Memory M.
  • the action selection according to the current observation ok t and Q k the update of the state based on the action a kt ( relocation of the VN), and the calculation of the reward rt in the updated observation ok t . For each agent.
  • the 12th to 14th lines indicate the end condition of the episode.
  • the learning samples of episodes 0 to E are stored in the Play Memory M.
  • a training sample is randomly taken out from the Play Memory M to train the agent.
  • each Q k is learned so that the Q tot is maximized based on the algorithm of reinforcement learning.
  • the learning algorithm for reinforcement learning is not limited to a specific one, and can be applied to any learning algorithm.
  • [3] or the like can be used as a learning algorithm for reinforcement learning.
  • state observation and behavior (allocation of VN to physical resources) in pre-learning may be performed for the actual physical network 200, or for a model equivalent to the actual physical network 200. You may do it.
  • the pre-learning unit 110 generates candidates for learning traffic demand and VM demand having the number of steps E, and stores them in the data storage unit 140.
  • S101 to S106 are executed while t is smaller than T. Further, S103 to S105 are repeated in each episode of episodes 0 to E.
  • the pre-learning unit 110 (each agent) observes ok t and selects an action a kt that maximizes Q k based on the Q k of each agent itself. That is, each agent selects the VN allocation destination server assigned to itself so that the value (Q k ) of the action value function is maximized.
  • the pre-learning unit 110 sets the selected action (VN allocation) in the physical network 200, updates the observation ot , and obtains ot + 1 . Further, in S104, the reward calculation unit 120 calculates the reward rt for each agent by the above-mentioned calculation method.
  • the pre-learning unit 110 stores the collected learning sample data ( oj, aj , rj ) in the Play Memory M (data storage unit 140).
  • the pre-learning unit 110 randomly selects a learning sample from the Play Memory M (data storage unit 140), and learns each agent so that the Q tot is maximized. That is, the action value function Q k of each agent is learned.
  • FIG. 10 shows a dynamic VN allocation procedure by cooperative multi-agent reinforcement learning (Coop-MADRL) executed by the allocation unit 130 of the control device 100.
  • loop-MADRL cooperative multi-agent reinforcement learning
  • the first line of FIG. 10 shows the pre-learning of g k , which is as described above. After that, the 2nd to 8th lines are repeated at each time. At a certain time, in the 4th line, the state ok t is observed. In the fifth line, the action a kt that maximizes Q k ( ok t , a' ) is selected based on the state ok t . The fourth and fifth lines are executed for each agent. In the 7th line, the VN allocation is updated.
  • the allocation unit 130 observes (acquires) the state ok t (VM demand, traffic demand, residual link capacity, residual server capacity) at time t for each agent.
  • VM demand and traffic demand are received from each user (user terminal, etc.), and the residual link capacity and the residual server capacity are obtained from the physical network 200 (or an operation system that monitors the physical network 200). get.
  • the VM demand and the traffic demand may be values obtained by demand forecasting.
  • the allocation unit 130 selects the action akt that maximizes Qk ( ok t , a' ) for each agent. That is, the allocation unit 130 selects the VM allocation destination server in the VN of each agent so that Q k ( ok t , a') is maximized.
  • the allocation unit 130 updates the state. Specifically, the allocation unit 130 sets the VN of each agent to allocate the VM to each allocation destination server in the physical network 200, and the traffic according to the demand flows through the correct route (set of links). As described above, the route is set in the physical network 200.
  • the present specification discloses at least the control device, the virtual network allocation method, and the program of each of the following items.
  • (Section 1) A control device for allocating virtual networks to physical networks with links and servers through cooperative multi-agent reinforcement learning. Divide the demand of the virtual network into multiple groups, assign agents to each group, and maximize the individual action value function, which is the action value function for each agent, and the overall action value function, which is the action value function for all agents.
  • Pre-learning department to learn from A control device including an allocation unit that allocates a virtual network to the physical network by using an individual action value function for each agent learned by the pre-learning unit.
  • (Section 2) The control device according to item 1, wherein the pre-learning unit uses the sum of the agents of the individual action value function as the total action value function.
  • (Section 3) The control device according to item 1 or 2, wherein the allocation unit selects an action for allocating a virtual network to the physical network so that the learned individual action value function for each agent is maximized.
  • (Section 4) A virtual network allocation method executed by a control device for allocating a virtual network to a physical network having links and servers by cooperative multi-agent reinforcement learning. Divide the demand of the virtual network into multiple groups, assign agents to each group, and maximize the individual action value function, which is the action value function for each agent, and the overall action value function, which is the action value function for all agents.
  • Pre-learning steps to learn and A virtual network allocation method including an allocation step for allocating a virtual network to the physical network by using an individual action value function for each agent learned by the pre-learning step.
  • (Section 5) A program for making a computer function as each part in the control device according to any one of the items 1 to 3.
  • Control device 110 Pre-learning unit 120 Reward calculation unit 130 Allocation unit 140 Data storage unit 200 Physical network 300 Physical node 400 Physical link 1000 Drive device 1001 Recording medium 1002 Auxiliary storage device 1003 Memory device 1004 CPU 1005 Interface device 1006 Display device 1007 Input device

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

Provided is a control device for allocating a virtual network to a physical network having a link and a server through cooperative multi-Agent reinforcement learning, wherein the control device comprises: a prior learning unit which divides demands of the virtual network into a plurality of groups, allocates an agent to each group, and learns an individual action value function that is an action value function for each agent so that the total action value function, which is an action value function of all agents, becomes maximized; and an allocation unit which allocates the virtual network to the physical network by using the individual action value function for each of the agents learnt by the prior learning unit.

Description

制御装置、仮想ネットワーク割当方法、及びプログラムControllers, virtual network allocation methods, and programs
 本発明は、仮想ネットワークを物理ネットワークに割り当てる技術に関連するものである。 The present invention relates to a technique for allocating a virtual network to a physical network.
 NFV(Network Function Virtualization)の発展に伴い、仮想ネットワーク機能(Virtual Network Function;VNF)を汎用的な物理リソース上で実行することが可能になった。NFVにより、複数のVNF間で物理リソースを共有することで、リソース利用効率の向上が期待できる。 With the development of NFV (Network Enhancement Virtualization), it has become possible to execute virtual network functions (Virtual Network Virtualization; VNF) on general-purpose physical resources. By sharing physical resources among a plurality of VNFs by NFV, improvement in resource utilization efficiency can be expected.
 物理リソースの例として、リンク帯域などのネットワークリソース、CPUやHDD容量などのサーバリソースが挙げられる。高品質なネットワークサービスを低コストに提供するためには、物理リソースへの最適な仮想ネットワーク(Virtual Network;VN)割当が必要となる。 Examples of physical resources include network resources such as link bandwidth and server resources such as CPU and HDD capacity. In order to provide high-quality network services at low cost, it is necessary to allocate optimal virtual networks (VN) to physical resources.
 VN割当とは、仮想リンクと仮想ノードから構成されるVNを物理リソースに割り当てることを指す。仮想リンクは、VNF間の要求帯域や要求遅延、VNFやユーザ間の接続関係などのネットワークリソース需要を表す。仮想ノードは、VNFを実行するために必要なCPU数や必要メモリ量などのサーバリソース需要を表す。また、最適な割当とは、サービス要求やリソース容量などの制約条件を満たしつつ、リソース利用効率などの目的関数の値を最大化する割当を指す。 VN allocation refers to allocating a VN consisting of a virtual link and a virtual node to a physical resource. The virtual link represents the demand for network resources such as the required bandwidth and required delay between VNFs, and the connection relationship between VNFs and users. The virtual node represents the demand for server resources such as the number of CPUs required to execute VNF and the amount of memory required. Optimal allocation refers to allocation that maximizes the value of the objective function such as resource utilization efficiency while satisfying constraints such as service requirements and resource capacity.
 近年、高画質の動画配信やOSアップデート等により、トラヒックやサーバのリソース需要変動が激化している。一定期間内の最大値で需要量を見積もり、割当を時間変化させない静的VN割当では、リソースの利用効率が低下してしまうことから、リソースの需要変動に追従した動的VN割当手法が求められている。 In recent years, due to high-quality video distribution and OS updates, traffic and server resource demand fluctuations have intensified. In static VN allocation, which estimates the amount of demand based on the maximum value within a certain period and does not change the allocation over time, the efficiency of resource utilization decreases. Therefore, a dynamic VN allocation method that follows fluctuations in resource demand is required. ing.
 動的VN割当手法とは、時間変化するVN需要に対して最適VN割当を求める手法である。動的VN割当手法では、割当の最適性と即時性を同時に満たす必要がある。しかし、両者はトレードオフの関係にあり、割当の最適性と即時性を同時に満たすことが困難であるという課題がある。割当結果の精度を増加させるためには、計算時間を増加させる必要がある。しかし、計算時間の増加は割当周期の増加に直結し、結果として割当の即時性を減少させてしまう。同様に、需要変動に対して即時に対応するためには、割当周期を減らす必要がある。しかし、割当周期の削減は計算時間の減少に直結し、結果として割当の最適性を減少させてしまう。上記より、割当の最適性と即時性を同時に満たすことは困難である。 The dynamic VN allocation method is a method for obtaining the optimum VN allocation for the time-varying VN demand. In the dynamic VN allocation method, it is necessary to satisfy the optimality and immediacy of allocation at the same time. However, there is a trade-off relationship between the two, and there is a problem that it is difficult to satisfy the optimality and immediacy of allocation at the same time. In order to increase the accuracy of the allocation result, it is necessary to increase the calculation time. However, an increase in calculation time is directly linked to an increase in the allocation cycle, and as a result, the immediacy of allocation is reduced. Similarly, it is necessary to reduce the allocation cycle in order to respond immediately to fluctuations in demand. However, the reduction of the allocation cycle directly leads to the reduction of the calculation time, and as a result, the optimality of the allocation is reduced. From the above, it is difficult to satisfy the optimality and immediacy of allocation at the same time.
 動的VN割当手法の課題を解決する手段として、強化学習による動的VN割当手法が提案されている(非特許文献1,2)。強化学習(Reinforcement Learning;RL)は、将来に渡って得られる報酬の和(累積報酬)が最も多く得られる行動を学習する手法である。非特許文献1,2に開示された既存手法では、RLによりネットワーク状態と最適な割当の関係を事前に学習し、各時刻での最適化計算を不要とすることで、割当の最適性と即時性を同時に実現することができる。 As a means for solving the problem of the dynamic VN allocation method, a dynamic VN allocation method by reinforcement learning has been proposed (Non-Patent Documents 1 and 2). Reinforcement learning (RL) is a method of learning the behavior in which the sum of rewards (cumulative rewards) obtained in the future is the largest. In the existing method disclosed in Non-Patent Documents 1 and 2, the relationship between the network state and the optimum allocation is learned in advance by RL, and the optimization calculation at each time is not required. Sex can be realized at the same time.
 最適なVN割当を求める問題は組み合わせ最適化問題であり、VN割当の解の候補数(すなわち行動数)はサーバ数やVN数に対して指数的に増加する。RLにより最適なVN割当を求めるためには、行動数に応じた事前学習が必要である。非特許文献1の技術では、サーバ数やVN数の増加に対して事前学習のコストが指数的に増加するため、スケール性を増加する仕組みが必要である。そのため、非特許文献2の技術では、非特許文献1の技術にマルチエージェント化によるVN数に対するスケール性を向上させる仕組みを導入している。非特許文献2の技術では、VN需要をグループ化し、グループ毎にRLのエージェントを用意し、各エージェントの学習コストを削減することで、VN数に対するスケール性を向上させている。 The problem of finding the optimum VN allocation is a combination optimization problem, and the number of VN allocation solution candidates (that is, the number of actions) increases exponentially with respect to the number of servers and the number of VNs. In order to obtain the optimum VN allocation by RL, prior learning according to the number of actions is required. In the technique of Non-Patent Document 1, the cost of pre-learning increases exponentially with the increase in the number of servers and the number of VNs, so a mechanism for increasing the scaleability is required. Therefore, in the technique of Non-Patent Document 2, a mechanism for improving the scaleability with respect to the number of VNs by multi-agent is introduced into the technique of Non-Patent Document 1. In the technique of Non-Patent Document 2, VN demand is grouped, RL agents are prepared for each group, and the learning cost of each agent is reduced to improve the scaleability with respect to the number of VNs.
 非特許文献2に開示された技術では、複数のエージェントが同時に行動しエージェント間の割当の競合が発生することによる割当精度の低下を防止するため、各時刻で行動可能なエージェントを1つに限定している。エージェント間の割当の競合の例として、複数のエージェントが同時にリソース利用率の低いサーバに仮想マシン(Virtual Machine;VM)を移動しようとして、サーバが過負荷になることを指す。このような各時刻での行動の制限は、割当の競合を防ぐことができる一方で、VN割当の精度の低下につながる。 In the technique disclosed in Non-Patent Document 2, the number of agents that can act at each time is limited to one in order to prevent a decrease in allocation accuracy due to the simultaneous action of a plurality of agents and the occurrence of allocation conflict among the agents. is doing. As an example of allocation conflict between agents, it means that multiple agents try to move a virtual machine (VM) to a server with low resource utilization at the same time, and the server becomes overloaded. Such restriction of actions at each time can prevent allocation conflicts, but leads to a decrease in the accuracy of VN allocation.
 本発明は上記の点に鑑みてなされたものであり、マルチエージェント型強化学習を用いて仮想ネットワークを物理ネットワークに割り当てる技術において、エージェント間の割当の競合による割当精度の低下を回避して、割当精度を向上させるための技術を提供することを目的とする。 The present invention has been made in view of the above points, and in a technique for allocating a virtual network to a physical network using multi-agent reinforcement learning, it is possible to avoid deterioration of allocation accuracy due to allocation conflict between agents and allocate. The purpose is to provide a technique for improving accuracy.
 開示の技術によれば、協調型マルチエージェント強化学習により、仮想ネットワークをリンクとサーバを有する物理ネットワークに割り当てるための制御装置であって、
 仮想ネットワークの需要を複数グループに分割し、グループ毎にエージェントを割り当て、個々のエージェントに対する行動価値関数である個別行動価値関数を、全エージェントに対する行動価値関数である全体行動価値関数が最大になるように学習する事前学習部と、
 前記事前学習部により学習されたエージェント毎の個別行動価値関数を用いて、仮想ネットワークを前記物理ネットワークに割り当てる割当部と
 を備える制御装置が提供される。
According to the disclosed technique, it is a control device for assigning a virtual network to a physical network having a link and a server by cooperative multi-agent reinforcement learning.
Divide the demand of the virtual network into multiple groups, assign agents to each group, and maximize the individual action value function, which is the action value function for each agent, and the overall action value function, which is the action value function for all agents. Pre-learning department to learn from
A control device including an allocation unit that allocates a virtual network to the physical network is provided by using an individual action value function for each agent learned by the pre-learning unit.
 開示の技術によれば、マルチエージェント型強化学習を用いて仮想ネットワークを物理ネットワークに割り当てる技術において、エージェント間の割当の競合による割当精度の低下を回避して、割当精度を向上させるための技術が提供される。 According to the disclosed technique, in the technique of allocating a virtual network to a physical network using multi-agent reinforcement learning, there is a technique for improving the allocation accuracy by avoiding a decrease in allocation accuracy due to allocation conflict between agents. Provided.
本発明の実施の形態におけるシステム構成図である。It is a system block diagram in embodiment of this invention. 制御装置の機能構成図である。It is a functional block diagram of a control device. 制御装置のハードウェアの構成図である。It is a block diagram of the hardware of a control device. 変数の定義を示す図である。It is a figure which shows the definition of a variable. 変数の定義を示す図である。It is a figure which shows the definition of a variable. 制御装置の全体動作を示すフローチャートである。It is a flowchart which shows the whole operation of a control device. 報酬計算手順を示す図である。It is a figure which shows the reward calculation procedure. 事前学習手順を示す図である。It is a figure which shows the pre-learning procedure. 制御装置の事前学習動作を示すフローチャートである。It is a flowchart which shows the pre-learning operation of a control device. 割当手順を示す図である。It is a figure which shows the allocation procedure. 制御装置の割当動作を示すフローチャートである。It is a flowchart which shows the allocation operation of a control device.
 以下、図面を参照して本発明の実施の形態(本実施の形態)を説明する。以下で説明する実施の形態は一例に過ぎず、本発明が適用される実施の形態は、以下の実施の形態に限られるわけではない。以下の実施の形態の説明で参照する参考文献については"[1]"などとして示している。参考文献名については、明細書の最後にまとめて示した。 Hereinafter, an embodiment of the present invention (the present embodiment) will be described with reference to the drawings. The embodiments described below are merely examples, and the embodiments to which the present invention is applied are not limited to the following embodiments. References referred to in the description of the following embodiments are shown as "[1]" and the like. Reference names are summarized at the end of the specification.
 (実施の形態の概要)
 本実施の形態では、前述した課題を解決するために、協調型マルチエージェント強化学習(Cooperative multi-agent RL;Coop-MARL)による動的VN割当手法を用いている。本手法では、参考文献[2]のマルチエージェント強化学習(MARL)による動的VN割当手法に、複数のエージェントが協調(協力)して最適なVN割当を学習する仕組みを導入している。これにより、エージェント間の割当の競合による割当精度の低下の問題を解決し、割当精度を向上させることができる。
(Outline of embodiment)
In this embodiment, in order to solve the above-mentioned problems, a dynamic VN allocation method by cooperative multi-agent reinforcement learning (Coop-MALL) is used. In this method, a mechanism is introduced in which a plurality of agents cooperate (cooperate) to learn the optimum VN allocation to the dynamic VN allocation method by the multi-agent reinforcement learning (MALL) of the reference [2]. As a result, it is possible to solve the problem of deterioration of allocation accuracy due to allocation conflict between agents and improve allocation accuracy.
 より具体的には、本実施の形態に係る技術では、参考文献[1,2]と同様に、各時刻のVN需要及び物理ネットワークの利用量を状態と定義し、ネットワーク経路やVN割当の変更を行動と定義し、目的関数や制約条件に応じた報酬設計を行うことで、最適なVN割当方法を学習する。 More specifically, in the technique according to the present embodiment, as in References [1, 2], the VN demand and the usage amount of the physical network at each time are defined as the states, and the network route and the VN allocation are changed. Is defined as an action, and the optimal VN allocation method is learned by designing the reward according to the objective function and constraints.
 また、本実施の形態に係る技術では、参考文献[2]と同様に、マルチエージェント化によりネットワーク全体のVN割当問題を複数のサブ問題に分割する。すなわち、VN需要を複数のグループに分割し、グループ毎にエージェントを用意することで、VN数に対するスケール性を向上させる。各エージェントが、各グループに含まれるVNに関する最適なVN割当を事前に学習し、実制御時には学習結果に基づいてエージェントが即時に最適なVN割当を判断することで、最適性と即時性を同時に実現する。参考文献[2]では、各エージェントが各エージェントにとって最適な行動を独立に学習していたのに対して、本実施の形態に係る技術では、各エージェントが全エージェントにとって最適な協調行動を学習する。 Further, in the technique according to the present embodiment, as in Reference [2], the VN allocation problem of the entire network is divided into a plurality of sub-problems by multi-agentization. That is, by dividing the VN demand into a plurality of groups and preparing an agent for each group, the scaleability with respect to the number of VNs is improved. Each agent learns the optimum VN allocation for the VN included in each group in advance, and at the time of actual control, the agent immediately determines the optimum VN allocation based on the learning result, so that the optimum and immediacy can be achieved at the same time. Realize. In reference [2], each agent independently learns the optimum behavior for each agent, whereas in the technique according to the present embodiment, each agent learns the optimum cooperative behavior for all agents. ..
 (システム構成)
 図1に、本実施の形態におけるシステムの構成例を示す。図1に示すように、本システムは、制御装置100と物理ネットワーク200を有する。制御装置100は、協調型マルチエージェント強化学習による動的VN割当を実行する装置である。物理ネットワーク200は、VNの割当対象である物理リソースを有するネットワークである。制御装置100は、制御ネットワーク等により物理ネットワーク200と接続されており、物理ネットワーク200を構成する装置から状態情報を取得したり、物理ネットワーク200を構成する装置に対して設定命令を送信したりすることができる。また、制御装置100は、ユーザの管理端末などからユーザのVN需要を受信することもできる。
(System configuration)
FIG. 1 shows a configuration example of the system according to the present embodiment. As shown in FIG. 1, the system has a control device 100 and a physical network 200. The control device 100 is a device that executes dynamic VN allocation by cooperative multi-agent reinforcement learning. The physical network 200 is a network having physical resources to which the VN is allocated. The control device 100 is connected to the physical network 200 by a control network or the like, acquires state information from the devices constituting the physical network 200, and transmits a setting command to the devices constituting the physical network 200. be able to. Further, the control device 100 can also receive the user's VN demand from the user's management terminal or the like.
 物理ネットワーク200は、複数の物理ノード300と、物理ノード300間を接続する複数の物理リンク400を有する。物理ノード300には物理サーバが接続されている。また、物理ノード300にはユーザ(ユーザ端末あるいはユーザネットワーク等)が接続されている。なお、物理ノード300に物理サーバが存在し、物理ノード300にユーザが存在すると言い換えてもよい。 The physical network 200 has a plurality of physical nodes 300 and a plurality of physical links 400 connecting the physical nodes 300. A physical server is connected to the physical node 300. Further, a user (user terminal, user network, etc.) is connected to the physical node 300. In addition, it may be paraphrased that the physical server exists in the physical node 300 and the user exists in the physical node 300.
 例えば、ある物理ノード300に存在するユーザとVMとの間で通信を行うVNを物理リソースに割り当てる際には、当該VMの割当先の物理サーバ、及び、当該ユーザ(物理ノード)と当該割当先の物理サーバとの間の経路(物理リンクの集合)が決定され、決定された構成に基づく物理ネットワーク200への設定がなされる。なお、物理サーバを単に「サーバ」と呼び、物理リンクを単に「リンク」と呼んでもよい。 For example, when allocating a VN that communicates between a user existing in a certain physical node 300 and a VM to a physical resource, the physical server to which the VM is assigned, and the user (physical node) and the allocation destination A route (a set of physical links) to and from a physical server is determined, and settings are made to the physical network 200 based on the determined configuration. The physical server may be simply called a "server", and the physical link may be simply called a "link".
 図2に、制御装置100の機能構成例を示す。図2に示すとおり、制御装置100は、事前学習部110、報酬計算部120、割当部130、データ格納部140を有する。なお、報酬計算部120は事前学習部110の中に含まれることとしてもよい。また、「事前学習部110、報酬計算部120」と「割当部130」が別々の装置(プログラムで動作するコンピュータ等)に備えられていてもよい。各部の機能概要は下記のとおりである。 FIG. 2 shows an example of the functional configuration of the control device 100. As shown in FIG. 2, the control device 100 includes a pre-learning unit 110, a reward calculation unit 120, an allocation unit 130, and a data storage unit 140. The reward calculation unit 120 may be included in the pre-learning unit 110. Further, the "pre-learning unit 110, the reward calculation unit 120" and the "allocation unit 130" may be provided in separate devices (computers operating by the program, etc.). The outline of the functions of each part is as follows.
 事前学習部110は、報酬計算部120で計算された報酬を用いて行動価値関数(後述するように、各エージェントのQと全体エージェントのQtot)の事前学習を行う。報酬計算部120は、報酬を計算する。割当部130は、事前学習部110で学習された行動価値関数Qを用いて、VNの物理リソースへの割当を実行する。データ格納部140は、Replay Memoryの機能を持つとともに、計算に必要なパラメータ等を格納している。なお、事前学習部110は、強化学習の学習モデルにおけるエージェントを含む。「エージェントを学習する」ことは、事前学習部110が行動価値関数を学習することに相当する。各部の詳細な動作については後述する。 The pre-learning unit 110 performs pre-learning of the action value function (Q k of each agent and Q tot of all agents as described later) using the reward calculated by the reward calculation unit 120. The reward calculation unit 120 calculates the reward. The allocation unit 130 executes the allocation of the VN to the physical resource by using the action value function Qk learned by the pre-learning unit 110. The data storage unit 140 has a Play Memory function and stores parameters and the like necessary for calculation. The pre-learning unit 110 includes an agent in the learning model of reinforcement learning. "Learning the agent" corresponds to the pre-learning unit 110 learning the action value function. The detailed operation of each part will be described later.
 <ハードウェア構成例>
 制御装置100は、例えば、コンピュータにプログラムを実行させることにより実現できる。このコンピュータは、物理的なコンピュータであってもよいし、仮想マシンであってもよい。
<Hardware configuration example>
The control device 100 can be realized, for example, by causing a computer to execute a program. This computer may be a physical computer or a virtual machine.
 すなわち、制御装置100は、コンピュータに内蔵されるCPUやメモリ等のハードウェア資源を用いて、当該制御装置100で実施される処理に対応するプログラムを実行することによって実現することが可能である。上記プログラムは、コンピュータが読み取り可能な記録媒体(可搬メモリ等)に記録して、保存したり、配布したりすることが可能である。また、上記プログラムをインターネットや電子メール等、ネットワークを通して提供することも可能である。 That is, the control device 100 can be realized by executing a program corresponding to the processing executed by the control device 100 using hardware resources such as a CPU and a memory built in the computer. The above program can be recorded on a computer-readable recording medium (portable memory, etc.), stored, and distributed. It is also possible to provide the above program through a network such as the Internet or e-mail.
 図3は、上記コンピュータのハードウェア構成例を示す図である。図3のコンピュータは、それぞれバスBで相互に接続されているドライブ装置1000、補助記憶装置1002、メモリ装置1003、CPU1004、インタフェース装置1005、表示装置1006、及び入力装置1007等を有する。 FIG. 3 is a diagram showing an example of the hardware configuration of the above computer. The computer of FIG. 3 has a drive device 1000, an auxiliary storage device 1002, a memory device 1003, a CPU 1004, an interface device 1005, a display device 1006, an input device 1007, and the like, which are connected to each other by a bus B, respectively.
 当該コンピュータでの処理を実現するプログラムは、例えば、CD-ROM又はメモリカード等の記録媒体1001によって提供される。プログラムを記憶した記録媒体1001がドライブ装置1000にセットされると、プログラムが記録媒体1001からドライブ装置1000を介して補助記憶装置1002にインストールされる。但し、プログラムのインストールは必ずしも記録媒体1001より行う必要はなく、ネットワークを介して他のコンピュータよりダウンロードするようにしてもよい。補助記憶装置1002は、インストールされたプログラムを格納すると共に、必要なファイルやデータ等を格納する。 The program that realizes the processing on the computer is provided by, for example, a recording medium 1001 such as a CD-ROM or a memory card. When the recording medium 1001 storing the program is set in the drive device 1000, the program is installed in the auxiliary storage device 1002 from the recording medium 1001 via the drive device 1000. However, the program does not necessarily have to be installed from the recording medium 1001, and may be downloaded from another computer via the network. The auxiliary storage device 1002 stores the installed program and also stores necessary files, data, and the like.
 メモリ装置1003は、プログラムの起動指示があった場合に、補助記憶装置1002からプログラムを読み出して格納する。CPU1004は、メモリ装置1003に格納されたプログラムに従って、制御装置100に係る機能を実現する。インタフェース装置1005は、ネットワークに接続するためのインタフェースとして用いられ、ネットワークを介した入力手段及び出力手段として機能する。表示装置1006はプログラムによるGUI(Graphical User Interface)等を表示する。入力装置157はキーボード及びマウス、ボタン、又はタッチパネル等で構成され、様々な操作指示を入力させるために用いられる。 The memory device 1003 reads and stores the program from the auxiliary storage device 1002 when the program is instructed to start. The CPU 1004 realizes the function related to the control device 100 according to the program stored in the memory device 1003. The interface device 1005 is used as an interface for connecting to a network, and functions as an input means and an output means via the network. The display device 1006 displays a GUI (Graphical User Interface) or the like by a program. The input device 157 is composed of a keyboard, a mouse, buttons, a touch panel, or the like, and is used for inputting various operation instructions.
 (変数定義)
 以降の説明において使用される変数の定義を図4、図5に示す。まず、図4は、VN割当に関する変数定義である。図4に示すように、以下のように変数が定義される。
(Variable definition)
The definitions of the variables used in the following description are shown in FIGS. 4 and 5. First, FIG. 4 is a variable definition related to VN allocation. As shown in FIG. 4, the variables are defined as follows.
 t∈T:タイムステップ(T:総タイムステップ数)
 N:VN数
 G(N,L):ネットワークグラフ(N:ノード集合,L:リンク集合)
 n∈N,z∈Z:ノード,サーバ
 c :サーバ容量
 c ij:リンク容量
 P:={p}:ユーザ配置
 D:={d }:トラヒック需要
 V:={v }:VM需要
 u ij,t:リンク利用率
 u i,t:サーバ利用率
 U =maxij(u ij,t):最大リンク利用率
 U =max(u i,t):最大サーバ利用率
 R :={r ij,t}:残余リンク容量
 R :={r i,t}:残余サーバ容量
 Y:={yij,t}:tにおけるVM割当(例:VMiを物理サーバjに割当)の集合
 P(Y,Yt+1):ペナルティ関数
 なお、上記の定義において、リンク利用率u ij,tとは、リンクijにおける「1-残余リンク容量÷全容量」である。また、サーバ利用率u i,tとは、サーバiにおける「1-残余サーバ容量÷全容量」である。
t ∈ T: time step (T: total number of time steps)
N: Number of VNs G (N, L): Network graph (N: node set, L: link set)
n ∈ N, z ∈ Z: Node, server c Z i : Server capacity c L ij : Link capacity P: = {pi}: User placement D t : = {d i t } : Traffic demand V t : = { vit}: VM demand u L ij, t : Link utilization rate u Z i , t : Server utilization rate UL t = max ij (u L ij , t ): Maximum link utilization rate U Z t = max i ( u Z i, t ): Maximum server utilization rate RL t : = {r L ij, t }: Residual link capacity R Z t : = {r Z i, t }: Residual server capacity Y t : = {y ij , T }: Set of VM allocations at t (eg, VMi is assigned to the physical server j) P (Yt, Yt + 1 ): Penalty function In the above definition, the link utilization rate uLij , t is It is "1-remaining link capacity ÷ total capacity" in the link ij. Further, the server utilization rates u Z i and t are "1-residual server capacity ÷ total capacity" in the server i.
 図5は、Coop-MADRLに関する変数の定義を示している。図5に示すように、以下のように変数が定義される。なお、各エージェントの行動価値関数Qを、個別行動価値関数と呼び、全エージェントの行動価値関Qtot数を、全体行動価値関数と呼んでよい。 FIG. 5 shows the definition of variables for Coop-MADRL. As shown in FIG. 5, the variables are defined as follows. The action value function Q k of each agent may be referred to as an individual action value function, and the action value relation Q tot number of all agents may be referred to as an overall action value function.
 e∈E:エピソード(E:総エピソード数)
 G:={g}:エージェント集合(1≦k≦N)
 s∈S:状態(S:状態集合)
 O:={O}:全エージェントの観測の集合
 o ∈O:各エージェントの観測
 o:={o }:全エージェントの観測
 A:={A}:全エージェントの行動の集合(A:行動集合)
 a ∈A:各エージェントの行動
 a:={a }:全エージェントの行動
 r:報酬
 Q(o ,a ):各エージェントの行動価値関数
 Qtot(o,a):全エージェントの行動価値関数
 M:Replay memory
 (動的VN割当問題)
 事前学習及び実制御において前提としている本実施の形態における動的VN割当問題について説明する。
e ∈ E: Episode (E: Total number of episodes)
G: = {g k }: Agent set (1 ≦ k ≦ N)
s t ∈ S: state (S: state set)
O: = { Ok } : A set of observations of all agents o kt Ok: Observations of each agent ot: = { ok t }: Observations of all agents A: = {A k } : Of all agents Set of actions ( Ak : set of actions)
a kt ∈ A k : Behavior of each agent at: = {a kt } : Behavior of all agents rt : Reward Q k ( ok t , a kt ) : Behavior value function of each agent Q tot ( ot , at): Behavioral value function of all agents M: Play memory
(Dynamic VN allocation problem)
The dynamic VN allocation problem in the present embodiment, which is premised on the pre-learning and the actual control, will be described.
 本実施の形態では、各VN需要は、仮想リンクとしてのトラヒック需要と、仮想ノードとしてのVM需要(仮想マシン需要)から構成されているとする。図1に示したとおり、物理ネットワークG(N,L)は、物理リンクLと物理ノードNで構成されており、各物理ノードNには各物理サーバZが接続されていると仮定する。すなわち、G(N,L)=G(Z,L)と仮定する。また、各サーバと各リンクには、リソース量の割当上限を意味するサーバ容量c とリンク容量c ijを設定する。 In the present embodiment, it is assumed that each VN demand is composed of a traffic demand as a virtual link and a VM demand (virtual machine demand) as a virtual node. As shown in FIG. 1, it is assumed that the physical network G (N, L) is composed of a physical link L and a physical node N, and each physical server Z is connected to each physical node N. That is, it is assumed that G (N, L) = G (Z, L). Further, for each server and each link, the server capacity cZ i , which means the upper limit of allocation of the resource amount, and the link capacity c Lij are set.
 ここで、N個のVN需要を考える。各ユーザが1つのVN需要を要求すると仮定する。VN需要は、始点(ユーザ)、終点(VM)、トラヒック需要D、VMサイズVで構成する。ここで、VMサイズは、ユーザが要求するVMの処理容量を示し、サーバ割当時にはVMサイズの分だけサーバ容量が消費されるものとする。VN需要が物理ネットワークに割り当てられた場合、DとVの分だけ、ネットワークリソースとサーバリソースが消費されるとする。 Here, consider the demand for N VN pieces. Suppose each user requests one VN demand. The VN demand is composed of a start point (user), an end point (VM), a traffic demand D t , and a VM size V t . Here, the VM size indicates the processing capacity of the VM requested by the user, and it is assumed that the server capacity is consumed by the amount of the VM size at the time of server allocation. When VN demand is allocated to the physical network, it is assumed that network resources and server resources are consumed by Dt and Vt .
 本実施の形態の制御装置100は、上記の物理ネットワークとVN需要に対して、動的VN割当問題を解くことでVNを物理ネットワークに割り当てる。当該動的VN割当問題は、目的関数として、全ての時刻に渡る最大リンク利用率U と最大サーバ利用率U の和の最小化を課す。すなわち、以下の式で表せる。 The control device 100 of the present embodiment allocates the VN to the physical network by solving the dynamic VN allocation problem for the above physical network and VN demand. The dynamic VN allocation problem imposes, as an objective function, the minimization of the sum of the maximum link utilization rate UL t and the maximum server utilization rate U Z t over all times. That is, it can be expressed by the following equation.
Figure JPOXMLDOC01-appb-M000001
 制約条件として、最大リンク利用率と最大サーバ利用率がそれぞれ1以下であることを課す。すなわち、U <1とU <1で表すことができる。
Figure JPOXMLDOC01-appb-M000001
As a constraint condition, the maximum link utilization rate and the maximum server utilization rate are each 1 or less. That is, it can be expressed by UL t <1 and U Z t <1.
 図6は、制御装置100の全体動作を示すフローチャートである。図6のフローチャートに示すように、本実施の形態では、まず、事前学習部110が、エージェント毎及び全体エージェントについての事前学習を行う(S100)。その後、割当部130が、学習済みエージェントの行動価値関数に基づいて実制御、すなわち、物理ネットワークに対する仮想ネットワーク割当のための設定等を行う(S200)。 FIG. 6 is a flowchart showing the overall operation of the control device 100. As shown in the flowchart of FIG. 6, in the present embodiment, the pre-learning unit 110 first performs pre-learning for each agent and for all agents (S100). After that, the allocation unit 130 performs actual control based on the action value function of the learned agent, that is, a setting for virtual network allocation to the physical network (S200).
 本実施の形態では、離散的なタイムステップを仮定し、各タイムステップでVN需要が変化すると仮定する。実制御では、各タイムステップtでは、まずVN需要を観測する。次に、観測値に基づいて、次のタイムステップt+1における最適なVN割当を学習済みエージェントが計算する。最後に、計算結果に基づいて、経路とVM配置の変更を行う。 In this embodiment, it is assumed that discrete time steps are assumed, and VN demand changes at each time step. In actual control, VN demand is first observed at each time step t. Next, the trained agent calculates the optimum VN allocation in the next time step t + 1 based on the observed value. Finally, the route and VM arrangement are changed based on the calculation result.
 (Coop-MARLアルゴリズムについて)
 次に、Coop-MARLアルゴリズムについて説明する。Coop-MARLアルゴリズムに関する変数の定義は図5に示したとおりである。
(About the Coop-MALL algorithm)
Next, the Coop-MALL algorithm will be described. The definitions of the variables for the Coop-MALL algorithm are as shown in FIG.
 VN数をN、エージェント数をMとし、エージェント数と同数のグループにVN需要を分割する。k番目のVN群の割当をエージェントgが行う。VN数をN、サーバ数を|Z|、候補経路を1つに限定した際の行動数は|Z|になる。 Let N be the number of VNs and M be the number of agents, and the VN demand is divided into groups having the same number of agents. Agent g k assigns the kth VN group. The number of VNs is N, the number of servers is | Z |, and the number of actions when the candidate route is limited to one is | Z | N.
 本実施の形態では、参考文献[2]の技術と同様に、VN需要をM(1≦M≦N)個に分割し、各エージェントの行動数を|Z|N/Mまで減少させることで、各エージェントの学習コストを低減させる。例えば、VN数が4、サーバ数が2、エージェント数1の場合、VN割当の割当の候補数(すなわち行動数)は、2=16となる。エージェント数4の場合、各エージェントのVN割当の候補数は、24/4=2=2となる。以後、簡単化のためN=M、すなわち、VN数と同数のエージェントを用意することとする。 In the present embodiment, as in the technique of reference [2], the VN demand is divided into M (1 ≦ M ≦ N) pieces, and the number of actions of each agent is reduced to | Z | N / M. , Reduce the learning cost of each agent. For example, when the number of VNs is 4 , the number of servers is 2, and the number of agents is 1, the number of candidates for VN allocation (that is, the number of actions) is 24 = 16. When the number of agents is 4, the number of candidates for VN allocation for each agent is 24/4 = 2 1 = 2. Hereinafter, for simplification, N = M, that is, the same number of agents as the number of VNs will be prepared.
 事前学習部110は、事前学習において、行動価値関数Q(s,a)の学習を行う。行動価値関数Q(s,a)は、状態sで行動aを選択した場合に得られる累積報酬の推定値を表す。マルチエージェント環境では、各エージェントは状態sの一部分のみが観測できると仮定し、各エージェントの観測をoとする。その場合、k番目のエージェントの行動価値関数はQ(o,a)となる。 The pre-learning unit 110 learns the behavioral value function Q ( st , at) in the pre-learning. The action value function Q ( st , at) represents an estimated value of the cumulative reward obtained when the action at is selected in the state st . In a multi-agent environment, it is assumed that each agent can observe only a part of the state st , and the observation of each agent is ok . In that case, the action value function of the kth agent is Q k (ok, a k ).
 本実施の形態では、k番目のエージェントの行動価値関数をQとし、全体の行動価値関数をQtotとする。Qは各エージェントにとっての行動価値を表し、Qtotは全エージェントにとっての行動価値を表す。参考文献[2]の技術では、k番目のエージェントはQが最大となる行動を学習するのに対して、本実施の形態では、全てのエージェントはQtotが最大となる行動を学習する。事前学習部110は、QtotをQに基づいて算出する。QtotをQに基づいて算出する方法については、特定の方法に限定されないが、例えば、下記の式(2)で算出することができる(参考文献[4])。つまり、式(2)では、Q(o,a)のエージェントについての総和をQtotとしている。 In this embodiment, the behavioral value function of the kth agent is Qk, and the behavioral value function of the whole is Qtot . Q k represents the action value for each agent, and Q tot represents the action value for all agents. In the technique of reference [2], the k-th agent learns the behavior that maximizes Q k , whereas in the present embodiment, all agents learn the behavior that maximizes Q tot . The pre-learning unit 110 calculates Q tot based on Q k . The method for calculating Q tot based on Q k is not limited to a specific method, but can be calculated by, for example, the following equation (2) (reference [4]). That is, in the equation (2), the sum of the agents of Q k ( ok, a k ) is set as Q tot .
Figure JPOXMLDOC01-appb-M000002
 実制御時には、各学習済みのエージェントは、Qが最大となる行動を時刻tにおける行動として選択する。なお、各学習済みのエージェントは、学習済みの行動価値関数を用いて割当処理を実行する割当部130に相当する。
Figure JPOXMLDOC01-appb-M000002
At the time of actual control, each learned agent selects the action that maximizes Q k as the action at time t. Each trained agent corresponds to the allocation unit 130 that executes the allocation process using the trained action value function.
 より詳細に説明する。事前学習は、観測o:={o }、行動a:={a }、報酬rから構成される学習サンプルデータを集めるフェーズと、学習サンプルデータに基づいてQを学習するフェーズから成る。学習サンプルデータを集めるフェーズでは、ステップtにおいて、各エージェントはo を観測し、各エージェントは自身のQに基づいて、Qが最大となる行動 It will be explained in more detail. Pre-learning consists of a phase of collecting training sample data consisting of observation ot: = { ok t } , action at: = {a kt}, and reward rt, and Q k based on the training sample data. It consists of a learning phase. In the phase of collecting training sample data, in step t, each agent observes ok t , and each agent performs an action that maximizes Q k based on its own Q k .
Figure JPOXMLDOC01-appb-M000003
を決定する。ステップtにおいて、全てのエージェントが行動a を決定した後、全エージェントの行動aに基づいて、状態sを更新する。その後、rを計算する。すなわち,学習サンプルデータを集めるフェーズでは、Qtotを使用せずに、Qに基づいてサンプルデータを集める。
Figure JPOXMLDOC01-appb-M000003
To decide. In step t , after all the agents have determined the action akt , the state st is updated based on the action at of all the agents. After that, rt is calculated. That is, in the phase of collecting training sample data, sample data is collected based on Q k without using Q tot .
 学習フェーズでは、集めたサンプルデータ(o,a,r,oj+1)に基づいてQを更新する。まず、Qに基づいて、Qtotを計算する。その後、強化学習のアルゴリズムに基づいて、Qtotが最大となるようにQを学習する。すなわち、Qを更新する。既存手法では、各Qが最大となるように各Qを更新するが、本実施の形態では、Qtotが最大となるように各Qを更新することで、全エージェントにとっての最適な行動を選択することができる。 In the learning phase, Q k is updated based on the collected sample data (o j , a j , r j , o j + 1 ). First, Q tot is calculated based on Q k . After that, based on the algorithm of reinforcement learning, Q k is learned so that Q tot is maximized. That is, Q k is updated. In the existing method, each Q k is updated so that each Q k is maximized, but in the present embodiment, each Q k is updated so that the Q tot is maximized, which is optimal for all agents. You can choose an action.
 なお、本実施の形態において、強化学習の学習アルゴリズムについては限定せず、任意の学習アルゴリズムに対して適用することができる。例えば、強化学習の学習アルゴリズムとして参考文献[3]などに開示された技術を用いることができる。参考文献[3]の技術を用いた場合の学習アルゴリズムの実施例について説明する。 In the present embodiment, the learning algorithm for reinforcement learning is not limited, and can be applied to any learning algorithm. For example, the technique disclosed in reference [3] or the like can be used as a learning algorithm for reinforcement learning. An example of a learning algorithm when the technique of reference [3] is used will be described.
 以下のように、Qをニューラルネットワークで近似することを考える。 Consider approximating Q k with a neural network as follows.
Figure JPOXMLDOC01-appb-M000004
 式(3)におけるθは、k番目のニューラルネットワークのハイパーパラメータを表し、Θは全てのθを表す。ニューラルネットワークの学習に用いるLoss関数を以下の式で定義する。
Figure JPOXMLDOC01-appb-M000004
In equation (3), θ k represents the hyperparameters of the kth neural network, and Θ represents all θ k . The Loss function used for learning the neural network is defined by the following formula.
Figure JPOXMLDOC01-appb-M000005
ここで、γはハイパーパラメータである。上記の式に対して、集めたサンプルデータ(o,a,r,oj+1)と確率的勾配降下法を用いてΘを更新することで、Qを学習する。
Figure JPOXMLDOC01-appb-M000005
Here, γ is a hyperparameter. For the above equation, Q k is learned by updating Θ using the collected sample data (o j , a j , r j , o j + 1 ) and the stochastic gradient descent method.
 実制御時には、事前学習の学習サンプルデータを集めるフェーズと同様に、各Qが最大となる行動を選択する。ただし、ここでのQは、Qtotが最大となるように学習したQであるため、他のエージェントを考慮した協調した行動を選択することができる。 At the time of actual control, the action that maximizes each Qk is selected as in the phase of collecting the learning sample data of the pre-learning. However, since the Q k here is the Q k learned so that the Q tot is maximized, it is possible to select a coordinated action in consideration of other agents.
 (モデリング)
 次に、本実施の形態における強化学習の学習モデルについて説明する。強化学習の学習モデルでは、状態s、行動a、報酬rが使用される。前述したように、各エージェントは状態の一部のみを観測できると仮定する。時刻tにおける状態sをs=[D,V,R ,R ]と定義する。また、時刻tにおけるエージェントgの観測o をo =[d ,v ,R ,R ]と定義する。ここで、DとVはトラヒック需要とVMサイズであり、R とR は各リンクの残余帯域および各サーバの残余容量である。各リンクの残余帯域はr ij、t=1-c ij ij、tで計算する。また、r i,tも同様に計算する。|Z|通りのVN割当の候補集合をAと定義して、各時刻ではAから行動a を一つ選択する。本モデルでは割当先サーバに対して経路が一意に定まるものとし、VN割当はVMと割当先サーバの組合せで決めるものとする。
(modeling)
Next, the learning model of reinforcement learning in this embodiment will be described. In the learning model of reinforcement learning, the state st, the action att , and the reward rt are used. As mentioned above, it is assumed that each agent can observe only part of the state. The state st at time t is defined as st = [D t , V t , RL t , R Z t ]. Further, the observation ok t of the agent g k at time t is defined as ok t = [d kt , v kt , RL t , R Z t ] . Here, D t and V t are traffic demand and VM size, and RL t and R Z t are the residual bandwidth of each link and the residual capacity of each server. The residual band of each link is calculated by r L ij, t = 1-c L ij u L ij, t . Further, r Z i and t are calculated in the same manner. A candidate set for VN allocation according to | Z | is defined as Ak , and one action a k t is selected from A k at each time. In this model, the route is uniquely determined for the allocation destination server, and the VN allocation is determined by the combination of the VM and the allocation destination server.
 gの報酬計算手順を図7に示す。この手順は、制御装置100の報酬計算部120により実行されるものである。1行目のEff(x)は効率関数を表し、xが増加する程Eff(x)が減少するように以下のように定義される関数である。 The reward calculation procedure for g k is shown in FIG. This procedure is executed by the reward calculation unit 120 of the control device 100. Eff (x) on the first line represents an efficiency function, and is a function defined as follows so that Eff (x) decreases as x increases.
Figure JPOXMLDOC01-appb-M000006
 式(5)に示すように、制約条件の違反に近い状態(U t+1やU t+1が90%以上になること)を強く避けるために、xが0.9以上の場合はEff(x)を2倍減少させる。不必要なVN再割当(U t+1やU t+1が20%以下のときのVN再割当)を避けるために、xが0.2以下の場合はEff(x)を一定とする。
Figure JPOXMLDOC01-appb-M000006
As shown in the equation (5), in order to strongly avoid a state close to a violation of the constraint condition ( UL t + 1 or U Z t + 1 becomes 90% or more), when x is 0.9 or more, Eff (x). ) Is reduced by a factor of two. In order to avoid unnecessary VN reassignment (VN reassignment when UL t + 1 or U Z t + 1 is 20% or less), Eff (x) is set to be constant when x is 0.2 or less.
 図7に示す報酬計算手順の2~4行目では、不必要なVNの再配置を抑制するため、VNの再割当に応じたペナルティを与える。2行目において、報酬計算部120は、再割当が行われたと判断した場合(YとYt+1が異なる場合)に、3行目に進み、r-P(Y,Yt+1)をrとする。P(Y,Yt+1)は、VNの再配置を抑制するためのペナルティ関数であり、再配置を抑制する場合はP値が大きく、許容する場合はP値が小さくするなるように設定する。ここで、YはVMの割当状態を表す。 In the second to fourth lines of the reward calculation procedure shown in FIG. 7, a penalty is given according to the reassignment of the VN in order to suppress unnecessary relocation of the VN. In the second line, when the reward calculation unit 120 determines that the reallocation has been performed (when Y t and Y t + 1 are different), the reward calculation unit 120 proceeds to the third line and performs rt −P (Y t , Y t + 1 ). Let it be rt . P (Y t , Y t + 1 ) is a penalty function for suppressing VN rearrangement, and is set so that the P value is large when the rearrangement is suppressed and the P value is small when the rearrangement is allowed. .. Here, Y t represents the VM allocation state.
 以下、事前学習と実制御のそれぞれについての動作を詳細に説明する。 Below, the operation of each of pre-learning and actual control will be explained in detail.
 (事前学習動作)
 事前学習部110が実行する、協調型マルチエージェント強化学習(Coop-MARL)の事前学習手順(事前学習アルゴリズム)を図8に示す。
(Pre-learning operation)
FIG. 8 shows a pre-learning procedure (pre-learning algorithm) of the cooperative multi-agent reinforcement learning (Coop-MALL) executed by the pre-learning unit 110.
 タイムステップEの一連の行動をエピソードと呼び、学習が完了するまでエピソードを繰り返し実行する。事前学習部110は、各エピソード(2~18行目)の最初に、全VNに対する、ステップ数Eの学習用トラヒック需要及びVM需要の候補を生成する。 The series of actions in time step E is called an episode, and the episode is repeated until learning is completed. At the beginning of each episode (lines 2 to 18), the pre-learning unit 110 generates candidates for learning traffic demand and VM demand having the number of steps E for all VNs.
 その後、事前学習部110は、一連の手順(5~16行目)を繰り返し実行する。 After that, the pre-learning unit 110 repeatedly executes a series of procedures (lines 5 to 16).
 6~11行目で各エージェントの学習サンプル(状態o 、行動a ,報酬r,次の状態o t+1のペア)を生成し、Replay Memory Mに格納する。学習サンプルの生成では、現在の観測o とQに応じた行動選択、行動a に基づいた状態の更新(VNの再配置)、更新した観測o における報酬rの計算を各エージェントに対して行う。 In the 6th to 11th lines, a learning sample (a pair of a state ok t , an action a kt , a reward rt , and a next state ok t + 1 ) of each agent is generated and stored in the Play Memory M. In the generation of the training sample, the action selection according to the current observation ok t and Q k , the update of the state based on the action a kt ( relocation of the VN), and the calculation of the reward rt in the updated observation ok t . For each agent.
 12~14行目は、エピソードの終了条件を指す。本モデルでは、r=-1を終了条件とする。17行目でエピソード0~Eの学習サンプルをReplay Memory Mに格納する。18行目でReplay Memory Mからランダムに学習サンプルを取り出し、エージェントの学習を行う。エージェントの学習では、前述したとおり、強化学習のアルゴリズムに基づいて、Qtotが最大となるように各Qの学習を行う。 The 12th to 14th lines indicate the end condition of the episode. In this model, rt = -1 is the end condition. In the 17th line, the learning samples of episodes 0 to E are stored in the Play Memory M. On the 18th line, a training sample is randomly taken out from the Play Memory M to train the agent. In the learning of the agent, as described above, each Q k is learned so that the Q tot is maximized based on the algorithm of reinforcement learning.
 また、前述したとおり、本実施の形態において、強化学習の学習アルゴリズムについては特定のものに限定せず、任意の学習アルゴリズムに対して適用することができる。例えば、強化学習の学習アルゴリズムとして[3]などを用いることができる。 Further, as described above, in the present embodiment, the learning algorithm for reinforcement learning is not limited to a specific one, and can be applied to any learning algorithm. For example, [3] or the like can be used as a learning algorithm for reinforcement learning.
 上述した報酬計算手順に基づく事前学習部110の動作例を図9のフローチャートを参照して説明する。 An operation example of the pre-learning unit 110 based on the above-mentioned reward calculation procedure will be described with reference to the flowchart of FIG.
 なお、事前学習における状態の観測や、行動(VNの物理リソースへの割当)については、実際の物理ネットワーク200に対して行うこととしてもよいし、実際の物理ネットワーク200と同等のモデルに対して行うこととしてもよい。 It should be noted that state observation and behavior (allocation of VN to physical resources) in pre-learning may be performed for the actual physical network 200, or for a model equivalent to the actual physical network 200. You may do it.
 S101において、事前学習部110は、ステップ数Eの学習用トラヒック需要及びVM需要の候補を生成し、データ格納部140に格納する。 In S101, the pre-learning unit 110 generates candidates for learning traffic demand and VM demand having the number of steps E, and stores them in the data storage unit 140.
 S101~S106は、tがTより小さい間実行される。また、S103~S105はエピソード0~Eの各エピソードで繰り返しで行われる。 S101 to S106 are executed while t is smaller than T. Further, S103 to S105 are repeated in each episode of episodes 0 to E.
 S102において、事前学習部110(各エージェント)は、o を観測し、各エージェント自身のQに基づいて、Qが最大となる行動a を選択する。つまり、各エージェントは、行動価値関数の値(Q)が最大になるように自身に割り当てられたVNの割当先サーバを選択する。 In S102, the pre-learning unit 110 (each agent) observes ok t and selects an action a kt that maximizes Q k based on the Q k of each agent itself. That is, each agent selects the VN allocation destination server assigned to itself so that the value (Q k ) of the action value function is maximized.
 S103において、事前学習部110(各エージェント)は、選択された行動(VN割当)を物理ネットワーク200に設定し、観測oを更新してot+1を得る。また、S104において、報酬計算部120は、前述した計算方法で、各エージェントについての報酬rの計算を行う。 In S103, the pre-learning unit 110 (each agent) sets the selected action (VN allocation) in the physical network 200, updates the observation ot , and obtains ot + 1 . Further, in S104, the reward calculation unit 120 calculates the reward rt for each agent by the above-mentioned calculation method.
 S105において、事前学習部110は、集めた学習サンプルデータ(o,a,r)をReplay Memory M(データ格納部140)に格納する。S106において、事前学習部110は、Replay Memory M(データ格納部140)から、学習サンプルをランダムに選択し、Qtotが最大になるように、各エージェントの学習を行う。つまり、各エージェントの行動価値関数Qの学習を行う。 In S105, the pre-learning unit 110 stores the collected learning sample data ( oj, aj , rj ) in the Play Memory M (data storage unit 140). In S106, the pre-learning unit 110 randomly selects a learning sample from the Play Memory M (data storage unit 140), and learns each agent so that the Q tot is maximized. That is, the action value function Q k of each agent is learned.
 (実制御動作)
 制御装置100の割当部130が実行する、協調型マルチエージェント強化学習(Coop-MADRL)による動的VN割当手順を図10に示す。
(Actual control operation)
FIG. 10 shows a dynamic VN allocation procedure by cooperative multi-agent reinforcement learning (Coop-MADRL) executed by the allocation unit 130 of the control device 100.
 図10の1行目は、gの事前学習を表しており、これは上記で説明したとおりである。その後は、2~8行目を各時刻で繰り返す。ある時刻において、4行目では、状態o の観測を行う。5行目では、状態o に基づいてQ(o ,a´)が最大となる行動a を選択する。4行目、5行目は各エージェントに対して実行される。7行目では、VN割当を更新する。 The first line of FIG. 10 shows the pre-learning of g k , which is as described above. After that, the 2nd to 8th lines are repeated at each time. At a certain time, in the 4th line, the state ok t is observed. In the fifth line, the action a kt that maximizes Q k ( ok t , a' ) is selected based on the state ok t . The fourth and fifth lines are executed for each agent. In the 7th line, the VN allocation is updated.
 上述した実制御手順に基づく割当部130の動作例を図11のフローチャートを参照して説明する。S201~S203は各タイムステップで実行される。 An operation example of the allocation unit 130 based on the above-mentioned actual control procedure will be described with reference to the flowchart of FIG. S201 to S203 are executed at each time step.
 S201において、割当部130は、各エージェントに対する、時刻tにおける状態o (VM需要、トラヒック需要、残余リンク容量、残余サーバ容量)を観測(取得)する。具体的には、例えば、VM需要、トラヒック需要を、各ユーザ(ユーザ端末等)から受信し、残余リンク容量と残余サーバ容量を、物理ネットワーク200から(あるいは物理ネットワーク200を監視するオペレーションシステム)から取得する。なお、VM需要、トラヒック需要に関しては、需要予測により得た値であってもよい。 In S201, the allocation unit 130 observes (acquires) the state ok t (VM demand, traffic demand, residual link capacity, residual server capacity) at time t for each agent. Specifically, for example, VM demand and traffic demand are received from each user (user terminal, etc.), and the residual link capacity and the residual server capacity are obtained from the physical network 200 (or an operation system that monitors the physical network 200). get. The VM demand and the traffic demand may be values obtained by demand forecasting.
 S202において、割当部130は、各エージェントについて、Q(o ,a´)が最大となる行動a を選択する。すなわち、割当部130は、Q(o ,a´)が最大となるように、各エージェントのVNにおけるVMの割当先サーバを選択する。 In S202, the allocation unit 130 selects the action akt that maximizes Qk ( ok t , a' ) for each agent. That is, the allocation unit 130 selects the VM allocation destination server in the VN of each agent so that Q k ( ok t , a') is maximized.
 S203において、割当部130は、状態を更新する。具体的には、割当部130は、各エージェントのVNについて、物理ネットワーク200における各割当先サーバに対してVMを割り当てる設定を行うとともに、需要に応じたトラヒックが正しい経路(リンクの集合)を流れるように、物理ネットワーク200における経路設定を行う。 In S203, the allocation unit 130 updates the state. Specifically, the allocation unit 130 sets the VN of each agent to allocate the VM to each allocation destination server in the physical network 200, and the traffic according to the demand flows through the correct route (set of links). As described above, the route is set in the physical network 200.
 (実施の形態の効果)
 以上説明したように、本実施の形態では、全てのエージェントにとっての行動価値Qtotが最大となる各エージェントの行動を学習する仕組みを備え、エージェント間の協調制御を学習することで、エージェント間の割当の競合による割当精度の低下の問題を解決し、割当精度を向上させることができる。
(Effect of embodiment)
As described above, in the present embodiment, a mechanism for learning the behavior of each agent that maximizes the behavior value Qtot for all agents is provided, and by learning the cooperative control between agents, the agents are involved. It is possible to solve the problem of deterioration of allocation accuracy due to allocation conflict and improve allocation accuracy.
 (実施の形態のまとめ)
 本明細書には、少なくとも下記各項の制御装置、仮想ネットワーク割当方法、及びプログラムが開示されている。
(第1項)
 協調型マルチエージェント強化学習により、仮想ネットワークをリンクとサーバを有する物理ネットワークに割り当てるための制御装置であって、
 仮想ネットワークの需要を複数グループに分割し、グループ毎にエージェントを割り当て、個々のエージェントに対する行動価値関数である個別行動価値関数を、全エージェントに対する行動価値関数である全体行動価値関数が最大になるように学習する事前学習部と、
 前記事前学習部により学習されたエージェント毎の個別行動価値関数を用いて、仮想ネットワークを前記物理ネットワークに割り当てる割当部と
 を備える制御装置。
(第2項)
 前記事前学習部は、個別行動価値関数のエージェントについての総和を前記全体行動価値関数として使用する
 第1項に記載の制御装置。
(第3項)
 前記割当部は、各エージェントについての学習済みの個別行動価値関数が最大になるように、仮想ネットワークを前記物理ネットワークに割り当てる行動を選択する
 第1項又は第2項に記載の制御装置。
(第4項)
 協調型マルチエージェント強化学習により、仮想ネットワークをリンクとサーバを有する物理ネットワークに割り当てるための制御装置が実行する仮想ネットワーク割当方法であって、
 仮想ネットワークの需要を複数グループに分割し、グループ毎にエージェントを割り当て、個々のエージェントに対する行動価値関数である個別行動価値関数を、全エージェントに対する行動価値関数である全体行動価値関数が最大になるように学習する事前学習ステップと、
 前記事前学習ステップにより学習されたエージェント毎の個別行動価値関数を用いて、仮想ネットワークを前記物理ネットワークに割り当てる割当ステップと
 を備える仮想ネットワーク割当方法。
(第5項)
 コンピュータを、第1項ないし第3項のうちいずれか1項に記載の制御装置における各部として機能させるためのプログラム。
(Summary of embodiments)
The present specification discloses at least the control device, the virtual network allocation method, and the program of each of the following items.
(Section 1)
A control device for allocating virtual networks to physical networks with links and servers through cooperative multi-agent reinforcement learning.
Divide the demand of the virtual network into multiple groups, assign agents to each group, and maximize the individual action value function, which is the action value function for each agent, and the overall action value function, which is the action value function for all agents. Pre-learning department to learn from
A control device including an allocation unit that allocates a virtual network to the physical network by using an individual action value function for each agent learned by the pre-learning unit.
(Section 2)
The control device according to item 1, wherein the pre-learning unit uses the sum of the agents of the individual action value function as the total action value function.
(Section 3)
The control device according to item 1 or 2, wherein the allocation unit selects an action for allocating a virtual network to the physical network so that the learned individual action value function for each agent is maximized.
(Section 4)
A virtual network allocation method executed by a control device for allocating a virtual network to a physical network having links and servers by cooperative multi-agent reinforcement learning.
Divide the demand of the virtual network into multiple groups, assign agents to each group, and maximize the individual action value function, which is the action value function for each agent, and the overall action value function, which is the action value function for all agents. Pre-learning steps to learn and
A virtual network allocation method including an allocation step for allocating a virtual network to the physical network by using an individual action value function for each agent learned by the pre-learning step.
(Section 5)
A program for making a computer function as each part in the control device according to any one of the items 1 to 3.
 以上、本実施の形態について説明したが、本発明はかかる特定の実施形態に限定されるものではなく、特許請求の範囲に記載された本発明の要旨の範囲内において、種々の変形・変更が可能である。 Although the present embodiment has been described above, the present invention is not limited to such a specific embodiment, and various modifications and changes can be made within the scope of the gist of the present invention described in the claims. It is possible.
 [参考文献]
[1] 鈴木晃人, 安孫子悠, 原田薫明, "深層強化学習による動的仮想ネットワーク割当手法の検討," 信学会総合大会, B-7-48, 2019.(非特許文献1に対応)
[2] 鈴木晃人, 原田薫明, "マルチエージェント深層強化学習による動的仮想リソース割当手法," 信学技報, vol.119, no. 195, IN2019-29, pp. 35-40, 2019 年9 月.(非特許文献2に対応)
[3] V. Mnihet al., "Human-level control through deep reinforcement learning," Nature, vol. 518, no. 7540,p. 529, 2015.
[4] P. Sunehag al.,"Value-decomposition networks for cooperative multi-agent learning based on team rewar," in Proc. AAMAS, 2018, pp. 2085-2087.
[References]
[1] Akihito Suzuki, Yu Yasuko, Kaoruaki Harada, "Examination of Dynamic Virtual Network Allocation Method by Deep Reinforcement Learning," General Conference of the Society of Communication, B-7-48, 2019. (Corresponding to Non-Patent Document 1)
[2] Akihito Suzuki, Kaoruaki Harada, "Dynamic Virtual Resource Allocation Method by Multi-Agent Deep Reinforcement Learning," Bibliography, vol.119, no. 195, IN2019-29, pp. 35-40, 2019 September, 2014 (corresponding to Non-Patent Document 2)
[3] V. Mnihet al., "Human-level control through deep reinforcement learning," Nature, vol. 518, no. 7540, p. 529, 2015.
[4] P. Sunehag al., "Value-decomposition networks for cooperative multi-agent learning based on team rewar," in Proc. AAMAS, 2018, pp. 2085-2087.
100 制御装置
110 事前学習部
120 報酬計算部
130 割当部
140 データ格納部
200 物理ネットワーク
300 物理ノード
400 物理リンク
1000 ドライブ装置
1001 記録媒体
1002 補助記憶装置
1003 メモリ装置
1004 CPU
1005 インタフェース装置
1006 表示装置
1007 入力装置
100 Control device 110 Pre-learning unit 120 Reward calculation unit 130 Allocation unit 140 Data storage unit 200 Physical network 300 Physical node 400 Physical link 1000 Drive device 1001 Recording medium 1002 Auxiliary storage device 1003 Memory device 1004 CPU
1005 Interface device 1006 Display device 1007 Input device

Claims (5)

  1.  協調型マルチエージェント強化学習により、仮想ネットワークをリンクとサーバを有する物理ネットワークに割り当てるための制御装置であって、
     仮想ネットワークの需要を複数グループに分割し、グループ毎にエージェントを割り当て、個々のエージェントに対する行動価値関数である個別行動価値関数を、全エージェントに対する行動価値関数である全体行動価値関数が最大になるように学習する事前学習部と、
     前記事前学習部により学習されたエージェント毎の個別行動価値関数を用いて、仮想ネットワークを前記物理ネットワークに割り当てる割当部と
     を備える制御装置。
    A control device for allocating virtual networks to physical networks with links and servers through cooperative multi-agent reinforcement learning.
    Divide the demand of the virtual network into multiple groups, assign agents to each group, and maximize the individual action value function, which is the action value function for each agent, and the overall action value function, which is the action value function for all agents. Pre-learning department to learn from
    A control device including an allocation unit that allocates a virtual network to the physical network by using an individual action value function for each agent learned by the pre-learning unit.
  2.  前記事前学習部は、個別行動価値関数のエージェントについての総和を前記全体行動価値関数として使用する
     請求項1に記載の制御装置。
    The control device according to claim 1, wherein the pre-learning unit uses the sum of the agents of the individual action value function as the overall action value function.
  3.  前記割当部は、各エージェントについての学習済みの個別行動価値関数が最大になるように、仮想ネットワークを前記物理ネットワークに割り当てる行動を選択する
     請求項1又は2に記載の制御装置。
    The control device according to claim 1 or 2, wherein the allocation unit selects an action for allocating a virtual network to the physical network so that the learned individual action value function for each agent is maximized.
  4.  協調型マルチエージェント強化学習により、仮想ネットワークをリンクとサーバを有する物理ネットワークに割り当てるための制御装置が実行する仮想ネットワーク割当方法であって、
     仮想ネットワークの需要を複数グループに分割し、グループ毎にエージェントを割り当て、個々のエージェントに対する行動価値関数である個別行動価値関数を、全エージェントに対する行動価値関数である全体行動価値関数が最大になるように学習する事前学習ステップと、
     前記事前学習ステップにより学習されたエージェント毎の個別行動価値関数を用いて、仮想ネットワークを前記物理ネットワークに割り当てる割当ステップと
     を備える仮想ネットワーク割当方法。
    A virtual network allocation method executed by a control device for allocating a virtual network to a physical network having links and servers by cooperative multi-agent reinforcement learning.
    Divide the demand of the virtual network into multiple groups, assign agents to each group, and maximize the individual action value function, which is the action value function for each agent, and the overall action value function, which is the action value function for all agents. Pre-learning steps to learn and
    A virtual network allocation method including an allocation step for allocating a virtual network to the physical network by using an individual action value function for each agent learned by the pre-learning step.
  5.  コンピュータを、請求項1ないし3のうちいずれか1項に記載の制御装置における各部として機能させるためのプログラム。 A program for making a computer function as each part in the control device according to any one of claims 1 to 3.
PCT/JP2020/049028 2020-12-25 2020-12-25 Control device, virtual network allocation method, and program WO2022137574A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/JP2020/049028 WO2022137574A1 (en) 2020-12-25 2020-12-25 Control device, virtual network allocation method, and program
JP2022571008A JPWO2022137574A1 (en) 2020-12-25 2020-12-25

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2020/049028 WO2022137574A1 (en) 2020-12-25 2020-12-25 Control device, virtual network allocation method, and program

Publications (1)

Publication Number Publication Date
WO2022137574A1 true WO2022137574A1 (en) 2022-06-30

Family

ID=82157999

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2020/049028 WO2022137574A1 (en) 2020-12-25 2020-12-25 Control device, virtual network allocation method, and program

Country Status (2)

Country Link
JP (1) JPWO2022137574A1 (en)
WO (1) WO2022137574A1 (en)

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
GUESTRIN, CARLOS ET AL.: "Coordinated Reinforcement Learning", AAAI, 2002, pages 98 - 105, XP055952821, Retrieved from the Internet <URL:https://www.aaai.org/Papers/Symposia/Spring/2002/SS-02-02/SS02-02-014.pdf> [retrieved on 20210406] *
SUZUKI, AKITO ET AL.: "Dynamic Virtual Resource Allocation Method Using Multi-agent Deep Reinforcement Learning", IEICE TECHNICAL REPOR T, vol. 119, no. 195, 29 August 2019 (2019-08-29), pages 35 - 40, XP009534137, ISSN: 2432- 6380 *

Also Published As

Publication number Publication date
JPWO2022137574A1 (en) 2022-06-30

Similar Documents

Publication Publication Date Title
Tanha et al. A hybrid meta-heuristic task scheduling algorithm based on genetic and thermodynamic simulated annealing algorithms in cloud computing environments
Chen et al. Task scheduling of cloud computing using integrated particle swarm algorithm and ant colony algorithm
JP4119239B2 (en) Computer resource allocation method, resource management server and computer system for executing the method
WO2022018798A1 (en) Control device, virtual network allocation method, and program
WO2020162211A1 (en) Control device, control method and program
CN110351348B (en) Cloud computing resource scheduling optimization method based on DQN
JPWO2007023726A1 (en) Information processing system
CN106371924B (en) A kind of method for scheduling task minimizing MapReduce cluster energy consumption
CN113822456A (en) Service combination optimization deployment method based on deep reinforcement learning in cloud and mist mixed environment
CN113742089B (en) Method, device and equipment for distributing neural network computing tasks in heterogeneous resources
Bousselmi et al. QoS-aware scheduling of workflows in cloud computing environments
CN113641445B (en) Cloud resource self-adaptive configuration method and system based on depth deterministic strategy
JP2009528649A (en) Improvements on distributed computing
CN111309472A (en) Online virtual resource allocation method based on virtual machine pre-deployment
Sinclair et al. Hindsight learning for mdps with exogenous inputs
CN115580882A (en) Dynamic network slice resource allocation method and device, storage medium and electronic equipment
CN114675953A (en) Resource dynamic scheduling method, device, equipment and computer readable storage medium
CN114090239A (en) Model-based reinforcement learning edge resource scheduling method and device
JP6721921B2 (en) Equipment design device, equipment design method, and program
CN114205317A (en) Service function chain SFC resource allocation method based on SDN and NFV and electronic equipment
WO2022137574A1 (en) Control device, virtual network allocation method, and program
Baheri Mars: Multi-scalable actor-critic reinforcement learning scheduler
Lin et al. Joint deadline-constrained and influence-aware design for allocating MapReduce jobs in cloud computing systems
Cui et al. Cloud workflow scheduling algorithm based on reinforcement learning
JP6732693B2 (en) Resource allocation control system, resource allocation control method, and program

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2022571008

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20967059

Country of ref document: EP

Kind code of ref document: A1