WO2022137574A1

WO2022137574A1 - Control device, virtual network allocation method, and program

Info

Publication number: WO2022137574A1
Application number: PCT/JP2020/049028
Authority: WO
Inventors: 晃人鈴木; 薫明原田
Original assignee: 日本電信電話株式会社
Priority date: 2020-12-25
Filing date: 2020-12-25
Publication date: 2022-06-30
Also published as: JPWO2022137574A1

Abstract

Provided is a control device for allocating a virtual network to a physical network having a link and a server through cooperative multi-Agent reinforcement learning, wherein the control device comprises: a prior learning unit which divides demands of the virtual network into a plurality of groups, allocates an agent to each group, and learns an individual action value function that is an action value function for each agent so that the total action value function, which is an action value function of all agents, becomes maximized; and an allocation unit which allocates the virtual network to the physical network by using the individual action value function for each of the agents learnt by the prior learning unit.

Description

制御装置、仮想ネットワーク割当方法、及びプログラムControllers, virtual network allocation methods, and programs

　本発明は、仮想ネットワークを物理ネットワークに割り当てる技術に関連するものである。 The present invention relates to a technique for allocating a virtual network to a physical network.

　ＮＦＶ（Ｎｅｔｗｏｒｋ　Ｆｕｎｃｔｉｏｎ　Ｖｉｒｔｕａｌｉｚａｔｉｏｎ）の発展に伴い、仮想ネットワーク機能（Ｖｉｒｔｕａｌ　Ｎｅｔｗｏｒｋ　Ｆｕｎｃｔｉｏｎ；ＶＮＦ）を汎用的な物理リソース上で実行することが可能になった。ＮＦＶにより、複数のＶＮＦ間で物理リソースを共有することで、リソース利用効率の向上が期待できる。 With the development of NFV (Network Enhancement Virtualization), it has become possible to execute virtual network functions (Virtual Network Virtualization; VNF) on general-purpose physical resources. By sharing physical resources among a plurality of VNFs by NFV, improvement in resource utilization efficiency can be expected.

　物理リソースの例として、リンク帯域などのネットワークリソース、ＣＰＵやＨＤＤ容量などのサーバリソースが挙げられる。高品質なネットワークサービスを低コストに提供するためには、物理リソースへの最適な仮想ネットワーク（Ｖｉｒｔｕａｌ　Ｎｅｔｗｏｒｋ；ＶＮ）割当が必要となる。 Examples of physical resources include network resources such as link bandwidth and server resources such as CPU and HDD capacity. In order to provide high-quality network services at low cost, it is necessary to allocate optimal virtual networks (VN) to physical resources.

　ＶＮ割当とは、仮想リンクと仮想ノードから構成されるＶＮを物理リソースに割り当てることを指す。仮想リンクは、ＶＮＦ間の要求帯域や要求遅延、ＶＮＦやユーザ間の接続関係などのネットワークリソース需要を表す。仮想ノードは、ＶＮＦを実行するために必要なＣＰＵ数や必要メモリ量などのサーバリソース需要を表す。また、最適な割当とは、サービス要求やリソース容量などの制約条件を満たしつつ、リソース利用効率などの目的関数の値を最大化する割当を指す。 VN allocation refers to allocating a VN consisting of a virtual link and a virtual node to a physical resource. The virtual link represents the demand for network resources such as the required bandwidth and required delay between VNFs, and the connection relationship between VNFs and users. The virtual node represents the demand for server resources such as the number of CPUs required to execute VNF and the amount of memory required. Optimal allocation refers to allocation that maximizes the value of the objective function such as resource utilization efficiency while satisfying constraints such as service requirements and resource capacity.

　近年、高画質の動画配信やＯＳアップデート等により、トラヒックやサーバのリソース需要変動が激化している。一定期間内の最大値で需要量を見積もり、割当を時間変化させない静的ＶＮ割当では、リソースの利用効率が低下してしまうことから、リソースの需要変動に追従した動的ＶＮ割当手法が求められている。 In recent years, due to high-quality video distribution and OS updates, traffic and server resource demand fluctuations have intensified. In static VN allocation, which estimates the amount of demand based on the maximum value within a certain period and does not change the allocation over time, the efficiency of resource utilization decreases. Therefore, a dynamic VN allocation method that follows fluctuations in resource demand is required. ing.

　動的ＶＮ割当手法とは、時間変化するＶＮ需要に対して最適ＶＮ割当を求める手法である。動的ＶＮ割当手法では、割当の最適性と即時性を同時に満たす必要がある。しかし、両者はトレードオフの関係にあり、割当の最適性と即時性を同時に満たすことが困難であるという課題がある。割当結果の精度を増加させるためには、計算時間を増加させる必要がある。しかし、計算時間の増加は割当周期の増加に直結し、結果として割当の即時性を減少させてしまう。同様に、需要変動に対して即時に対応するためには、割当周期を減らす必要がある。しかし、割当周期の削減は計算時間の減少に直結し、結果として割当の最適性を減少させてしまう。上記より、割当の最適性と即時性を同時に満たすことは困難である。 The dynamic VN allocation method is a method for obtaining the optimum VN allocation for the time-varying VN demand. In the dynamic VN allocation method, it is necessary to satisfy the optimality and immediacy of allocation at the same time. However, there is a trade-off relationship between the two, and there is a problem that it is difficult to satisfy the optimality and immediacy of allocation at the same time. In order to increase the accuracy of the allocation result, it is necessary to increase the calculation time. However, an increase in calculation time is directly linked to an increase in the allocation cycle, and as a result, the immediacy of allocation is reduced. Similarly, it is necessary to reduce the allocation cycle in order to respond immediately to fluctuations in demand. However, the reduction of the allocation cycle directly leads to the reduction of the calculation time, and as a result, the optimality of the allocation is reduced. From the above, it is difficult to satisfy the optimality and immediacy of allocation at the same time.

　動的ＶＮ割当手法の課題を解決する手段として、強化学習による動的ＶＮ割当手法が提案されている（非特許文献１，２）。強化学習（Ｒｅｉｎｆｏｒｃｅｍｅｎｔ　Ｌｅａｒｎｉｎｇ；ＲＬ）は、将来に渡って得られる報酬の和（累積報酬）が最も多く得られる行動を学習する手法である。非特許文献１，２に開示された既存手法では、ＲＬによりネットワーク状態と最適な割当の関係を事前に学習し、各時刻での最適化計算を不要とすることで、割当の最適性と即時性を同時に実現することができる。 As a means for solving the problem of the dynamic VN allocation method, a dynamic VN allocation method by reinforcement learning has been proposed (Non-Patent Documents 1 and 2). Reinforcement learning (RL) is a method of learning the behavior in which the sum of rewards (cumulative rewards) obtained in the future is the largest. In the existing method disclosed in

Non-Patent Documents

1 and 2, the relationship between the network state and the optimum allocation is learned in advance by RL, and the optimization calculation at each time is not required. Sex can be realized at the same time.

　最適なＶＮ割当を求める問題は組み合わせ最適化問題であり、ＶＮ割当の解の候補数（すなわち行動数）はサーバ数やＶＮ数に対して指数的に増加する。ＲＬにより最適なＶＮ割当を求めるためには、行動数に応じた事前学習が必要である。非特許文献１の技術では、サーバ数やＶＮ数の増加に対して事前学習のコストが指数的に増加するため、スケール性を増加する仕組みが必要である。そのため、非特許文献２の技術では、非特許文献１の技術にマルチエージェント化によるＶＮ数に対するスケール性を向上させる仕組みを導入している。非特許文献２の技術では、ＶＮ需要をグループ化し、グループ毎にＲＬのエージェントを用意し、各エージェントの学習コストを削減することで、ＶＮ数に対するスケール性を向上させている。 The problem of finding the optimum VN allocation is a combination optimization problem, and the number of VN allocation solution candidates (that is, the number of actions) increases exponentially with respect to the number of servers and the number of VNs. In order to obtain the optimum VN allocation by RL, prior learning according to the number of actions is required. In the technique of Non-Patent Document 1, the cost of pre-learning increases exponentially with the increase in the number of servers and the number of VNs, so a mechanism for increasing the scaleability is required. Therefore, in the technique of Non-Patent Document 2, a mechanism for improving the scaleability with respect to the number of VNs by multi-agent is introduced into the technique of Non-Patent Document 1. In the technique of Non-Patent Document 2, VN demand is grouped, RL agents are prepared for each group, and the learning cost of each agent is reduced to improve the scaleability with respect to the number of VNs.

　非特許文献２に開示された技術では、複数のエージェントが同時に行動しエージェント間の割当の競合が発生することによる割当精度の低下を防止するため、各時刻で行動可能なエージェントを１つに限定している。エージェント間の割当の競合の例として、複数のエージェントが同時にリソース利用率の低いサーバに仮想マシン（Ｖｉｒｔｕａｌ　Ｍａｃｈｉｎｅ；ＶＭ）を移動しようとして、サーバが過負荷になることを指す。このような各時刻での行動の制限は、割当の競合を防ぐことができる一方で、ＶＮ割当の精度の低下につながる。 In the technique disclosed in Non-Patent Document 2, the number of agents that can act at each time is limited to one in order to prevent a decrease in allocation accuracy due to the simultaneous action of a plurality of agents and the occurrence of allocation conflict among the agents. is doing. As an example of allocation conflict between agents, it means that multiple agents try to move a virtual machine (VM) to a server with low resource utilization at the same time, and the server becomes overloaded. Such restriction of actions at each time can prevent allocation conflicts, but leads to a decrease in the accuracy of VN allocation.

　本発明は上記の点に鑑みてなされたものであり、マルチエージェント型強化学習を用いて仮想ネットワークを物理ネットワークに割り当てる技術において、エージェント間の割当の競合による割当精度の低下を回避して、割当精度を向上させるための技術を提供することを目的とする。 The present invention has been made in view of the above points, and in a technique for allocating a virtual network to a physical network using multi-agent reinforcement learning, it is possible to avoid deterioration of allocation accuracy due to allocation conflict between agents and allocate. The purpose is to provide a technique for improving accuracy.

　開示の技術によれば、協調型マルチエージェント強化学習により、仮想ネットワークをリンクとサーバを有する物理ネットワークに割り当てるための制御装置であって、
　仮想ネットワークの需要を複数グループに分割し、グループ毎にエージェントを割り当て、個々のエージェントに対する行動価値関数である個別行動価値関数を、全エージェントに対する行動価値関数である全体行動価値関数が最大になるように学習する事前学習部と、
　前記事前学習部により学習されたエージェント毎の個別行動価値関数を用いて、仮想ネットワークを前記物理ネットワークに割り当てる割当部と
　を備える制御装置が提供される。 According to the disclosed technique, it is a control device for assigning a virtual network to a physical network having a link and a server by cooperative multi-agent reinforcement learning.
Divide the demand of the virtual network into multiple groups, assign agents to each group, and maximize the individual action value function, which is the action value function for each agent, and the overall action value function, which is the action value function for all agents. Pre-learning department to learn from
A control device including an allocation unit that allocates a virtual network to the physical network is provided by using an individual action value function for each agent learned by the pre-learning unit.

　開示の技術によれば、マルチエージェント型強化学習を用いて仮想ネットワークを物理ネットワークに割り当てる技術において、エージェント間の割当の競合による割当精度の低下を回避して、割当精度を向上させるための技術が提供される。 According to the disclosed technique, in the technique of allocating a virtual network to a physical network using multi-agent reinforcement learning, there is a technique for improving the allocation accuracy by avoiding a decrease in allocation accuracy due to allocation conflict between agents. Provided.

本発明の実施の形態におけるシステム構成図である。It is a system block diagram in embodiment of this invention. 制御装置の機能構成図である。It is a functional block diagram of a control device. 制御装置のハードウェアの構成図である。It is a block diagram of the hardware of a control device. 変数の定義を示す図である。It is a figure which shows the definition of a variable. 変数の定義を示す図である。It is a figure which shows the definition of a variable. 制御装置の全体動作を示すフローチャートである。It is a flowchart which shows the whole operation of a control device. 報酬計算手順を示す図である。It is a figure which shows the reward calculation procedure. 事前学習手順を示す図である。It is a figure which shows the pre-learning procedure. 制御装置の事前学習動作を示すフローチャートである。It is a flowchart which shows the pre-learning operation of a control device. 割当手順を示す図である。It is a figure which shows the allocation procedure. 制御装置の割当動作を示すフローチャートである。It is a flowchart which shows the allocation operation of a control device.

　以下、図面を参照して本発明の実施の形態（本実施の形態）を説明する。以下で説明する実施の形態は一例に過ぎず、本発明が適用される実施の形態は、以下の実施の形態に限られるわけではない。以下の実施の形態の説明で参照する参考文献については"［１］"などとして示している。参考文献名については、明細書の最後にまとめて示した。 Hereinafter, an embodiment of the present invention (the present embodiment) will be described with reference to the drawings. The embodiments described below are merely examples, and the embodiments to which the present invention is applied are not limited to the following embodiments. References referred to in the description of the following embodiments are shown as "[1]" and the like. Reference names are summarized at the end of the specification.

　（実施の形態の概要）
　本実施の形態では、前述した課題を解決するために、協調型マルチエージェント強化学習（Ｃｏｏｐｅｒａｔｉｖｅ　ｍｕｌｔｉ－ａｇｅｎｔ　ＲＬ；Ｃｏｏｐ－ＭＡＲＬ）による動的ＶＮ割当手法を用いている。本手法では、参考文献［２］のマルチエージェント強化学習（ＭＡＲＬ）による動的ＶＮ割当手法に、複数のエージェントが協調（協力）して最適なＶＮ割当を学習する仕組みを導入している。これにより、エージェント間の割当の競合による割当精度の低下の問題を解決し、割当精度を向上させることができる。 (Outline of embodiment)
In this embodiment, in order to solve the above-mentioned problems, a dynamic VN allocation method by cooperative multi-agent reinforcement learning (Coop-MALL) is used. In this method, a mechanism is introduced in which a plurality of agents cooperate (cooperate) to learn the optimum VN allocation to the dynamic VN allocation method by the multi-agent reinforcement learning (MALL) of the reference [2]. As a result, it is possible to solve the problem of deterioration of allocation accuracy due to allocation conflict between agents and improve allocation accuracy.

　より具体的には、本実施の形態に係る技術では、参考文献［１，２］と同様に、各時刻のＶＮ需要及び物理ネットワークの利用量を状態と定義し、ネットワーク経路やＶＮ割当の変更を行動と定義し、目的関数や制約条件に応じた報酬設計を行うことで、最適なＶＮ割当方法を学習する。 More specifically, in the technique according to the present embodiment, as in References [1, 2], the VN demand and the usage amount of the physical network at each time are defined as the states, and the network route and the VN allocation are changed. Is defined as an action, and the optimal VN allocation method is learned by designing the reward according to the objective function and constraints.

　また、本実施の形態に係る技術では、参考文献［２］と同様に、マルチエージェント化によりネットワーク全体のＶＮ割当問題を複数のサブ問題に分割する。すなわち、ＶＮ需要を複数のグループに分割し、グループ毎にエージェントを用意することで、ＶＮ数に対するスケール性を向上させる。各エージェントが、各グループに含まれるＶＮに関する最適なＶＮ割当を事前に学習し、実制御時には学習結果に基づいてエージェントが即時に最適なＶＮ割当を判断することで、最適性と即時性を同時に実現する。参考文献［２］では、各エージェントが各エージェントにとって最適な行動を独立に学習していたのに対して、本実施の形態に係る技術では、各エージェントが全エージェントにとって最適な協調行動を学習する。 Further, in the technique according to the present embodiment, as in Reference [2], the VN allocation problem of the entire network is divided into a plurality of sub-problems by multi-agentization. That is, by dividing the VN demand into a plurality of groups and preparing an agent for each group, the scaleability with respect to the number of VNs is improved. Each agent learns the optimum VN allocation for the VN included in each group in advance, and at the time of actual control, the agent immediately determines the optimum VN allocation based on the learning result, so that the optimum and immediacy can be achieved at the same time. Realize. In reference [2], each agent independently learns the optimum behavior for each agent, whereas in the technique according to the present embodiment, each agent learns the optimum cooperative behavior for all agents. ..

　（システム構成）
　図１に、本実施の形態におけるシステムの構成例を示す。図１に示すように、本システムは、制御装置１００と物理ネットワーク２００を有する。制御装置１００は、協調型マルチエージェント強化学習による動的ＶＮ割当を実行する装置である。物理ネットワーク２００は、ＶＮの割当対象である物理リソースを有するネットワークである。制御装置１００は、制御ネットワーク等により物理ネットワーク２００と接続されており、物理ネットワーク２００を構成する装置から状態情報を取得したり、物理ネットワーク２００を構成する装置に対して設定命令を送信したりすることができる。また、制御装置１００は、ユーザの管理端末などからユーザのＶＮ需要を受信することもできる。 (System configuration)
FIG. 1 shows a configuration example of the system according to the present embodiment. As shown in FIG. 1, the system has a control device 100 and a physical network 200. The control device 100 is a device that executes dynamic VN allocation by cooperative multi-agent reinforcement learning. The physical network 200 is a network having physical resources to which the VN is allocated. The control device 100 is connected to the physical network 200 by a control network or the like, acquires state information from the devices constituting the physical network 200, and transmits a setting command to the devices constituting the physical network 200. be able to. Further, the control device 100 can also receive the user's VN demand from the user's management terminal or the like.

　物理ネットワーク２００は、複数の物理ノード３００と、物理ノード３００間を接続する複数の物理リンク４００を有する。物理ノード３００には物理サーバが接続されている。また、物理ノード３００にはユーザ（ユーザ端末あるいはユーザネットワーク等）が接続されている。なお、物理ノード３００に物理サーバが存在し、物理ノード３００にユーザが存在すると言い換えてもよい。 The physical network 200 has a plurality of physical nodes 300 and a plurality of physical links 400 connecting the physical nodes 300. A physical server is connected to the physical node 300. Further, a user (user terminal, user network, etc.) is connected to the physical node 300. In addition, it may be paraphrased that the physical server exists in the physical node 300 and the user exists in the physical node 300.

　例えば、ある物理ノード３００に存在するユーザとＶＭとの間で通信を行うＶＮを物理リソースに割り当てる際には、当該ＶＭの割当先の物理サーバ、及び、当該ユーザ（物理ノード）と当該割当先の物理サーバとの間の経路（物理リンクの集合）が決定され、決定された構成に基づく物理ネットワーク２００への設定がなされる。なお、物理サーバを単に「サーバ」と呼び、物理リンクを単に「リンク」と呼んでもよい。 For example, when allocating a VN that communicates between a user existing in a certain physical node 300 and a VM to a physical resource, the physical server to which the VM is assigned, and the user (physical node) and the allocation destination A route (a set of physical links) to and from a physical server is determined, and settings are made to the physical network 200 based on the determined configuration. The physical server may be simply called a "server", and the physical link may be simply called a "link".

　図２に、制御装置１００の機能構成例を示す。図２に示すとおり、制御装置１００は、事前学習部１１０、報酬計算部１２０、割当部１３０、データ格納部１４０を有する。なお、報酬計算部１２０は事前学習部１１０の中に含まれることとしてもよい。また、「事前学習部１１０、報酬計算部１２０」と「割当部１３０」が別々の装置（プログラムで動作するコンピュータ等）に備えられていてもよい。各部の機能概要は下記のとおりである。 FIG. 2 shows an example of the functional configuration of the control device 100. As shown in FIG. 2, the control device 100 includes a pre-learning unit 110, a reward calculation unit 120, an allocation unit 130, and a data storage unit 140. The reward calculation unit 120 may be included in the pre-learning unit 110. Further, the "pre-learning unit 110, the reward calculation unit 120" and the "allocation unit 130" may be provided in separate devices (computers operating by the program, etc.). The outline of the functions of each part is as follows.

　事前学習部１１０は、報酬計算部１２０で計算された報酬を用いて行動価値関数（後述するように、各エージェントのＱ_ｋと全体エージェントのＱ_ｔｏｔ）の事前学習を行う。報酬計算部１２０は、報酬を計算する。割当部１３０は、事前学習部１１０で学習された行動価値関数Ｑ_ｋを用いて、ＶＮの物理リソースへの割当を実行する。データ格納部１４０は、Ｒｅｐｌａｙ　Ｍｅｍｏｒｙの機能を持つとともに、計算に必要なパラメータ等を格納している。なお、事前学習部１１０は、強化学習の学習モデルにおけるエージェントを含む。「エージェントを学習する」ことは、事前学習部１１０が行動価値関数を学習することに相当する。各部の詳細な動作については後述する。 The pre-learning unit 110 performs pre-learning of the action value function (Q _k of each agent and Q _tot of all agents as described later) using the reward calculated by the reward calculation unit 120. The reward calculation unit 120 calculates the reward. The allocation unit 130 executes the allocation of the VN to the physical resource by using the action value function _Qk learned by the pre-learning unit 110. The data storage unit 140 has a Play Memory function and stores parameters and the like necessary for calculation. The pre-learning unit 110 includes an agent in the learning model of reinforcement learning. "Learning the agent" corresponds to the pre-learning unit 110 learning the action value function. The detailed operation of each part will be described later.

　＜ハードウェア構成例＞
　制御装置１００は、例えば、コンピュータにプログラムを実行させることにより実現できる。このコンピュータは、物理的なコンピュータであってもよいし、仮想マシンであってもよい。 <Hardware configuration example>
The control device 100 can be realized, for example, by causing a computer to execute a program. This computer may be a physical computer or a virtual machine.

　すなわち、制御装置１００は、コンピュータに内蔵されるＣＰＵやメモリ等のハードウェア資源を用いて、当該制御装置１００で実施される処理に対応するプログラムを実行することによって実現することが可能である。上記プログラムは、コンピュータが読み取り可能な記録媒体（可搬メモリ等）に記録して、保存したり、配布したりすることが可能である。また、上記プログラムをインターネットや電子メール等、ネットワークを通して提供することも可能である。 That is, the control device 100 can be realized by executing a program corresponding to the processing executed by the control device 100 using hardware resources such as a CPU and a memory built in the computer. The above program can be recorded on a computer-readable recording medium (portable memory, etc.), stored, and distributed. It is also possible to provide the above program through a network such as the Internet or e-mail.

　図３は、上記コンピュータのハードウェア構成例を示す図である。図３のコンピュータは、それぞれバスＢで相互に接続されているドライブ装置１０００、補助記憶装置１００２、メモリ装置１００３、ＣＰＵ１００４、インタフェース装置１００５、表示装置１００６、及び入力装置１００７等を有する。 FIG. 3 is a diagram showing an example of the hardware configuration of the above computer. The computer of FIG. 3 has a drive device 1000, an auxiliary storage device 1002, a memory device 1003, a CPU 1004, an interface device 1005, a display device 1006, an input device 1007, and the like, which are connected to each other by a bus B, respectively.

　当該コンピュータでの処理を実現するプログラムは、例えば、ＣＤ－ＲＯＭ又はメモリカード等の記録媒体１００１によって提供される。プログラムを記憶した記録媒体１００１がドライブ装置１０００にセットされると、プログラムが記録媒体１００１からドライブ装置１０００を介して補助記憶装置１００２にインストールされる。但し、プログラムのインストールは必ずしも記録媒体１００１より行う必要はなく、ネットワークを介して他のコンピュータよりダウンロードするようにしてもよい。補助記憶装置１００２は、インストールされたプログラムを格納すると共に、必要なファイルやデータ等を格納する。 The program that realizes the processing on the computer is provided by, for example, a recording medium 1001 such as a CD-ROM or a memory card. When the recording medium 1001 storing the program is set in the drive device 1000, the program is installed in the auxiliary storage device 1002 from the recording medium 1001 via the drive device 1000. However, the program does not necessarily have to be installed from the recording medium 1001, and may be downloaded from another computer via the network. The auxiliary storage device 1002 stores the installed program and also stores necessary files, data, and the like.

　メモリ装置１００３は、プログラムの起動指示があった場合に、補助記憶装置１００２からプログラムを読み出して格納する。ＣＰＵ１００４は、メモリ装置１００３に格納されたプログラムに従って、制御装置１００に係る機能を実現する。インタフェース装置１００５は、ネットワークに接続するためのインタフェースとして用いられ、ネットワークを介した入力手段及び出力手段として機能する。表示装置１００６はプログラムによるＧＵＩ（Ｇｒａｐｈｉｃａｌ　Ｕｓｅｒ　Ｉｎｔｅｒｆａｃｅ）等を表示する。入力装置１５７はキーボード及びマウス、ボタン、又はタッチパネル等で構成され、様々な操作指示を入力させるために用いられる。 The memory device 1003 reads and stores the program from the auxiliary storage device 1002 when the program is instructed to start. The CPU 1004 realizes the function related to the control device 100 according to the program stored in the memory device 1003. The interface device 1005 is used as an interface for connecting to a network, and functions as an input means and an output means via the network. The display device 1006 displays a GUI (Graphical User Interface) or the like by a program. The input device 157 is composed of a keyboard, a mouse, buttons, a touch panel, or the like, and is used for inputting various operation instructions.

　（変数定義）
　以降の説明において使用される変数の定義を図４、図５に示す。まず、図４は、ＶＮ割当に関する変数定義である。図４に示すように、以下のように変数が定義される。 (Variable definition)
The definitions of the variables used in the following description are shown in FIGS. 4 and 5. First, FIG. 4 is a variable definition related to VN allocation. As shown in FIG. 4, the variables are defined as follows.

　ｔ∈Ｔ：タイムステップ（Ｔ：総タイムステップ数）
　Ｎ：ＶＮ数
　Ｇ（Ｎ，Ｌ）：ネットワークグラフ（Ｎ：ノード集合，Ｌ：リンク集合）
　ｎ∈Ｎ，ｚ∈Ｚ：ノード，サーバ
　ｃ^Ｚ _ｉ：サーバ容量
　ｃ^Ｌ _ｉｊ：リンク容量
　Ｐ：＝｛ｐ^ｉ｝：ユーザ配置
　Ｄ_ｔ：＝｛ｄ^ｉ _ｔ｝：トラヒック需要
　Ｖ_ｔ：＝｛ｖ^ｉ _ｔ｝：ＶＭ需要
　ｕ^Ｌ _ｉｊ，ｔ：リンク利用率
　ｕ^Ｚ _ｉ，ｔ：サーバ利用率
　Ｕ^Ｌ _ｔ＝ｍａｘ_ｉｊ（ｕ^Ｌ _ｉｊ，ｔ）：最大リンク利用率
　Ｕ^Ｚ _ｔ＝ｍａｘ_ｉ（ｕ^Ｚ _ｉ，ｔ）：最大サーバ利用率
　Ｒ^Ｌ _ｔ：＝｛ｒ^Ｌ _ｉｊ，ｔ｝：残余リンク容量
　Ｒ^Ｚ _ｔ：＝｛ｒ^Ｚ _ｉ，ｔ｝：残余サーバ容量
　Ｙ_ｔ：＝｛ｙ_ｉｊ，ｔ｝：ｔにおけるＶＭ割当（例：ＶＭｉを物理サーバｊに割当）の集合
　Ｐ（Ｙ_ｔ，Ｙ_ｔ＋１）：ペナルティ関数
　なお、上記の定義において、リンク利用率ｕ^Ｌ _ｉｊ，ｔとは、リンクｉｊにおける「１－残余リンク容量÷全容量」である。また、サーバ利用率ｕ^Ｚ _ｉ，ｔとは、サーバｉにおける「１－残余サーバ容量÷全容量」である。 t ∈ T: time step (T: total number of time steps)
N: Number of VNs G (N, L): Network graph (N: node set, L: link set)
n ∈ N, z ∈ Z: Node, server c ^Z _i : Server capacity c ^L _ij : Link capacity P: = {pi}: User placement D _t : ^{= {d i} ^t _} : Traffic demand V _t : = { vit}: VM demand u ^L _{ij, t} _: Link utilization rate u ^Z ⁱ _{, t} : Server utilization rate UL _t = max _ij ^{(u L} ^ij _{, t} ): Maximum link utilization rate U ^Z _t = max _i ( u ^Z _{i, t} ): Maximum server utilization rate ^RL _t : = {r ^L _{ij, t} }: Residual link capacity R ^Z _t : = {r ^Z _{i, t} }: Residual server capacity Y _t : = {y _{ij , T} }: Set of VM allocations at _t (eg, VMi is assigned to the physical server j) P (Yt, Yt _{+ 1} ): Penalty function In the above definition, the link utilization rate ^uLij _{, t} is It is "1-remaining link capacity ÷ total capacity" in the link ij. Further, the server utilization rates u ^Z _{i and t} are "1-residual server capacity ÷ total capacity" in the server i.

　図５は、Ｃｏｏｐ－ＭＡＤＲＬに関する変数の定義を示している。図５に示すように、以下のように変数が定義される。なお、各エージェントの行動価値関数Ｑ_ｋを、個別行動価値関数と呼び、全エージェントの行動価値関Ｑ_ｔｏｔ数を、全体行動価値関数と呼んでよい。 FIG. 5 shows the definition of variables for Coop-MADRL. As shown in FIG. 5, the variables are defined as follows. The action value function Q _k of each agent may be referred to as an individual action value function, and the action value relation Q _tot number of all agents may be referred to as an overall action value function.

　ｅ∈Ｅ：エピソード（Ｅ：総エピソード数）
　Ｇ：＝｛ｇ_ｋ｝：エージェント集合（１≦ｋ≦Ｎ）
　ｓ_ｔ∈Ｓ：状態（Ｓ：状態集合）
　Ｏ：＝｛Ｏ^ｋ｝：全エージェントの観測の集合
　ｏ^ｋ _ｔ∈Ｏ^ｋ：各エージェントの観測
　ｏ_ｔ：＝｛ｏ^ｋ _ｔ｝：全エージェントの観測
　Ａ：＝｛Ａ^ｋ｝：全エージェントの行動の集合（Ａ^ｋ：行動集合）
　ａ^ｋ _ｔ∈Ａ^ｋ：各エージェントの行動
　ａ_ｔ：＝｛ａ^ｋ _ｔ｝：全エージェントの行動
　ｒ_ｔ：報酬
　Ｑ_ｋ（ｏ^ｋ _ｔ，ａ^ｋ _ｔ）：各エージェントの行動価値関数
　Ｑ_ｔｏｔ（ｏ_ｔ，ａ_ｔ）：全エージェントの行動価値関数
　Ｍ：Ｒｅｐｌａｙ　ｍｅｍｏｒｙ
　（動的ＶＮ割当問題）
　事前学習及び実制御において前提としている本実施の形態における動的ＶＮ割当問題について説明する。 e ∈ E: Episode (E: Total number of episodes)
G: = {g _k }: Agent set (1 ≦ k ≦ N)
s _t ∈ S: state (S: state set)
O: = ^{ Ok ^} : A set of observations of all agents o kt _∈ Ok: Observations of each agent ot: = ^{ ok _t }: Observations of all agents A: ^{= {A k} ^} _: Of all agents Set of actions ( ^Ak : set of actions)
a _kt ∈ A ^k : Behavior of each agent at: = {a _kt ^} : Behavior of all agents _rt : Reward Q ^k ⁽ _ok _t , a ^kt ₎ _: Behavior value function of each agent Q _tot ( _ot , at): _Behavioral value function of all agents M: Play memory
(Dynamic VN allocation problem)
The dynamic VN allocation problem in the present embodiment, which is premised on the pre-learning and the actual control, will be described.

　本実施の形態では、各ＶＮ需要は、仮想リンクとしてのトラヒック需要と、仮想ノードとしてのＶＭ需要（仮想マシン需要）から構成されているとする。図１に示したとおり、物理ネットワークＧ（Ｎ，Ｌ）は、物理リンクＬと物理ノードＮで構成されており、各物理ノードＮには各物理サーバＺが接続されていると仮定する。すなわち、Ｇ（Ｎ，Ｌ）＝Ｇ（Ｚ，Ｌ）と仮定する。また、各サーバと各リンクには、リソース量の割当上限を意味するサーバ容量ｃ^Ｚ _ｉとリンク容量ｃ^Ｌ _ｉｊを設定する。 In the present embodiment, it is assumed that each VN demand is composed of a traffic demand as a virtual link and a VM demand (virtual machine demand) as a virtual node. As shown in FIG. 1, it is assumed that the physical network G (N, L) is composed of a physical link L and a physical node N, and each physical server Z is connected to each physical node N. That is, it is assumed that G (N, L) = G (Z, L). Further, for each server and each link, the server capacity ^cZ _i , which means the upper limit of allocation of the resource amount, and the link capacity c ^Lij _are set.

　ここで、Ｎ個のＶＮ需要を考える。各ユーザが１つのＶＮ需要を要求すると仮定する。ＶＮ需要は、始点（ユーザ）、終点（ＶＭ）、トラヒック需要Ｄ_ｔ、ＶＭサイズＶ_ｔで構成する。ここで、ＶＭサイズは、ユーザが要求するＶＭの処理容量を示し、サーバ割当時にはＶＭサイズの分だけサーバ容量が消費されるものとする。ＶＮ需要が物理ネットワークに割り当てられた場合、Ｄ_ｔとＶ_ｔの分だけ、ネットワークリソースとサーバリソースが消費されるとする。 Here, consider the demand for N VN pieces. Suppose each user requests one VN demand. The VN demand is composed of a start point (user), an end point (VM), a traffic demand D _t , and a VM size V _t . Here, the VM size indicates the processing capacity of the VM requested by the user, and it is assumed that the server capacity is consumed by the amount of the VM size at the time of server allocation. When VN demand is allocated to the physical network, it is assumed that network resources and server resources are consumed by _Dt and _Vt .

　本実施の形態の制御装置１００は、上記の物理ネットワークとＶＮ需要に対して、動的ＶＮ割当問題を解くことでＶＮを物理ネットワークに割り当てる。当該動的ＶＮ割当問題は、目的関数として、全ての時刻に渡る最大リンク利用率Ｕ^Ｌ _ｔと最大サーバ利用率Ｕ^Ｚ _ｔの和の最小化を課す。すなわち、以下の式で表せる。 The control device 100 of the present embodiment allocates the VN to the physical network by solving the dynamic VN allocation problem for the above physical network and VN demand. The dynamic VN allocation problem imposes, as an objective function, the minimization of the sum of the maximum link utilization rate ^UL _t and the maximum server utilization rate U ^Z _t over all times. That is, it can be expressed by the following equation.

　制約条件として、最大リンク利用率と最大サーバ利用率がそれぞれ１以下であることを課す。すなわち、Ｕ^Ｌ _ｔ＜１とＵ^Ｚ _ｔ＜１で表すことができる。

As a constraint condition, the maximum link utilization rate and the maximum server utilization rate are each 1 or less. That is, it can be expressed by ^UL _t <1 and U ^Z _t <1.

　図６は、制御装置１００の全体動作を示すフローチャートである。図６のフローチャートに示すように、本実施の形態では、まず、事前学習部１１０が、エージェント毎及び全体エージェントについての事前学習を行う（Ｓ１００）。その後、割当部１３０が、学習済みエージェントの行動価値関数に基づいて実制御、すなわち、物理ネットワークに対する仮想ネットワーク割当のための設定等を行う（Ｓ２００）。 FIG. 6 is a flowchart showing the overall operation of the control device 100. As shown in the flowchart of FIG. 6, in the present embodiment, the pre-learning unit 110 first performs pre-learning for each agent and for all agents (S100). After that, the allocation unit 130 performs actual control based on the action value function of the learned agent, that is, a setting for virtual network allocation to the physical network (S200).

　本実施の形態では、離散的なタイムステップを仮定し、各タイムステップでＶＮ需要が変化すると仮定する。実制御では、各タイムステップｔでは、まずＶＮ需要を観測する。次に、観測値に基づいて、次のタイムステップｔ＋１における最適なＶＮ割当を学習済みエージェントが計算する。最後に、計算結果に基づいて、経路とＶＭ配置の変更を行う。 In this embodiment, it is assumed that discrete time steps are assumed, and VN demand changes at each time step. In actual control, VN demand is first observed at each time step t. Next, the trained agent calculates the optimum VN allocation in the next time step t + 1 based on the observed value. Finally, the route and VM arrangement are changed based on the calculation result.

　（Ｃｏｏｐ－ＭＡＲＬアルゴリズムについて）
　次に、Ｃｏｏｐ－ＭＡＲＬアルゴリズムについて説明する。Ｃｏｏｐ－ＭＡＲＬアルゴリズムに関する変数の定義は図５に示したとおりである。 (About the Coop-MALL algorithm)
Next, the Coop-MALL algorithm will be described. The definitions of the variables for the Coop-MALL algorithm are as shown in FIG.

　ＶＮ数をＮ、エージェント数をＭとし、エージェント数と同数のグループにＶＮ需要を分割する。ｋ番目のＶＮ群の割当をエージェントｇ_ｋが行う。ＶＮ数をＮ、サーバ数を｜Ｚ｜、候補経路を１つに限定した際の行動数は｜Ｚ｜^Ｎになる。 Let N be the number of VNs and M be the number of agents, and the VN demand is divided into groups having the same number of agents. Agent g _k assigns the kth VN group. The number of VNs is N, the number of servers is | Z |, and the number of actions when the candidate route is limited to one is | Z | ^N.

　本実施の形態では、参考文献［２］の技術と同様に、ＶＮ需要をＭ（１≦Ｍ≦Ｎ）個に分割し、各エージェントの行動数を｜Ｚ｜^Ｎ／Ｍまで減少させることで、各エージェントの学習コストを低減させる。例えば、ＶＮ数が４、サーバ数が２、エージェント数１の場合、ＶＮ割当の割当の候補数（すなわち行動数）は、２^４＝１６となる。エージェント数４の場合、各エージェントのＶＮ割当の候補数は、２^４／４＝２^１＝２となる。以後、簡単化のためＮ＝Ｍ、すなわち、ＶＮ数と同数のエージェントを用意することとする。 In the present embodiment, as in the technique of reference [2], the VN demand is divided into M (1 ≦ M ≦ N) pieces, and the number of actions of each agent is reduced to | Z | ^{N / M.} , Reduce the learning cost of each agent. For example, when the number of VNs is ⁴ , the number of servers is 2, and the number of agents is 1, the number of candidates for VN allocation (that is, the number of actions) is 24 = 16. When the number of agents is 4, the number of candidates for VN allocation for each agent is ^24/4 = 2 ¹ = 2. Hereinafter, for simplification, N = M, that is, the same number of agents as the number of VNs will be prepared.

　事前学習部１１０は、事前学習において、行動価値関数Ｑ（ｓ_ｔ，ａ_ｔ）の学習を行う。行動価値関数Ｑ（ｓ_ｔ，ａ_ｔ）は、状態ｓ_ｔで行動ａ_ｔを選択した場合に得られる累積報酬の推定値を表す。マルチエージェント環境では、各エージェントは状態ｓ_ｔの一部分のみが観測できると仮定し、各エージェントの観測をｏ^ｋとする。その場合、ｋ番目のエージェントの行動価値関数はＱ_ｋ（ｏ^ｋ，ａ^ｋ）となる。 The pre-learning unit 110 _learns the behavioral value function Q ( _st , at) in the pre-learning. The action value function Q ( _st , at) represents an estimated value of the _cumulative _reward obtained when the action at is selected in the state _st . In a multi-agent environment, it is assumed that each agent can observe only a part of the state _st , and the observation of each agent is ^ok . In that case, the action value function of the ^kth agent is Q _k (ok, a ^k ).

　本実施の形態では、ｋ番目のエージェントの行動価値関数をＱ_ｋとし、全体の行動価値関数をＱ_ｔｏｔとする。Ｑ_ｋは各エージェントにとっての行動価値を表し、Ｑ_ｔｏｔは全エージェントにとっての行動価値を表す。参考文献［２］の技術では、ｋ番目のエージェントはＱ_ｋが最大となる行動を学習するのに対して、本実施の形態では、全てのエージェントはＱ_ｔｏｔが最大となる行動を学習する。事前学習部１１０は、Ｑ_ｔｏｔをＱ_ｋに基づいて算出する。Ｑ_ｔｏｔをＱ_ｋに基づいて算出する方法については、特定の方法に限定されないが、例えば、下記の式（２）で算出することができる（参考文献［４］）。つまり、式（２）では、Ｑ_ｋ（ｏ^ｋ，ａ^ｋ）のエージェントについての総和をＱ_ｔｏｔとしている。 In this embodiment, the behavioral value function of the _kth agent is Qk, and the behavioral value function of the whole is _Qtot . Q _k represents the action value for each agent, and Q _tot represents the action value for all agents. In the technique of reference [2], the k-th agent learns the behavior that maximizes Q _k , whereas in the present embodiment, all agents learn the behavior that maximizes Q _tot . The pre-learning unit 110 calculates Q _tot based on Q _k . The method for calculating Q _tot based on Q _k is not limited to a specific method, but can be calculated by, for example, the following equation (2) (reference [4]). That is, in the equation (2), the sum of the agents of Q _k ( ^{ok, a k} ⁾ is set as Q _tot .

　実制御時には、各学習済みのエージェントは、Ｑ_ｋが最大となる行動を時刻ｔにおける行動として選択する。なお、各学習済みのエージェントは、学習済みの行動価値関数を用いて割当処理を実行する割当部１３０に相当する。

At the time of actual control, each learned agent selects the action that maximizes Q _k as the action at time t. Each trained agent corresponds to the allocation unit 130 that executes the allocation process using the trained action value function.

　より詳細に説明する。事前学習は、観測ｏ_ｔ：＝｛ｏ^ｋ _ｔ｝、行動ａ_ｔ：＝｛ａ^ｋ _ｔ｝、報酬ｒ_ｔから構成される学習サンプルデータを集めるフェーズと、学習サンプルデータに基づいてＱ_ｋを学習するフェーズから成る。学習サンプルデータを集めるフェーズでは、ステップｔにおいて、各エージェントはｏ^ｋ _ｔを観測し、各エージェントは自身のＱ_ｋに基づいて、Ｑ_ｋが最大となる行動 It will be explained in more detail. Pre-learning consists of a phase of collecting training sample data consisting of observation ot: = ^{ _{ok t} _} , action at: = _{{a kt}, and reward rt, and Q k} _based _on _the ^training sample data. It consists of a learning phase. In the phase of collecting training sample data, in step t, each agent observes _ok _t , and each agent performs an action that maximizes Q ^k based on its own Q _k .

を決定する。ステップｔにおいて、全てのエージェントが行動ａ^ｋ _ｔを決定した後、全エージェントの行動ａ_ｔに基づいて、状態ｓ_ｔを更新する。その後、ｒ_ｔを計算する。すなわち，学習サンプルデータを集めるフェーズでは、Ｑ_ｔｏｔを使用せずに、Ｑ_ｋに基づいてサンプルデータを集める。

To decide. In step _t , after all the agents have determined the action ^akt , the state _st _is updated based on the action at of all the agents. After that, _rt is calculated. That is, in the phase of collecting training sample data, sample data is collected based on Q _k without using Q _tot .

　学習フェーズでは、集めたサンプルデータ（ｏ_ｊ，ａ_ｊ，ｒ_ｊ，ｏ_ｊ＋１）に基づいてＱ_ｋを更新する。まず、Ｑ_ｋに基づいて、Ｑ_ｔｏｔを計算する。その後、強化学習のアルゴリズムに基づいて、Ｑ_ｔｏｔが最大となるようにＱ_ｋを学習する。すなわち、Ｑ_ｋを更新する。既存手法では、各Ｑ_ｋが最大となるように各Ｑ_ｋを更新するが、本実施の形態では、Ｑ_ｔｏｔが最大となるように各Ｑ_ｋを更新することで、全エージェントにとっての最適な行動を選択することができる。 In the learning phase, Q _k is updated based on the collected sample data (o _j , a _j , r _j , o _{j + 1} ). First, Q _tot is calculated based on Q _k . After that, based on the algorithm of reinforcement learning, Q _k is learned so that Q _tot is maximized. That is, Q _k is updated. In the existing method, each Q _k is updated so that each Q _k is maximized, but in the present embodiment, each Q _k is updated so that the Q _tot is maximized, which is optimal for all agents. You can choose an action.

　なお、本実施の形態において、強化学習の学習アルゴリズムについては限定せず、任意の学習アルゴリズムに対して適用することができる。例えば、強化学習の学習アルゴリズムとして参考文献［３］などに開示された技術を用いることができる。参考文献［３］の技術を用いた場合の学習アルゴリズムの実施例について説明する。 In the present embodiment, the learning algorithm for reinforcement learning is not limited, and can be applied to any learning algorithm. For example, the technique disclosed in reference [3] or the like can be used as a learning algorithm for reinforcement learning. An example of a learning algorithm when the technique of reference [3] is used will be described.

　以下のように、Ｑ_ｋをニューラルネットワークで近似することを考える。 Consider approximating Q _k with a neural network as follows.

　式（３）におけるθ^ｋは、ｋ番目のニューラルネットワークのハイパーパラメータを表し、Θは全てのθ^ｋを表す。ニューラルネットワークの学習に用いるＬｏｓｓ関数を以下の式で定義する。

In equation (3), θ ^k represents the hyperparameters of the kth neural network, and Θ represents all θ ^k . The Loss function used for learning the neural network is defined by the following formula.

ここで、γはハイパーパラメータである。上記の式に対して、集めたサンプルデータ（ｏ_ｊ，ａ_ｊ，ｒ_ｊ，ｏ_ｊ＋１）と確率的勾配降下法を用いてΘを更新することで、Ｑ_ｋを学習する。

Here, γ is a hyperparameter. For the above equation, Q _k is learned by updating Θ using the collected sample data (o _j , a _j , r _j , o _{j + 1} ) and the stochastic gradient descent method.

　実制御時には、事前学習の学習サンプルデータを集めるフェーズと同様に、各Ｑ_ｋが最大となる行動を選択する。ただし、ここでのＱ_ｋは、Ｑ_ｔｏｔが最大となるように学習したＱ_ｋであるため、他のエージェントを考慮した協調した行動を選択することができる。 At the time of actual control, the action that maximizes each _Qk is selected as in the phase of collecting the learning sample data of the pre-learning. However, since the Q _k here is the Q _k learned so that the Q _tot is maximized, it is possible to select a coordinated action in consideration of other agents.

　（モデリング）
　次に、本実施の形態における強化学習の学習モデルについて説明する。強化学習の学習モデルでは、状態ｓ_ｔ、行動ａ_ｔ、報酬ｒ_ｔが使用される。前述したように、各エージェントは状態の一部のみを観測できると仮定する。時刻ｔにおける状態ｓ_ｔをｓ_ｔ＝［Ｄ_ｔ，Ｖ_ｔ，Ｒ^Ｌ _ｔ，Ｒ^Ｚ _ｔ］と定義する。また、時刻ｔにおけるエージェントｇ_ｋの観測ｏ^ｋ _ｔをｏ^ｋ _ｔ＝［ｄ^ｋ _ｔ，ｖ^ｋ _ｔ，Ｒ^Ｌ _ｔ，Ｒ^Ｚ _ｔ］と定義する。ここで、Ｄ_ｔとＶ_ｔはトラヒック需要とＶＭサイズであり、Ｒ^Ｌ _ｔとＲ^Ｚ _ｔは各リンクの残余帯域および各サーバの残余容量である。各リンクの残余帯域はｒ^Ｌ _ｉｊ、ｔ＝１－ｃ^Ｌ _ｉｊｕ^Ｌ _ｉｊ、ｔで計算する。また、ｒ^Ｚ _ｉ，ｔも同様に計算する。｜Ｚ｜通りのＶＮ割当の候補集合をＡ^ｋと定義して、各時刻ではＡ^ｋから行動ａ^ｋ _ｔを一つ選択する。本モデルでは割当先サーバに対して経路が一意に定まるものとし、ＶＮ割当はＶＭと割当先サーバの組合せで決めるものとする。 (modeling)
Next, the learning model of reinforcement learning in this embodiment will be described. In the learning model of reinforcement learning, the state st, the action _att , and the _reward _rt are used. As mentioned above, it is assumed that each agent can observe only part of the state. The state st at time _t is defined as _{st = [D t} _, V _t , ^RL _t , R ^Z _t ]. Further, the observation ok _t of the agent g _k at time _t is ^defined as ^ok _t = [d ^kt , v ^kt , ^RL _t , R ^Z _t _] . Here, D _t and V _t are traffic demand and VM size, and ^RL _t and R ^Z _t are the residual bandwidth of each link and the residual capacity of each server. The residual band of each link is calculated by r ^L _{ij, t} = 1-c ^L _ij u ^L _{ij, t} . Further, r ^Z _{i and t} are calculated in the same manner. A candidate set for VN allocation according to | Z | is defined as ^Ak , and one action a ^k _t is selected from A ^k at each time. In this model, the route is uniquely determined for the allocation destination server, and the VN allocation is determined by the combination of the VM and the allocation destination server.

　ｇ_ｋの報酬計算手順を図７に示す。この手順は、制御装置１００の報酬計算部１２０により実行されるものである。１行目のＥｆｆ（ｘ）は効率関数を表し、ｘが増加する程Ｅｆｆ（ｘ）が減少するように以下のように定義される関数である。 The reward calculation procedure for g _k is shown in FIG. This procedure is executed by the reward calculation unit 120 of the control device 100. Eff (x) on the first line represents an efficiency function, and is a function defined as follows so that Eff (x) decreases as x increases.

　式（５）に示すように、制約条件の違反に近い状態（Ｕ^Ｌ _ｔ＋１やＵ^Ｚ _ｔ＋１が９０％以上になること）を強く避けるために、ｘが０．９以上の場合はＥｆｆ（ｘ）を２倍減少させる。不必要なＶＮ再割当（Ｕ^Ｌ _ｔ＋１やＵ^Ｚ _ｔ＋１が２０％以下のときのＶＮ再割当）を避けるために、ｘが０．２以下の場合はＥｆｆ（ｘ）を一定とする。

As shown in the equation (5), in order to strongly avoid a state close to a violation of the constraint condition ( ^UL _{t + 1} or U ^Z _{t + 1} becomes 90% or more), when x is 0.9 or more, Eff (x). ) Is reduced by a factor of two. In order to avoid unnecessary VN reassignment (VN reassignment when ^UL _{t + 1} or U ^Z _{t + 1} is 20% or less), Eff (x) is set to be constant when x is 0.2 or less.

　図７に示す報酬計算手順の２～４行目では、不必要なＶＮの再配置を抑制するため、ＶＮの再割当に応じたペナルティを与える。２行目において、報酬計算部１２０は、再割当が行われたと判断した場合（Ｙ_ｔとＹ_ｔ＋１が異なる場合）に、３行目に進み、ｒ_ｔ－Ｐ（Ｙ_ｔ，Ｙ_ｔ＋１）をｒ_ｔとする。Ｐ（Ｙ_ｔ，Ｙ_ｔ＋１）は、ＶＮの再配置を抑制するためのペナルティ関数であり、再配置を抑制する場合はＰ値が大きく、許容する場合はＰ値が小さくするなるように設定する。ここで、Ｙ_ｔはＶＭの割当状態を表す。 In the second to fourth lines of the reward calculation procedure shown in FIG. 7, a penalty is given according to the reassignment of the VN in order to suppress unnecessary relocation of the VN. In the second line, when the reward calculation unit 120 determines that the reallocation has been performed (when Y _t and Y _{t + 1} are different), the reward calculation unit 120 proceeds to the third line and performs rt −P (Y _t , Y _t _{+ 1} ). Let it be _rt . P (Y _t , Y _{t + 1} ) is a penalty function for suppressing VN rearrangement, and is set so that the P value is large when the rearrangement is suppressed and the P value is small when the rearrangement is allowed. .. Here, Y _t represents the VM allocation state.

　以下、事前学習と実制御のそれぞれについての動作を詳細に説明する。 Below, the operation of each of pre-learning and actual control will be explained in detail.

　（事前学習動作）
　事前学習部１１０が実行する、協調型マルチエージェント強化学習（Ｃｏｏｐ－ＭＡＲＬ）の事前学習手順（事前学習アルゴリズム）を図８に示す。 (Pre-learning operation)
FIG. 8 shows a pre-learning procedure (pre-learning algorithm) of the cooperative multi-agent reinforcement learning (Coop-MALL) executed by the pre-learning unit 110.

　タイムステップＥの一連の行動をエピソードと呼び、学習が完了するまでエピソードを繰り返し実行する。事前学習部１１０は、各エピソード（２～１８行目）の最初に、全ＶＮに対する、ステップ数Ｅの学習用トラヒック需要及びＶＭ需要の候補を生成する。 The series of actions in time step E is called an episode, and the episode is repeated until learning is completed. At the beginning of each episode (lines 2 to 18), the pre-learning unit 110 generates candidates for learning traffic demand and VM demand having the number of steps E for all VNs.

　その後、事前学習部１１０は、一連の手順（５～１６行目）を繰り返し実行する。 After that, the pre-learning unit 110 repeatedly executes a series of procedures (lines 5 to 16).

　６～１１行目で各エージェントの学習サンプル（状態ｏ^ｋ _ｔ、行動ａ^ｋ _ｔ，報酬ｒ_ｔ，次の状態ｏ^ｋ _ｔ＋１のペア）を生成し、Ｒｅｐｌａｙ　Ｍｅｍｏｒｙ　Ｍに格納する。学習サンプルの生成では、現在の観測ｏ^ｋ _ｔとＱ_ｋに応じた行動選択、行動ａ^ｋ _ｔに基づいた状態の更新（ＶＮの再配置）、更新した観測ｏ^ｋ _ｔにおける報酬ｒ_ｔの計算を各エージェントに対して行う。 In the 6th to 11th lines, a learning sample (a pair of a state ok _t , an action a ^kt , a reward _rt , and a next state ^ok _t ⁺ ₁ ) of each agent is generated and stored in the Play Memory M. In the generation of the training sample, the action selection according to the current observation _ok _t and Q ^k , the update of the state based on the action a ^kt ₍ relocation of the VN), and the calculation of the reward _rt in the updated observation ^ok _t . For each agent.

　１２～１４行目は、エピソードの終了条件を指す。本モデルでは、ｒ_ｔ＝－１を終了条件とする。１７行目でエピソード０～Ｅの学習サンプルをＲｅｐｌａｙ　Ｍｅｍｏｒｙ　Ｍに格納する。１８行目でＲｅｐｌａｙ　Ｍｅｍｏｒｙ　Ｍからランダムに学習サンプルを取り出し、エージェントの学習を行う。エージェントの学習では、前述したとおり、強化学習のアルゴリズムに基づいて、Ｑ_ｔｏｔが最大となるように各Ｑ_ｋの学習を行う。 The 12th to 14th lines indicate the end condition of the episode. In this model, rt ₌ -1 is the end condition. In the 17th line, the learning samples of episodes 0 to E are stored in the Play Memory M. On the 18th line, a training sample is randomly taken out from the Play Memory M to train the agent. In the learning of the agent, as described above, each Q _k is learned so that the Q _tot is maximized based on the algorithm of reinforcement learning.

　また、前述したとおり、本実施の形態において、強化学習の学習アルゴリズムについては特定のものに限定せず、任意の学習アルゴリズムに対して適用することができる。例えば、強化学習の学習アルゴリズムとして［３］などを用いることができる。 Further, as described above, in the present embodiment, the learning algorithm for reinforcement learning is not limited to a specific one, and can be applied to any learning algorithm. For example, [3] or the like can be used as a learning algorithm for reinforcement learning.

　上述した報酬計算手順に基づく事前学習部１１０の動作例を図９のフローチャートを参照して説明する。 An operation example of the pre-learning unit 110 based on the above-mentioned reward calculation procedure will be described with reference to the flowchart of FIG.

　なお、事前学習における状態の観測や、行動（ＶＮの物理リソースへの割当）については、実際の物理ネットワーク２００に対して行うこととしてもよいし、実際の物理ネットワーク２００と同等のモデルに対して行うこととしてもよい。 It should be noted that state observation and behavior (allocation of VN to physical resources) in pre-learning may be performed for the actual physical network 200, or for a model equivalent to the actual physical network 200. You may do it.

　Ｓ１０１において、事前学習部１１０は、ステップ数Ｅの学習用トラヒック需要及びＶＭ需要の候補を生成し、データ格納部１４０に格納する。 In S101, the pre-learning unit 110 generates candidates for learning traffic demand and VM demand having the number of steps E, and stores them in the data storage unit 140.

　Ｓ１０１～Ｓ１０６は、ｔがＴより小さい間実行される。また、Ｓ１０３～Ｓ１０５はエピソード０～Ｅの各エピソードで繰り返しで行われる。 S101 to S106 are executed while t is smaller than T. Further, S103 to S105 are repeated in each episode of episodes 0 to E.

　Ｓ１０２において、事前学習部１１０（各エージェント）は、ｏ^ｋ _ｔを観測し、各エージェント自身のＱ_ｋに基づいて、Ｑ_ｋが最大となる行動ａ^ｋ _ｔを選択する。つまり、各エージェントは、行動価値関数の値（Ｑ_ｋ）が最大になるように自身に割り当てられたＶＮの割当先サーバを選択する。 In S102, the pre-learning unit 110 (each agent) _{observes ok t} ^and _selects an action a _kt that maximizes Q ^k based on the Q _k of each agent itself. That is, each agent selects the VN allocation destination server assigned to itself so that the value (Q _k ) of the action value function is maximized.

　Ｓ１０３において、事前学習部１１０（各エージェント）は、選択された行動（ＶＮ割当）を物理ネットワーク２００に設定し、観測ｏ_ｔを更新してｏ_ｔ＋１を得る。また、Ｓ１０４において、報酬計算部１２０は、前述した計算方法で、各エージェントについての報酬ｒ_ｔの計算を行う。 In S103, the pre-learning unit 110 (each agent) sets the selected action (VN allocation) in the physical network 200, updates the observation _ot , and obtains _{ot + 1} . Further, in S104, the reward calculation unit 120 calculates the reward _rt for each agent by the above-mentioned calculation method.

　Ｓ１０５において、事前学習部１１０は、集めた学習サンプルデータ（ｏ_ｊ，ａ_ｊ，ｒ_ｊ）をＲｅｐｌａｙ　Ｍｅｍｏｒｙ　Ｍ（データ格納部１４０）に格納する。Ｓ１０６において、事前学習部１１０は、Ｒｅｐｌａｙ　Ｍｅｍｏｒｙ　Ｍ（データ格納部１４０）から、学習サンプルをランダムに選択し、Ｑ_ｔｏｔが最大になるように、各エージェントの学習を行う。つまり、各エージェントの行動価値関数Ｑ_ｋの学習を行う。 In S105, the pre-learning unit 110 stores the collected learning sample data ( _{oj, aj} _, _rj ) in the Play Memory M (data storage unit 140). In S106, the pre-learning unit 110 randomly selects a learning sample from the Play Memory M (data storage unit 140), and learns each agent so that the Q _tot is maximized. That is, the action value function Q _k of each agent is learned.

　（実制御動作）
　制御装置１００の割当部１３０が実行する、協調型マルチエージェント強化学習（Ｃｏｏｐ－ＭＡＤＲＬ）による動的ＶＮ割当手順を図１０に示す。 (Actual control operation)
FIG. 10 shows a dynamic VN allocation procedure by cooperative multi-agent reinforcement learning (Coop-MADRL) executed by the allocation unit 130 of the control device 100.

　図１０の１行目は、ｇ_ｋの事前学習を表しており、これは上記で説明したとおりである。その後は、２～８行目を各時刻で繰り返す。ある時刻において、４行目では、状態ｏ^ｋ _ｔの観測を行う。５行目では、状態ｏ^ｋ _ｔに基づいてＱ_ｋ（ｏ^ｋ _ｔ，ａ´）が最大となる行動ａ^ｋ _ｔを選択する。４行目、５行目は各エージェントに対して実行される。７行目では、ＶＮ割当を更新する。 The first line of FIG. 10 shows the pre-learning of g _k , which is as described above. After that, the 2nd to 8th lines are repeated at each time. At a certain time, in the 4th line, the state ^ok _t is observed. In the fifth line, the action a ^kt that maximizes Q ^k ( ^ok _t , a' ₎ is selected based on the state _ok _t . The fourth and fifth lines are executed for each agent. In the 7th line, the VN allocation is updated.

　上述した実制御手順に基づく割当部１３０の動作例を図１１のフローチャートを参照して説明する。Ｓ２０１～Ｓ２０３は各タイムステップで実行される。 An operation example of the allocation unit 130 based on the above-mentioned actual control procedure will be described with reference to the flowchart of FIG. S201 to S203 are executed at each time step.

　Ｓ２０１において、割当部１３０は、各エージェントに対する、時刻ｔにおける状態ｏ^ｋ _ｔ（ＶＭ需要、トラヒック需要、残余リンク容量、残余サーバ容量）を観測（取得）する。具体的には、例えば、ＶＭ需要、トラヒック需要を、各ユーザ（ユーザ端末等）から受信し、残余リンク容量と残余サーバ容量を、物理ネットワーク２００から（あるいは物理ネットワーク２００を監視するオペレーションシステム）から取得する。なお、ＶＭ需要、トラヒック需要に関しては、需要予測により得た値であってもよい。 In S201, the allocation unit 130 observes (acquires) the state ^ok _t (VM demand, traffic demand, residual link capacity, residual server capacity) at time t for each agent. Specifically, for example, VM demand and traffic demand are received from each user (user terminal, etc.), and the residual link capacity and the residual server capacity are obtained from the physical network 200 (or an operation system that monitors the physical network 200). get. The VM demand and the traffic demand may be values obtained by demand forecasting.

　Ｓ２０２において、割当部１３０は、各エージェントについて、Ｑ_ｋ（ｏ^ｋ _ｔ，ａ´）が最大となる行動ａ^ｋ _ｔを選択する。すなわち、割当部１３０は、Ｑ_ｋ（ｏ^ｋ _ｔ，ａ´）が最大となるように、各エージェントのＶＮにおけるＶＭの割当先サーバを選択する。 In S202, the allocation unit 130 selects the action ^akt that maximizes _Qk ( ^ok _t , a' ₎ for each agent. That is, the allocation unit 130 selects the VM allocation destination server in the VN of each agent so that Q _k ( ^ok _t , a') is maximized.

　Ｓ２０３において、割当部１３０は、状態を更新する。具体的には、割当部１３０は、各エージェントのＶＮについて、物理ネットワーク２００における各割当先サーバに対してＶＭを割り当てる設定を行うとともに、需要に応じたトラヒックが正しい経路（リンクの集合）を流れるように、物理ネットワーク２００における経路設定を行う。 In S203, the allocation unit 130 updates the state. Specifically, the allocation unit 130 sets the VN of each agent to allocate the VM to each allocation destination server in the physical network 200, and the traffic according to the demand flows through the correct route (set of links). As described above, the route is set in the physical network 200.

　（実施の形態の効果）
　以上説明したように、本実施の形態では、全てのエージェントにとっての行動価値Ｑ_ｔｏｔが最大となる各エージェントの行動を学習する仕組みを備え、エージェント間の協調制御を学習することで、エージェント間の割当の競合による割当精度の低下の問題を解決し、割当精度を向上させることができる。 (Effect of embodiment)
As described above, in the present embodiment, a mechanism for learning the behavior of each agent that maximizes the behavior value _Qtot for all agents is provided, and by learning the cooperative control between agents, the agents are involved. It is possible to solve the problem of deterioration of allocation accuracy due to allocation conflict and improve allocation accuracy.

　（実施の形態のまとめ）
　本明細書には、少なくとも下記各項の制御装置、仮想ネットワーク割当方法、及びプログラムが開示されている。
（第１項）
　協調型マルチエージェント強化学習により、仮想ネットワークをリンクとサーバを有する物理ネットワークに割り当てるための制御装置であって、
　仮想ネットワークの需要を複数グループに分割し、グループ毎にエージェントを割り当て、個々のエージェントに対する行動価値関数である個別行動価値関数を、全エージェントに対する行動価値関数である全体行動価値関数が最大になるように学習する事前学習部と、
　前記事前学習部により学習されたエージェント毎の個別行動価値関数を用いて、仮想ネットワークを前記物理ネットワークに割り当てる割当部と
　を備える制御装置。
（第２項）
　前記事前学習部は、個別行動価値関数のエージェントについての総和を前記全体行動価値関数として使用する
　第１項に記載の制御装置。
（第３項）
　前記割当部は、各エージェントについての学習済みの個別行動価値関数が最大になるように、仮想ネットワークを前記物理ネットワークに割り当てる行動を選択する
　第１項又は第２項に記載の制御装置。
（第４項）
　協調型マルチエージェント強化学習により、仮想ネットワークをリンクとサーバを有する物理ネットワークに割り当てるための制御装置が実行する仮想ネットワーク割当方法であって、
　仮想ネットワークの需要を複数グループに分割し、グループ毎にエージェントを割り当て、個々のエージェントに対する行動価値関数である個別行動価値関数を、全エージェントに対する行動価値関数である全体行動価値関数が最大になるように学習する事前学習ステップと、
　前記事前学習ステップにより学習されたエージェント毎の個別行動価値関数を用いて、仮想ネットワークを前記物理ネットワークに割り当てる割当ステップと
　を備える仮想ネットワーク割当方法。
（第５項）
　コンピュータを、第１項ないし第３項のうちいずれか１項に記載の制御装置における各部として機能させるためのプログラム。 (Summary of embodiments)
The present specification discloses at least the control device, the virtual network allocation method, and the program of each of the following items.
(Section 1)
A control device for allocating virtual networks to physical networks with links and servers through cooperative multi-agent reinforcement learning.
Divide the demand of the virtual network into multiple groups, assign agents to each group, and maximize the individual action value function, which is the action value function for each agent, and the overall action value function, which is the action value function for all agents. Pre-learning department to learn from
A control device including an allocation unit that allocates a virtual network to the physical network by using an individual action value function for each agent learned by the pre-learning unit.
(Section 2)
The control device according to item 1, wherein the pre-learning unit uses the sum of the agents of the individual action value function as the total action value function.
(Section 3)
The control device according to

item

1 or 2, wherein the allocation unit selects an action for allocating a virtual network to the physical network so that the learned individual action value function for each agent is maximized.
(Section 4)
A virtual network allocation method executed by a control device for allocating a virtual network to a physical network having links and servers by cooperative multi-agent reinforcement learning.
Divide the demand of the virtual network into multiple groups, assign agents to each group, and maximize the individual action value function, which is the action value function for each agent, and the overall action value function, which is the action value function for all agents. Pre-learning steps to learn and
A virtual network allocation method including an allocation step for allocating a virtual network to the physical network by using an individual action value function for each agent learned by the pre-learning step.
(Section 5)
A program for making a computer function as each part in the control device according to any one of the items 1 to 3.

　以上、本実施の形態について説明したが、本発明はかかる特定の実施形態に限定されるものではなく、特許請求の範囲に記載された本発明の要旨の範囲内において、種々の変形・変更が可能である。 Although the present embodiment has been described above, the present invention is not limited to such a specific embodiment, and various modifications and changes can be made within the scope of the gist of the present invention described in the claims. It is possible.

　［参考文献］
[1] 鈴木晃人, 安孫子悠, 原田薫明, "深層強化学習による動的仮想ネットワーク割当手法の検討," 信学会総合大会, B-7-48, 2019.（非特許文献１に対応）
[2] 鈴木晃人, 原田薫明, "マルチエージェント深層強化学習による動的仮想リソース割当手法," 信学技報, vol.119, no. 195, IN2019-29, pp. 35-40, 2019 年9 月.（非特許文献２に対応）
[3] V. Mnihet al., "Human-level control through deep reinforcement learning," Nature, vol. 518, no. 7540,p. 529, 2015.
[4] P. Sunehag al.,"Value-decomposition networks for cooperative multi-agent learning based on team rewar," in Proc. AAMAS, 2018, pp. 2085-2087. [References]
[1] Akihito Suzuki, Yu Yasuko, Kaoruaki Harada, "Examination of Dynamic Virtual Network Allocation Method by Deep Reinforcement Learning," General Conference of the Society of Communication, B-7-48, 2019. (Corresponding to Non-Patent Document 1)
[2] Akihito Suzuki, Kaoruaki Harada, "Dynamic Virtual Resource Allocation Method by Multi-Agent Deep Reinforcement Learning," Bibliography, vol.119, no. 195, IN2019-29, pp. 35-40, 2019 September, 2014 (corresponding to Non-Patent Document 2)
[3] V. Mnihet al., "Human-level control through deep reinforcement learning," Nature, vol. 518, no. 7540, p. 529, 2015.
[4] P. Sunehag al., "Value-decomposition networks for cooperative multi-agent learning based on team rewar," in Proc. AAMAS, 2018, pp. 2085-2087.

１００　制御装置
１１０　事前学習部
１２０　報酬計算部
１３０　割当部
１４０　データ格納部
２００　物理ネットワーク
３００　物理ノード
４００　物理リンク
１０００　ドライブ装置
１００１　記録媒体
１００２　補助記憶装置
１００３　メモリ装置
１００４　ＣＰＵ
１００５　インタフェース装置
１００６　表示装置
１００７　入力装置 100 Control device 110 Pre-learning unit 120 Reward calculation unit 130 Allocation unit 140 Data storage unit 200 Physical network 300 Physical node 400 Physical link 1000 Drive device 1001 Recording medium 1002 Auxiliary storage device 1003 Memory device 1004 CPU
1005 Interface device 1006 Display device 1007 Input device

Claims

　協調型マルチエージェント強化学習により、仮想ネットワークをリンクとサーバを有する物理ネットワークに割り当てるための制御装置であって、
　仮想ネットワークの需要を複数グループに分割し、グループ毎にエージェントを割り当て、個々のエージェントに対する行動価値関数である個別行動価値関数を、全エージェントに対する行動価値関数である全体行動価値関数が最大になるように学習する事前学習部と、
　前記事前学習部により学習されたエージェント毎の個別行動価値関数を用いて、仮想ネットワークを前記物理ネットワークに割り当てる割当部と
　を備える制御装置。 A control device for allocating virtual networks to physical networks with links and servers through cooperative multi-agent reinforcement learning.
Divide the demand of the virtual network into multiple groups, assign agents to each group, and maximize the individual action value function, which is the action value function for each agent, and the overall action value function, which is the action value function for all agents. Pre-learning department to learn from
A control device including an allocation unit that allocates a virtual network to the physical network by using an individual action value function for each agent learned by the pre-learning unit.
　前記事前学習部は、個別行動価値関数のエージェントについての総和を前記全体行動価値関数として使用する
　請求項１に記載の制御装置。 The control device according to claim 1, wherein the pre-learning unit uses the sum of the agents of the individual action value function as the overall action value function.
　前記割当部は、各エージェントについての学習済みの個別行動価値関数が最大になるように、仮想ネットワークを前記物理ネットワークに割り当てる行動を選択する
　請求項１又は２に記載の制御装置。 The control device according to claim 1 or 2, wherein the allocation unit selects an action for allocating a virtual network to the physical network so that the learned individual action value function for each agent is maximized.
　協調型マルチエージェント強化学習により、仮想ネットワークをリンクとサーバを有する物理ネットワークに割り当てるための制御装置が実行する仮想ネットワーク割当方法であって、
　仮想ネットワークの需要を複数グループに分割し、グループ毎にエージェントを割り当て、個々のエージェントに対する行動価値関数である個別行動価値関数を、全エージェントに対する行動価値関数である全体行動価値関数が最大になるように学習する事前学習ステップと、
　前記事前学習ステップにより学習されたエージェント毎の個別行動価値関数を用いて、仮想ネットワークを前記物理ネットワークに割り当てる割当ステップと
　を備える仮想ネットワーク割当方法。 A virtual network allocation method executed by a control device for allocating a virtual network to a physical network having links and servers by cooperative multi-agent reinforcement learning.
Divide the demand of the virtual network into multiple groups, assign agents to each group, and maximize the individual action value function, which is the action value function for each agent, and the overall action value function, which is the action value function for all agents. Pre-learning steps to learn and
A virtual network allocation method including an allocation step for allocating a virtual network to the physical network by using an individual action value function for each agent learned by the pre-learning step.
　コンピュータを、請求項１ないし３のうちいずれか１項に記載の制御装置における各部として機能させるためのプログラム。 A program for making a computer function as each part in the control device according to any one of claims 1 to 3.