CN116112938B

CN116112938B - SFC deployment method based on multi-agent reinforcement learning

Info

Publication number: CN116112938B
Application number: CN202211467664.XA
Authority: CN
Inventors: 唐伦; 李师锐; 杜雨聪; 陈前斌
Original assignee: Shenzhen Sailei Culture Media Co ltd
Current assignee: Shenzhen Sailei Culture Media Co ltd
Priority date: 2022-11-22
Filing date: 2022-11-22
Publication date: 2024-04-19
Anticipated expiration: 2042-11-22
Also published as: CN116112938A

Abstract

The invention relates to an SFC deployment method based on multi-agent reinforcement learning, and belongs to the technical field of mobile communication. The method comprises the following steps: s1: in a network function virtualization scene, an overload penalty mechanism based on the node capacity proportion is designed, the nodes are subjected to reservation monitoring and excessive use penalty is applied, a mathematical model with minimized network overload penalty, end-to-end average delay and deployment cost is established, and a service function chain deployment optimization problem is converted into a Markov decision process to be solved; s2: establishing a multi-agent business arrangement scheme based on user division, wherein a multi-agent framework follows a strategy of centralized training and distributed execution; s3: designing a central attention mechanism with multiple attention heads, and combining information focused on different subspaces; s4: and each decision maker adopts a flexible actor-criticizer algorithm, and the exploration and the robustness of the intelligent agent are improved based on a reinforcement learning framework of maximum entropy.

Description

SFC deployment method based on multi-agent reinforcement learning

Technical Field

The invention belongs to the technical field of mobile communication, and relates to an SFC deployment method based on multi-agent reinforcement learning.

Background

The 5G will open an era of everything interconnection, and if a physical network is built for multiple traffic flows in each scenario, high costs will be incurred. Emerging services are of a great variety, conflicting and extremely diverse in demand, and traditional "one-time" network methods are no longer effectively viable, and network slicing technology is of great interest to the industry, which creates the same physical network infrastructure as multiple logical networks, each serving mainly one service scenario, which is proposed as a key technology.

In a Network Function Virtualization (NFV) architecture, virtualized Network Functions (VNFs) represent software instantiations of network functions that are separate from the hardware resources used. The Service Function Chain (SFC) is a service request formed by connecting a plurality of orderly VNFs in series, and the SFC deployment phase requires VNF placement and instantiation on the underlying network, and is accompanied with problems of resource allocation, routing and the like, so as to meet specific network service requirements. How to design an efficient deployment scenario is a key challenge for SFC orchestration.

The traditional heuristic method relies on manual embedding rules and cannot be well adapted to dynamic network structures and environments, so that the problem of SFC deployment is paid attention to by reinforcement learning. In the current research about SFC deployment, there are few joint considerations on network load, SFC end-to-end delay and deployment cost, and in addition, although there are few documents to solve SFC deployment problem through reinforcement learning, there are few solutions based on multi-agent reinforcement learning, and there is no investigation on how to further promote the extended evolution capability of algorithm when service request increases.

Disclosure of Invention

In view of the above, the present invention aims to provide an SFC deployment method based on multi-agent reinforcement learning, which can dynamically and selectively pay attention to obtaining information of larger deployment returns, realize optimization of load penalty, delay and deployment cost, effectively improve the balance degree of resource allocation, and have better expansion performance when the number of agents is increased.

In order to achieve the above purpose, the present invention provides the following technical solutions:

a service function chain deployment method based on multi-agent reinforcement learning comprises the following steps:

S1: in a network function virtualization scene, an overload penalty mechanism based on the node capacity proportion is designed, the nodes are subjected to reservation monitoring and excessive use penalty is applied, a mathematical model with minimized network overload penalty, end-to-end average delay and deployment cost is established, and a service function chain deployment optimization problem is converted into a Markov decision process to be solved;

s2: establishing a multi-agent business arrangement scheme based on user division, wherein a multi-agent framework follows a strategy of centralized training and distributed execution;

S3: designing a central attention mechanism with multiple attention heads, and combining information focused on different subspaces;

S4: and each decision maker adopts a flexible actor-critter algorithm, improves the exploratory property and the robustness of the intelligent agent based on a reinforcement learning framework of maximum entropy, selectively extracts information through a attention mechanism, and combines a dominance function to realize credit allocation.

Further, in step S1, the network function virtualized scenario includes four components: the system comprises a physical facility layer, a virtualization management layer, a network application layer and a service operation support system. The physical layer is a bottom layer bearing network comprising a general server, provides the VNF with instantiated physical resources, the virtualization layer realizes load analysis of the physical network and execution of a resource allocation strategy, the application layer is mainly responsible for creating SFCs and group chains according to service requirements, and the service operation support system completes real-time monitoring of network states and the like.

Further, in step S1, the overload penalty refers to a penalty for nodes with excessive load caused by improper resource allocation, so as to improve the uniformity of resource allocation, and the relative proportion of the remaining resources to the capacities of the nodes is analyzed because the resource capacities of each node in the network are different.

Overload punishmentAt t time slots can be expressed as:

Wherein epsilon _c represents unit penalty suffered by insufficient resource reservation rate, alpha _c represents resource overload warning value of the underlying server, The CPU resource reservation rate of the v node in the t time slot is represented by the following calculation formula:

Where a _v (t) represents the CPU resources allocated by the v-th server in the t time slot, which can be expressed as:

Further, in step S1, the end-to-end delay of the ith SFC is divided into VNF processing delay P _i and link communication delay T _i.

P _i is denoted as t time slot:

Where m _i represents the size of the data packet on the ith SFC, and the service rate coefficient β represents the amount of data that can be processed by a single CPU per second.

T _i represents the total link communication delay of the ith SFC, related to the link mapping case, expressed as:

Where ψ represents the delay caused by the packet queuing schedule.

Further, in step S1, the cost incurred by SFC deployment includes two parts, one part from the processing of VNFs by the underlying network server and the other part from the use of link bandwidths. Cost of processingThe dynamic cost is described as the variable cost generated by CPU operation and is related to the allocated resource quantity, the coefficient is positive epsilon, and the static cost is the cost of activating the VNF by any server and is positive/>Representation, therefore/>Can represent/>Physical Link Bandwidth usage cost/>In proportion to the occupied physical link bandwidth, the unit bandwidth overhead coefficient is expressed by positive number/>Representation, thenExpressed as/>The total cost of deployment Z _i for the ith SFC at time t can be expressed as:

Further, in step S1, a joint optimization model of SFC deployment is established, and in order to unify units of delay, cost and penalty, normalization processing is performed on each part, and a utility function is designed as follows:

Further, the SFC deployment problem is converted into an MDP model, which is represented by a 4-tuple M= < S, A, P, R >, wherein S is a state space, A is an action space, P is a state transition probability, and R is a reward function.

Defining a state space comprising SFC mapping states K _i (t), residual computing resource rates of nodesAnd the ratio of the residual bandwidth resources of each physical link/>S (t) ∈s is expressed as S (t) = { K (t), η _c(t),η_b (t) }, where K (t) = [ K _i (t) ],/>

Defining an action space includes mapping of all VNFs for each chain, node CPU resource allocation, and link bandwidth resource allocation, so a (t) e a is denoted as a (t) = { δ (t), C (t), B (t) }, where

The network takes action a (t) in t time slot state s (t) and then transfers to state s (t+1) of the next time slot, and the state transfer probability of the process is defined as p (s (t+1) |s (t), a (t)).

The objective here is to minimize SFC end-to-end average latency, network deployment cost, and overload penalty, then the reward function can be defined as R (t) = - [ U (t) ] ^-1.

Further, in step S3, the central attention mechanism follows the paradigm of centralized training reviewers and distributed execution strategies, so that a specific agent can selectively pay attention to information from other agents, and the distributed actor network takes action and obtains corresponding observations, and performs shared training after mapping local information.

To combine information focused on different subspaces, the contributions of all the heads should be connected together, which can be expressed as Mhead = Concat (head ₁,head₂,…,head_h)W^o, each head using an independent set of parameters (W _q,W_k, V), where V is a shared matrix, and W _q and W _k are parameters in the respective attention heads, converting the mapping functions e _i and e _j into query and key values, respectively, and generating the aggregate contribution x _i of other agents to a particular agent i.

Further, in step S4, the method specifically includes the following steps:

S41: the network service is grouped and linked, and then the service is divided into a plurality of plates according to user classification;

s42: resetting the SFC deployment environment and initializing private observation of each decision maker;

S43: selecting actions of each agent in a local reinforcement learning area, carrying out VNF placement and resource allocation of node CPU and link bandwidth, obtaining local observation and decision rewards of SFC deployment, and storing the transition of each environment in a buffer pool;

S44: each intelligent agent carries out local SAC algorithm training, and the mapping function of each intelligent agent is collected and uploaded to a central attention mechanism to obtain a contribution value;

S45: the local model aggregates the contribution values and the self-mapping values to obtain observation-action values with attention;

s46: calculating a joint loss function and updating a critic network in combination with Adam, calculating a dominance function and updating an actor network in combination with Adam;

s47: the steps S42-S46 are repeated until the models of all decision makers converge or the round deadlines expire.

The invention has the beneficial effects that: the method is characterized in that reasonable scheduling is performed under the condition of being limited by calculation of physical facilities and bandwidth resources, so that the deployment cost and SFC end-to-end time delay are minimized, the load occupied by server resources is homogenized, the exploratory of VNF deployment and resource allocation strategies is enhanced by a local maximum entropy strategy, and an attention mechanism enables an intelligent agent to pay attention to external contributions effectively.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objects and other advantages of the invention may be realized and obtained by means of the instrumentalities and combinations particularly pointed out in the specification.

Drawings

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in the following preferred detail with reference to the accompanying drawings, in which:

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a system architecture based on NFV orchestration and control according to the present invention;

FIG. 3 is a schematic diagram of multi-agent business orchestration and resource allocation according to the present invention;

FIG. 4 is a diagram illustrating a central attention mechanism of multi-agent reinforcement learning according to the present invention.

Detailed Description

Other advantages and effects of the present invention will become apparent to those skilled in the art from the following disclosure, which describes the embodiments of the present invention with reference to specific examples. The invention may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present invention. It should be noted that the illustrations provided in the following embodiments merely illustrate the basic idea of the present invention by way of illustration, and the following embodiments and features in the embodiments may be combined with each other without conflict.

Wherein the drawings are for illustrative purposes only and are shown in schematic, non-physical, and not intended to limit the invention; for the purpose of better illustrating embodiments of the invention, certain elements of the drawings may be omitted, enlarged or reduced and do not represent the size of the actual product; it will be appreciated by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The same or similar reference numbers in the drawings of embodiments of the invention correspond to the same or similar components; in the description of the present invention, it should be understood that, if there are terms such as "upper", "lower", "left", "right", "front", "rear", etc., that indicate an azimuth or a positional relationship based on the azimuth or the positional relationship shown in the drawings, it is only for convenience of describing the present invention and simplifying the description, but not for indicating or suggesting that the referred device or element must have a specific azimuth, be constructed and operated in a specific azimuth, so that the terms describing the positional relationship in the drawings are merely for exemplary illustration and should not be construed as limiting the present invention, and that the specific meaning of the above terms may be understood by those of ordinary skill in the art according to the specific circumstances.

Referring to fig. 1 and 2, the physical layer is an underlying bearer network including a general server, and provides its instantiated physical resources for the VNF. The virtualization layer mainly completes real-time monitoring of network states, load analysis of physical networks and execution of resource allocation strategies. The application layer is mainly responsible for creating SFC according to the service requirement, and various services are provided for users by taking the SFC as a carrier.

The physical network includes a large number of nodes and links, modeled as an undirected graph G ^p = (N, L). N represents a set of physical nodes, i.e., servers, and L represents a set of links connecting the nodes. The servers provide their instantiated CPU resources for the VNFs, and each underlying server may instantiate multiple VNFs. Each server contains a plurality of CPUs that,Indicating the CPU resource capacity owned by the v-th server. uv denotes the physical link connecting u and v, with limited bandwidth resources denoted/>

The virtual network is modeled as a directed graph G ^v = (V, P). The set of SFCs in the network is denoted as F, and the ith SFC is denoted as a directed graphV _i denotes a set of different VNFs on the ith SFC, and P _i denotes a set of virtual links on the ith SFC. For the j-th VNF on the i-th SFC,/>Representing the amount of CPU resources allocated to it by the server v. jk denotes the link connecting the adjacent jth and kth VNF on the ith SFC,/>Representing the amount of bandwidth resources allocated to it by the physical link uv.

Referring to fig. 3, fig. 3 is a multi-agent service orchestration and resource allocation scheme according to the present invention. Users with various business requirements are treated as different agents and numbered i e {1,2, …, N }, each agent learns the policy pi _i:O_i→P(A_i based on local observations o _i(t)＝{K_i(t),η_ci(t),η_bi (t) }, independent of other agents, taking actions a _i(t)＝{δ_i(t),C_i(t),B_i (t) }, and obtaining a private reward r _i (t) to interact with the environment continuously.

The resource capacity of each server in the network is different, the capacity is represented by a rectangular total area, each service request is represented by different SFCs, various shapes of lines representing different agents are designed for displaying the network resource load and the condition that each agent occupies resources, wherein the rectangle represents a reserved part of network resources which are not used, in addition, the CPU resource quantity allocated to the agents by each server is represented by the area size of the occupied space, and the bandwidth resource allocation quantity of a physical link is represented by the lengths of various lines.

The agents cooperate to service the arriving requests, each having access to all resources in the environment and selecting appropriate network resources to meet the respective business needs, with the common goal of achieving the maximum cumulative shared prize.

Referring to fig. 4, fig. 4 is a central attention mechanism of multi-agent reinforcement learning according to the present invention, and the steps are as follows:

Step 1): the model aggregation node transmits initial parameters;

Step 2): selecting actions of each agent in the local reinforcement learning area, and carrying out VNF placement and resource allocation of node CPU and link bandwidth;

Step 3): each intelligent agent carries out local SAC algorithm training, and the mapping function of each intelligent agent is collected and uploaded to a central attention mechanism;

Step 4): obtaining training contribution values of all the agents through a query value/key value system in the multi-head attention;

step 5): the local model aggregates the contribution values and the self-mapping values to obtain observation-action values with attention;

The central attention mechanism follows the paradigm of a centralized training reviewer and a distributed execution strategy, so that a specific intelligent agent can selectively pay attention to information from other intelligent agents, a distributed actor network takes action and acquires corresponding observation, and shared training is performed after local information mapping. To combine information focused on different subspaces, the contributions of all heads should be connected together, which can be expressed as Mhead = Concat (head ₁,head₂,…,head_h)W^o, each head using an independent set of parameters (W _q,W_k, V), and yielding the aggregate contribution x _i of other agents to a particular agent i, where V is a usage matrix, definition The observation-action value function of the agent i is related to the contributions of other agents in addition to the observation and actions of the agent i, and is expressed as:

Where g _i and f _i are both multi-layer perceptron mapping functions, (o _i,a_i) represent VNF mapping, CPU and bandwidth resource allocation actions taken by an agent i, and observations obtained by an individual from an SFC deployment environment, x _i is an agent contribution value after removing the agent i, and represents a weighted sum of the agent contributions, i.e.:

where i represents the set of agents other than i, V _j represents the cost function provided by agent j, the observation and action are encoded by the mapping function, then obtained by linear transformation by the shared matrix V, and h represents the nonlinear hadamard product. Alpha _j represents the attention weight value, adopts bilinear mapping (i.e. query-key system), and passes the correlation value between the mapped values e _j and e _i into a normalized exponential function, namely:

Where e _j is denoted as e _j＝g_j(o_j,a_j),j＝1...N,W_q and W _k are parameters in each attention header, converting the mapping functions e _i and e _j into query values and key values, respectively. And inputting the query value and the key value into a scaling dot product module, so as to perform scale transformation on the dimensions of the two matrixes, and finally obtaining the weight of each value through a normalization exponential function module.

All critics networks are updated together, with the aim of achieving a minimization of the joint regression loss function, namely:

wherein D is a playback cache pool storing past experiences, Is the value estimated value of the action of the intelligent agent i, which is needed to be obtained through the attention mechanism, and y _i is the target value expressed as:

alpha is used as a flexible temperature parameter, so that the importance of SFC deployment rewards and maximum entropy can be effectively measured, And/>Parameters of the target critics network and the target actors network, respectively. The update of the target network adopts a soft update mode, namely:

In order to solve the credit allocation problem between agents, consider introducing a merit function in agent learning, the idea is to function from observation-action values The actions of the given agent are marginalized and then compared with the original value to see if the increase in rewards is attributed to other agents. The dominance function is expressed as:

Where b (o, a _\i) is a multi-agent baseline, and under the condition of keeping the actions of other agents unchanged, the specific actions of a certain agent are replaced by the averages of other possible actions, namely:

the actor network strategy of each agent is updated by gradient rising, and the gradient calculation expression is as follows:

Finally, it is noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the present invention, which is intended to be covered by the claims of the present invention.

Claims

1. An SFC deployment method based on multi-agent reinforcement learning is characterized in that: the method comprises the following steps:

S4: each decision maker adopts a flexible actor-criticizer algorithm, improves exploratory property and robustness of an intelligent agent based on a reinforcement learning frame of maximum entropy, selectively extracts information through a attention mechanism, and realizes credit allocation by combining with a dominance function;

In the step S1, the network function virtualization scenario includes four components: the system comprises a physical facility layer, a virtualization management layer, a network application layer and a service operation support system; the physical layer provides instantiated physical resources for the network function, the virtualization layer realizes load analysis of the physical network and execution of a resource allocation strategy, the application layer is responsible for creating SFC and group chains according to service requirements, and the service operation support system completes real-time monitoring of network states;

the overload penalty mechanism based on the node capacity ratio is as follows: overload punishment means punishment is carried out on nodes with overlarge loads caused by improper resource allocation so as to improve the uniformity degree of resource allocation;

Overload punishment Expressed at t time slots as:

wherein a _v (t) represents the CPU resource allocated by the v-th server in the t time slot, which is expressed as:

The end-to-end time delay of the ith SFC is divided into VNF processing time delay P _i and link communication time delay T _i;

P _i is denoted as t time slot:

wherein m _i represents the size of a data packet on the ith SFC, and the service rate coefficient beta represents the data volume which can be processed by a single CPU per second;

Wherein, psi represents delay caused by queuing and scheduling of the data packets;

the cost of SFC deployment includes two parts, one part from the processing of VNFs by the underlying network server and the other part from the use of link bandwidth; cost of processing The dynamic cost is described as the variable cost generated by CPU operation, the coefficient is positive epsilon, the static cost is the cost of activating the VNF by any server, and the positive/>Representation of/>Representation/>Physical Link Bandwidth usage cost/>In proportion to the occupied physical link bandwidth, the unit bandwidth overhead coefficient is expressed by positive number/>Representation, then/>Expressed as/>Then the total cost of deployment Z _i for the ith SFC is denoted as t slots:

Establishing a joint optimization model of SFC deployment, carrying out normalization processing on each part as a unit of unified time delay, cost and punishment, and designing a utility function as follows:

Converting the SFC deployment problem into an MDP model, and representing the SFC deployment problem by using a 4-tuple M= < S, A, P and R > wherein S is a state space, A is an action space, P is a state transition probability and R is a reward function;

Defining a state space comprising SFC mapping states K _i (t), residual computing resource rates of nodes And the ratio of the residual bandwidth resources of each physical link/>S (t) ∈s is expressed as S (t) = { K (t), η _c(t),η_b (t) }, where K (t) = [ K _i (t) ],

Defining an action space comprising a mapping of all VNFs of each chain, node CPU resource allocation and link bandwidth resource allocation, a (t) e a being denoted as a (t) = { δ (t), C (t), B (t) }, wherein

The network takes action a (t) under the state s (t) of t time slots and then transfers to the state s (t+1) of the next time slot, and the state transfer probability of the process is defined as p (s (t+1) |s (t), a (t));

The goal is to minimize SFC end-to-end average delay, network deployment cost and overload penalty, defining a reward function as R (t) = - [ U (t) ] ^-1;

the multi-agent business arrangement scheme based on user division is as follows:

users with various business requirements are treated as different agents and are numbered i e {1,2, the respective agents are based on local observations o _i(t)＝{K_i(t),η_ci(t),η_bi (t) independent of the other agents, each taking action a _i(t)＝{δ_i(t),C_i(t),B_i (t) }, and obtaining a private reward r _i (t) to constantly interact with the environment, each agent learning strategy pi _i:O_i→P(A_i;

The resource capacity of each server in the network is different, the capacity is represented by a rectangular total area, each service request is represented by different SFCs, lines representing various shapes of different agents are designed for displaying the network resource load and the resource occupation condition of each agent, wherein the rectangle represents a reserved part which is not used by the network resource, the CPU resource quantity allocated to the agent by each server is represented by the area size of the occupied space of the CPU resource quantity, and the bandwidth resource allocation quantity of a physical link is represented by the length of various lines;

the agents cooperate with each other to service the arriving requests, each agent being able to access all resources in the environment and select appropriate network resources to meet the respective business needs, the common goal of each agent being to obtain the maximum cumulative shared rewards;

The central attention mechanism with multiple attention heads steps as follows:

Step 1): the model aggregation node transmits initial parameters;

The central attention mechanism follows the paradigm of a centralized training commentator and a distributed execution strategy, so that a specific intelligent agent can selectively pay attention to information from other intelligent agents, a distributed actor network takes action and acquires corresponding observation, and shared training is performed after local information mapping; to combine information focused on different subspaces, the contributions of all heads are connected together, denoted Mhead = Concat (head ₁,head₂,···,head_h)W^o, each head using an independent set of parameters (W _q,W_k, V), and resulting in an aggregate contribution x _i of other agents to a particular agent i, where V is a usage matrix, defining The observation-action value function of the agent i is related to the contributions of other agents in addition to the observation and actions of the agent i, and is expressed as:

Wherein, i represents the set of agents except i, V _j represents the cost function provided by agent j, the observation and action are encoded by the mapping function, then obtained by linear transformation of the combined matrix V, and h represents the nonlinear Hadamard product; alpha _j represents the attention weight value, adopts bilinear mapping (i.e. query-key system), and passes the correlation value between the mapped values e _j and e _i into a normalized exponential function, namely:

Where e _j is denoted as e _j＝g_j(o_j,a_j),j＝1...N,W_q and W _k are parameters in each attention header, converting mapping functions e _i and e _j into query and key values, respectively; then inputting the query value and the key value into a scaling dot product module, so as to perform scale transformation on the dimensions of the two matrixes, and finally obtaining the weight of each value through a normalization exponential function module;

alpha is used as a flexible temperature parameter, so that the importance of SFC deployment rewards and maximum entropy can be effectively measured, And/>Parameters of a target critic network and a target actor network respectively; the update of the target network adopts a soft update mode, namely:

to solve the credit allocation problem between agents, consider introducing a merit function in agent learning, the idea is to derive from the observation-action value function The actions of the given agent are marginalized, and then compared with the original value, so as to know whether the increment of rewards is attributed to other agents; the dominance function is expressed as:

the step S4 specifically comprises the following steps:

s47: steps S42-S46 are repeated until the models of all decision makers converge or the round deadlines expire.