1. An SFC deployment method based on multi-agent reinforcement learning is characterized in that: the method comprises the following steps:
S1: in a network function virtualization scene, an overload penalty mechanism based on the node capacity proportion is designed, the nodes are subjected to reservation monitoring and excessive use penalty is applied, a mathematical model with minimized network overload penalty, end-to-end average delay and deployment cost is established, and a service function chain deployment optimization problem is converted into a Markov decision process to be solved;
s2: establishing a multi-agent business arrangement scheme based on user division, wherein a multi-agent framework follows a strategy of centralized training and distributed execution;
S3: designing a central attention mechanism with multiple attention heads, and combining information focused on different subspaces;
S4: each decision maker adopts a flexible actor-criticizer algorithm, improves exploratory property and robustness of an intelligent agent based on a reinforcement learning frame of maximum entropy, selectively extracts information through a attention mechanism, and realizes credit allocation by combining with a dominance function;
In the step S1, the network function virtualization scenario includes four components: the system comprises a physical facility layer, a virtualization management layer, a network application layer and a service operation support system; the physical layer provides instantiated physical resources for the network function, the virtualization layer realizes load analysis of the physical network and execution of a resource allocation strategy, the application layer is responsible for creating SFC and group chains according to service requirements, and the service operation support system completes real-time monitoring of network states;
the overload penalty mechanism based on the node capacity ratio is as follows: overload punishment means punishment is carried out on nodes with overlarge loads caused by improper resource allocation so as to improve the uniformity degree of resource allocation;
Overload punishment Expressed at t time slots as:
Wherein epsilon c represents unit penalty suffered by insufficient resource reservation rate, alpha c represents resource overload warning value of the underlying server, The CPU resource reservation rate of the v node in the t time slot is represented by the following calculation formula:
wherein a v (t) represents the CPU resource allocated by the v-th server in the t time slot, which is expressed as:
The end-to-end time delay of the ith SFC is divided into VNF processing time delay P i and link communication time delay T i;
P i is denoted as t time slot:
wherein m i represents the size of a data packet on the ith SFC, and the service rate coefficient beta represents the data volume which can be processed by a single CPU per second;
t i represents the total link communication delay of the ith SFC, related to the link mapping case, expressed as:
Wherein, psi represents delay caused by queuing and scheduling of the data packets;
the cost of SFC deployment includes two parts, one part from the processing of VNFs by the underlying network server and the other part from the use of link bandwidth; cost of processing The dynamic cost is described as the variable cost generated by CPU operation, the coefficient is positive epsilon, the static cost is the cost of activating the VNF by any server, and the positive/>Representation of/>Representation/>Physical Link Bandwidth usage cost/>In proportion to the occupied physical link bandwidth, the unit bandwidth overhead coefficient is expressed by positive number/>Representation, then/>Expressed as/>Then the total cost of deployment Z i for the ith SFC is denoted as t slots:
Establishing a joint optimization model of SFC deployment, carrying out normalization processing on each part as a unit of unified time delay, cost and punishment, and designing a utility function as follows:
Converting the SFC deployment problem into an MDP model, and representing the SFC deployment problem by using a 4-tuple M= < S, A, P and R > wherein S is a state space, A is an action space, P is a state transition probability and R is a reward function;
Defining a state space comprising SFC mapping states K i (t), residual computing resource rates of nodes And the ratio of the residual bandwidth resources of each physical link/>S (t) ∈s is expressed as S (t) = { K (t), η c(t),ηb (t) }, where K (t) = [ K i (t) ],
Defining an action space comprising a mapping of all VNFs of each chain, node CPU resource allocation and link bandwidth resource allocation, a (t) e a being denoted as a (t) = { δ (t), C (t), B (t) }, wherein
The network takes action a (t) under the state s (t) of t time slots and then transfers to the state s (t+1) of the next time slot, and the state transfer probability of the process is defined as p (s (t+1) |s (t), a (t));
The goal is to minimize SFC end-to-end average delay, network deployment cost and overload penalty, defining a reward function as R (t) = - [ U (t) ] -1;
the multi-agent business arrangement scheme based on user division is as follows:
users with various business requirements are treated as different agents and are numbered i e {1,2, the respective agents are based on local observations o i(t)={Ki(t),ηci(t),ηbi (t) independent of the other agents, each taking action a i(t)={δi(t),Ci(t),Bi (t) }, and obtaining a private reward r i (t) to constantly interact with the environment, each agent learning strategy pi i:Oi→P(Ai;
The resource capacity of each server in the network is different, the capacity is represented by a rectangular total area, each service request is represented by different SFCs, lines representing various shapes of different agents are designed for displaying the network resource load and the resource occupation condition of each agent, wherein the rectangle represents a reserved part which is not used by the network resource, the CPU resource quantity allocated to the agent by each server is represented by the area size of the occupied space of the CPU resource quantity, and the bandwidth resource allocation quantity of a physical link is represented by the length of various lines;
the agents cooperate with each other to service the arriving requests, each agent being able to access all resources in the environment and select appropriate network resources to meet the respective business needs, the common goal of each agent being to obtain the maximum cumulative shared rewards;
The central attention mechanism with multiple attention heads steps as follows:
Step 1): the model aggregation node transmits initial parameters;
Step 2): selecting actions of each agent in the local reinforcement learning area, and carrying out VNF placement and resource allocation of node CPU and link bandwidth;
Step 3): each intelligent agent carries out local SAC algorithm training, and the mapping function of each intelligent agent is collected and uploaded to a central attention mechanism;
Step 4): obtaining training contribution values of all the agents through a query value/key value system in the multi-head attention;
step 5): the local model aggregates the contribution values and the self-mapping values to obtain observation-action values with attention;
The central attention mechanism follows the paradigm of a centralized training commentator and a distributed execution strategy, so that a specific intelligent agent can selectively pay attention to information from other intelligent agents, a distributed actor network takes action and acquires corresponding observation, and shared training is performed after local information mapping; to combine information focused on different subspaces, the contributions of all heads are connected together, denoted Mhead = Concat (head 1,head2,···,headh)Wo, each head using an independent set of parameters (W q,Wk, V), and resulting in an aggregate contribution x i of other agents to a particular agent i, where V is a usage matrix, defining The observation-action value function of the agent i is related to the contributions of other agents in addition to the observation and actions of the agent i, and is expressed as:
Where g i and f i are both multi-layer perceptron mapping functions, (o i,ai) represent VNF mapping, CPU and bandwidth resource allocation actions taken by an agent i, and observations obtained by an individual from an SFC deployment environment, x i is an agent contribution value after removing the agent i, and represents a weighted sum of the agent contributions, i.e.:
Wherein, i represents the set of agents except i, V j represents the cost function provided by agent j, the observation and action are encoded by the mapping function, then obtained by linear transformation of the combined matrix V, and h represents the nonlinear Hadamard product; alpha j represents the attention weight value, adopts bilinear mapping (i.e. query-key system), and passes the correlation value between the mapped values e j and e i into a normalized exponential function, namely:
Where e j is denoted as e j=gj(oj,aj),j=1...N,Wq and W k are parameters in each attention header, converting mapping functions e i and e j into query and key values, respectively; then inputting the query value and the key value into a scaling dot product module, so as to perform scale transformation on the dimensions of the two matrixes, and finally obtaining the weight of each value through a normalization exponential function module;
All critics networks are updated together, with the aim of achieving a minimization of the joint regression loss function, namely:
wherein D is a playback cache pool storing past experiences, Is the value estimated value of the action of the intelligent agent i, which is needed to be obtained through the attention mechanism, and y i is the target value expressed as:
alpha is used as a flexible temperature parameter, so that the importance of SFC deployment rewards and maximum entropy can be effectively measured, And/>Parameters of a target critic network and a target actor network respectively; the update of the target network adopts a soft update mode, namely:
to solve the credit allocation problem between agents, consider introducing a merit function in agent learning, the idea is to derive from the observation-action value function The actions of the given agent are marginalized, and then compared with the original value, so as to know whether the increment of rewards is attributed to other agents; the dominance function is expressed as:
Where b (o, a \i) is a multi-agent baseline, and under the condition of keeping the actions of other agents unchanged, the specific actions of a certain agent are replaced by the averages of other possible actions, namely:
the actor network strategy of each agent is updated by gradient rising, and the gradient calculation expression is as follows:
the step S4 specifically comprises the following steps:
S41: the network service is grouped and linked, and then the service is divided into a plurality of plates according to user classification;
s42: resetting the SFC deployment environment and initializing private observation of each decision maker;
S43: selecting actions of each agent in a local reinforcement learning area, carrying out VNF placement and resource allocation of node CPU and link bandwidth, obtaining local observation and decision rewards of SFC deployment, and storing the transition of each environment in a buffer pool;
S44: each intelligent agent carries out local SAC algorithm training, and the mapping function of each intelligent agent is collected and uploaded to a central attention mechanism to obtain a contribution value;
S45: the local model aggregates the contribution values and the self-mapping values to obtain observation-action values with attention;
s46: calculating a joint loss function and updating a critic network in combination with Adam, calculating a dominance function and updating an actor network in combination with Adam;
s47: steps S42-S46 are repeated until the models of all decision makers converge or the round deadlines expire.