CN116112938B - SFC deployment method based on multi-agent reinforcement learning - Google Patents

SFC deployment method based on multi-agent reinforcement learning Download PDF

Info

Publication number
CN116112938B
CN116112938B CN202211467664.XA CN202211467664A CN116112938B CN 116112938 B CN116112938 B CN 116112938B CN 202211467664 A CN202211467664 A CN 202211467664A CN 116112938 B CN116112938 B CN 116112938B
Authority
CN
China
Prior art keywords
agent
network
sfc
function
deployment
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211467664.XA
Other languages
Chinese (zh)
Other versions
CN116112938A (en
Inventor
唐伦
李师锐
杜雨聪
陈前斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Sailei Culture Media Co ltd
Original Assignee
Shenzhen Sailei Culture Media Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Sailei Culture Media Co ltd filed Critical Shenzhen Sailei Culture Media Co ltd
Priority to CN202211467664.XA priority Critical patent/CN116112938B/en
Publication of CN116112938A publication Critical patent/CN116112938A/en
Application granted granted Critical
Publication of CN116112938B publication Critical patent/CN116112938B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W16/00Network planning, e.g. coverage or traffic planning tools; Network deployment, e.g. resource partitioning or cells structures
    • H04W16/18Network planning tools
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W16/00Network planning, e.g. coverage or traffic planning tools; Network deployment, e.g. resource partitioning or cells structures
    • H04W16/22Traffic simulation tools or models
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/50Reducing energy consumption in communication networks in wire-line communication networks, e.g. low power modes or reduced link rate

Abstract

The invention relates to an SFC deployment method based on multi-agent reinforcement learning, and belongs to the technical field of mobile communication. The method comprises the following steps: s1: in a network function virtualization scene, an overload penalty mechanism based on the node capacity proportion is designed, the nodes are subjected to reservation monitoring and excessive use penalty is applied, a mathematical model with minimized network overload penalty, end-to-end average delay and deployment cost is established, and a service function chain deployment optimization problem is converted into a Markov decision process to be solved; s2: establishing a multi-agent business arrangement scheme based on user division, wherein a multi-agent framework follows a strategy of centralized training and distributed execution; s3: designing a central attention mechanism with multiple attention heads, and combining information focused on different subspaces; s4: and each decision maker adopts a flexible actor-criticizer algorithm, and the exploration and the robustness of the intelligent agent are improved based on a reinforcement learning framework of maximum entropy.

Description

SFC deployment method based on multi-agent reinforcement learning
Technical Field
The invention belongs to the technical field of mobile communication, and relates to an SFC deployment method based on multi-agent reinforcement learning.
Background
The 5G will open an era of everything interconnection, and if a physical network is built for multiple traffic flows in each scenario, high costs will be incurred. Emerging services are of a great variety, conflicting and extremely diverse in demand, and traditional "one-time" network methods are no longer effectively viable, and network slicing technology is of great interest to the industry, which creates the same physical network infrastructure as multiple logical networks, each serving mainly one service scenario, which is proposed as a key technology.
In a Network Function Virtualization (NFV) architecture, virtualized Network Functions (VNFs) represent software instantiations of network functions that are separate from the hardware resources used. The Service Function Chain (SFC) is a service request formed by connecting a plurality of orderly VNFs in series, and the SFC deployment phase requires VNF placement and instantiation on the underlying network, and is accompanied with problems of resource allocation, routing and the like, so as to meet specific network service requirements. How to design an efficient deployment scenario is a key challenge for SFC orchestration.
The traditional heuristic method relies on manual embedding rules and cannot be well adapted to dynamic network structures and environments, so that the problem of SFC deployment is paid attention to by reinforcement learning. In the current research about SFC deployment, there are few joint considerations on network load, SFC end-to-end delay and deployment cost, and in addition, although there are few documents to solve SFC deployment problem through reinforcement learning, there are few solutions based on multi-agent reinforcement learning, and there is no investigation on how to further promote the extended evolution capability of algorithm when service request increases.
Disclosure of Invention
In view of the above, the present invention aims to provide an SFC deployment method based on multi-agent reinforcement learning, which can dynamically and selectively pay attention to obtaining information of larger deployment returns, realize optimization of load penalty, delay and deployment cost, effectively improve the balance degree of resource allocation, and have better expansion performance when the number of agents is increased.
In order to achieve the above purpose, the present invention provides the following technical solutions:
a service function chain deployment method based on multi-agent reinforcement learning comprises the following steps:
S1: in a network function virtualization scene, an overload penalty mechanism based on the node capacity proportion is designed, the nodes are subjected to reservation monitoring and excessive use penalty is applied, a mathematical model with minimized network overload penalty, end-to-end average delay and deployment cost is established, and a service function chain deployment optimization problem is converted into a Markov decision process to be solved;
s2: establishing a multi-agent business arrangement scheme based on user division, wherein a multi-agent framework follows a strategy of centralized training and distributed execution;
S3: designing a central attention mechanism with multiple attention heads, and combining information focused on different subspaces;
S4: and each decision maker adopts a flexible actor-critter algorithm, improves the exploratory property and the robustness of the intelligent agent based on a reinforcement learning framework of maximum entropy, selectively extracts information through a attention mechanism, and combines a dominance function to realize credit allocation.
Further, in step S1, the network function virtualized scenario includes four components: the system comprises a physical facility layer, a virtualization management layer, a network application layer and a service operation support system. The physical layer is a bottom layer bearing network comprising a general server, provides the VNF with instantiated physical resources, the virtualization layer realizes load analysis of the physical network and execution of a resource allocation strategy, the application layer is mainly responsible for creating SFCs and group chains according to service requirements, and the service operation support system completes real-time monitoring of network states and the like.
Further, in step S1, the overload penalty refers to a penalty for nodes with excessive load caused by improper resource allocation, so as to improve the uniformity of resource allocation, and the relative proportion of the remaining resources to the capacities of the nodes is analyzed because the resource capacities of each node in the network are different.
Overload punishmentAt t time slots can be expressed as:
Wherein epsilon c represents unit penalty suffered by insufficient resource reservation rate, alpha c represents resource overload warning value of the underlying server, The CPU resource reservation rate of the v node in the t time slot is represented by the following calculation formula:
Where a v (t) represents the CPU resources allocated by the v-th server in the t time slot, which can be expressed as:
Further, in step S1, the end-to-end delay of the ith SFC is divided into VNF processing delay P i and link communication delay T i.
P i is denoted as t time slot:
Where m i represents the size of the data packet on the ith SFC, and the service rate coefficient β represents the amount of data that can be processed by a single CPU per second.
T i represents the total link communication delay of the ith SFC, related to the link mapping case, expressed as:
Where ψ represents the delay caused by the packet queuing schedule.
Further, in step S1, the cost incurred by SFC deployment includes two parts, one part from the processing of VNFs by the underlying network server and the other part from the use of link bandwidths. Cost of processingThe dynamic cost is described as the variable cost generated by CPU operation and is related to the allocated resource quantity, the coefficient is positive epsilon, and the static cost is the cost of activating the VNF by any server and is positive/>Representation, therefore/>Can represent/>Physical Link Bandwidth usage cost/>In proportion to the occupied physical link bandwidth, the unit bandwidth overhead coefficient is expressed by positive number/>Representation, thenExpressed as/>The total cost of deployment Z i for the ith SFC at time t can be expressed as:
Further, in step S1, a joint optimization model of SFC deployment is established, and in order to unify units of delay, cost and penalty, normalization processing is performed on each part, and a utility function is designed as follows:
Further, the SFC deployment problem is converted into an MDP model, which is represented by a 4-tuple M= < S, A, P, R >, wherein S is a state space, A is an action space, P is a state transition probability, and R is a reward function.
Defining a state space comprising SFC mapping states K i (t), residual computing resource rates of nodesAnd the ratio of the residual bandwidth resources of each physical link/>S (t) ∈s is expressed as S (t) = { K (t), η c(t),ηb (t) }, where K (t) = [ K i (t) ],/>
Defining an action space includes mapping of all VNFs for each chain, node CPU resource allocation, and link bandwidth resource allocation, so a (t) e a is denoted as a (t) = { δ (t), C (t), B (t) }, where
The network takes action a (t) in t time slot state s (t) and then transfers to state s (t+1) of the next time slot, and the state transfer probability of the process is defined as p (s (t+1) |s (t), a (t)).
The objective here is to minimize SFC end-to-end average latency, network deployment cost, and overload penalty, then the reward function can be defined as R (t) = - [ U (t) ] -1.
Further, in step S3, the central attention mechanism follows the paradigm of centralized training reviewers and distributed execution strategies, so that a specific agent can selectively pay attention to information from other agents, and the distributed actor network takes action and obtains corresponding observations, and performs shared training after mapping local information.
To combine information focused on different subspaces, the contributions of all the heads should be connected together, which can be expressed as Mhead = Concat (head 1,head2,…,headh)Wo, each head using an independent set of parameters (W q,Wk, V), where V is a shared matrix, and W q and W k are parameters in the respective attention heads, converting the mapping functions e i and e j into query and key values, respectively, and generating the aggregate contribution x i of other agents to a particular agent i.
Further, in step S4, the method specifically includes the following steps:
S41: the network service is grouped and linked, and then the service is divided into a plurality of plates according to user classification;
s42: resetting the SFC deployment environment and initializing private observation of each decision maker;
S43: selecting actions of each agent in a local reinforcement learning area, carrying out VNF placement and resource allocation of node CPU and link bandwidth, obtaining local observation and decision rewards of SFC deployment, and storing the transition of each environment in a buffer pool;
S44: each intelligent agent carries out local SAC algorithm training, and the mapping function of each intelligent agent is collected and uploaded to a central attention mechanism to obtain a contribution value;
S45: the local model aggregates the contribution values and the self-mapping values to obtain observation-action values with attention;
s46: calculating a joint loss function and updating a critic network in combination with Adam, calculating a dominance function and updating an actor network in combination with Adam;
s47: the steps S42-S46 are repeated until the models of all decision makers converge or the round deadlines expire.
The invention has the beneficial effects that: the method is characterized in that reasonable scheduling is performed under the condition of being limited by calculation of physical facilities and bandwidth resources, so that the deployment cost and SFC end-to-end time delay are minimized, the load occupied by server resources is homogenized, the exploratory of VNF deployment and resource allocation strategies is enhanced by a local maximum entropy strategy, and an attention mechanism enables an intelligent agent to pay attention to external contributions effectively.
Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objects and other advantages of the invention may be realized and obtained by means of the instrumentalities and combinations particularly pointed out in the specification.
Drawings
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in the following preferred detail with reference to the accompanying drawings, in which:
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a system architecture based on NFV orchestration and control according to the present invention;
FIG. 3 is a schematic diagram of multi-agent business orchestration and resource allocation according to the present invention;
FIG. 4 is a diagram illustrating a central attention mechanism of multi-agent reinforcement learning according to the present invention.
Detailed Description
Other advantages and effects of the present invention will become apparent to those skilled in the art from the following disclosure, which describes the embodiments of the present invention with reference to specific examples. The invention may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present invention. It should be noted that the illustrations provided in the following embodiments merely illustrate the basic idea of the present invention by way of illustration, and the following embodiments and features in the embodiments may be combined with each other without conflict.
Wherein the drawings are for illustrative purposes only and are shown in schematic, non-physical, and not intended to limit the invention; for the purpose of better illustrating embodiments of the invention, certain elements of the drawings may be omitted, enlarged or reduced and do not represent the size of the actual product; it will be appreciated by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
The same or similar reference numbers in the drawings of embodiments of the invention correspond to the same or similar components; in the description of the present invention, it should be understood that, if there are terms such as "upper", "lower", "left", "right", "front", "rear", etc., that indicate an azimuth or a positional relationship based on the azimuth or the positional relationship shown in the drawings, it is only for convenience of describing the present invention and simplifying the description, but not for indicating or suggesting that the referred device or element must have a specific azimuth, be constructed and operated in a specific azimuth, so that the terms describing the positional relationship in the drawings are merely for exemplary illustration and should not be construed as limiting the present invention, and that the specific meaning of the above terms may be understood by those of ordinary skill in the art according to the specific circumstances.
Referring to fig. 1 and 2, the physical layer is an underlying bearer network including a general server, and provides its instantiated physical resources for the VNF. The virtualization layer mainly completes real-time monitoring of network states, load analysis of physical networks and execution of resource allocation strategies. The application layer is mainly responsible for creating SFC according to the service requirement, and various services are provided for users by taking the SFC as a carrier.
The physical network includes a large number of nodes and links, modeled as an undirected graph G p = (N, L). N represents a set of physical nodes, i.e., servers, and L represents a set of links connecting the nodes. The servers provide their instantiated CPU resources for the VNFs, and each underlying server may instantiate multiple VNFs. Each server contains a plurality of CPUs that,Indicating the CPU resource capacity owned by the v-th server. uv denotes the physical link connecting u and v, with limited bandwidth resources denoted/>
The virtual network is modeled as a directed graph G v = (V, P). The set of SFCs in the network is denoted as F, and the ith SFC is denoted as a directed graphV i denotes a set of different VNFs on the ith SFC, and P i denotes a set of virtual links on the ith SFC. For the j-th VNF on the i-th SFC,/>Representing the amount of CPU resources allocated to it by the server v. jk denotes the link connecting the adjacent jth and kth VNF on the ith SFC,/>Representing the amount of bandwidth resources allocated to it by the physical link uv.
Referring to fig. 3, fig. 3 is a multi-agent service orchestration and resource allocation scheme according to the present invention. Users with various business requirements are treated as different agents and numbered i e {1,2, …, N }, each agent learns the policy pi i:Oi→P(Ai based on local observations o i(t)={Ki(t),ηci(t),ηbi (t) }, independent of other agents, taking actions a i(t)={δi(t),Ci(t),Bi (t) }, and obtaining a private reward r i (t) to interact with the environment continuously.
The resource capacity of each server in the network is different, the capacity is represented by a rectangular total area, each service request is represented by different SFCs, various shapes of lines representing different agents are designed for displaying the network resource load and the condition that each agent occupies resources, wherein the rectangle represents a reserved part of network resources which are not used, in addition, the CPU resource quantity allocated to the agents by each server is represented by the area size of the occupied space, and the bandwidth resource allocation quantity of a physical link is represented by the lengths of various lines.
The agents cooperate to service the arriving requests, each having access to all resources in the environment and selecting appropriate network resources to meet the respective business needs, with the common goal of achieving the maximum cumulative shared prize.
Referring to fig. 4, fig. 4 is a central attention mechanism of multi-agent reinforcement learning according to the present invention, and the steps are as follows:
Step 1): the model aggregation node transmits initial parameters;
Step 2): selecting actions of each agent in the local reinforcement learning area, and carrying out VNF placement and resource allocation of node CPU and link bandwidth;
Step 3): each intelligent agent carries out local SAC algorithm training, and the mapping function of each intelligent agent is collected and uploaded to a central attention mechanism;
Step 4): obtaining training contribution values of all the agents through a query value/key value system in the multi-head attention;
step 5): the local model aggregates the contribution values and the self-mapping values to obtain observation-action values with attention;
The central attention mechanism follows the paradigm of a centralized training reviewer and a distributed execution strategy, so that a specific intelligent agent can selectively pay attention to information from other intelligent agents, a distributed actor network takes action and acquires corresponding observation, and shared training is performed after local information mapping. To combine information focused on different subspaces, the contributions of all heads should be connected together, which can be expressed as Mhead = Concat (head 1,head2,…,headh)Wo, each head using an independent set of parameters (W q,Wk, V), and yielding the aggregate contribution x i of other agents to a particular agent i, where V is a usage matrix, definition The observation-action value function of the agent i is related to the contributions of other agents in addition to the observation and actions of the agent i, and is expressed as:
Where g i and f i are both multi-layer perceptron mapping functions, (o i,ai) represent VNF mapping, CPU and bandwidth resource allocation actions taken by an agent i, and observations obtained by an individual from an SFC deployment environment, x i is an agent contribution value after removing the agent i, and represents a weighted sum of the agent contributions, i.e.:
where i represents the set of agents other than i, V j represents the cost function provided by agent j, the observation and action are encoded by the mapping function, then obtained by linear transformation by the shared matrix V, and h represents the nonlinear hadamard product. Alpha j represents the attention weight value, adopts bilinear mapping (i.e. query-key system), and passes the correlation value between the mapped values e j and e i into a normalized exponential function, namely:
Where e j is denoted as e j=gj(oj,aj),j=1...N,Wq and W k are parameters in each attention header, converting the mapping functions e i and e j into query values and key values, respectively. And inputting the query value and the key value into a scaling dot product module, so as to perform scale transformation on the dimensions of the two matrixes, and finally obtaining the weight of each value through a normalization exponential function module.
All critics networks are updated together, with the aim of achieving a minimization of the joint regression loss function, namely:
wherein D is a playback cache pool storing past experiences, Is the value estimated value of the action of the intelligent agent i, which is needed to be obtained through the attention mechanism, and y i is the target value expressed as:
alpha is used as a flexible temperature parameter, so that the importance of SFC deployment rewards and maximum entropy can be effectively measured, And/>Parameters of the target critics network and the target actors network, respectively. The update of the target network adopts a soft update mode, namely:
In order to solve the credit allocation problem between agents, consider introducing a merit function in agent learning, the idea is to function from observation-action values The actions of the given agent are marginalized and then compared with the original value to see if the increase in rewards is attributed to other agents. The dominance function is expressed as:
Where b (o, a \i) is a multi-agent baseline, and under the condition of keeping the actions of other agents unchanged, the specific actions of a certain agent are replaced by the averages of other possible actions, namely:
the actor network strategy of each agent is updated by gradient rising, and the gradient calculation expression is as follows:
Finally, it is noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the present invention, which is intended to be covered by the claims of the present invention.

Claims (1)

1. An SFC deployment method based on multi-agent reinforcement learning is characterized in that: the method comprises the following steps:
S1: in a network function virtualization scene, an overload penalty mechanism based on the node capacity proportion is designed, the nodes are subjected to reservation monitoring and excessive use penalty is applied, a mathematical model with minimized network overload penalty, end-to-end average delay and deployment cost is established, and a service function chain deployment optimization problem is converted into a Markov decision process to be solved;
s2: establishing a multi-agent business arrangement scheme based on user division, wherein a multi-agent framework follows a strategy of centralized training and distributed execution;
S3: designing a central attention mechanism with multiple attention heads, and combining information focused on different subspaces;
S4: each decision maker adopts a flexible actor-criticizer algorithm, improves exploratory property and robustness of an intelligent agent based on a reinforcement learning frame of maximum entropy, selectively extracts information through a attention mechanism, and realizes credit allocation by combining with a dominance function;
In the step S1, the network function virtualization scenario includes four components: the system comprises a physical facility layer, a virtualization management layer, a network application layer and a service operation support system; the physical layer provides instantiated physical resources for the network function, the virtualization layer realizes load analysis of the physical network and execution of a resource allocation strategy, the application layer is responsible for creating SFC and group chains according to service requirements, and the service operation support system completes real-time monitoring of network states;
the overload penalty mechanism based on the node capacity ratio is as follows: overload punishment means punishment is carried out on nodes with overlarge loads caused by improper resource allocation so as to improve the uniformity degree of resource allocation;
Overload punishment Expressed at t time slots as:
Wherein epsilon c represents unit penalty suffered by insufficient resource reservation rate, alpha c represents resource overload warning value of the underlying server, The CPU resource reservation rate of the v node in the t time slot is represented by the following calculation formula:
wherein a v (t) represents the CPU resource allocated by the v-th server in the t time slot, which is expressed as:
The end-to-end time delay of the ith SFC is divided into VNF processing time delay P i and link communication time delay T i;
P i is denoted as t time slot:
wherein m i represents the size of a data packet on the ith SFC, and the service rate coefficient beta represents the data volume which can be processed by a single CPU per second;
t i represents the total link communication delay of the ith SFC, related to the link mapping case, expressed as:
Wherein, psi represents delay caused by queuing and scheduling of the data packets;
the cost of SFC deployment includes two parts, one part from the processing of VNFs by the underlying network server and the other part from the use of link bandwidth; cost of processing The dynamic cost is described as the variable cost generated by CPU operation, the coefficient is positive epsilon, the static cost is the cost of activating the VNF by any server, and the positive/>Representation of/>Representation/>Physical Link Bandwidth usage cost/>In proportion to the occupied physical link bandwidth, the unit bandwidth overhead coefficient is expressed by positive number/>Representation, then/>Expressed as/>Then the total cost of deployment Z i for the ith SFC is denoted as t slots:
Establishing a joint optimization model of SFC deployment, carrying out normalization processing on each part as a unit of unified time delay, cost and punishment, and designing a utility function as follows:
Converting the SFC deployment problem into an MDP model, and representing the SFC deployment problem by using a 4-tuple M= < S, A, P and R > wherein S is a state space, A is an action space, P is a state transition probability and R is a reward function;
Defining a state space comprising SFC mapping states K i (t), residual computing resource rates of nodes And the ratio of the residual bandwidth resources of each physical link/>S (t) ∈s is expressed as S (t) = { K (t), η c(t),ηb (t) }, where K (t) = [ K i (t) ],
Defining an action space comprising a mapping of all VNFs of each chain, node CPU resource allocation and link bandwidth resource allocation, a (t) e a being denoted as a (t) = { δ (t), C (t), B (t) }, wherein
The network takes action a (t) under the state s (t) of t time slots and then transfers to the state s (t+1) of the next time slot, and the state transfer probability of the process is defined as p (s (t+1) |s (t), a (t));
The goal is to minimize SFC end-to-end average delay, network deployment cost and overload penalty, defining a reward function as R (t) = - [ U (t) ] -1;
the multi-agent business arrangement scheme based on user division is as follows:
users with various business requirements are treated as different agents and are numbered i e {1,2, the respective agents are based on local observations o i(t)={Ki(t),ηci(t),ηbi (t) independent of the other agents, each taking action a i(t)={δi(t),Ci(t),Bi (t) }, and obtaining a private reward r i (t) to constantly interact with the environment, each agent learning strategy pi i:Oi→P(Ai;
The resource capacity of each server in the network is different, the capacity is represented by a rectangular total area, each service request is represented by different SFCs, lines representing various shapes of different agents are designed for displaying the network resource load and the resource occupation condition of each agent, wherein the rectangle represents a reserved part which is not used by the network resource, the CPU resource quantity allocated to the agent by each server is represented by the area size of the occupied space of the CPU resource quantity, and the bandwidth resource allocation quantity of a physical link is represented by the length of various lines;
the agents cooperate with each other to service the arriving requests, each agent being able to access all resources in the environment and select appropriate network resources to meet the respective business needs, the common goal of each agent being to obtain the maximum cumulative shared rewards;
The central attention mechanism with multiple attention heads steps as follows:
Step 1): the model aggregation node transmits initial parameters;
Step 2): selecting actions of each agent in the local reinforcement learning area, and carrying out VNF placement and resource allocation of node CPU and link bandwidth;
Step 3): each intelligent agent carries out local SAC algorithm training, and the mapping function of each intelligent agent is collected and uploaded to a central attention mechanism;
Step 4): obtaining training contribution values of all the agents through a query value/key value system in the multi-head attention;
step 5): the local model aggregates the contribution values and the self-mapping values to obtain observation-action values with attention;
The central attention mechanism follows the paradigm of a centralized training commentator and a distributed execution strategy, so that a specific intelligent agent can selectively pay attention to information from other intelligent agents, a distributed actor network takes action and acquires corresponding observation, and shared training is performed after local information mapping; to combine information focused on different subspaces, the contributions of all heads are connected together, denoted Mhead = Concat (head 1,head2,···,headh)Wo, each head using an independent set of parameters (W q,Wk, V), and resulting in an aggregate contribution x i of other agents to a particular agent i, where V is a usage matrix, defining The observation-action value function of the agent i is related to the contributions of other agents in addition to the observation and actions of the agent i, and is expressed as:
Where g i and f i are both multi-layer perceptron mapping functions, (o i,ai) represent VNF mapping, CPU and bandwidth resource allocation actions taken by an agent i, and observations obtained by an individual from an SFC deployment environment, x i is an agent contribution value after removing the agent i, and represents a weighted sum of the agent contributions, i.e.:
Wherein, i represents the set of agents except i, V j represents the cost function provided by agent j, the observation and action are encoded by the mapping function, then obtained by linear transformation of the combined matrix V, and h represents the nonlinear Hadamard product; alpha j represents the attention weight value, adopts bilinear mapping (i.e. query-key system), and passes the correlation value between the mapped values e j and e i into a normalized exponential function, namely:
Where e j is denoted as e j=gj(oj,aj),j=1...N,Wq and W k are parameters in each attention header, converting mapping functions e i and e j into query and key values, respectively; then inputting the query value and the key value into a scaling dot product module, so as to perform scale transformation on the dimensions of the two matrixes, and finally obtaining the weight of each value through a normalization exponential function module;
All critics networks are updated together, with the aim of achieving a minimization of the joint regression loss function, namely:
wherein D is a playback cache pool storing past experiences, Is the value estimated value of the action of the intelligent agent i, which is needed to be obtained through the attention mechanism, and y i is the target value expressed as:
alpha is used as a flexible temperature parameter, so that the importance of SFC deployment rewards and maximum entropy can be effectively measured, And/>Parameters of a target critic network and a target actor network respectively; the update of the target network adopts a soft update mode, namely:
to solve the credit allocation problem between agents, consider introducing a merit function in agent learning, the idea is to derive from the observation-action value function The actions of the given agent are marginalized, and then compared with the original value, so as to know whether the increment of rewards is attributed to other agents; the dominance function is expressed as:
Where b (o, a \i) is a multi-agent baseline, and under the condition of keeping the actions of other agents unchanged, the specific actions of a certain agent are replaced by the averages of other possible actions, namely:
the actor network strategy of each agent is updated by gradient rising, and the gradient calculation expression is as follows:
the step S4 specifically comprises the following steps:
S41: the network service is grouped and linked, and then the service is divided into a plurality of plates according to user classification;
s42: resetting the SFC deployment environment and initializing private observation of each decision maker;
S43: selecting actions of each agent in a local reinforcement learning area, carrying out VNF placement and resource allocation of node CPU and link bandwidth, obtaining local observation and decision rewards of SFC deployment, and storing the transition of each environment in a buffer pool;
S44: each intelligent agent carries out local SAC algorithm training, and the mapping function of each intelligent agent is collected and uploaded to a central attention mechanism to obtain a contribution value;
S45: the local model aggregates the contribution values and the self-mapping values to obtain observation-action values with attention;
s46: calculating a joint loss function and updating a critic network in combination with Adam, calculating a dominance function and updating an actor network in combination with Adam;
s47: steps S42-S46 are repeated until the models of all decision makers converge or the round deadlines expire.
CN202211467664.XA 2022-11-22 2022-11-22 SFC deployment method based on multi-agent reinforcement learning Active CN116112938B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211467664.XA CN116112938B (en) 2022-11-22 2022-11-22 SFC deployment method based on multi-agent reinforcement learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211467664.XA CN116112938B (en) 2022-11-22 2022-11-22 SFC deployment method based on multi-agent reinforcement learning

Publications (2)

Publication Number Publication Date
CN116112938A CN116112938A (en) 2023-05-12
CN116112938B true CN116112938B (en) 2024-04-19

Family

ID=86258653

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211467664.XA Active CN116112938B (en) 2022-11-22 2022-11-22 SFC deployment method based on multi-agent reinforcement learning

Country Status (1)

Country Link
CN (1) CN116112938B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110505099A (en) * 2019-08-28 2019-11-26 重庆邮电大学 A kind of service function chain dispositions method based on migration A-C study
CN112769594A (en) * 2020-12-14 2021-05-07 北京邮电大学 Intra-network service function deployment method based on multi-agent reinforcement learning
CN114172937A (en) * 2022-01-19 2022-03-11 重庆邮电大学 Dynamic service function chain arrangement method and system based on deep reinforcement learning
KR20220077106A (en) * 2020-12-01 2022-06-08 포항공과대학교 산학협력단 Method for VNF Deployment based on Machine Learning Algorithm and apparatus thereof

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110505099A (en) * 2019-08-28 2019-11-26 重庆邮电大学 A kind of service function chain dispositions method based on migration A-C study
KR20220077106A (en) * 2020-12-01 2022-06-08 포항공과대학교 산학협력단 Method for VNF Deployment based on Machine Learning Algorithm and apparatus thereof
CN112769594A (en) * 2020-12-14 2021-05-07 北京邮电大学 Intra-network service function deployment method based on multi-agent reinforcement learning
CN114172937A (en) * 2022-01-19 2022-03-11 重庆邮电大学 Dynamic service function chain arrangement method and system based on deep reinforcement learning

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Multiagent Reinforcement-Learning-Aided Service Function Chain Deployment for Internet of Things;Yuchao Zhu;IEEE Internet of Things Journal;20220214;全文 *
基于异步优势演员-评论家学习的服务功能链资源分配算法;唐 伦;《电 子 与 信 息 学》;20210615;全文 *
基于强化学习的5G网络切片虚拟网络功能迁移算法;唐伦;周钰;谭颀;魏延南;陈前斌;;电子与信息学报;20200315(第03期);全文 *
运营商网络中基于深度强化学习的服务功能链迁移机制;陈卓;冯钢;何颖;周杨;;电子与信息学报;20200915(第09期);全文 *

Also Published As

Publication number Publication date
CN116112938A (en) 2023-05-12

Similar Documents

Publication Publication Date Title
He et al. Blockchain-based edge computing resource allocation in IoT: A deep reinforcement learning approach
Oma et al. An energy-efficient model for fog computing in the internet of things (IoT)
Shao et al. A data replica placement strategy for IoT workflows in collaborative edge and cloud environments
CN113282368B (en) Edge computing resource scheduling method for substation inspection
JP3989443B2 (en) Method for controlling a web farm and web farm
CN107710237A (en) Deep neural network divides on server
CN114401532A (en) Intra-network pooled resource allocation optimization method based on contribution perception in computational power network
Wu et al. Optimal deploying IoT services on the fog computing: A metaheuristic-based multi-objective approach
Ren et al. An energy‐aware method for task allocation in the Internet of things using a hybrid optimization algorithm
Yadav et al. An opposition-based hybrid evolutionary approach for task scheduling in fog computing network
Cao et al. A deep reinforcement learning approach to multi-component job scheduling in edge computing
CN113190342B (en) Method and system architecture for multi-application fine-grained offloading of cloud-edge collaborative networks
Zanbouri et al. A new fog-based transmission scheduler on the Internet of multimedia things using a fuzzy-based quantum genetic algorithm
CN116112938B (en) SFC deployment method based on multi-agent reinforcement learning
Yongdong Bi-level programming optimization method for cloud manufacturing service composition based on harmony search
Zhang et al. You calculate and I provision: A DRL-assisted service framework to realize distributed and tenant-driven virtual network slicing
Di et al. In-network pooling: contribution-aware allocation optimization for computing power network in B5G/6G era
Lorido-Botran et al. Adaptive container scheduling in cloud data centers: a deep reinforcement learning approach
Mobasheri et al. Toward developing fog decision making on the transmission rate of various IoT devices based on reinforcement learning
CN112162837A (en) Software definition-based edge computing scheduling method and system
CN115225512B (en) Multi-domain service chain active reconfiguration mechanism based on node load prediction
CN115086249B (en) Cloud data center resource allocation method based on deep reinforcement learning
Zou et al. Efficient orchestration of virtualization resource in ran based on chemical reaction optimization and q-learning
CN115361453A (en) Load fair unloading and transferring method for edge service network
Duraisamy et al. High performance and energy efficient wireless NoC-enabled multicore architectures for graph analytics

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20240329

Address after: Room 711, 7th Floor, Building D, Bantian International Center, No. 5 Huancheng South Road, Ma'antang Community, Bantian Street, Longgang District, Shenzhen City, Guangdong Province, 518000

Applicant after: Shenzhen Sailei Culture Media Co.,Ltd.

Country or region after: China

Address before: 400065 Chongqing Nan'an District huangjuezhen pass Chongwen Road No. 2

Applicant before: CHONGQING University OF POSTS AND TELECOMMUNICATIONS

Country or region before: China

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant