CN116566891A

CN116566891A - Delay-sensitive service function chain parallel route optimization method, device and medium

Info

Publication number: CN116566891A
Application number: CN202310532713.1A
Authority: CN
Inventors: 郭松涛; 朱永东; 王紫恒; 刘贵燕; 曲鑫
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2023-05-10
Filing date: 2023-05-10
Publication date: 2023-08-08

Abstract

The invention relates to a delay-sensitive service function chain parallel route optimization method, a device and a medium, belonging to the technical field of edge calculation, wherein the method comprises the following steps: constructing a service function chain parallel routing problem in an edge computing network based on network function virtualization, and modeling the service function chain parallel routing problem by using a Markov decision process to obtain an MDP model in consideration of real-time network state change; the method comprises the steps of solving an MDP model by an enhanced depth deterministic strategy gradient algorithm based on a self-attention mechanism, wherein the solving process comprises two parts of offline training and online operation, the offline training part obtains a trained SA-EDDPG model, and the online operation part determines an optimal routing scheme by operating the trained SA-EDDPG model to perform service function chain parallel routing. Compared with the prior art, the invention has the advantages of taking real-time network change into consideration, low time delay, low resource consumption and the like.

Description

Delay-sensitive service function chain parallel route optimization method, device and medium

Technical Field

The present invention relates to the field of edge computing technologies, and in particular, to a method, an apparatus, and a medium for optimizing parallel routes of a delay-sensitive service function chain.

Background

In recent years, with the rapid development of 5G and artificial intelligence technologies, a large number of time-delay sensitive and computation intensive applications have emerged in the network, and the conventional cloud computing mode has not been able to effectively meet the demands of these applications. Therefore, the concept of edge computation is proposed. Edge computing can fully utilize resource-constrained edge devices to provide computing, storage, communication, etc. services on the edge side near the end user, thereby significantly reducing network latency and relieving the pressure of the data center network. With the advent of edge computing applications, network edges placed a large number of network functional elements that provide network services. Therefore, how to flexibly and efficiently manage these network functions, so that the network functions can be dynamically created, deleted and migrated, thereby meeting different service requirements of users under the condition of limited network edge resources, and becoming a urgent problem to be solved in the development process of edge computing.

Traditionally, network functions have been implemented by dedicated network devices, which are costly and lack scalability. Network Function Virtualization (NFV) is proposed as an emerging solution. Unlike traditional approaches, NFV implements network functions in a virtualized technology on a generic server. Through the arrangement and scheduling mechanism of NFV, service requests in edge computing can be implemented through a Service Function Chain (SFC), i.e., a set of ordered Virtual Network Functions (VNFs), so as to solve the problem of network function management in edge computing, and reduce service delay and operation cost.

However, a key problem with SFC is that each VNF in the SFC needs to be placed on a physical node and traffic flowing through the SFC has to be processed by each VNF in order, which may lead to higher latency, which may not meet the requirements of latency sensitive applications in an edge computing network. Processing service requests by parallel SFC becomes a solution to this problem. To implement this approach, each traffic needs to be split into multiple sub-streams and the same functional VNFs need to be repeatedly instantiated on different server nodes. In this way, the repeated instances of each VNF may process the split sub-streams in parallel to share the processing load, thereby reducing the processing delay of the service. Furthermore, traffic splitting and re-instantiating VNFs consume certain resources, which is crucial to ensure the cost-effectiveness of the network operator due to resource limitation in the edge computing network.

Currently, there have been some studies on the problem of SFC parallel routing. Since this type of problem is difficult for NPs to solve, researchers often propose heuristic or approximation algorithms to solve this problem. However, these studies mostly assume that network traffic is known and SFC parallel routing is performed based on these assumptions, ignoring dynamic changes in network state. Reinforcement Learning (RL) can automatically adjust strategies to make optimal decisions based on historical experience and environmental feedback in complex and uncertain environments. Accordingly, some studies have used RL algorithms for SFC parallel routing in a dynamic network environment. However, when the network scale is large, it is difficult for the conventional RL algorithm to accurately describe complex network state changes, thereby becoming inefficient. Deep Reinforcement Learning (DRL) combines deep learning with RL, using Deep Neural Networks (DNNs) to handle complex state changes of large-scale networks, is an emerging technology to solve the problem of online SFC parallel routing. However, most of the research using the DRL method is performed in the data center network, and services with low latency requirements are not considered. Based on this, parallel routing of SFCs using the DRL method in a dynamic edge computing network can be considered, but there are still some problems to be solved. On the one hand, the SFC parallel routing procedure involves discrete type actions (e.g., VNF placement) and continuous type actions (e.g., traffic splitting), and existing DRL-based SFC routing algorithms cannot effectively handle both types of actions simultaneously. On the other hand, the location distribution of server nodes in an edge computing network may be relatively decentralized, which may affect the efficiency of SFC routing.

Disclosure of Invention

The invention aims to provide a delay-sensitive service function chain parallel route optimization method, a device and a medium, which minimize the resource consumption of a server and a link on the premise of meeting the delay and resource constraint, solve the on-line route problem, discrete type and continuous type action problem of SFC and improve the efficiency of SFC parallel route.

The aim of the invention can be achieved by the following technical scheme:

a delay-sensitive service function chain parallel route optimization method comprises the following steps:

constructing a service function chain parallel routing problem in an edge computing network based on network function virtualization, and modeling the service function chain parallel routing problem by using a Markov decision process to obtain an MDP model in consideration of real-time network state change;

the method comprises the steps of solving an MDP model by an enhanced depth deterministic strategy gradient algorithm based on a self-attention mechanism, wherein the solving process comprises an offline training part and an online running part, the offline training part obtains a trained SA-EDDPG model, and the online running part determines an optimal routing scheme by running the trained SA-EDDPG model to perform service function chain parallel routing.

The network model of the edge computing network is as follows:

the underlying physical network is represented as a connected graph g= (N, L), where N is a set of server nodes, including various devices with computing capabilities at the network edge, L is a set of physical links connecting two server nodes,for each server node N εN, use C _n Representing its resource capacity, for each link L e L, with B _l and D_l Respectively representing the available bandwidth capacity and the communication time delay;

each service function chain request is defined as a series of ordered virtual network functions VNFs, and network traffic needs to be routed through each VNF in the service function chain in turn; for traffic routed through the service function link, it is assumed to be split and divided into I sub-streams, each represented by I; the set of service function chain requests arriving in real time is denoted by R, for each service function chain request R (F, epsilon _r ,D _r ) E R, F represents the set of VNs required for the service function chain request, e _r Representing the flow rate of the traffic passing through the service function link, D _r A delay constraint representing the service function chain request; for each substream i servicing a functional chain request, the set of VNFs required for substream i is denoted as F _i ＝{f ₁ ,f ₂ ,…,f _m ,…,f _|i|}, wherein f_m Is the mth VNF required for the sub-flow i, |i| represents the number of VNFs contained in the sub-flow i; service instance f for each VNF _m ∈F _i Has a resource requirement ofThe representation is performed.

The modeling of the service function chain parallel routing problem by using the Markov decision process comprises the following steps:

s1: determining constraint conditions;

s11: determining a node capacity constraint that ensures that the total resources consumed by VNFs required for the request do not exceed the resource limit of the server node n to be deployed, expressed as:

wherein ,is a 0-1 variable if the sub-stream i of r is requested to be requiredm VNF f _m Placed on physical node n, the variable has a value of 1, otherwise it has a value of 0;

s12: determining a link capacity constraint that ensures that the total bandwidth required for all requests passing through the link L e L does not exceed the bandwidth capacity B of the link _l Expressed as:

wherein ,is a 0-1 variable, which has a value of 1 if a link l is used to deliver the substream i of the request r, and a value of 0 otherwise; />Is a continuous variable representing the flow ratio divided by the substream i of the request r, i.e. the flow dividing ratio, which has a value in the range +.>

S13: determining a time delay constraint: for a request r, its delay contains two parts in total, namely the processing delay at the server node and the communication delay in the link traffic transmission, i.e. the total delay D of the request r _total Expressed as:

wherein ,representing f on server node N E N _m Processing delay of E i, using the maximum delay in all sub-streams i as the total delay of the request r;

for each request r, if it can be successfully received, the total delay D of the request r _total Cannot exceed its delay limit D _r ：

S14: determining a placement constraint that ensures that only one physical server is selected to place the mth VNF required for the substream i of request r, and that all VNFs in the substream i of request r can be served, expressed as:

s2: an optimization objective for determining service function link parallel routing problems is to minimize joint resource consumption, i.e., to minimize resource consumption of servers and bandwidth consumption of links, wherein,

resource consumption U of all servers _N Expressed as:

bandwidth consumption U of all links _L Expressed as:

then, the optimization objective of joint resource consumption is expressed as:

wherein η₁ and η₂ Weights of server resource consumption and link bandwidth consumption respectively satisfyη ₁ ,η ₂ E (0, 1) and eta ₁ +η ₂ ＝1。

The MDP model is defined as a five-tuple wherein /> and />State space and action space, respectively, +.>Is a state transition probability distribution +.>Is a reward function, gamma e [0,1 ]]Is a discount factor for future rewards, and is specifically defined as follows:

State space: the state at time tDefined as a vector G _t ＝(C _t ,B _t ,D _t )，G _t For characterizing the underlying physical network at time t, wherein +.>Representing the current available resources of all server nodes,representing the current available bandwidth of all physical links, < >>Representing the time delays of all physical links;

action space: the action space of the intelligent agent is a setWherein each action a e A represents VNF placement and traffic routing, action a is defined as +.> wherein ,/>Is a discrete action, indicating whether the mth VNF of the substream i requesting r is placed on the server node n at time t; />A discrete action for indicating whether the sub-stream i requesting r at time t is routed through the physical link l; />Is a continuous action for representing the flow dividing ratio of the request r at the time t;

state transition: the state transition is represented as(s) _t ,a _t ,r _t ,s _t+1), wherein s_t Is the network state at the current time t, a _t Is an action to handle VNF placement and traffic routing in sub-flow i of request r, r _t and s_t+1 Respectively, perform action a _t The instant prize obtained later and the network state at the next time t+1, for each stateProbability of state transition p(s) _t+1 ∣s _t ,a _t ) Representing the state s of the agent in the network _t Lower execution action a _t After that, the network state transitions to s _t+1 Probability of (2);

bonus function: based on the optimization objective of the service function chain parallel routing problem, the rewarding function is used forDefined as server and linkNegative of the total resource consumption of (a):

the offline training process comprises the following steps:

step 1) the agent and the environment interact to generate training data, wherein the environment refers to an edge computing network based on network function virtualization, and the agent firstly observes the current network state s _t And the network state s is entered by self-attention mechanism _t Transition to state s 'based on neighbor node information' _t The agent performs action a based on its current policy _t The environment is based on the current state s' _t And received action a _t Update state s _t+1 And feed back a reward signal r to the intelligent agent _t The intelligent agent receives r _t Updating its own strategy to make a better decision at the next time t+1, generating training data(s) by cycling the above process _t ,a _t ,r _t ,s _t+1 )；

Step 2) training data(s) _t ,a _t ,r _t ,s _t+1 ) Stored in an experience playback pool;

step 3) randomly selecting a batch of samples from the experience playback pool when the state transition samples are accumulated to a preset number, and inputting the samples into the SA-EDDPG model;

Step 4) training an SA-EDDPG model, wherein the input of the SA-EDDPG model is a set R of a bottom physical network G and a service request, and the output is VNFf _m Adopts a network structure of double actors-critics, and totally comprises four neural networks, namely a main actor network mu (s|theta ^μ ) Main criticizing home network Q (s, a|θ ^Q ) Target actor network μ' (s|θ) ^μ′ ) And a target critics network Q' (s, a|θ) ^Q′), wherein θ^μ 、θ ^Q 、θ ^μ′ and θ^Q′ Is a parameter in a neural network, the actor network being responsible for the network at a given pointGenerating actions of VNF placement and traffic routing in a network state, wherein a critic network is responsible for evaluating actions generated by an actor network, and in the training process, a main actor network updates a parameter theta by a strategy gradient method ^μ The main commentator network updates the parameter theta through a gradient descent method based on time sequence differential errors ^Q The target actor network and the target critic network update the parameter theta in a soft update mode ^μ′ and θ^Q′ ；

The online operation process comprises the following steps:

step 5), selecting the SA-EDDPG model which is trained in the offline training process to perform online routing of the service function chain;

step 6) network state s _t Input into a trained SA-EDDPG model, and evaluate each action a by using the model _t To obtain corresponding rewards, and at the same time, to transmit data (s _t ,a _t ,r _t ,s _t+1 ) Store to experience playback poolThe method is used for updating the SA-EDDPG model;

step 7) performing the actions of VNF placement and traffic routing that can get the highest prize on the underlying physical network.

In the SA-EDDPG model described above,

the actor network has five layers, in order, an input layer, an attention layer, two hidden layers, and an output layer, the staractor network μ (s|θ ^μ ) And target actor network μ' (s|θ) ^μ′ ) Has the same neural network structure, wherein the input layer is a network state vector s _t The method comprises the steps of carrying out a first treatment on the surface of the The attention layer will state vector s for each node _t Conversion into a vector s taking into account all node information _t 'A'; both fully connected hidden layers contain 256 neurons for processing state information; the output of the output layer is action a _t Wherein the output layer is divided into two parts for obtaining discrete and continuous action decisions, specifically, defining FC ₁ Output layer to obtain discrete VNF and link releaseDecision making, i.e. and />Definition of FC ₂ The output layer obtains a continuous flow dividing ratio, i.e.>FC ₁ The output layer obtains corresponding action value, namely 0 or 1, FC through the Sigmoid activation function ₂ The output layer directly outputs the original continuous motion value +. >Adding noise or amplitude limit to obtain action value ∈>

The commentator network approximates the action cost function Q (s, a) by DNN, and comprises five layers, namely an input layer, an attention layer, two hidden layers and an output layer, wherein the main commentator network Q (s, a|theta ^Q ) And a target critics network Q' (s, a|θ) ^Q′ ) Having the same neural network structure, wherein the input layer is the current network state s _t Action a output by actor network _t The method comprises the steps of carrying out a first treatment on the surface of the The attention layer will network state vector s _t Conversion to s' _t The method comprises the steps of carrying out a first treatment on the surface of the The two hidden layers contain 256 neurons for processing state and motion information; the output layer is used for outputting the value of the action cost function Q (s, a); the reviewer network uses the activation function ReLU to introduce nonlinear features.

The training process of the SA-EDDPG model comprises parameter updating of four neural networks, and specifically comprises the following steps:

for the main criticism network Q (s, a|θ ^Q ) Updating the parameter θ with TD error in DQN algorithm ^Q The loss function calculation formula of the main criticism network is as follows:

where M is the sample batch size, y _i Is a target value, y _i The calculation formula of (2) is as follows:

y _i ＝r _i +γQ′(s _i+1 ,μ′(s _i+1 |θ ^μ′ )|θ ^Q′ )

for the main actor network μ (s|θ ^μ ) Gradient updating of parameter θ using sampling strategy ^μ The update formula of the master actor network is as follows:

For the target critics network Q' (s, a|θ) ^Q′ ) And target actor network μ' (s|θ) ^μ′ ) The respective parameters are updated by adopting a soft update mode, and the formula is as follows:

θ ^Q′ ←τθ ^Q +(1-τ)θ ^Q′

θ ^μ′ ←τθ ^μ +(1-τ)θ ^μ′

wherein the parameter τ is < 1.

The training process of the SA-EDDPG model specifically comprises the following steps of:

step 4-1) initializing a Master criticizing network Q (s, a|θ) ^Q ) Master actor network μ (s|θ ^μ ) Target critics network Q' (s, a|θ) ^Q′ ) And target actor network μ' (s|θ) ^μ′ ) Initializing an experience playback pool

Step 4-2) in each training round, the following steps are performed:

step 4-2-1), resetting the network environment, and acquiring an initial network state by the intelligent agent according to the reset environment;

step 4-2-2) initializing a random process, and adding random noise to each output action;

step 4-2-3) for each time slot t in the training round, the following iterations are performed:

step 4-2-3-1) inputting the network state s based on the self-attention mechanism _t Information conversion based on neighbor nodes into s _t ′；

Step 4-2-3-2) is based on s _t ' information, agent according to the formulaSelecting action a _t And execute, wherein->Action a is performed as random noise _t Obtain prize r _t At the same time, the network state is from s _t Transition to s _t+1 ；

Step 4-2-3-3) state transition data(s) obtained by interaction of the agent with the environment _t ,a _t ,r _t ,s _t+1 ) Deposit into experience playback poolIn (a) and (b);

step 4-2-3-4) from the experience playback poolRandomly selecting M samples to train an SA-EDDPG model;

step 4-2-3-5) update the primary and target networks: calculating the target value y _i Updating parameter θ of a home reviewer network using a loss function of the home reviewer network ^Q Updating parameter θ of starring actor network using update formula of starring actor network ^μ Updating parameters of the target critic network and the target actor network based on a soft update formula;

step 4-3) after reaching the preset training round, the training is completed.

A delay-sensitive service function chain parallel route optimization device, implemented based on the method as described above, comprising:

the parallel routing problem construction module is used for constructing a service function chain parallel routing problem in the edge computing network based on network function virtualization, taking real-time network state change into consideration, and modeling the service function chain parallel routing problem by using a Markov decision process to obtain an MDP model;

the system comprises an SA-EDDPG algorithm solving module, a self-attention mechanism-based enhanced depth deterministic strategy gradient algorithm solving module and a service function chain parallel route processing module, wherein the solving process comprises an offline training part and an online running part, the offline training part obtains a trained SA-EDDPG model, and the online running part determines an optimal route scheme by running the trained SA-EDDPG model.

A delay-sensitive service function chain parallel route optimization device, comprising a memory, a processor, and a program stored in the memory, wherein the processor implements the method described above when executing the program.

A storage medium having stored thereon a program which when executed performs a method as described above.

Compared with the prior art, the invention has the following beneficial effects:

(1) The invention provides an enhanced depth deterministic strategy gradient (Enhanced Deep Deterministic Policy Gradient, EDDPG) algorithm, which can effectively process discrete and continuous actions simultaneously by improving the DNN structure in the DDPG algorithm.

(2) The invention introduces a self-attention mechanism in the EDDPG algorithm, and by calculating the attention value among the server nodes in the edge network, the self-attention mechanism enables the intelligent agent in the DRL to pay attention to the more valuable server nodes when executing actions, and can reduce the attention to the irrelevant server nodes, thereby reducing unnecessary exploration of the intelligent agent, helping to solve the problem of SFC parallel routing on the excessively dispersed edge server nodes, accelerating the training and convergence time of DNN and improving the efficiency of SFC parallel routing.

(3) The method of the invention can minimize the resource consumption of the server and the link on the premise of meeting the time delay and the resource constraint, and has lower time delay and lower resource consumption compared with other algorithms.

(4) The invention takes real-time network state changes into consideration, and models SFC parallel routing problems by using a Markov Decision Process (MDP).

Drawings

FIG. 1 is a flow chart of the method of the present invention;

FIG. 2 is a schematic diagram of a solution process of an enhanced depth deterministic strategy gradient algorithm based on a self-attention mechanism;

FIG. 3 is a graph of rewards versus different algorithms during training in one embodiment;

FIG. 4 is a graph of the acceptance rate of different algorithms for different request numbers in one embodiment;

FIG. 5 is a graph of total resource consumption versus different algorithms for different request numbers in one embodiment;

FIG. 6 is a graph of average latency versus different algorithms for different request numbers in one embodiment;

FIG. 7 is a graph of average time delay versus different algorithms for different numbers of nodes in one embodiment.

Detailed Description

The invention will now be described in detail with reference to the drawings and specific examples. The present embodiment is implemented on the premise of the technical scheme of the present invention, and a detailed implementation manner and a specific operation process are given, but the protection scope of the present invention is not limited to the following examples.

The embodiment provides a delay-sensitive service function chain parallel route optimization method, as shown in fig. 1, comprising the following steps:

s1: and constructing a service function chain parallel routing problem in the edge computing network based on network function virtualization, and modeling the service function chain parallel routing problem by using a Markov decision process to obtain an MDP model in consideration of real-time network state change.

S11: SFC parallel routing problems in NFV-based edge computing networks are constructed with the aim of minimizing the resource consumption of servers and links while meeting latency and resource constraints.

First, for convenience of reading, the present embodiment provides a summary of parameter explanations as shown in table 1, which are common in the present embodiment.

Table 1 parameter interpretation

The underlying physical network is represented as a connected graph g= (N, L), where N is a set of server nodes (including various devices with computing capabilities at the network edge), L is a set of physical links connecting two server nodes,for each server node N εN, use C _n Representing its resource capacity, i.e., available computing and storage resources. Similarly, for each link L ε L, use B _l and D_l Representing its available bandwidth capacity and communication latency, respectively.

Each SFC request is defined as a sequence of VNFs through which network traffic needs to be routed in turn. For traffic routed through SFC, it is assumed to be divisible and it can be divided into I sub-streams, each represented by I. R is used to represent the set of SFC requests arriving in real time. For each SFC request r (F, ε _r ,D _r ) E R, F denotes the set of VNs, epsilon, required for the SFC request _r Then used to represent the flow rate of the traffic routed through the SFC, D _r Representing the latency constraint of the SFC request. Furthermore, for each substream i requested by SFC, the set of VNs required for substream i is denoted as F _i ＝{f ₁ ,f ₂ ,...,f _m ,...,f _|i|}, wherein f_m Is the mth VNF required for sub-flow i, |i| denotes that sub-flow i containsNumber of VNFs. Service instance f for each VNF _m ∈F _i There is a resource requirement, ci _fm The representation is performed.

First, constraints considered in the SFC parallel routing problem are determined.

(1) Node capacity constraints. Node capacity constraints may ensure that the total resources consumed by VNFs required for the request do not exceed the resource limits of server node n to be deployed, in the mathematical form shown in equation (1):

wherein ,is a 0-1 variable. It has the specific meaning that if the mth VNFf required by the substream i of r is requested _m Placed on physical node n, the variable has a value of 1, otherwise it has a value of 0.

(2) Link capacity constraints. The link capacity constraint ensures that the total bandwidth required for all requests passing through the link L e L does not exceed the bandwidth capacity B of the link _l The specific expression is as follows:

in formula (2)Is a 0-1 variable. It has the specific meaning that if a link i is used to deliver the substream i of the request r, the variable has a value of 1, otherwise it has a value of 0. +.>Is a continuous variable which represents the flow ratio (hereinafter referred to as flow dividing ratio) divided by the substream i of the request r, and has a value in the range of 0,1]I.e.

(3) And (5) time delay constraint. For a request r, its latency contains a total of two parts, namely the processing latency at the server node and the communication latency when the link is transmitting traffic. Thus, the total delay D of request r _total Expressed as the following formula:

wherein ,representing f on server node N E N _m E.i. The request r contains I substreams altogether, and the delays for each substream I may be different, so the maximum delay in all substreams I is used as the total delay for the request r.

For each request r, if it can be successfully received, the total delay D of the request r _total Cannot exceed its delay limit D _r The specific expression is shown in the formula (4):

(4) placing constraints. The placement constraint ensures that only one physical server is selected to place the mth VNF required for requesting sub-flow i of r. In addition, placement constraints ensure that all VNFs in sub-flow i of request r can be serviced. The specific constraints are expressed as the following formula:

the goal of the SFC parallel routing problem is to jointly minimize the resource consumption of the server and the bandwidth consumption of the link.

Resource consumption U for all servers _N This can be expressed by the formula (7):

resource consumption U for all links _L This can be expressed by the formula (8):

the problem of optimizing the consumption of joint resources is then finally expressed in the form:

wherein η₁ and η₂ The server resource consumption and the link bandwidth consumption are weighted, respectively. They satisfy eta ₁ ,η ₂ E (0, 1) and eta ₁ +η ₂ ＝1。

S12: the SFC parallel routing problem is modeled with a Markov Decision Process (MDP) considering real-time network state changes.

With modeling of the SFC parallel routing problem described above, the present embodiment continues to describe how it is converted into an MDP model. In general, the MDP model is defined as a five-tuple wherein /> and />State space and action space, respectively, +.>Is a state transition probability distribution +.>Is a reward function, gamma e [0,1 ]]Is a discount factor for future rewards. To account for real-time network state changes caused by random arrival and departure of service requests, the present embodiment introduces a discrete time period T. SFC parallel routing problem related pentad +.>The definition of (c) is specifically as follows:

state space: the state at time tDefined as a vector G _t ＝(C _t ,B _t ,D _t ). Specifically, G _t Is used to represent the characteristics of the underlying physical network at time t, wherein +.>Representing the current available resources of all server nodes, < >>Representing the current available bandwidth of all physical links, < >>Representing the latency of all physical links.

Action space: the action space of the intelligent agent is a setWherein each action a E A representsVNF placement and traffic routing. Specifically, action a at time t is defined as +.> wherein ,/>Is a discrete action, which specifically means whether the mth VNF of the substream i requesting r is placed on the server node n at time t. />Also a discrete action to indicate whether the substream i of the request r is routed over the physical link i at time t. / >Is a continuous operation, and is used to represent the flow dividing ratio of the request r at the time t.

State transition: the state transition of the MDP is denoted as (s _t ,a _t ,r _t ,s _t+1), wherein s_t Is the network state at the current time t, a _t Is an action to handle VNF placement and traffic routing in sub-flow i of request r, r _t and s_t+1 Respectively, perform action a _t The instant prize obtained later and the network status at the next time t+1. For each stateProbability of state transition p(s) _t+1 ∣s _t ,a _t ) Representing the state s of the agent in the network _t Lower execution action a _t After that, the network state transitions to s _t+1 Is a probability of (2).

Bonus function: typically, the goal of reinforcement learning is to maximize rewards, i.e., maximize the cumulative discount rewards. In this embodiment, the optimization objective of the set SFC routing problem is to minimize the resource consumption of the server and the bandwidth consumption of the link. Thus, the reward functionShould be defined as a negative value of the total resource consumption of the server and link, as shown in equation (10):

s2: the enhanced depth deterministic strategy gradient (Self-Attention Mechanism based EDDPG, SA-EDDPG) algorithm based on the Self-attention mechanism solves the MDP model, solves the on-line routing problem of SFC, solves discrete and continuous actions at the same time, introduces the Self-attention mechanism into the DNN structure, reduces the attention to irrelevant server nodes, thereby reducing unnecessary exploration of an intelligent agent, accelerating training and convergence time of DNN, and improving the efficiency of SFC parallel routing.

Fig. 2 is a schematic diagram of the SA-EDDPG algorithm proposed by the present invention, which includes two main processes: an off-line training process (step 1-step 4) and an on-line running process (step 5-step 7).

(1) Training process. The offline training process of the architecture mainly includes steps 1-4 in fig. 2. The goal of the training process is to generate an SA-EDDPG model.

Step 1: the agent interacts with the environment to generate training data. In this process, the environment refers to an NFV-based edge computing network. The agent first observes the current network state s _t The network state s is then passed through a self-attention mechanism _t Transition to state s 'based on neighbor node information' _t . Next, the agent performs action a based on its current policy _t . The environment is based on the current state s' _t And received action a _t Update state s _t+1 And feed back a reward signal r to the intelligent agent _t . Then, the agent receives r _t The own policy is updated so that a better decision is made at the next time t+1. The whole process is cyclically carried out in such a way that a large amount of training data (s _t ,a _t ,r _t ,s _t+1 ) Is generated. To break the reinforcement learning algorithmThe correlation between adjacent samples in order to improve the utilization efficiency of training samples, and the embodiment introduces an experience playback technology.

Step 2: training data(s) _t ,a _t ,r _t ,s _t+1 ) Stored in an experience playback pool.

Step 3: when the state transition samples accumulate to a certain number (depending on the size of the experience playback pool), a small batch of samples is randomly selected from the experience playback pool and transferred to the SA-EDDPG model.

Step 4: training SA-EDDPG model. SA-EDDPG adopts a network structure of double actors-critics (AC), so that the SA-EDDPG totally comprises four neural networks, namely a main actor network mu (s|theta) ^μ ) Main criticizing home network Q (s, a|θ ^Q ) Target actor network μ' (s|θ) ^μ′ ) And a target critics network Q' (s, a|θ) ^Q′), wherein θ^μ 、θ ^Q 、θ ^μ′ and θ^Q′ Is a parameter in the neural network. The actor network is responsible for generating VNF placement and traffic routing actions for a given network state, and the reviewer network is responsible for evaluating the actor network generated actions. During training, the main actor network updates the parameter theta by means of strategy gradient ^μ The main criticism network updates the parameter theta by a gradient descent method based on a time sequence difference error (TD error) ^Q While the target actor network and the target critic network update the parameter theta in a soft update mode ^μ′ and θ^Q′ 。

(2) And (5) running the process. The online running process of the architecture includes steps 5-7 in fig. 2.

Step 5: the SA-EDDPG model which is trained in the training process is selected to conduct the online routing of SFC.

Step 6: network state s _t Input into a trained SA-EDDPG model, and then evaluate each action a using the model _t To obtain a corresponding prize. To further update the SA-EDDPG model, the present embodiment updates the data (s _t ,a _t ,r _t ,s _t+1 ) Store to experience playback poolIs a kind of medium.

Step 7: the actions of VNF placement and traffic routing that can get the highest prize are performed at the underlying physical network. In addition, once the accuracy of the trained model is greatly reduced, retraining of the model is required. Specifically, the SA-EDDPG model previously trained is replicated and trained using the newly acquired training data. Then, after the training procedure is completed, the old model is replaced with the new SA-EDDPG model to improve the performance of the model.

The SA-EDDPG algorithm provided by the invention adopts an actor-critique network structure, and the actor network and the critique network both comprise a main network and a target network, so that the SA-EDDPG algorithm comprises four neural networks in total, and the structures of the neural networks are described in detail below.

Actor networks are a strategy-based deep reinforcement learning method that uses DNNs to learn a deterministic strategy and directly uses the strategy to generate deterministic actions without sampling from random strategies. The input layer of the actor network is the network state vector s _t . In order to obtain information of neighboring nodes, the present embodiment also introduces a layer of attention in the DNN structure of the actor network. The state vector s of each node through the attention layer _t Can be converted into a vector s 'taking into account all node information' _t . The attention layer is followed by two fully connected hidden layers, both containing 256 neurons for processing state information. The output layer of the actor network is action a _t 。

Unlike the output layer in a traditional actor network, the present embodiment divides the last layer of the actor network into two parts for obtaining discrete and continuous action decisions, respectively. Specifically, define FC ₁ The output layer obtains discrete VNF and link placement decisions (i.e. and />) Definition of FC ₂ The output layer obtains a continuous flow split ratio (i.e.)>)。FC ₁ The output layer gets the corresponding action value (i.e. 0 or 1) by Sigmoid activating function. FC (fiber channel) ₂ The output layer directly outputs the original continuous motion value +. >Then adding noise or amplitude limitation to obtain action value +.>Thus, the neural network structure of the actor network has five layers, namely an input layer, an attention layer, two hidden layers and an output layer, and the starring actor network μ (s|θ ^μ ) And target actor network μ' (s|θ) ^μ′ ) Has the same neural network structure.

The commentator network is a deep reinforcement learning method based on a cost function, which approximates an action cost function Q (s, a) by DNN. The input of the commentator network is the current network state s _t Action a output by actor network _t . Similarly, attention layers are also introduced into the DNN of the commentator network, thereby bringing the network state vector s _t Conversion to s _t '. The attention layer is also followed by two hidden layers containing 256 neurons for processing state and motion information. The output layer of the commentary network is used for outputting the value of the action cost function Q (s, a). Furthermore, the reviewer network uses the activation function ReLU to introduce nonlinear features, thereby enhancing the processing power of the DNN. Therefore, the neural network structure of the criticizing network also includes five layers, and the main criticizing network Q (s, a|θ ^Q ) And a target critics network Q' (s, a|θ) ^Q′ ) Also has the same neural network structure.

As described above, the SA-EDDPG algorithm includes four neural networks in total, and parameter updating of the four neural networks is involved in the operation of the SA-EDDPG algorithm, which will be described below.

For the main criticism network Q (s, a|θ ^Q ) Its parameter theta ^Q The updates of (1) utilize TD error in the DQN algorithm. Therefore, the loss function calculation formula of the main criticism network is as follows:

where M is the sample batch size, y _i Is the target value. y is _i The calculation formula of (2) is as follows:

for the main actor network μ (s|θ ^μ ) Its parameter theta ^μ The updating of (1) uses a method of sampling strategy gradients. The updated formula for the master actor network is expressed as follows:

target critics network Q' (s, a|θ) ^Q′ ) And target actor network μ' (s|θ) ^μ′ ) The parameters of the data are updated by adopting a soft update mode, and the specific formula is as follows:

wherein, the parameter tau is less than 1, which makes the parameter updating of the target network slow and stable, and improves the learning stability.

Table 2 describes the training flow of the SA-EDDPG based SFC parallel routing method. Specifically, the input of the model is the underlying physical network G and the set of service requests R, while the output is VNFf _m Is provided, and the placement locations and routing paths of traffic. First, initialize a main criticism home network Complex Q (s, a|theta) ^Q ) Master actor network μ (s|θ ^μ ) Target critics network Q' (s, a|θ) ^Q′ ) And target actor network μ' (s|θ) ^μ′ ) Network weight parameters (lines 1-2) of (a) and then initializing an empirical playback pool(line 3). In each training round, the network environment should be reset before training begins, and the agent acquires the initial network state according to the reset environment (line 5). In order to balance the exploration and utilization of the intelligent agent, the embodiment introduces a random process in the training process of the SA-EDDPG algorithm>It may be achieved by adding random noise to each output action (line 6). For each time slot t, a self-attention mechanism is first introduced, bringing the network state s _t Information conversion based on neighbor nodes into s _t ' (line 8). Then based on s' _t According to the formula +.>Selecting action a _t, wherein />It allows the agent to better explore potentially optimal actions (line 9) for random noise. At the time of executing action a _t After that, get rewards r _t . At the same time, the network state is from s _t Transition to s _t+1 . Then, the state transition data(s) obtained by the interaction of the agent with the environment _t ,a _t ,r _t ,s _t+1 ) Deposit into experience playback pool->In (lines 10-11). Thereafter, from->Randomly select M samples to train The SA-EDDPG model is trained (line 12).

Next, the primary network and the target network are updated. First, a target value y is calculated according to formula (12) _i (line 13). Subsequently, the parameter θ of the main criticizing home network is updated by using the formula (11) ^Q (line 14). In particular, the procedure of updating is to minimize the mean square error between the target value and the main reviewer network Q value using a gradient descent approach. Then, the parameter θ of the main actor network is updated using equation (13) ^μ (line 15). For the target critic network and the target actor network, their parameters theta are updated according to formula (14) respectively ^Q′ And (line 16). It can be seen from equation (14) that the SA-EDDPG algorithm employs a soft update manner, which makes the update of the target network more stable, unlike the manner in which the DQN algorithm periodically replicates parameters from the main network to update the target network.

TABLE 2 SFC routing algorithm based on SA-EDDPG

In this example, the performance of the SA-EDDPG algorithm was evaluated by simulation experiments. First, the relevant parameters of the simulation experiment are set. Then, SA-EDDPG is compared with other existing algorithms through various indexes, and experimental results are analyzed.

Experimental setup

The present embodiment generates a physical network with 50 nodes according to the NetworkX tool, and there is one link between each pair of nodes. In addition, other parameters in the network are generated from some previous related studies. Each node capacity in the network is randomly selected from the units of [30,100 ]. The bandwidth of each link in the physical network is randomly allocated as 100,1000 Mbps, and the time delay is randomly valued as 1,10 ms. Each SFC contains 2-8 different numbers of VNFs. The amount of resources requested by the VNF obeys a uniform distribution of 1,5 units. The bandwidth requirements of SFC obey a uniform distribution of [10,50] Mbps.

All simulation experiments were performed on a computer equipped with Intel (R) Core (TM) i7-11700 [email protected] and NVIDIA GeForce RTX 3050 GPUs. In addition, this example uses Python 3.7 and PyTorch 1.8 to complete all experiments. As described above, the deep neural network structure is composed of an input layer, one attention layer, two hidden layers, and an output layer, and the number of neurons of each hidden layer is 256. In order to handle both discrete and continuous actions, two different output layers are used in the actor network, the activation function of which is Sigmoid. Other layers of the actor network and the reviewer network then use the ReLU as an activation function. Other experimental parameter settings are shown in table 3.

TABLE 3 experimental parameters

(II) Performance analysis

To evaluate the performance of the SA-EDDPG algorithm, the present embodiment incorporates the following two algorithms for comparison.

SFC-DOP: SFC-DOP is an SFC online programming algorithm based on DDPG. The system comprises a VNF placement and traffic routing module, and can realize dynamic arrangement of SFCs in complex and dynamic networks. In addition, it can effectively solve continuous type actions in SFC orchestration problem. However, it does not take into account information of other server nodes when placing VNFs.

a-DDPG: the A-DDPG adopts a DRL algorithm based on the DDPG to solve the problems of dynamic VNF placement and traffic routing. Unlike SFC-DOP, the a-DDPG algorithm introduces a focus mechanism, which takes into account the influence of the state of neighbor nodes on the placement decision when placing the VNF, which can speed up the training and convergence time of the algorithm. However, it cannot solve both discrete and continuous type actions compared to the SA-EDDPG algorithm of the present invention.

Figure 3 shows the prize values for all algorithms during training. As can be seen from the graph, the prize value for each algorithm gradually converges as the number of rounds increases. In particular, it is observed that the SA-EDDPG algorithm always receives the highest rewards and converges faster than the other two algorithms. Specifically, the SFC-DOP uses the DDPG algorithm to obtain the optimal policy, but the SFC-DOP does not consider the influence of information of other server nodes on VNF placement, so it has the lowest rewards and the slowest convergence speed. In contrast, A-DDPG is significantly better than SFC-DOP in prize value and convergence speed. Because of the introduction of the attention mechanism, the A-DDPG considers the additional state information of the neighbor nodes when selecting actions, so that an agent can concentrate on more valuable nodes when deciding, and forward rewards are obtained more quickly in training, thereby accelerating training and convergence time. However, the A-DDPG algorithm performs worse than SA-EDDPG because it does not consider both discrete and continuous decision actions. As described above, the SA-EDDPG algorithm provided by the present invention improves the structure of the neural network, and can process discrete VNF placement decisions and continuous traffic segmentation decisions at the same time. Thus, in a dynamic NFV-based edge computing network, the agents of SA-EDDPG are more intelligent.

Fig. 4 shows the request acceptance rates of all algorithms for different request numbers. It can be observed that as the number of requests increases, the request acceptance rates of the three algorithms all show a decreasing trend. This is because as the number of service requests increases, the underlying physical network resources become increasingly occupied, resulting in subsequent requests being denied. More specifically, the request acceptance rate of SFC-DOP is significantly lower than the other two algorithms because the impact of underlying server resources and link resources on SFC routing policies is not considered in its reward function. In contrast, the A-DDPG and SA-EDDPG algorithm proposed by the present invention take into account the impact of underlying resources on SFC routing. In addition, the A-DDPG and SA-EDDPG algorithms can obtain information of other server nodes due to the introduction of a self-attention mechanism, so that an intelligent agent can better capture dynamic change of the state of the underlying physical network resources, and thus higher request acceptance rate is obtained. However, the SA-EDDPG algorithm still has a higher request acceptance rate than the A-DDPG algorithm because it is able to handle both discrete and continuous actions in SFC routing problems, thereby more efficiently utilizing the resources of the underlying network to accept requests. Compared with the other two algorithms, the request acceptance rate of the SA-EDDPG algorithm is respectively improved by 15 percent and 7 percent.

Fig. 5 shows the total resource consumption of three algorithms at different request numbers, including server resource consumption and link bandwidth consumption. It can be observed from the figure that as the number of requests increases, the total resource consumption of the three algorithms also tends to increase. The SA-EDDPG algorithm has minimal total resource consumption compared to other algorithms. Specifically, it is 38% and 20% less than the SFC-DOP algorithm and the A-DDPG algorithm, respectively. This is because the inclusion of a resource optimization incentive mechanism in the reward function of the SA-EDDPG algorithm can motivate the agent to select actions that consume less resources, thereby reducing overall resource consumption. In contrast, the a-DDPG algorithm consumes more total resources than the SA-EDDPG because it cannot efficiently handle the discrete decision actions in the SFC routing problem, resulting in a waste of resources. While SFC-DOP consumes the most total resources, on the one hand because it does not take into account resource optimization when performing SFC routing, and on the other hand because SFC-DOP does not fully utilize information from other server nodes compared to the other two algorithms, which makes the agent not efficiently adaptable to NFV-based edge computing network environments.

Fig. 6 shows the average delay for all algorithms at different request numbers. As the number of requests increases, the average latency of the three algorithms increases to some extent. However, as shown in fig. 4, the request acceptance rate decreases as the number of requests increases, which results in a decrease in the number of accepted requests. Thus, the rate of increase of the average delay of all algorithms is slowed down. Specifically, the average time delay of the SFC-DOP algorithm is significantly higher than that of the A-DDPG algorithm and the SA-EDDPG algorithm. This is because the SFC-DOP algorithm does not consider the state information of the neighbor nodes, increasing unnecessary exploration by the agent, resulting in higher latency. By introducing the attention mechanism, the A-DDPG algorithm can fully utilize the information of other server nodes, and reduce the inefficient exploration of the intelligent agent, thereby reducing the average time delay. Compared with the other two algorithms, the SA-EDDPG algorithm provided by the invention has the lowest average time delay which is 40% and 24% lower than that of the SFC-DOP algorithm and the A-DDPG algorithm respectively. This is because the SA-EDDPG algorithm adopts a parallel routing strategy, which can effectively reduce the delay of the service request.

Fig. 7 shows the average time delay of all algorithms for different numbers of physical nodes. As the number of nodes increases, the average delay of the three algorithms all tends to decrease. This is because as the number of physical nodes increases, there are sufficient resources and applicable paths in the network to accommodate the service request. The average latency of the SFC-DOP algorithm is highest compared to the other two algorithms. Because as the number of nodes increases, the network topology becomes more complex, and the SFC-DOP algorithm ignores the state information of other physical nodes, the agent consumes more time to explore, thereby generating higher latency. In contrast, both the A-DDPG algorithm and the SA-EDDPG algorithm have lower average latency than the SFC-DOP algorithm because they both add an attention mechanism to the algorithm that can cause the agent to focus on more valuable nodes when performing actions. In addition, the SA-EDDPG algorithm performs parallel routing of SFC by means of traffic segmentation and repeated instantiation of VNF, and simultaneously processes discrete and continuous actions by improving neural network structure, thus having the lowest average latency. Meanwhile, experimental results also show that the SA-EDDPG algorithm can realize lower time delay under the condition of consuming less resources.

This embodiment mainly discusses the SFC parallel routing problem in NFV-based edge computing networks. In order to meet the demands of delay-sensitive applications in edge computing networks, the invention proposes an optimization method for implementing SFC parallel routing by traffic segmentation and repeated instantiation of VNFs. Specifically, a model of the SFC parallel routing problem is first constructed, considering the resource limitation in the edge computing network, the goal of which is to minimize the resource consumption of servers and links while meeting the delay and resource constraints. Aiming at the dynamic change and the scattered edge server node of the network state, an SA-EDDPG algorithm based on DRL is provided to solve the problem. The algorithm can simultaneously process discrete type and continuous type actions in the SFC parallel routing problem by improving the structure of the neural network and using a self-attention mechanism, and reduces unnecessary exploration of an intelligent agent, thereby improving the SFC parallel routing efficiency. Finally, a large number of simulation experiments are performed in the embodiment, and the performance of the SA-EDDPG algorithm is evaluated in terms of request acceptance rate, total resource consumption, time delay and the like. Experimental results show that compared with the existing algorithm, the SA-EDDPG algorithm provided by the invention can effectively improve the acceptance rate of the service request and reduce the time delay and the total resource consumption of the service request.

The above functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing describes in detail preferred embodiments of the present invention. It should be understood that numerous modifications and variations can be made in accordance with the concepts of the invention by one of ordinary skill in the art without undue burden. Therefore, all technical solutions which can be obtained by logic analysis, reasoning or limited experiments based on the prior art by a person skilled in the art according to the inventive concept shall be within the scope of protection defined by the claims.

Claims

1. The parallel route optimizing method for the delay sensitive service function chain is characterized by comprising the following steps:

2. The method for optimizing parallel routes of a delay-sensitive service function chain according to claim 1, wherein the network model of the edge computing network is:

the underlying physical network is represented as a connected graph g= (N, L), where N is a set of server nodes, including various devices with computing capabilities at the network edge, L is a set of physical links connecting two server nodes, For each server node N εN, use C _n Representing its resource capacity, for each link L e L, with B _l and D_l Respectively representing the available bandwidth capacity and the communication time delay;

each service function chain request is defined as a series of ordered virtual network functions VNFs, and network traffic needs to be routed through each VNF in the service function chain in turn; for traffic routed through the service function link, it is assumed to be split and divided into I sub-streams, each represented by I; the set of service function chain requests arriving in real time is denoted by R, for each service function chain request R (F, epsilon _r ,D _r ) E R, F represents the required for the service function chain requestCollections of VNs, ε _r Representing the flow rate of the traffic passing through the service function link, D _r A delay constraint representing the service function chain request; for each substream i servicing a functional chain request, the set of VNFs required for substream i is denoted as F _i ＝{f ₁ ,f ₂ ,...,f _m ,...,f _|i|}, wherein f_m Is the mth VNF required for the sub-flow i, |i| represents the number of VNFs contained in the sub-flow i; service instance f for each VNF _m ∈F _i Has a resource requirement ofThe representation is performed.

3. A method of delay-sensitive service function chaining parallel route optimization according to claim 2 wherein modeling the service function chaining parallel route problem with a markov decision process comprises the steps of:

S1: determining constraint conditions;

wherein ,is a 0-1 variable if the mth VNF required for requesting substream i of r _m Placed on physical node n, the variable has a value of 1, otherwise it has a value of 0;

resource consumption U of all servers _N Expressed as:

bandwidth consumption U of all links _L Expressed as:

then, the optimization objective of joint resource consumption is expressed as:

wherein η₁ and η₂ Weights of server resource consumption and link bandwidth consumption respectively, satisfy eta ₁ ,η ₂ E (0, 1) and eta ₁ +η ₂ ＝1。

4. A time delay sensitive device according to claim 3The service function chain parallel route optimization method is characterized in that the MDP model is defined as a five-tuple wherein /> and />State space and action space, respectively, +.>Is a state transition probability distribution, R is a reward function, gamma e [0,1 ]]Is a discount factor for future rewards, and is specifically defined as follows:

action space: the action space of the intelligent agent is a setEach of which is provided withAction a e a represents VNF placement and traffic routing, and action a at time t is defined as +.> wherein ,/>Is a discrete action, indicating whether the mth VNF of the substream i requesting r is placed on the server node n at time t; />A discrete action for indicating whether the sub-stream i requesting r at time t is routed through the physical link l; />Is a continuous action for representing the flow dividing ratio of the request r at the time t;

bonus function: based on the optimization objective of the service function chain parallel routing problem, the rewarding function is used forDefined as the total cost of servers and linksNegative value of source consumption:

5. the method for optimizing parallel routes of a delay-sensitive service function chain of claim 4, wherein the offline training process comprises the steps of:

step 4) training an SA-EDDPG model, wherein the input of the SA-EDDPG model is a set R of a bottom physical network G and a service request, and the output is VNFf _m Adopts a network structure of double actors-critics, and totally comprises four neural networks, namely a main actor network mu (s|theta ^μ ) Main criticizing home network Q (s, a|θ ^Q ) Target actor network μ' (s|θ) ^μ′ ) And a target critics network Q' (s, a|θ) ^Q′), wherein θ^μ 、θ ^Q 、θ ^μ′ and θ^Q′ Is a parameter in the neural network, the actor network is responsible for generating actions of VNF placement and traffic routing under a given network state, the critic network is responsible for evaluating the actions generated by the actor network, and in the training process, the main actor network updates the parameter theta by a strategy gradient method ^μ The main commentator network updates the parameter theta through a gradient descent method based on time sequence differential errors ^Q The target actor network and the target critic network update the parameter theta in a soft update mode ^μ′ and θ^Q′ ；

The online operation process comprises the following steps:

6. The method for latency sensitive service function chaining parallel route optimization of claim 5, wherein in the SA-EDDPG model,

the actor network has five layers, in order, an input layer, an attention layer, two hidden layers, and an output layer, the staractor network μ (s|θ ^μ ) And target actor network μ' (s|θ) ^μ′ ) Has the same neural network structure, wherein the input layer is a network state vector s _t The method comprises the steps of carrying out a first treatment on the surface of the The attention layer will state vector s for each node _t Conversion to a form taking into account all node informationVector s' _t The method comprises the steps of carrying out a first treatment on the surface of the Both fully connected hidden layers contain 256 neurons for processing state information; the output of the output layer is action a _t Wherein the output layer is divided into two parts for obtaining discrete and continuous action decisions, specifically, defining FC ₁ The output layer obtains discrete VNF and link placement decisions, i.e and />Definition of FC ₂ The output layer obtains a continuous flow dividing ratio, i.e.>FC ₁ The output layer obtains corresponding action value, namely 0 or 1, FC through the Sigmoid activation function ₂ The output layer directly outputs the original continuous motion value +.>Adding noise or amplitude limit to obtain action value ∈>

7. The delay-sensitive service function link parallel route optimization method according to claim 1, wherein the training process of the SA-EDDPG model comprises parameter updating of four neural networks, specifically:

y _i ＝r _i +γQ′(s _i+1 ,μ′(s _i+1 |θ ^μ′ )|θ ^Q′ )

θ ^Q′ ←τθ ^Q +(1-τ)θ ^Q′

θ ^μ′ ←τθ ^μ +(1-τ)θ ^μ′

wherein the parameter τ is < 1.

8. The method for optimizing parallel routes of a delay-sensitive service function chain according to claim 1, wherein the training process of the SA-EDDPG model specifically comprises the following steps:

Step 4-2) in each training round, the following steps are performed:

step 4-3) after reaching the preset training round, the training is completed.

9. A delay-sensitive service function chaining parallel route optimization device, implemented based on the method of any of claims 1-8, comprising:

10. A delay-sensitive service function chaining parallel route optimization device comprising a memory, a processor and a program stored in the memory, characterized in that the processor implements the method according to any of claims 1-8 when executing the program.

11. A storage medium having a program stored thereon, wherein the program, when executed, implements the method of any of claims 1-8.