CN110233763B

CN110233763B - Virtual network embedding algorithm based on time sequence difference learning

Info

Publication number: CN110233763B
Application number: CN201910527020.7A
Authority: CN
Inventors: 王森; 张标
Original assignee: Chongqing University
Current assignee: Chongqing University
Priority date: 2019-07-19
Filing date: 2019-07-19
Publication date: 2021-06-18
Anticipated expiration: 2039-07-19
Also published as: CN110233763A

Abstract

The invention relates to a virtual network embedding algorithm based on time sequence difference learning, which models a VNE problem into a Markov Decision Process (MDP) and establishes a neural network to approximate a value function of a VNE state. On the basis, an algorithm named VNE-TD based on time sequence difference learning (a reinforced learning method) is provided. In VNE-TD, multiple embedding candidates for node mapping are probabilistic generated, and TD learning is used to evaluate the long term potential of each candidate. A large number of simulation results show that the VNE-TD algorithm is obviously superior to the previous algorithm in terms of (block ratio) blocking ratio and yield.

Description

Virtual network embedding algorithm based on time sequence difference learning

Technical Field

The invention relates to a computer network, in particular to a virtual network embedding algorithm based on time sequence difference learning.

Background

In recent years, network virtualization has received much attention from research community areas and industries due to the fact that it provides a promising solution for future networks. It is considered as a tool that can overcome the resistance of the current internet to fundamental changes. In addition, network virtualization is also a key enabler for cloud computing. The main entity of network virtualization is the Virtual Network (VN). As shown in fig. 1, a VN is a combination of virtual nodes and links on an underlying network (SN), where the numbers on a node or under a link are the node capacity and the link bandwidth, respectively. The virtual nodes are interconnected by virtual links of one or more SN paths. By virtualizing the node and link resources of one SN, multiple VNs with widely different characteristics can be hosted simultaneously on the same physical hardware. Given a set of Virtual Network Requests (VNRs) that have certain resource requirements for both nodes and links, the problem of finding a particular subset of nodes and links in a SN to satisfy each VNR is known as the Virtual Network Embedding (VNE) problem. In most implementations, the VNE problem must be treated as an online problem. That is, VNRs are not known in advance. Instead, they arrive at the system dynamically and may stay in the SN for a period of time. In practice, the VNE algorithm must handle VNRs on arrival, rather than a set of VNRs at a time (offline VNE). In making online embedded decisions for VNRs, infrastructure providers (InP, which is typically the owner of SNs) typically aim to maximize their long-term revenue, which makes VNE issues more challenging.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: a better balance between performance and computational complexity is achieved when virtual network is embedded.

In order to achieve the purpose, the invention adopts the following technical scheme: virtual network embedding calculation based on time sequence difference learning

The method comprises the following steps:

s101: establishing a VNE model

The underlying network SN is modeled as a weighted undirected graph and denoted G^s(V^s,E^s) In which V is^sIs a set of bottom nodes, E^sIs a set of underlying links, each underlying node v^s∈V^sIs provided with

Computing power, per underlying link e^s∈E^sIs provided with

The bandwidth of (d);

will VNR_kModeling as an undirected graph, denoted G^k(V^k,E^k) In which V is^kIs a set of virtual nodes, E^kIs a virtual set of links, each virtual node v^k∈V^kIs provided with

Computing power, per virtual link e^k∈E^kIs provided with

The bandwidth requirement of (d);

s102: defining states

S102 a: is VNE_kDefining a reward function, such as formula (1): VNE_kRepresents the procedure for the kth VNR;

wherein, c_vRepresenting node capacity of node v, b_eRepresents the link bandwidth of link e, eta represents the unit price of computing resource, beta represents the bandwidth resourceA unit price of (1); therefore, it is natural that VNR will be processed_kThe latter instant prize is defined as Rvn (k), i.e. r_k＝Rvn(k)；

S102 b: define a set of operations for the VNE: the operational set of the VNE is defined as the set of all possible node mappings;

s102 c: a markov state is defined for the VNE:

representing state s using normalized remaining node capacity and link bandwidth of SN_kIn the form of

And

s_kis an ordered set, as shown in equation (3) below:

in the RL, the state signal that successfully retains all relevant information is called Markov;

if the state signal has Markov properties, then the response of the environment at k +1 depends only on the state and action at k, in which case the dynamics of the environment can be determined by specifying only the following;

Pr{s_k+1＝s′，r_k+1＝r|s_k，a_k} (5)

s103: modeling the VNE as a Markov decision process MDP;

s103 a: defining a policy and value function: the policy of the VNE agent is a mapping from each state s and action a to the probability of taking action a, given a policy pi, the value function of the VNE is a function of the VNE state, representing the value function as V^π(s)， s∈S，V^π(s) can be viewed as the potential to accommodate future VNRs and generate long-term revenue, which is defined as equation (8):

R_kis from VNR_kIs the sum of all rewards, gamma is the discount rate that determines the present value of future rewards;

s103 b: defining an optimum function:

the objective of studying the VNE problem from the RL point of view is to find an optimal strategy that will yield the maximum return in the long term;

let pi^*Is an optimal strategy, if and only if, given any arbitrary strategy pi, pi^*>π means that for all S, S ∈ S, there is

The optimum function is defined as

For the optimum value function V^*(s), there are the following iterative expressions:

s104: approximating an optimum function V using a neural network^*(s), which is a function of the values under the optimal strategy:

approximation of the optima function V using a standard feedforward neural network with 2 fully connected (fc) layers^*(s), fc1 and fc2 have the same number of nodes, denoted as H, and the input of the neural network is the state s by using the rectifier as an activation function, as shown in formula (3), the neural network takes the state s as the input and outputs the value V(s) by calculation, and the value is expected to be approximate to V^*(s)；

The supervised learning of the approximation function V(s) is an adjustment of neural network parameters

In order to minimizeV(s) and V^*The difference between(s) can be expressed as:

as the RL process proceeds, V^*(s_k) Can be regarded as a sample of an approximation function V(s) for parallel supervised learning, according to the gradient descent method for VNR k, the parameters

The update is as follows:

wherein α is a positive step size parameter controlling learning speed;

s105: in VNE, given a VNR, we know the possible operations and the corresponding next state, and therefore,

and

it is determined, and as is known, the matching of each node map is traversed, and is used as an operation set, and a result state set of simulation embedding of the operation set is used as an input of the neural network in S104, so as to obtain a plurality of values of an optimal value function, and since the optimal strategy pi (S) can be expressed as:

i.e., the maximum value, meets the optimal policy,

s106: and selecting a matching actual embedded VNR corresponding to the optimal value function with the maximum value, and then finding the shortest path between two SN nodes with certain bandwidth to match the VN link.

As an improvement, in S105, when traversing the matching of each node mapping, it needs to perform the following reduction process first:

and generating node mapping candidate items with RW and uniform selection probability by using a probability method for generating a plurality of node mapping candidate items and using the measurement RW and the uniform value.

Compared with the prior art, the invention has at least the following advantages:

1. using neural networks to approximate the value function of VNE states helps to generalize from previously experienced states to never seen states for VNE problems with large state spaces.

2. Based on time sequence difference learning, passive balance is abandoned, the contradiction between an online embedding decision and a long-term target is overcome through active learning and online decision based on past experience, the problem of resource allocation is solved more effectively, and the resource utilization rate is improved.

Drawings

FIG. 1 is an example of a VNE problem

Fig. 2 is an example topology.

Figure 3 shows an example of a VNE problem.

The embedding results of the example of fig. 4.

Figure 5 illustrates the VNE process in the RL concept.

FIG. 6 neural network approximating an optimum function

Fig. 7(a) is a graph of the blocking ratio versus the parameter d for different algorithms, and fig. 7(b) is a graph of the benefit per second versus the parameter d for different algorithms.

Fig. 8(a) is a graph of blocking ratio versus time for different algorithms, fig. 8(b) is a graph of revenue per second versus time for different algorithms, and fig. 8(c) is a graph of WAPL versus time for different algorithms.

FIG. 9 is a graph of loss versus training times.

Fig. 10(a) is a graph of blocking ratio versus workload for different algorithms, fig. 10(b) is a graph of revenue per second versus workload for different algorithms, and fig. 10(c) is a graph of WAPL versus workload for different algorithms.

Fig. 11(a) is a graph showing an influence of the blocking ratio on the number of node mapping candidates, and fig. 11(b) is a graph showing an influence of the profit per second on the number of node mapping candidates.

Fig. 12(a) is a graph of blocking ratio versus VNRs link connectivity for different algorithms, and fig. 12(b) is a graph of revenue per second versus VNRs link connectivity for different algorithms.

Detailed Description

The present invention is described in further detail below.

The main challenge of the VNE problem is the contradiction between online decision making and pursuing long-term goals. The prior art attempts to overcome this challenge by balancing SN workloads, hopefully accommodating more future VNRs. However, the problem here is that the connectivity of the nodes is related to other nodes. The consumption of node connectivity capabilities does not necessarily only reduce its own capabilities. In fig. 3, a SN and a VNR need to be embedded in the SN. Take a node level metric (named GRC) in the prior art as an example. With the parameter d set to 0.85, the GRC value of the SN node is "origin" as shown in FIG. 4. To balance the SN workload, the GRC-VNE will select the node measured by the GRC and the two nodes with the strongest connectivity binding capability, i.e. node B and node G, to match the two nodes in the VNR (node a and node B). Thus, the remaining GRC values are shown in FIG. 4 as "After VNR embedded by GRC-VNE", and the variance of these values is 0.0032. In contrast, the VNE-TD algorithm proposed by the present invention selects node B and node C. The remaining GRC values are shown in FIG. 4 as "After VNR embedded by VNE-TD" and the variance of these values is 0.0016. This shows that the basic assumption of the prior art work of balancing SN workloads is problematic. It can neither bring a more balanced workload nor bring more remaining resources.

A virtual network embedding algorithm based on time sequence difference learning comprises the following steps:

s101: establishing a VNE model

Computing power (e.g., CPU cycles), per underlying link e^s∈E^sIs provided with

The bandwidth of (d); an example of SN is given at the bottom of fig. 1. The numbers around the nodes and links are their available resources.

Computing power, per virtual link e^k∈E^kIs provided with

The bandwidth requirement of (d);

an example of a VNR is given at the top of fig. 1. For VNR k, t^kFor VNR arrival time, a defined value l^kIs the lifetime of the VNR.

S102: defining states

wherein, c_vRepresenting node capacity of node v, b_eRepresenting the link bandwidth of the link e, eta representing the unit price of the computing resource, and beta representing the unit price of the bandwidth resource;

the goal of the reward function is to maximize the long-term time-averaged benefit of InP, as follows:

wherein K_T＝{k|0＜t_k< T } represents a plurality of sets of VNRs before time instance T arrives;

the reward function is intended to provide an immediate goodwill measure for a certain behavior in a given state. As can be seen from equation (2), the objective of the VNE problem is to maximize the long-term average time gain of InP. Therefore, it is natural to define the instant prize after processing vnrk as rvn (k), i.e., rk ═ rvn (k).

s102 c: a markov state is defined for the VNE:

And

s_kis an ordered set, as shown in equation (3) below:

for markov states, all that is important is the current state signal; its meaning is independent of the path or history to it. More specifically, in the most common cause and effect relationships, the response of the environment may depend on everything that has occurred before. In most RL problems, the transfer function is a probability function. In this case, the dynamics can only be represented by specifying a complete probability distribution:

Pr{s_k+1＝s′，r_k+1＝r|s_k，a_k，r_k，s_k-1，a_k-1，...，r₁，s₀，a₀} (4)

on the other hand, if the state signal has Markov properties, then the response of the environment at k +1 depends only on the state and action at k, in which case the dynamics of the environment can be determined by specifying only the following;

Pr{s_k+1＝s′，r_k+1＝r|s_k，a_k} (5)

s103: modeling the VNE as a Markov decision process MDP;

s103 a: defining a policy and value function:

the reinforcement learning task satisfying the Markov characteristic is called a Markov decision process, and the VNE state provided by the invention is a Markov state, so that the decision process of the VNE problem can be perfectly modeled as MDP.

In MDP, given an arbitrary state s and action a, the probability of each possible next state s' is expressed as:

these quantities are called transition probabilities; likewise, the expected value of the next prize is noted as:

from the RL perspective, the objective of the VNE is to find an action-optimal policy that is optimal at any time and under any state;

the policy of the VNE agent is a mapping from each state s and action a to the probability of taking action a in state s, representing the policy and the corresponding probability as pi and pi (s, a);

almost all reinforcement learning algorithms estimate the quality of an agent in a given state based on an estimation value function and a state function;

given policySlightly π, the value function of VNE is a function of the state of VNE, and the value function is represented as V^π(s)，s∈S，V^π(s) can be viewed as the potential to accommodate future VNRs and generate long-term revenue, formally defined as equation (8):

R_kis the sum of all rewards from VNR k, gamma is the discount rate that determines the present value of future rewards,

s103 b: defining an optimum function:

The optimum function is defined as

s104: solving V(s) by the neural network to make V(s) approximate to an optimal value function V^*(s)：

Approximation of the optima function V using a standard feedforward neural network with 2 fully connected (fc) layers^*(s) as shown in FIG. 6. The nodes at the fc1 and the fc2 are the same in number, denoted as H, and use a rectifier as an activation function, the input of the neural network is a state s, as shown in formula (3), and the neural network takes the state s as the input and outputs a value V(s) through calculation, which is expected to be similar to that of the state sV^*(s)；

In order to minimize V(s) and V^*The difference between(s) can be expressed as:

The update is as follows:

wherein α is a positive step size parameter controlling learning speed;

s105: in VNE, given a VNR, we know the possible operations and the corresponding next state. Therefore, the temperature of the molten metal is controlled,

and

is deterministic and known. And traversing the mapping matching of each node, taking the mapping matching as an operation set, and taking a result state set of simulation embedding of the operation set as the input of the neural network in the S104 to obtain a plurality of values of the optimal value function. Since the optimal strategy pi(s) can be expressed as:

i.e. the maximum value, the optimal strategy is met.

S106: and selecting the node corresponding to the value function with the maximum value to be actually embedded into the VNR, and then finding the shortest path between two SN nodes with certain bandwidth to match the VN link.

The present invention uses a RL method, i.e., time sequence difference (abbreviated TD) learning, to update the estimate of the optimum function and make an embedding decision based on the estimate, specifically, TD learning updates its estimate V^*(s) the following:

as mentioned above, V^*(s) approximating the parameters of equation (11) by a neural network in combination with a TD algorithm

The update transformation of (d) is:

according to the above update rule, V^*(s) and V(s) are in the process of TD and supervised learning, respectively, and are performed simultaneously.

The algorithm VNE-TD is a function of embedding decisions at VNR arrival. As shown in the algorithm VNE-TD, the input state to the neural network is the result state of each node mapping candidate simulation embedding, and the node with the largest value is selected to actually embed in the VNR. After the node mapping is established, the shortest path between two SN nodes with certain bandwidth is found to match the VN link. If a partitionable stream is allowed, then AND [12 ] is used]The same multi-commodity flow algorithm maps virtual links. According to expression (12), we should maximize the selection

Is matched to j. Because the reward (r ═ rvn (vnr)) is the same for the candidate, we can choose a match j that maximizes v (sjn). When the VNR's lifecycle is over, it will be offOpens the SN and releases the resources allocated to it as described earlier. The state of the SN may change. However, the parameters of the neural network are not updated when the VNR leaves and arrives.

As an improvement, in S105, when determining the optimal value function with the maximum value, the possible operation set is too large to be used

Traversing, firstly, performing the following reduction processing on the operation set: using a probabilistic approach to generating multiple node mapping candidates,

using the metric RW and the uniform value, node mapping candidates with RW and uniform selection probability are generated.

The method of the invention is described in detail as follows:

TABLE 1 symbols and notations used in the present invention

1.1VNE model

Modeling SN as a weighted undirected graph and representing it as G^s(V^s,E^s) In which V is^sIs a set of bottom nodes, E^sIs the underlying set of links. Each bottom node v^s∈V^sIs provided with

The bandwidth of (c). An example of SN is given at the bottom of fig. 1. The numbers around the nodes and links are their available resources.

1.1.1 virtual network request

A VNR k can also be modeled as an undirected graph, denoted G^k(V^k,E^k) In which V is^kIs a set of virtual nodes and E^kIs a collection of virtual links. Each virtual node v^k∈V^kIs provided with

Computing power, per virtual link e^k∈E^kIs provided with

Bandwidth requirements of (2). An example of a VNR is given at the top of fig. 1. For VNR k, t^kFor VNR arrival time, a defined value l^kIs the lifetime of the VNR.

2.2 VNE Process

For VNR k, the VNE flow consists of two key components, node mapping and link mapping

2.2.1 node mapping

The node mapping may be described as a one-to-one mapping, i.e., M_N:V^k→V^sThus, for M_N(v^k)＝v^s,v^k∈V^kAnd v^s∈V^sThe following two conditions must be satisfied (1) if

Then

(2)

The first constraint ensures that any two nodes of the VNR map to two different nodes of the SN, and the second constraint requires that each VN node map to one SN node with a certain node capacity.

2.2.2 Link mapping

In the link mapping phase, for one virtual link in the VNR, a set of paths needs to be found between two mapping nodes in the SN, and the total available bandwidth of these nodes is larger than the requirement of the virtual link. In the present invention, only single-path mapping is consideredThe case of (1), i.e., one virtual link can only map to one SN path. In the case of single-path mapping, the link mapping may be a mapping

And (4) showing. Wherein the content of the first and second substances,

is G^sOf all paths. For the

The following conditions must be satisfied:

the VNE problem must be handled as an online problem. VNRs arrive dynamically to the system, and the VNE algorithm must handle VNRs as they arrive.

2.3VNE revenue model and goal

VNE yield model is similar, and the InP-generated yield is represented by:

where η and β represent the unit price of the computational resource and the bandwidth resource, respectively.

The goal is to maximize the long-term time-averaged benefit of InP as follows:

wherein K_T＝{k|0＜t_k< T } denotes the set of VNRs before the time instance T arrives.

3. Fitting VME in RL model

RL is how the learning algorithm maps context to actions, maximizing the digital reward signal. As shown in fig. 5, the agent is the subject of learning, and the environment is the object of learning. The agent is capable of performing the operation. Performing an operation may leave the agent in the current state or cause a transition of the state space to another state. The conversion function may be a probability function or a deterministic function. The environment generates a reward for the agent as a result of the agent's actions. Typically, the value of the reward is calculated by a predetermined reward function that is used to control the enhancement process to the agent.

The purpose of the reward function is to provide an immediate goodwill measure for a certain activity in a particular state. The reward for each action depends on whether the new state is better than the current state. Over time, the agent attempts to learn the best operation to perform for each particular state, i.e., the operation that maximizes the overall long-term return. In the RL, a function is involved that indicates what is best in the future by accumulating the relevant instantaneous reward within a limited range.

As will be explained in more detail below, on the one hand, the objective of the VNE problem is to maximize the long-term time-averaged yield of InP; on the other hand, the embedding decision should be made immediately after the VNR appears, based on the present situation and past experience. The nature of the VNE problem provides a good environment for RL participation, both with long-term goals and with online decisions. In fig. 5, it is shown how to fit the VNE problem in the RL model. Regarding the VNE problem, considering SN and ever-coming VNRs as a whole from the RL point of view constitutes the environment. In the VNE problem, processing one VNR forms one RL cycle. For VNR k +1, the VNE agent depends on the current state s_kAnd previous experience in which all previous states and rewards may be included gives an embedded policy a_k,. In action a_kThe environment then gives the resulting state s_k+1And a prize r_k+1。

3.2 defining a reward function for the VNE

As previously mentioned, the reward function is intended to provide an immediate goodwill measure for a certain behavior at a given state. As can be seen from equation (2), the objective of the VNE problem is to maximize the long-term average time gain for InP. Thus, it is natural to define the immediate reward after processing VNRk as Rvn (k), i.e., r_k＝Rvn(k)。

It is clear that such a reward function can easily be adapted to other objectives of the VNE. This means that solving VNE problems with RL is very flexible. For example, if the objective of the VNE is to minimize the blocking ratio, then i can set the reward to 1 if the VNR is successfully embedded, and 0 otherwise.

3.3 defining an operation set and a Markov State for the VNE

How states and behaviors are defined is the key to the performance of the RL. In the present invention, the set of operations of the VNE is defined as the set of all possible node mappings. If the embedding is unsuccessful according to the action of the node mapping, the VNR will be blocked and no action is done on the SN.

In the VNE problem, we know the current VNR but not the next one. Therefore, before the next VNR arrives, if the VNR state representing the environment is included, the next state of the environment cannot be determined. Thus, while the environment of the VNE problem includes a SN and multiple VNRs as shown in fig. 5, we use only the state of the SN to represent the environment.

We use the normalized remaining node capacity and link bandwidth of the SN to represent state s_kIn the form of

And

s_kis an ordered set, as follows:

in the RL, the state signal that successfully retains all relevant information is referred to as Markov.

on the other hand, if the state signal has Markov properties, then the response of the environment at k +1 depends only on the state and action at k, in which case the dynamics of the environment can be determined by specifying only:

Pr{s_k+1＝s′，r_k+1＝r|s_k，a_k} (5)

3.4 modeling VNE as a Markov decision Process

The reinforcement learning task that satisfies the markov property is called the Markov Decision Process (MDP). Since the VNE state given by the present invention is a markov state, the decision process of the VNE problem can be perfectly modeled as an MDP.

these quantities are called transition probabilities. Likewise, the expected value of the next prize is noted as:

from the RL point of view, the VNE's goal is to find an action-optimal policy that selects the best at any time and under any state.

Defining: the policy of the VNE agent is a mapping from each state s and action a to the probability of taking action a in state s. We denote the strategy and corresponding probability as π and π (s, a).

Defining: given a policy π, the value function of VNE is a function of VNE state. We denote the value function as V^π(s)，s ∈S。V^π(s) can be viewed as the potential to accommodate future VNRs and generate long-term revenue. It is formally defined as follows:

R_kis the sum of all rewards from VNR k. Gamma is the discount rate that determines the present value of the future prize.

The objective of studying the VNE problem from the RL point of view is to find an optimal strategy pi that will yield the maximum return in the long term.

Defining: pi^*Is an optimal strategy, if and only if, given any arbitrary strategy pi, pi^*>π means that for all S, S ∈ S, there is

Defining: the optimum function is defined as

Proposition: for the optimum value function V^*(s), we have the following iterative expression:

and (3) proving that:

equation (9) represents the relationship between the optimal value of the current state and the optimal value of the possible next state, giving the optimal value function, how to get the optimal action.

3.5 approximation of the optimal function

In the present invention, we approximate the optimal function V using a standard feedforward neural network with 2 fully-connected (fc) layers^*(s) as shown in FIG. 6. The fc1 and fc2 nodes have the same number and are marked as H. A rectifier was used as the activation function, which is probably the most common activation function for deep neural networks as of 2018. The input to the neural network is state s, as shown in equation (3). By calculation, the neural network takes the state s as input and outputs the value V(s), which is expected to approximate V^*(s)。

The process of (1). The objective is to minimize V(s) and V^*The difference between(s) can be expressed as:

as the RL process proceeds, V^*(s_k) Can be regarded as a sample of an approximation function V(s) for parallel supervised learning, according to the gradient descent method for VNRk, the parameters

The update is as follows:

where α is a positive step size parameter that controls the learning rate.

3.6 solving VNE problem with TD learning

Computing V by approximation of neural networks in the learning process^*(s). In VNE, given a VNR, we know the possible operations and the corresponding next state. Therefore, the temperature of the molten metal is controlled,

and

is deterministic and known. Optimal action pi^*(s) can be calculated from the following formula:

however, the set of possible operations is too large to traverse. Therefore, we need to significantly reduce the search space. As shown in the algorithm GC _ GRC below, a probabilistic method of generating multiple node mapping candidates was developed using a node ranking metric (named GRC). However, the algorithm of the present invention is independent of the metrics of the GRC. Two other metrics are considered, namely the metric (called RW) and the uniform value. The two algorithms that generate node mapping candidates with RW and uniform selection probability are GC _ RW and GC _ UNI, respectively. In the algorithm GC _ GRC, the parameter L is the generated node mapping candidate number.

In the present invention, a RL method, i.e. sequential difference (abbreviated TD) learning, is used to update the estimate of the optimum function and make embedding decisions based on the estimate. Specifically, TD learning updates its estimate V^*(s) the following:

V^*(s) are approximated by a neural network. Combining the parameters of the formula (11) with a TD algorithm

The update transformation of (d) is:

according to the aboveSaid update rule, V^*(s) and V(s) are in the process of TD and supervised learning, respectively, and are performed simultaneously.

The algorithm VNE-TD is a function of embedding decisions at VNR arrival. In VNE-TD, neural network parameters

Initialization is performed according to a normal distribution. As shown in the algorithm VNE-TD, the input state to the neural network is the result state of each node mapping candidate simulation embedding, and the node with the largest value is selected to actually embed in the VNR. After the node mapping is established, the shortest path between two SN nodes with certain bandwidth is found to match the VN link. If partitionable streams are allowed, the virtual links are mapped using a multiple commodity stream algorithm. According to expression (12), maximization should be selected

Is matched to j. Since the reward (r ═ rvn (vnr)) is the same for the candidate, one can choose to maximize V(s)^j _n) Is matched to j. After embedding VNR, algorithm VNE-TD stores triples in memory<sc,r,sn>As shown on line 26. The maximum number of triples that the memory can store is set to 1000. The memory follows the FIFO (first in, first out) replacement rule. In order to make the training of the neural network smoother and optimized, the parameters are compared with the single-step mode described by expression (14)

Is updated in batches. And randomly extracting the ternary groups with the batch sizes from the memory by the VNE-TD, and training the neural network by using the ternary groups with the batch sizes. As shown in equation (14), a triplet<sc,r,sn>The training error of (1) is r + gamma V_k(s_n)-V_k(s_c). The goal of the batch training process is to minimize the mean square error, i.e., loss, of the batch. As shown on line 2, the VNE-TD may use any one of three algorithms, i.e., GC _ GRC, GC _ RW, or GC _ UNI. Algorithms using GC _ GRC, GC _ RW, or GC _ UNI are named VNE-TD-GRC, VNE-TD-RW, or VNE-TD-UNI, respectively.

When the VNR's lifecycle ends, it will leave the SN and release the previously described resources allocated to it. The state of the SN will change. However, the parameters of the neural network are not updated when the VNR leaves and arrives.

Evaluation of

1. Benchmark test and performance index

The VNE-TD is compared to prior art algorithms.

The VNE-TD is compared to other algorithms using mainly three performance indicators (1) the blocking ratio is the number of blocking VNRs divided by the total number of all VNRs; (2) revenue per second is the total revenue obtained so far divided by the number of seconds elapsed; (3) the Weighted Average Path Length (WAPL) is the sum of all bandwidths actually allocated in the SN divided by the sum of the link bandwidths of all VNRs, i.e. the weighted average length of all paths to which a VNR link is mapped.

2. Simulation setup

An event-driven simulation environment is implemented using Python. Neural networks and their training are implemented using Tensorflow, a popular open source software library for machine learning applications such as neural networks. In the simulation, the topology of the SN and VNs was randomly generated using the GT-ITM tool. The SN has 60 nodes and 150 links. The number of VN nodes is uniformly distributed between 2 and 20, and the link connectivity between any two nodes is 0.2 in VNs. 4000 VNRs need to be embedded in the SN. For both SN and VNs networks, the initial node capacity and link bandwidth are chosen randomly, with a uniform distribution of the same average. The average of the node capacity and link bandwidth of the SN is 40 times VNs. VNRs arrive one after the other, forming a poisson process, with an average arrival rate of one request per second. VNRs have a lifetime that follows an exponential distribution with an average of μ ═ 70 seconds. The values of the parameters η and β in expression (1) in the profit model are set to 1. The reduction ratio in equation (8) is set to 1 because we find that setting γ to 1 makes the neural network converge more smoothly and faster. For a neural network, we set the number of hidden layer nodes H to 300, which is the same size as the number of inputs to the neural network. The batch size in the following evaluation subsection was empirically set to 50. The number of node mapping candidates (i.e., L) is set to 40. Unless otherwise stated, the above parameters are not altered in the following subsections.

Each simulation series in the following subsections, except 4, will run three times. The same topology of SNs and VNRs as described above will be used each time, as well as a different set of random node capacities and link bandwidths. The standard deviation of the three runs is represented by error bars as the following simulation results.

1. Robustness of the GRC parameter d

In general, the computation of a GRC is based on two factors, namely node capacity and connectivity to other nodes. These two factors are balanced by the parameter d of the GRC. In fig. 7(a), the blocking ratio of the different algorithms is shown. In fig. 7(b), revenue per second is shown. As can be seen from FIG. 7, VNE-TD-GRC is insensitive to parameter d, while performance of GRC-VNE is significantly dependent on parameter d. Furthermore, when d is relatively small, the deviation of GRC-VNE is very large. The offset of VNE-TD-GRC is small and stable. Under the blocking condition set by simulation, the demand of link bandwidth is larger than the node capacity and is more critical. Therefore, for GRC-VNE, the parameter d needs to be adjusted to close to 1.00 to support the factor of connectivity capability while almost ignoring the factor of node capacity. In contrast, VNE-TD-GRC uses only the metrics of GRC to help narrow the search range, relying on a value function to make the final decision for node mapping. This is why VNE-TD-GRC is insensitive to the parameter d compared to GRC-VNE. Clearly this is a very desirable property of VNE-TD-GRC, since VNRs are not known in advance and can vary greatly over time.

Therefore, the present invention sets the parameter d to VNE-TD-GRC of 0.95 and GRC-VNE of 0.995.

2. Influence of TD learning

To show the effect of TD learning, we compared VNE-TD-GRC using the Rand-GRC algorithm (referring to randomly selected GRC). Similar to the algorithm VNE-TD-GRC, the algorithm Rand-GRC uses the algorithm GC-GRC to probabilistically generate L node mapping candidates. Except that instead of selecting the maximum value denoted by v(s), a candidate is randomly selected from all candidates that can be successfully embedded. This means that Rand-GRC loses learning ability compared to VNE-TD-GRC. In the simulations of this subsection, L is set to 10.

As can be seen from fig. 8(a), although the node mapping is probabilistic, the blocking ratio of the algorithm Rand-GRC is better than the GRC-VNE due to the multiple candidates. This means that even during training, VNE-TD-GRC can still perform better than GRC-VNE. Furthermore, when TD learning involves selecting the optimum from a plurality of candidates, the occlusion ratio is significantly improved by 67.2% at 3900, as compared to GRC-VNE. As can be seen from FIG. 8(b), the VNE-TD-GRC algorithm can increase the yield by 13.9% per second at 3900 compared to GRC-VNE. Interestingly, Rand-GRC is almost as good as GRC-VNE in revenue per second, although it is better than GRC-VNE in blocking ratio. It appears that Rand-GRC is only good at embedding VNRs that are of low yield and relatively easy to handle. As can be seen from FIG. 8(c), the algorithm Rand-GRC significantly improves WAPL compared to GRC-VNE due to the probability mapping of the nodes. The algorithm TD-VNE-GRC can effectively overcome the defect. This means that using TD learning can help increase revenue per second by keeping the blocking ratio and WAPL low.

In fig. 9, we show the variation of the loss as the number of training increases. The loss is the mean square error of the training batch, which is the minimum objective of the training process. As can be seen from fig. 9, the loss converges to a local optimum at the 700 th training, i.e. the time after processing the 700 th VNR. At local optimum, the loss is about 400 (error about 20). The average reward is about 92 and the loss at local optimality is relatively small, which may mean that the approximation with the proposed neural network works well.

3. Effects of workload

We demonstrate the effect of workload by changing the mean lifetime of VNRs from 40 seconds to 100 seconds. We also add the algorithm LC-GRC (representing the node where the GRC cost is lowest, (our algorithm is to select the largest performing contrast) which uses the algorithm GC-GRC to generate the L node mapping candidates and select the lowest cost candidate in the SNs.

As can be seen from fig. 10, the blocking ratio and the per second gain of the proposed three VNE-TD algorithms are continuously improved with increasing workload compared to the other algorithms. Wherein the yield per second of the algorithm VNE-TD-GRC at the highest workload is increased by 24.8% and 17.1% respectively compared to GRC-VNE and RW-MM-SP.

The algorithm VNE-TD-GRC performs best in three versions of VNE-TD. The algorithm VNE-TD-UNI performs the worst, with the greatest variance among the three versions. This means that the two indices GRC and RW do help the VNE-TD to focus on a more promising search area, although the magnitude of the improvement is not large. In addition, it also shows the potential of VNE-TD in combination with other VNE algorithms.

4. Influence of the parameter L

In fig. 11(a) and (b), we show the effect of the number of node mapping candidates, i.e. the parameter L. this shows that VNE-TD-GRC can further improve the blocking ratio and revenue from 79.6% and 17.4%, 82.3% and 18.3% respectively per second compared to GRC-VNE, while L increases from 40 to 60%. Increasing L from 40 to 60 does not result in an unacceptable increase in computation time, depending on the computational complexity of VNE-TD in section 3.7.

5. Influence of topological properties

In fig. 12, we demonstrate the effect of VN node link connectivity. As the connectivity of the link increases, the connectivity of the VN node also increases, which means that the difficulty of embedding also increases. As can be seen from fig. 12, VNE-TD-GRC works better than GRC-VNE when link connectivity is higher. When the link connectivity is 0.5, the yield per second of VNE-TD-GRC is 23.1% higher than GRC-VNE.

Finally, the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention, and all of them should be covered in the claims of the present invention.

Claims

1. A virtual network embedding method based on time sequence difference learning is characterized in that: the method comprises the following steps:

s101: establishing a VNE model

Computing power, per underlying link e^s∈E^sIs provided with

The bandwidth of (d);

Computing power, per virtual link e^k∈E^kIs provided with

The bandwidth requirement of (d);

s102: defining states

wherein, c_vRepresenting node capacity of node v, b_eRepresenting the link bandwidth of the link e, eta representing the unit price of the computing resource, and beta representing the unit price of the bandwidth resource; therefore, it is natural that VNR will be processed_kThe latter instant prize is defined as Rvn (k), i.e. r_k＝Rvn(k)；

s102 c: a markov state is defined for the VNE:

And

s_kis an ordered set, as shown in equation (3) below:

Pr{s_t+1{s′，r_k+1＝r|s_k，a_k} (5)

s103: modeling the VNE as a Markov decision process MDP;

s103 a: defining a policy and value function: a policy of a VNE agent is a mapping from each state s and action a to the probability of taking action a, given a policy π, that the value function of the VNE is a function of the VNE state, representing the value function as V^π(s)，s∈S，V^π(s) can be viewed as accommodating future VNRs and producing long-termThe income potential is used for measuring the quality of the current state, and the income potential is defined as the formula (8):

s103 b: defining an optimum function:

let pi^*Is an optimal strategy, if and only if, given any arbitrary strategy pi, pi^*>Pi means that for all S, S e S, there is V pi^*(s)>＝V_π(s)；

The optimum function is defined as

In order to minimize V(s) and V^*The difference between(s) can be expressed as:

The update is as follows:

wherein α is a positive step size parameter controlling learning speed;

and

i.e., the maximum value, meets the optimal policy,

2. The virtual network embedding method based on timing difference learning of claim 1, wherein: in S105, when traversing the matching of each node map, it needs to perform the following reduction process first:

using a probabilistic method of generating a plurality of node mapping candidates, using the metric RW and the uniform value, node mapping candidates with RW and uniform selection probabilities are generated.