CN106850289B

CN106850289B - Service combination method combining Gaussian process and reinforcement learning

Info

Publication number: CN106850289B
Application number: CN201710055817.2A
Authority: CN
Inventors: 王红兵; 李佳杰
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2017-01-25
Filing date: 2017-01-25
Publication date: 2020-04-24
Anticipated expiration: 2037-01-25
Also published as: CN106850289A

Abstract

The invention discloses a service combination method combining a Gaussian process and reinforcement learning, which comprises the following steps: 1. modeling a service composition problem as a four-tuple Markov decision process; 2. solving a four-tuple Markov decision process by applying a Q-learning-based reinforcement learning method to obtain an optimal strategy; wherein the Q value is updated by establishing a Gaussian prediction model of the Q value; 3. and mapping the optimal strategy into the workflow of the web service combination. The method uses a Gaussian process to model the learning of the Q value, so that the method has better accuracy and generalization.

Description

Service combination method combining Gaussian process and reinforcement learning

Technical Field

The invention relates to a method for combining Web services by using a computer, belonging to the field of artificial intelligence.

Background

With the development of computer technology, the requirements of software systems become more and more complex and changeable, and with the development of internet and information technology, a Service-Oriented Architecture (Service-Oriented Architecture) is gradually developed: software or components implementing certain functions are placed in the context of the internet as web services, and users may use their functions by communicating with the web services via some messaging protocol. And finally, constructing a new software system meeting the requirements by combining various web services. Currently, a weather service, a map positioning service, and the like are common web services.

For a certain function, there are generally a plurality of services that are similar in function but different in quality of Service (QoS), one type of Service that can satisfy a certain function is called an abstract Service, and a plurality of specific services that satisfy the function are called candidates for the abstract Service. For a user's requirement, how to select a service with the best quality from a plurality of candidate services and finally obtain the best combination of services is a service combination problem, and the selection and combination optimization of services according to the QoS attributes of different services is called QoS-aware service combination. Since the internet environment is highly dynamic, and QoS attributes of a certain service may fluctuate or change with time and environmental changes, the service combination method needs to have certain adaptivity and be capable of coping with the influence of environmental changes. Meanwhile, as candidate services are continuously increased and business requirements are more and more complex, a complex user requirement often includes a plurality of abstract services and corresponding candidate services, and therefore the service combination method also needs to be capable of facing the challenge of the large-scale service combination problem. Based on the above two problems, some scholars propose a service combination method based on Markov Decision Processes (MDP) and reinforcement learning. MDP is a decision planning technique, in a service combination, a current network environment and context are modeled as states in MDP, a plurality of candidate services available for selection in the current state are modeled as a plurality of actions available in MDP, and after a certain action is performed, a new state is shifted to, so that a next round of selection is performed until the whole service combination is finally completed. After the MDP model is used for modeling the service combination process, the problem of exploring the optimal service combination can be converted into the problem of solving the MDP model, and therefore a reinforcement learning method is further used. The reinforcement learning method is an effective method for solving the MDP model, particularly in a large-scale dynamic environment of a service combination problem, reinforcement learning is performed through iterative interaction with the environment, and the reinforcement learning method is natural and adaptive and can well deal with the service combination problem in a network environment. In the traditional reinforcement learning algorithm Q-learning, a Q value is recorded through a value table, the generalization capability is lacked, the learning result is not accurate enough, and the influence of noise is large.

Disclosure of Invention

The purpose of the invention is as follows: aiming at the problems in the prior art, the invention discloses a service combination method combining a Gaussian process and reinforcement learning, wherein the Gaussian process is used for modeling the learning of a Q value, so that the Q value has better accuracy and generalization.

The technical scheme is as follows: the technical scheme adopted by the invention is as follows:

a service combination method combining Gaussian process and reinforcement learning comprises the following steps:

(1) modeling a service composition problem as a four-tuple Markov decision process;

(2) solving a four-tuple Markov decision process by applying a Q-learning-based reinforcement learning method to obtain an optimal strategy;

(3) and mapping the optimal strategy into the workflow of the web service combination.

Specifically, the service composition problem is modeled in step (1) as a four-tuple Markov decision process as follows:

M＝<S,A,P,R>

where S is a set of finite states in the environment; a is a set of callable actions, A(s) represents a set of actions that can be performed in state s; p is a function describing the MDP state transition, P (s '| s, a) represents the probability of transitioning to state s' after invoking action a under state s; r is a reward value function, and R (s, a) represents the reward value resulting from invoking action a in state s.

Specifically, the step (2) of solving the four-tuple Markov decision process by applying a Q-learning-based reinforcement learning method to obtain the optimal strategy comprises the following steps:

(21) taking the state action pair z as input and the corresponding Q value Q (z) as output, and establishing a Gaussian prediction model of the Q value;

(22) initializing a learning rate sigma, a discount rate gamma and a greedy strategy probability epsilon in Q-learning, wherein the current state s is 0, and the current time step t is 0;

(23) selecting one service a as current service a by using greedy strategy with probability of epsilon_tAnd the execution is carried out,

(24) is recorded in the current state s_tUnder execution of Current service a_tIs given a return value r_tAnd state after execution of service a_t+1(ii) a Calculating the in-state action pair z according to_t＝<s_t,a_t>The following Q value:

wherein Q (z)_t) For acting on a state z_t＝<s_t,a_t>The Q value, σ, is the learning rate, r, γ, is the discount rate, s_t+1To execute service a_tFrom the current state s_tTransition to the subsequent state, a_t+1Is in a state s_t+1Service of lower choice, Q(s)_t+1,a_t+1) Indicating an action pair in a state<s_t+1,a_t+1>The lower Q value;

(25) updating the Q value according to a Gaussian prediction model:

where I is the identity matrix, ω_nFor uncertainty parameters, Z is a set of historical state action pairs,

for the set of historical Q values corresponding to Z, K (Z, Z) is the covariance matrix between the historical state action pairs, with the ith row and j column elements of K (Z)_i,z_j) K (·) is a kernel function; k (Z, Z)_t+1) For historical state action pairs with newly entered state action pairs_t+1A covariance matrix between;

according to the state action pair z_t+1＝<s_t+1,a_t+1>And a corresponding Q value Q (z)_t+1) Updating a Gaussian prediction model;

(26) and updating the current state: s_t＝s_t+1When s is_tWhen the terminal state is the termination state and the convergence condition is met, the reinforcement learning is finished to obtain an optimal strategy; otherwise go to step (23).

Specifically, the kernel function k (·) in the gaussian prediction model is a gaussian kernel function:

wherein sigma_kIs the width of the gaussian kernel function.

Specifically, the convergence condition in step (26) is: the variation of Q value is less than Q threshold Q_thNamely: i Q (z)_t)-Q(z_t+1)|<Q_th。

Has the advantages that: compared with the prior art, the service combination method disclosed by the invention has the following advantages: in the invention, when the reinforcement learning Q value is calculated, the traditional method of recording and searching the Q value through a value table is improved, the service selected and called each time and the observed QoS attribute are taken as the input and the output of an unknown function, in the iterative process of the Q value, the Q value is estimated through a Gaussian process instead of being searched through the value table, and meanwhile, the parameter of the Gaussian process is learned and updated, so that the prediction of the Q value is more accurate, and a better service combination result is finally obtained. Meanwhile, a Gaussian process reinforcement learning service combination method is adopted, and a Gaussian process model can be trained from the existing data, so that new data can be predicted and estimated, the method has good generalization capability, and can be more suitable for dynamic and variable web service combination environments.

Drawings

FIG. 1 is a basic service composition model;

FIG. 2 is a schematic diagram of service composition modeled by MDP;

FIG. 3 is a schematic diagram of a basic Gaussian process;

FIG. 4 is a flow diagram of a method of service composition incorporating a Gaussian process and reinforcement learning.

Detailed Description

The invention is further elucidated with reference to the drawings and the detailed description.

Basic model of service composition as shown in fig. 1, a complex software system can be viewed as a workflow composed of a plurality of components or subsystems, in the field of service composition, i.e., web services. Thus, when combining services, the user's needs can be modeled with an abstract task workflow diagram, where the individual components are abstract services. For each abstract service there may be multiple candidate services with similar functionality but different QoS (quality of service), so that a suitable concrete service can be selected from the candidate services based on the QoS attributes, eventually combining the available service composition systems.

The invention discloses a service combination method combining a Gaussian process and reinforcement learning, which comprises the following steps:

step 1, modeling a service combination problem into a four-tuple Markov decision process:

M＝<S,A,P,R>

FIG. 2 shows an example of service composition modeled by MDP, which describes the process of service composition while traveling. In the MDP model, candidate services that can be invoked are modeled as distinct actions. Different actions are invoked, possibly reaching different states, while the new state determines the set of services that can be invoked next. For the different services invoked, the quality of the service, i.e. the reward function in the MDP model, is evaluated by the observed QoS attributes. Thus, a service composition problem is transformed into an MDP model, so that solution optimization can be performed through a reinforcement learning method.

Step 2, solving a four-tuple Markov decision process by applying a Q-learning-based reinforcement learning method to obtain an optimal strategy;

and solving the MDP model to find the optimal service selection strategy in each state, so that the final combined result is more optimal. In the MDP model, the goodness or badness of an action is not only dependent on the immediate return value generated by the action, but also related to the subsequent state and return caused by the action, and in the reinforcement learning algorithm Q-learning, the estimated value of selecting action a in state s is evaluated by using a Q-value function Q (s, a), and the iterative formula is as follows:

wherein σ is a learning rate for controlling the magnitude of the degree of change at each update of the Q value; gamma is discount rate, which is used to control the influence degree of future state; reinforcement learning theory considers that the effect of an immediate return value should be greater than the future possible return values, so that the value of γ is between 0 and 1. R is R (s, a), which is the report value for performing action a in state s. Q (s ', a') represents the Q value selected for a 'after transition to state s' after performing action a, and is used to represent a future prize value.

In the traditional reinforcement learning process, the calculated Q value is recorded, and when Q is updated later, Q (s ', a') is obtained by searching in a Q value table which is calculated and recorded before, which is sufficient in some application scenes. However, in a highly dynamic service combination scene, the method lacks generalization capability and cannot cope with data changes in a real scene. And along with the expansion of the service combination scale, the space and time required by value table storage and query also consume great computing power, and the requirement on real-time performance cannot be well met. Therefore, the invention provides a method for modeling the estimation of the Q value through the Gaussian process, thereby improving the generalization ability, better coping with the dynamic environment and obtaining better effect in practical application.

As shown in fig. 4, the method specifically includes the following steps:

the gaussian process is schematically shown in fig. 3, a gaussian process model is trained according to known input and output data, and when a new input arrives, the corresponding output is predicted through the model. The Gaussian process model is uniquely determined by the mean function and the covariance function, so that the adjustment and optimization are easy, and the iterative convergence is relatively fast.

Specifically, a group of n training samples is selected { (z)_i＝(s_i,a_i),Q(z_i) 1.. n }, where z is_i＝(s_i,a_i) Is a state action pair, is an input; q (z)_i) The Q value corresponding to the state action is output. z is a radical of^*And Q^*Is the data that needs to be predicted. The Gaussian process considers that the input and output satisfy a joint probability distribution, using K (X, X)^*) Represents n × n^*All training points X and test points X^*Covariance matrix of (n is the number of training points, n)^*Number of test points), K (X, X^*) The ith row and j columns of the matrix have the elements of k (X)_i,X^*)，X_iIs the ith element of set X.

K(X,X),K(X^*,X),K(X^*,X^*) Similarly, the joint distribution of the output training points and the output test points is as follows:

calculated Q (z)^*) Is desired to be α_* ^TK(Z,Z^*). Wherein

Wherein ω is_nRepresenting an uncertainty parameter, and taking a value of 1 in the embodiment; i is an identity matrix; z is a set of historical state action pairs, f is a set of historical Q values corresponding to Z, K (Z, Z) is a covariance matrix between the historical state action pairs, and the ith row and j column elements of the covariance matrix are K (Z)_i,z_j) K (·) is a kernel function; k (Z, Z)_t+1) For historical state action pairs with newly entered state action pairs_t+1A covariance matrix between;

(23) selecting one service a as current service a by using greedy strategy with probability of epsilon_tAnd executing, specifically: randomly generating a random number upsilon in the interval (0,1), if upsilon>Epsilon, randomly selecting a new service a; if upsilon is less than or equal to epsilon, selecting the service with the largest current Q value as a new service a; this can avoid falling into local optima;

(25) updating the Q value according to a Gaussian prediction model:

for the set of historical Q values corresponding to Z, K (Z, Z) is the covariance matrix between the historical state action pairs, with the ith row and j column elements of K (Z)_i,z_j) K (·) is a kernel function; k (Z, Z)_t+1) For historical state action pairs with newly entered state action pairs_t+1The covariance matrix in between. There are many kinds of kernel functions that can be used, and in this embodiment, the kernel function K is a gaussian kernel function:

wherein sigma_kIs the width of the gaussian kernel function.

Since the Gaussian model has changed due to the newly added data points, it is necessary to act on z according to the state_t+1＝<s_t+1,a_t+1>And a corresponding Q value Q (z)_t+1) Updating Gaussian backgroundThe measurement model is used for the iterative update of the next Q value;

The convergence condition in this embodiment is that the Q value changes stably, that is, the change of the Q value is smaller than the Q value threshold Q_thNamely: i Q (z)_t)-Q(z_t+1)|<Q_thAnd obtaining the optimal strategy at the moment, and obtaining the final service combination result according to the optimal strategy.

Claims

1. A service combination method combining Gaussian process and reinforcement learning is characterized by comprising the following steps:

(1) modeling the service composition problem as a four-tuple Markov decision process as follows:

M＝<S,A,P,R>

where S is a set of finite states in the environment; a is a set of callable actions, A(s) represents a set of actions that can be performed in state s; p is a function describing the MDP state transition, P (s '| s, a) represents the probability of transitioning to state s' after invoking action a under state s; r is a return value function, and R (s, a) represents the return value obtained by invoking the action a under the state s;

(3) mapping the optimal strategy into a workflow of a web service combination;

the step (2) of solving the four-tuple Markov decision process by applying a Q-learning-based reinforcement learning method to obtain an optimal strategy comprises the following steps:

(22) initializing learning rate sigma, discount rate gamma, greedy strategy probability epsilon and current state s in Q-learning_t0, and 0 is the current time step t;

(23) by a probability ofGreedy strategy selection of epsilon for current service a_tAnd executing;

(24) is recorded in the current state s_tUnder execution of Current service a_tIs given a return value r_tAnd executing the current service a_tRear state s_t+1(ii) a Calculating the in-state action pair z according to_t＝<s_t,a_t>The following Q value:

wherein Q (z)_t) For acting on a state z_t＝<s_t,a_t>Q value at σ is learning rate, r_tTo return value, gamma is discount rate, s_t+1To execute service a_tFrom the current state s_tTransition to the subsequent state, a_t+1Is in a state s_t+1Service of lower choice, Q(s)_t+1,a_t+1) Indicating an action pair in a state<s_t+1,a_t+1>The lower Q value;

(25) updating the Q value according to a Gaussian prediction model:

(26) is updated whenThe former state: s_t＝s_t+1When s is_tWhen the terminal state is the termination state and the convergence condition is met, the reinforcement learning is finished to obtain an optimal strategy; otherwise go to step (23).

2. The service composition method combining gaussian process and reinforcement learning according to claim 1, wherein the kernel function k (-) in the gaussian prediction model is a gaussian kernel function:

wherein sigma_kIs the width of the gaussian kernel function.

3. The service composition method combining gaussian process and reinforcement learning according to claim 1, wherein the convergence condition in step (26) is: the variation of Q value is less than Q threshold Q_thNamely: i Q (z)_t)-Q(z_t+1)|<Q_th。