CN110209467B

CN110209467B - Elastic resource expansion method and system based on machine learning

Info

Publication number: CN110209467B
Application number: CN201910437262.7A
Authority: CN
Inventors: 刘方明; 金海�; 李羿
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2019-05-23
Filing date: 2019-05-23
Publication date: 2021-02-05
Anticipated expiration: 2039-05-23
Also published as: CN110209467A

Abstract

The invention discloses a method and a system for expanding elastic resources based on machine learningBelong to cloud computing technology field and deep learning field, include: running deadline t of known task to be run_dUnder the condition of task calculation amount, calculating the minimum total amount of resources required by completing the task by using a regression model; continuously collecting the current running state and the resource utilization rate of the task in the running process of the task, inputting the minimum total resource amount, the current running state of the task, the resource utilization rate and the task calculation amount into a prediction model for prediction, and obtaining the completion time T of the task_c(ii) a If T_c＞t_dThen the completion time T for the final task is calculated_c′＜t_dThe minimum total amount of resources; if the task is not completed, the collection is continued, and if the task is completed, the collection is stopped. The method calculates the minimum total amount of resources through a regression model to ensure that the task can be completed on time, predicts the completion time in the operation process, and automatically calculates the elastic expansion of the resources when the completion time exceeds the operation cut-off time limit.

Description

Elastic resource expansion method and system based on machine learning

Technical Field

The invention belongs to the technical field of cloud computing and the field of deep learning, and particularly relates to a machine learning-based elastic resource expansion method and system.

Background

The cloud computing service mode is that firstly, a tenant informs a cloud service provider of the amount of cloud computing resources required to be applied, and then the cloud service provider allocates the resources according to the request of the tenant. In this mode, the tenant needs to estimate the total amount of resources needed based on its own traffic. However, because cloud tenants lack knowledge of the cloud service provider service infrastructure implementation, they have difficulty estimating the amount of computing resources needed in the virtual environment of the cloud platform based on previous experience with running the service locally. Therefore, a tenant proposes a solution, that is, a less number of resources are applied to a cloud service provider first, and when the completion of a computing task cannot meet a deadline, more resources are applied to the cloud service provider, and a system for managing and performing such automatic capacity expansion operation is called an elastic resource management system.

However, current elastic resource management systems provided by cloud services providers are generally rule-based. The tenant needs to define the rules for triggering the system to expand the resources by itself, for example, the usage rate of the CPU is higher than a threshold and lasts for more than a certain time. The formulation of these rules is still a difficult matter for the tenant, and in order to achieve the ideal capacity expansion effect, the tenant needs to take a long time to debug the threshold in the capacity expansion rule.

For some cloud computing tasks based on the MapReduce framework, elastic resource management is more difficult. First, the threshold-based capacity expansion strategy may fail because the MapReduce task is computationally intensive, the utilization rate of the computing resources of the virtual machine is always kept close to 100% during the computation process, and it is difficult for a user to select an appropriate threshold to trigger the capacity expansion operation. Secondly, the MapReduce computing task comprises a plurality of operation processes (Map process and Reduce process), each operation process has different requirements on resources, and the operation processes influence each other. Therefore, it is difficult for the tenant to estimate how much time is required to complete the task by analyzing the operational process of the task. Finally, even if the tenant is able to estimate the time for task completion based on some task-related parameters, the accuracy of such estimation may be low in a virtual environment. Because the computing performance of the virtual machine actually fluctuates continuously in the cloud environment, when the performance of the virtual machine is reduced, the task execution speed is greatly affected, and therefore the estimation accuracy is affected.

Therefore, the prior art has the technical problems of difficult elastic resource management and low estimation accuracy.

Disclosure of Invention

In view of the above defects or improvement needs of the prior art, the present invention provides a method and a system for elastic resource expansion based on machine learning, so as to solve the technical problems of difficult elastic resource management and low estimation accuracy in the prior art.

To achieve the above object, according to an aspect of the present invention, there is provided a machine learning-based elastic resource extension method, including the steps of:

(1) running deadline t of known task to be run_dUnder the condition of task calculation amount, calculating the minimum total amount of resources required by completing the task by using a regression model;

(2) in the running process of the task, the current running state and the resource utilization rate of the task are continuously collected, the minimum total amount of resources, the current running state of the task, the resource utilization rate and the task calculated amount are input into a prediction model for prediction, and the completion time T of the task is obtained_c；

(3) If T_c≤t_dThen go to step (4) if T_c＞t_dThen the completion time T for the final task is calculated_c′＜t_dThe minimum total amount of resources;

(4) if the task is not finished, entering the step (2), and if the task is finished, stopping collecting;

the regression model is obtained by fitting a regression equation after calculating a correlation coefficient between calculation resources used for running the historical tasks and the completion time of the historical tasks;

the prediction model is obtained by training relevant information of a running historical task, wherein the relevant information comprises: computing resources used for running the historical tasks, running logs of the historical tasks and resource utilization rates of the historical tasks.

Further, the computing resources used to run the historical tasks include: the method comprises the following steps of counting the total amount r of internal memory in the virtual machine, the type h of a storage medium in the virtual machine, the resource sharing mode u of the virtual machine, the architecture g of a CPU of the virtual machine and the number n of the virtual machines in a cluster.

Further, the regression model is:

where t is predicted task completion time, w is task calculated amount, b₀、b₁、b₂、b₃、b₄、b₅And b₆Respectively indicate the O, I, II, III, IV, V and VIAnd (5) synthesizing parameters.

Further, the fitting parameters of the O, I, II, III, IV, V and VI are obtained by logistic regression fitting, and the error of the regression model corresponding to the fitting parameters of the O, I, II, III, IV, V and VI is the minimum.

Further, the running log of the historical task comprises: the percentage of subtasks completed, the completion speed of the completed subtasks, the percentage of completion of the historical tasks, and the time difference between two completed subtasks.

Further, the resource utilization of the historical tasks includes: the method comprises the following steps of calculating the CPU utilization rate of a head node, the operation load of the head node, the memory usage amount of the head node, the CPU utilization rate of a calculation node, the operation load of the calculation node and the memory usage amount of the calculation node.

Further, the training of the predictive model includes:

constructing a multi-modal neural network, which comprises a feature extraction layer, a feature fusion layer and a regression layer;

extracting the characteristics in the running logs of the historical tasks and the resource utilization rate of the historical tasks by utilizing a characteristic extraction layer;

inputting the extracted features, task calculation amount and calculation resources used for running historical tasks into a feature fusion layer to be sequentially subjected to fusion, noise reduction and dimension reduction processing to obtain new feature vectors;

and inputting the new feature vector into a regression layer to perform regression training, and finally obtaining a prediction model.

Further, the step (1) comprises the following steps:

(11) running deadline t of known task to be run_dAnd under the condition of task calculation amount, setting the maximum number n of virtual machines in the cluster_max；

(12) The number of virtual machines in the cluster is less than n_maxUnder the constraint of (3), traversing the combination of the total amount of internal memories in all the virtual machines, the types of storage media in the virtual machines, the resource sharing mode of the virtual machines, the architecture of a CPU of the virtual machines and the number of the virtual machines in the cluster;

(13) separately calculating this using a regression modelThe completion time of the combinations is less than or equal to t_dAnd the combination with the least total amount of resources is taken as the minimum total amount of resources needed to complete the task.

Further, the step (3) comprises the following steps:

(31) if T_c≤t_dEntering the step (4);

(32) if T_c＞t_dIf the number of the virtual machines in the cluster is less than the minimum resource total number, the number of the virtual machines in the cluster is increased by one to obtain the new minimum resource total number, and the completion time T of the new task is calculated by using the new minimum resource total number_c′；

(33) If T_c′＞t_dRepeating the step (32) if T'_c≤t_dRecording the number of the virtual machines in the cluster at the moment, and taking the new minimum resource total amount at the moment as the final minimum resource total amount.

According to another aspect of the present invention, there is provided a machine learning-based elastic resource extension system, including:

a minimum resource total calculation module used for calculating the running deadline t of the task to be run when the running deadline t is known_dUnder the condition of task calculation amount, calculating the minimum total amount of resources required by completing the task by using a regression model;

a completion time prediction module for continuously collecting the current running state and resource utilization rate of the task in the running process of the task, inputting the minimum total resource amount, the current running state of the task, the resource utilization rate and the task calculation amount into a prediction model for prediction to obtain the completion time T of the task_c；

A completion time comparison module for comparing the completion time of the task with the operation deadline, if T_c≤t_dIf the result is T, executing a task completion judgment module_c＞t_dThen the completion time T for the final task is calculated_c′＜t_dThe minimum total amount of resources;

the task completion condition judging module is used for executing the completion time predicting module if the task is not completed, and stopping collection if the task is completed;

the regression model training module is used for fitting a regression equation after calculating a correlation coefficient between the calculation resources used for running the historical tasks and the historical task completion time to obtain a regression model;

the prediction model training module is used for training relevant information of the running historical task to obtain a prediction model, wherein the relevant information comprises: computing resources used for running the historical tasks, running logs of the historical tasks and resource utilization rates of the historical tasks.

In general, compared with the prior art, the above technical solution contemplated by the present invention can achieve the following beneficial effects:

(1) compared with the traditional prediction method based on the static model, the method has the advantages that the regression model is obtained by fitting the regression equation after calculating the correlation coefficient between the calculation resources used for running the historical tasks and the historical task completion time, then the minimum total amount of resources required for completing the tasks are calculated based on the regression model to ensure that the tasks can be completed on time, the completion time is predicted in the running process, and the elastic expansion of the calculation resources is automatically carried out when the completion time exceeds the running deadline. The invention can realize dynamic prediction and elastic resource management, and has high estimation accuracy.

(2) According to the method, the functional relation between the task completion time and the required resources is established through the regression model, so that a reasonable amount of cloud computing resources are recommended for the tenants. Compared with the existing method, the method has the advantage that the mode which can directly act on the modeling can be directly used for the existing cloud platform. On the other hand, the invention does not need to apply redundant computing resources to guarantee the on-time completion of the task.

(3) The invention provides a prediction model based on a multi-mode neural network, which is used for dynamically predicting the completion time of a task. Compared with the existing method, the method can monitor the running speed and the resource utilization rate of the task in real time, and timely perform cluster expansion when the completion time cannot be met, so that the problem of calculation speed reduction caused by performance fluctuation of the virtual machine is effectively solved, and the task can be completed on time.

Drawings

Fig. 1 is a flowchart illustrating elastic resource scaling in a cloud environment according to an embodiment of the present invention;

FIG. 2 is a flow chart of a method for training and using a regression model according to an embodiment of the present invention;

FIG. 3(a) is a schematic diagram of the relationship between the number of virtual machines and the completion time when constructing the regression model according to the embodiment of the present invention;

FIG. 3(b) is a schematic diagram of the relationship between the calculated amount and the completion time when constructing the regression model according to the embodiment of the present invention;

FIG. 4 is a flow chart of a method of training and using a multi-modal neural network provided by an embodiment of the present invention;

FIG. 5 is a schematic diagram of a multi-modal neural network provided by an embodiment of the present invention;

fig. 6 is a schematic diagram of an actual effect of guaranteeing operation of the MapReduce task provided in embodiment 1 of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

According to the method, the running information of the task on the cloud platform is monitored in real time, and when the computing performance of the virtual machine on the cloud platform is reduced, the elastic expansion of computing resources is automatically carried out to ensure that the computing task can be finally completed on time. Compared with the traditional prediction method based on the static model, the method can save the overhead of cloud tenants for renting the virtual machines by 30.8% at most.

As shown in fig. 1, a method for elastic resource expansion based on machine learning includes the following steps:

(1) running deadline when task to be run is knownt_dUnder the condition of task calculation amount, calculating the minimum total amount of resources required by completing the task by using a regression model;

The regression model is obtained by fitting a regression equation after calculating a correlation coefficient between a calculation resource G1 used for running the historical task and the completion time of the historical task; the computing resources used to run the historical tasks include: the method comprises the following steps of counting the total amount r of internal memory in the virtual machine, the type h of a storage medium in the virtual machine, the resource sharing mode u of the virtual machine, the architecture g of a CPU of the virtual machine and the number n of the virtual machines in a cluster.

The regression model is:

where t is predicted task completion time, w is task calculated amount, b₀、b₁、b₂、b₃、b₄、b₅And b₆Respectively indicate the first, second, third, fourth, fifth, sixth, seventh, eighth, ninth, and ninth,Four, five and six fitting parameters.

The fitting parameters of the O, I, II, III, IV, V and VI are obtained by logistic regression fitting, and the errors of the regression models corresponding to the fitting parameters of the O, I, II, III, IV, V and VI are minimum.

When the task to be run is known as a MapReduce task and the historical task is also known as a historical MapReduce task, as shown in fig. 2, the training and using of the regression model includes:

(1) and collecting computing resources G1 used by the tenant to run the historical MapReduce task on the cloud platform.

(2) The following features should be included in the computing resource G1: the method comprises the following steps of (1) total amount r of memory in a virtual machine, (h) type of storage medium in the virtual machine, (u) virtual machine resource sharing mode, (G) architecture of virtual machine CPU, (n) number of virtual machines in a cluster, (G1 ═ n, r, h, u, G >

(3) Pearson correlation coefficients of the respective features in G1 with the completion time were calculated. The calculation formula is as follows:

where n is the total number of samples collected in the training set, X_iIs a certain feature of G1, t_iThe completion time corresponding to the feature is,

the standard score of the sample of (a),

is the sample mean value, σ_XIs the sample standard deviation.

(4) And visualizing the linear relation between the characteristics w and n with higher Pearson correlation coefficients and the completion time.

(5) As shown in fig. 3(a), w is in direct proportion to the completion time, and as shown in fig. 3(b), n is in inverse proportion to the completion time. Then, assuming that the features h, u, r, g are used to determine the computation power of a single VM, finally the regression equation t ═ f (w, n, h, u, r, g) is determined:

(6) through the logistic regression algorithm, proper fitting parameters can be obtained, so that the error of the regression model is minimized.

(7) The tenant provides a run deadline T, a task computation amount W, and a maximum number of virtual machines N allowed to establish the cluster.

(8) When N is less than or equal to N, all combinations of < N, r, h, u, g > are traversed.

(9) If there are combinations such that T ═ f (w, n, h, u, r, g) < T, all combinations are recorded.

(10) And (5) making n equal to n-1, and jumping to the step (8).

(11) And if no combination exists, enabling T to be f (w, n, h, u, r, g) < T, and selecting the combination with the least resource quantity in all records as the initial size of the cluster.

As shown in fig. 4, the method can provide a resource elastic expansion function for a tenant to run a MapReduce task, and subtasks of the MapReduce task include a Map task and a Reduce task, and the method includes the following specific implementation steps:

(1) relevant information of a tenant running a MapReduce task on a cloud platform is collected, and the relevant information comprises computing resources used for running the task, a running log G2 of the task and a resource utilization rate G3.

The following features should be included in the running log G2 of the task: percentage M of Map task that has been completed_pPercent R of completed Reduce task_pCompletion speed of Map task Ms, completion speed of Reduce task R_sPercentage completion of the entire calculation process T_pTime difference T between two completed tasks_iI.e. G2 ═<M_p，R_p，M_s，R_s，T_p，T_i>。

The following features should be included in the resource utilization G3: CPU utilization H of head node_COperation load H of head node_LHead segmentMemory usage H of a point_MCalculating CPU utilization W of a node_CComputing the computational load W of a node_LCalculating the memory usage W of the node_MI.e. G3 ═<H_C，H_L，H_M，W_C，W_L，W_M>。

(2) As shown in fig. 5, a multi-modal neural network is constructed, which includes a feature extraction layer, a feature fusion layer and a regression layer; and taking the LSTM neural network as a feature extraction layer and taking the depth principal component self-encoder as a feature fusion layer.

Features in G2 and G3 were extracted by LSTM neural networks. The result G2 was obtained_m＝F2_LSTM(G2)，G3_m＝F3_LSTM(G3) Along with non-temporal features such as w and G1 as an input to a depth principal component auto-encoder. The depth principal component self-encoder can fuse, reduce noise and reduce dimension of the features to obtain a new feature vector G_f，G_f＝F_e(w，G1，G2_m，G3_m)。

The feature vector is used as a regression layer based on the minimum mean square error to carry out regression training, and the loss function is as follows:

where C is the cost of the loss function and y (-) denotes G_fCorresponding observed actual value, F_r(. cndot.) is the network that needs to be trained. Finally obtain T_c，T_c＝F_r(G_f)。

And finally, training to obtain a multi-modal neural network, namely a prediction model, which is used for constructing the relation among the time t required by completing the calculation task, the calculation amount w of the task and three groups of information, namely t ═ f (w, G1, G2 and G3).

(3) The running deadline T and the task computation amount W are provided by the tenant.

(4) During the running process of the task, the current running state G2 of the task is continuously collected_cAnd resource utilization G3_c。

(5) And predicting the completion time t of the task through the multi-modal neural network.

(6) If T > T, let n be n +1 in G1. Jumping to step (5)

(7) Otherwise, calculating the difference value of n before and after updating, and adding a corresponding number of virtual machines into the cluster.

(8) And (4) if the task is not finished, jumping to the step (4).

(9) And if the task is completed, stopping working.

Example 1

In order to verify the feasibility and effectiveness of the method, the method is verified in a real environment. The experimental preparation work included: and establishing a cluster with the maximum number of 40 virtual machines on the Ali cloud platform. A total of 300 MapReduce tasks of different kinds and workloads are run, including WordCount, TeraSort and PageRank. And (3) collecting the running state data and the resource utilization rate data, taking 5 seconds as a sampling period, and finally obtaining 30000 groups of data for constructing the multi-modal neural network.

Finally, in order to verify the effect of the system, a WordCount task with the calculation amount of 400GB and the completion time limit of 1700 seconds is submitted to the cluster of which the size of the initial cluster is 16 virtual machines. As shown in FIG. 6, AS-M and AS-R in the legend represent the operation conditions of the Map process and Reduce process of the MapReduce task when the present invention is used. The NAS-M and NAS-R representations in the legend do not represent the operating conditions of the Map process and Reduce process of the MapReduce task when the present invention is used. When the task runs for 675 seconds, the running speed of the virtual machine is reduced, and the elastic telescopic system judges that 24 virtual machines need to be expanded to ensure that the task is completed on time. At 975 seconds, additional computing resources are added to the cluster, and then the running speed of the task is greatly improved. Finally, the method of the present invention allowed the task to be completed at 1650 seconds, however, when the method of the present invention was not used, the completion time for the task was 1950 seconds. Embodiment 1 shows that the MapReduce task which would otherwise exceed the completion time limit can be completed on time effectively by adding computing resources to the cluster.

Through experimental detection, the elastic resource expansion mechanism based on machine learning can save the expenditure of renting virtual machines of cloud tenants by 30.8% at most compared with the traditional method, and provides a very reliable guarantee for task completion time. The method can greatly save the expenditure of cloud tenants when using the public cloud, and has great value for users who need to frequently run MapReduce tasks on the public cloud.

It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A method for expanding elastic resources based on machine learning is characterized by comprising the following steps:

the prediction model is obtained by training relevant information of a running historical task, wherein the relevant information comprises: computing resources used for running the historical tasks, running logs of the historical tasks and resource utilization rates of the historical tasks;

the step (3) comprises the following steps:

(31) if T_c≤t_dEntering the step (4);

2. The machine learning-based elastic resource extension method according to claim 1, wherein the computing resources used for running the historical task comprise: the method comprises the following steps of counting the total amount r of internal memory in the virtual machine, the type h of a storage medium in the virtual machine, the resource sharing mode u of the virtual machine, the architecture g of a CPU of the virtual machine and the number n of the virtual machines in a cluster.

3. The machine learning-based elastic resource extension method according to claim 2, wherein the regression model is:

where t is predicted task completion time, w is task calculated amount, b₀、b₁、b₂、b₃、b₄、b₅And b₆The o, one, two, three, four, five and six fitting parameters are indicated, respectively.

4. The method as claimed in claim 3, wherein the fitting parameters O, I, II, III, IV, V and VI are obtained by logistic regression fitting, and the error of the regression model corresponding to the fitting parameters O, I, II, III, IV, V and VI is the minimum.

5. The machine learning-based elastic resource extension method according to any one of claims 2 to 4, wherein the running log of the historical task comprises: the percentage of subtasks completed, the completion speed of the completed subtasks, the percentage of completion of the historical tasks, and the time difference between two completed subtasks.

6. The machine learning-based elastic resource extension method according to any one of claims 2 to 4, wherein the resource utilization rate of the historical task comprises: the method comprises the following steps of calculating the CPU utilization rate of a head node, the operation load of the head node, the memory usage amount of the head node, the CPU utilization rate of a calculation node, the operation load of the calculation node and the memory usage amount of the calculation node.

7. The machine learning-based elastic resource extension method according to any one of claims 2-4, wherein the training of the predictive model comprises:

8. The method for machine learning-based elastic resource expansion according to any one of claims 1-4, wherein the step (1) comprises the steps of:

(13) calculating the completion time of the combinations by using a regression model, wherein the completion time is less than or equal to t_dAnd the combination with the least total amount of resources is taken as the minimum total amount of resources needed to complete the task.

9. A machine learning based elastic resource extension system, comprising:

the prediction model training module is used for training relevant information of the running historical task to obtain a prediction model, wherein the relevant information comprises: computing resources used for running the historical tasks, running logs of the historical tasks and resource utilization rates of the historical tasks;

the completion time prediction module includes:

if T_c≤t_dIf yes, executing a task completion condition judgment module;

if T_c＞t_dIf the number of the virtual machines in the cluster is less than the minimum resource total number, the number of the virtual machines in the cluster is increased by one to obtain the new minimum resource total number, and the completion time T of the new task is calculated by using the new minimum resource total number_c′；

If T_c′＞t_dRepeating the previous step if T'_c≤t_dRecording the number of the virtual machines in the cluster at the moment, and taking the new minimum resource total amount at the moment as the final minimum resource total amount.