CN110209467B - Elastic resource expansion method and system based on machine learning - Google Patents

Elastic resource expansion method and system based on machine learning Download PDF

Info

Publication number
CN110209467B
CN110209467B CN201910437262.7A CN201910437262A CN110209467B CN 110209467 B CN110209467 B CN 110209467B CN 201910437262 A CN201910437262 A CN 201910437262A CN 110209467 B CN110209467 B CN 110209467B
Authority
CN
China
Prior art keywords
task
running
resource
completion time
resources
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910437262.7A
Other languages
Chinese (zh)
Other versions
CN110209467A (en
Inventor
刘方明
金海�
李羿
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
Original Assignee
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology filed Critical Huazhong University of Science and Technology
Priority to CN201910437262.7A priority Critical patent/CN110209467B/en
Publication of CN110209467A publication Critical patent/CN110209467A/en
Application granted granted Critical
Publication of CN110209467B publication Critical patent/CN110209467B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5061Partitioning or combining of resources
    • G06F9/5072Grid computing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45533Hypervisors; Virtual machine monitors
    • G06F9/45558Hypervisor-specific management and integration aspects
    • G06F2009/45595Network integration; Enabling network access in virtual machine instances

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a method and a system for expanding elastic resources based on machine learningBelong to cloud computing technology field and deep learning field, include: running deadline t of known task to be rundUnder the condition of task calculation amount, calculating the minimum total amount of resources required by completing the task by using a regression model; continuously collecting the current running state and the resource utilization rate of the task in the running process of the task, inputting the minimum total resource amount, the current running state of the task, the resource utilization rate and the task calculation amount into a prediction model for prediction, and obtaining the completion time T of the taskc(ii) a If Tc>tdThen the completion time T for the final task is calculatedc′<tdThe minimum total amount of resources; if the task is not completed, the collection is continued, and if the task is completed, the collection is stopped. The method calculates the minimum total amount of resources through a regression model to ensure that the task can be completed on time, predicts the completion time in the operation process, and automatically calculates the elastic expansion of the resources when the completion time exceeds the operation cut-off time limit.

Description

Elastic resource expansion method and system based on machine learning
Technical Field
The invention belongs to the technical field of cloud computing and the field of deep learning, and particularly relates to a machine learning-based elastic resource expansion method and system.
Background
The cloud computing service mode is that firstly, a tenant informs a cloud service provider of the amount of cloud computing resources required to be applied, and then the cloud service provider allocates the resources according to the request of the tenant. In this mode, the tenant needs to estimate the total amount of resources needed based on its own traffic. However, because cloud tenants lack knowledge of the cloud service provider service infrastructure implementation, they have difficulty estimating the amount of computing resources needed in the virtual environment of the cloud platform based on previous experience with running the service locally. Therefore, a tenant proposes a solution, that is, a less number of resources are applied to a cloud service provider first, and when the completion of a computing task cannot meet a deadline, more resources are applied to the cloud service provider, and a system for managing and performing such automatic capacity expansion operation is called an elastic resource management system.
However, current elastic resource management systems provided by cloud services providers are generally rule-based. The tenant needs to define the rules for triggering the system to expand the resources by itself, for example, the usage rate of the CPU is higher than a threshold and lasts for more than a certain time. The formulation of these rules is still a difficult matter for the tenant, and in order to achieve the ideal capacity expansion effect, the tenant needs to take a long time to debug the threshold in the capacity expansion rule.
For some cloud computing tasks based on the MapReduce framework, elastic resource management is more difficult. First, the threshold-based capacity expansion strategy may fail because the MapReduce task is computationally intensive, the utilization rate of the computing resources of the virtual machine is always kept close to 100% during the computation process, and it is difficult for a user to select an appropriate threshold to trigger the capacity expansion operation. Secondly, the MapReduce computing task comprises a plurality of operation processes (Map process and Reduce process), each operation process has different requirements on resources, and the operation processes influence each other. Therefore, it is difficult for the tenant to estimate how much time is required to complete the task by analyzing the operational process of the task. Finally, even if the tenant is able to estimate the time for task completion based on some task-related parameters, the accuracy of such estimation may be low in a virtual environment. Because the computing performance of the virtual machine actually fluctuates continuously in the cloud environment, when the performance of the virtual machine is reduced, the task execution speed is greatly affected, and therefore the estimation accuracy is affected.
Therefore, the prior art has the technical problems of difficult elastic resource management and low estimation accuracy.
Disclosure of Invention
In view of the above defects or improvement needs of the prior art, the present invention provides a method and a system for elastic resource expansion based on machine learning, so as to solve the technical problems of difficult elastic resource management and low estimation accuracy in the prior art.
To achieve the above object, according to an aspect of the present invention, there is provided a machine learning-based elastic resource extension method, including the steps of:
(1) running deadline t of known task to be rundUnder the condition of task calculation amount, calculating the minimum total amount of resources required by completing the task by using a regression model;
(2) in the running process of the task, the current running state and the resource utilization rate of the task are continuously collected, the minimum total amount of resources, the current running state of the task, the resource utilization rate and the task calculated amount are input into a prediction model for prediction, and the completion time T of the task is obtainedc
(3) If Tc≤tdThen go to step (4) if Tc>tdThen the completion time T for the final task is calculatedc′<tdThe minimum total amount of resources;
(4) if the task is not finished, entering the step (2), and if the task is finished, stopping collecting;
the regression model is obtained by fitting a regression equation after calculating a correlation coefficient between calculation resources used for running the historical tasks and the completion time of the historical tasks;
the prediction model is obtained by training relevant information of a running historical task, wherein the relevant information comprises: computing resources used for running the historical tasks, running logs of the historical tasks and resource utilization rates of the historical tasks.
Further, the computing resources used to run the historical tasks include: the method comprises the following steps of counting the total amount r of internal memory in the virtual machine, the type h of a storage medium in the virtual machine, the resource sharing mode u of the virtual machine, the architecture g of a CPU of the virtual machine and the number n of the virtual machines in a cluster.
Further, the regression model is:
Figure BDA0002070337760000031
where t is predicted task completion time, w is task calculated amount, b0、b1、b2、b3、b4、b5And b6Respectively indicate the O, I, II, III, IV, V and VIAnd (5) synthesizing parameters.
Further, the fitting parameters of the O, I, II, III, IV, V and VI are obtained by logistic regression fitting, and the error of the regression model corresponding to the fitting parameters of the O, I, II, III, IV, V and VI is the minimum.
Further, the running log of the historical task comprises: the percentage of subtasks completed, the completion speed of the completed subtasks, the percentage of completion of the historical tasks, and the time difference between two completed subtasks.
Further, the resource utilization of the historical tasks includes: the method comprises the following steps of calculating the CPU utilization rate of a head node, the operation load of the head node, the memory usage amount of the head node, the CPU utilization rate of a calculation node, the operation load of the calculation node and the memory usage amount of the calculation node.
Further, the training of the predictive model includes:
constructing a multi-modal neural network, which comprises a feature extraction layer, a feature fusion layer and a regression layer;
extracting the characteristics in the running logs of the historical tasks and the resource utilization rate of the historical tasks by utilizing a characteristic extraction layer;
inputting the extracted features, task calculation amount and calculation resources used for running historical tasks into a feature fusion layer to be sequentially subjected to fusion, noise reduction and dimension reduction processing to obtain new feature vectors;
and inputting the new feature vector into a regression layer to perform regression training, and finally obtaining a prediction model.
Further, the step (1) comprises the following steps:
(11) running deadline t of known task to be rundAnd under the condition of task calculation amount, setting the maximum number n of virtual machines in the clustermax
(12) The number of virtual machines in the cluster is less than nmaxUnder the constraint of (3), traversing the combination of the total amount of internal memories in all the virtual machines, the types of storage media in the virtual machines, the resource sharing mode of the virtual machines, the architecture of a CPU of the virtual machines and the number of the virtual machines in the cluster;
(13) separately calculating this using a regression modelThe completion time of the combinations is less than or equal to tdAnd the combination with the least total amount of resources is taken as the minimum total amount of resources needed to complete the task.
Further, the step (3) comprises the following steps:
(31) if Tc≤tdEntering the step (4);
(32) if Tc>tdIf the number of the virtual machines in the cluster is less than the minimum resource total number, the number of the virtual machines in the cluster is increased by one to obtain the new minimum resource total number, and the completion time T of the new task is calculated by using the new minimum resource total numberc′;
(33) If Tc′>tdRepeating the step (32) if T'c≤tdRecording the number of the virtual machines in the cluster at the moment, and taking the new minimum resource total amount at the moment as the final minimum resource total amount.
According to another aspect of the present invention, there is provided a machine learning-based elastic resource extension system, including:
a minimum resource total calculation module used for calculating the running deadline t of the task to be run when the running deadline t is knowndUnder the condition of task calculation amount, calculating the minimum total amount of resources required by completing the task by using a regression model;
a completion time prediction module for continuously collecting the current running state and resource utilization rate of the task in the running process of the task, inputting the minimum total resource amount, the current running state of the task, the resource utilization rate and the task calculation amount into a prediction model for prediction to obtain the completion time T of the taskc
A completion time comparison module for comparing the completion time of the task with the operation deadline, if Tc≤tdIf the result is T, executing a task completion judgment modulec>tdThen the completion time T for the final task is calculatedc′<tdThe minimum total amount of resources;
the task completion condition judging module is used for executing the completion time predicting module if the task is not completed, and stopping collection if the task is completed;
the regression model training module is used for fitting a regression equation after calculating a correlation coefficient between the calculation resources used for running the historical tasks and the historical task completion time to obtain a regression model;
the prediction model training module is used for training relevant information of the running historical task to obtain a prediction model, wherein the relevant information comprises: computing resources used for running the historical tasks, running logs of the historical tasks and resource utilization rates of the historical tasks.
In general, compared with the prior art, the above technical solution contemplated by the present invention can achieve the following beneficial effects:
(1) compared with the traditional prediction method based on the static model, the method has the advantages that the regression model is obtained by fitting the regression equation after calculating the correlation coefficient between the calculation resources used for running the historical tasks and the historical task completion time, then the minimum total amount of resources required for completing the tasks are calculated based on the regression model to ensure that the tasks can be completed on time, the completion time is predicted in the running process, and the elastic expansion of the calculation resources is automatically carried out when the completion time exceeds the running deadline. The invention can realize dynamic prediction and elastic resource management, and has high estimation accuracy.
(2) According to the method, the functional relation between the task completion time and the required resources is established through the regression model, so that a reasonable amount of cloud computing resources are recommended for the tenants. Compared with the existing method, the method has the advantage that the mode which can directly act on the modeling can be directly used for the existing cloud platform. On the other hand, the invention does not need to apply redundant computing resources to guarantee the on-time completion of the task.
(3) The invention provides a prediction model based on a multi-mode neural network, which is used for dynamically predicting the completion time of a task. Compared with the existing method, the method can monitor the running speed and the resource utilization rate of the task in real time, and timely perform cluster expansion when the completion time cannot be met, so that the problem of calculation speed reduction caused by performance fluctuation of the virtual machine is effectively solved, and the task can be completed on time.
Drawings
Fig. 1 is a flowchart illustrating elastic resource scaling in a cloud environment according to an embodiment of the present invention;
FIG. 2 is a flow chart of a method for training and using a regression model according to an embodiment of the present invention;
FIG. 3(a) is a schematic diagram of the relationship between the number of virtual machines and the completion time when constructing the regression model according to the embodiment of the present invention;
FIG. 3(b) is a schematic diagram of the relationship between the calculated amount and the completion time when constructing the regression model according to the embodiment of the present invention;
FIG. 4 is a flow chart of a method of training and using a multi-modal neural network provided by an embodiment of the present invention;
FIG. 5 is a schematic diagram of a multi-modal neural network provided by an embodiment of the present invention;
fig. 6 is a schematic diagram of an actual effect of guaranteeing operation of the MapReduce task provided in embodiment 1 of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
According to the method, the running information of the task on the cloud platform is monitored in real time, and when the computing performance of the virtual machine on the cloud platform is reduced, the elastic expansion of computing resources is automatically carried out to ensure that the computing task can be finally completed on time. Compared with the traditional prediction method based on the static model, the method can save the overhead of cloud tenants for renting the virtual machines by 30.8% at most.
As shown in fig. 1, a method for elastic resource expansion based on machine learning includes the following steps:
(1) running deadline when task to be run is knowntdUnder the condition of task calculation amount, calculating the minimum total amount of resources required by completing the task by using a regression model;
(2) in the running process of the task, the current running state and the resource utilization rate of the task are continuously collected, the minimum total amount of resources, the current running state of the task, the resource utilization rate and the task calculated amount are input into a prediction model for prediction, and the completion time T of the task is obtainedc
(3) If Tc≤tdThen go to step (4) if Tc>tdThen the completion time T for the final task is calculatedc′<tdThe minimum total amount of resources;
(4) if the task is not finished, entering the step (2), and if the task is finished, stopping collecting;
the regression model is obtained by fitting a regression equation after calculating a correlation coefficient between calculation resources used for running the historical tasks and the completion time of the historical tasks;
the prediction model is obtained by training relevant information of a running historical task, wherein the relevant information comprises: computing resources used for running the historical tasks, running logs of the historical tasks and resource utilization rates of the historical tasks.
The regression model is obtained by fitting a regression equation after calculating a correlation coefficient between a calculation resource G1 used for running the historical task and the completion time of the historical task; the computing resources used to run the historical tasks include: the method comprises the following steps of counting the total amount r of internal memory in the virtual machine, the type h of a storage medium in the virtual machine, the resource sharing mode u of the virtual machine, the architecture g of a CPU of the virtual machine and the number n of the virtual machines in a cluster.
The regression model is:
Figure BDA0002070337760000071
where t is predicted task completion time, w is task calculated amount, b0、b1、b2、b3、b4、b5And b6Respectively indicate the first, second, third, fourth, fifth, sixth, seventh, eighth, ninth, and ninth,Four, five and six fitting parameters.
The fitting parameters of the O, I, II, III, IV, V and VI are obtained by logistic regression fitting, and the errors of the regression models corresponding to the fitting parameters of the O, I, II, III, IV, V and VI are minimum.
When the task to be run is known as a MapReduce task and the historical task is also known as a historical MapReduce task, as shown in fig. 2, the training and using of the regression model includes:
(1) and collecting computing resources G1 used by the tenant to run the historical MapReduce task on the cloud platform.
(2) The following features should be included in the computing resource G1: the method comprises the following steps of (1) total amount r of memory in a virtual machine, (h) type of storage medium in the virtual machine, (u) virtual machine resource sharing mode, (G) architecture of virtual machine CPU, (n) number of virtual machines in a cluster, (G1 ═ n, r, h, u, G >
(3) Pearson correlation coefficients of the respective features in G1 with the completion time were calculated. The calculation formula is as follows:
Figure BDA0002070337760000081
where n is the total number of samples collected in the training set, XiIs a certain feature of G1, tiThe completion time corresponding to the feature is,
Figure BDA0002070337760000082
the standard score of the sample of (a),
Figure BDA0002070337760000083
is the sample mean value, σXIs the sample standard deviation.
(4) And visualizing the linear relation between the characteristics w and n with higher Pearson correlation coefficients and the completion time.
(5) As shown in fig. 3(a), w is in direct proportion to the completion time, and as shown in fig. 3(b), n is in inverse proportion to the completion time. Then, assuming that the features h, u, r, g are used to determine the computation power of a single VM, finally the regression equation t ═ f (w, n, h, u, r, g) is determined:
Figure BDA0002070337760000084
(6) through the logistic regression algorithm, proper fitting parameters can be obtained, so that the error of the regression model is minimized.
(7) The tenant provides a run deadline T, a task computation amount W, and a maximum number of virtual machines N allowed to establish the cluster.
(8) When N is less than or equal to N, all combinations of < N, r, h, u, g > are traversed.
(9) If there are combinations such that T ═ f (w, n, h, u, r, g) < T, all combinations are recorded.
(10) And (5) making n equal to n-1, and jumping to the step (8).
(11) And if no combination exists, enabling T to be f (w, n, h, u, r, g) < T, and selecting the combination with the least resource quantity in all records as the initial size of the cluster.
As shown in fig. 4, the method can provide a resource elastic expansion function for a tenant to run a MapReduce task, and subtasks of the MapReduce task include a Map task and a Reduce task, and the method includes the following specific implementation steps:
(1) relevant information of a tenant running a MapReduce task on a cloud platform is collected, and the relevant information comprises computing resources used for running the task, a running log G2 of the task and a resource utilization rate G3.
The following features should be included in the running log G2 of the task: percentage M of Map task that has been completedpPercent R of completed Reduce taskpCompletion speed of Map task Ms, completion speed of Reduce task RsPercentage completion of the entire calculation process TpTime difference T between two completed tasksiI.e. G2 ═<Mp,Rp,Ms,Rs,Tp,Ti>。
The following features should be included in the resource utilization G3: CPU utilization H of head nodeCOperation load H of head nodeLHead segmentMemory usage H of a pointMCalculating CPU utilization W of a nodeCComputing the computational load W of a nodeLCalculating the memory usage W of the nodeMI.e. G3 ═<HC,HL,HM,WC,WL,WM>。
(2) As shown in fig. 5, a multi-modal neural network is constructed, which includes a feature extraction layer, a feature fusion layer and a regression layer; and taking the LSTM neural network as a feature extraction layer and taking the depth principal component self-encoder as a feature fusion layer.
Features in G2 and G3 were extracted by LSTM neural networks. The result G2 was obtainedm=F2LSTM(G2),G3m=F3LSTM(G3) Along with non-temporal features such as w and G1 as an input to a depth principal component auto-encoder. The depth principal component self-encoder can fuse, reduce noise and reduce dimension of the features to obtain a new feature vector Gf,Gf=Fe(w,G1,G2m,G3m)。
The feature vector is used as a regression layer based on the minimum mean square error to carry out regression training, and the loss function is as follows:
Figure BDA0002070337760000091
where C is the cost of the loss function and y (-) denotes GfCorresponding observed actual value, Fr(. cndot.) is the network that needs to be trained. Finally obtain Tc,Tc=Fr(Gf)。
And finally, training to obtain a multi-modal neural network, namely a prediction model, which is used for constructing the relation among the time t required by completing the calculation task, the calculation amount w of the task and three groups of information, namely t ═ f (w, G1, G2 and G3).
(3) The running deadline T and the task computation amount W are provided by the tenant.
(4) During the running process of the task, the current running state G2 of the task is continuously collectedcAnd resource utilization G3c
(5) And predicting the completion time t of the task through the multi-modal neural network.
(6) If T > T, let n be n +1 in G1. Jumping to step (5)
(7) Otherwise, calculating the difference value of n before and after updating, and adding a corresponding number of virtual machines into the cluster.
(8) And (4) if the task is not finished, jumping to the step (4).
(9) And if the task is completed, stopping working.
Example 1
In order to verify the feasibility and effectiveness of the method, the method is verified in a real environment. The experimental preparation work included: and establishing a cluster with the maximum number of 40 virtual machines on the Ali cloud platform. A total of 300 MapReduce tasks of different kinds and workloads are run, including WordCount, TeraSort and PageRank. And (3) collecting the running state data and the resource utilization rate data, taking 5 seconds as a sampling period, and finally obtaining 30000 groups of data for constructing the multi-modal neural network.
Finally, in order to verify the effect of the system, a WordCount task with the calculation amount of 400GB and the completion time limit of 1700 seconds is submitted to the cluster of which the size of the initial cluster is 16 virtual machines. As shown in FIG. 6, AS-M and AS-R in the legend represent the operation conditions of the Map process and Reduce process of the MapReduce task when the present invention is used. The NAS-M and NAS-R representations in the legend do not represent the operating conditions of the Map process and Reduce process of the MapReduce task when the present invention is used. When the task runs for 675 seconds, the running speed of the virtual machine is reduced, and the elastic telescopic system judges that 24 virtual machines need to be expanded to ensure that the task is completed on time. At 975 seconds, additional computing resources are added to the cluster, and then the running speed of the task is greatly improved. Finally, the method of the present invention allowed the task to be completed at 1650 seconds, however, when the method of the present invention was not used, the completion time for the task was 1950 seconds. Embodiment 1 shows that the MapReduce task which would otherwise exceed the completion time limit can be completed on time effectively by adding computing resources to the cluster.
Through experimental detection, the elastic resource expansion mechanism based on machine learning can save the expenditure of renting virtual machines of cloud tenants by 30.8% at most compared with the traditional method, and provides a very reliable guarantee for task completion time. The method can greatly save the expenditure of cloud tenants when using the public cloud, and has great value for users who need to frequently run MapReduce tasks on the public cloud.
It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (9)

1. A method for expanding elastic resources based on machine learning is characterized by comprising the following steps:
(1) running deadline t of known task to be rundUnder the condition of task calculation amount, calculating the minimum total amount of resources required by completing the task by using a regression model;
(2) in the running process of the task, the current running state and the resource utilization rate of the task are continuously collected, the minimum total amount of resources, the current running state of the task, the resource utilization rate and the task calculated amount are input into a prediction model for prediction, and the completion time T of the task is obtainedc
(3) If Tc≤tdThen go to step (4) if Tc>tdThen the completion time T for the final task is calculatedc′<tdThe minimum total amount of resources;
(4) if the task is not finished, entering the step (2), and if the task is finished, stopping collecting;
the regression model is obtained by fitting a regression equation after calculating a correlation coefficient between calculation resources used for running the historical tasks and the completion time of the historical tasks;
the prediction model is obtained by training relevant information of a running historical task, wherein the relevant information comprises: computing resources used for running the historical tasks, running logs of the historical tasks and resource utilization rates of the historical tasks;
the step (3) comprises the following steps:
(31) if Tc≤tdEntering the step (4);
(32) if Tc>tdIf the number of the virtual machines in the cluster is less than the minimum resource total number, the number of the virtual machines in the cluster is increased by one to obtain the new minimum resource total number, and the completion time T of the new task is calculated by using the new minimum resource total numberc′;
(33) If Tc′>tdRepeating the step (32) if T'c≤tdRecording the number of the virtual machines in the cluster at the moment, and taking the new minimum resource total amount at the moment as the final minimum resource total amount.
2. The machine learning-based elastic resource extension method according to claim 1, wherein the computing resources used for running the historical task comprise: the method comprises the following steps of counting the total amount r of internal memory in the virtual machine, the type h of a storage medium in the virtual machine, the resource sharing mode u of the virtual machine, the architecture g of a CPU of the virtual machine and the number n of the virtual machines in a cluster.
3. The machine learning-based elastic resource extension method according to claim 2, wherein the regression model is:
Figure FDA0002782145820000021
where t is predicted task completion time, w is task calculated amount, b0、b1、b2、b3、b4、b5And b6The o, one, two, three, four, five and six fitting parameters are indicated, respectively.
4. The method as claimed in claim 3, wherein the fitting parameters O, I, II, III, IV, V and VI are obtained by logistic regression fitting, and the error of the regression model corresponding to the fitting parameters O, I, II, III, IV, V and VI is the minimum.
5. The machine learning-based elastic resource extension method according to any one of claims 2 to 4, wherein the running log of the historical task comprises: the percentage of subtasks completed, the completion speed of the completed subtasks, the percentage of completion of the historical tasks, and the time difference between two completed subtasks.
6. The machine learning-based elastic resource extension method according to any one of claims 2 to 4, wherein the resource utilization rate of the historical task comprises: the method comprises the following steps of calculating the CPU utilization rate of a head node, the operation load of the head node, the memory usage amount of the head node, the CPU utilization rate of a calculation node, the operation load of the calculation node and the memory usage amount of the calculation node.
7. The machine learning-based elastic resource extension method according to any one of claims 2-4, wherein the training of the predictive model comprises:
constructing a multi-modal neural network, which comprises a feature extraction layer, a feature fusion layer and a regression layer;
extracting the characteristics in the running logs of the historical tasks and the resource utilization rate of the historical tasks by utilizing a characteristic extraction layer;
inputting the extracted features, task calculation amount and calculation resources used for running historical tasks into a feature fusion layer to be sequentially subjected to fusion, noise reduction and dimension reduction processing to obtain new feature vectors;
and inputting the new feature vector into a regression layer to perform regression training, and finally obtaining a prediction model.
8. The method for machine learning-based elastic resource expansion according to any one of claims 1-4, wherein the step (1) comprises the steps of:
(11) running deadline t of known task to be rundAnd under the condition of task calculation amount, setting the maximum number n of virtual machines in the clustermax
(12) The number of virtual machines in the cluster is less than nmaxUnder the constraint of (3), traversing the combination of the total amount of internal memories in all the virtual machines, the types of storage media in the virtual machines, the resource sharing mode of the virtual machines, the architecture of a CPU of the virtual machines and the number of the virtual machines in the cluster;
(13) calculating the completion time of the combinations by using a regression model, wherein the completion time is less than or equal to tdAnd the combination with the least total amount of resources is taken as the minimum total amount of resources needed to complete the task.
9. A machine learning based elastic resource extension system, comprising:
a minimum resource total calculation module used for calculating the running deadline t of the task to be run when the running deadline t is knowndUnder the condition of task calculation amount, calculating the minimum total amount of resources required by completing the task by using a regression model;
a completion time prediction module for continuously collecting the current running state and resource utilization rate of the task in the running process of the task, inputting the minimum total resource amount, the current running state of the task, the resource utilization rate and the task calculation amount into a prediction model for prediction to obtain the completion time T of the taskc
A completion time comparison module for comparing the completion time of the task with the operation deadline, if Tc≤tdIf the result is T, executing a task completion judgment modulec>tdThen the completion time T for the final task is calculatedc′<tdThe minimum total amount of resources;
the task completion condition judging module is used for executing the completion time predicting module if the task is not completed, and stopping collection if the task is completed;
the regression model training module is used for fitting a regression equation after calculating a correlation coefficient between the calculation resources used for running the historical tasks and the historical task completion time to obtain a regression model;
the prediction model training module is used for training relevant information of the running historical task to obtain a prediction model, wherein the relevant information comprises: computing resources used for running the historical tasks, running logs of the historical tasks and resource utilization rates of the historical tasks;
the completion time prediction module includes:
if Tc≤tdIf yes, executing a task completion condition judgment module;
if Tc>tdIf the number of the virtual machines in the cluster is less than the minimum resource total number, the number of the virtual machines in the cluster is increased by one to obtain the new minimum resource total number, and the completion time T of the new task is calculated by using the new minimum resource total numberc′;
If Tc′>tdRepeating the previous step if T'c≤tdRecording the number of the virtual machines in the cluster at the moment, and taking the new minimum resource total amount at the moment as the final minimum resource total amount.
CN201910437262.7A 2019-05-23 2019-05-23 Elastic resource expansion method and system based on machine learning Active CN110209467B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910437262.7A CN110209467B (en) 2019-05-23 2019-05-23 Elastic resource expansion method and system based on machine learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910437262.7A CN110209467B (en) 2019-05-23 2019-05-23 Elastic resource expansion method and system based on machine learning

Publications (2)

Publication Number Publication Date
CN110209467A CN110209467A (en) 2019-09-06
CN110209467B true CN110209467B (en) 2021-02-05

Family

ID=67788551

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910437262.7A Active CN110209467B (en) 2019-05-23 2019-05-23 Elastic resource expansion method and system based on machine learning

Country Status (1)

Country Link
CN (1) CN110209467B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102320317B1 (en) * 2019-11-11 2021-11-02 한국전자기술연구원 Method for selecting predict-based migration candidate and target on cloud edge
CN111625352A (en) * 2020-05-18 2020-09-04 杭州数澜科技有限公司 Scheduling method, device and storage medium
CN111749675A (en) * 2020-05-25 2020-10-09 中国地质大学(武汉) Stratum drillability prediction method and system based on cascade model algorithm
CN112346860B (en) * 2020-10-27 2022-02-08 四川长虹电器股份有限公司 Method and system for elastically deploying service based on machine learning
CN117290083A (en) * 2022-06-20 2023-12-26 华为云计算技术有限公司 Resource adjustment method and device, computing device cluster and readable storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108009023A (en) * 2017-11-29 2018-05-08 武汉理工大学 Method for scheduling task based on BP neural network time prediction in mixed cloud
CN108958919A (en) * 2018-07-13 2018-12-07 湘潭大学 More DAG task schedule expense fairness assessment models of limited constraint in a kind of cloud computing
CN109766181A (en) * 2018-12-06 2019-05-17 北京航空航天大学 A kind of RMS schedulability determination method and device based on deep learning

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2012027478A1 (en) * 2010-08-24 2012-03-01 Jay Moorthi Method and apparatus for clearing cloud compute demand
CN105071983B (en) * 2015-07-16 2017-02-01 清华大学 Abnormal load detection method for cloud calculation on-line business
CN106961351A (en) * 2017-03-03 2017-07-18 南京邮电大学 Intelligent elastic telescopic method based on Docker container clusters
CN107193652B (en) * 2017-04-27 2019-11-12 华中科技大学 The flexible resource dispatching method and system of flow data processing system in container cloud environment
CN107995039B (en) * 2017-12-07 2020-11-03 福州大学 Resource self-learning and self-adaptive distribution method for cloud software service

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108009023A (en) * 2017-11-29 2018-05-08 武汉理工大学 Method for scheduling task based on BP neural network time prediction in mixed cloud
CN108958919A (en) * 2018-07-13 2018-12-07 湘潭大学 More DAG task schedule expense fairness assessment models of limited constraint in a kind of cloud computing
CN109766181A (en) * 2018-12-06 2019-05-17 北京航空航天大学 A kind of RMS schedulability determination method and device based on deep learning

Also Published As

Publication number Publication date
CN110209467A (en) 2019-09-06

Similar Documents

Publication Publication Date Title
CN110209467B (en) Elastic resource expansion method and system based on machine learning
CN108829494B (en) Container cloud platform intelligent resource optimization method based on load prediction
TWI725744B (en) Method for establishing system resource prediction and resource management model through multi-layer correlations
Bashar Autonomic scaling of cloud computing resources using BN-based prediction models
CN106776288B (en) A kind of health metric method of the distributed system based on Hadoop
CN104216784B (en) Focus balance control method and relevant apparatus
CN111352712B (en) Cloud computing task tracking processing method and device, cloud computing system and server
CN105260794A (en) Load predicting method of cloud data center
CN103778474A (en) Resource load capacity prediction method, analysis prediction system and service operation monitoring system
CN110083518B (en) AdaBoost-Elman-based virtual machine software aging prediction method
CN107483292B (en) Dynamic monitoring method for cloud platform
CN107783881A (en) Website dynamic property monitoring method and system based on memory queue
CN110764714A (en) Data processing method, device and equipment and readable storage medium
CN115543626A (en) Power defect image simulation method adopting heterogeneous computing resource load balancing scheduling
CN114172819A (en) Demand resource prediction method, system, electronic device and storage medium for NFV network element
CN110610140A (en) Training method, device and equipment of face recognition model and readable storage medium
CN106936611A (en) A kind of method and device for predicting network state
CN115913967A (en) Micro-service elastic scaling method based on resource demand prediction in cloud environment
Ismaeel et al. A novel host readiness factor for energy-efficient VM consolidation in cloud data centers
CN102158357B (en) Method for analyzing performances of single closed fork-join queuing network based on horizontal decomposition
Wang et al. An adaptive elasticity policy for staging based in-situ processing
CN111078440A (en) Disk error detection method, device and storage medium
Shinwari et al. Auto scalable big data as-a-service in the cloud: a literature review
Zhang et al. P-SUS: Parallel execution of sensing unit selection for mobile crowd sensing in an urban road network
CN117130882B (en) Node resource prediction method and system based on time sequence intervention analysis model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant