CN113032367A

CN113032367A - Dynamic load scene-oriented cross-layer configuration parameter collaborative tuning method and system for big data system

Info

Publication number: CN113032367A
Application number: CN202110313931.7A
Authority: CN
Inventors: 窦晖; 贾成成; 张以文
Original assignee: Anhui University
Current assignee: Anhui University
Priority date: 2021-03-24
Filing date: 2021-03-24
Publication date: 2021-06-25

Abstract

The invention provides a method and a system for collaborative optimization of cross-layer configuration parameters of a big data system suitable for a dynamic load scene.

Description

Dynamic load scene-oriented cross-layer configuration parameter collaborative tuning method and system for big data system

Technical Field

The invention is suitable for the technical field of application software performance optimization, and particularly relates to a cross-layer configuration parameter collaborative tuning method and a cross-layer configuration parameter collaborative tuning system of a large data system, which are suitable for a dynamic load scene.

Background

With the explosion of network services such as social networking, instant messaging, and electronic commerce, internet and mobile network users generate a great amount of data every day, and a big data era has come. According to the '4V' characteristic of the big data, in order to extract valuable information from massive and variable data, the big data is necessary to be processed and analyzed. Thus, as an underlying technology for big data processing and analysis, big data systems are widely deployed for storing data, scheduling computing resources, and processing and analyzing data. Generally, a big data system consists of the following three levels of application software:

a data storage layer: and the data processing system is responsible for persistent storage of data to be processed, data in the data processing process and result data after data processing. For example, the HDFS stores data required by the processing layer software and data generated by the processing layer software as storage layer software.

And a resource scheduling layer: and the data processing layer software responsible for distributing hardware resources to specific data processing tasks according to the scheduling strategy. For example, Yarn, as a resource scheduling layer software, allocates hardware resources to data processing layer software to perform data processing tasks.

A data processing layer: is responsible for performing specific data processing tasks. For example, Spark can efficiently process large-scale data using the machine resources allocated by Yarn.

HDFS, Yarn, and Spark are popular and general software in the ecology of big data systems, a single software is only responsible for one loop of the flows of data storage, resource scheduling, data processing, etc., and one big data processing task usually needs to be completed cooperatively by multiple cross-layer software in the big data system. As shown in fig. 1, the data stored in the HDFS is transmitted to Spark for processing and analysis, the Yarn is responsible for allocating enough machine resources to Spark to ensure that the task can be completed normally, and the data after Spark processing and analysis is stored in the HDFS for persistence processing.

In order to accomplish large data processing tasks from different demand scenarios, the load faced by large data systems is typically dynamically changing. Therefore, the big data software from the data storage layer, the resource scheduling layer and the data processing layer usually provides a large amount of configuration parameters that can be modified to adapt to the different demands of different big data processing tasks in terms of performance. By reasonably adjusting the performance-related parameters, the performance of the big data system under different load scenes can be optimized. To date, there are two main ways in academia and industry to optimize configuration parameters of software in big data systems: (1) and manually adjusting and optimizing the configuration parameters through the experience of experts and test results. Because different software in the big data system has different configuration parameters and the relationship between the configuration parameters and the software performance is complex, the manual search of the optimal configuration parameters is time-consuming and cannot be popularized; (2) to overcome the drawbacks of manual tuning, researchers began using model-based methods for automated tuning of configuration parameters. The method mainly comprises the steps of collecting performance indexes corresponding to different configuration parameters of a certain specific big data software under a specific load, then establishing a model between the configuration parameters and the performance indexes, and finally searching for the optimal configuration according to the performance model by using a specific algorithm. However, existing approaches are generally only applicable to certain specific software operating under a specific load in a large data system.

A method and system for tuning configuration parameters of a big data system based on deep learning as disclosed in application No. CN201710361578.3, the method comprising: a neural network training step, namely, preliminarily constructing a deep neural network, taking at least one mapping protocol parameter as an input parameter, taking an optimal configuration parameter to be predicted as an output parameter, and taking historical data of a big data system as a training sample set; then, the mapping reduction time is used as a measurement standard of the deep neural network, and the weight of each layer of neurons is adjusted based on a parameter learning rule of a back propagation idea until the mapping reduction time meets the time cost requirement; and a configuration parameter prediction step, namely setting an initial value of at least one mapping protocol parameter, reading current test data, and inputting the current test data into the deep neural network obtained in the neural network training step to obtain the configuration parameters. According to the method, the configuration parameters in the mapping protocol framework are optimized through the deep neural network, manual adjustment is avoided, and the application effect of the predicted parameters is good. The method adopts a model mode to optimize the configuration parameters of the big data system, and is not suitable for the cross-layer configuration parameter collaborative optimization of the big data system in a dynamic load scene.

In an actual big data processing scene, different configuration parameters need to be selected for different big data processing tasks so as to achieve optimal performance; configuration parameters among big data software from a data storage layer, a resource scheduling layer and a data processing layer have complex influence, and the performance of a big data processing task cannot be optimized only by adjusting the configuration parameters of certain software. Therefore, the method is oriented to a dynamic load scene, cross-layer configuration parameter coordination and optimization are carried out on big data system software, and the method is very important for optimizing the performance of big data processing tasks.

However, solving this problem mainly faces the following challenges:

ultra-high dimensional configuration parameter search space: there is a limitation in optimizing configuration parameters for a certain software, and the performance of a big data processing task cannot be optimized. Therefore, the method is oriented to a dynamic load scene, cross-layer configuration parameter coordination and optimization are carried out on big data system software, and the method is very important for optimizing the performance of big data processing tasks. By taking the HDFS, Yarn, Spark as an example, only Spark itself has more than one hundred configuration parameters, and the other software also has the same, and combining these three kinds of software to tune out parameters, which results in an ultra-high-dimensional parameter space, so the configuration space is very huge, where direct modeling requires a large number of data sets, and the configuration test execution time is too long. It is therefore not feasible in practice to tune all parameters together directly.

The execution time of the single configuration test is long: the big data software system has a complex execution flow, when testing the performance of a workload in a certain configuration, the configuration parameters need to be loaded first, then the workload is added to be executed under the configuration scheme, and finally the performance indexes are collected, which is a very time-consuming process. When the configuration is tested by cross-layer multi-software collaborative tuning, because the search space of the configuration parameters is large and the dimensionality of the configured characteristic parameters is high, if all the configuration parameters are loaded into a software system and then executed under a specific workload, the time consumption is more huge, and a plurality of sets of configuration schemes need to be tested when the optimal parameters are searched. It is therefore not feasible for multiple software co-tuning to test the configuration each time directly in a real system.

If the adoption of the fully online tuning requires executing the actual workload in production and then finding the optimal configuration parameters during the execution of the workload, the characteristic of long execution time of a single configuration also causes a large amount of time consumption for searching the optimal configuration in the fully online mode.

Workload dynamic changes: large data processing systems are often required to face some scenario of workload type changes during actual deployment. The workload in actual production is not constant, and there are CPU-intensive workloads and IO-intensive workloads. Different workloads have different preferences on resources, some workloads need a large amount of storage resources, and some workloads need a large amount of processing work by a CPU, so that a certain type of workload can achieve an optimal configuration scheme in the whole large data processing framework and can have poor performance on other types of workloads. The optimal configuration is not universal for each type of workload and the differences between workloads need to be considered.

Disclosure of Invention

The technical problem to be solved by the invention is the defect of the cross-layer configuration parameter collaborative tuning method of the large data system facing the dynamic load scene in the prior art.

The invention solves the technical problems through the following technical means:

a dynamic load scene-oriented cross-layer configuration parameter collaborative tuning method for a big data system comprises the following steps:

s1: sorting the importance of parameters of each software in a data storage layer, a resource scheduling layer and a data processing layer, and selecting a pre-set dimensional parameter; randomly taking values of the extracted pre-set dimensional parameters to generate a plurality of groups of configurations;

s2: bringing a plurality of groups of configurations into a big data system for execution, generating a performance label under a specific configuration, and finally obtaining a configuration performance matrix;

s3: establishing a performance model by using a configuration performance matrix and utilizing a random forest, and storing a target performance model for carrying out model migration on a new workload;

s4: substituting the target performance model into a genetic algorithm, and finding out a set of configuration parameters which enable the overall performance of the system to be best to be expressed by software cooperation in a data storage layer, a resource scheduling layer and a data processing layer under the working load through the genetic algorithm;

s5: when a new workload comes, calculating the similarity between the current new workload and each original workload, and according to the similarity calculation result, the following two situations occur:

(1) if the similarity is the maximum value and is larger than the threshold value, the new working load is considered to be similar to the original working load, the performance model of the original working load is transferred to the performance model of the current new working load through a certain compensation mechanism, and then the optimal configuration of the current new working load is searched on the performance model by using a genetic algorithm;

(2) if the similarity does not reach the threshold, the current new workload is considered to be the new type of load which is not recorded, the Bayesian optimization is adopted to select the set dimension parameters on line, the configuration which meets the requirements is found to enable the current new type of load to be executed, meanwhile, the steps S1-S4 are executed for the new type of load, and when the optimal configuration is found, the optimal configuration is used for execution.

According to the method, the 20-dimensional parameters are extracted through the importance sorting of the parameters, the dimension of the configuration parameters is reduced, and the interpretability of the model is increased. The invention adopts an integrated learning method (random forest algorithm) to establish a performance model by taking the configuration parameters of each software as input and taking the performance indexes of the workload as output. It should be noted that the model execution time is not the only option that can be output as the model, and the system throughput and the software response time can be output as the model. Through the establishment of the performance model under a specific workload, the performance index of the multi-layer big data software for the workload under the set of configuration can be directly generated when the configuration parameters are input, so that the great time consumption of testing the performance index by actually executing the configuration scheme in the multi-layer software is avoided.

The present invention reduces the processing time of new workloads using a scheme for performance model migration based on the similarity between workloads. By comparing the similarity value between the workloads with a pre-established threshold value, when the similarity value exceeds the threshold value, the similarity between the workloads is considered, then the performance model suitable for the workloads is obtained through simple migration of the performance model, and time consumption caused by retraining of the performance model is avoided. Under the condition that the threshold value is not exceeded, a set of scheme suitable for the new workload is quickly found through an online searching method, meanwhile, an offline performance model is built, and when the performance model is built, the optimal configuration is continuously searched on the model.

Further, the step S1 includes:

s11: selecting the most important 20-dimensional parameters by using a Lasso algorithm; the objective function for its minimization is:

F_Laasso＝||y—Xw||₂ ²+α||w||₁

extracting important characteristic parameters by adding a penalty term to a least square method; w is the coefficient of each characteristic parameter, the larger w is, the more important the parameter is, when the coefficient w of a certain characteristic parameter is 0, the parameter is rejected, and the 20-dimensional coefficient which has the largest influence on the output of the model is extracted by the method;

s12: for these 20-dimensional parameters, 200 sets of configuration parameters are randomly generated; if the parameter is a numerical parameter, randomly taking a value in a range determined by the following formula:

range＝[d_p/x，d_px]

wherein d is_pX is a fixed scaling factor, which is a default value of the parameter p, and is 10 in the present invention.

And if the value of the configuration parameter is a Boolean type or enumeration type value, encoding the parameter value by using an One-Hot encoding mode, and then randomly selecting the parameter value in a value range.

Further, the step S2 includes:

s21: according to the 200 sets of configuration parameters generated by S1, updating corresponding parameter values to corresponding software in a data storage layer, a resource scheduling layer and a data processing layer through an automatic deployment code;

s22: adding the workload into a big data processing flow, and acquiring a performance label corresponding to the configuration under the workload;

s23: and generating a configuration performance matrix according to the configuration obtained in the last step and the corresponding performance label.

Further, the step S3 includes:

s31: the process configures a performance matrix in 60%: 20%: the configuration performance matrix is divided into three parts according to the proportion of 20 percent, and the three parts are respectively used as a training data set, a verification data set and a test data set for the random forest to perform performance model modeling;

s32: for data in a training set, adopting a Bootstrap strategy to extract n groups of sample sets with the same dimensionality, wherein the n groups of sample sets have the same number of samples, but the samples are not necessarily the same, and each sample has configuration parameter characteristics with the same dimensionality, but the configuration parameter characteristics are not necessarily the same;

s33: establishing a regression tree model for each sample set, wherein n regression trees are calculated;

s34: the output of the n regression trees is { pt₁，pt₂…pt_nGet the average value as the final output of the model

And storing the characteristics of the workload and its corresponding performance model in the system for future comparison and migration of new workloads.

Further, the step S4 includes:

s41: determining the value of each parameter in the genetic algorithm; the execution time generated by specific configuration is used as a fitness value in the genetic algorithm, a set of configuration parameters are used as chromosomes in the genetic algorithm, and specific parameter values are used as genes in the genetic algorithm. Extracting n groups of parameter configurations from the test set as initial populations of the genetic algorithm:

{C₁，C₂…C_n}

C₁，C₂，C_neach represents a specific configuration scheme, and n groups are counted;

s42: substituting the n groups of parameter configurations into a performance model generated by a random forest to calculate the execution time of each group of parameter configurations:

{P₁，P₂…P_n}

s43: configure parameter { C₁，C₂…C_nThe genetic algorithm is introduced and expressed as { P }₁，P₂…P_nThe fitness value corresponding to each group of configuration in the genetic algorithm is used as the fitness value; executing a series of cross mutation operations on the original n groups of configuration by a genetic algorithm to generate the next group of configuration parameters:

{C₁′，C₂′…C_n′}

s44: and substituting the newly generated group of configuration parameters into the performance model, calculating the execution time corresponding to each group of configuration parameters again, and repeating the steps S42-S44. Until an optimal configuration is found.

Further, the step S5 includes:

s51: when a new task arrives, the new task is executed once on a set of initial configuration, and a vector Vec is formed according to the system performance indexes during the execution and collection of the execution_i；

S52: the cosine similarity between the workload and the previous workloads of various types is calculated in sequence, and the calculation formula is as follows:

Vec_ivector of system performance indicators, Vec, collected when workload i is executed at a fixed initial configuration_jA vector formed by system performance indexes collected when the workload j is executed under fixed initial configuration; the value of the highest cosine similarity of the workload to the previous workload is taken and compared with 0.75. Here, 0.75 is taken as a similarity threshold;

s53: if it isThe highest cosine similarity is greater than 0.75, and the two workloads are considered to have similarity. After the cosine similarity between the newly arrived workload i and all types of workloads is calculated, the cosine similarity between the newly arrived workload i and j types of workloads is maximum and is larger than 0.75; the optimal configuration C found previously by the workload j is placed under the workload i to be executed in the system to collect the execution time ET_iThe execution time of the workload j under the configuration C is ET_jCalculating the difference ET between the two_j-ET_iAs a compensation to apply a performance model of workload j to workload i. I.e. predicted execution time T of workload i under configuration C_i(C)＝T_i(C)+ET_j-ET_i. Then searching the optimal configuration of the model by using a genetic algorithm;

regarding the workload which does not reach the similarity threshold value, the workload is considered as an unrecorded new type load; aiming at the situation, the invention provides a Bayesian optimization-based quick start scheme, which specifically comprises the following steps:

and (3) selecting the 20-dimensional parameters on line through Bayesian optimization, quickly finding a good enough configuration scheme to execute the workload, executing the steps S1-S4 on the workload, and executing the optimal configuration when the optimal configuration is found.

The invention also provides a dynamic load scene-oriented big data system cross-layer configuration parameter collaborative tuning system, which comprises:

the software configuration parameter acquisition module is used for sorting the importance of the parameters of each software in the data storage layer, the resource scheduling layer and the data processing layer, randomly taking values of the extracted pre-set dimensional parameters and generating a plurality of groups of configurations;

the configuration performance matrix calculation module is used for bringing the configuration into a system to execute, generating a performance label under specific configuration and finally obtaining a configuration performance matrix;

the model building module is used for building a performance model by using a configuration performance matrix and utilizing a random forest, and storing a target performance model for carrying out model migration on a new workload;

the optimal configuration parameter searching module is used for substituting a performance model generated by a random forest into a genetic algorithm, and finding out a set of configuration parameters which enable the overall performance of the system to be best in cooperation with software in a data storage layer, a resource scheduling layer and a data processing layer under the working load through the genetic algorithm;

and the model migration module is used for calculating the similarity between the current new workload and each original workload when the new workload comes, and the following two conditions occur according to the similarity calculation result:

(2) if the similarity does not reach the threshold, the current new working load is considered to be the unrecorded new type load, the Bayesian optimization is adopted to select the set dimension parameters on line, the configuration meeting the requirements is found to enable the current new type load to be executed, meanwhile, the software configuration parameter acquisition module, the configuration performance matrix calculation module, the model establishment module and the optimal configuration parameter searching module are executed on the new type load, and when the optimal configuration is found, the optimal configuration is used for execution.

Further, the specific execution process of the module for obtaining software configuration parameters includes:

F_Lasso＝||y—Xw||₂ ²+α||w||₁

range＝[d_p/x，d_px]

Further, the specific execution process of the configuration performance matrix calculation module includes:

s22: adding the workload into a big data processing flow, and acquiring a performance label corresponding to the workload:

The specific execution process of the model building module comprises the following steps:

The specific execution process of the optimal configuration parameter searching module comprises the following steps:

{C₁，C₂…C_n}

{P₁，P₂…P_n}

{C₁′，C₂′…C_n′}

Further, the specific execution process of the model migration module includes:

s51: when a new task arrives, the new task is initially matched in a setSetting up execution once, and forming vector Vec according to system performance index in the execution period of this execution collection_i；

s53: if the highest cosine similarity is greater than 0.75, the two workloads are considered to have similarity. After the cosine similarity between the newly arrived workload i and all types of workloads is calculated, the cosine similarity between the newly arrived workload i and j types of workloads is maximum and is larger than 0.75; the optimal configuration C found previously by the workload j is placed under the workload i to be executed in the system to collect the execution time ET_iThe execution time of the workload j under the configuration C is ET_jCalculating the difference ET between the two_i-ET_iAs a compensation to apply a performance model of workload j to workload i. I.e. predicted execution time T of workload i under configuration C_i(C)＝T_i(C)+ET_j-ET_i. Then searching the optimal configuration of the model by using a genetic algorithm;

the 20-dimensional parameters are selected on line through Bayesian optimization, a good enough configuration scheme is quickly found to enable the workload to be executed, meanwhile, a software configuration parameter acquisition module, a configuration performance matrix calculation module, a model establishment module and an optimal configuration parameter searching module are executed on the workload, and when the optimal configuration is found, the optimal configuration is used for execution.

The invention has the advantages that:

the invention designs a dynamic load scene-oriented method for collaboratively adjusting and optimizing cross-layer configuration parameters of a big data system, which comprises the steps of extracting the parameters after the importance ranking is carried out on the cross-layer multi-layer software, searching an optimal configuration scheme after modeling the performance of a workload running on the multi-layer big data software, and carrying out model migration according to the characteristics of a new workload after the new workload arrives so as to adapt to scenes under different workloads. And the cross-layer collaborative tuning avoids the situation that the overall performance of the system is not optimal but certain software configuration parameters are optimal due to the fact that the parameters are affected by each other in an intricate and complex manner.

According to the method, the top dimension parameters are extracted through the importance ranking of the parameters, the dimension of the configuration parameters is reduced, and the interpretability of the model is increased. The invention adopts an integrated learning method (random forest algorithm) to establish a performance model by taking the configuration parameters of each software as input and taking the performance indexes of the workload as output. It should be noted that the model execution time is not the only option that can be output as the model, and the system throughput and the software response time can be output as the model. Through the establishment of the performance model under a specific workload, the performance index of the multi-layer big data software for the workload under the set of configuration can be directly generated when the configuration parameters are input, so that the great time consumption of testing the performance index by actually executing the configuration scheme in the multi-layer software is avoided.

Drawings

FIG. 1 is a cross-layer software collaboration flow chart as described in the background of the invention;

FIG. 2 is a flow chart of a cross-layer configuration parameter co-optimization method for a big data system oriented to a dynamic load scenario in an embodiment of the present invention;

FIG. 3 is an exemplary diagram of 200 sets of configuration files produced in an embodiment of the present invention;

FIG. 4 is a flow chart of random forest modeling in an embodiment of the present invention;

fig. 5 is a flowchart illustrating the step S4 of finding the optimal configuration parameters according to the embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the embodiments of the present invention, and it is obvious that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment provides a dynamic load scene-oriented cross-layer configuration parameter collaborative optimization method for a big data system, which is different from most other schemes in that the method is suitable for load change, comprehensively adjusts configuration parameters by cooperatively considering big data cross-layer multi-software, and can search optimal configuration by using various measurement standards (such as workload execution time, system throughput and the like). By using the framework, long-time manual test is avoided, and the optimal configuration can be quickly found. As shown in fig. 2, the method specifically includes the following steps:

step S1: and (4) sorting the importance of the parameters of the HDFS, Yarn and Spark, and selecting the most important 20-dimensional parameters as the input features of the model in the step S3. And randomly taking values of the extracted 20-dimensional parameters to generate a plurality of groups of configurations.

As shown in fig. 2, the step S1 specifically executes the following process:

s11: the most important 20-dimensional parameters are selected using the Lasso algorithm. The objective function for its minimization is:

F_Lasso＝||y—Xw||₂ ²+α||w||₁

and extracting important characteristic parameters by adding a penalty term to the least square method. W is the coefficient of each characteristic parameter, the larger w indicates that the parameter is more important, and when the coefficient w of a certain characteristic parameter is 0, the parameter is rejected. By the method, the coefficient with the largest influence on the model output in 20 dimensions is extracted.

S12: for these 20-dimensional parameters, 200 sets of configuration parameters are randomly generated. If the parameter is a numerical parameter, randomly taking a value in a range determined by the following formula:

range＝[d_p/x，d_px]

Step S2: the configuration is brought into the system to be executed, a performance label under the specific configuration is generated, and the execution time of the workload is used as the performance label (other indexes can also be used as the performance label, such as the system throughput). This step ultimately results in a configured performance matrix. The specific execution process comprises the following steps:

s21: and updating corresponding parameter values to the Yarn, Spark and HDFS software through the automatic deployment code according to the 200 sets of configuration parameters generated by the S1.

S22: and adding the workload into a big data processing flow, and acquiring the execution time corresponding to the configuration under the workload.

S23: and generating a configuration performance matrix according to the configuration obtained in the last step and the corresponding execution time.

Step S3: the configuration performance matrix is used to build a model using a random forest. After the model is trained, the corresponding execution time can be output according to an input set of specific configuration parameters. The storage model is migrated with a later model of a new workload. As shown in fig. 4, the specific execution process of step S3 is as follows:

s31: the process configures a performance matrix in 60%: 20%: the configured performance matrix is divided into three parts according to the proportion of 20%, and the three parts are respectively used as a training data set, a verification data set and a test data set for the random forest to perform performance model modeling.

S32: and for the data in the training set, adopting a Bootstrap strategy to extract n groups of sample sets with the same dimension, wherein the n groups of sample sets have the same number of samples, but the samples are not necessarily the same, and each sample has the configuration parameter features with the same dimension, but the configuration parameter features are not necessarily the same.

S33: and establishing a regression tree model for each sample set, wherein n regression trees are calculated in total.

S34: the output of the n regression trees is { pt₁，pt₂…pt_nAnd taking the average value thereof as the final output of the model.

Step S4: and substituting the performance model generated by the random forest into a genetic algorithm, and finding out the optimal configuration parameters of Yarn, Spark and HDFS under the working load by the genetic algorithm. As shown in fig. 5, the specific implementation procedure of step S4 is as follows:

s41: and determining the value of each parameter in the genetic algorithm. The execution time generated by specific configuration is used as a fitness value in the genetic algorithm, a set of configuration parameters are used as chromosomes in the genetic algorithm, and specific parameter values are used as genes in the genetic algorithm. Extracting n groups of parameter configurations from the test set as initial populations of the genetic algorithm:

{C₁，C₂…C_n}

C₁，C₂，C_neach represents a specific set of configuration schemes, totaling n sets.

{P₁，P₂…P_n}

s43: configure parameter { C₁，C₂…C_nThe genetic algorithm is introduced and expressed as { P }₁，P₂…P_nAs fitness value corresponding to each set of configuration in the genetic algorithm. Executing a series of cross mutation operations on the original n groups of configuration by a genetic algorithm to generate the next group of configuration parameters:

{C₁′，C₂′…C_n′}

Step S5: and when a new workload comes, migrating the performance model according to the similarity.

The specific execution process of step S5 includes:

s51: when a new task arrives, the new task is executed once on a set of initial configuration, and according to the execution, system performance indexes such as 'CPUs affected', 'context-switches', 'cpu-migrations', 'cycles' and the like during execution are collected to form a vector Vec_i。

Vec_ivector of system performance indicators, Vec, collected when workload i is executed at a fixed initial configuration_jA vector of system performance indicators gathered for a workload j when executed at a fixed initial configuration. The value of the highest cosine similarity of the workload to the previous workload is taken and compared with 0.75. Here, 0.75 is taken as the similarity threshold.

S53: if the highest cosine similarity is greater than 0.75, the two workloads are considered to have similarity. After the cosine similarity between the newly arrived workload i and all types of workloads is calculated, the cosine similarity with the j type of workloads is the maximum and is larger than 0.75. The optimal configuration C found previously by the workload j is placed under the workload i to be executed in the system to collect the execution time ET_iThe execution time of the workload j under the configuration C is ET_jCalculating the difference ET between the two_j-ET_iAs a compensation to apply a performance model of workload j to workload i. I.e. predicted execution time T of workload i under configuration C_i(C)＝T_i(C)+ET_j-ET_i. A genetic algorithm is then used to search this model for its optimal configuration.

And regarding the workload which does not reach the similarity threshold, the workload is considered to be a new type of unregistered load. The invention provides a quick starting scheme based on Bayesian optimization for the situation.

The idea of Bayesian optimization is to generate an initial candidate solution set, then find the next most likely extreme point according to the points, add the point into the set, and repeat the steps until the iteration is terminated. And finally, finding out the point with the maximum function value from the points to be used as the solution of the problem. The method is more effective than grid search and random search because the information of the searched points is utilized in the solving process.

The invention designs a dynamic load scene-oriented cross-layer configuration parameter collaborative tuning method for a big data system.

Based on the above method, this embodiment further provides a dynamic load scenario-oriented collaborative tuning system for cross-layer configuration parameters of a big data system, including:

The specific execution process of the module for acquiring the software configuration parameters comprises the following steps:

F_Lasso＝||y—Xw||₂ ²+α||w||₁

range＝[d_p/x，d_px]

The specific execution process of the configuration performance matrix calculation module comprises the following steps:

The execution process of the model building module is specifically as follows:

{C₁，C₂…C_n}

{P₁，P₂…P_n}

{C₁′，C₂′…C_n′}

The specific execution process of the model migration module comprises the following steps:

s53: if the highest cosine similarity is greater than 0.75, the two workloads are considered to have similarity. After the cosine similarity between the newly arrived workload i and all types of workloads is calculated, the cosine similarity between the newly arrived workload i and j types of workloads is maximum and is larger than 0.75; the optimal configuration C found previously by the workload j is placed under the workload i to be executed in the system to collect the execution time ET_iThe execution time of the workload j under the configuration C is ET_jCalculating the difference ET between the two_j-ET_iAs a compensation to apply a performance model of workload j to workload i. I.e. predicted execution time T of workload i under configuration C_i(C)＝T_i(C)+ET_j-ET_i. Then searching the optimal configuration of the model by using a genetic algorithm;

The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A dynamic load scene-oriented cross-layer configuration parameter collaborative tuning method for a big data system is characterized by comprising the following steps:

2. The method for collaborative tuning of cross-layer configuration parameters of big data system facing dynamic load scenario as claimed in claim 1, wherein said step S1 includes:

F_Lasso＝||y-Xw||₂ ²+α||w||₁

range＝[d_p/x，d_px]

wherein d is_pThe parameter p is a default value, x is a fixed proportionality coefficient, and x is 10;

3. The big data system cross-layer configuration parameter collaborative tuning method for the dynamic load scenario as claimed in claim 2, wherein the step S2 includes:

4. The method for collaborative tuning of cross-layer configuration parameters of big data system facing dynamic load scenario as claimed in claim 1, wherein said step S3 includes:

s31: processing a configuration performance matrix, dividing the configuration performance matrix into three parts according to the proportion of 60 percent to 20 percent, and respectively using the three parts as a training data set, a verification data set and a test data set for a random forest to perform performance model modeling;

5. The method for collaborative tuning of cross-layer configuration parameters of big data system facing dynamic load scenario as claimed in claim 1, wherein said step S4 includes:

s41: determining the value of each parameter in the genetic algorithm; taking the execution time generated by specific configuration as a fitness value in a genetic algorithm, taking a set of configuration parameters as chromosomes in the genetic algorithm, and taking specific parameter values as genes in the genetic algorithm; extracting n groups of parameter configurations from the test set as initial populations of the genetic algorithm:

{C₁，C₂…C_n}

{P₁，P₂…P_n}

{C₁′，C₂′…C_n′}

s44: and substituting the newly generated group of configuration parameters into the performance model, calculating the execution time corresponding to each group of configuration parameters again, and repeating the steps S42-S44 until the optimal configuration is found.

6. The method for collaborative tuning of cross-layer configuration parameters of big data system facing dynamic load scenario as claimed in claim 1, wherein said step S5 includes:

Vec_ivector of system performance indicators, Vec, collected when workload i is executed at a fixed initial configuration_jA vector formed by system performance indexes collected when the workload j is executed under fixed initial configuration; comparing the value of the work load with the highest cosine similarity of the previous work load with 0.75; here, 0.75 is taken as a similarity threshold;

s53: if the highest cosine similarity is greater than 0.75, the two workloads are considered to have similarity; after the cosine similarity between the newly arrived workload i and all types of workloads is calculated, the cosine similarity between the newly arrived workload i and j types of workloads is maximum and is larger than 0.75; the optimal configuration C found previously by the workload j is placed under the workload i to be executed in the system to collect the execution time ET_iThe execution time of the workload j under the configuration C is ET_jCalculating the difference ET between the two_j-ET_iAs a compensation to apply a performance model of workload j to workload i; i.e. predicted execution time T of workload i under configuration C_i(C)＝T_i(C)+ET_j-ET_i(ii) a Then searching the optimal configuration of the model by using a genetic algorithm;

7. A big data system cross-layer configuration parameter collaborative tuning system oriented to a dynamic load scene is characterized by comprising the following components:

8. The dynamic load scenario-oriented big data system cross-layer configuration parameter collaborative tuning system according to claim 7, wherein the specific execution process of the software configuration parameter obtaining module includes:

F_Lasso＝||y-Xw||₂ ²+α||w||₁

range＝[d_p/x，d_px]

9. The dynamic load scenario-oriented big data system cross-layer configuration parameter collaborative tuning system according to claim 7, wherein the configuration performance matrix calculation module specifically executes a process including:

10. The dynamic load scenario-oriented big data system cross-layer configuration parameter collaborative tuning method according to claim 7, wherein the specific implementation process of the model building module includes:

11. The dynamic load scenario-oriented big data system cross-layer configuration parameter collaborative tuning method according to claim 7, wherein the specific implementation process of the optimal configuration parameter searching module includes:

{C₁，C₂…C_n}

{P₁，P₂…P_n}

{C₁′，C₂′…C_n′}

12. The dynamic load scenario-oriented big data system cross-layer configuration parameter collaborative tuning method according to claim 7, wherein the specific execution process of the model migration module includes:

s51: when a new task arrives, the system is executed once on a set of initial configuration, and the execution period of the system is collected according to the executionPerformance index constitutive vector Vec_i；