CN113032367A - Dynamic load scene-oriented cross-layer configuration parameter collaborative tuning method and system for big data system - Google Patents

Dynamic load scene-oriented cross-layer configuration parameter collaborative tuning method and system for big data system Download PDF

Info

Publication number
CN113032367A
CN113032367A CN202110313931.7A CN202110313931A CN113032367A CN 113032367 A CN113032367 A CN 113032367A CN 202110313931 A CN202110313931 A CN 202110313931A CN 113032367 A CN113032367 A CN 113032367A
Authority
CN
China
Prior art keywords
configuration
workload
parameter
performance
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110313931.7A
Other languages
Chinese (zh)
Inventor
窦晖
贾成成
张以文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui University
Original Assignee
Anhui University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui University filed Critical Anhui University
Priority to CN202110313931.7A priority Critical patent/CN113032367A/en
Publication of CN113032367A publication Critical patent/CN113032367A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/217Database tuning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5083Techniques for rebalancing the load in a distributed system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/004Artificial life, i.e. computing arrangements simulating life
    • G06N3/006Artificial life, i.e. computing arrangements simulating life based on simulated virtual individual or collective life forms, e.g. social simulations or particle swarm optimisation [PSO]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention provides a method and a system for collaborative optimization of cross-layer configuration parameters of a big data system suitable for a dynamic load scene.

Description

Dynamic load scene-oriented cross-layer configuration parameter collaborative tuning method and system for big data system
Technical Field
The invention is suitable for the technical field of application software performance optimization, and particularly relates to a cross-layer configuration parameter collaborative tuning method and a cross-layer configuration parameter collaborative tuning system of a large data system, which are suitable for a dynamic load scene.
Background
With the explosion of network services such as social networking, instant messaging, and electronic commerce, internet and mobile network users generate a great amount of data every day, and a big data era has come. According to the '4V' characteristic of the big data, in order to extract valuable information from massive and variable data, the big data is necessary to be processed and analyzed. Thus, as an underlying technology for big data processing and analysis, big data systems are widely deployed for storing data, scheduling computing resources, and processing and analyzing data. Generally, a big data system consists of the following three levels of application software:
a data storage layer: and the data processing system is responsible for persistent storage of data to be processed, data in the data processing process and result data after data processing. For example, the HDFS stores data required by the processing layer software and data generated by the processing layer software as storage layer software.
And a resource scheduling layer: and the data processing layer software responsible for distributing hardware resources to specific data processing tasks according to the scheduling strategy. For example, Yarn, as a resource scheduling layer software, allocates hardware resources to data processing layer software to perform data processing tasks.
A data processing layer: is responsible for performing specific data processing tasks. For example, Spark can efficiently process large-scale data using the machine resources allocated by Yarn.
HDFS, Yarn, and Spark are popular and general software in the ecology of big data systems, a single software is only responsible for one loop of the flows of data storage, resource scheduling, data processing, etc., and one big data processing task usually needs to be completed cooperatively by multiple cross-layer software in the big data system. As shown in fig. 1, the data stored in the HDFS is transmitted to Spark for processing and analysis, the Yarn is responsible for allocating enough machine resources to Spark to ensure that the task can be completed normally, and the data after Spark processing and analysis is stored in the HDFS for persistence processing.
In order to accomplish large data processing tasks from different demand scenarios, the load faced by large data systems is typically dynamically changing. Therefore, the big data software from the data storage layer, the resource scheduling layer and the data processing layer usually provides a large amount of configuration parameters that can be modified to adapt to the different demands of different big data processing tasks in terms of performance. By reasonably adjusting the performance-related parameters, the performance of the big data system under different load scenes can be optimized. To date, there are two main ways in academia and industry to optimize configuration parameters of software in big data systems: (1) and manually adjusting and optimizing the configuration parameters through the experience of experts and test results. Because different software in the big data system has different configuration parameters and the relationship between the configuration parameters and the software performance is complex, the manual search of the optimal configuration parameters is time-consuming and cannot be popularized; (2) to overcome the drawbacks of manual tuning, researchers began using model-based methods for automated tuning of configuration parameters. The method mainly comprises the steps of collecting performance indexes corresponding to different configuration parameters of a certain specific big data software under a specific load, then establishing a model between the configuration parameters and the performance indexes, and finally searching for the optimal configuration according to the performance model by using a specific algorithm. However, existing approaches are generally only applicable to certain specific software operating under a specific load in a large data system.
A method and system for tuning configuration parameters of a big data system based on deep learning as disclosed in application No. CN201710361578.3, the method comprising: a neural network training step, namely, preliminarily constructing a deep neural network, taking at least one mapping protocol parameter as an input parameter, taking an optimal configuration parameter to be predicted as an output parameter, and taking historical data of a big data system as a training sample set; then, the mapping reduction time is used as a measurement standard of the deep neural network, and the weight of each layer of neurons is adjusted based on a parameter learning rule of a back propagation idea until the mapping reduction time meets the time cost requirement; and a configuration parameter prediction step, namely setting an initial value of at least one mapping protocol parameter, reading current test data, and inputting the current test data into the deep neural network obtained in the neural network training step to obtain the configuration parameters. According to the method, the configuration parameters in the mapping protocol framework are optimized through the deep neural network, manual adjustment is avoided, and the application effect of the predicted parameters is good. The method adopts a model mode to optimize the configuration parameters of the big data system, and is not suitable for the cross-layer configuration parameter collaborative optimization of the big data system in a dynamic load scene.
In an actual big data processing scene, different configuration parameters need to be selected for different big data processing tasks so as to achieve optimal performance; configuration parameters among big data software from a data storage layer, a resource scheduling layer and a data processing layer have complex influence, and the performance of a big data processing task cannot be optimized only by adjusting the configuration parameters of certain software. Therefore, the method is oriented to a dynamic load scene, cross-layer configuration parameter coordination and optimization are carried out on big data system software, and the method is very important for optimizing the performance of big data processing tasks.
However, solving this problem mainly faces the following challenges:
ultra-high dimensional configuration parameter search space: there is a limitation in optimizing configuration parameters for a certain software, and the performance of a big data processing task cannot be optimized. Therefore, the method is oriented to a dynamic load scene, cross-layer configuration parameter coordination and optimization are carried out on big data system software, and the method is very important for optimizing the performance of big data processing tasks. By taking the HDFS, Yarn, Spark as an example, only Spark itself has more than one hundred configuration parameters, and the other software also has the same, and combining these three kinds of software to tune out parameters, which results in an ultra-high-dimensional parameter space, so the configuration space is very huge, where direct modeling requires a large number of data sets, and the configuration test execution time is too long. It is therefore not feasible in practice to tune all parameters together directly.
The execution time of the single configuration test is long: the big data software system has a complex execution flow, when testing the performance of a workload in a certain configuration, the configuration parameters need to be loaded first, then the workload is added to be executed under the configuration scheme, and finally the performance indexes are collected, which is a very time-consuming process. When the configuration is tested by cross-layer multi-software collaborative tuning, because the search space of the configuration parameters is large and the dimensionality of the configured characteristic parameters is high, if all the configuration parameters are loaded into a software system and then executed under a specific workload, the time consumption is more huge, and a plurality of sets of configuration schemes need to be tested when the optimal parameters are searched. It is therefore not feasible for multiple software co-tuning to test the configuration each time directly in a real system.
If the adoption of the fully online tuning requires executing the actual workload in production and then finding the optimal configuration parameters during the execution of the workload, the characteristic of long execution time of a single configuration also causes a large amount of time consumption for searching the optimal configuration in the fully online mode.
Workload dynamic changes: large data processing systems are often required to face some scenario of workload type changes during actual deployment. The workload in actual production is not constant, and there are CPU-intensive workloads and IO-intensive workloads. Different workloads have different preferences on resources, some workloads need a large amount of storage resources, and some workloads need a large amount of processing work by a CPU, so that a certain type of workload can achieve an optimal configuration scheme in the whole large data processing framework and can have poor performance on other types of workloads. The optimal configuration is not universal for each type of workload and the differences between workloads need to be considered.
Disclosure of Invention
The technical problem to be solved by the invention is the defect of the cross-layer configuration parameter collaborative tuning method of the large data system facing the dynamic load scene in the prior art.
The invention solves the technical problems through the following technical means:
a dynamic load scene-oriented cross-layer configuration parameter collaborative tuning method for a big data system comprises the following steps:
s1: sorting the importance of parameters of each software in a data storage layer, a resource scheduling layer and a data processing layer, and selecting a pre-set dimensional parameter; randomly taking values of the extracted pre-set dimensional parameters to generate a plurality of groups of configurations;
s2: bringing a plurality of groups of configurations into a big data system for execution, generating a performance label under a specific configuration, and finally obtaining a configuration performance matrix;
s3: establishing a performance model by using a configuration performance matrix and utilizing a random forest, and storing a target performance model for carrying out model migration on a new workload;
s4: substituting the target performance model into a genetic algorithm, and finding out a set of configuration parameters which enable the overall performance of the system to be best to be expressed by software cooperation in a data storage layer, a resource scheduling layer and a data processing layer under the working load through the genetic algorithm;
s5: when a new workload comes, calculating the similarity between the current new workload and each original workload, and according to the similarity calculation result, the following two situations occur:
(1) if the similarity is the maximum value and is larger than the threshold value, the new working load is considered to be similar to the original working load, the performance model of the original working load is transferred to the performance model of the current new working load through a certain compensation mechanism, and then the optimal configuration of the current new working load is searched on the performance model by using a genetic algorithm;
(2) if the similarity does not reach the threshold, the current new workload is considered to be the new type of load which is not recorded, the Bayesian optimization is adopted to select the set dimension parameters on line, the configuration which meets the requirements is found to enable the current new type of load to be executed, meanwhile, the steps S1-S4 are executed for the new type of load, and when the optimal configuration is found, the optimal configuration is used for execution.
According to the method, the 20-dimensional parameters are extracted through the importance sorting of the parameters, the dimension of the configuration parameters is reduced, and the interpretability of the model is increased. The invention adopts an integrated learning method (random forest algorithm) to establish a performance model by taking the configuration parameters of each software as input and taking the performance indexes of the workload as output. It should be noted that the model execution time is not the only option that can be output as the model, and the system throughput and the software response time can be output as the model. Through the establishment of the performance model under a specific workload, the performance index of the multi-layer big data software for the workload under the set of configuration can be directly generated when the configuration parameters are input, so that the great time consumption of testing the performance index by actually executing the configuration scheme in the multi-layer software is avoided.
The present invention reduces the processing time of new workloads using a scheme for performance model migration based on the similarity between workloads. By comparing the similarity value between the workloads with a pre-established threshold value, when the similarity value exceeds the threshold value, the similarity between the workloads is considered, then the performance model suitable for the workloads is obtained through simple migration of the performance model, and time consumption caused by retraining of the performance model is avoided. Under the condition that the threshold value is not exceeded, a set of scheme suitable for the new workload is quickly found through an online searching method, meanwhile, an offline performance model is built, and when the performance model is built, the optimal configuration is continuously searched on the model.
Further, the step S1 includes:
s11: selecting the most important 20-dimensional parameters by using a Lasso algorithm; the objective function for its minimization is:
FLaasso=||y—Xw||2 2+α||w||1
extracting important characteristic parameters by adding a penalty term to a least square method; w is the coefficient of each characteristic parameter, the larger w is, the more important the parameter is, when the coefficient w of a certain characteristic parameter is 0, the parameter is rejected, and the 20-dimensional coefficient which has the largest influence on the output of the model is extracted by the method;
s12: for these 20-dimensional parameters, 200 sets of configuration parameters are randomly generated; if the parameter is a numerical parameter, randomly taking a value in a range determined by the following formula:
range=[dp/x,dpx]
wherein d ispX is a fixed scaling factor, which is a default value of the parameter p, and is 10 in the present invention.
And if the value of the configuration parameter is a Boolean type or enumeration type value, encoding the parameter value by using an One-Hot encoding mode, and then randomly selecting the parameter value in a value range.
Further, the step S2 includes:
s21: according to the 200 sets of configuration parameters generated by S1, updating corresponding parameter values to corresponding software in a data storage layer, a resource scheduling layer and a data processing layer through an automatic deployment code;
s22: adding the workload into a big data processing flow, and acquiring a performance label corresponding to the configuration under the workload;
s23: and generating a configuration performance matrix according to the configuration obtained in the last step and the corresponding performance label.
Further, the step S3 includes:
s31: the process configures a performance matrix in 60%: 20%: the configuration performance matrix is divided into three parts according to the proportion of 20 percent, and the three parts are respectively used as a training data set, a verification data set and a test data set for the random forest to perform performance model modeling;
s32: for data in a training set, adopting a Bootstrap strategy to extract n groups of sample sets with the same dimensionality, wherein the n groups of sample sets have the same number of samples, but the samples are not necessarily the same, and each sample has configuration parameter characteristics with the same dimensionality, but the configuration parameter characteristics are not necessarily the same;
s33: establishing a regression tree model for each sample set, wherein n regression trees are calculated;
s34: the output of the n regression trees is { pt1,pt2…ptnGet the average value as the final output of the model
Figure BDA0002990348620000051
And storing the characteristics of the workload and its corresponding performance model in the system for future comparison and migration of new workloads.
Further, the step S4 includes:
s41: determining the value of each parameter in the genetic algorithm; the execution time generated by specific configuration is used as a fitness value in the genetic algorithm, a set of configuration parameters are used as chromosomes in the genetic algorithm, and specific parameter values are used as genes in the genetic algorithm. Extracting n groups of parameter configurations from the test set as initial populations of the genetic algorithm:
{C1,C2…Cn}
C1,C2,Cneach represents a specific configuration scheme, and n groups are counted;
s42: substituting the n groups of parameter configurations into a performance model generated by a random forest to calculate the execution time of each group of parameter configurations:
{P1,P2…Pn}
s43: configure parameter { C1,C2…CnThe genetic algorithm is introduced and expressed as { P }1,P2…PnThe fitness value corresponding to each group of configuration in the genetic algorithm is used as the fitness value; executing a series of cross mutation operations on the original n groups of configuration by a genetic algorithm to generate the next group of configuration parameters:
{C1′,C2′…Cn′}
s44: and substituting the newly generated group of configuration parameters into the performance model, calculating the execution time corresponding to each group of configuration parameters again, and repeating the steps S42-S44. Until an optimal configuration is found.
Further, the step S5 includes:
s51: when a new task arrives, the new task is executed once on a set of initial configuration, and a vector Vec is formed according to the system performance indexes during the execution and collection of the executioni
S52: the cosine similarity between the workload and the previous workloads of various types is calculated in sequence, and the calculation formula is as follows:
Figure BDA0002990348620000061
Vecivector of system performance indicators, Vec, collected when workload i is executed at a fixed initial configurationjA vector formed by system performance indexes collected when the workload j is executed under fixed initial configuration; the value of the highest cosine similarity of the workload to the previous workload is taken and compared with 0.75. Here, 0.75 is taken as a similarity threshold;
s53: if it isThe highest cosine similarity is greater than 0.75, and the two workloads are considered to have similarity. After the cosine similarity between the newly arrived workload i and all types of workloads is calculated, the cosine similarity between the newly arrived workload i and j types of workloads is maximum and is larger than 0.75; the optimal configuration C found previously by the workload j is placed under the workload i to be executed in the system to collect the execution time ETiThe execution time of the workload j under the configuration C is ETjCalculating the difference ET between the twoj-ETiAs a compensation to apply a performance model of workload j to workload i. I.e. predicted execution time T of workload i under configuration Ci(C)=Ti(C)+ETj-ETi. Then searching the optimal configuration of the model by using a genetic algorithm;
regarding the workload which does not reach the similarity threshold value, the workload is considered as an unrecorded new type load; aiming at the situation, the invention provides a Bayesian optimization-based quick start scheme, which specifically comprises the following steps:
and (3) selecting the 20-dimensional parameters on line through Bayesian optimization, quickly finding a good enough configuration scheme to execute the workload, executing the steps S1-S4 on the workload, and executing the optimal configuration when the optimal configuration is found.
The invention also provides a dynamic load scene-oriented big data system cross-layer configuration parameter collaborative tuning system, which comprises:
the software configuration parameter acquisition module is used for sorting the importance of the parameters of each software in the data storage layer, the resource scheduling layer and the data processing layer, randomly taking values of the extracted pre-set dimensional parameters and generating a plurality of groups of configurations;
the configuration performance matrix calculation module is used for bringing the configuration into a system to execute, generating a performance label under specific configuration and finally obtaining a configuration performance matrix;
the model building module is used for building a performance model by using a configuration performance matrix and utilizing a random forest, and storing a target performance model for carrying out model migration on a new workload;
the optimal configuration parameter searching module is used for substituting a performance model generated by a random forest into a genetic algorithm, and finding out a set of configuration parameters which enable the overall performance of the system to be best in cooperation with software in a data storage layer, a resource scheduling layer and a data processing layer under the working load through the genetic algorithm;
and the model migration module is used for calculating the similarity between the current new workload and each original workload when the new workload comes, and the following two conditions occur according to the similarity calculation result:
(1) if the similarity is the maximum value and is larger than the threshold value, the new working load is considered to be similar to the original working load, the performance model of the original working load is transferred to the performance model of the current new working load through a certain compensation mechanism, and then the optimal configuration of the current new working load is searched on the performance model by using a genetic algorithm;
(2) if the similarity does not reach the threshold, the current new working load is considered to be the unrecorded new type load, the Bayesian optimization is adopted to select the set dimension parameters on line, the configuration meeting the requirements is found to enable the current new type load to be executed, meanwhile, the software configuration parameter acquisition module, the configuration performance matrix calculation module, the model establishment module and the optimal configuration parameter searching module are executed on the new type load, and when the optimal configuration is found, the optimal configuration is used for execution.
Further, the specific execution process of the module for obtaining software configuration parameters includes:
s11: selecting the most important 20-dimensional parameters by using a Lasso algorithm; the objective function for its minimization is:
FLasso=||y—Xw||2 2+α||w||1
extracting important characteristic parameters by adding a penalty term to a least square method; w is the coefficient of each characteristic parameter, the larger w is, the more important the parameter is, when the coefficient w of a certain characteristic parameter is 0, the parameter is rejected, and the 20-dimensional coefficient which has the largest influence on the output of the model is extracted by the method;
s12: for these 20-dimensional parameters, 200 sets of configuration parameters are randomly generated; if the parameter is a numerical parameter, randomly taking a value in a range determined by the following formula:
range=[dp/x,dpx]
wherein d ispX is a fixed scaling factor, which is a default value of the parameter p, and is 10 in the present invention.
And if the value of the configuration parameter is a Boolean type or enumeration type value, encoding the parameter value by using an One-Hot encoding mode, and then randomly selecting the parameter value in a value range.
Further, the specific execution process of the configuration performance matrix calculation module includes:
s21: according to the 200 sets of configuration parameters generated by S1, updating corresponding parameter values to corresponding software in a data storage layer, a resource scheduling layer and a data processing layer through an automatic deployment code;
s22: adding the workload into a big data processing flow, and acquiring a performance label corresponding to the workload:
s23: and generating a configuration performance matrix according to the configuration obtained in the last step and the corresponding performance label.
The specific execution process of the model building module comprises the following steps:
s31: the process configures a performance matrix in 60%: 20%: the configuration performance matrix is divided into three parts according to the proportion of 20 percent, and the three parts are respectively used as a training data set, a verification data set and a test data set for the random forest to perform performance model modeling;
s32: for data in a training set, adopting a Bootstrap strategy to extract n groups of sample sets with the same dimensionality, wherein the n groups of sample sets have the same number of samples, but the samples are not necessarily the same, and each sample has configuration parameter characteristics with the same dimensionality, but the configuration parameter characteristics are not necessarily the same;
s33: establishing a regression tree model for each sample set, wherein n regression trees are calculated;
s34: the output of the n regression trees is { pt1,pt2…ptnGet the average value as the final output of the model
Figure BDA0002990348620000081
And storing the characteristics of the workload and its corresponding performance model in the system for future comparison and migration of new workloads.
The specific execution process of the optimal configuration parameter searching module comprises the following steps:
s41: determining the value of each parameter in the genetic algorithm; the execution time generated by specific configuration is used as a fitness value in the genetic algorithm, a set of configuration parameters are used as chromosomes in the genetic algorithm, and specific parameter values are used as genes in the genetic algorithm. Extracting n groups of parameter configurations from the test set as initial populations of the genetic algorithm:
{C1,C2…Cn}
C1,C2,Cneach represents a specific configuration scheme, and n groups are counted;
s42: substituting the n groups of parameter configurations into a performance model generated by a random forest to calculate the execution time of each group of parameter configurations:
{P1,P2…Pn}
s43: configure parameter { C1,C2…CnThe genetic algorithm is introduced and expressed as { P }1,P2…PnThe fitness value corresponding to each group of configuration in the genetic algorithm is used as the fitness value; executing a series of cross mutation operations on the original n groups of configuration by a genetic algorithm to generate the next group of configuration parameters:
{C1′,C2′…Cn′}
s44: and substituting the newly generated group of configuration parameters into the performance model, calculating the execution time corresponding to each group of configuration parameters again, and repeating the steps S42-S44. Until an optimal configuration is found.
Further, the specific execution process of the model migration module includes:
s51: when a new task arrives, the new task is initially matched in a setSetting up execution once, and forming vector Vec according to system performance index in the execution period of this execution collectioni
S52: the cosine similarity between the workload and the previous workloads of various types is calculated in sequence, and the calculation formula is as follows:
Figure BDA0002990348620000091
Vecivector of system performance indicators, Vec, collected when workload i is executed at a fixed initial configurationjA vector formed by system performance indexes collected when the workload j is executed under fixed initial configuration; the value of the highest cosine similarity of the workload to the previous workload is taken and compared with 0.75. Here, 0.75 is taken as a similarity threshold;
s53: if the highest cosine similarity is greater than 0.75, the two workloads are considered to have similarity. After the cosine similarity between the newly arrived workload i and all types of workloads is calculated, the cosine similarity between the newly arrived workload i and j types of workloads is maximum and is larger than 0.75; the optimal configuration C found previously by the workload j is placed under the workload i to be executed in the system to collect the execution time ETiThe execution time of the workload j under the configuration C is ETjCalculating the difference ET between the twoi-ETiAs a compensation to apply a performance model of workload j to workload i. I.e. predicted execution time T of workload i under configuration Ci(C)=Ti(C)+ETj-ETi. Then searching the optimal configuration of the model by using a genetic algorithm;
regarding the workload which does not reach the similarity threshold value, the workload is considered as an unrecorded new type load; aiming at the situation, the invention provides a Bayesian optimization-based quick start scheme, which specifically comprises the following steps:
the 20-dimensional parameters are selected on line through Bayesian optimization, a good enough configuration scheme is quickly found to enable the workload to be executed, meanwhile, a software configuration parameter acquisition module, a configuration performance matrix calculation module, a model establishment module and an optimal configuration parameter searching module are executed on the workload, and when the optimal configuration is found, the optimal configuration is used for execution.
The invention has the advantages that:
the invention designs a dynamic load scene-oriented method for collaboratively adjusting and optimizing cross-layer configuration parameters of a big data system, which comprises the steps of extracting the parameters after the importance ranking is carried out on the cross-layer multi-layer software, searching an optimal configuration scheme after modeling the performance of a workload running on the multi-layer big data software, and carrying out model migration according to the characteristics of a new workload after the new workload arrives so as to adapt to scenes under different workloads. And the cross-layer collaborative tuning avoids the situation that the overall performance of the system is not optimal but certain software configuration parameters are optimal due to the fact that the parameters are affected by each other in an intricate and complex manner.
According to the method, the top dimension parameters are extracted through the importance ranking of the parameters, the dimension of the configuration parameters is reduced, and the interpretability of the model is increased. The invention adopts an integrated learning method (random forest algorithm) to establish a performance model by taking the configuration parameters of each software as input and taking the performance indexes of the workload as output. It should be noted that the model execution time is not the only option that can be output as the model, and the system throughput and the software response time can be output as the model. Through the establishment of the performance model under a specific workload, the performance index of the multi-layer big data software for the workload under the set of configuration can be directly generated when the configuration parameters are input, so that the great time consumption of testing the performance index by actually executing the configuration scheme in the multi-layer software is avoided.
The present invention reduces the processing time of new workloads using a scheme for performance model migration based on the similarity between workloads. By comparing the similarity value between the workloads with a pre-established threshold value, when the similarity value exceeds the threshold value, the similarity between the workloads is considered, then the performance model suitable for the workloads is obtained through simple migration of the performance model, and time consumption caused by retraining of the performance model is avoided. Under the condition that the threshold value is not exceeded, a set of scheme suitable for the new workload is quickly found through an online searching method, meanwhile, an offline performance model is built, and when the performance model is built, the optimal configuration is continuously searched on the model.
Drawings
FIG. 1 is a cross-layer software collaboration flow chart as described in the background of the invention;
FIG. 2 is a flow chart of a cross-layer configuration parameter co-optimization method for a big data system oriented to a dynamic load scenario in an embodiment of the present invention;
FIG. 3 is an exemplary diagram of 200 sets of configuration files produced in an embodiment of the present invention;
FIG. 4 is a flow chart of random forest modeling in an embodiment of the present invention;
fig. 5 is a flowchart illustrating the step S4 of finding the optimal configuration parameters according to the embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the embodiments of the present invention, and it is obvious that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment provides a dynamic load scene-oriented cross-layer configuration parameter collaborative optimization method for a big data system, which is different from most other schemes in that the method is suitable for load change, comprehensively adjusts configuration parameters by cooperatively considering big data cross-layer multi-software, and can search optimal configuration by using various measurement standards (such as workload execution time, system throughput and the like). By using the framework, long-time manual test is avoided, and the optimal configuration can be quickly found. As shown in fig. 2, the method specifically includes the following steps:
step S1: and (4) sorting the importance of the parameters of the HDFS, Yarn and Spark, and selecting the most important 20-dimensional parameters as the input features of the model in the step S3. And randomly taking values of the extracted 20-dimensional parameters to generate a plurality of groups of configurations.
As shown in fig. 2, the step S1 specifically executes the following process:
s11: the most important 20-dimensional parameters are selected using the Lasso algorithm. The objective function for its minimization is:
FLasso=||y—Xw||2 2+α||w||1
and extracting important characteristic parameters by adding a penalty term to the least square method. W is the coefficient of each characteristic parameter, the larger w indicates that the parameter is more important, and when the coefficient w of a certain characteristic parameter is 0, the parameter is rejected. By the method, the coefficient with the largest influence on the model output in 20 dimensions is extracted.
S12: for these 20-dimensional parameters, 200 sets of configuration parameters are randomly generated. If the parameter is a numerical parameter, randomly taking a value in a range determined by the following formula:
range=[dp/x,dpx]
wherein d ispX is a fixed scaling factor, which is a default value of the parameter p, and is 10 in the present invention.
And if the value of the configuration parameter is a Boolean type or enumeration type value, encoding the parameter value by using an One-Hot encoding mode, and then randomly selecting the parameter value in a value range.
Step S2: the configuration is brought into the system to be executed, a performance label under the specific configuration is generated, and the execution time of the workload is used as the performance label (other indexes can also be used as the performance label, such as the system throughput). This step ultimately results in a configured performance matrix. The specific execution process comprises the following steps:
s21: and updating corresponding parameter values to the Yarn, Spark and HDFS software through the automatic deployment code according to the 200 sets of configuration parameters generated by the S1.
S22: and adding the workload into a big data processing flow, and acquiring the execution time corresponding to the configuration under the workload.
S23: and generating a configuration performance matrix according to the configuration obtained in the last step and the corresponding execution time.
Step S3: the configuration performance matrix is used to build a model using a random forest. After the model is trained, the corresponding execution time can be output according to an input set of specific configuration parameters. The storage model is migrated with a later model of a new workload. As shown in fig. 4, the specific execution process of step S3 is as follows:
s31: the process configures a performance matrix in 60%: 20%: the configured performance matrix is divided into three parts according to the proportion of 20%, and the three parts are respectively used as a training data set, a verification data set and a test data set for the random forest to perform performance model modeling.
S32: and for the data in the training set, adopting a Bootstrap strategy to extract n groups of sample sets with the same dimension, wherein the n groups of sample sets have the same number of samples, but the samples are not necessarily the same, and each sample has the configuration parameter features with the same dimension, but the configuration parameter features are not necessarily the same.
S33: and establishing a regression tree model for each sample set, wherein n regression trees are calculated in total.
S34: the output of the n regression trees is { pt1,pt2…ptnAnd taking the average value thereof as the final output of the model.
Figure BDA0002990348620000121
And storing the characteristics of the workload and its corresponding performance model in the system for future comparison and migration of new workloads.
Step S4: and substituting the performance model generated by the random forest into a genetic algorithm, and finding out the optimal configuration parameters of Yarn, Spark and HDFS under the working load by the genetic algorithm. As shown in fig. 5, the specific implementation procedure of step S4 is as follows:
s41: and determining the value of each parameter in the genetic algorithm. The execution time generated by specific configuration is used as a fitness value in the genetic algorithm, a set of configuration parameters are used as chromosomes in the genetic algorithm, and specific parameter values are used as genes in the genetic algorithm. Extracting n groups of parameter configurations from the test set as initial populations of the genetic algorithm:
{C1,C2…Cn}
C1,C2,Cneach represents a specific set of configuration schemes, totaling n sets.
S42: substituting the n groups of parameter configurations into a performance model generated by a random forest to calculate the execution time of each group of parameter configurations:
{P1,P2…Pn}
s43: configure parameter { C1,C2…CnThe genetic algorithm is introduced and expressed as { P }1,P2…PnAs fitness value corresponding to each set of configuration in the genetic algorithm. Executing a series of cross mutation operations on the original n groups of configuration by a genetic algorithm to generate the next group of configuration parameters:
{C1′,C2′…Cn′}
s44: and substituting the newly generated group of configuration parameters into the performance model, calculating the execution time corresponding to each group of configuration parameters again, and repeating the steps S42-S44. Until an optimal configuration is found.
Step S5: and when a new workload comes, migrating the performance model according to the similarity.
The specific execution process of step S5 includes:
s51: when a new task arrives, the new task is executed once on a set of initial configuration, and according to the execution, system performance indexes such as 'CPUs affected', 'context-switches', 'cpu-migrations', 'cycles' and the like during execution are collected to form a vector Veci
S52: the cosine similarity between the workload and the previous workloads of various types is calculated in sequence, and the calculation formula is as follows:
Figure BDA0002990348620000131
Vecivector of system performance indicators, Vec, collected when workload i is executed at a fixed initial configurationjA vector of system performance indicators gathered for a workload j when executed at a fixed initial configuration. The value of the highest cosine similarity of the workload to the previous workload is taken and compared with 0.75. Here, 0.75 is taken as the similarity threshold.
S53: if the highest cosine similarity is greater than 0.75, the two workloads are considered to have similarity. After the cosine similarity between the newly arrived workload i and all types of workloads is calculated, the cosine similarity with the j type of workloads is the maximum and is larger than 0.75. The optimal configuration C found previously by the workload j is placed under the workload i to be executed in the system to collect the execution time ETiThe execution time of the workload j under the configuration C is ETjCalculating the difference ET between the twoj-ETiAs a compensation to apply a performance model of workload j to workload i. I.e. predicted execution time T of workload i under configuration Ci(C)=Ti(C)+ETj-ETi. A genetic algorithm is then used to search this model for its optimal configuration.
And regarding the workload which does not reach the similarity threshold, the workload is considered to be a new type of unregistered load. The invention provides a quick starting scheme based on Bayesian optimization for the situation.
The idea of Bayesian optimization is to generate an initial candidate solution set, then find the next most likely extreme point according to the points, add the point into the set, and repeat the steps until the iteration is terminated. And finally, finding out the point with the maximum function value from the points to be used as the solution of the problem. The method is more effective than grid search and random search because the information of the searched points is utilized in the solving process.
And (3) selecting the 20-dimensional parameters on line through Bayesian optimization, quickly finding a good enough configuration scheme to execute the workload, executing the steps S1-S4 on the workload, and executing the optimal configuration when the optimal configuration is found.
The invention designs a dynamic load scene-oriented cross-layer configuration parameter collaborative tuning method for a big data system.
According to the method, the 20-dimensional parameters are extracted through the importance sorting of the parameters, the dimension of the configuration parameters is reduced, and the interpretability of the model is increased. The invention adopts an integrated learning method (random forest algorithm) to establish a performance model by taking the configuration parameters of each software as input and taking the performance indexes of the workload as output. It should be noted that the model execution time is not the only option that can be output as the model, and the system throughput and the software response time can be output as the model. Through the establishment of the performance model under a specific workload, the performance index of the multi-layer big data software for the workload under the set of configuration can be directly generated when the configuration parameters are input, so that the great time consumption of testing the performance index by actually executing the configuration scheme in the multi-layer software is avoided.
The present invention reduces the processing time of new workloads using a scheme for performance model migration based on the similarity between workloads. By comparing the similarity value between the workloads with a pre-established threshold value, when the similarity value exceeds the threshold value, the similarity between the workloads is considered, then the performance model suitable for the workloads is obtained through simple migration of the performance model, and time consumption caused by retraining of the performance model is avoided. Under the condition that the threshold value is not exceeded, a set of scheme suitable for the new workload is quickly found through an online searching method, meanwhile, an offline performance model is built, and when the performance model is built, the optimal configuration is continuously searched on the model.
Based on the above method, this embodiment further provides a dynamic load scenario-oriented collaborative tuning system for cross-layer configuration parameters of a big data system, including:
the software configuration parameter acquisition module is used for sorting the importance of the parameters of each software in the data storage layer, the resource scheduling layer and the data processing layer, randomly taking values of the extracted pre-set dimensional parameters and generating a plurality of groups of configurations;
the configuration performance matrix calculation module is used for bringing the configuration into a system to execute, generating a performance label under specific configuration and finally obtaining a configuration performance matrix;
the model building module is used for building a performance model by using a configuration performance matrix and utilizing a random forest, and storing a target performance model for carrying out model migration on a new workload;
the optimal configuration parameter searching module is used for substituting a performance model generated by a random forest into a genetic algorithm, and finding out a set of configuration parameters which enable the overall performance of the system to be best in cooperation with software in a data storage layer, a resource scheduling layer and a data processing layer under the working load through the genetic algorithm;
and the model migration module is used for calculating the similarity between the current new workload and each original workload when the new workload comes, and the following two conditions occur according to the similarity calculation result:
(1) if the similarity is the maximum value and is larger than the threshold value, the new working load is considered to be similar to the original working load, the performance model of the original working load is transferred to the performance model of the current new working load through a certain compensation mechanism, and then the optimal configuration of the current new working load is searched on the performance model by using a genetic algorithm;
(2) if the similarity does not reach the threshold, the current new working load is considered to be the unrecorded new type load, the Bayesian optimization is adopted to select the set dimension parameters on line, the configuration meeting the requirements is found to enable the current new type load to be executed, meanwhile, the software configuration parameter acquisition module, the configuration performance matrix calculation module, the model establishment module and the optimal configuration parameter searching module are executed on the new type load, and when the optimal configuration is found, the optimal configuration is used for execution.
The specific execution process of the module for acquiring the software configuration parameters comprises the following steps:
s11: selecting the most important 20-dimensional parameters by using a Lasso algorithm; the objective function for its minimization is:
FLasso=||y—Xw||2 2+α||w||1
extracting important characteristic parameters by adding a penalty term to a least square method; w is the coefficient of each characteristic parameter, the larger w is, the more important the parameter is, when the coefficient w of a certain characteristic parameter is 0, the parameter is rejected, and the 20-dimensional coefficient which has the largest influence on the output of the model is extracted by the method;
s12: for these 20-dimensional parameters, 200 sets of configuration parameters are randomly generated; if the parameter is a numerical parameter, randomly taking a value in a range determined by the following formula:
range=[dp/x,dpx]
wherein d ispX is a fixed scaling factor, which is a default value of the parameter p, and is 10 in the present invention.
And if the value of the configuration parameter is a Boolean type or enumeration type value, encoding the parameter value by using an One-Hot encoding mode, and then randomly selecting the parameter value in a value range.
The specific execution process of the configuration performance matrix calculation module comprises the following steps:
s21: according to the 200 sets of configuration parameters generated by S1, updating corresponding parameter values to corresponding software in a data storage layer, a resource scheduling layer and a data processing layer through an automatic deployment code;
s22: adding the workload into a big data processing flow, and acquiring a performance label corresponding to the configuration under the workload;
s23: and generating a configuration performance matrix according to the configuration obtained in the last step and the corresponding performance label.
The execution process of the model building module is specifically as follows:
s31: the process configures a performance matrix in 60%: 20%: the configured performance matrix is divided into three parts according to the proportion of 20%, and the three parts are respectively used as a training data set, a verification data set and a test data set for the random forest to perform performance model modeling.
S32: and for the data in the training set, adopting a Bootstrap strategy to extract n groups of sample sets with the same dimension, wherein the n groups of sample sets have the same number of samples, but the samples are not necessarily the same, and each sample has the configuration parameter features with the same dimension, but the configuration parameter features are not necessarily the same.
S33: and establishing a regression tree model for each sample set, wherein n regression trees are calculated in total.
S34: the output of the n regression trees is { pt1,pt2…ptnAnd taking the average value thereof as the final output of the model.
Figure BDA0002990348620000161
And storing the characteristics of the workload and its corresponding performance model in the system for future comparison and migration of new workloads.
The specific execution process of the optimal configuration parameter searching module comprises the following steps:
s41: determining the value of each parameter in the genetic algorithm; the execution time generated by specific configuration is used as a fitness value in the genetic algorithm, a set of configuration parameters are used as chromosomes in the genetic algorithm, and specific parameter values are used as genes in the genetic algorithm. Extracting n groups of parameter configurations from the test set as initial populations of the genetic algorithm:
{C1,C2…Cn}
C1,C2,Cneach represents a specific configuration scheme, and n groups are counted;
s42: substituting the n groups of parameter configurations into a performance model generated by a random forest to calculate the execution time of each group of parameter configurations:
{P1,P2…Pn}
s43: configure parameter { C1,C2…CnThe genetic algorithm is introduced and expressed as { P }1,P2…PnThe fitness value corresponding to each group of configuration in the genetic algorithm is used as the fitness value; executing a series of cross mutation operations on the original n groups of configuration by a genetic algorithm to generate the next group of configuration parameters:
{C1′,C2′…Cn′}
s44: and substituting the newly generated group of configuration parameters into the performance model, calculating the execution time corresponding to each group of configuration parameters again, and repeating the steps S42-S44. Until an optimal configuration is found.
The specific execution process of the model migration module comprises the following steps:
s51: when a new task arrives, the new task is executed once on a set of initial configuration, and a vector Vec is formed according to the system performance indexes during the execution and collection of the executioni
S52: the cosine similarity between the workload and the previous workloads of various types is calculated in sequence, and the calculation formula is as follows:
Figure BDA0002990348620000162
Vecivector of system performance indicators, Vec, collected when workload i is executed at a fixed initial configurationjA vector formed by system performance indexes collected when the workload j is executed under fixed initial configuration; the value of the highest cosine similarity of the workload to the previous workload is taken and compared with 0.75. Here, 0.75 is taken as a similarity threshold;
s53: if the highest cosine similarity is greater than 0.75, the two workloads are considered to have similarity. After the cosine similarity between the newly arrived workload i and all types of workloads is calculated, the cosine similarity between the newly arrived workload i and j types of workloads is maximum and is larger than 0.75; the optimal configuration C found previously by the workload j is placed under the workload i to be executed in the system to collect the execution time ETiThe execution time of the workload j under the configuration C is ETjCalculating the difference ET between the twoj-ETiAs a compensation to apply a performance model of workload j to workload i. I.e. predicted execution time T of workload i under configuration Ci(C)=Ti(C)+ETj-ETi. Then searching the optimal configuration of the model by using a genetic algorithm;
regarding the workload which does not reach the similarity threshold value, the workload is considered as an unrecorded new type load; aiming at the situation, the invention provides a Bayesian optimization-based quick start scheme, which specifically comprises the following steps:
the 20-dimensional parameters are selected on line through Bayesian optimization, a good enough configuration scheme is quickly found to enable the workload to be executed, meanwhile, a software configuration parameter acquisition module, a configuration performance matrix calculation module, a model establishment module and an optimal configuration parameter searching module are executed on the workload, and when the optimal configuration is found, the optimal configuration is used for execution.
The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (12)

1. A dynamic load scene-oriented cross-layer configuration parameter collaborative tuning method for a big data system is characterized by comprising the following steps:
s1: sorting the importance of parameters of each software in a data storage layer, a resource scheduling layer and a data processing layer, and selecting a pre-set dimensional parameter; randomly taking values of the extracted pre-set dimensional parameters to generate a plurality of groups of configurations;
s2: bringing a plurality of groups of configurations into a big data system for execution, generating a performance label under a specific configuration, and finally obtaining a configuration performance matrix;
s3: establishing a performance model by using a configuration performance matrix and utilizing a random forest, and storing a target performance model for carrying out model migration on a new workload;
s4: substituting the target performance model into a genetic algorithm, and finding out a set of configuration parameters which enable the overall performance of the system to be best to be expressed by software cooperation in a data storage layer, a resource scheduling layer and a data processing layer under the working load through the genetic algorithm;
s5: when a new workload comes, calculating the similarity between the current new workload and each original workload, and according to the similarity calculation result, the following two situations occur:
(1) if the similarity is the maximum value and is larger than the threshold value, the new working load is considered to be similar to the original working load, the performance model of the original working load is transferred to the performance model of the current new working load through a certain compensation mechanism, and then the optimal configuration of the current new working load is searched on the performance model by using a genetic algorithm;
(2) if the similarity does not reach the threshold, the current new workload is considered to be the new type of load which is not recorded, the Bayesian optimization is adopted to select the set dimension parameters on line, the configuration which meets the requirements is found to enable the current new type of load to be executed, meanwhile, the steps S1-S4 are executed for the new type of load, and when the optimal configuration is found, the optimal configuration is used for execution.
2. The method for collaborative tuning of cross-layer configuration parameters of big data system facing dynamic load scenario as claimed in claim 1, wherein said step S1 includes:
s11: selecting the most important 20-dimensional parameters by using a Lasso algorithm; the objective function for its minimization is:
FLasso=||y-Xw||2 2+α||w||1
extracting important characteristic parameters by adding a penalty term to a least square method; w is the coefficient of each characteristic parameter, the larger w is, the more important the parameter is, when the coefficient w of a certain characteristic parameter is 0, the parameter is rejected, and the 20-dimensional coefficient which has the largest influence on the output of the model is extracted by the method;
s12: for these 20-dimensional parameters, 200 sets of configuration parameters are randomly generated; if the parameter is a numerical parameter, randomly taking a value in a range determined by the following formula:
range=[dp/x,dpx]
wherein d ispThe parameter p is a default value, x is a fixed proportionality coefficient, and x is 10;
and if the value of the configuration parameter is a Boolean type or enumeration type value, encoding the parameter value by using an One-Hot encoding mode, and then randomly selecting the parameter value in a value range.
3. The big data system cross-layer configuration parameter collaborative tuning method for the dynamic load scenario as claimed in claim 2, wherein the step S2 includes:
s21: according to the 200 sets of configuration parameters generated by S1, updating corresponding parameter values to corresponding software in a data storage layer, a resource scheduling layer and a data processing layer through an automatic deployment code;
s22: adding the workload into a big data processing flow, and acquiring a performance label corresponding to the configuration under the workload;
s23: and generating a configuration performance matrix according to the configuration obtained in the last step and the corresponding performance label.
4. The method for collaborative tuning of cross-layer configuration parameters of big data system facing dynamic load scenario as claimed in claim 1, wherein said step S3 includes:
s31: processing a configuration performance matrix, dividing the configuration performance matrix into three parts according to the proportion of 60 percent to 20 percent, and respectively using the three parts as a training data set, a verification data set and a test data set for a random forest to perform performance model modeling;
s32: for data in a training set, adopting a Bootstrap strategy to extract n groups of sample sets with the same dimensionality, wherein the n groups of sample sets have the same number of samples, but the samples are not necessarily the same, and each sample has configuration parameter characteristics with the same dimensionality, but the configuration parameter characteristics are not necessarily the same;
s33: establishing a regression tree model for each sample set, wherein n regression trees are calculated;
s34: the output of the n regression trees is { pt1,pt2…ptnGet the average value as the final output of the model
Figure FDA0002990348610000021
And storing the characteristics of the workload and its corresponding performance model in the system for future comparison and migration of new workloads.
5. The method for collaborative tuning of cross-layer configuration parameters of big data system facing dynamic load scenario as claimed in claim 1, wherein said step S4 includes:
s41: determining the value of each parameter in the genetic algorithm; taking the execution time generated by specific configuration as a fitness value in a genetic algorithm, taking a set of configuration parameters as chromosomes in the genetic algorithm, and taking specific parameter values as genes in the genetic algorithm; extracting n groups of parameter configurations from the test set as initial populations of the genetic algorithm:
{C1,C2…Cn}
C1,C2,Cneach represents a specific configuration scheme, and n groups are counted;
s42: substituting the n groups of parameter configurations into a performance model generated by a random forest to calculate the execution time of each group of parameter configurations:
{P1,P2…Pn}
s43: configure parameter { C1,C2…CnThe genetic algorithm is introduced and expressed as { P }1,P2…PnThe fitness value corresponding to each group of configuration in the genetic algorithm is used as the fitness value; executing a series of cross mutation operations on the original n groups of configuration by a genetic algorithm to generate the next group of configuration parameters:
{C1′,C2′…Cn′}
s44: and substituting the newly generated group of configuration parameters into the performance model, calculating the execution time corresponding to each group of configuration parameters again, and repeating the steps S42-S44 until the optimal configuration is found.
6. The method for collaborative tuning of cross-layer configuration parameters of big data system facing dynamic load scenario as claimed in claim 1, wherein said step S5 includes:
s51: when a new task arrives, the new task is executed once on a set of initial configuration, and a vector Vec is formed according to the system performance indexes during the execution and collection of the executioni
S52: the cosine similarity between the workload and the previous workloads of various types is calculated in sequence, and the calculation formula is as follows:
Figure FDA0002990348610000031
Vecivector of system performance indicators, Vec, collected when workload i is executed at a fixed initial configurationjA vector formed by system performance indexes collected when the workload j is executed under fixed initial configuration; comparing the value of the work load with the highest cosine similarity of the previous work load with 0.75; here, 0.75 is taken as a similarity threshold;
s53: if the highest cosine similarity is greater than 0.75, the two workloads are considered to have similarity; after the cosine similarity between the newly arrived workload i and all types of workloads is calculated, the cosine similarity between the newly arrived workload i and j types of workloads is maximum and is larger than 0.75; the optimal configuration C found previously by the workload j is placed under the workload i to be executed in the system to collect the execution time ETiThe execution time of the workload j under the configuration C is ETjCalculating the difference ET between the twoj-ETiAs a compensation to apply a performance model of workload j to workload i; i.e. predicted execution time T of workload i under configuration Ci(C)=Ti(C)+ETj-ETi(ii) a Then searching the optimal configuration of the model by using a genetic algorithm;
regarding the workload which does not reach the similarity threshold value, the workload is considered as an unrecorded new type load; aiming at the situation, the invention provides a Bayesian optimization-based quick start scheme, which specifically comprises the following steps:
and (3) selecting the 20-dimensional parameters on line through Bayesian optimization, quickly finding a good enough configuration scheme to execute the workload, executing the steps S1-S4 on the workload, and executing the optimal configuration when the optimal configuration is found.
7. A big data system cross-layer configuration parameter collaborative tuning system oriented to a dynamic load scene is characterized by comprising the following components:
the software configuration parameter acquisition module is used for sorting the importance of the parameters of each software in the data storage layer, the resource scheduling layer and the data processing layer, randomly taking values of the extracted pre-set dimensional parameters and generating a plurality of groups of configurations;
the configuration performance matrix calculation module is used for bringing the configuration into a system to execute, generating a performance label under specific configuration and finally obtaining a configuration performance matrix;
the model building module is used for building a performance model by using a configuration performance matrix and utilizing a random forest, and storing a target performance model for carrying out model migration on a new workload;
the optimal configuration parameter searching module is used for substituting a performance model generated by a random forest into a genetic algorithm, and finding out a set of configuration parameters which enable the overall performance of the system to be best in cooperation with software in a data storage layer, a resource scheduling layer and a data processing layer under the working load through the genetic algorithm;
and the model migration module is used for calculating the similarity between the current new workload and each original workload when the new workload comes, and the following two conditions occur according to the similarity calculation result:
(1) if the similarity is the maximum value and is larger than the threshold value, the new working load is considered to be similar to the original working load, the performance model of the original working load is transferred to the performance model of the current new working load through a certain compensation mechanism, and then the optimal configuration of the current new working load is searched on the performance model by using a genetic algorithm;
(2) if the similarity does not reach the threshold, the current new working load is considered to be the unrecorded new type load, the Bayesian optimization is adopted to select the set dimension parameters on line, the configuration meeting the requirements is found to enable the current new type load to be executed, meanwhile, the software configuration parameter acquisition module, the configuration performance matrix calculation module, the model establishment module and the optimal configuration parameter searching module are executed on the new type load, and when the optimal configuration is found, the optimal configuration is used for execution.
8. The dynamic load scenario-oriented big data system cross-layer configuration parameter collaborative tuning system according to claim 7, wherein the specific execution process of the software configuration parameter obtaining module includes:
s11: selecting the most important 20-dimensional parameters by using a Lasso algorithm; the objective function for its minimization is:
FLasso=||y-Xw||2 2+α||w||1
extracting important characteristic parameters by adding a penalty term to a least square method; w is the coefficient of each characteristic parameter, the larger w is, the more important the parameter is, when the coefficient w of a certain characteristic parameter is 0, the parameter is rejected, and the 20-dimensional coefficient which has the largest influence on the output of the model is extracted by the method;
s12: for these 20-dimensional parameters, 200 sets of configuration parameters are randomly generated; if the parameter is a numerical parameter, randomly taking a value in a range determined by the following formula:
range=[dp/x,dpx]
wherein d ispThe parameter p is a default value, x is a fixed proportionality coefficient, and x is 10;
and if the value of the configuration parameter is a Boolean type or enumeration type value, encoding the parameter value by using an One-Hot encoding mode, and then randomly selecting the parameter value in a value range.
9. The dynamic load scenario-oriented big data system cross-layer configuration parameter collaborative tuning system according to claim 7, wherein the configuration performance matrix calculation module specifically executes a process including:
s21: according to the 200 sets of configuration parameters generated by S1, updating corresponding parameter values to corresponding software in a data storage layer, a resource scheduling layer and a data processing layer through an automatic deployment code;
s22: adding the workload into a big data processing flow, and acquiring a performance label corresponding to the configuration under the workload;
s23: and generating a configuration performance matrix according to the configuration obtained in the last step and the corresponding performance label.
10. The dynamic load scenario-oriented big data system cross-layer configuration parameter collaborative tuning method according to claim 7, wherein the specific implementation process of the model building module includes:
s31: the process configures a performance matrix in 60%: 20%: the configuration performance matrix is divided into three parts according to the proportion of 20 percent, and the three parts are respectively used as a training data set, a verification data set and a test data set for the random forest to perform performance model modeling;
s32: for data in a training set, adopting a Bootstrap strategy to extract n groups of sample sets with the same dimensionality, wherein the n groups of sample sets have the same number of samples, but the samples are not necessarily the same, and each sample has configuration parameter characteristics with the same dimensionality, but the configuration parameter characteristics are not necessarily the same;
s33: establishing a regression tree model for each sample set, wherein n regression trees are calculated;
s34: the output of the n regression trees is { pt1,pt2…ptnGet the average value as the final output of the model
Figure FDA0002990348610000051
And storing the characteristics of the workload and its corresponding performance model in the system for future comparison and migration of new workloads.
11. The dynamic load scenario-oriented big data system cross-layer configuration parameter collaborative tuning method according to claim 7, wherein the specific implementation process of the optimal configuration parameter searching module includes:
s41: determining the value of each parameter in the genetic algorithm; taking the execution time generated by specific configuration as a fitness value in a genetic algorithm, taking a set of configuration parameters as chromosomes in the genetic algorithm, and taking specific parameter values as genes in the genetic algorithm; extracting n groups of parameter configurations from the test set as initial populations of the genetic algorithm:
{C1,C2…Cn}
C1,C2,Cneach represents a specific configuration scheme, and n groups are counted;
s42: substituting the n groups of parameter configurations into a performance model generated by a random forest to calculate the execution time of each group of parameter configurations:
{P1,P2…Pn}
s43: configure parameter { C1,C2…CnThe genetic algorithm is introduced and expressed as { P }1,P2…PnThe fitness value corresponding to each group of configuration in the genetic algorithm is used as the fitness value; executing a series of cross mutation operations on the original n groups of configuration by a genetic algorithm to generate the next group of configuration parameters:
{C1′,C2′…Cn′}
s44: and substituting the newly generated group of configuration parameters into the performance model, calculating the execution time corresponding to each group of configuration parameters again, and repeating the steps S42-S44 until the optimal configuration is found.
12. The dynamic load scenario-oriented big data system cross-layer configuration parameter collaborative tuning method according to claim 7, wherein the specific execution process of the model migration module includes:
s51: when a new task arrives, the system is executed once on a set of initial configuration, and the execution period of the system is collected according to the executionPerformance index constitutive vector Veci
S52: the cosine similarity between the workload and the previous workloads of various types is calculated in sequence, and the calculation formula is as follows:
Figure FDA0002990348610000061
Vecivector of system performance indicators, Vec, collected when workload i is executed at a fixed initial configurationjA vector formed by system performance indexes collected when the workload j is executed under fixed initial configuration; comparing the value of the work load with the highest cosine similarity of the previous work load with 0.75; here, 0.75 is taken as a similarity threshold;
s53: if the highest cosine similarity is greater than 0.75, the two workloads are considered to have similarity; after the cosine similarity between the newly arrived workload i and all types of workloads is calculated, the cosine similarity between the newly arrived workload i and j types of workloads is maximum and is larger than 0.75; the optimal configuration C found previously by the workload j is placed under the workload i to be executed in the system to collect the execution time ETiThe execution time of the workload j under the configuration C is ETjCalculating the difference ET between the twoj-ETiAs a compensation to apply a performance model of workload j to workload i; i.e. predicted execution time T of workload i under configuration Ci(C)=Ti(C)+ETj-ETi(ii) a Then searching the optimal configuration of the model by using a genetic algorithm;
regarding the workload which does not reach the similarity threshold value, the workload is considered as an unrecorded new type load; aiming at the situation, the invention provides a Bayesian optimization-based quick start scheme, which specifically comprises the following steps:
the 20-dimensional parameters are selected on line through Bayesian optimization, a good enough configuration scheme is quickly found to enable the workload to be executed, meanwhile, a software configuration parameter acquisition module, a configuration performance matrix calculation module, a model establishment module and an optimal configuration parameter searching module are executed on the workload, and when the optimal configuration is found, the optimal configuration is used for execution.
CN202110313931.7A 2021-03-24 2021-03-24 Dynamic load scene-oriented cross-layer configuration parameter collaborative tuning method and system for big data system Pending CN113032367A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110313931.7A CN113032367A (en) 2021-03-24 2021-03-24 Dynamic load scene-oriented cross-layer configuration parameter collaborative tuning method and system for big data system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110313931.7A CN113032367A (en) 2021-03-24 2021-03-24 Dynamic load scene-oriented cross-layer configuration parameter collaborative tuning method and system for big data system

Publications (1)

Publication Number Publication Date
CN113032367A true CN113032367A (en) 2021-06-25

Family

ID=76473439

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110313931.7A Pending CN113032367A (en) 2021-03-24 2021-03-24 Dynamic load scene-oriented cross-layer configuration parameter collaborative tuning method and system for big data system

Country Status (1)

Country Link
CN (1) CN113032367A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023097568A1 (en) * 2021-12-01 2023-06-08 中国科学院深圳先进技术研究院 Method for adjusting and optimizing configuration parameters of stream data processing system on basis of bayesian optimization
CN117130460A (en) * 2023-04-14 2023-11-28 荣耀终端有限公司 Method, device, server and storage medium for reducing power consumption
WO2024045836A1 (en) * 2022-08-30 2024-03-07 华为技术有限公司 Parameter adjustment method and related device

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080215512A1 (en) * 2006-09-12 2008-09-04 New York University System, method, and computer-accessible medium for providing a multi-objective evolutionary optimization of agent-based models
CN103064664A (en) * 2012-11-28 2013-04-24 华中科技大学 Hadoop parameter automatic optimization method and system based on performance pre-evaluation
WO2013070940A1 (en) * 2011-11-08 2013-05-16 Mettler-Toledo, LLC Configuration of a metrologically sealed device via a passive rf interface
CN106202431A (en) * 2016-07-13 2016-12-07 华中科技大学 A kind of Hadoop parameter automated tuning method and system based on machine learning
CN108229547A (en) * 2017-12-27 2018-06-29 东南大学 A kind of gear distress recognition methods based on partial model transfer learning
CN108234177A (en) * 2016-12-21 2018-06-29 深圳先进技术研究院 A kind of HBase configuration parameter automated tunings method and device, user equipment
CN108491226A (en) * 2018-02-05 2018-09-04 西安电子科技大学 Spark based on cluster scaling configures parameter automated tuning method
CN110727506A (en) * 2019-10-18 2020-01-24 北京航空航天大学 SPARK parameter automatic tuning method based on cost model
CN111176832A (en) * 2019-12-06 2020-05-19 重庆邮电大学 Performance optimization and parameter configuration method based on memory computing framework Spark
CN112463763A (en) * 2020-11-19 2021-03-09 东北大学 RF algorithm-based MySQL database parameter screening method

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080215512A1 (en) * 2006-09-12 2008-09-04 New York University System, method, and computer-accessible medium for providing a multi-objective evolutionary optimization of agent-based models
WO2013070940A1 (en) * 2011-11-08 2013-05-16 Mettler-Toledo, LLC Configuration of a metrologically sealed device via a passive rf interface
CN103064664A (en) * 2012-11-28 2013-04-24 华中科技大学 Hadoop parameter automatic optimization method and system based on performance pre-evaluation
CN106202431A (en) * 2016-07-13 2016-12-07 华中科技大学 A kind of Hadoop parameter automated tuning method and system based on machine learning
CN108234177A (en) * 2016-12-21 2018-06-29 深圳先进技术研究院 A kind of HBase configuration parameter automated tunings method and device, user equipment
CN108229547A (en) * 2017-12-27 2018-06-29 东南大学 A kind of gear distress recognition methods based on partial model transfer learning
CN108491226A (en) * 2018-02-05 2018-09-04 西安电子科技大学 Spark based on cluster scaling configures parameter automated tuning method
CN110727506A (en) * 2019-10-18 2020-01-24 北京航空航天大学 SPARK parameter automatic tuning method based on cost model
CN111176832A (en) * 2019-12-06 2020-05-19 重庆邮电大学 Performance optimization and parameter configuration method based on memory computing framework Spark
CN112463763A (en) * 2020-11-19 2021-03-09 东北大学 RF algorithm-based MySQL database parameter screening method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
YE, QW (YE, QIANWEN) 等: "Profiling-Based Big Data Workflow Optimization in a Cross-layer Coupled Design Framework", 《ALGORITHMS AND ARCHITECTURES FOR PARALLEL PROCESSING》 *
华幸成: "面向大数据处理的应用性能优化方法研究", 《中国优秀博硕士学位论文全文数据库(博士)信息科技辑》 *
张妮: "分布式存储***HBase性能调优方法的研究与实现", 《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023097568A1 (en) * 2021-12-01 2023-06-08 中国科学院深圳先进技术研究院 Method for adjusting and optimizing configuration parameters of stream data processing system on basis of bayesian optimization
WO2024045836A1 (en) * 2022-08-30 2024-03-07 华为技术有限公司 Parameter adjustment method and related device
CN117130460A (en) * 2023-04-14 2023-11-28 荣耀终端有限公司 Method, device, server and storage medium for reducing power consumption

Similar Documents

Publication Publication Date Title
CN113064879B (en) Database parameter adjusting method and device and computer readable storage medium
CN113032367A (en) Dynamic load scene-oriented cross-layer configuration parameter collaborative tuning method and system for big data system
US20170330078A1 (en) Method and system for automated model building
CN108320171A (en) Hot item prediction technique, system and device
CN111028100A (en) Refined short-term load prediction method, device and medium considering meteorological factors
CN111461286B (en) Spark parameter automatic optimization system and method based on evolutionary neural network
CN111210023A (en) Automatic selection system and method for data set classification learning algorithm
Yang et al. A pattern fusion model for multi-step-ahead CPU load prediction
CN112256739A (en) Method for screening data items in dynamic flow big data based on multi-arm gambling machine
CN113642652A (en) Method, device and equipment for generating fusion model
CN114706840A (en) Load perception-oriented method for optimizing parameters of ArangoDB of multi-mode database
CN110222824B (en) Intelligent algorithm model autonomous generation and evolution method, system and device
Shukla et al. FAT-ETO: Fuzzy-AHP-TOPSIS-Based efficient task offloading algorithm for scientific workflows in heterogeneous fog–cloud environment
CN114647790A (en) Big data mining method and cloud AI (Artificial Intelligence) service system applied to behavior intention analysis
CN107066328A (en) The construction method of large-scale data processing platform
CN112200208B (en) Cloud workflow task execution time prediction method based on multi-dimensional feature fusion
CN113763031A (en) Commodity recommendation method and device, electronic equipment and storage medium
Liu et al. Cloud service selection based on rough set theory
CN116415732A (en) User side power load data processing method based on improved ARNN
CN113610225A (en) Quality evaluation model training method and device, electronic equipment and storage medium
CN109308565B (en) Crowd performance grade identification method and device, storage medium and computer equipment
CN113191540A (en) Construction method and device of industrial link manufacturing resources
CN113111308A (en) Symbolic regression method and system based on data-driven genetic programming algorithm
CN110942149B (en) Feature variable selection method based on information change rate and condition mutual information
CN117474173B (en) Multi-water source dynamic allocation device and system for plain river network area

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20210625