WO2022111125A1

WO2022111125A1 - Random-forest-based automatic optimization method for graphic data processing framework

Info

Publication number: WO2022111125A1
Application number: PCT/CN2021/124378
Authority: WO
Inventors: 陈超; 辛锦瀚; 杨永魁; 王峥; 喻之斌; 郭伟钰; 刘江佾
Original assignee: 深圳先进技术研究院
Priority date: 2020-11-27
Filing date: 2021-10-18
Publication date: 2022-06-02
Also published as: CN114565001A

Abstract

Disclosed is a random-forest-based automatic optimization method for a graphic data processing framework. The method comprises: constructing a training data set, wherein each piece of sample data in the training data set represents a correspondence between a configuration parameter combination of a graphic data processing framework, the size of an input data set and a program running time; training, on the basis of the training data set, a random forest model that includes a plurality of decision-making trees, and taking the trained random forest model as a performance prediction model for predicting, in combination with the size of the input data set, a corresponding program running time for different configuration parameter combinations; and in a search space of configuration parameters, by using the performance prediction model, predicting, for different sizes of input data sets, the performance level of different configuration parameters generated by a genetic algorithm, so as to obtain an optimal configuration parameter. By means of the present invention, the size of an input data set can be sensed, and deep and high-performance automatic optimization of configuration parameters is realized.

Description

一种基于随机森林的图数据处理框架自动调优方法An Automatic Tuning Method of Graph Data Processing Framework Based on Random Forest

技术领域technical field

本发明涉及大数据处理技术领域，更具体地，涉及一种基于随机森林的图数据处理框架自动调优方法。The invention relates to the technical field of big data processing, and more particularly, to an automatic tuning method for a graph data processing framework based on random forest.

背景技术Background technique

随着互联网产业和技术的发展，在大数据领域，图形数据处理的规模与重要性也日益增长。以Spark GraphX框架为例，其是使用分布式数据流***在Apache Spark上构建的嵌入式图形处理框架。Spark GraphX提供了一个熟悉的可配置图形抽象，足以表示现有的图形结构，并且可以使用一些基本的数据流运算符来实现(例如连接、映射和分组)。同时，SparkGraphX借助分布式连接优化和物化视图维护来重建特定的图形优化，并利用分布式数据流框架，提供了低成本的图形处理容错能力。With the development of the Internet industry and technology, in the field of big data, the scale and importance of graph data processing are also increasing. Take the Spark GraphX framework as an example, which is an embedded graph processing framework built on Apache Spark using a distributed data flow system. Spark GraphX provides a familiar configurable graph abstraction, sufficient to represent existing graph structures, and can be implemented using some basic data flow operators (such as join, map, and group). At the same time, SparkGraphX uses distributed connection optimization and materialized view maintenance to rebuild specific graph optimizations, and utilizes a distributed data flow framework to provide low-cost graph processing fault tolerance.

Spark GraphX的性能主要受配置参数的影响，不合理的配置会严重降低框架性能。Spark官方推荐了一套默认配置参数，然而在实际的图形数据处理任务中，默认配置参数无法根据计算资源以及工作负载的变化进行相应的适配，这导致Spark GraphX的性能受到限制，同时浪费了大量计算资源。Spark GraphX具有大量配置参数，且不同参数间存在相互影响，所以人工调参难度大、成本高，因而Spark GraphX配置参数的自动调优方法具有重大研究意义。The performance of Spark GraphX is mainly affected by configuration parameters, and unreasonable configuration will seriously reduce the performance of the framework. Spark officially recommends a set of default configuration parameters. However, in actual graph data processing tasks, the default configuration parameters cannot be adapted according to changes in computing resources and workloads, which limits the performance of Spark GraphX and wastes A lot of computing resources. Spark GraphX has a large number of configuration parameters, and there is mutual influence between different parameters, so manual parameter adjustment is difficult and costly. Therefore, the automatic tuning method of Spark GraphX configuration parameters has great research significance.

现有的Spark GraphX框架优化方法仅针对图并行抽象和稀疏图结构强加的限制实现了一系列***优化，优化对象主要包括图特性和图***，在传统数据库***的经典技术基础上，进行了索引、增量视图维护和连接的优化，以及Spark中标准数据流操作符的优化，实现了与专用图处理***的性能对等。然而现有的Spark GraphX优化方法只是对图数据本身特性的优化和Spark GraphX***内部实现优化，而未考虑运行时的配置参数和输入数据集的大小对Spark GraphX的性能影响，优化效果较差；并且，现有Spark GraphX优化方法所使用的机器学习算法性能不佳，并且不能够适用于当前Spark GraphX调参优化场景。The existing Spark GraphX framework optimization method only implements a series of system optimizations for the limitations imposed by graph parallel abstraction and sparse graph structure. The optimization objects mainly include graph features and graph systems. Based on the classic technologies of traditional database systems, indexing is carried out. , optimizations for incremental view maintenance and joins, and optimizations for standard dataflow operators in Spark to achieve performance parity with dedicated graph processing systems. However, the existing Spark GraphX optimization method only optimizes the characteristics of the graph data itself and the internal implementation of the Spark GraphX system, but does not consider the impact of the runtime configuration parameters and the size of the input data set on the performance of Spark GraphX, and the optimization effect is poor; In addition, the machine learning algorithm used by the existing Spark GraphX optimization method has poor performance and cannot be applied to the current Spark GraphX parameter tuning optimization scenario.

发明内容SUMMARY OF THE INVENTION

本发明的目的是克服上述现有技术的缺陷，提供一种基于随机森林的图数据处理框架自动调优方法，可应用于Spark GraphX等图处理框架的配置参数优化。The purpose of the present invention is to overcome the above-mentioned defects of the prior art, and to provide an automatic tuning method for graph data processing framework based on random forest, which can be applied to the configuration parameter optimization of graph processing frameworks such as Spark GraphX.

本发明的技术方案是提供一种基于随机森林的图数据处理框架自动调优方法，该方法包括以下步骤：The technical solution of the present invention is to provide an automatic tuning method for graph data processing framework based on random forest, and the method includes the following steps:

构建训练数据集，该训练数据集的每条样本数据表征图数据处理框架的配置参数组合、输入数据集大小与程序运行时间之间的对应关系；Build a training data set, each sample data of the training data set represents the configuration parameter combination of the graph data processing framework, the corresponding relationship between the size of the input data set and the running time of the program;

基于所述训练数据集训练包含多棵决策树的随机森林模型，其中每个决策树的训练集通过对所述训练数据集进行引导聚焦生成，将经训练的随机森林模型作为性能预测模型，用于对不同参数配置组合结合输入数据集大小预测对应的程序运行时间；A random forest model including multiple decision trees is trained based on the training data set, wherein the training set of each decision tree is generated by guiding and focusing on the training data set, and the trained random forest model is used as a performance prediction model. For predicting the corresponding program running time for different parameter configuration combinations combined with the size of the input data set;

在配置参数的搜索空间中，利用所述性能预测模型针对不同输入数据集大小，预测由遗传算法产生的不同配置参数的性能高低，进而获得最优配置参数。In the search space of configuration parameters, the performance prediction model is used to predict the performance of different configuration parameters generated by the genetic algorithm for different input data set sizes, and then the optimal configuration parameters are obtained.

与现有技术相比，本发明的优点在于，在异构机器集群中，以图处理框架的配置参数为优化对象，实现了自动调参优化，能够自动感知数据集大小，最终找到运行程序的最佳配置。本发明针对图处理框架调参优化的特点，选取随机森林算法(RF)并结合遗传算法(GA)对输入数据集规模自动感知，实现了深入且高效的调参优化。Compared with the prior art, the present invention has the advantage that, in a heterogeneous machine cluster, the configuration parameters of the graph processing framework are used as the optimization object to realize automatic parameter adjustment and optimization, and can automatically perceive the size of the data set, and finally find the running program. best configuration. Aiming at the characteristics of the parameter adjustment optimization of the graph processing framework, the invention selects the random forest algorithm (RF) and combines the genetic algorithm (GA) to automatically perceive the scale of the input data set, so as to realize the in-depth and efficient parameter adjustment optimization.

通过以下参照附图对本发明的示例性实施例的详细描述，本发明的其它特征及其优点将会变得清楚。Other features and advantages of the present invention will become apparent from the following detailed description of exemplary embodiments of the present invention with reference to the accompanying drawings.

附图说明Description of drawings

被结合在说明书中并构成说明书的一部分的附图示出了本发明的实施例，并且连同其说明一起用于解释本发明的原理。The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention.

图1是根据本发明一个实施例的基于随机森林的图数据处理框架自动调优方法的流程图；1 is a flowchart of an automatic tuning method for a graph data processing framework based on random forest according to an embodiment of the present invention;

图2是根据本发明一个实施例的基于随机森林的图数据处理框架自动调优方法的过程示意；2 is a schematic process diagram of an automatic tuning method for a graph data processing framework based on random forests according to an embodiment of the present invention;

图3是现有技术与本发明一个实施例的效果对比图；Fig. 3 is the effect comparison diagram of the prior art and an embodiment of the present invention;

图4是根据本发明一个实施例对Spark GraphX程序加速运行的优化效果图。FIG. 4 is an optimization effect diagram of accelerated operation of Spark GraphX program according to an embodiment of the present invention.

具体实施方式Detailed ways

现在将参照附图来详细描述本发明的各种示例性实施例。应注意到：除非另外具体说明，否则在这些实施例中阐述的部件和步骤的相对布置、数字表达式和数值不限制本发明的范围。Various exemplary embodiments of the present invention will now be described in detail with reference to the accompanying drawings. It should be noted that the relative arrangement of components and steps, the numerical expressions and numerical values set forth in these embodiments do not limit the scope of the invention unless specifically stated otherwise.

以下对至少一个示例性实施例的描述实际上仅仅是说明性的，决不作为对本发明及其应用或使用的任何限制。The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the invention, its application, or uses.

对于相关领域普通技术人员已知的技术、方法和设备可能不作详细讨论，但在适当情况下，所述技术、方法和设备应当被视为说明书的一部分。Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail, but where appropriate, such techniques, methods, and apparatus should be considered part of the specification.

在这里示出和讨论的所有例子中，任何具体值应被解释为仅仅是示例性的，而不是作为限制。因此，示例性实施例的其它例子可以具有不同的值。In all examples shown and discussed herein, any specific values should be construed as illustrative only and not limiting. Accordingly, other instances of the exemplary embodiment may have different values.

应注意到：相似的标号和字母在下面的附图中表示类似项，因此，一旦某一项在一个附图中被定义，则在随后的附图中不需要对其进行进一步讨论。It should be noted that like numerals and letters refer to like items in the following figures, so once an item is defined in one figure, it does not require further discussion in subsequent figures.

本发明能够应用于多种类型的大数据处理框架，例如，Spark GraphX、PowerGraph和TinkerPop等。为便于理解，本文将以Spark GraphX框架为例进行说明。The present invention can be applied to various types of big data processing frameworks, such as Spark GraphX, PowerGraph, TinkerPop, and the like. For ease of understanding, this article will take the Spark GraphX framework as an example.

结合图1和图2所示，所提供的基于随机森林的图数据处理框架自动调优方法包括以下步骤：Combined with Figure 1 and Figure 2, the provided automatic tuning method of the graph data processing framework based on random forest includes the following steps:

步骤S110，构建训练数据集，其中每条样本数据表征图数据处理框架的配置参数组合、输入数据集大小与程序运行时间之间的对应关系。Step S110, constructing a training data set, wherein each piece of sample data represents the configuration parameter combination of the graph data processing framework, the corresponding relationship between the size of the input data set and the running time of the program.

该步骤是数据收集部分，包含一个参数生成器，为每次Spark GraphX程序运行待优化程序自动生成参数，在每次运行结束后自动收集程序运行时间，与对应所使用的配置参数与数据集大小结合得到一条样本数据，在多次运行Spark GraphX程序后，最终得到一个样本集合，或称训练数据集。This step is the data collection part, including a parameter generator, which automatically generates parameters for each Spark GraphX program to run the program to be optimized, and automatically collects the program running time after each run, corresponding to the configuration parameters used and the size of the data set. Combined to obtain a piece of sample data, after running the Spark GraphX program multiple times, a sample set, or training data set, is finally obtained.

具体地，参数生成器(Conf Generator)首先选择出显著影响Spark GraphX性能的参数；接下来，根据选择出来的参数自动为待优化程序的运行自动生成并分配参数；然后，使用若干组自动生成的参数自动运行待优化程序，在每次程序运行结束后，收集运行时SparkGraphX程序所使用的配置参数及输入数据集大小，与Spark GraphX程序运行时间结合，作为训练数据集中的一条样本数据。如此，在多次运行后会得到由多条样本数据组成的训练数据集。Specifically, the parameter generator (Conf Generator) first selects parameters that significantly affect the performance of Spark GraphX; then, according to the selected parameters, it automatically generates and assigns parameters for the operation of the program to be optimized; The parameters automatically run the program to be optimized. After each program runs, the configuration parameters used by the SparkGraphX program and the size of the input data set at runtime are collected, combined with the Spark GraphX program running time, as a piece of sample data in the training data set. In this way, a training dataset consisting of multiple sample data will be obtained after multiple runs.

步骤S120，利用训练数据集训练包含多棵决策树的随机森林模型，作为性能预测模型。Step S120, using the training data set to train a random forest model including multiple decision trees as a performance prediction model.

该步骤利用数据收集阶段产生的训练数据集基于机器学习算法进行建模，目的在于搭建一个性能预测模型，能够反映不同配置参数和不同输入数据集大小对程序执行性能的影响。This step uses the training data set generated in the data collection stage to model based on the machine learning algorithm, and the purpose is to build a performance prediction model that can reflect the impact of different configuration parameters and different input data set sizes on program execution performance.

优选地，采用随机森林算法进行建模，根据以下步骤获得性能预测模型：Preferably, the random forest algorithm is used for modeling, and the performance prediction model is obtained according to the following steps:

步骤S121，从训练数据集中使用引导聚集算法，取出m个样本，一共进行了n _tree(随机森林算法中决策树的数量)次采样，并根据这些采样生成n _tree个训练集或称训练子集。 Step S121, use the guided aggregation algorithm from the training data set, take out m samples, carry out n _tree (the number of decision trees in the random forest algorithm) sampling in total, and generate n _tree training sets or training subsets according to these samples .

步骤S122，利用这些训练子集，训练成n _tree个决策树模型； Step S122, use these training subsets to train into n _tree decision tree models;

步骤S123，其中对于单个决策树模型，先从该节点的属性结合中随机选择一个包含k个Spark GraphX属性(例如Spark GraphX参数与数据集大小)的子集，然后每次***时根据信息增益或基尼指数从子集中选择最优属性进行***。Step S123, wherein for a single decision tree model, first randomly select a subset containing k Spark GraphX attributes (such as Spark GraphX parameters and data set size) from the attribute combination of the node, and then divide each time according to information gain or The Gini index selects the best attributes from the subset to split.

步骤S124，每棵决策树都按这种规则***，直到该节点的所有训练子集都属于同一类。In step S124, each decision tree is split according to this rule until all training subsets of the node belong to the same class.

步骤S125，将最后生成的多棵决策树组成随机森林模型，随机森林模型的输出结果可以按照多棵树分类器投票决定最终分类结果或由多棵树预测值的均值决定最终预测结果。In step S125, a random forest model is formed from the finally generated decision trees, and the output result of the random forest model can be voted by the multiple tree classifiers to determine the final classification result or the final prediction result can be determined by the mean of the predicted values of the multiple trees.

经训练的随机森林模型即为性能预测模型或简称性能模型，可用于对不同配置参数结合输入数据集大小进行Spark GraphX程序运行时间的预测。The trained random forest model is the performance prediction model or simply the performance model, which can be used to predict the running time of the Spark GraphX program with different configuration parameters combined with the size of the input data set.

上述步骤S124可以理解为达到目标精度之后，停止***。随着随机森林算法的执行，执行时间变化越来越小，并且模型精度越来越准确，通过增加决策树的个数，有助于解决“过度拟合”问题。上述涉及的n _tree、m、k是大于等于2的整数，可根据实际精度或执行速度要求进行适当设置。 The above step S124 can be understood as stopping the splitting after reaching the target accuracy. With the execution of the random forest algorithm, the execution time changes less and less, and the model accuracy becomes more and more accurate. By increasing the number of decision trees, it helps to solve the "overfitting" problem. The above-mentioned n _tree , m, and k are integers greater than or equal to 2, and can be appropriately set according to actual precision or execution speed requirements.

步骤S130，在配置参数的搜索空间中，利用所述性能预测模型针对不同输入数据集大小，预测由遗传算法产生的不同配置参数的性能高低，进而获得最优配置参数。Step S130: In the search space of configuration parameters, the performance prediction model is used to predict the performance of different configuration parameters generated by the genetic algorithm for different input data set sizes, thereby obtaining optimal configuration parameters.

在该步骤中，使用遗传算法基于性能预测模型进行迭代搜索，最终筛选出最优配置参数。In this step, the genetic algorithm is used to perform an iterative search based on the performance prediction model, and the optimal configuration parameters are finally screened.

在搜索阶段，性能预测模型被用于预测在不同输入数据集大小下，由遗传算法产生的不同配置参数在Spark GraphX中的性能高低，从而避免实际运行程序，实现高效率搜索，最终得到最优配置参数并直接用于Spark GraphX程序，从而提高了Spark GraphX性能。In the search stage, the performance prediction model is used to predict the performance of different configuration parameters generated by the genetic algorithm in Spark GraphX under different input data set sizes, so as to avoid the actual running of the program, achieve efficient search, and finally obtain the optimal Configure parameters and use them directly in Spark GraphX programs, thereby improving Spark GraphX performance.

具体地，最优配置的搜索过程包括：Specifically, the search process for the optimal configuration includes:

步骤S131，在配置参数的搜索空间中随机输入一组配置参数并通过性能预测模型计算得到初始化的个体适应度A标准。Step S131 , randomly input a set of configuration parameters in the search space of configuration parameters, and calculate the initialized individual fitness A standard through the performance prediction model.

例如，结合实际的输入数据集大小以性能预测模型输出的程序执行时间作为个体适应度标准。For example, the program execution time output by the performance prediction model is used as the individual fitness criterion in combination with the actual input dataset size.

步骤S132，从配置参数的搜索空间中随机选择n组配置参数(例如n大于训练集数量的1/5)作为初始化种群P，对P中每个个体进行随机的交叉运算及变异率为0.02的变异运算。Step S132, randomly select n groups of configuration parameters (for example, n is greater than 1/5 of the number of training sets) from the search space of configuration parameters as the initialization population P, and perform random crossover operations on each individual in P and a mutation rate of 0.02. mutation operation.

例如，变异率可根据配置参数搜索空间大小或对执行速度的要求设置为其他值。For example, the mutation rate can be set to other values based on the configuration parameter search space size or requirements for execution speed.

步骤S133，利用性能预测模型对种群P及其后代进行适应度计算，并筛选出适应度高于A的个体组成新种群P’，将适应度最高的个体的适应度A’作为新的适应度标准A’。Step S133: Use the performance prediction model to calculate the fitness of the population P and its descendants, and screen out individuals whose fitness is higher than A to form a new population P', and use the fitness A' of the individual with the highest fitness as the new fitness Standard A'.

步骤S134，重复S132和S133，直至无法产生更优秀的个体，则当前最优个体即为搜索到的最优配置参数。Step S134, repeating S132 and S133 until no better individual can be generated, then the current optimal individual is the optimal configuration parameter searched.

本发明自动收集待优化程序运行数据并使用随机森林算法建模与遗传算法结合实现高性能的自动调参优化。利用遗传算法交叉变异特性避免搜索陷入局部最优，同时保证了优秀的搜索性能。The invention automatically collects the running data of the program to be optimized, and uses the combination of random forest algorithm modeling and genetic algorithm to realize high-performance automatic parameter adjustment optimization. The crossover mutation feature of genetic algorithm is used to avoid the search falling into local optimum, while ensuring excellent search performance.

为进一步验证本发明的效果，进行了实验验证。基于Spark官方提供的Spark GraphX测试程序，分别是PageRank(PR)、NWeight(NW)、Connected Component(CC)和Triangle Counting(TC))对Spark GraphX框架的配置参数进行自动优化。In order to further verify the effect of the present invention, experimental verification was carried out. Based on the Spark GraphX test programs officially provided by Spark, namely PageRank (PR), NWeight (NW), Connected Component (CC) and Triangle Counting (TC), the configuration parameters of the Spark GraphX framework are automatically optimized.

实验中，选取了现有Spark GraphX优化技术最常用的两种算法决策树算法(DT)和支持向量机算法(SVM)，与本发明所使用的随机森林算法(RF)进行了性能对比；并且，直接利用本发明的优化方法对不同输入数据集下，Spark GraphX的优化效果进行了实验。In the experiment, the decision tree algorithm (DT) and the support vector machine algorithm (SVM), two of the most commonly used algorithms in the existing Spark GraphX optimization technology, were selected, and the performance was compared with the random forest algorithm (RF) used in the present invention; and , directly using the optimization method of the present invention to conduct experiments on the optimization effect of Spark GraphX under different input data sets.

图3是现有技术常用的决策树算法(DT)和支持向量机算法(SVM)与本发明所使用的随机森林算法(RF)，针对所选取的两种不同Spark GraphX程序的建模效果对比。从图中的实验结果可以明显看出，本发明的RF算法(右侧柱状图)在不同程序下的建模精度均高于其它三种算法。并且RF算法的建模精度平均高于DT算法26.1％，高于SVM算法10.6％。因此，本发明所使用建模方法更为优秀。Fig. 3 is the decision tree algorithm (DT) and the support vector machine algorithm (SVM) commonly used in the prior art and the random forest algorithm (RF) used in the present invention, for the modeling effect comparison of two different Spark GraphX programs selected . It can be clearly seen from the experimental results in the figure that the modeling accuracy of the RF algorithm of the present invention (right histogram) under different programs is higher than that of the other three algorithms. And the modeling accuracy of the RF algorithm is 26.1% higher than that of the DT algorithm on average, and 10.6% higher than that of the SVM algorithm. Therefore, the modeling method used in the present invention is more excellent.

图4是对Spark GraphX程序加速运行的优化，由于本发明的优化方法针对不同输入数据集大小下的不同程序自动配置了合理的参数，因此相较于官方默认配置(左侧柱状图)，本发明的优化方法(右侧柱状图)显著提升了Spark GraphX的运行速度，平均提升2.0倍，最高提升2.8倍。Figure 4 shows the optimization of the accelerated operation of Spark GraphX programs. Since the optimization method of the present invention automatically configures reasonable parameters for different programs with different input data set sizes, compared with the official default configuration (the left histogram), this The invented optimization method (the histogram on the right) significantly improves the running speed of Spark GraphX, with an average increase of 2.0 times and a maximum increase of 2.8 times.

实验结果表明，本发明的优化方法实现了对Spark GraphX的自动调参优化，且优化性能优于现有技术，并能根据不同的输入数据集大小找到对应最优配置，在不同程序负载下相较于官方默认配置显著提升了Spark GraphX的数据处理速度。The experimental results show that the optimization method of the present invention realizes the automatic parameter adjustment and optimization of Spark GraphX, and the optimization performance is better than that of the prior art, and the corresponding optimal configuration can be found according to the size of different input data sets. Compared with the official default configuration, the data processing speed of Spark GraphX is significantly improved.

综上，本发明提出了一种能感知输入数据集大小，并基于Spark GraphX的配置参数进行自动调参，实现了对Spark GraphX的深层次高性能的自动优化。相对于现有技术，本发明的效果主要体现在以下方面：In summary, the present invention proposes a method that can perceive the size of the input data set, and automatically adjust parameters based on the configuration parameters of Spark GraphX, thereby realizing the automatic optimization of deep-level high performance of Spark GraphX. Compared with the prior art, the effect of the present invention is mainly reflected in the following aspects:

1)、现有的Spark GraphX自动优化方法仅针对图特性和图***，采用优化内部代码和图查询算法的方式进行实现，但并未深入到对运行时的配置参数进行优化，而配置参数会更大程度地直接影响Spark GraphX的性能，因此现有优化方法深度不够。本发明基于Spark GraphX的配置参数进行自动参数调优，实现了对Spark GraphX深层次的调参优化。1) The existing Spark GraphX automatic optimization method is only for graph features and graph systems, and is implemented by optimizing internal code and graph query algorithms, but it does not go deep into optimizing the configuration parameters at runtime, and the configuration parameters will It directly affects the performance of Spark GraphX to a greater extent, so the existing optimization methods are not deep enough. The present invention performs automatic parameter tuning based on the configuration parameters of Spark GraphX, and realizes the deep-level parameter tuning optimization of Spark GraphX.

2)、现有Spark GraphX优化方法并未考虑输入数据集大小对性能的影响，但由于Spark GraphX使用Spark作为底层计算框架，其对输入数据集大小非常敏感，因此输入数据集大小不可忽视。本发明提出的优化方法能够对输入数据集大小进行自动感知，结合随机森林算法与遗传算法实现了对Spark GraphX高性能的自动调参优化。2) The existing Spark GraphX optimization method does not consider the impact of the size of the input dataset on performance, but since Spark GraphX uses Spark as the underlying computing framework, it is very sensitive to the size of the input dataset, so the size of the input dataset cannot be ignored. The optimization method proposed by the invention can automatically perceive the size of the input data set, and realizes the high-performance automatic parameter tuning optimization of Spark GraphX by combining the random forest algorithm and the genetic algorithm.

3)、现有Spark GraphX优化方法所使用的机器学习算法性能不佳，且不符合Spark GraphX参数自动调优要求。本发明将随机森林算法与遗传算法相结合，提出了更适合于Spark GraphX自动调参优化的方法。3) The machine learning algorithm used by the existing Spark GraphX optimization method has poor performance and does not meet the requirements of Spark GraphX parameter automatic tuning. The invention combines random forest algorithm and genetic algorithm, and proposes a method more suitable for automatic parameter adjustment and optimization of Spark GraphX.

本发明可以是***、方法和/或计算机程序产品。计算机程序产品可以包括计算机可读存储介质，其上载有用于使处理器实现本发明的各个方面的计算机可读程序指令。The present invention may be a system, method and/or computer program product. The computer program product may include a computer-readable storage medium having computer-readable program instructions loaded thereon for causing a processor to implement various aspects of the present invention.

计算机可读存储介质可以是可以保持和存储由指令执行设备使用的指令的有形设备。计算机可读存储介质例如可以是――但不限于――电存储设备、磁存储设备、光存储设备、电磁存储设备、半导体存储设备或者上述的任意合适的组合。计算机可读存储介质的更具体的例子(非穷举的列表)包括：便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、静态随机存取存储器(SRAM)、便携式压缩盘只读存储器(CD-ROM)、数字多功能盘(DVD)、记忆棒、软盘、机械编码设备、例如其上存储有指令的打孔卡或凹槽内凸起结构、以及上述的任意合适的组合。这里所使用的计算机可读存储介质不被解释为瞬时信号本身，诸如无线电波或者其他自由传播的电磁波、通过波导或其他传输媒介传播的电磁波(例如，通过光纤电缆的光脉冲)、或者通过电线传输的电信号。A computer-readable storage medium may be a tangible device that can hold and store instructions for use by the instruction execution device. The computer-readable storage medium may be, for example, but not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (non-exhaustive list) of computer readable storage media include: portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM) or flash memory), static random access memory (SRAM), portable compact disk read only memory (CD-ROM), digital versatile disk (DVD), memory sticks, floppy disks, mechanically coded devices, such as printers with instructions stored thereon Hole cards or raised structures in grooves, and any suitable combination of the above. Computer-readable storage media, as used herein, are not to be construed as transient signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (eg, light pulses through fiber optic cables), or through electrical wires transmitted electrical signals.

这里所描述的计算机可读程序指令可以从计算机可读存储介质下载到各个计算/处理设备，或者通过网络、例如因特网、局域网、广域网和/或无线网下载到外部计算机或外部存储设备。网络可以包括铜传输电缆、光纤传输、无线传输、路由器、防火墙、交换机、网关计算机和/或边缘服务器。每个计算/处理设备中的网络适配卡或者网络接口从网络接收计算机可读程序指令，并转发该计算机可读程序指令，以供存储在各个计算/处理设备中的计算机可读存储介质中。The computer readable program instructions described herein may be downloaded to various computing/processing devices from a computer readable storage medium, or to an external computer or external storage device over a network such as the Internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and/or edge servers. A network adapter card or network interface in each computing/processing device receives computer-readable program instructions from a network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in each computing/processing device .

用于执行本发明操作的计算机程序指令可以是汇编指令、指令集架构(ISA)指令、机器指令、机器相关指令、微代码、固件指令、状态设置数据、或者以一种或多种编程语言的任意组合编写的源代码或目标代码，所述编程语言包括面向对象的编程语言—诸如Smalltalk、C++等，以及常规的过程式编程语言—诸如“C”语言或类似的编程语言。计算机可读程序指令可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中，远程计算机可以通过任意种类的网络—包括局域网(LAN)或广域网(WAN)—连接到用户计算机，或者，可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。在一些实施例中，通过利用计算机可读程序指令的状态信息来个性化定制电子电路，例如可编程逻辑电路、现场可编程门阵列(FPGA)或可编程逻辑阵列(PLA)，该电子电路可以执行计算机可读程序指令，从而实现本发明的各个方面。The computer program instructions for carrying out the operations of the present invention may be assembly instructions, instruction set architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state setting data, or instructions in one or more programming languages. Source or object code, written in any combination, including object-oriented programming languages, such as Smalltalk, C++, etc., and conventional procedural programming languages, such as the "C" language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server implement. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (eg, using an Internet service provider through the Internet connect). In some embodiments, custom electronic circuits, such as programmable logic circuits, field programmable gate arrays (FPGAs), or programmable logic arrays (PLAs), can be personalized by utilizing state information of computer readable program instructions. Computer readable program instructions are executed to implement various aspects of the present invention.

这里参照根据本发明实施例的方法、装置(***)和计算机程序产品的流程图和/或框图描述了本发明的各个方面。应当理解，流程图和/或框图的每个方框以及流程图和/或框图中各方框的组合，都可以由计算机可读程序指令实现。Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer readable program instructions.

这些计算机可读程序指令可以提供给通用计算机、专用计算机或其它可编程数据处理装置的处理器，从而生产出一种机器，使得这些指令在通过计算机或其它可编程数据处理装置的处理器执行时，产生了实现流程图和/或框图中的一个或多个方框中规定的功能/动作的装置。也可以把这些计算机可读程序指令存储在计算机可读存储介质中，这些指令使得计算机、可编程数据处理装置和/或其他设备以特定方式工作，从而，存储有指令的计算机可读介质则包括一个制造品，其包括实现流程图和/或框图中的一个或多个方框中规定的功能/动作的各个方面的指令。These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer or other programmable data processing apparatus to produce a machine that causes the instructions when executed by the processor of the computer or other programmable data processing apparatus , resulting in means for implementing the functions/acts specified in one or more blocks of the flowchart and/or block diagrams. These computer readable program instructions can also be stored in a computer readable storage medium, these instructions cause a computer, programmable data processing apparatus and/or other equipment to operate in a specific manner, so that the computer readable medium on which the instructions are stored includes An article of manufacture comprising instructions for implementing various aspects of the functions/acts specified in one or more blocks of the flowchart and/or block diagrams.

也可以把计算机可读程序指令加载到计算机、其它可编程数据处理装置、或其它设备上，使得在计算机、其它可编程数据处理装置或其它设备上执行一系列操作步骤，以产生计算机实现的过程，从而使得在计算机、其它可编程数据处理装置、或其它设备上执行的指令实现流程图和/或框图中的一个或多个方框中规定的功能/动作。Computer readable program instructions can also be loaded onto a computer, other programmable data processing apparatus, or other equipment to cause a series of operational steps to be performed on the computer, other programmable data processing apparatus, or other equipment to produce a computer-implemented process , thereby causing instructions executing on a computer, other programmable data processing apparatus, or other device to implement the functions/acts specified in one or more blocks of the flowcharts and/or block diagrams.

附图中的流程图和框图显示了根据本发明的多个实施例的***、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上，流程图或框图中的每个方框可以代表一个模块、程序段或指令的一部分，所述模块、程序段或指令的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。在有些作为替换的实现中，方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如，两个连续的方框实际上可以基本并行地执行，它们有时也可以按相反的顺序执行，这依所涉及的功能而定。也要注意的是，框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合，可以用执行规定的功能或动作的专用的基于硬件的***来实现，或者可以用专用硬件与计算机指令的组合来实现。对于本领域技术人员来说公知的是，通过硬件方式实现、通过软件方式实现以及通过软件和硬件结合的方式实现都是等价的。The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more functions for implementing the specified logical function(s) executable instructions. In some alternative implementations, the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It is also noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented in dedicated hardware-based systems that perform the specified functions or actions , or can be implemented in a combination of dedicated hardware and computer instructions. It is well known to those skilled in the art that implementation in hardware, implementation in software, and implementation in a combination of software and hardware are all equivalent.

以上已经描述了本发明的各实施例，上述说明是示例性的，并非穷尽性的，并且也不限于所披露的各实施例。在不偏离所说明的各实施例的范围和精神的情况下，对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。本文中所用术语的选择，旨在最好地解释各实施例的原理、实际应用或对市场中的技术改进，或者使本技术领域的其它普通技术人员能理解本文披露的各实施例。本发明的范围由所附权利要求来限定。Various embodiments of the present invention have been described above, and the foregoing descriptions are exemplary, not exhaustive, and not limiting of the disclosed embodiments. Numerous modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen to best explain the principles of the various embodiments, the practical application or technical improvement in the marketplace, or to enable others of ordinary skill in the art to understand the various embodiments disclosed herein. The scope of the invention is defined by the appended claims.

Claims

一种基于随机森林的图数据处理框架自动调优方法，包括以下步骤：An automatic tuning method for graph data processing framework based on random forest, including the following steps:

构建训练数据集，该训练数据集的每条样本数据表征图数据处理框架的配置参数组合、输入数据集大小与程序运行时间之间的对应关系；Build a training data set, each sample data of the training data set represents the configuration parameter combination of the graph data processing framework, the corresponding relationship between the size of the input data set and the running time of the program;

基于所述训练数据集训练包含多棵决策树的随机森林模型，其中每个决策树的训练集通过对所述训练数据集进行引导聚焦生成，将经训练的随机森林模型作为性能预测模型，用于对不同参数配置组合结合输入数据集大小预测对应的程序运行时间；A random forest model including multiple decision trees is trained based on the training data set, wherein the training set of each decision tree is generated by guiding and focusing on the training data set, and the trained random forest model is used as a performance prediction model. For predicting the corresponding program running time for different parameter configuration combinations combined with the size of the input data set;

在配置参数的搜索空间中，利用所述性能预测模型针对不同输入数据集大小，预测由遗传算法产生的不同配置参数的性能高低，进而获得最优配置参数。In the search space of configuration parameters, the performance prediction model is used to predict the performance of different configuration parameters generated by the genetic algorithm for different input data set sizes, and then the optimal configuration parameters are obtained.
根据权利要求1所述的方法，其中，基于所述训练数据集训练包含多棵决策树的随机森林模型包括：The method of claim 1, wherein training a random forest model comprising a plurality of decision trees based on the training data set comprises:

从所述训练数据集中使用引导聚集算法，取出m个样本，共进行n _tree次采样，并根据这些采样生成n _tree个训练集，n _tree对应所述随机森林模型包含的决策树数量； Using the guided aggregation algorithm from the training data set, take out m samples, carry out n _trees sampling in total, and generate n _trees training sets according to these samples, and n _trees correspond to the number of decision trees included in the random forest model;

利用所述训练集训练n _tree棵决策树，其中对于单个决策树，先从该节点的属性结合中随机选择一个包含k个图数据处理框架属性的子集，然后每次***时根据信息增益或基尼指数从子集中选择最优属性进行***，进而生成多棵决策树组成随机森林模型。 Use the training set to train n _tree decision trees, wherein for a single decision tree, first randomly select a subset containing k graph data processing framework attributes from the attribute combination of the node, and then divide each time according to the information gain or The Gini index selects the optimal attribute from the subset to split, and then generates multiple decision trees to form a random forest model.
根据权利要求1所述的方法，其中，所述性能预测模型的输出结果由多棵决策树的分类投票决定或由多棵决策树预测值的均值决定。The method according to claim 1, wherein the output result of the performance prediction model is determined by a classification vote of a plurality of decision trees or by an average value of the prediction values of the plurality of decision trees.
根据权利要求1所述的方法，其中，在配置参数的搜索空间中，利用所述性能预测模型针对不同输入数据集大小，预测由遗传算法产生的不同配置参数的性能高低，进而获得最优配置参数包括：The method according to claim 1, wherein, in the search space of configuration parameters, the performance prediction model is used to predict the performance of different configuration parameters generated by a genetic algorithm for different input data set sizes, so as to obtain an optimal configuration Parameters include:

在配置参数的搜索空间中随机输入一组配置参数并通过所述性能预测模型计算得到初始化的个体适应度A标准，该个体适应度是预测的程序执行时间；Randomly input a set of configuration parameters in the search space of configuration parameters and calculate the initialized individual fitness A standard through the performance prediction model, where the individual fitness is the predicted program execution time;

从配置参数的搜索空间中随机选择n组配置参数作为初始化种群P，对P中每个个体进行随机的交叉运算及变异运算；Randomly select n groups of configuration parameters from the search space of configuration parameters as the initialization population P, and perform random crossover and mutation operations on each individual in P;

利用所述性能预测模型对种群P及其后代进行适应度计算，并筛选出适应度高于A的个体组成新种群P’，将适应度最高的个体的适应度A’作为新的适应度标准A’，通过迭代运算找出适应度最高的个体，该个体对应最优配置参数。Use the performance prediction model to calculate the fitness of the population P and its descendants, screen out individuals with a fitness higher than A to form a new population P', and use the fitness A' of the individual with the highest fitness as the new fitness standard A', find out the individual with the highest fitness through iterative operation, and this individual corresponds to the optimal configuration parameters.
根据权利要求1所述的方法，其中，所述构建训练数据集包括：The method of claim 1, wherein said constructing a training data set comprises:

为每次图数据处理框架程序运行待优化程序自动生成配置参数，在每次运行结束后自动收集程序运行时间，与对应所使用的配置参数与输入数据集大小结合作为一条样本数据。The configuration parameters are automatically generated for each time the graph data processing framework program runs the program to be optimized, and the program running time is automatically collected after each operation, which is combined with the corresponding configuration parameters used and the size of the input data set as a piece of sample data.
根据权利要求1所述的方法，其中，所述图数据处理框架包括Spark GraphX、Power Graph或Tinker Pop。The method of claim 1, wherein the graph data processing framework comprises Spark GraphX, Power Graph or Tinker Pop.
一种计算机可读存储介质，其上存储有计算机程序，其中，该程序被处理器执行时实现根据权利要求1至6中任一项所述方法的步骤。A computer-readable storage medium having stored thereon a computer program, wherein the program, when executed by a processor, implements the steps of the method according to any one of claims 1 to 6.
一种计算机设备，包括存储器和处理器，在所述存储器上存储有能够在处理器上运行的计算机程序，其特征在于，所述处理器执行所述程序时实现权利要求1至6中任一项所述的方法的步骤。A computer device, comprising a memory and a processor, a computer program that can be run on the processor is stored in the memory, and characterized in that, when the processor executes the program, any one of claims 1 to 6 is implemented The steps of the method described in item.