CN102708404A

CN102708404A - Machine learning based method for predicating parameters during MPI (message passing interface) optimal operation in multi-core environments

Info

Publication number: CN102708404A
Application number: CN2012100420437A
Authority: CN
Inventors: 曾宇
Original assignee: BEJING COMPUTING CENTER
Current assignee: BEJING COMPUTING CENTER
Priority date: 2012-02-23
Filing date: 2012-02-23
Publication date: 2012-10-03
Anticipated expiration: 2032-02-23
Also published as: CN102708404B

Abstract

The invention provides a novel method for optimizing an MPI (message passing interface) application in multi-core environments, and particularly relates to a machine learning based method for predicating parameters during MPI application optimal operation under multi-core clusters. According to the method, training benchmarks with different ratios of point-to-point communication data to collective communication data are designed to generate training data under the specific multi-core clusters, parameter optimized models during operation are constructed by a decision tree REPTree capable of quickly outputting results and an ANN (artificial neural network) capable of generating multiple output and good in noise immunity, the optimized models are trained by the training data generated by the training benchmarks, and the trained models are used for predicating the unknown parameters inputted to the MPI application during optimal operation. Experiments show that speed-up ratios generated by the parameters obtained by the REPTree-based predication model and the ANN-based predication model during optimal operation are averagely higher than 90% of a practical maximum speed-up ratio.

Description

A kind of based on the parameter prediction method during the MPI optimized operation under the multinuclear of machine learning

Technical field

The present invention relates under the multi-core environment MPI and optimize, specifically, relate to a kind of based on the parameter prediction method during the MPI optimized operation under the multinuclear of machine learning.

Background technology

Along with multi-core technology is widely used in a group of planes more, the performance optimization that MPI uses under the multinuclear group of planes becomes the focus of research.The MPI storehouse of main flow realizes that (Open MPI, MPICH etc.) all provide adjustable runtime parameter mechanism at present, allows the user to transfer excellent runtime parameter to promote the performance that MPI uses according to certain applications demand, hardware and operating system.

We design this chapter and have realized a kind ofly based on MPI runtime parameter Optimization Model under the general multi-core environment of machine learning, can be that MPI program prediction under the multinuclear group of planes of given soft, hardware configuration is near optimum runtime parameter combination automatically.The forecast model that we propose through off-line training and the on-line study to forecast model, can be the unknown approaching optimum runtime parameter of MPI program prediction based on decision tree in the machine learning and Artificial Neural Network automatically.The MPI program of predicting is by common descriptions of static nature such as the behavioral characteristics that source code operation is once obtained and communicator sizes.The optimum MPI runtime parameter Forecasting Methodology based on machine learning that we propose verifies on the multinuclear SMP group of planes based on InfiniBand, and the environment of the MPI storehouse of this main flow of utilization Open MPI parameter during as prediction MPI optimized operation.Prove through the IS in the parallel benchmark external member 2.4 of NAS and the experiment of LU benchmark; Compare with Open MPI default configuration, the optimization runtime parameter combination that obtains based on the forecast model of machine learning can bring maximum about 20% performance boost for the MPI a multinuclear group of planes under uses.

Multi-core technology refers to two or more processing kernels are integrated in the middle of the processor chips, and the handling property through load allocating is quickened to use to the multinuclear.Become the main flow platform of high-performance computing sector at present based on the group of planes of multi-core technology, an increasing group of planes adopts polycaryon processor as core component [CHAI08].Message passing interface MPI (Message Passing Interface) is a multiple programming model the most frequently used under the group of planes, is widely used in distributed and the shared drive system.

The new features of polycaryon processor make the memory hierarchy of a multinuclear group of planes complicated more, have brought new optimization space also for simultaneously the MPI program.Though the data locality of algorithm, load balancing etc. are the factors that influences the MPI application performance; But it is relevant with concrete application-specific characteristic; Directly with existing MPI program portable to multinuclear group of planes platform, the performance of application and extensibility do not obtain great improvement [SW 09].Optimization research for MPI under the multinuclear at present mainly concentrates on the aspects such as optimization of mixing MPI/OpenMP, optimization MPI runtime parameter, optimizing MPI process topology, MPI collective communication; The performance important influence that wherein adjustable runtime parameter is used the MPI under the multi-core environment, but optimum runtime parameter depends on the bottom architecture of a multinuclear node or a multinuclear group of planes and the characteristic of MPI program self.

The MPI storehouse of main flow realizes that the runtime parameter that all provides adjustable is machine-processed, allows the user to obtain more high-performance through the adjustment runtime parameter.For example can revise the agreement that point to point link adopts, promptly revise in the MPI storehouse threshold parameter that transfers concentrated communication protocol (Rendezvous) by communication protocol (Eager) immediately to according to the size of communication information.The performance important influence that adjustable runtime parameter is used the MPI under the multinuclear group of planes, but optimum runtime parameter depends on factors such as the level of communicating by letter that MPI uses in the communication performance (communication delay and bandwidth that comprise internal memory and network), a group of planes of network interconnection mode (comprising Infiniband network, gigabit Ethernet and Myrinet network etc.), the group of planes of memory hierarchy (comprising the sharing mode of intranodal secondary or three grades of buffer memorys etc.), the group of planes of a multinuclear group of planes (comprise that Chip is interior, between Chip and intra-node communication) largely.

Fig. 1 has shown that at one 10 node the difference configuration combination of following five runtime parameters of a multinuclear group of planes of every node 8 nuclears is to the performance impact of IS benchmark (Class B) in the parallel benchmark external member of NAS.Under the group of planes of interconnected AMD double-core 10 nodes of Infiniband; Best runtime parameter configuration is compared with Open MPI storehouse default setting and can be brought about 20% performance boost at most, and wrong configuration is compared with default configuration and caused about 30% performance loss.

Fig. 2 shows the influence of runtime parameter to the Jacobi benchmark.Experiment is presented on the AMD node of one 32 nuclear and matrix size when being 4096*6096, and for the Jacobi benchmark, the optimized parameter configuration combination that obtains maximum speed-up ratio during 8 MPI processes is during with 16 MPI processes different (comparing with default configuration).Experimental result is also shown under 8 MPI processes simultaneously, and optimum MPI runtime parameter can bring about 70% performance boost to the Jacobi benchmark.

Fig. 1 and Fig. 2 explain that adjustable runtime parameter can be used MPI and bring considerable performance boost, is difficult to manual realization but the configuration set of runtime parameter and corresponding optimization space are quite huge simultaneously.The Open MPI based on the modular assembly structure with main flow is an example; Suppose from the coll assembly of the btl assembly of point to point link commonly used and set operation, respectively to get an adjustable numeric type parameter and an index type parameter; 20 kinds of values of each numeric type parameter testing; Each border will shape parameter has 2 kinds of values, then uses four parameters of automatic Iterative Technology Need test to constitute the combining and configuring of 1600 kinds of runtime parameters.With the average execution time of MPI program under every kind of configuration is to calculate in 5 minutes, needs 5 day time find best runtime parameter sequence altogether.Therefore press for a kind of fast automatic parameter optimization method and promote the performance that MPI uses under the multinuclear group of planes.

Summary of the invention

For achieving the above object, the invention provides a kind of based on the parameter prediction method during the MPI optimized operation under the multinuclear of machine learning.

It is a kind of based on the parameter prediction method during the MPI optimized operation under the multinuclear of machine learning,

Adopt two kinds of standards of decision tree and artificial neural network to make up Optimization Model;

Generate training data with the training benchmark of the structure combinations of parameters when many group operations is set on a target multinuclear group of planes, and the model of structure is carried out off-line training;

Configuration parameter when the model after the training is used for the operation of new MPI program prediction optimum;

Forecasting institute is got the result and reality has the runtime parameter vector to do contrast most, the accuracy of evaluation prediction pattern.

Preferably, it is the input of decision-tree model that said decision-tree model will be trained the performance of program of benchmark and the configuration groups cooperation of runtime parameter, and training data is: { F _i, C _i, F wherein _iBe the program characteristic of training benchmark, C _iBe the combination of the runtime parameter under the present procedure characteristic, the actual speed-up ratio that obtains is as the output of decision tree.

Preferably, said artificial nerve network model will be trained in the benchmark and to be produced the data of high speed-up ratio and select and be used for the training parameter forecast model, and training data is: { F _i, C _{I_best}, F wherein _i=<f ₁, f ₂..., f _m>Be the program characteristic of training benchmark, C _{I_best}=<c ₁, c ₂..., c _n>Parameter combinations during for the optimum operation under the present procedure characteristic.

Preferably, said decision-tree model is in the training pattern stage, and F produces different speed-up ratio results with C through the conversion vector; When performance model is predicted, if F _pThe performance of program vector of the MPI program of representative input then can obtain maximum speed-up ratio S _MaxRuntime parameter configuration C _BestParameter combinations vector, i.e. S in the time of will being the optimum operation of this MPI program _Max=M _REPTree(F _p, C _Best).

Preferably, said artificial nerve network model is in the training pattern stage, and F produces different speed-up ratio results with C through the conversion vector; When performance model is predicted, if M _ANNBe the artificial nerve network model after the training, then C _Best=M _ANN(F _p), F wherein _pThe performance of program vector of the MPI program of representative input, C _BestParameter combinations vector when being the optimum operation of this MPI program.

Preferably, said training benchmark comprises two kinds of MPI communication modes: synchronous MPI point to point link and MPI set operation; The training benchmark receives 5 parameters, the message size that can be used for respectively exchanging in the size, collective communication of message of the ratio of point to point link in the Control Training benchmark, the ratio of collective communication, two synchronous point to point links of MPI process and the size of communicator.

Preferably; Said off-line training is through 5 input parameters of conversion training benchmark; The ratio of control point to point link and collective communication is respectively: 100% point to point link, 100% collective communication, 50% point to point link and 50% collective communication; Under three kinds of different communication ratios, the size of message size and MPI communicator in conversion point-to-point and the collective communication, and the configuration of conversion runtime parameter combination respectively; Common property is given birth to 3000 of training datas, is used for the neural network training Optimization Model.

Preferably, when forecast model foundation and after, will carry out the prediction task according to the actual requirements with a large amount of learning data trained.

Preferably, before carrying out prediction, need be under a target multinuclear group of planes to will predicting the instrument operation of carrying out of MPI program, with the proper vector F of the MPI program that obtains importing _pWith F _pParameter combinations in the time of can obtaining importing the optimum operation of MPI program as the input of model; When a target multinuclear group of planes changed, above process need repeated.

The present invention proposes a kind of new method that MPI uses of under multi-core environment, optimizing: parameter is predicted during optimized operation that the utilization machine learning method is used MPI under the multinuclear group of planes.We have designed the training benchmark with different point to point links and collective communication ratio data and under a specific multinuclear group of planes, have produced training data; Adopt decision tree REPTree and a plurality of outputs of generation that to export the result fast and neural network ANN to make up the runtime parameter Optimization Model simultaneously with better noise immunity; Training data through the training benchmark produces is trained Optimization Model, and parameter was predicted when the model after the training was used to the optimized operation of the input MPI program of the unknown.Experiment showed, based on the forecast model of REPTree and the speed-up ratio that produces based on the optimization runtime parameter that the forecast model of ANN obtains and on average reach more than 90% of actual maximum speed-up ratio.

Description of drawings

Fig. 1 runtime parameter is to the performance impact of IS benchmark (Class B)

Fig. 2 runtime parameter is to the performance impact of Jacobi benchmark (4096*6096)

Fig. 3 is based on the forecast model of machine learning

Fig. 4 decision tree forecast model

Fig. 5 neural network prediction model

Embodiment

Further specify below in conjunction with accompanying drawing and specific embodiment.

The performance important influence that adjustable runtime parameter is used the MPI under the multinuclear group of planes, but optimum runtime parameter depends on the bottom architecture of a multinuclear group of planes and the characteristic of MPI program self.The method and the step of our parameter prediction when introducing the utilization machine learning techniques and carrying out MPI optimized operation under the multinuclear of this joint.

Our method comprises four-stage: tectonic model, model training, use training pattern to carry out parameter prediction and model prediction accuracy assessment.Wherein we have adopted machine learning techniques one decision tree and the artificial neural network of two kinds of standards to be used to make up Optimization Model the phase one.The model training stage we on a target multinuclear group of planes, generate training datas with the training benchmark of structure through the combinations that many group runtime parameters are set, and the model of structure is carried out off-line training.Model after the training can be used for to the optimum runtime parameter configuration of the MPI program prediction of new the unknown.Forecasting institute gets the accuracy that the contrast of result and actual optimum runtime parameter vector can the evaluation prediction pattern.

The essence of machine learning is that the appliance computer learning system solves practical problems, can be regarded as a mapping or function y=F (X) based on the forecast model of machine learning, and wherein X is input, and output y is continuous or orderly value.The destination of study is to obtain a mapping or function F, to the contact modeling between X and the y.The accuracy rate of fallout predictor is calculated the predicted value of y and the difference of actual given value and is assessed [HAN 07] through to each check tuple X.

The MPI runtime parameter is difficult to realize under the excellent multinuclear owing to manually transfer; Therefore we adopt the forecast model of setting up optimized parameter based on the method for machine learning, this model can to given multinuclear group of planes platform down arbitrarily during the optimum operation of unknown MPI loading routine parameter predict.

Fig. 3 has described the course of work of forecast model; At first on target multinuclear NOWs, use different runtime parameter configuration operation training benchmark to produce training data; With the training data that produces the forecast model of constructing is carried out off-line training; Extract the input of the program characteristic of given MPI program as forecast model then, last model output is near optimum runtime parameter predicted value, to obtain near maximum speed-up ratio.The formula form of parametric prediction model can be expressed as during based on the MPI optimized operation of machine learning: establishing M is the forecast model after the training, F=<f ₁, f ₂..., f _i>The performance of program of the MPI program of input is extracted in representative, then C=M (F) gained vector C=<c ₁, c ₂..., c _i>Parameter combinations when being the optimum operation of this program.

Decision-tree model

Decision tree is based on tree-like forecast model; The root node of tree is whole data acquisition space; The corresponding fragmentation problem of each partial node; It is the test to certain unitary variant, and this test is slit into two or more data blocks with the data acquisition spatial, and each leaf node is that the data that have classification results are cut apart.It is that decision tree need not understood a lot of background knowledges in learning process that the reason of setting up forecast model is set in our trade-off decision; The information that only provides from the sample data collection just can produce a decision tree; Bifurcated through tree node is differentiated can make a certain type of classification problem only relevant with the corresponding variable's attribute value of main tree node, does not promptly need whole variable-values to judge that corresponding classification or execution predict.

We adopt this decision tree learning fast of REPTree device to make up our decision-tree model.REPTree uses the wrong hedge clipper branch strategy of cutting approximately and can create regression tree, therefore can effectively handle the situation [PIER 02] of connection attribute and property value vacancy.

Fig. 4 has described our decision tree forecast model.During training pattern we will to train the performance of program of benchmark and the configuration groups cooperation of runtime parameter be the input of decision-tree model, promptly the training data of REPTree model is: { F _i, C _i, F wherein _iBe the program characteristic of training benchmark, C _iBe the combination of the runtime parameter under the present procedure characteristic, the actual speed-up ratio that obtains is as the output of decision tree.Be the input and output of our program characteristic that will train benchmark, different runtime parameter combination and the actual speed-up ratio that obtains, with the If-then rule that generates decision-making as modeling REPTree.The decision tree that produces through sample data collection study can be used for MPI program prediction to the new the unknown that extracts program characteristic near optimum runtime parameter combination.

Formulism to model is explained as follows: establish M _REPTreeBe the decision tree forecast model, then relation may be defined as S=M between model and training data and the output data _REPTree(f ₁, f ₂..., f _m, c ₁, c ₂... c _n), F=wherein<f ₁, f ₂..., f _m>Be the proper vector of program, C=<c ₁, c ₂..., c _n>Be the runtime parameter mix vector, S is the actual speed-up ratio that when being input as F and C, produces.In the training pattern stage, we produce different speed-up ratio results through conversion vector F with C.When performance model is predicted, if F _pThe performance of program vector of the MPI program of representative input then can obtain maximum speed-up ratio S _MaxRuntime parameter configuration C _BestParameter combinations vector, i.e. S in the time of will being the optimum operation of this MPI program _Max=M _REPTree(F _p, C _Best).

Neural network model

Artificial neural network ANN (Artificial Neural Network) is one type of machine learning model; Can shine upon one group of input parameter and export to one group of target, we adopt ANN to be because it can finely be applied to linearity and nonlinear regression problem and good noise proofness [opening 08] is arranged.

One three layers forward direction type error anti-pass neural network is used to make up forecast model; Experimental verification is designed to the best ANN of our forecasting problem performances: the transition function of hiding layer is tangent (Sigmoid) function: the transition function of

output layer is logarithm tan (Logarithmic sigmoid):

hides layer simultaneously has 10 neurons; And the training function of hiding layer adopts wheat quart (Levenberg-Marquardt) algorithm, because it has well combined the speed of Newton's algorithm and the stability of gradient descent algorithm [BATR 92].

Fig. 5 has described our neural network prediction model.During model training, we will train in the benchmark and to produce the data of high speed-up ratio and select and be used for training the parametric prediction model based on ANN, and promptly the training data of ANN model is: { F _i, C _{I_best}, F wherein _i=<f ₁, f ₂..., f _m>Be the program characteristic of training benchmark, C _{I_best}=<c ₁, c ₂..., c _n>Parameter combinations during for the optimum operation under the present procedure characteristic.The described formulate form of corresponding preamble is established M _ANNBe the ANN model after the training, then C _Best=M _ANN(F _p), F wherein _pThe performance of program vector of the MPI program of representative input, C _BestParameter combinations vector when being the optimum operation of this MPI program.

The MPI performance of program extracts

Our designed method is parameter during to the MPI applied forcasting optimum operation of the unknown with the Optimization Model of off-line training, therefore will from the MPI program of the unknown, extract suitable performance of program and import as Optimization Model, is used for being predicted the outcome accurately.Because runtime parameter mainly influences the communication performance between the MPI process, so we mainly consider the communication pattern of MPI program, the data volume of communication exchange and the size of communicator when carrying out feature extraction.Table 2 has been explained the characteristic of MPI program, and these necessary programs characteristics can be through obtaining an instrument operation will predicting the MPI program.

Table 2 MPI program characteristic and description

Training baseline configuration and training data generate

In order to produce the data of training forecast model, we have designed the training benchmark program.On the multinuclear group of planes of target architecture, use the multiple various combination of adjustable runtime parameter can produce training data to the training benchmark.Simultaneously, train benchmark can accept a plurality of input parameters and come point-to-point in the Control Training benchmark, set operation data quantity transmitted and communicator size.

We come the project training benchmark according to the defined MPI performance of program of table 2.Benchmark mainly comprises following two kinds of MPI communication modes: synchronous MPI point to point link and MPI set operation.The training benchmark receives 5 parameters, the message size that can be used for respectively exchanging in the size, collective communication of message of the ratio of point to point link in the Control Training benchmark, the ratio of collective communication, two synchronous point to point links of MPI process and the size of communicator.

Through 5 input parameters of conversion training benchmark, the ratio of control point to point link and collective communication is respectively: 100% point to point link, 100% collective communication, 50% point to point link and 50% collective communication.Under three kinds of different communication ratios, the respectively size of message size and MPI communicator in conversion point-to-point and the collective communication, and the configuration of conversion runtime parameter combination, common property is given birth to 3000 of training datas, is used for the neural network training Optimization Model.

Carry out prediction

When forecast model foundation and after, will carry out the prediction task according to the actual requirements with a large amount of learning data trained.Because our decision tree predictive mode S _Max=M _REPTree(F _p, C _Best) and neural network prediction model C _Best=M _ANN(F _p) in, all need use the performance of program vector F of program to be predicted _pBe used as input, therefore before carrying out prediction, we need be under a target multinuclear group of planes to will predicting the instrument operation of carrying out of MPI program, with the proper vector F of the MPI program that obtains importing _pWith F _pParameter combinations in the time of can obtaining importing the optimum operation of MPI program as the input of model.But when a target multinuclear group of planes changed, above process need repeated.

Claims

1. one kind based on the parameter prediction method during the MPI optimized operation under the multinuclear of machine learning, it is characterized in that:

2. the method for claim 1, it is characterized in that: it is the input of decision-tree model that said decision-tree model will be trained the performance of program of benchmark and the configuration groups cooperation of runtime parameter, and training data is: { F _i, C _i, F wherein _iBe the program characteristic of training benchmark, C _iBe the combination of the runtime parameter under the present procedure characteristic, the actual speed-up ratio that obtains is as the output of decision tree.

3. the method for claim 1 is characterized in that: said artificial nerve network model will be trained and produced the data of high speed-up ratio in the benchmark and select and be used for the training parameter forecast model, and training data is: { F _i, C _{I_best}, F wherein _i=<f ₁, f ₂..., f _m>Be the program characteristic of training benchmark, C _{I_best}=<c ₁, c ₂..., c _n>Parameter combinations during for the optimum operation under the present procedure characteristic.

4. method as claimed in claim 2 is characterized in that: said decision-tree model is in the training pattern stage, and F produces different speed-up ratio results with C through the conversion vector; When performance model is predicted, if F _pThe performance of program vector of the MPI program of representative input then can obtain maximum speed-up ratio S _MaxRuntime parameter configuration C _BestParameter combinations vector, i.e. S in the time of will being the optimum operation of this MPI program _Max=M _REPTree(F _p, C _Best).

5. method as claimed in claim 3 is characterized in that: said artificial nerve network model is in the training pattern stage, and F produces different speed-up ratio results with C through the conversion vector; When performance model is predicted, if M _ANNBe the artificial nerve network model after the training, then C _Best=M _ANN(F _p), F wherein _pThe performance of program vector of the MPI program of representative input, C _BestParameter combinations vector when being the optimum operation of this MPI program.

6. the method for claim 1, it is characterized in that: said training benchmark comprises two kinds of MPI communication modes: synchronous MPI point to point link and MPI set operation; The training benchmark receives 5 parameters, the message size that can be used for respectively exchanging in the size, collective communication of message of the ratio of point to point link in the Control Training benchmark, the ratio of collective communication, two synchronous point to point links of MPI process and the size of communicator.

7. the method for claim 1; It is characterized in that: said off-line training is through 5 input parameters of conversion training benchmark; The ratio of control point to point link and collective communication is respectively: 100% point to point link, 100% collective communication, 50% point to point link and 50% collective communication; Under three kinds of different communication ratios, the size of message size and MPI communicator in conversion point-to-point and the collective communication, and the configuration of conversion runtime parameter combination respectively; Common property is given birth to 3000 of training datas, is used for the neural network training Optimization Model.

8. the method for claim 1 is characterized in that: when forecast model foundation and after with a large amount of learning data trained, will carry out the prediction task based on actual demand.

9. the method for claim 1 is characterized in that: before carrying out prediction, need be under a target multinuclear group of planes to will predicting the instrument operation of carrying out of MPI program, with the proper vector F of the MPI program that obtains importing _pWith F _pParameter combinations in the time of can obtaining importing the optimum operation of MPI program as the input of model; When a target multinuclear group of planes changed, above process need repeated.