CN110377525A

CN110377525A - A kind of parallel program property-predication system based on feature and machine learning when running

Info

Publication number: CN110377525A
Application number: CN201910680598.6A
Authority: CN
Inventors: 张伟哲; 何慧; 王一名; 郝萌
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2019-07-25
Filing date: 2019-07-25
Publication date: 2019-10-25
Anticipated expiration: 2039-07-25
Also published as: CN110377525B

Abstract

A kind of parallel program property-predication system based on feature and machine learning when running, belongs to the technical field of parallel program property-predication.That there are expenses in order to solve the parallel program property-predication system based on machine learning is bigger by the present invention, predicted time is long, and the problem that accuracy rate is lower.Mixing pitching pile is carried out to original program, reduce basic block counter, then program is deleted into the serial program of not input results, the process that prewired program executes while reducing the runing time of program, quickly and accurately get basic block frequency, it pre-processes these data, in input prediction model, finally exports the execution time of large-scale parallel program.The model that the present invention generates has very strong generalization ability, can accurately predict the execution time of large-scale parallel program, and predict expense very little.

Description

A kind of parallel program property-predication system based on feature and machine learning when running

Technical field

The parallel program property-predication system of feature and machine learning, belongs to parallel when the present invention relates to a kind of based on operation Procedural foreseeable technical field.

Background technique

With the rapid growth of high performance computing system scale and complexity, such as number of nodes, storage, user is in high-performance The cost that concurrent application is executed in computing system is consequently increased, many concurrent programs holding in high performance computing system Line efficiency is relatively low, causes the waste of system resource, this leads to the efficiency and scalability of high performance system and application program Problem becomes more and more prominent.Therefore, before executing concurrent program on a large scale in high performance computing system, by running small rule Mould concurrent program is predicted to be very important in the performance of large-scale parallel program on the target system.In addition, according to prediction As a result, carrying out performance optimization to concurrent program, it can be effectively reduced the cost of execution, avoid the waste of resource.

Document number be CN101650687B prior art discloses a kind of large-scale parallel program property-predication realization sides Method comprising: the communication sequence and sequence for collecting concurrent program calculate vector, analyze the similitude that each process calculates and selection Representational process records the Content of Communication of representative process, has generation using the calculate node playback of target platform Table process, the sequence for obtaining representative process calculate the time, and the calculating time of other processes is replaced with these calculating times； Obtain the communications records of concurrent program；Use the program feature that network simulator automatic Prediction is final.It can be made by this method With seldom hardware resource, accurate concurrent program estimated performance is obtained.

That there are expenses is bigger for parallel program property-predication system based on machine learning, predicted time is long, and accurate Rate is lower, in the prior art without to the lower parallel program performance for reaching optimal compromise of expense, predicted time, accuracy rate of sening as an envoy to Forecasting system.

Summary of the invention

The technical problem to be solved by the present invention is

That there are expenses in order to solve the parallel program property-predication system based on machine learning is bigger by the present invention, prediction when Between long, and the problem that accuracy rate is lower.

The present invention solves the technical solution that above-mentioned technical problem uses are as follows:

A kind of parallel program property-predication system based on feature and machine learning when running, the system comprises features to obtain Modulus block, performance modeling module and performance prediction module,

Feature obtains module and carries out " edge to it after concurrent program to be measured is converted to LLVM IR form Profiling pitching pile " generates the concurrent program (executable program) after pitching pile, with different input sizes and process number, executes Concurrent program after pitching pile generates total run time, process number, basic block frequency, to the total run time, process number, base Three kinds of parameters of this block frequency are pre-processed；

Performance modeling module, for using pretreated process number, basic block frequency as input；Pretreated execution Time carries out machine learning as output, obtains performance prediction model after machine learning；

Performance prediction module for above-mentioned concurrent program to be measured to be converted to LLVM IR form, then carries out basic block to it Pitching pile is mixed, program is carried out again after pitching pile and deletes to obtain executable serial program, to obtain input size in module than feature Big different input size and process number execute the serial program with process number, generate process number and basic block frequency, so Process number and basic block frequency are pre-processed again afterwards；Using after processing process number and basic block frequency as the performance The input of prediction model, the concurrent program for obtaining prediction execute the output of time.

Further, detailed process is as follows for edge profiling pitching pile algorithm,

Input are as follows: the LLVM IR of concurrent program,

Output are as follows: the IR after edge profiling pitching pile,

1) counter group C, is created in concurrent program to be measured, and is initialized as zero；

2), judge whether the side in figure is critical edge in the corresponding controlling stream graph of LLVM IR of concurrent program, if It is between the source basic block (basic block) and target basic block of critical edge e, to be inserted into new basic block newbb；In new basic block Code { C [index] ++ } is added before the command for stopping of newbb；Otherwise, in the source basic block of critical edge e or target basic block Code { C [index] ++ } is added before command for stopping, completes pitching pile.

Further, detailed process is as follows for mixing pitching pile algorithm,

Input: the LLVM IR of concurrent program,

Output: the IR after mixing pitching pile,

1) feature, is obtained to obtain through handling the basic block collection selected in module,

2) counter group C, is created in target program, and is initialized as zero；

3), for the circulation l comprising basic block selected in step 1) in concurrent program to be measured, judge whether l is certainly It so recycles and judges whether the head block h on the side back in circulation is dominated by the basic block, if it is, in head node header A preheader block p is created before；Then following steps are executed:

A preheader block p is created before node header；

Obtain the relevant value of LTC: %start, %end, %stride；

Code is added before the command for stopping of pCalculate the LTC of l；

It is executed when p traversal, adds code { C [index] +=Г } before the command for stopping of p；

Otherwise, code { C [index] ++ } is added in above-mentioned basic block.

Further, detailed process is as follows for program Pruning algorithm:

Input: the IR after concurrent program mixing pitching pile

Output: the IR after deleting

1) code relevant to being exported in concurrent program in the IR after, first deleting concurrent program mixing pitching pile；

2) function call in MPI concurrent program, is deleted again,

3) dead code, is finally eliminated.

The present invention has following advantageous effects:

It accurately predicts the performance of large-scale parallel program, can not only be customer analysis program feature, it can be in height Application program is effectively carried out in performance computing system, moreover it is possible to help user management and schedule job, reasonably allocation schedule plan Slightly, the operation waiting time is reduced, and is able to carry out stock assessment, user is instructed to apply for resource.Therefore, the invention proposes one A parallel program property-predication system, the model which generates have very strong generalization ability, can accurately predict to advise greatly The execution time of mould concurrent program, and predict expense very little, the value with very strong practical application.

It is of the present invention based on operation when feature and machine learning parallel program property-predication system in operation when it is special Sign refers to basic block frequency, and parallel program performance refers to the execution time of program.

Detailed description of the invention

Attached drawing is used to provide further understanding of the present invention, and constitutes part of specification, with tool of the invention Body embodiment or embodiment are used together to explain the present invention, are not construed as limiting the invention.In the accompanying drawings:

Fig. 1 is parallel program property-predication frame construction drawing of the present invention；

Fig. 2 is predicted time of 6 kinds of concurrent programs characterized by basic block and actual time comparison diagram, in figure a) Sweep3D, b) LULESH, c) NPB SP, d) NPB BT, e) NPB LU and f) meaning of NPB EP indicate simultaneously well known to being Line program title；Ordinate in figure indicates to execute time, abscissa expression sample size；

Fig. 3 is the box figure of MAPE of six kinds of concurrent programs characterized by basic fast frequency, and SVR, RF, Ridge indicate three Kind machine learning method；

Fig. 4 is the comparison diagram of three kinds of method errors；

Fig. 5 is the comparison diagram of six kinds of concurrent programs prediction expenses and raw overhead.

Specific embodiment

In conjunction with shown in Fig. 1 to 5, for a kind of concurrent program based on feature and machine learning when running of the present invention The realization of performance prediction system is illustrated as follows:

1 parallel program property-predication system

As shown in Figure 1, the parallel program property-predication system is broadly divided into three parts: feature obtains, the sum of performance modeling Performance prediction.First part is that performance of program obtains, mainly by carrying out edge profiling to small-scale concurrent program Pitching pile obtains training data feature, and the program pitching pile in the present invention is all based on LLVM compiler framework, after pitching pile is performed a plurality of times Program, be averaged, obtain process number and basic block frequency, as the feature of training data, the total run time of program is made For parallel program performance index of the invention, feature pretreatment is then carried out, the generalization ability of model is improved；Second part is property It can model, using the machine learning regression algorithm for having supervision, carry out performance modeling, constantly adjust ginseng, it is pre- to evaluate optimal performance Survey model；Part III is to carry out large-scale parallel program property-predication using this model, needs the extensive journey of quick obtaining Input of the basic block frequency as model when sort run reduces basic block it is therefore proposed that carrying out mixing pitching pile to original program Then program is deleted into the serial program of not input results by counter, retain journey while reducing the runing time of program The process that sequence executes, quickly and accurately gets basic block frequency, pre-processes these data, last defeated in input prediction model The execution time of large-scale parallel program out.

The acquisition of 2 performance model features

Small-scale concurrent program is changed into LLVM intermediate code form first with LLVM compiler framework front end, is then write It realizes the Pass of edge profiling pitching pile, executes Pass, pitching pile automatically is carried out to program.Then, after executing pitching pile Program generates the file comprising basic block frequency.It include process number and basic block frequency by data preparation matter finally, reading file The data set of rate.Edge profiling pitching pile algorithm is specifically expressed as follows:

3 performance modelings based on machine learning

First carry out feature pretreatment, mainly to data carry out non-linear normalizing, and by removal repeated characteristic, Variance back-and-forth method and Pearson correlation coefficient select suitable feature.Then using SVR, Ridge recurrence and tri- kinds of engineerings of RF It practises algorithm and carries out performance modeling, divide data into training set, test set and verifying collection, use training set model of fit, test Collect adjusting parameter and collect the assessment for carrying out model using verifying, wherein grid data service and k folding cross-validation method are combined, comments Ginseng is constantly adjusted while estimating model, automatically selects out optimal configuration parameter.Mould is assessed using mean absolute percentage error Type generalization ability.

The performance prediction of 4 large-scale parallel programs

In order to predict large-scale parallel program performance, feature when needing to obtain the operation of extensive program, as prediction model Input.Although obtaining the expense very little of small-scale performance of program using edge profiling pitching pile, obtained using it The expense of extensive performance of program is very big.Therefore, it is necessary to reduce the expense of extensive program after pitching pile.It is brought to reduce pitching pile Expense, propose a kind of combination process pitching pile algorithm.In addition, in order to reduce the expense that extensive program itself executes, it is also proposed that A kind of program Pruning algorithm.

Mixing pitching pile algorithm will combine dynamic pitching pile and static pitching pile.It is held using cycle-index recognition methods estimation circulation Capable number can directly obtain cycle-index in the process of running, do not need insertion counter and add up.If circulation is concluded Initialization of variable is %start, and the condition for exiting circulation is %end, and the stepping of circulation is %stride.The meter of cycle-index Г Calculation form is as follows:

The new basic block for being known as preheader is added before the header of circulation, and by the basic block in header Counter is moved in preheader, and insertion calculates the formula of cycle-index in preheader, in this way, there is no need to be inserted into meter Number device.This method can be further reduced the quantity of access and refresh counter.But the base in not all Natural Circulation This block counter can be moved in preheader.Determine that this is followed next, being given in the Natural Circulation comprising branch Whether the basic block frequency in ring can be moved to the method in preheader basic block.Judged using definition below basic Whether the counter of block can be moved to preheader node.

1 is defined in a controlling stream graph, input node b0, if each path from b0 to bj all has to pass through bi When, then claim node bi to dominate node bj, writes and be bi > > bj.According to definition, each node dominates oneself, for example, bi > > bi.

Mixing pitching pile algorithm is specifically expressed as follows:

Program Pruning algorithm obtains selected basic block frequency in the case where not considering calculated result and therefore first retains Initial code and the relevant code of pitching pile, the program after being deleted with guarantee can operate normally and accurate recording basic block frequency Then rate deletes useless and relevant to output code in IR.In addition, in order to generate a serial program, it is also necessary to delete The part of concurrent program function call.After deleting code relevant to output and MPI function call code, it may appear that many Dead code, these codes are not used for other calculating, these dead codes can be deleted from IR by executing dead code elimination.In this way, IR just is reduced, obtain smaller executable program and executes speed faster.

Program Pruning algorithm is specifically expressed as follows:

Technical effect of the invention is described below below:

1 prediction result

Table 1 illustrates the set of two kinds of features, the first is common method (INPUT), and selection is characterized in input Parameter and process number, second is method proposed by the present invention (RUNTIME), and selection is characterized in basic block frequency and process Number.The method characterized by basic block frequency is substantially better than the method characterized by inputting parameter as can be seen from Table 1.This hair Bright method MAPE is 20% hereinafter, the average MAPE of 6 kinds of Parallel applications is 8.41%.

The feature set and MAPE of 1 concurrent program of table

Table 2 is the standard deviation of 6 kinds of concurrent program prediction errors, can be clearly seen that the dispersion degree of prediction error, from And analyze the stability of model.The stability of RF is best in the method characterized by inputting parameter, is using basic block frequency When rate is as feature, SVR is better than RF.Generally speaking, the stability of the SVR model using basic block frequency as feature is best.

These results indicate that compared with the conventional machines learning method only with input parameter attribute, it is special when based on operation The automatic performance modeling of sign can establish better performance model, significantly improve precision of prediction and stability.

The standard deviation of 2 concurrent program error of table

Fig. 2 show respectively Sweep3D, LULESH and NPB Parallel application using SVR, RF and Ridge regression algorithm, and with Basic block is the predicted time of characteristic and the comparison diagram of the true runing time of program.When in these figures, according to actual motion Incremental order is ranked up the sample of test set, and most deep point is true program execution time, other shallower points indicate The time of machine learning model prediction.

Fig. 3 is the box figure of MAPE of 6 kinds of Parallel applications characterized by basic block frequency, and box figure can be avoided exceptional value Influence, accurately show the discrete distribution of data.From these figures, it can clearly find out that the prediction error of SVR is minimum.

2 comparative experimentss

Method proposed by the present invention and the other two kinds classical performance prediction models based on input parameter will be compared. Both methods is the method for Branes and the method for Hoefler.The comparison of three kinds of method errors is as shown in Figure 4.

The MAPE of 3 three kinds of methods of table

3 performance prediction expenses

When predicting the performance of concurrent application, it is only necessary to which execution deletes rear serial program accordingly to collect basic block Frequency, without executing original parallel application program.The data of generation only include several basic blocks (6 in the present invention) Basic block frequency, storage overhead can be ignored.Therefore, the serial program after main assessment is deleted is opened in the execution of prediction Pin.Computing resource on supercomputer is therefore, in this experiment, to predict table when expense also uses core according to charging when core Show.

Table 4 shows when predicting the performance of 6 selected application programs, when the core of method consumption of the invention and initially simultaneously The comparison of number when the core that row application program executes.It can be found from this table, execute the method for the present invention in 6 application programs All expenses be all significantly less than original application program execution expense.Average administration fee only accounts for former execution cost 0.1219%.This means that method HPC user can be helped effectively to predict the performance of concurrent application.This is because deleting As soon as the program after subtracting is an independent serial program, it only uses a node or a core that can execute.In addition, being inserted by reducing Enter the quantity of counter and eliminate many dead codes to optimize this serial program, further increases its performance.

The average overhead of table 4 method and original execution

Fig. 5 illustrates the prediction expense of 6 kinds of concurrent programs and the comparison diagram of raw overhead, in these figures, according to reality Incremental order when operation is ranked up the sample of test set, and when y-axis is core, the line close to x-axis is prediction expense, far from x The line of axis is raw overhead.From these figures, it can be clearly seen that prediction expense is far smaller than the expense of original program execution.

Claims

1. a kind of parallel program property-predication system based on feature and machine learning when running, which is characterized in that the system Module, performance modeling module and performance prediction module are obtained including feature,

Feature obtains module and carries out " edge to it after concurrent program to be measured is converted to LLVM IR form Profiling pitching pile " generates the concurrent program after pitching pile, parallel after executing pitching pile with different input sizes and process number Program generates total run time, process number, basic block frequency, to the total run time, process number, three kinds of basic block frequency Parameter is pre-processed；

Performance modeling module, for using pretreated process number, basic block frequency as input；The pretreated execution time Machine learning is carried out as output, obtains performance prediction model after machine learning；

Performance prediction module for above-mentioned concurrent program to be measured to be converted to LLVM IR form, then carries out basic block mixing to it Pitching pile carries out program after pitching pile again and deletes to obtain executable serial program, with than feature obtain in module input size and into The big different input size of number of passes and process number execute the serial program, generate process number and basic block frequency, then again Process number and basic block frequency are pre-processed；Using after processing process number and basic block frequency as the performance prediction The input of model, the concurrent program for obtaining prediction execute the output of time.

2. a kind of parallel program property-predication system based on feature and machine learning when running according to claim 1, It is characterized in that, detailed process is as follows for edge profiling pitching pile algorithm,

Input are as follows: the LLVM IR of concurrent program,

Output are as follows: the IR after edge profiling pitching pile,

2), judge whether the side in figure is critical edge in the corresponding controlling stream graph of LLVM IR of concurrent program, if so, Between the source basic block and target basic block of critical edge e, it is inserted into new basic block newbb；In the command for stopping of new basic block newbb Preceding addition code { C [index] ++ }；Otherwise, generation is added before the command for stopping of the source basic block of critical edge e or target basic block Code { C [index] ++ } completes pitching pile.

3. a kind of parallel program property-predication system based on feature and machine learning when running according to claim 1 or 2 System, which is characterized in that detailed process is as follows for mixing pitching pile algorithm,

Input: the LLVM IR of concurrent program,

Output: the IR after mixing pitching pile,

2) counter group C, is created in target program, and is initialized as zero；

3), for the circulation l comprising basic block selected in step 1) in concurrent program to be measured, judge whether l is that nature follows Ring and judge whether the head block h on the side back is dominated by the basic block in circulation, if it is, before head node header Create a preheader block p；Then following steps are executed:

A preheader block p is created before node header；

Obtain the relevant value of LTC: %start, %end, %stride；

Code is added before the command for stopping of pCalculate the LTC of l；

Otherwise, code { C [index] ++ } is added in above-mentioned basic block.

4. a kind of parallel program property-predication system based on feature and machine learning when running according to claim 3, It is characterized in that, detailed process is as follows for program Pruning algorithm:

Input: the IR after concurrent program mixing pitching pile

Output: the IR after deleting

2) function call in MPI concurrent program, is deleted again,

3) dead code, is finally eliminated.