CN109710419B

CN109710419B - MPI code communication process analysis method based on text analysis

Info

Publication number: CN109710419B
Application number: CN201811345110.6A
Authority: CN
Inventors: 肖利民; 张锐; 闫柏成; 王志昊; 刘成春; 周易
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2018-11-13
Filing date: 2018-11-13
Publication date: 2022-04-08
Anticipated expiration: 2038-11-13
Also published as: CN109710419A

Abstract

The invention provides an MPI code communication process analysis method based on text analysis, which is applied to a parallel application program developed based on MPI. The method comprises the following steps: (a) analyzing MPI source codes line by line, and capturing key communication variables; (b) according to the number of processes expected to be started by a user, performing loop analysis on the source code for corresponding times, wherein the loop analysis simulates the execution of an MPI task each time; (c) executing corresponding analysis process according to the analyzed different keyword sentences to search whether the MPI communication keywords are contained or not during circulation; (d) converting the communication process into point-to-point communication information according to the searched different MPI communication keywords; (e) integrating point-to-point communication information in multiple processes.

Description

MPI code communication process analysis method based on text analysis

The technical field is as follows:

the invention relates to a code analysis method, in particular to an MPI code communication process analysis method based on text analysis.

Background art:

in computer science, High Performance Computing (HPC) is an important branch, and is a technology for researching and developing High Performance computers from multiple aspects, such as system architecture, parallel algorithm, software development and the like. High performance computing has gradually become a new mainstay of scientific research beyond traditional theoretical studies and laboratory experiments.

In the process of using high-performance computing, some applications cannot effectively utilize the computing resources of the high-performance computing platform occupied by the applications, and a lot of computing resources are wasted. In order to make more efficient use of high performance computing platforms, it is necessary to guide them to match the actual application, and this can be done by program prediction techniques. In the process of using high-performance calculation, most programs are executed in parallel, so that the method has important significance for researching the performance prediction and optimization technology of the parallel programs, and can improve the performance of the high-performance calculation. The performance prediction and optimization of the parallel application firstly needs to analyze the communication performance of the parallel application, because the communication performance of the parallel application is one of the key factors influencing the overall performance of the parallel application, and the key communication information needs to be captured in the parallel application code and converted into point-to-point communication information to accurately reflect the communication performance of the parallel application. Because Message Passing Interface (MPI) has many advantages, such as good portability, it is popular with developers and widely applied to developing parallel applications. Therefore, it is very important for the parallel application developed based on the MPI to capture the key communication information in the code and convert the key communication information into the point-to-point communication information.

At present, two main analysis modes of the MPI parallel application code communication process are manual analysis and analysis based on the instrumentation technology. The manual parsing mode refers to manually reading the source code of the MPI application and then mining the point-to-point communication information contained in the MPI application. In the process of manually reading MPI parallel application codes, the inaccurate understanding of a certain section of codes easily causes the inaccuracy of the mined point-to-point communication information, and consumes a great deal of time and human resources. The analysis mode based on the instrumentation technology is to insert corresponding information capturing processes at different stages of source codes, target codes and the like, and then intercept and store the communication processes when the MPI is applied in the actual environment in parallel. When the method is applied to a large scale, redundant computation and time resources are consumed due to the fact that the number of processes is large, the computation time is long, the communication process is complex and the like.

The invention content is as follows:

the invention aims to provide an MPI code communication process analysis method based on text analysis, which reduces the resource overhead in the MPI parallel application code communication process analysis.

The technical scheme of the invention is as follows:

the MPI code communication process analysis method based on text analysis is characterized by comprising the following steps of:

1) analyzing MPI source codes line by line, capturing key communication variables and storing the key communication variables in a list;

2) analyzing the source code for n times circularly according to the actual process number n expected to be started by the user; each cycle analysis simulates the execution of one MPI task;

3) executing corresponding analysis process according to the analyzed different keywords during each cycle analysis, and searching whether the MPI communication keywords are contained or not;

4) converting the communication process into point-to-point communication information according to the searched different MPI communication keywords;

5) and finally integrating the point-to-point communication information in the multiple execution cycle analysis processes.

Wherein step 1) comprises the steps of initializing and obtaining key communication variables:

step (1.1) obtaining a source code file and an actual process number n expected to be started by a user;

step (1.2) scanning MPI application source codes, and performing lexical analysis line by line;

step (1.3) when the MPI _ Comm _ rank key function is analyzed, the variable name of the second parameter of the function is stored in a variable, and the variable name represents the currently running process number;

step (1.4) when the MPI _ Comm _ size key word is analyzed, the variable name of the second parameter of the function is stored in a variable, and the variable name represents the currently opened actual process number;

and (1.5) when the keyword sentences of while, for and if are analyzed, searching whether MPI communication keywords are contained in the execution code sections of the corresponding keyword sentences, and if so, saving variables related to the judgment conditional expressions of the corresponding keyword sentences into a global list.

The step 2) comprises the following steps:

step (2.1) according to the process number of the current simulation MPI task, the cycle analysis is uniformly distributed into a plurality of execution processes for analysis, and the specific execution process number can be configured by a user;

step (2.2) starting a process pool, wherein the number of processes running in the pool can be configured by a user, and then running the processes, namely analyzing a plurality of cycles for simulating MPI task execution contained in the actual execution process;

and (2.3) emptying a dictionary in which variables and corresponding values are stored in each process during each loop analysis, then obtaining variable names and corresponding values defined by # define, storing the variable names and the corresponding values into the dictionary, then storing the obtained variable names representing the number of starting processes and the number of started processes into the dictionary in a key-value pair mode, and then storing the variable names representing the number of the currently running processes and the number of times of the current loop into the dictionary in a key-value pair mode.

In the step 3), each circulation is processed according to the following steps:

step (3.1) starting from the first line of the main function, and performing lexical analysis line by line;

step (3.2) when the assignment statement is analyzed according to the characteristics of the assignment statement, firstly, an assignment variable A in the assignment statement is taken out, if A belongs to _ keys, evaluation is carried out according to the type of the variable on the right side of the assignment number, and then the name of the current assignment variable and the solved result are stored in a dictionary in a key value pair mode; otherwise, the assignment statement is not processed, and the analysis is continued from the next line;

step (3.3) when the while keyword statement is analyzed, firstly extracting the variables in the loop judgment condition after the while keyword statement to form a set B, if the variables are extracted to form a set B

Extracting a loop judgment conditional expression of the while statement, calculating the value of the loop judgment conditional expression by using a logic expression calculation method, if the value is false, directly skipping the analysis of the loop body code segment of the current while statement, and returning the end line number of the loop body code segment of the current while statement in the source code file; if the value is true, analyzing the codes in the while statement loop body line by line according to the step 4), and calculating the loop judgment again after the analysis is finishedThe value of the conditional expression is false, and the ending line number of the current while statement loop body code segment in the source code file is returned; if it is not

If not, finding out the ending line number of the loop body code segment of the current while statement in the source code file according to the statement segment searching algorithm, and starting to analyze from the next line of the ending line number;

step (3.4) when the for keyword statement is analyzed, firstly extracting the variables related to the expressions 1, 2 and 3 after the for keyword statement to form a set C, if the variables are related to the expressions 1, 2 and 3 after the for keyword statement

Extracting an expression 1 of the for statement, wherein the general form of the expression is an assignment statement, so that evaluation is carried out according to the type of an expression on the right side of an assignment number, then the name of a current assigned variable and an obtained result are stored in a dictionary in a key value pair form, then an expression 2 is extracted, the value of the expression is obtained by using a logic expression calculation method, if the value is false, analysis of a loop body code segment of the current for statement is directly skipped, and the number of end lines of the loop body code segment of the current for statement in a source code file is returned; if the value is true, analyzing the codes in the for statement loop body line by line according to the step 4), extracting an expression 3 after the analysis is finished, analyzing the expression 3, calculating the value of the expression 2 again after the analysis is finished until the value is false, and returning the end line number of the current for statement loop body code segment in the source code file; if it is not

If not, finding out the ending line number of the loop body code segment of the current for statement in the source code file according to the sentence segment searching algorithm, and starting analysis at the next line of the ending line number;

step (3.5) when the if keyword statement is analyzed, firstly extracting the variables in the conditional expression after the if keyword statement to form a set D, if so, extracting the variables from the conditional expression to form a set D

Extracting a conditional expression behind the if keyword, then calculating the value of the conditional expression by using a logic expression calculation method, if the value is true, analyzing codes in the current if statement section line by line according to the step 4), and returning the end line number of the whole if statement code section in the source code file after the analysis is finished; if the value is false, skipping the current if sentence segment, searching whether an else if keyword or an else keyword appears in the next line, and if not, directly analyzing the next line; if the else key words are analyzed, analyzing codes in the current else statement segment line by line according to the step 4), and returning the end line number of the whole if statement code segment in the source code file after the analysis is finished; if the else if key words are analyzed, firstly solving the values of the conditional expressions behind the else if key words, if the values are true, analyzing the codes in the current else if sentence section line by line, returning the ending line number of the whole if sentence code segment in the source code file after the analysis is finished, if the values are false, skipping the current else if sentence section, searching whether the else if key words or the else key words appear in the next line, and repeating the steps in the analysis process; if it is not

If not, finding out the ending line number of the current whole if statement code segment in the source code file according to the statement segment searching algorithm, and starting to analyze from the next line of the ending line number;

and (3.6) when the MPI point-to-point communication function key word is analyzed, executing the step 4).

In the step 4), the method comprises the following steps of mining point-to-point communication information:

when the MPI point-to-point communication function key words are identified, firstly, the process number executed by the current simulation MPI task is taken out from the dictionary through the variable name representing the current process number, then the target process number of the MPI task to be sent is taken out from the dictionary according to the variable name of the 4 th parameter in the MPI point-to-point communication function, then whether the sending frequency of the source process number to the target process number is stored in the current process executing the circular analysis or not is searched, if not, the frequency is created and set to be 1, and if yes, the frequency is added with 1.

In the step 5), the point-to-point communication information in the multiple processes is integrated, namely after the multiple cyclic analysis processes are executed, the point-to-point communication information stored in each process is collected together, and the point-to-point communication information is collected according to the sending process number, the target process number and the sending times.

The invention has the beneficial effects that: the MPI code communication process analysis method based on text analysis is applied to a parallel application program developed based on MPI, and the MPI code communication process analysis mode based on text analysis is used for simulating the execution process of an MPI task to quickly and accurately dig out point-to-point communication information by analyzing MPI parallel application source codes, so that the resource overhead in the analysis process is reduced. The method can quickly and accurately complete the analysis process on a common personal computer, so that the resources required during analysis are obviously reduced. The condition that the mined information is inaccurate due to the fact that the manual analysis mode is misunderstood is avoided, and the consumption of more calculation and time resources caused by the problems of large process number and the like when large-scale application is faced is reduced.

Description of the drawings:

fig. 1 is a flowchart of an MPI code communication process parsing method based on text analysis according to the present invention.

FIG. 2 is a flow chart of the present invention using multi-process processing to accelerate the MPI code resolution process.

The specific implementation mode is as follows:

the present invention will be described in further detail with reference to the accompanying drawings.

Fig. 1 is a flow chart of the method for analyzing the MPI code communication process based on text analysis according to the present invention.

The MPI code communication process analysis method based on text analysis comprises the following steps:

In step 1), the method comprises the steps of initializing and obtaining key communication variables:

obtaining the actual process number n expected to be started by a user and reading the MPI source code into a memory, wherein as can be seen from the figure, the key communication variables can be obtained after the MPI application code is analyzed by a lexical method line by line, and the key communication variables are processed according to the following steps:

step (1.1) scanning MPI application source code, performing lexical analysis line by line,

step (1.2) when the MPI _ Comm _ rank key function is resolved, the variable name of the second parameter of the function is stored in a variable rank, the variable name represents the currently running process number,

step (1.3) when the MPI _ Comm _ size keyword is resolved, the variable name of the second parameter of the function is stored in a variable p, the variable name represents the actual process number which is currently opened,

step (1.4) when the keyword sentences of while, for and if are analyzed, searching whether MPI communication keywords are contained in the execution code segments of the corresponding keyword sentences, and if yes, storing variables related to judgment condition expressions of the corresponding keyword sentences into a global list _ keys;

step 2) circularly analyzing the source code for n times according to the actual process number n expected to be started by the user

FIG. 2 is a flow chart of the present invention using multi-process processing to accelerate the MPI code resolution process. As can be seen from fig. 2, assuming that the number of processes that the user desires to start is n, an appropriate number m of packets, i.e. the number of analysis processes to which the number n of cycles for executing the MPI simulation task is distributed, is selected according to the value of n. The reference when grouping is based on the process number of the current simulation MPI task execution. And then starting a process pool, and selecting a proper maximum starting process number a according to requirements and conditions. And then putting the grouped processes for executing the loop analysis into a process pool to run. Each packet loop analysis process has a certain number of process numbers for the current simulated MPI task execution. Clearing the value stored in the dictionary at first during each circulation, then storing the obtained variable name representing the number of starting processes and the number of started processes in the dictionary dit in the form of key value pairs, then obtaining the variable name defined by # define and the corresponding value, storing the variable name representing the number of the currently running process and the number of times of the current circulation in the dictionary dit in the form of key value pairs;

step 3) executing the loop in step 2), each time the loop analysis simulates the execution of an MPI task

Each cycle is processed according to the following steps:

step (3.1) starting from the first line of the main function, lexical analysis is performed line by line,

step (3.2) when the assignment statement is analyzed according to the characteristics of the assignment statement, firstly, an assignment variable A in the assignment statement is taken out, if A belongs to _ keys, evaluation is carried out according to the type of the variable on the right side of the assignment number, and then the name of the current assignment variable and the solved result are stored in a dictionary dit in a key value pair mode; otherwise, the assignment statement is not processed, and the analysis is continued from the next line,

Extracting the loop judgment conditional expression of the while statement, calculating the value of the loop judgment conditional expression by using a logic expression calculation method, directly skipping the analysis of the loop body code segment of the current while statement if the value is false, and returning the loop body code segment of the current while statementAn end line number in the source code file; if the value is true, analyzing the codes in the while statement loop body line by line according to the step 4), after the analysis is finished, calculating the value of the loop judgment conditional expression again until the value is false, and returning the end line number of the current while statement loop body code segment in the source code file; if it is not

If not, finding out the ending line number of the loop body code segment of the current while statement in the source code file according to the statement segment searching algorithm, starting to analyze from the next line of the ending line number,

Extracting an expression 1 of the for statement, wherein the general form of the expression 1 is an assignment statement, so that evaluation is carried out according to the type of an expression on the right side of an assignment number, then the name of a current assigned variable and an obtained result are stored in a dictionary fact in a key value pair form, then an expression 2 is extracted, the value of the expression is obtained by using a logic expression calculation method, if the value is false, analysis of a loop body code segment of the current for statement is directly skipped, and the number of end lines of the loop body code segment of the current for statement in a source code file is returned; if the value is true, analyzing the codes in the for statement loop body line by line according to the step 4), extracting an expression 3 after the analysis is finished, analyzing the expression 3, calculating the value of the expression 2 again after the analysis is finished until the value is false, and returning the end line number of the current for statement loop body code segment in the source code file; if it is not

If not, finding out the ending row number of the loop body code segment of the current for statement in the source code file according to the sentence segment searching algorithm, starting analysis at the next row of the ending row number,

step 4) code analysis of for, while and if sentence segments

When the assignment statement is analyzed, executing the step (3.2); when the while keyword statement is analyzed, executing the step (3.3); when the for keyword statement is analyzed, executing the step (3.4); when the if keyword statement is analyzed, executing the step (3.5); when the MPI point-to-point communication function key word is analyzed, executing the step 5);

step 5) mining point-to-point communication information

When the MPI point-to-point communication function key words are identified, firstly, taking out the process number of the current MPI task execution simulation from the dictionary dit through the variable name stored in the rank variable, then, taking out the target process number of the current MPI task to be sent from the dictionary dit according to the variable name of the 4 th parameter in the MPI point-to-point communication function, then, searching whether the sending frequency of the source process number to the target process number is stored in the process of the current execution cycle analysis, if not, establishing and setting the frequency to be 1, and if so, adding 1 to the frequency;

step 6) summarizing point-to-point communication information in a plurality of processes

As can be seen from fig. 2, after the operation of the plurality of packet loop analysis processes is completed, part of the peer-to-peer communication information is stored, the peer-to-peer communication information stored in each process is collected together, and is collected according to the sending process number, the destination process number, and the sending times.

Claims

1. The MPI code communication process analysis method based on text analysis is characterized by comprising the following steps of:

5) finally, point-to-point communication information in a plurality of execution cycle analysis processes is integrated together;

said step 1) comprises the steps of initializing and obtaining critical communication variables:

2. The method according to claim 1, wherein the step 2) comprises the steps of:

step (2.1) according to the process number of the current simulation MPI task, the cycle analysis is uniformly distributed into a plurality of execution processes for analysis, and the specific execution process number is configured by a user;

step (2.2) starting a process pool, wherein the number of processes running in the pool is configured by a user, and then running the plurality of processes in the step (3.1), namely analyzing a plurality of cycles for simulating MPI task execution contained in the actual execution process;

3. The method according to claim 1, wherein in step 3), each cycle is processed by the following steps:

Extracting a loop judgment conditional expression of the while statement, calculating the value of the loop judgment conditional expression by using a logic expression calculation method, if the value is false, directly skipping the analysis of the loop body code segment of the current while statement, and returning the end line number of the loop body code segment of the current while statement in the source code file; if the value is true, analyzing the codes in the while statement loop body line by line according to the step 4), after the analysis is finished, calculating the value of the loop judgment conditional expression again until the value is false, and returning the end line number of the current while statement loop body code segment in the source code file; if it is not

Extracting a conditional expression behind the if keyword, then calculating the value of the conditional expression by using a logic expression calculation method, if the value is true, analyzing codes in the current if statement section line by line according to the step 4), and returning the end line number of the whole if statement code section in the source code file after the analysis is finished; if the value is false, skipping the current if sentence segment, searching whether an elseif keyword or an else keyword appears in the next line, and if not, directly analyzing the next line; if the else key words are analyzed, analyzing the codes in the current else sentence segment line by line according to the step 4), and returning after the analysis is finishedThe end line number of the whole if statement code segment in the source code file; if the elseif keyword is analyzed, firstly, the value of a conditional expression behind the elseif keyword is solved, if the value is true, the codes in the current elseif sentence section are analyzed line by line, after the analysis is completed, the end line number of the whole if sentence code segment in a source code file is returned, if the value is false, the current elseif sentence section is skipped, whether the elseif keyword or else keyword appears in the next line is searched, and the analysis process repeats the steps; if it is not

4. The method according to claim 1, wherein the step 4) includes the step of mining the peer-to-peer communication information:

5. The method as claimed in claim 1, wherein the step 5) of integrating peer-to-peer communication information in multiple processes means that after the multiple loop analysis processes are executed, peer-to-peer communication information stored in each process is collected together and collected according to the number of the sending process, the number of the destination process and the sending times.