CN117764203A

CN117764203A - Large-scale machine learning performance optimization guiding device, method, equipment and medium

Info

Publication number: CN117764203A
Application number: CN202410046348.8A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Shanghai Bi Ren Technology Co ltd
Current assignee: Shanghai Bi Ren Technology Co ltd
Priority date: 2024-01-11
Filing date: 2024-01-11
Publication date: 2024-03-26

Abstract

The invention provides a large-scale machine learning performance optimization guiding device, method, equipment and medium, relating to the technical field of machine learning, wherein the device comprises: the basic task layer is used for carrying out data processing on the original performance data based on a plurality of pre-selected basic task components to obtain basic data; the high-level task layer is used for carrying out statistics concentration processing on the basic data to obtain first multi-level data; the business function layer is used for carrying out cross integration, performance index calculation and performance abnormality detection on the first multi-level data to obtain second multi-level data; determining a performance analysis chain, optimization suggestions and performance reports based on the second multi-level data; and the network service layer is used for responding to the user operation, determining an operation result based on the multi-level database and the service database, and visually displaying the operation result. The invention provides performance optimization guidance through multi-dimensional and multi-level integrated analysis, and is beneficial to users to analyze performance in a direction with finer granularity.

Description

Large-scale machine learning performance optimization guiding device, method, equipment and medium

Technical Field

The invention relates to the technical field of machine learning, in particular to a large-scale machine learning performance optimization guiding device, method, equipment and medium.

Background

In the technical field of machine learning, performance analysis tools are mainly used for analyzing various indexes during software running. Common performance analysis tools include: the PyTorch kineto and the PyTorch kineto further optimize and refine tensor usage information and data bandwidth information are common performance analysis tools, and although the PyTorch kineto can exhibit various views such as a performance overview view, an operator view, a time line (timeline) view, a memory view, and a communication performance view, the granularity level is too small and the information of small particles is too scattered. The performance analysis tool based on PyTorch kineto further optimize, although tensor use information and data bandwidth information are perfected, only more analysis data are provided, multi-dimensional and multi-level integrated analysis is not performed at all, and the performance analysis tool is unfavorable for a user to analyze the performance in a direction with finer granularity.

Disclosure of Invention

The invention provides a large-scale machine learning performance optimization guiding device, method, equipment and medium, which are used for realizing the purposes of giving performance optimization guidance through multi-dimensional and multi-level integrated analysis and helping users analyze performance in a direction with finer granularity.

In a first aspect, the present invention provides a large-scale machine learning performance optimization guidance device, including:

the basic task layer is used for providing a basic task component library, carrying out data processing on the original performance data in the original database based on a plurality of basic task components preselected from the basic task component library to obtain basic data, and storing the basic data into the basic database;

the high-level task layer is used for carrying out statistic concentration processing on the basic data to obtain first multi-level data, and storing the first multi-level data into a multi-level database;

the business function layer is used for carrying out cross integration, performance index calculation and performance abnormality detection on the first multi-level data to obtain second multi-level data, and storing the second multi-level data into the multi-level database; determining a performance analysis chain, optimization suggestions and performance reports based on the second multi-level data, and storing the performance analysis chain, the optimization suggestions and the performance reports in a service database;

the network service layer is used for responding to user operation, determining an operation result based on the multi-level database and the service database, and visually displaying the operation result; the operation result comprises a query result and/or a custom analysis result.

According to the large-scale machine learning performance optimization guiding device provided by the invention, the basic task layer is specifically used for:

providing a basic task component library, wherein the basic task component library comprises a plurality of fine-grained basic task components;

pre-selecting a plurality of basic task components from the basic task component library based on a performance optimization target of the high-level task layer;

constructing at least one basic task graph based on a plurality of basic task components and the dependency relationship among the plurality of basic task components;

inputting the original performance data in the original database into the at least one basic task graph for data processing to obtain the basic data;

and storing the basic data into the basic database.

According to the large-scale machine learning performance optimization guiding device provided by the invention, the high-level task layer comprises:

the performance statistics module is used for carrying out concentration processing of space dimension and time dimension on the basic data to obtain first concentrated data; performing performance statistics on the first concentrated data by adopting at least one statistical index to obtain performance statistical data, and storing the performance statistical data into the multi-level database;

The operator information statistics module is used for carrying out concentration processing of space dimension and time dimension on the basic data to obtain second concentrated data; performing operator information statistics on the second concentrated data by adopting at least one statistical index to obtain operator information statistical data, and storing the operator information statistical data into the multi-level database;

the video memory information statistics module is used for carrying out concentration processing of space dimension and time dimension on the basic data to obtain third concentrated data; performing video memory information statistics on the third concentrated data by adopting at least one statistical index to obtain video memory information statistical data, and storing the video memory information statistical data into the multi-level database;

the first multi-level data comprises the performance statistical data, the operator information statistical data and the video memory information statistical data.

According to the large-scale machine learning performance optimization guiding device provided by the invention, the business function layer comprises:

the cross integration module is used for carrying out cross integration on the first multi-level data to obtain the associated information among the first multi-level data, and storing the associated information into the multi-level database;

The performance index calculation module is used for performing performance index calculation on the first multi-level data and the associated information in the multi-level database to obtain target index data, and storing the target index data into the multi-level database; the target index data comprises a plurality of index data of each card, comprehensive index data of each card and performance comparison results among the cards;

the performance abnormality detection module is used for performing performance abnormality detection on the first multi-level data and the associated information in the multi-level database to obtain abnormal index data, and storing the abnormal index data into the multi-level database; the second multi-level data comprises the association information, the target index data and the abnormal index data;

the tuning optimization chart analysis module is used for performing tuning optimization chart analysis on the second multi-level data to obtain the performance analysis chain and the optimization suggestion, and storing the performance analysis chain and the optimization suggestion into the service database;

and the performance report generating module is used for generating the performance report based on the second multi-level data and storing the performance report into the service database.

According to the large-scale machine learning performance optimization guiding device provided by the invention, the business function layer further comprises:

the performance simulator is used for taking the original performance data as input, simulating a part of training process of large-scale machine learning and obtaining a supplementary performance report; the supplemental performance report is complementary to the performance report, and the supplemental performance report is stored in the traffic database.

According to the large-scale machine learning performance optimization guiding device provided by the invention, the network service layer comprises:

the data visualization platform is used for inquiring from the multi-level database and the service database based on the inquiring condition input by the user to obtain the inquiring result; displaying the query result on a visual network interface based on the visual method selected by the user;

and the custom analysis platform is used for carrying out performance analysis on the multi-level database and the service database based on the custom analysis method configured by the user to obtain the custom analysis result.

According to the large-scale machine learning performance optimization guiding device provided by the invention, the network service layer further comprises:

And the data monitoring platform is used for monitoring the multi-level database in real time to obtain a real-time monitoring result.

and the performance analysis robot is used for guiding the user to complete multiple rounds of conversations by using a natural language processing technology, so that performance analysis on uncovered analysis dead angles is completed.

In a second aspect, the present invention also provides a method for optimizing and guiding large-scale machine learning performance, including:

providing a basic task component library, carrying out data processing on original performance data in an original database based on a plurality of basic task components preselected from the basic task component library to obtain basic data, and storing the basic data into the basic database;

carrying out statistic concentration processing on the basic data to obtain first multi-level data, and storing the first multi-level data into a multi-level database;

performing cross integration, performance index calculation and performance abnormality detection on the first multi-level data to obtain second multi-level data, and storing the second multi-level data into the multi-level database; determining a performance analysis chain, optimization suggestions and performance reports based on the second multi-level data, and storing the performance analysis chain, the optimization suggestions and the performance reports in a service database;

Responding to user operation, determining an operation result based on the multi-level database and the service database, and visually displaying the operation result; the operation result comprises a query result and/or a custom analysis result.

In a third aspect, the present invention also provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the method for optimizing and guiding performance of large-scale machine learning according to the second aspect.

In a fourth aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a large-scale machine learning performance optimization guidance method as described in the above second aspect.

In a fifth aspect, the present invention also provides a computer program product comprising a computer program which when executed by a processor implements the large-scale machine learning performance optimization instruction method of the second aspect described above.

The invention provides a large-scale machine learning performance optimization guiding device, a method, equipment and a medium, wherein the device comprises: a basic task layer, an advanced task layer, a business function layer and a network service layer; the basic task layer is used for providing a basic task component library, and based on a plurality of basic task components preselected from the basic task component library, the basic task layer can perform preliminary concentration processing on massive original performance data in the original database to obtain basic data, and the basic data is stored in the basic database; the high-level task layer is used for carrying out further statistical concentration processing on massive basic data in the basic database to obtain first multi-level data, and storing the first multi-level data into the multi-level database; the business function layer is used for carrying out cross integration, performance index calculation and performance abnormality detection on the first multi-level data to obtain second multi-level data, and storing the second multi-level data into the multi-level database, so that multi-dimensional multi-level mining integration can be carried out on the first multi-level data, and the granularity level of the data is greatly increased; extracting a performance analysis chain, optimization suggestions and performance reports based on the second multi-level data, and storing the performance analysis chain, the optimization suggestions and the performance reports into a service database; the network service layer is used for responding to the user operation, determining an operation result based on a multi-level database and a service database, and visually displaying the operation result; the operation results comprise query results and/or custom analysis results; the user can be guided through the performance analysis chain to analyze performance layer by layer in a finer granularity direction, starting from performance reports and optimization suggestions. Therefore, the invention provides performance optimization guidance through multi-dimensional and multi-level integrated analysis, and is beneficial to users to analyze performance in a direction with finer granularity.

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a large-scale machine learning performance optimization guidance device provided by an embodiment of the present invention;

FIG. 2 is a schematic diagram of data layering of a large-scale machine learning performance optimization guidance device provided by an embodiment of the invention;

FIG. 3 is a schematic diagram of a basic task graph provided by an embodiment of the present invention;

FIG. 4 is a schematic diagram of a tuning map analysis process according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of an inter-card performance comparative analysis process provided by an embodiment of the present invention;

FIG. 6 is a schematic diagram of a performance optimization closed loop of a large-scale machine learning performance optimization guidance device provided by an embodiment of the present invention;

FIG. 7 is a flow chart of a large-scale machine learning performance optimization guidance method provided by an embodiment of the invention;

Fig. 8 is a schematic structural diagram of an electronic device provided by the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In the technical field of machine learning, performance analysis tools are mainly used for analyzing various indexes during software running. The PyTorrch kineto is a common performance analysis tool, and the PyTorrch kineto can present various views such as a performance overview view, an operator view, a timeline view, a memory view, and a communication performance view. Wherein:

the performance overview view may show overall performance scenarios, such as: graphics processor (Graphics Processing Unit, GPU) hardware information, end-to-end (one training cycle) time consumption, different types of Kernel (Kernel) time consumption and scale, etc.

The operator view may expose CPU and GPU operator performance details, such as: summarizing time consumption, time consumption duty ratio, calling times and the like.

the timeline view may show all recorded information, such as: the recorded time position, time length, recorded function call stack, etc., may be used to analyze the details of the performance.

The memory view may show a change curve of a memory state over time, and the memory state may include an allocated GPU memory in use, a PyTorch managed GPU total memory, an allocated CPU memory in use, and a PyTorch managed CPU total memory. In addition, the memory view also provides the memory usage condition of the operators, such as the memory usage size of the operators in the execution process.

The communication performance view is a view specific to the distributed scene, and can show hardware resource information of one node (including a plurality of GPUs) and overlapping states of communication operators and other operators, because communication time consumption is likely to be a performance bottleneck in the distributed scene, and overlapping with other operators is expected to reduce the influence as much as possible.

However, pyTorch kineto exhibits too few particle size levels and information about small particles is too sporadic. The statistical rule is too simple and does not conform to the actual situation when the granularity is larger, for example, the communication performance view can not reflect the overlapping situation between communication and between calculation and memory; the classification method of the "performance overview view" on the operators is not accurate enough, which is relevant for the specific model operator definition.

The performance analysis tool based on PyTorch kineto further optimize, although tensor use information and data bandwidth information are perfected, only more analysis data are provided, and multi-dimensional and multi-level integrated analysis is not performed at all. On the one hand, this is a very high threshold for the user, who needs to know very well about the distributed strategy, the model structure and the operator details, and on the basis of these priori knowledge, by combining with a rich optimization experience, how to further screen the subdivision data, so as to determine the state of the model performance. On the other hand, in the scenario of large-scale machine learning, the data volume of the performance analysis (profiling) data is very huge, the total data volume may be hundreds of GB, and the huge data volume is almost impossible to analyze by manual work, so that only part of data is often focused in the analysis process, and many optimization opportunities are missed. In this way, the user is not favored to analyze performance in a finer granularity direction.

Based on the above, the embodiment of the invention provides a large-scale machine learning performance optimization guiding device, a method, equipment and a medium, and the method, the equipment and the medium are specifically described below.

The large-scale machine learning performance optimization director of the present invention is described below in conjunction with fig. 1-6.

Referring to fig. 1, fig. 1 is a schematic diagram of a large-scale machine learning performance optimization guiding device according to an embodiment of the present invention. As shown in fig. 1, the apparatus may include: a basic task layer 1, an advanced task layer 2, a business function layer 3 and a network service layer 4. The basic task layer 1, the advanced task layer 2 and the business function layer 3 are deployed at a server side, the network service layer 4 is deployed at a client side, and a user interacts with the network service layer 4 through the client side. Wherein:

the basic task layer 1 is used for providing a basic task component library, carrying out data processing on the original performance data in the original database based on a plurality of basic task components selected in advance from the basic task component library to obtain basic data, and storing the basic data into the basic database.

Specifically, the basic task layer 1 provides a basic task component library, and the basic task component library comprises a plurality of fine-grained basic task components, wherein the basic task components are programs for realizing basic functions. The method comprises the steps of carrying out preliminary concentration processing on massive original performance data in an original database through a plurality of basic task components preselected from a basic task component library to obtain basic data, and storing the basic data into the basic database. The massive amount of original performance data in the original database forms an original data layer shown in fig. 2, and the basic data in the basic database forms an enhanced data layer shown in fig. 2.

In one embodiment, the base task layer 1 is specifically configured to: providing a basic task component library, wherein the basic task component library comprises a plurality of fine-grained basic task components; pre-selecting a plurality of basic task components from a basic task component library based on a performance optimization target of an advanced task layer; constructing at least one basic task graph based on the plurality of basic task components and the dependency relationship between the plurality of basic task components; inputting the original performance data in the original database into at least one basic task graph for data processing to obtain basic data; the base data is stored in a base database.

Illustratively, as shown in fig. 3, the basic processing component, the information fusion component, the automatic tag component, the time stamp calibration component, the communication time calculation component, and the memory analysis component are preselected from the basic task component library based on the performance optimization objective of the advanced task layer, and the embodiment is not limited thereto. Wherein:

the basic processing component is used for preprocessing such as data enhancement and restoration.

The information fusion component is used for fusing data of multiple recording modes, and synthesizing a precise and complete data, so that both performance accuracy and information integrity can be considered.

The automatic tag component is used for automatically deriving tag values of the performance data based on the own rules of the performance data.

Optionally, the automatic tag component includes a single tag derivation function and a range tag derivation function, and the present embodiment is not limited thereto.

1) Single tag derivation

The model information and stack information of the deep neural network (Deep Neural Networks, DNN) are adopted to carry out fuzzy matching with the names of the performance data, and the single tag value of the performance data can be automatically deduced.

2) Range tag derivation

And processing the labels in the performance data through a longest repeated subsequence extraction algorithm, and generating a new label for each subsequence. The new tag contains program structure and can provide more performance analysis classification dimensions. For example, a Generative Pre-training transducer model (GPT) contains several transducer operators, and the original information of each transducer operator is fragmented, so that boundaries between transducer operators cannot be distinguished. The range of each transducer operator can be marked by an automatic label, so that the range label of the performance data can be automatically deduced.

The time stamp calibration component is used for automatically deducing time sequence deviation by utilizing the time sequence characteristics of the inter-machine communication operators, so that the purpose of time stamp calibration is achieved, and the problem that the accuracy of inter-card time sequence analysis is seriously affected due to the deviation of time stamps of different machines in a large-scale scene can be avoided. Wherein inter-card timing refers to timing relationships between different cards. And deducing time sequence deviation based on the rule of distributed communication to obtain the time sequence among cards. The "card" herein may be a GPU card.

The communication time calculation component is used for distinguishing the communication waiting time from the real communication time, so that the communication bandwidth and the bubbles can be analyzed more accurately. Where a bubble refers to idle time on the GPU or CPU. Bubble size and location are related to the distributed training strategy. For example, bubbles must exist in the flow parallel strategy, so that actual and theoretical bubble differences need to be analyzed to find out unreasonable bubbles.

The video memory analysis component is used for deeply mining video memory bottleneck information (such as peak components of the video memory) in the original video memory data. Wherein, the use state of the video memory of each time stamp is recorded in the original video memory data. The memory bottleneck information is a key factor influencing the memory upper limit required by model training, and the memory upper limit required by model training can be reduced based on the mined memory bottleneck information, so that the memory optimization is completed.

And the first basic processing component, the second basic processing component and the information fusion component have front-back dependency relationship, the information fusion component has front-back dependency relationship with the automatic tag component, the time stamp calibration component and the video memory analysis component, and the automatic tag component and the time stamp calibration component have front-back dependency relationship with the communication time calculation component. A base task graph is constructed based on the preselected base task components and the dependencies between the base task components.

And inputting the original performance data trace data1 in the original database into a first basic processing component in the basic task graph to perform data enhancement, restoration and other preprocessing to obtain first enhancement data. And inputting the original performance data trace data2 in the original database into a second basic processing component in the basic task graph to perform data enhancement, restoration and other preprocessing to obtain second enhancement data.

The original performance data "trace data1" has a small information amount but high accuracy, and the original performance data "trace data2" has a large information amount but low accuracy. Simultaneously, the two kinds of original performance data can be used, and both accuracy and completeness can be achieved.

It should be noted that the above two types of raw performance data are merely used to teach a person skilled in the art how to implement the present invention, and the present invention is not limited thereto, but may be various other types of raw performance data.

The first enhancement data and the second enhancement data are input into the information fusion component for information fusion to obtain fusion data, and the scattered first enhancement data and second enhancement data can be synthesized into a part of accurate and complete data, so that both performance accuracy and information integrity can be considered.

And processing the fusion data based on the automatic tag assembly, the timestamp calibration assembly, the communication time calculation assembly and the video memory analysis assembly to obtain basic data, and storing the basic data into a basic database.

And the high-level task layer 2 is used for carrying out statistic concentration processing on the basic data to obtain first multi-level data, and storing the first multi-level data into a multi-level database.

Specifically, the data volume of the basic database is very huge, which may be hundreds of thousands of GB, the advanced task layer 2 performs further statistical concentration processing on massive basic data in the basic database, reduces the data volume by several orders of magnitude, simultaneously retains key information as much as possible, merges similar data, and introduces some statistical indexes (average value, maximum value, minimum value, variance, etc.) to measure the difference in the merged data, so as to obtain first multi-level data, which can be directly used for analysis, so as to be stored in the multi-level database.

In one embodiment, the advanced task layer 2 includes: the system comprises a performance statistics module, an operator information statistics module and a video memory information statistics module; wherein:

the performance statistics module is used for carrying out concentration processing of space dimension and time dimension on the basic data to obtain first concentrated data; and carrying out performance statistics on the first concentrated data by adopting at least one statistical index to obtain performance statistical data, and storing the performance statistical data into a multi-level database.

Specifically, the amount of data in the base database is very large, which may be hundreds of GB, and the base data with time information is subjected to concentration processing in the spatial dimension and the time dimension. And concentrating the basic data in space dimension for the basic data without time information. And carrying out performance statistics on the first concentrated data obtained by the concentration processing by adopting statistical indexes such as average value, maximum value, minimum value, variance and the like to obtain performance statistical data. The performance statistics are stored in a multi-level database in the form of a performance statistics table.

The operator information statistics module is used for carrying out concentration processing of space dimension and time dimension on the basic data to obtain second concentrated data; and carrying out operator information statistics on the second concentrated data by adopting at least one statistical index to obtain operator information statistical data, and storing the operator information statistical data into a multi-level database.

Specifically, the amount of data in the base database is very large, which may be hundreds of GB, and the base data with time information is subjected to concentration processing in the spatial dimension and the time dimension. And concentrating the basic data in space dimension for the basic data without time information. And carrying out operator information statistics on the second concentrated data obtained by the concentration processing by adopting statistical indexes such as average value, maximum value, minimum value, variance and the like to obtain operator information statistical data. And storing the operator information statistical data into a multi-level database in the form of an operator information statistical table.

The video memory information statistics module is used for carrying out concentration processing of space dimension and time dimension on the basic data to obtain third concentrated data; and carrying out video memory information statistics on the third concentrated data by adopting at least one statistical index to obtain video memory information statistical data, and storing the video memory information statistical data into a multi-level database.

Specifically, the amount of data in the base database is very large, which may be hundreds of GB, and the base data with time information is subjected to concentration processing in the spatial dimension and the time dimension. And concentrating the basic data in space dimension for the basic data without time information. And carrying out video memory information statistics on the third concentrated data obtained by the concentration processing by adopting statistical indexes such as average value, maximum value, minimum value, variance and the like to obtain video memory information statistical data. And storing the video memory information statistical data into a multi-level database in the form of a video memory information statistical table.

The first multi-level data comprises performance statistical data, operator information statistical data and video memory information statistical data. In this way, the first multi-level data of three levels of performance, operators and video memory can be obtained by carrying out statistic concentration processing on massive basic data in the basic database. The first multi-level data forms the underlying statistical data layer shown in fig. 2.

The service function layer 3 is used for carrying out cross integration, performance index calculation and performance abnormality detection on the first multi-level data to obtain second multi-level data, and storing the second multi-level data into a multi-level database; and determining a performance analysis chain, optimization suggestions and performance reports based on the second multi-level data, and storing the performance analysis chain, the optimization suggestions and the performance reports in a business database.

Specifically, as the first multi-level data obtained by the advanced task layer 2 is not rich enough in hierarchy and still has huge data volume, multiple processing such as cross integration, performance index calculation and performance abnormality detection needs to be performed on the first multi-level data to obtain second multi-level data, and the second multi-level data is stored in the multi-level database. Therefore, the first multi-level data with complex multi-dimensions can be concentrated into the simple single-dimension data, so that the performance bottleneck positioning efficiency of a user is accelerated, and the coverage rate of data analysis is improved.

And generating a performance analysis chain, optimization suggestions and performance reports based on the second multi-level data, and storing the performance analysis chain, the optimization suggestions and the performance reports in a service database. The performance report is obtained by continuously concentrating the data on the basis of the second multi-level data, the second multi-level data can be concentrated to scalar performance index values as much as possible, and the performance index values can directly reflect the performance quality of the model. The bottleneck problem of the performance can be automatically found out through the performance analysis chain, and a plurality of feasible optimization suggestions are given.

In one embodiment, the service function layer 3 includes: the system comprises a cross integration module, a performance index calculation module, a performance abnormality detection module, a tuning optimization chart analysis module and a performance report generation module; wherein:

and the cross integration module is used for carrying out cross integration on the first multi-level data to obtain the associated information among the first multi-level data, and storing the associated information into the multi-level database.

Specifically, the first multi-level data are cross integrated, association information among the first multi-level data is mined, and the association information is stored in a multi-level database. The association information (multi-dimensional cross information table) between the first multi-level data forms the cross data layer shown in fig. 2.

The performance index calculation module is used for calculating the performance index of the first multi-level data and the associated information in the multi-level database to obtain target index data, and storing the target index data into the multi-level database; the target index data includes a plurality of index data of each card, comprehensive index data of each card, and performance comparison results between the cards.

Specifically, under a large-scale scene, the data volume of the first multi-level data is still huge, each card has respective data, and direct analysis is quite inefficient. The performance index calculation is carried out on the first multi-level data and the associated information in the multi-level database, so that a plurality of index data of each card, comprehensive index data of each card, performance comparison results among the cards and other target index data can be extracted, and the method is particularly suitable for large-scale machine learning cluster scenes. The target index data forms the index data layer shown in fig. 2.

Since the performance benefit of the distributed training depends on the performance difference among multiple cards in design, the larger the scale of the distributed training is, the larger the effect of the performance difference among multiple cards is, so that the performance difference among multiple cards needs to be analyzed, and the related technology does not consider the problem at all.

The performance abnormality detection module is used for performing performance abnormality detection on the first multi-level data and the associated information in the multi-level database to obtain abnormal index data, and storing the abnormal index data into the multi-level database; the second multi-level data includes association information, target index data, and abnormal index data.

Specifically, performance anomaly detection is performed on first multi-level data and associated information in a multi-level database to obtain anomaly index data. The anomaly index data may reflect cards, operators, time regions, and anomaly types where performance anomalies occur. The abnormality index data forms an abnormality analysis layer shown in fig. 2.

And the tuning optimization graph analysis module is used for performing tuning optimization graph analysis on the second multi-level data to obtain a performance analysis chain and optimization suggestions, and storing the performance analysis chain and the optimization suggestions into the service database.

Illustratively, as shown in fig. 4, a designed tuning chart is given, and starting from specific indexes, for example, the cpu_ocupy ratio is found to be too low, and then whether the communication occupancy (ocupy) ratio and the input/output (io) ocupy ratio are reasonable or not is checked along the tuning chart. These proportions are all theoretically supported, such as by calculating theoretical values from the communication bandwidth and data volume of the hardware. Then, when looking at the communication ocupy ratio, if the communication bubble (bubble) ratio is found to be too large, then the bubble condition of the different communication groups is continuously analyzed. Fig. 4 lists three communication groups (Tensor Parallel, tp), data Parallel (dp), pipeline Parallel (Pipeline Parallel, pp)), finds a communication group that is time-consuming abnormal, and then continues to analyze the cause of the communication time-consuming abnormality.

The communication time-consuming anomalies are typically due to inter-card timing differences, from which the analysis of the performance bottleneck sources can continue to be deduced by the inter-card performance comparison of fig. 5.

The process of the tuning optimization graph analysis is saved to be a performance analysis chain. The performance bottlenecks can be located very quickly along the analysis chain and several possible optimization suggestions can be given. The performance analysis chain and optimization suggestions are stored in a business database.

And the performance report generating module is used for generating a performance report based on the second multi-level data and storing the performance report into the service database.

Specifically, the second multi-level data is concentrated to scalar performance index values as much as possible, the performance index values can directly reflect the performance of the model, a performance report is generated, and the performance report is stored in a service database. The performance analysis chain, optimization suggestions, and performance reports form the performance reporting layer shown in fig. 2.

Optionally, the service function layer 3 further includes: the performance simulator is used for taking the original performance data as input, simulating a part of training process of large-scale machine learning and obtaining a supplementary performance report; the supplemental performance report is complementary to the performance report and is stored in the traffic database.

Specifically, the performance simulator may take as input accurate raw performance data, so that performance data may be obtained without GPU hardware, and the execution of the lower level hardware may be restored. The simulation of a part of the training process of large-scale machine learning, i.e. the simulation of some sub-modules, such as a video memory management sub-module, a video memory unloading (offflow) strategy sub-module, etc., can deduce the information which does not exist in the original performance data and obtain the supplementary performance report. The supplemental performance report and the performance report are complementary, and the supplemental performance report is also stored in the business database. Supplemental performance reports also belong to the performance reporting layer shown in fig. 2.

The network service layer 4 is used for responding to the user operation, determining an operation result based on a multi-level database and a service database, and visually displaying the operation result; the operation results comprise query results and/or custom analysis results.

In one embodiment, the network service layer 4 includes: the data visualization platform and the custom analysis platform; wherein:

the data visualization platform is used for inquiring from the multi-level database and the service database based on the inquiry condition input by the user to obtain an inquiry result; and displaying the query result on a visual network interface based on the visual method selected by the user.

Specifically, a data visualization platform is built, and the data visualization platform provides a visual network interface. The user is free to enter query conditions and select visualization methods at the visualization network interface. The query condition can be used for querying from a multi-level database and a business database to obtain a query result. The query result can be displayed on the visual network interface through the visual method.

And the custom analysis platform is used for carrying out performance analysis on the multi-level database and the business database based on the custom analysis method configured by the user to obtain a custom analysis result.

Specifically, a custom analysis platform is built, for example: business intelligence (Business Intelligence, BI) analysis platform the BI analysis platform can perform data analysis using modern data warehouse technology, online analytical processing technology, data mining and data presentation technology to realize commercial value. The user can freely configure the custom analysis method on the custom analysis platform, for example: table processing tools, structured query language (Structured Query Language, SQL), etc. And performing performance analysis on the multi-level database and the business database by adopting a custom analysis method to obtain a custom analysis result.

Optionally, the network service layer 4 further includes: and the data monitoring platform is used for monitoring the multi-level database in real time to obtain a real-time monitoring result.

It should be noted that, the business intelligent dashboard (business intelligence dashboard, BI dashboard) refers to a platform including data visualization and data monitoring functions. Thus, the data visualization platform and the data monitoring platform may be replaced with a business intelligence dashboard.

Optionally, the network service layer 4 further includes: and the performance analysis robot is used for guiding a user to complete multiple rounds of conversations by using a natural language processing technology, so that performance analysis on uncovered analysis dead angles is completed.

Specifically, performance analysis is a constantly exploring process, with model and framework changes, there must be uncovered analysis dead angles. The performance analysis robot can automatically generate answers and guiding prompts by using natural language processing technology, and the performance analysis of uncovered analysis dead angles is completed step by step through multiple rounds of conversations.

In a specific implementation, as shown in fig. 6, the whole performance analysis flow is centered on "large-scale machine learning performance optimization guiding device". Firstly, the pyrach profiler outputs original performance data, and generates simulator tuning results, optimization suggestions, performance analysis chains and multi-level performance data after being processed by a large-scale machine learning performance optimization guiding device. Then, analysis can be performed along the performance analysis chain and the simulator tuning results, the performance analysis chain can automatically position specific data positions, and the next tuning scheme is guided by combining the optimization suggestions. The user adjusts the model and framework policies according to the adjustment scheme and then proceeds to verify through the pyrach profiler.

The large-scale machine learning performance optimization guiding device provided by the embodiment of the invention comprises: a basic task layer, an advanced task layer, a business function layer and a network service layer; the basic task layer is used for providing a basic task component library, and based on a plurality of basic task components preselected from the basic task component library, the basic task layer can perform preliminary concentration processing on massive original performance data in the original database to obtain basic data, and the basic data is stored in the basic database; the high-level task layer is used for carrying out further statistical concentration processing on massive basic data in the basic database to obtain first multi-level data, and storing the first multi-level data into the multi-level database; the business function layer is used for carrying out cross integration, performance index calculation and performance abnormality detection on the first multi-level data to obtain second multi-level data, and storing the second multi-level data into the multi-level database, so that multi-dimensional multi-level mining integration can be carried out on the first multi-level data, and the granularity level of the data is greatly increased; extracting a performance analysis chain, optimization suggestions and performance reports based on the second multi-level data, and storing the performance analysis chain, the optimization suggestions and the performance reports into a service database; the network service layer is used for responding to the user operation, determining an operation result based on a multi-level database and a service database, and visually displaying the operation result; the operation results comprise query results and/or custom analysis results; the user can be guided through the performance analysis chain to analyze performance layer by layer in a finer granularity direction, starting from performance reports and optimization suggestions. Therefore, the embodiment of the invention provides performance optimization guidance through multi-dimensional and multi-level integrated analysis, and is beneficial to users to analyze performance in a direction with finer granularity.

The method for optimizing and guiding the large-scale machine learning performance provided by the invention is described below, and the method for optimizing and guiding the large-scale machine learning performance described below and the device for optimizing and guiding the large-scale machine learning performance described above can be correspondingly referred to each other.

Referring to fig. 7, fig. 7 is a flow chart of a large-scale machine learning performance optimization guiding method according to an embodiment of the invention. As shown in fig. 7, the method may include the steps of:

step 701, providing a basic task component library, performing data processing on original performance data in an original database based on a plurality of basic task components preselected from the basic task component library to obtain basic data, and storing the basic data into the basic database;

step 702, performing statistic concentration processing on the basic data to obtain first multi-level data, and storing the first multi-level data into a multi-level database;

step 703, performing cross integration, performance index calculation and performance abnormality detection on the first multi-level data to obtain second multi-level data, and storing the second multi-level data in a multi-level database; determining a performance analysis chain, optimization suggestions and performance reports based on the second multi-level data, and storing the performance analysis chain, the optimization suggestions and the performance reports in a service database;

Step 704, responding to user operation, determining an operation result based on a multi-level database and a service database, and visually displaying the operation result; the operation results comprise query results and/or custom analysis results.

In an example embodiment, step 701 may include:

pre-selecting a plurality of basic task components from a basic task component library based on a performance optimization target of an advanced task layer;

constructing at least one basic task graph based on the plurality of basic task components and the dependency relationship between the plurality of basic task components;

inputting the original performance data in the original database into at least one basic task graph for data processing to obtain basic data;

the base data is stored in a base database.

In an example embodiment, step 702 may include:

concentrating the basic data in space dimension and time dimension to obtain first concentrated data; performing performance statistics on the first concentrated data by adopting at least one statistical index to obtain performance statistical data, and storing the performance statistical data into a multi-level database;

Concentrating the basic data in space dimension and time dimension to obtain second concentrated data; performing operator information statistics on the second concentrated data by adopting at least one statistical index to obtain operator information statistical data, and storing the operator information statistical data into a multi-level database;

concentrating the basic data in space dimension and time dimension to obtain third concentrated data; performing video memory information statistics on the third concentrated data by adopting at least one statistical index to obtain video memory information statistical data, and storing the video memory information statistical data into a multi-level database;

the first multi-level data comprises performance statistical data, operator information statistical data and video memory information statistical data.

In an example embodiment, step 703 may include:

cross integration is carried out on the first multi-level data to obtain the associated information among the first multi-level data, and the associated information is stored in a multi-level database;

performing performance index calculation on the first multi-level data and the associated information in the multi-level database to obtain target index data, and storing the target index data into the multi-level database; the target index data comprises a plurality of index data of each card, comprehensive index data of each card and performance comparison results among the cards;

Performing performance anomaly detection on first multi-level data and associated information in the multi-level database to obtain anomaly index data, and storing the anomaly index data into the multi-level database; the second multi-level data comprises associated information, target index data and abnormal index data;

performing optimization graph analysis on the second multi-level data to obtain a performance analysis chain and optimization suggestions, and storing the performance analysis chain and the optimization suggestions into a service database;

and generating a performance report based on the second multi-level data and storing the performance report in a service database.

In an example embodiment, step 703 may further include:

taking the original performance data as input, simulating a part of training process of large-scale machine learning, and obtaining a supplementary performance report; the supplemental performance report is complementary to the performance report and is stored in the traffic database.

In an example embodiment, step 704 may include:

inquiring from the multi-level database and the business database based on the inquiry condition input by the user to obtain an inquiry result; displaying a query result on a visual network interface based on a visual method selected by a user;

and performing performance analysis on the multi-level database and the service database based on the user-configured custom analysis method to obtain a custom analysis result.

In an example embodiment, step 704 may further include: and monitoring the multi-level database in real time to obtain a real-time monitoring result.

In an example embodiment, step 704 may further include: and guiding the user to complete multiple rounds of conversations by using a natural language processing technology, thereby completing performance analysis aiming at uncovered analysis dead angles.

Fig. 8 illustrates a physical structure diagram of an electronic device, as shown in fig. 8, which may include: processor 810, communication interface (Communications Interface) 820, memory 830, and communication bus 840, wherein processor 810, communication interface 820, memory 830 accomplish communication with each other through communication bus 840. The processor 810 may invoke logic instructions in the memory 830 to perform a large-scale machine learning performance optimization guidance method comprising:

Performing cross integration, performance index calculation and performance abnormality detection on the first multi-level data to obtain second multi-level data, and storing the second multi-level data into a multi-level database; determining a performance analysis chain, optimization suggestions and performance reports based on the second multi-level data, and storing the performance analysis chain, the optimization suggestions and the performance reports in a service database;

responding to user operation, determining an operation result based on a multi-level database and a service database, and visually displaying the operation result; the operation results comprise query results and/or custom analysis results.

Further, the logic instructions in the memory 830 described above may be implemented in the form of software functional units and may be stored in a computer-readable storage medium when sold or used as a stand-alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In another aspect, the present invention also provides a computer program product, the computer program product comprising a computer program, the computer program being storable on a non-transitory computer readable storage medium, the computer program, when executed by a processor, being capable of executing the large-scale machine learning performance optimization guidance method provided by the above methods, the method comprising:

In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform a method of large-scale machine learning performance optimization guidance provided by the above methods, the method comprising:

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A large-scale machine learning performance optimization guidance device, comprising:

2. The large-scale machine learning performance optimization director as set forth in claim 1, wherein the base task layer is specifically configured to:

and storing the basic data into the basic database.

3. The large-scale machine learning performance optimization director as claimed in claim 1, wherein the advanced task layer comprises:

4. The large-scale machine learning performance optimization director as claimed in claim 1, wherein the business function layer comprises:

5. The large-scale machine learning performance optimization director of claim 4 wherein the business function layer further comprises:

6. The large-scale machine learning performance optimization director as claimed in claim 1, wherein the web service layer comprises:

7. The large-scale machine learning performance optimization director of claim 6 wherein the web services layer further comprises:

8. The large-scale machine learning performance optimization director of claim 6 wherein the web services layer further comprises:

9. A method for optimizing and guiding large-scale machine learning performance, comprising the steps of:

10. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor implements the large-scale machine learning performance optimization instruction method of claim 9 when executing the program.

11. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the large-scale machine learning performance optimization guidance method of claim 9.