CN114880108A - Performance analysis method and equipment based on CPU-GPU heterogeneous architecture and storage medium - Google Patents

Performance analysis method and equipment based on CPU-GPU heterogeneous architecture and storage medium Download PDF

Info

Publication number
CN114880108A
CN114880108A CN202111535943.0A CN202111535943A CN114880108A CN 114880108 A CN114880108 A CN 114880108A CN 202111535943 A CN202111535943 A CN 202111535943A CN 114880108 A CN114880108 A CN 114880108A
Authority
CN
China
Prior art keywords
performance analysis
gpu
zero
cpu
knowledge proof
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111535943.0A
Other languages
Chinese (zh)
Other versions
CN114880108B (en
Inventor
鲁真妍
杨永魁
喻之斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Institute of Advanced Technology of CAS
Original Assignee
Shenzhen Institute of Advanced Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Institute of Advanced Technology of CAS filed Critical Shenzhen Institute of Advanced Technology of CAS
Priority to CN202111535943.0A priority Critical patent/CN114880108B/en
Priority to PCT/CN2021/141306 priority patent/WO2023108800A1/en
Publication of CN114880108A publication Critical patent/CN114880108A/en
Application granted granted Critical
Publication of CN114880108B publication Critical patent/CN114880108B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/5044Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering hardware capabilities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/505Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the load
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/20Processor architectures; Processor configuration, e.g. pipelining
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The application provides a performance analysis method and equipment based on a CPU-GPU heterogeneous architecture and a storage medium. The performance analysis method based on the CPU-GPU heterogeneous architecture comprises the following steps: acquiring a preset zero knowledge proof performance analysis model, and calibrating coefficient values in the zero knowledge proof performance analysis model; inputting the first data volume, and the parameter combination of the multiple groups of window sizes and the actual thread number into a zero-knowledge proof performance analysis model; selecting the total execution time with the shortest time and the corresponding optimal parameter combination from a plurality of total execution times output by the zero-knowledge proof performance analysis model; setting parameters of the CPU-GPU heterogeneous architecture based on the optimal parameter combination; wherein the number of parameter combinations is the same as the number of total execution times. Through the mode, the performance analysis method provided by the application can be used for predicting the performance of the zero-knowledge proof algorithm by providing the zero-knowledge proof performance optimization analysis model, so that the optimal parameter combination influencing the performance of the CPU-GPU heterogeneous architecture is screened out.

Description

Performance analysis method and equipment based on CPU-GPU heterogeneous architecture and storage medium
Technical Field
The present application relates to the field of zero-knowledge proof technologies, and in particular, to a performance analysis method, device, and storage medium based on a CPU-GPU heterogeneous architecture.
Background
A zero knowledge proof means that the prover can convince the verifier that a statement is correct without exposing any useful information to the verifier. Therefore, the problems of data security, privacy leakage and the like can be solved.
However, in practical use, the huge data size is combined with the complexity of the zero-knowledge proof algorithm, and great performance pressure is generated on a heterogeneous system in terms of data access, calculation and the like, so that application hindrance caused by performance problems affects the falling of the zero-knowledge proof technology in an application scene.
Disclosure of Invention
The application provides a performance analysis method and equipment based on a CPU-GPU heterogeneous architecture and a storage medium.
The application provides a performance analysis method based on a CPU-GPU heterogeneous architecture, which comprises the following steps:
acquiring a preset zero knowledge proof performance analysis model, and calibrating coefficient values in the zero knowledge proof performance analysis model;
inputting the first data volume, and the parameter combination of the multiple groups of window sizes and actual thread numbers into the zero-knowledge proof performance analysis model;
selecting the total execution time with the shortest time and the corresponding optimal parameter combination from a plurality of total execution times output by the zero-knowledge proof performance analysis model;
setting parameters of the CPU-GPU heterogeneous architecture based on the optimal parameter combination;
wherein the number of parameter combinations is the same as the number of total execution times.
Wherein the calibrating the coefficient values in the zero-knowledge proof performance analysis model comprises:
setting the size of a fixed window and the number of fixed actual threads;
inputting the fixed window size, the fixed actual thread number and a plurality of second data volumes within a preset range into the zero-knowledge proof performance analysis model;
calibrating the coefficient value according to the relation between the total execution time output by the zero-knowledge proof performance analysis model and the plurality of second data volumes;
wherein the second amount of data is less than the first amount of data.
The zero-knowledge proof performance analysis model comprises a CPU preprocessing time model, a GPU transmission time model and a GPU execution time model.
The CPU pre-processing time model outputs the pre-processing time in a linear relation with the input data volume, and the GPU transmission time model outputs the transmission time in a linear relation with the input data volume.
The GPU executes a fitting formula of the time model as follows:
Figure BDA0003412561370000021
wherein, t exe Execution time, window for a single iteration size The window size is, and a, b, c are calibrated coefficients.
The total execution time of the GPU comprises GPU execution starting time and a plurality of times of iteration execution time, wherein the total execution time of the GPU is as follows:
t gpu =t gpu0 +i×t ext
wherein, t gpu0 For GPU execution startup time, t exe For a single iteration execution time, i is the number of iteration calculations.
And determining the iterative computation times according to the ratio of the actual thread number to the maximum parallel thread number of the GPU.
The CPU in the CPU-GPU heterogeneous architecture is responsible for logic control and data preprocessing, and the GPU is responsible for processing intensive and parallelizable calculation.
The application also provides a terminal device comprising a memory and a processor, wherein the memory is coupled to the processor;
wherein the memory is used for storing program data, and the processor is used for executing the program data to realize the performance analysis method.
The present application also provides a computer storage medium for storing program data which, when executed by a processor, is used to implement the performance analysis method described above.
The beneficial effect of this application is: the terminal equipment acquires a preset zero knowledge proof performance analysis model and calibrates coefficient values in the zero knowledge proof performance analysis model; inputting the first data volume, and the parameter combination of the multiple groups of window sizes and the actual thread number into a zero-knowledge proof performance analysis model; selecting the total execution time with the shortest time and the corresponding optimal parameter combination from a plurality of total execution times output by the zero-knowledge proof performance analysis model; setting parameters of the CPU-GPU heterogeneous architecture based on the optimal parameter combination; wherein the number of parameter combinations is the same as the number of total execution times. Through the mode, the performance analysis method provided by the application can be used for predicting the performance of the zero-knowledge proof algorithm by providing the zero-knowledge proof performance optimization analysis model, so that the optimal parameter combination influencing the performance of the CPU-GPU heterogeneous architecture is screened out.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts. Wherein:
FIG. 1 is a schematic flowchart of an embodiment of a performance analysis method based on a CPU-GPU heterogeneous architecture provided in the present application;
FIG. 2 is a schematic diagram of a CPU-GPU heterogeneous architecture-based zero-knowledge proof computation data flow provided herein;
FIG. 3 is a data flow diagram of the FFT stage provided herein;
FIG. 4 is a data flow diagram of the MULTIEXXPA phase provided herein;
FIG. 5 is a schematic diagram illustrating the execution time distribution of the SYNTHESIZE stage, FFT stage, and MULTIEXP stage provided in the present application;
FIG. 6 is a schematic diagram illustrating the time ratio of each step of the FFT stage provided in the present application;
FIG. 7 is a time ratio diagram of the various steps of the MULTIIEXP phase as provided herein;
fig. 8 is a data flow diagram of the multiiexp algorithm provided herein;
FIG. 9 is a detailed flowchart of step S11 of the performance analysis method shown in FIG. 1;
FIG. 10 is a graph showing the fitting results of the CPU pre-processing time model provided herein;
FIG. 11 is a diagram illustrating a fitting result of a GPU transmission time model provided herein;
FIG. 12 is a diagram illustrating the fitting result of the GPU execution time model provided in the present application;
fig. 13 is a schematic structural diagram of an embodiment of a terminal device provided in the present application;
FIG. 14 is a schematic structural diagram of an embodiment of a computer storage medium provided in the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
In order to solve the problem of low utilization rate of the CPU-GPU heterogeneous architecture, due to the algorithm complexity, huge data volume and calculated amount of zero knowledge proof and the system complexity brought by the CPU-GPU heterogeneous architecture, for the implementation based on the CPU-GPU heterogeneous architecture, the CPU is adjusted to be responsible for logic control and data preprocessing, the GPU is responsible for intensive and parallelizable calculation, and therefore a zero knowledge proof performance optimization method is provided, application obstruction caused by performance problems is solved, and landing of the zero knowledge proof technology in an application scene is accelerated.
Referring to fig. 1 and fig. 2 in detail, fig. 1 is a schematic flowchart of an embodiment of a performance analysis method based on a CPU-GPU heterogeneous architecture provided in the present application, and fig. 2 is a schematic diagram of a zero-knowledge proof computation data flow based on the CPU-GPU heterogeneous architecture provided in the present application.
In the embodiment of the present application, as shown in fig. 2, a parallel execution scheme for parallelization of computation is provided based on a CPU-GPU heterogeneous architecture.
Specifically, the present application divides the computation of zero knowledge proof into three stages, i.e., a synchronization (i.e., circuit generation) stage, an FFT (i.e., fast fourier transform) stage, and a multixp (i.e., large number multiplication and addition) stage, wherein the computation of the multixp stage can be further divided into A, B, C three parts, i.e., a multixpa stage (i.e., a large number multiplication and addition a stage), a multixp B (i.e., a large number multiplication and addition B stage), and a multixp C stage (i.e., a large number multiplication and addition C stage) according to the difference of input data. As shown in fig. 2, the data output from the synchronize stage is divided into three parts, one part is used as input of the FFT stage, and the other two parts are used as input of the mute xp B stage and the mute xp C stage, respectively. The output of the FFT stage will be the input to the multiiexp a stage. The final output of the multiiexp a, multiiexpb, and multiiexp C stages generates the PROOF.
As can be seen from the calculation data flow shown in fig. 2, the operation of the optimized CPU-GPU heterogeneous architecture is mainly divided into two parts: pipelining of the FFT and MULTIEXPA stages, and parallelization of the MULTIEXP B and MULTIEXP C stages.
In the FFT stage, as shown in fig. 3, the FFT stage divides the data into N parts, where N is greater than or equal to 1 and less than or equal to 10, and each part of data needs to be preprocessed 3 times and FFT operated 7 times.
In the multi iexp stage, as shown in fig. 4, taking the multi xpa stage as an example, the data also needs to be divided into N parts, each of the N parts of data is preprocessed, and then further divided into I parts according to the size of the GPU memory and the number of the computing units, so as to perform multi iexp calculation.
It should be noted that, the data preprocessing of the two stages is performed by the CPU, and the FFT operation and the multi xp operation require that the data is transmitted to the GPU, the GPU performs parallel computation, and then the result is transmitted back to the CPU.
In summary, the present application divides the process of generating the proof into 3 stages, each of which is divided into several steps, and we need to pay attention to the execution time ratio of each stage and step and the usage of their hardware resources. The log data is used for analyzing the time ratio of each stage and step, and tools such as top and nvidia-smi are used for monitoring the use condition of hardware resources in real time.
Combining the log data and the hardware resource monitoring data, we can obtain the computation time of each part as shown in fig. 5. The synchronization phase time is smaller and is 13%, and the CPU multi-core parallel computing can be directly used in the phase. The FFT stage time is 22%. The multiiexp phase time was 63% at the most. Wherein, the three stages of MUTIEXPA stage, MUTIEXPB stage and MUTIEXPC stage respectively account for 18 percent, 16 percent and 29 percent.
The FFT phase is illustrated in fig. 6 for each step, and the MUTIEXP phase is illustrated in fig. 7 for each step. It should be noted here that the environment adopted in the experiment is configured as 128G memory plus 128G disk swap area, the data transmission between the memory and the disk swap area and the speed limitation of the disk swap area itself, which results in a long time consumption of the preprocessing step compared to the configuration using the memory completely.
In terms of hardware resource utilization, the maximum utilization rate of the FFT stage to the video memory is 77%, and the maximum utilization rate of the GPU computing unit is 100%. The maximum utilization rate of the MULTIIEXP stage on the video memory is 36%, and the maximum utilization rate of the GPU computing unit is 100%.
The bottleneck of the zero-knowledge proof algorithm is found out through the data and the process, the flow of the algorithm needs to be analyzed specifically, the data flow is cleared, and the serial and parallel division of the calculation task is determined. And counting the computing time and resource usage of each computing task in the algorithm by combining the log data and the hardware resource usage rate data, thereby finding a bottleneck and optimizing a space.
Further, in the implementation of the zero-knowledge proof algorithm, a large amount of data is partitioned into a plurality of parts and subjected to certain calculation processing, and the same calculation task is completed for each part. The calculation task is carried out for many times, and each time, the calculation task is composed of three parts of data reading, data calculating and data writing back. We refer to such computational tasks that are performed multiple times on different data as critical tasks. In combination with the analysis results obtained in the previous part, the performance analysis model is established for the key tasks, based on the realization of a CPU-GPU heterogeneous architecture, the CPU is responsible for logic control and data preprocessing, the GPU is responsible for processing intensive and parallelizable calculation, the performance analysis model provides guidance for performance optimization of zero-knowledge proof, application obstacles generated by performance problems are solved, and landing of the zero-knowledge proof technology in an application scene is accelerated.
Specifically, in the above bottleneck analysis stage of the zero-knowledge proof algorithm, it is analyzed that the multiiexp stage accounts for 63% of the total time, and therefore, a zero-knowledge proof key performance analysis model and a corresponding performance analysis method need to be provided for this stage.
The theoretical basis of the zero-knowledge proof key performance analysis model is introduced firstly as follows:
referring to fig. 4, taking the multi-stage as an example, the multi-stage first divides the data into N parts, wherein each part of the data is pre-processed by the CPU. And continuously dividing one part of data into I parts, transmitting each part of data serving as input data of one GPU task from the CPU memory to the GPU memory, and transmitting the data from the GPU memory to the CPU memory after GPU calculation is finished.
Let t total For the total execution time of the MULTIXPA phase, N is the number of data portions in the first data division. Assuming that the data amount of each data obtained by dividing is the same and is D, and the processing time of each data is the same and is t, the total execution time of the MULTIXPA stage can pass throughFormula (1) is embodied as:
t total =N×t (1)
let t cpu The time for the CPU to preprocess the data of data amount D. And after the preprocessing is finished, dividing the data with the data volume D into I parts again, and calling a GPU (graphics processing unit) for calculation once for each part of data. Assuming that the data volume of each part of data is still the same after the data division, D is the same, wherein the relation between D and D is as formula (2):
Figure BDA0003412561370000071
assuming that the data processing time after the second division is the same, in the primary GPU task, t trans For the data transfer time between CPU and GPU, including the time from CPU memory to GPU memory and the time from GPU memory to CPU memory, t gpu Time is calculated for the GPU. Thus, the processing time of each piece of data after the first division can be expressed as equation (3):
t=t cpu +I×(t trans +t gpu ) (3)
wherein, the CPU carries out the data preprocessing part by the preprocessing starting time t cpu0 And execution time. Setting the speed at which a datum is processed to be constant v cpu I.e. the execution time is linear with the amount of data. Assuming that the data amount at this time is D, the data preprocessing time of the CPU can be expressed by equation (4):
t cpu =t cpu0 +D×v cpu (4)
the transmission time between the CPU and the GPU is changed from transmission starting time t trans0 With respect to the execution time, the speed at which a data is transmitted is set to be constant v trans I.e. the transmission time is linear with the amount of data. If the data amount at this time is d, the transmission time between the CPU and the GPU can be expressed by equation (5):
t trans =t trans0 +d×v trans (5)
when the GPU is executed, the ratio i of the actual thread number to the maximum parallel thread number of the GPU isAnd (5) performing i times of iterative calculation. The total execution time of the GPU is executed by the GPU for the starting time t gpu0 And i iteration execution times t exe And (4) forming. Assuming that the time for each iteration is equal, the GPU execution time can be expressed as equation (6):
t gpu =t gpu0 +i×t exe (6)
further, in the multi iexp stage, a large number multiplication and addition is required, specifically, referring to equation (7), the large number refers to data with bits exceeding the processing bit width of the processor, and the multiplication and addition refers to the addition of results after d groups of large numbers are multiplied by the large numbers. Used in the experiment was a 384-bit 2-dimensional vector p i And 256-bit scalar k i Where i denotes the ith number.
Specifically, the algorithm of the large number multiplication and addition is as shown in fig. 8, and firstly, the scalar is divided into a plurality of windows (windows), and each window is multiplied and added with a corresponding vector. Final result W of each window multiply-add i Performing the operation of equation (8), multiplying
Figure BDA0003412561370000081
The final result can be obtained. Wherein i represents the ith window. m represents a scalar k i The number of binary bits. The calculation between the windows is completed by the CPU, and the GPU is responsible for completing the calculation in the windows in parallel.
Figure BDA0003412561370000082
Figure BDA0003412561370000083
The calculation process in each window is described as follows: with continued reference to FIG. 8, the size of each window in FIG. 8 is 4, and the number in each window is between 0 and 15. Each vector is multiplied by a decimal number between 0-15. We set 16 buckets, each corresponding to a decimal number between 0 and 15, and put the vectors into their multiplier corresponding buckets. The product and result in the window at this time is as follows (9):
W 0 =∑iB i =B 1 +2B 2 +…+15B 15 (9)
wherein, B i Representing a bucket with a corresponding decimal number i.
Assuming that there is only one vector in each packet at this time, there is equation (10):
B 5 =P 3 +P 4 (10)
assuming that there are multiple vectors in each bucket, as in fig. 8, there are equations (11) (12):
B 1 =P 3 (11)
B 5 =P 4 (12)
that is, each time we put a vector into a bucket, if there is already a value in it, it is added to the value already in it, and a vector addition operation is performed.
Furthermore, in the GPU, the window is further divided into a plurality of groups (groups) vertically, so that the parallelism of the calculation can be improved. Each thread need only be responsible for computing the multiply-add within one group. The product in one window at this time is given by the equation (13):
W i =g 0 +g 1 +…+g n (13)
wherein, g i Represents the product sum of the ith group in the window.
Therefore, in the parameter for determining the execution time size in the multiiexp phase, there are two key parameters that can be adjusted, i.e. the number of windows and the number of groups. The effect of these two tunable key parameters on the other parameters that determine the execution time size in the multixp phase is as follows:
Figure BDA0003412561370000091
Figure BDA0003412561370000092
num thread =num window ×num window (16)
where m is a scalar k in the calculation i Number of bits of (d, num) window Number of windows, window size Is the size of the window; bucket len The number of packets in one thread is positively correlated with the size of the video memory used by each thread; num thread Is the number of threads.
It can be seen from the above formula that, when the total number of threads is unchanged, i.e. the amount of data calculated in each thread is unchanged, the larger the window, the smaller the number of windows, and the larger the number of groups, the smaller the groups. In this case, the number of times of addition operations of the vector in a single thread is smaller, but the number of window bits per calculation is larger, and the video memory used is larger. Larger video memory is used, more bits are calculated each time, the execution time is increased, and the execution time is reduced when the number of point addition operations is reduced. The window size at this point will have a median value of optimal.
Therefore, we will refer to t in formula (6) exe Expressed as the calculated time t copute And memory access time t mem Specifically, it is represented by formula (17):
t exe =t compute +t mem (17)
analysis as in the previous paragraph, window size For the calculation time t compute There is a positive correlation effect and also a negative correlation effect. Let us set the positive correlation coefficient as a and the negative correlation coefficient as b, we obtain equation (18):
Figure BDA0003412561370000093
memory access time t mem And bucket len In positive correlation, i.e. with
Figure BDA0003412561370000094
Positive correlation, with positive correlation coefficient given as c, yields formula (19):
Figure BDA0003412561370000095
according to the theoretical analysis, the zero knowledge established by the application proves that the key performance analysis model has three input parameters: (1) the amount of data; (2) window size, actual thread number, etc. values that can be manually specified in the code; (3) parameters related to the current hardware resources, the operating environment, such as the speed at which the CPU performs some fixed computation on data of 1KB in size.
Based on the three input parameters, the optimal parameter combination for the CPU-GPU heterogeneous architecture is analyzed through a zero-knowledge proof key performance analysis model, and therefore a zero-knowledge proof algorithm is optimized.
As shown in fig. 1, the performance analysis method based on the CPU-GPU heterogeneous architecture according to the embodiment of the present application specifically includes the following steps:
step S11: and acquiring a preset zero knowledge proof performance analysis model, and calibrating coefficient values in the zero knowledge proof performance analysis model.
In the embodiment of the present application, after obtaining the zero knowledge proof performance analysis model, the coefficient values in the zero knowledge proof performance analysis model, that is, the coefficient a, the coefficient b, and the coefficient c of the above equations (18) and (19), may be calibrated by using small-scale data.
Specifically, assuming that the values of these coefficients are constant in the same hardware resource, operating environment when the amount of data is within a certain range, these constant parameter values can be measured by a plurality of experiments with small-scale data which takes a short time. Please refer to fig. 9 continuously for the process of calibrating the coefficients in the zero-knowledge proof performance analysis model, and fig. 9 is a specific flowchart of step S11 in the performance analysis method shown in fig. 1.
As shown in fig. 9, step S11 further includes:
step S111: setting the size of a fixed window and fixing the actual thread number.
In the embodiment of the application, the terminal device may specify a fixed window size and a fixed actual thread number in the zero-knowledge proof performance analysis model together with the modified code.
Step S112: and inputting the fixed window size, the fixed actual thread number and a plurality of second data volumes in a preset range into a zero-knowledge proof performance analysis model.
In the embodiment of the present application, when the data amount is within a certain range, the values of the coefficients are constant in the same hardware resource and operation environment. The data used in step S112 are all small-scale data, which can effectively accelerate the analysis speed of the zero-knowledge proof performance analysis model.
Step S113: and calibrating the coefficient value according to the relation between the total execution time output by the zero-knowledge proof performance analysis model and the plurality of second data quantities.
In the embodiment of the application, through a plurality of experiments of small-scale data which take short time, the constant parameter values can be measured, and the coefficient values of the part which is calibrated are fixed to be used as the analysis coefficient values of large-scale data.
Step S12: and inputting the first data volume and the parameter combination of the multiple groups of window sizes and actual thread numbers into a zero-knowledge proof performance analysis model.
In the embodiment of the application, after the calibration of the small-scale data is performed on the coefficient of the zero knowledge proof performance analysis model, the correlation coefficient of the zero knowledge proof performance analysis model is fixed, that is, the correlation coefficient includes the coefficient a, the coefficient b and the coefficient c.
The zero knowledge proof performance analysis model takes the data volume, the window size and the actual thread number as independent variables, the coefficient a, the coefficient b and the coefficient c as fixed coefficients, and the total execution time as dependent variables. That is, the input of the zero knowledge proof performance analysis model is the data of the first data amount, and the corresponding total execution time can be predicted as the output by giving the data amount and the values of other independent variables, and the parameter combinations of the window size and the actual thread number are multiple sets.
Step S13: and selecting the total execution time with the shortest time and the corresponding optimal parameter combination from a plurality of total execution times output by the zero-knowledge proof performance analysis model.
In a practical application scenario, a task with a huge input data amount and a long total execution time is usually faced, and the values of the independent variable parameters have many combinations. For example, when the input data is 32GB, one-time proof shows that the execution time is about 51 minutes, and hundreds of independent variable parameter values are combined. It is time-costly to find an optimum if the effects of these combinations are exhausted directly through experimentation.
Our model inputs the values of the data volume and other independent variables and can calculate the corresponding execution time. Inputting all independent variable parameter value combinations, calculating the corresponding execution time, and selecting a parameter combination with the shortest execution time, namely a group of optimal parameter combinations. A result close to the real experimental effect is obtained by a simple calculation process.
Specifically, the terminal device selects the shortest total execution time from the multiple sets of total execution times, and the input parameter combination corresponding to the shortest total execution time is the corresponding optimal parameter combination.
Step S14: and setting parameters of the CPU-GPU heterogeneous architecture based on the optimal parameter combination.
Through experimental verification, the terminal equipment can prove that a CPU preprocessing time model, a GPU transmission time model and a GPU execution time model in the performance analysis model analysis fitting through zero knowledge.
The CPU preprocessing time model corresponds to the above equation (3), and the fitting result thereof is substantially in line with the linear relationship with the data amount, referring to fig. 10. The GPU transmission time model corresponds to the above equation (4), and the fitting result thereof is shown in fig. 11, which also substantially conforms to the linear relationship with the data amount.
The fitting result of the GPU latency model corresponding to the above equations (17), (18) and (19) is shown in fig. 12, and the fitting expression of the GPU latency model is as equation (20):
Figure BDA0003412561370000121
wherein, the fitting result comprises parameters a of 18.955, b of 1787.008 and c of 0.0033.
In the embodiment of the application, the terminal equipment acquires a preset zero knowledge proof performance analysis model and calibrates coefficient values in the zero knowledge proof performance analysis model; inputting the first data volume, and the parameter combination of the multiple groups of window sizes and the actual thread number into a zero-knowledge proof performance analysis model; selecting the total execution time with the shortest time and the corresponding optimal parameter combination from a plurality of total execution times output by the zero-knowledge proof performance analysis model; setting parameters of the CPU-GPU heterogeneous architecture based on the optimal parameter combination; wherein the number of parameter combinations is the same as the number of total execution times. Through the mode, the performance analysis method provided by the application can be used for predicting the performance of the zero-knowledge proof algorithm by providing the zero-knowledge proof performance optimization analysis model, so that the optimal parameter combination influencing the performance of the CPU-GPU heterogeneous architecture is screened out.
The application discloses a zero-knowledge proof performance analysis model based on a CPU-GPU, which can be used for predicting the performance of a zero-knowledge proof algorithm. And observing the difference between the predicted value and the measured value of the model, finding out the reason of the difference, and specifying a targeted optimization scheme. On the other hand, the analytical model needs to include adjustable parameters that have an effect on performance, and the optimal combination of these parameters can be selected by the model.
It will be understood by those skilled in the art that in the method of the present invention, the order of writing the steps does not imply a strict order of execution and any limitations on the implementation, and the specific order of execution of the steps should be determined by their function and possible inherent logic.
To implement the performance analysis method based on the CPU-GPU heterogeneous architecture of the foregoing embodiment, the present application further provides a terminal device, and please refer to fig. 13 specifically, where fig. 13 is a schematic structural diagram of an embodiment of the terminal device provided in the present application.
The terminal device 500 of the embodiment of the present application includes a memory 51 and a processor 52, wherein the memory 51 and the processor 52 are coupled.
The memory 51 is used for storing program data, and the processor 52 is used for executing the program data to implement the performance analysis method based on the CPU-GPU heterogeneous architecture according to the above embodiment.
In the present embodiment, the processor 52 may also be referred to as a CPU (Central Processing Unit). Processor 52 may be an integrated circuit chip having signal processing capabilities. The processor 52 may also be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete gate or transistor logic, discrete hardware components. A general purpose processor may be a microprocessor or the processor 52 may be any conventional processor or the like.
The present application further provides a computer storage medium, as shown in fig. 14, the computer storage medium 600 is used to store program data 61, and when the program data 61 is executed by a processor, the method for performing performance analysis based on the CPU-GPU heterogeneous architecture is implemented as described in the foregoing embodiments.
The present application further provides a computer program product, where the computer program product includes a computer program operable to enable a computer to execute the performance analysis method based on the CPU-GPU heterogeneous architecture according to the embodiment of the present application. The computer program product may be a software installation package.
The performance analysis method based on the CPU-GPU heterogeneous architecture according to the above embodiments of the present application may be stored in a device, for example, a computer readable storage medium, when the method is implemented in the form of a software functional unit and sold or used as an independent product. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The above description is only for the purpose of illustrating embodiments of the present application and is not intended to limit the scope of the present application, and all modifications of equivalent structures and equivalent processes, which are made by the contents of the specification and the drawings of the present application or are directly or indirectly applied to other related technical fields, are also included in the scope of the present application.

Claims (10)

1. A performance analysis method based on a CPU-GPU heterogeneous architecture is characterized by comprising the following steps:
acquiring a preset zero knowledge proof performance analysis model, and calibrating coefficient values in the zero knowledge proof performance analysis model;
inputting the first data volume, and the parameter combination of the multiple groups of window sizes and actual thread numbers into the zero-knowledge proof performance analysis model;
selecting the total execution time with the shortest time and the corresponding optimal parameter combination from a plurality of total execution times output by the zero-knowledge proof performance analysis model;
setting parameters of the CPU-GPU heterogeneous architecture based on the optimal parameter combination;
wherein the number of parameter combinations is the same as the number of total execution times.
2. The performance analysis method according to claim 1,
the calibrating the coefficient value in the zero-knowledge proof performance analysis model comprises the following steps:
setting the size of a fixed window and the number of fixed actual threads;
inputting the fixed window size, the fixed actual thread number and a plurality of second data volumes within a preset range into the zero-knowledge proof performance analysis model;
calibrating the coefficient value according to the relation between the total execution time output by the zero-knowledge proof performance analysis model and the plurality of second data volumes;
wherein the second amount of data is less than the first amount of data.
3. The performance analysis method according to claim 1,
the zero-knowledge proof performance analysis model comprises a CPU preprocessing time model, a GPU transmission time model and a GPU execution time model.
4. The performance analysis method according to claim 3,
the preprocessing time output by the CPU preprocessing time model is in a linear relation with the input data volume, and the transmission time output by the GPU transmission time model is in a linear relation with the input data volume.
5. The performance analysis method according to claim 3,
the fitting formula of the GPU execution time model is as follows:
Figure FDA0003412561360000021
wherein, t exe Execution time, window for a single iteration size The window size is, and a, b, c are calibrated coefficients.
6. The performance analysis method according to claim 5,
the total execution time of the GPU comprises GPU execution starting time and a plurality of times of iteration execution time, wherein the total execution time of the GPU is as follows:
t gpu =t gpu0 +i×t exe
wherein, t gpu0 For GPU execution startup time, t exe For a single iteration execution time, i is the number of iteration calculations.
7. The performance analysis method of claim 6, wherein the number of iterative computations is determined by a ratio of an actual thread number to a maximum parallel thread number of the GPU.
8. The performance analysis method according to claim 1,
the CPU in the CPU-GPU heterogeneous architecture is responsible for logic control and data preprocessing, and the GPU is responsible for processing intensive and parallelizable computation.
9. A terminal device, comprising a memory and a processor, wherein the memory is coupled to the processor;
wherein the memory is configured to store program data and the processor is configured to execute the program data to implement the performance analysis method of any one of claims 1-8.
10. A computer storage medium for storing program data which, when executed by a processor, is adapted to implement the performance analysis method of any one of claims 1 to 8.
CN202111535943.0A 2021-12-15 2021-12-15 Performance analysis method and equipment based on CPU-GPU heterogeneous architecture and storage medium Active CN114880108B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202111535943.0A CN114880108B (en) 2021-12-15 2021-12-15 Performance analysis method and equipment based on CPU-GPU heterogeneous architecture and storage medium
PCT/CN2021/141306 WO2023108800A1 (en) 2021-12-15 2021-12-24 Performance analysis method based on cpu-gpu heterogeneous architecture, and device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111535943.0A CN114880108B (en) 2021-12-15 2021-12-15 Performance analysis method and equipment based on CPU-GPU heterogeneous architecture and storage medium

Publications (2)

Publication Number Publication Date
CN114880108A true CN114880108A (en) 2022-08-09
CN114880108B CN114880108B (en) 2023-01-03

Family

ID=82667713

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111535943.0A Active CN114880108B (en) 2021-12-15 2021-12-15 Performance analysis method and equipment based on CPU-GPU heterogeneous architecture and storage medium

Country Status (2)

Country Link
CN (1) CN114880108B (en)
WO (1) WO2023108800A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116934572B (en) * 2023-09-18 2024-03-01 荣耀终端有限公司 Image processing method and apparatus

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110246779A1 (en) * 2008-12-11 2011-10-06 Isamu Teranishi Zero-knowledge proof system, zero-knowledge proof device, zero-knowledge verification device, zero-knowledge proof method and program therefor
CN104657219A (en) * 2015-02-27 2015-05-27 西安交通大学 Application program thread count dynamic regulating method used under isomerous many-core system
CN106648654A (en) * 2016-12-20 2017-05-10 深圳先进技术研究院 Data sensing-based Spark configuration parameter automatic optimization method
CN107861606A (en) * 2017-11-21 2018-03-30 北京工业大学 A kind of heterogeneous polynuclear power cap method by coordinating DVFS and duty mapping
CN111025275A (en) * 2019-11-21 2020-04-17 南京航空航天大学 Multi-base radar radiation parameter multi-target joint optimization method based on radio frequency stealth
CN112017440A (en) * 2020-10-26 2020-12-01 长沙理工大学 Iterative algorithm for intersection traffic control in automatic driving environment
CN112256623A (en) * 2020-10-26 2021-01-22 曙光信息产业(北京)有限公司 Heterogeneous system-based processing performance optimization method and device

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109379195B (en) * 2018-12-18 2021-04-30 深圳前海微众银行股份有限公司 Zero-knowledge proof circuit optimization method, device, equipment and readable storage medium
CN111585770B (en) * 2020-01-21 2023-04-07 上海致居信息科技有限公司 Method, device, medium and system for distributed acquisition of zero-knowledge proof

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110246779A1 (en) * 2008-12-11 2011-10-06 Isamu Teranishi Zero-knowledge proof system, zero-knowledge proof device, zero-knowledge verification device, zero-knowledge proof method and program therefor
CN104657219A (en) * 2015-02-27 2015-05-27 西安交通大学 Application program thread count dynamic regulating method used under isomerous many-core system
CN106648654A (en) * 2016-12-20 2017-05-10 深圳先进技术研究院 Data sensing-based Spark configuration parameter automatic optimization method
CN107861606A (en) * 2017-11-21 2018-03-30 北京工业大学 A kind of heterogeneous polynuclear power cap method by coordinating DVFS and duty mapping
CN111025275A (en) * 2019-11-21 2020-04-17 南京航空航天大学 Multi-base radar radiation parameter multi-target joint optimization method based on radio frequency stealth
CN112017440A (en) * 2020-10-26 2020-12-01 长沙理工大学 Iterative algorithm for intersection traffic control in automatic driving environment
CN112256623A (en) * 2020-10-26 2021-01-22 曙光信息产业(北京)有限公司 Heterogeneous system-based processing performance optimization method and device

Also Published As

Publication number Publication date
WO2023108800A1 (en) 2023-06-22
CN114880108B (en) 2023-01-03

Similar Documents

Publication Publication Date Title
CN110352433B (en) Hardware node with matrix-vector multiplication block for neural network processing
WO2017088458A1 (en) Pipeline-level computation apparatus, data processing method and network-on-chip chip
CN111695671B (en) Method and device for training neural network and electronic equipment
CN113038302B (en) Flow prediction method and device and computer storage medium
CN114880108B (en) Performance analysis method and equipment based on CPU-GPU heterogeneous architecture and storage medium
CN112835551B (en) Data processing method for processing unit, electronic device, and computer-readable storage medium
CN111859775A (en) Software and hardware co-design for accelerating deep learning inference
US20230161783A1 (en) Device for accelerating self-attention operation in neural networks
TWI743648B (en) Systems and methods for accelerating nonlinear mathematical computing
US20160140016A1 (en) Event sequence construction of event-driven software by combinational computations
CN114444667A (en) Method and device for training neural network and electronic equipment
Falcao et al. Heterogeneous implementation of a voronoi cell-based svp solver
Artemiev et al. Numerical solution of stochastic differential equations on Supercomputers
Novaković et al. A Kogbetliantz-type algorithm for the hyperbolic SVD
US20230029761A1 (en) Run-time reconfigurable accelerator for matrix multiplication
WO2023108801A1 (en) Data processing method based on cpu-gpu heterogeneous architecture, device and storage medium
Yano et al. LQ control problem based on numerical computation with guaranteed accuracy
KR102384588B1 (en) Method for operating a neural network and for producing weights for the neural network
CN112463377B (en) Method and device for heterogeneous computing system to execute computing task
US20220051095A1 (en) Machine Learning Computer
WO2022222882A1 (en) Information processing method and computing device
Venieris et al. Towards heterogeneous solvers for large-scale linear systems
Ueki et al. Aqss: Accelerator of quantization neural networks with stochastic approach
US20240104356A1 (en) Quantized neural network architecture
Schieffer et al. Boosting the Performance of Object Tracking with a Half-Precision Particle Filter on GPU

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant