CN113485798A

CN113485798A - Kernel function generation method, apparatus, device and storage medium

Info

Publication number: CN113485798A
Application number: CN202110665158.0A
Authority: CN
Inventors: 肖熠; 霍志坤; 李志功
Original assignee: Dawning Information Industry Beijing Co Ltd
Current assignee: Shuguang Information Industry (Henan) Co.,Ltd.
Priority date: 2021-06-16
Filing date: 2021-06-16
Publication date: 2021-10-08
Anticipated expiration: 2041-06-16
Also published as: CN113485798B

Abstract

The application discloses a kernel function generation method, a device, equipment and a storage medium, wherein the method comprises the following steps: after kernel function configuration information is obtained, M matrix dimension intervals included by the kernel function configuration information are respectively split according to the number N of target computing nodes to obtain a plurality of matrix dimension subintervals, N kernel function generation tasks are constructed according to the plurality of matrix dimension subintervals and a parameter space corresponding to each matrix dimension interval included by the kernel function configuration information, the N kernel function generation tasks and kernel function operation files are distributed to N target computing nodes, each target computing node is used for generating a kernel function corresponding to at least one matrix dimension according to the kernel function generation tasks and the kernel function operation files, the kernel functions corresponding to at least one matrix dimension and sent by each target computing node are combined, and the plurality of matrix dimensions and the kernel function corresponding to each matrix dimension are obtained. Thus, the time and computational complexity of generating the kernel function may be reduced.

Description

Kernel function generation method, apparatus, device and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a kernel function generation method, apparatus, device, and storage medium.

Background

In modern high-performance computers, a high-performance computer architecture with a heterogeneous accelerator as a main computing unit gradually becomes a main architecture, and the heterogeneous accelerator can effectively provide high floating-point computing performance and has lower power consumption. A Linear system package (Linpack) is a main standard for evaluating a floating point computing Performance peak value of a High-Performance computer, and a High Performance Linpack (HPL) is a benchmark test program mainly adopted in the world at present and is a test standard of large-scale and super-large-scale clusters at present. On a high-performance computer, the HPL solves the N-element primary dense linear algebraic equation set by using a Gaussian elimination method or an iteration method so as to evaluate the floating point calculation performance of the high-performance computer. One part of programs in the HPL runs on a CPU, the other part of programs in the HPL runs on a heterogeneous accelerator, and the part of programs which are most calculated in the part of programs running on the heterogeneous accelerator are programs corresponding to double-precision universal matrix multiplication (GEMM) and double-precision trigonometric solution (TRSM), wherein the programs corresponding to the double-precision GEMM and the double-precision TRSM respectively correspond to a kernel function, and each kernel function needs to be written in an explicit or implicit mode. At present, due to different heterogeneous accelerator micro-architectures, the efficiency of the same algorithm implemented on different heterogeneous accelerators often has great difference, and a kernel function is generally generated by using an automatic code generation technology.

Tenfile is a conventional kernel function generator which can generate a GEMM kernel function and a TRSM kernel function, in the related technology, when a kernel function is generated through Tenfile, a test file is firstly constructed according to a preset matrix dimension interval (including a value interval of rows and columns of a matrix) and a preset parameter space, wherein the parameter space comprises a plurality of parameters and a value interval of each parameter, when a test file is constructed, a plurality of matrix dimensions are specifically determined according to the matrix dimension interval, for each matrix dimension, a test file corresponding to the matrix dimension is constructed according to the matrix dimension and the parameter space, each test file comprises the matrix dimension, each parameter in the parameter space and a value of each parameter, then a plurality of kernel functions are generated according to the test file corresponding to the matrix dimension, a Benchmark example is operated to test the performance of each kernel function, and the kernel function with the optimal performance is determined as the kernel function corresponding to the matrix dimension, finally, a kernel function is generated for each determined matrix dimension.

In the generating process of the kernel function, as the parameter space comprises at least ten different parameters and the value interval of each parameter is different, when a test file corresponding to a matrix dimension is constructed according to the matrix dimension and the parameter space, the required time is longer, and when the matrix dimension interval is larger, the required total time is further increased along with the increase of the number of the determined matrix dimensions; moreover, when a plurality of kernel functions are generated according to the test file and the performance of each kernel function is tested, the computational complexity is high due to the large number of test files. In a word, the time for generating the kernel function according to the preset matrix dimension interval and the preset parameter space is long, and the calculation complexity is high.

Disclosure of Invention

The application provides a kernel function generation method, a kernel function generation device and a storage medium, which are used for solving the problems of long kernel function generation time and high computation complexity.

In a first aspect, the present application provides a kernel function generating method, including:

after kernel function configuration information is obtained, splitting M matrix dimension intervals according to the number N of target computing nodes to obtain a plurality of matrix dimension subintervals, wherein the kernel function configuration information comprises the M matrix dimension intervals and a parameter space corresponding to each matrix dimension interval, and M | and N are positive integers greater than or equal to 1;

constructing the N kernel function generation tasks according to the matrix dimension subintervals and the parameter space corresponding to each matrix dimension interval, wherein each kernel function generation task comprises at least one matrix dimension subinterval and the parameter space corresponding to each matrix dimension subinterval;

distributing the N kernel function generation tasks and the kernel function operation files to the N target computing nodes, wherein each target computing node is used for generating a kernel function corresponding to at least one matrix dimension according to the kernel function generation tasks and the kernel function operation files;

and combining the kernel functions corresponding to the at least one matrix dimension sent by each target computing node to obtain kernel function generation information, wherein the kernel function generation information comprises a plurality of matrix dimensions and the kernel function corresponding to each matrix dimension.

Optionally, the splitting the M matrix dimension intervals according to the number N of target computing nodes to obtain a plurality of matrix dimension subintervals includes:

according to the sequence from large to small of the first end values of the value intervals of the rows or the columns in the M matrix dimension intervals, sequentially dividing each matrix dimension interval into a plurality of matrix dimension subintervals according to the size of the matrix dimension interval and the number N of the target computing nodes, wherein the difference value of the number of the matrix dimensions corresponding to different matrix dimension subintervals is smaller than or equal to a first preset threshold value; alternatively, the first and second electrodes may be,

and sequentially splitting each matrix dimension interval into the N matrix dimension subintervals according to the sequence from large to small of the first end values of the value intervals of the rows or the columns in the M matrix dimension intervals.

Another embodiment in the above application has the following advantages or benefits: and splitting each matrix dimension interval according to the sequence of the first end values of the value intervals of the rows or the columns in the M matrix dimension intervals from large to small, so that N kernel function generating tasks can be conveniently constructed. The difference value of the number of the matrix dimensions corresponding to the different matrix dimension subintervals is smaller than or equal to a first preset threshold value, so that the running time of the computing node when generating the kernel function according to each matrix dimension subinterval and the parameter space can be ensured to be the same, and the parallel running time is ensured to be the same. Or each matrix dimension interval in the M matrix dimension intervals is split into N matrix dimension subintervals, so that the running time of the computing node when generating the kernel function according to each matrix dimension subinterval and the parameter space can be ensured to be the same, and the parallel running time is ensured to be the same.

Optionally, the constructing the N kernel function generation tasks according to the multiple matrix dimension subintervals and the parameter space corresponding to each matrix dimension interval includes:

determining a parameter space corresponding to each matrix dimension subinterval according to the multiple matrix dimension subintervals and the parameter space corresponding to each matrix dimension interval;

and constructing the N kernel function generation tasks according to the N, the plurality of matrix dimension subintervals and the parameter space corresponding to each matrix dimension subinterval.

Another embodiment in the above application has the following advantages or benefits: by constructing N kernel function generating tasks according to N matrix dimension subintervals and the parameter space corresponding to each matrix dimension subinterval, a kernel function generating process with large calculation amount can be divided into kernel function generating tasks with small calculation amount, and the kernel function generating tasks are distributed to different target computing nodes to be executed in parallel, so that the kernel function generating time is shortened, the calculation amount of each computing node is also reduced, the computing complexity is reduced, and meanwhile, the cluster hardware resources can be fully utilized.

Optionally, the constructing N kernel function generation tasks according to the N, the multiple matrix dimension subintervals, and the parameter space corresponding to each matrix dimension subinterval includes:

obtaining N sets according to the matrix dimension subintervals, wherein the difference value of the number of the matrix dimension subintervals in each set is smaller than or equal to a second preset threshold value, and the number of the matrix dimension interval corresponding to the matrix dimension subinterval in each set is an odd number or an even number, or the number of the matrix dimension interval corresponding to the matrix dimension subinterval in each set is continuous;

and adding a parameter space corresponding to the matrix dimension subintervals in the set to each set of the N sets to obtain the N kernel function generation tasks.

Another embodiment in the above application has the following advantages or benefits: n sets are obtained according to the plurality of matrix dimension subintervals, the parameter space corresponding to the matrix dimension subintervals in the sets is added to each set in the N sets, N kernel function generating tasks are obtained, a kernel function generating process with large calculation amount can be divided into kernel function generating tasks with small calculation amount, the kernel function generating tasks are distributed to different target computing nodes to be executed in parallel, the time for generating the kernel function is shortened, the calculation amount of each computing node is also reduced, therefore, the computing complexity is reduced, and meanwhile, cluster hardware resources can be fully utilized.

Optionally, the information of each computing node includes an IP address of the computing node, and the distributing the N kernel function generation tasks and the kernel function execution files to the N target computing nodes includes:

and distributing the N kernel function generation tasks and the kernel function running files to the N target computing nodes according to the IP addresses of the N target computing nodes.

Another embodiment in the above application has the following advantages or benefits: the distribution of the kernel function generation task is carried out according to the IP address of the target computing node, so that different target computing nodes can execute the kernel function generation task in parallel, and the computing complexity and the total kernel function generation time are reduced.

Optionally, the method further includes:

acquiring computing node list information, wherein the computing node list information comprises a plurality of computing nodes and an IP address of each computing node;

sequentially sending a request for establishing communication connection to each computing node in the computing node list information;

and if the communication connection with the computing node is successfully established, determining that the computing node is a target computing node, and allocating an identifier for the target computing node.

Another embodiment in the above application has the following advantages or benefits: the computing nodes capable of establishing communication connection in the computing node list information are determined firstly, so that the subsequent generation and distribution of the kernel function generation tasks are facilitated according to the number of the computing nodes capable of establishing communication connection, the computing nodes executing the kernel function generation tasks are all effective nodes, and the parallel execution of the kernel function generation tasks is ensured.

Optionally, the method further includes:

and if the timing heartbeat information sent by the first target computing node is not received within a preset time period, sending the kernel function generating task and the kernel function operating file sent to the first target computing node to other target computing nodes.

Another embodiment in the above application has the following advantages or benefits: if the timing heartbeat information sent by a certain target computing node is not received within the preset time period, the failure of the execution of the kernel function generation task can be determined, and the kernel function generation task and the kernel function running file sent to the target computing node are sent to other target computing nodes, so that the abnormal detection and abnormal processing can be realized.

In a second aspect, the present application provides a kernel function generating apparatus, including:

the acquisition module is used for acquiring the kernel function configuration information;

the splitting module is used for splitting M matrix dimension intervals according to the number N of target computing nodes to obtain a plurality of matrix dimension subintervals, wherein the kernel function configuration information comprises the M matrix dimension intervals and a parameter space corresponding to each matrix dimension interval, and M | and N are positive integers greater than or equal to 1;

a building module, configured to build the N kernel function generation tasks according to the multiple matrix dimension subintervals and a parameter space corresponding to each matrix dimension interval, where each kernel function generation task includes at least one matrix dimension subinterval and a parameter space corresponding to each matrix dimension subinterval;

a sending module, configured to distribute the N kernel function generation tasks and the kernel function execution files to the N target computing nodes, where each target computing node generates a kernel function corresponding to at least one matrix dimension according to the kernel function generation tasks and the kernel function execution files;

and the processing module is used for merging the kernel functions corresponding to the at least one matrix dimension sent by each target computing node to obtain kernel function generation information, wherein the kernel function generation information comprises a plurality of matrix dimensions and the kernel function corresponding to each matrix dimension.

In a third aspect, the present application provides a kernel function generating device, including:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the kernel function generation method of the first aspect or any of the possible implementations of the first aspect via execution of the executable instructions.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the kernel function generating method of the first aspect or any of the possible implementation manners of the first aspect.

In a fifth aspect, an embodiment of the present application provides a computer program product, which includes a computer program, and when the computer program is executed by a processor, the computer program implements the kernel function generating method described in the first aspect or any of the possible implementation manners of the first aspect.

According to the kernel function generation method, device, equipment and storage medium provided by the application, after the kernel function configuration information is acquired by the main node, splitting M matrix dimension intervals included in the kernel function configuration information according to the number N of target computing nodes to obtain a plurality of matrix dimension subintervals, then constructing N kernel function generation tasks according to the plurality of matrix dimension subintervals and a parameter space corresponding to each matrix dimension interval included in the kernel function configuration information, distributing the N kernel function generation tasks and kernel function operation files to the N target computing nodes, executing at least one kernel function generation task by each target computing node, executing the N kernel function generation tasks in parallel by the N computing nodes, and finally merging the kernel functions corresponding to at least one matrix dimension transmitted by each target computing node by the master node to obtain the plurality of matrix dimensions and the kernel function corresponding to each matrix dimension. As a plurality of target computing nodes execute the kernel function generation task in parallel, the time for generating the kernel function is shortened, and the computing amount of each computing node is reduced, thereby reducing the computing complexity and simultaneously fully utilizing the cluster hardware resources.

Drawings

Fig. 1 is a flowchart of a kernel function generation method according to an embodiment of the present disclosure;

fig. 2 is a flowchart of an embodiment of a kernel function generating method according to the present application;

fig. 3 is an interaction flowchart of a kernel function generation method according to an embodiment of the present application;

fig. 4 is a schematic processing flow diagram of a main node and a compute node in the kernel function generation method according to the embodiment of the present application;

fig. 5 is a schematic structural diagram of a kernel function generating apparatus according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a kernel function generating device according to an embodiment of the present application.

Detailed Description

Reference will now be made in detail to embodiments of the present application, examples of which are illustrated in the accompanying drawings. The embodiments described below with reference to the drawings are exemplary and intended to be used for explaining the present application and should not be construed as limiting the present application.

The terms "first" and "second," and the like in the description, the claims, and the drawings of the embodiments of the present application are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

First, some terms in the embodiments of the present application are explained below to facilitate understanding by those skilled in the art.

1. The matrix dimension interval includes an interval of a matrix dimension to be solved corresponding to the kernel function, for example, the GEMM kernel function realizes multiplication of two matrices, the matrix dimension interval corresponding to the GEMM kernel function is M [ a1, a2], N [ B1, B2] and K [ C1, C2], that is, the matrix dimension interval includes a value interval [ a1, a2] of a row of a first matrix, a value interval [ B1, B2] of a column and a value interval [ C1, C2] of a column of a second matrix, and the column of the first matrix is equal to the row of the second matrix. For example, the TRSM kernel function implements a triangular solution of a matrix, and the matrix dimension intervals corresponding to the TRSM kernel function are M [ D1, D2] and N [ E1, E2], that is, the matrix dimension intervals include a value interval [ D1, D2] of a row and a value interval [ E1, E2] of a column of the matrix.

2. Parameter space (Kernel Parameters) includes a plurality of Parameters and a value range of each parameter, where the Parameters include, for example, global split (global split), global split map (global split map), workgroup map (workgroup map), local split (local split), vector width (vector width), workgroup, loop unrolling (loop roll), and read vectors (read vectors).

In the related art, when a kernel function is generated through Tensile, the kernel function is generated according to a preset matrix dimension interval and a preset parameter space for a longer time, and the calculation complexity is higher. To solve the problem, embodiments of the present application provide a kernel function generation method, apparatus, device, and storage medium, the method comprises the steps that a main node splits at least one acquired matrix dimension interval according to the number N of target computing nodes to obtain a plurality of matrix dimension subintervals, N kernel function generating tasks are constructed according to the matrix dimension subintervals and parameter spaces corresponding to the matrix dimension intervals, the N kernel function generating tasks and kernel function running files are distributed to the N target computing nodes, each target computing node executes at least one kernel function generating task, the N target computing nodes execute the N kernel function generating tasks in parallel, and finally the main node merges received kernel functions corresponding to at least one matrix dimension sent by each target computing node to obtain the matrix dimensions and the kernel functions corresponding to the matrix dimensions. As the plurality of target computing nodes execute the kernel function generation task in parallel, the time for generating the kernel function is shortened, the computing amount of each target computing node is also reduced, and the computing complexity is reduced.

The master node and the computing node in the embodiment of the present application may be servers.

The following describes the technical solutions of the present application and how to solve the above technical problems with specific examples. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.

Fig. 1 is a flowchart of a kernel function generating method according to an embodiment of the present disclosure, where the kernel function generating method may be executed by a kernel function generating apparatus, and the kernel function generating apparatus may be implemented by software and/or hardware. The kernel function generating means may be a server. As shown in fig. 1, the method of this embodiment may include:

s101, after kernel function configuration information is obtained, splitting M matrix dimension intervals according to the number N of target calculation nodes to obtain a plurality of matrix dimension sub-intervals, wherein the kernel function configuration information comprises the M matrix dimension intervals and parameter spaces corresponding to the matrix dimension intervals, and M | and N are positive integers larger than or equal to 1.

Specifically, in an implementation manner, the execution main body of this embodiment may be a master node, the master node may be a server, the kernel function configuration information and the information (including information such as the number N and the IP address) of the target computing node may be preset and stored in a device or an apparatus that needs to generate a kernel function, the device or the apparatus that needs to generate a kernel function sends the kernel function configuration information and the information of the target computing node to the master node, the master node may obtain the kernel function configuration information and the information of the target computing node, the master node executes the kernel function generating method provided in this embodiment, and after obtaining the kernel function generation information, the master node may send the kernel function information to the device or the apparatus that needs to generate a kernel function, the information of the target computing node can be stored by the main node after being sent for the first time, and only the kernel function configuration information needs to be sent subsequently.

The kernel function configuration information includes M matrix dimension intervals and a parameter space corresponding to each matrix dimension interval, taking M as 5 as an example, the kernel function configuration information includes 5 matrix dimension intervals and 5 parameter spaces, each matrix dimension interval corresponds to one parameter space, and different matrix dimension intervals correspond to different parameter spaces. For example, the matrix dimension interval corresponding to the GEMM kernel function includes a value interval of a row of a first matrix, a value interval of a column of the first matrix, and a value interval of a column of a second matrix, where the column of the first matrix is equal to the row of the second matrix. The matrix dimension interval corresponding to the TRSM kernel function includes a value interval of a row of a matrix and a value interval of a column of the matrix.

Each parameter space includes a plurality of parameters and a value interval of each parameter, and the explanation of the parameter space can refer to the above explanation.

The information of the target computing node includes a plurality of target computing nodes and information of each target computing node, the information of each target computing node may include an IP address of the target computing node, and the information of each target computing node may further include a name of the target computing node, a Secure Shell (SSH) port, an account, a password, a path for storing computing process information, the number of heterogeneous accelerators included in the target computing node, and the like.

Specifically, M matrix dimension intervals are respectively split to obtain a plurality of matrix dimension subintervals, the splitting is to divide a large interval into continuous small intervals, each matrix dimension interval can be randomly split according to the size of each matrix dimension interval and the number N of target computing nodes, the difference value of the number of matrix dimensions corresponding to the split different matrix dimension subintervals is ensured to be less than or equal to a first preset threshold value, for example, the first preset threshold value can be 0, the difference value of the number of matrix dimensions corresponding to the split different matrix dimension subintervals is ensured to be equal to 0, namely, the number of matrix dimensions corresponding to the split different matrix dimension subintervals is the same, so that the running time of the computing nodes when generating kernel functions according to each matrix dimension subinterval and parameter space is ensured to be the same, and the parallel running time is ensured to be the same, and ensuring load balance.

It should be noted that the size of the matrix dimension interval may be the size of a value interval of a row of the matrix, or the size of a value interval of a column, for example, the value interval of a row of the matrix is [50, 100], the size of the value interval of a row of the matrix is 50, and the size of the matrix dimension interval is 50. In the matrix dimension interval corresponding to the GEMM kernel function, the size of the matrix dimension interval may be the size of the largest value interval among the value intervals of the rows, the value intervals of the columns of the first matrix, and the value intervals of the columns of the second matrix. In the matrix dimension interval corresponding to the TRSM kernel function, the size of the matrix dimension interval may be the size of the largest value interval in the value intervals of the rows of the matrix and the columns of the matrix.

Optionally, the M matrix dimension intervals are respectively split according to the number N of the target computing nodes to obtain a plurality of matrix dimension subintervals, and the following two implementable modes are provided:

according to the first mode, according to the sequence from large to small of first end values of value intervals of rows or columns in M matrix dimension intervals, dividing each matrix dimension interval into a plurality of matrix dimension sub-intervals according to the size of the matrix dimension interval and the number N of target calculation nodes, wherein the difference value of the number of matrix dimensions corresponding to different matrix dimension sub-intervals is within a preset range.

For example, there are 4 matrix dimension intervals, a matrix dimension interval 1, a matrix dimension interval 2, a matrix dimension interval 3, and a matrix dimension interval 4, and the matrix dimension interval 1, the matrix dimension interval 3, the matrix dimension interval 2, and the matrix dimension interval 4 may be obtained by sorting the first end values of the value intervals of the rows or columns of the 4 matrix dimension intervals in descending order, for example, the matrix dimension interval 1, the matrix dimension interval 3, the matrix dimension interval 2, and the matrix dimension interval 4, and splitting each matrix dimension interval into a plurality of matrix dimension subintervals according to the size of the matrix dimension interval and the number N of target computation nodes in sequence, where the difference between the numbers of matrix dimensions corresponding to different matrix dimension subintervals is less than or equal to a first preset threshold, so that the running time of the computation nodes when generating kernel functions according to each matrix dimension subinterval and the parameter space is the same, and the parallel running time is ensured to be the same. Each matrix dimension interval is split according to the sequence from large to small of the first end values of the value intervals of the rows or columns in the M matrix dimension intervals, so that N kernel function generating tasks can be conveniently constructed.

And secondly, sequentially splitting each matrix dimension interval into N matrix dimension subintervals according to the sequence from large to small of the first end values of the value intervals of the rows or columns in the M matrix dimension intervals.

Specifically, each matrix dimension interval in the M matrix dimension intervals is split into N matrix dimension subintervals, the method is suitable for a scenario where the number of matrix dimensions corresponding to each matrix dimension interval is approximately the same, the method is split into the N matrix dimension subintervals, and the number of matrix dimensions corresponding to each matrix dimension subinterval is also approximately the same, so that the running time of the computing node when generating the kernel function according to each matrix dimension subinterval and the parameter space can be ensured to be the same, and the parallel running time is ensured to be the same. Each matrix dimension interval is split according to the sequence from large to small of the first end values of the value intervals of the rows or columns in the M matrix dimension intervals, so that N kernel function generating tasks can be conveniently constructed.

In this embodiment, the target computing node may be a computing node that can establish a communication connection in the computing node list information, and in an implementable manner, before S101, the method may further include:

and sequentially sending a communication connection establishment request to each computing node in the computing node list information, if the computing node is determined to be successfully established with the computing node, determining the computing node as a target computing node, and allocating an identifier for the target computing node.

The computing nodes capable of establishing communication connection in the computing node list information are determined firstly, so that the subsequent generation and distribution of the kernel function generation tasks are facilitated according to the number of the computing nodes capable of establishing communication connection, the computing nodes executing the kernel function generation tasks are all effective nodes, and the parallel execution of the kernel function generation tasks is ensured.

S102, constructing N kernel function generating tasks according to the matrix dimension subintervals and the parameter space corresponding to each matrix dimension interval, wherein each kernel function generating task comprises the matrix dimension subintervals and the parameter space corresponding to each matrix dimension subinterval.

S103, distributing the N kernel function generation tasks and the kernel function operation files to N target computing nodes, and enabling each target computing node to generate a kernel function corresponding to at least one matrix dimension according to the kernel function generation tasks and the kernel function operation files.

Specifically, the information of each computing node includes an IP address of the computing node, and S103 may specifically be to distribute the N kernel function generation tasks and the kernel function execution files to the N target computing nodes according to the IP addresses of the N target computing nodes, where one kernel function generation task and one kernel function execution file may be sent to one target computing node.

After receiving a kernel function generation task and a kernel function running file, each target computing node generates a kernel function, where the kernel function running file may include a Tensile and a file corresponding to a running environment of the Tensile, and a process of generating the kernel function by the computing node may specifically be: the method comprises the steps that a computing node generates a plurality of config.yaml files according to a kernel function generation task, each config.yaml file comprises a matrix dimension subinterval and a parameter space corresponding to the matrix dimension subinterval, the computing node runs Tenfile according to each config.yaml file in sequence, a plurality of kernel functions and a Benchmark example are generated by using the Tenfile, then the Benchmark example is run to test the performance of each kernel function, the kernel function with the optimal performance is determined as the kernel function corresponding to one matrix dimension, and the optimal performance can be the shortest solving time of the kernel function.

The method for generating a plurality of kernel functions by using Tenfile and the Benchmark arithmetic example are the same as the existing generating process, and a config.yaml file is taken as an example for explanation, and the specific process is as follows: firstly, a test file is built according to a matrix dimension subinterval and a parameter space corresponding to the matrix dimension subinterval, wherein the parameter space comprises a plurality of parameters and a value interval of each parameter, when the test file is built, specifically, a plurality of matrix dimensions are determined aiming at the matrix dimension subinterval, for each matrix dimension, the test file corresponding to the matrix dimension is built according to the matrix dimension and the parameter space, each test file comprises the matrix dimension, each parameter in the parameter space and the value of each parameter, then, a plurality of kernel functions are generated according to the test file corresponding to the matrix dimension, the performance of each kernel function is tested by running a Benchmark example, the kernel function with the optimal performance is determined as the kernel function corresponding to the matrix dimension, and finally, a kernel function is generated for each determined matrix dimension.

For example, the target computing node 1 generates 2 config.yaml files (such as config.yaml file 1 and config.yaml file 2) according to the kernel function generation task, and finally generates a config.yaml file 1 using Tensile: kernel functions corresponding to m matrix dimensions and config.yaml file 2: n kernel functions corresponding to the matrix dimensions, and m and n are positive integers.

S104, merging the kernel functions corresponding to at least one matrix dimension sent by each target computing node to obtain kernel function generation information, wherein the kernel function generation information comprises a plurality of matrix dimensions and the kernel function corresponding to each matrix dimension.

Specifically, the master node receives a kernel function corresponding to at least one matrix dimension sent by each target computing node, for example, 3 total target computing nodes, the master node receives a kernel function corresponding to each matrix dimension of 10 matrix dimensions and 10 matrix dimensions sent by the target computing node 1, the master node receives a kernel function corresponding to each matrix dimension of 20 matrix dimensions and 20 matrix dimensions sent by the target computing node 2, the master node receives a kernel function corresponding to each matrix dimension of 30 matrix dimensions and 30 matrix dimensions sent by the target computing node 3, and after combination, kernel function generation information is obtained, where the kernel function generation information includes a kernel function corresponding to each matrix dimension of 50 matrix dimensions and 50 matrix dimensions.

In the kernel function generating method provided in this embodiment, after the master node obtains the kernel function configuration information, splitting M matrix dimension intervals included in the kernel function configuration information according to the number N of target computing nodes to obtain a plurality of matrix dimension subintervals, then constructing N kernel function generation tasks according to the plurality of matrix dimension subintervals and a parameter space corresponding to each matrix dimension interval included in the kernel function configuration information, distributing the N kernel function generation tasks and kernel function operation files to the N target computing nodes, executing at least one kernel function generation task by each target computing node, executing the N kernel function generation tasks in parallel by the N computing nodes, and finally merging the kernel functions corresponding to at least one matrix dimension transmitted by each target computing node by the master node to obtain the plurality of matrix dimensions and the kernel function corresponding to each matrix dimension. As a plurality of target computing nodes execute the kernel function generation task in parallel, the time for generating the kernel function is shortened, and the computing amount of each computing node is reduced, thereby reducing the computing complexity and simultaneously fully utilizing the cluster hardware resources.

Fig. 2 is a flowchart of an embodiment of a kernel function generating method provided in the embodiment of the present application, as shown in fig. 2, the method of the present embodiment is based on the method shown in fig. 1, optionally, S102 may be implemented by the following steps:

and S1021, determining a parameter space corresponding to each matrix dimension subinterval according to the plurality of matrix dimension subintervals and the parameter space corresponding to each matrix dimension interval.

Specifically, the matrix dimension subinterval is obtained by splitting the matrix dimension interval, for example, the parameter space corresponding to the matrix dimension interval 1 is parameter space 1, the matrix dimension interval 1 is split to obtain matrix dimension subinterval 1, matrix dimension subinterval 2, matrix dimension subinterval 3, and matrix dimension subinterval 4, and the parameter spaces corresponding to the four matrix dimension subintervals are parameter space 1.

S1022, constructing N kernel function generation tasks according to N, the plurality of matrix dimension subintervals and the parameter space corresponding to each matrix dimension subinterval.

Specifically, each kernel function generation task includes at least one matrix dimension subinterval and a parameter space corresponding to each matrix dimension subinterval, when N kernel function generation tasks are constructed, for example, when a matrix dimension interval is split in a mode one, a plurality of matrix dimension subintervals are selected from the matrix dimension subintervals split in each matrix dimension interval to form one kernel function generation task, or the kernel function generation tasks are selected in a cross mode, for example, 4 matrix dimension intervals are provided, a plurality of matrix dimension subintervals are selected from the matrix dimension subintervals split in the matrix dimension interval 1 and the matrix dimension subintervals split in the matrix dimension interval 3 to form one kernel function generation task, or the kernel function generation tasks are selected from the matrix dimension subintervals split in the continuous matrix dimension intervals, the selection mode is not limited in this embodiment, as long as the difference value of the number of the matrix dimension subintervals in each kernel function generation task is less than or equal to a second preset threshold value, i.e. to ensure load balancing.

When the matrix dimension interval is split in the second mode, each matrix dimension interval is split into N matrix dimension subintervals, one matrix dimension subinterval is randomly selected from the matrix dimension subintervals split from each matrix dimension interval, the N matrix dimension subintervals are selected in total to construct a kernel function generation task, and the intervals of the matrix dimension subintervals included in each kernel function generation task are the same in size, so that the parallel running time of all target computing nodes can be guaranteed to be the same.

As an implementation manner, S1022 may specifically be:

obtaining N sets according to the matrix dimension subintervals, wherein the difference value of the number of the matrix dimension subintervals in each set is smaller than or equal to a second preset threshold value, and the number of the matrix dimension interval corresponding to the matrix dimension subintervals in each set is an odd number or an even number, or the number of the matrix dimension interval corresponding to the matrix dimension subintervals in each set is continuous. And then adding a parameter space corresponding to the matrix dimension subintervals in the sets for each set of the N sets to obtain N kernel function generation tasks.

Specifically, after each matrix dimension interval is split, each matrix dimension subinterval may be numbered continuously, and when N sets are obtained according to a plurality of matrix dimension subintervals, the difference between the numbers of the matrix dimension subintervals in each set is required to be smaller than or equal to a second preset threshold, for example, the second preset threshold is 0 or 1, that is, the number of elements in each set is substantially equal. In the specific selection, for example, the selection may be performed according to the number of the matrix dimension subintervals, such as selecting odd-numbered or even-numbered matrix dimension subintervals, or selecting matrix dimension subintervals with consecutive numbers.

In the embodiment shown in fig. 1 and fig. 2, optionally, the method may further include:

and if the timing heartbeat information sent by the first target computing node is not received within the preset time period, sending the kernel function generating task and the kernel function operating file sent to the first target computing node to other target computing nodes.

Specifically, if the timing heartbeat information sent by a certain target computing node is not received within a preset time period, it can be determined that the execution of the kernel function generation task fails, and the kernel function generation task and the kernel function running file sent to the target computing node are sent to other target computing nodes, so that exception detection and exception handling can be realized.

The following describes a detailed process of the kernel function generation method provided in the present application with reference to a specific embodiment.

Fig. 3 is an interaction flowchart of a kernel function generation method provided in an embodiment of the present application, and as shown in fig. 3, the method of the present embodiment may include:

s201, the main node acquires kernel function configuration information and computing node list information.

The kernel function configuration information comprises M matrix dimension intervals and a parameter space corresponding to each matrix dimension interval, each parameter space comprises a plurality of parameters and a value interval of each parameter, the calculation node list information comprises a plurality of calculation nodes and information of each calculation node, and M is a positive integer greater than or equal to 1. Fig. 4 is a schematic diagram of a processing flow of a master node and a compute node in a kernel function generation method provided in an embodiment of the present application, for example, as shown in fig. 4, in the embodiment, taking a GEMM kernel function as an example, kernel function configuration information includes 4 matrix dimension sections and 4 parameter spaces, where the matrix dimension section 1 is M [ a1, a2], N [ B1, B2], K [ C1, C2], the matrix dimension section 2 is M [ A3, A4], N [ A3, A4], K [ A3, A4], the matrix dimension section 3 is M [ A5, A6], N [ A5, A6], K [ A5, A6], the matrix dimension section 4 is M [ A7, A8], N [ A7, A8], K [ A7, A8], where M, N and K are rows, a first matrix column, and a second matrix column of a matrix, respectively.

Specifically, the master node information and the computing node list information may be preset in a device or equipment that needs to generate a kernel function, where the master node information may include a name, an IP address, an SSH port, an account, a password, a path for storing computing process information, and the like of the master node. The computing node list information may include information of a plurality of computing nodes and each computing node, and the information of each computing node may include an IP address of the computing node, an SSH port, an account, a password, a path for storing computing process information, the number of heterogeneous accelerators included in the computing node, and the like.

The device or equipment needing to generate the kernel function can send kernel function configuration information and computing node list information to the main node according to the IP address of the main node.

Before the master node obtains the kernel function configuration information and the computing node list information, it may first detect whether a system state is normal when the master node system is started, for example, a device or an apparatus that needs to generate a kernel function may send notification information before sending the kernel function configuration information and the computing node list information, and after receiving the notification information, the master node first detects whether the system state is normal, for example, detects whether at least one of a network state, a specified communication port state, a node storage space, a memory space, and a dependent base software version is normal. And after the system detection is finished, acquiring the kernel function configuration information and the computing node list information.

S202, the main node sequentially sends a communication connection establishment request to each computing node in the computing node list information according to the computing node list information, if the fact that the communication connection is successfully established with the computing node is confirmed, the computing node is confirmed to be a target computing node, and an identification is distributed to the target computing node.

Specifically, the master node sequentially sends a request for establishing communication connection to each computing node in the computing node list information, and after receiving the request for establishing communication connection, the computing node may first detect whether a system state is normal, for example, whether at least one of a network state, a specified communication port state, a node storage space, a memory space, and a dependent basic software version is normal when the master node system is started.

If the main node receives the confirmation information of the computing node, the state information of the computing node is received, if the system state of the computing node is normal, the successful establishment of communication connection with the computing node is determined, the computing node is added into a target computing node, and an identifier is distributed for the target computing node.

S203, the main node splits the M matrix dimension intervals according to the number N of the target calculation nodes to obtain a plurality of matrix dimension subintervals, wherein N is a positive integer greater than or equal to 1.

The specific splitting manner may adopt the first manner and the second manner in the embodiment shown in fig. 1, and may also adopt other manners.

As an example shown in FIG. 4, N is 3 and there are 3 target compute nodes. And the main node splits the 4 matrix dimension intervals according to the 3 target calculation nodes to obtain a plurality of matrix dimension subintervals. For example, in the embodiment shown in fig. 4, the value interval of the row M of the first matrix is the largest value interval, the value interval of the row M of the first matrix is only split, and the value intervals corresponding to N and K are not split, for example, the interval M [ a1, a2] is split into 3 sub-intervals, which are: m [ a1.1, a1.2], M [ a1.3, a1.4] and M [ a1.5, a2], then the matrix dimension interval 1 is split to obtain 3 matrix dimension subintervals, the first matrix dimension subinterval is: m [ A1.1, A1.2], N [ B1, B2], K [ C1, C2], the second matrix dimension subinterval is: m [ A1.3, A1.4], N [ B1, B2], K [ C1, C2], the third matrix dimension subinterval is: m [ A1.5, A2], N [ B1, B2], K [ C1, C2 ]. The matrix dimension interval 2, the matrix dimension interval 3, and the matrix dimension interval 4 are similarly split, and can be seen from fig. 4, which is not described herein again.

It should be noted that fig. 4 is only an example, and in another embodiment, M, N and K may be split and arranged and combined after splitting, so as to obtain a plurality of matrix dimension subintervals corresponding to M, N and K.

S204, the main node constructs N kernel function generation tasks according to the matrix dimension subintervals and the parameter space corresponding to each matrix dimension interval, and each kernel function generation task comprises the matrix dimension subintervals and the parameter space corresponding to each matrix dimension subinterval.

Specifically, S204 may be: and according to the plurality of matrix dimension subintervals and the parameter space corresponding to each matrix dimension subinterval, determining the parameter space corresponding to each matrix dimension subinterval, and according to the N, the plurality of matrix dimension subintervals and the parameter space corresponding to each matrix dimension subinterval, constructing N kernel function generation tasks.

Taking fig. 4 as an example, after splitting in S203, 3x4 is obtained as 12 matrix dimension subintervals, and 3 kernel function generation tasks are constructed according to the number of target computing nodes, 3 and 12 matrix dimension subintervals and the parameter space corresponding to each matrix dimension subinterval, where in the specific construction, for example, the kernel function generation tasks may be divided equally, each kernel function generation task includes 4 matrix dimension subintervals and the parameter space corresponding to each matrix dimension subinterval, and in the dividing, the kernel function generation tasks may be combined randomly, so as to ensure load balancing, and may be combined according to the sizes of the matrix dimension subintervals.

S205, the main node distributes the N kernel function generation tasks and the kernel function running files to the N target computing nodes according to the IP addresses of the N target computing nodes.

Taking fig. 4 as an example, 3 kernel function generation tasks are constructed through S204, and the 3 kernel function generation tasks are distributed to 3 target computing nodes, and each target computing node distributes one kernel function generation task. Optionally, the master node may record the transmission state of the corresponding target computing node as "transmitted" after transmitting the kernel function generation task for the target computing node.

S206, the target computing node generates a task and a kernel function running file according to the received kernel function, and generates a kernel function corresponding to at least one matrix dimension.

Specifically, the kernel function running file may include Tensile and a file corresponding to a running environment of the Tensile, and S206 may specifically be: the target computing node generates a plurality of config.yaml files according to a kernel function generation task, each config.yaml file comprises a matrix dimension subinterval and a parameter space corresponding to the matrix dimension subinterval, the computing node runs Tenfile according to each config.yaml file in sequence, a plurality of kernel functions and a Benchmark example are generated by using the Tenfile, then the Benchmark example is run to test the performance of each kernel function, the kernel function with the optimal performance is determined as the kernel function corresponding to one matrix dimension, and the optimal performance can be the shortest solving time of the kernel function. The generation of multiple kernel functions by using Tensile and the Benchmark algorithm are the same as those of the conventional generation process, and are not described again here.

Taking fig. 4 as an example, for example, the kernel function generation task received by the target computing node 1 includes 4 matrix dimension subintervals and a parameter space corresponding to each matrix dimension subinterval, and the target computing node 1 generates 4 config.yaml files according to the kernel function generation task: a config.yaml file 1, a config.yaml file 2, a config.yaml file 3, and a config.yaml file 4, which run Tensile from each config.yaml file in order, and finally generate a config.yaml file 1 using Tensile: kernel function, config.yaml file 2 corresponding to m matrix dimensions: kernel function corresponding to n matrix dimensions, config.yaml file 3: kernel function for p matrix dimensions and config.yaml file 4: q kernel functions corresponding to the matrix dimensions, and the target computing node 1 finally generates m + n + p + q kernel functions corresponding to the matrix dimensions. The generation process of other target computing nodes is similar, and is not described in detail here.

And S207, when running a Benchmark example to test the performance of each kernel function, the target computing node sends timing heartbeat information to the main node according to a preset period to ensure normal communication.

S208, the target computing node sends the kernel function corresponding to at least one matrix dimension to the master node.

S209, the main node monitors timing heartbeat information of all target computing nodes, and if the timing heartbeat information sent by the first target computing node is not received within a preset time period, the kernel function generating task and the kernel function running file sent to the first target computing node are sent to other target computing nodes.

S210, the master node merges the kernel functions corresponding to at least one matrix dimension sent by each target computing node to obtain kernel function generation information, and the kernel function generation information comprises a plurality of matrix dimensions and the kernel function corresponding to each matrix dimension.

The kernel function generation method provided in this embodiment shortens the time for generating the kernel function, and the computation amount of each computation node is also reduced, so that the computation complexity is reduced, and meanwhile, the cluster hardware resources can be fully utilized, and the total time for generating the kernel function can be shortened as the number of the computation nodes increases.

The following are embodiments of the apparatus of the present application that may be used to perform the above-described embodiments of the method of the present application. For details which are not disclosed in the embodiments of the apparatus of the present application, reference is made to the embodiments of the method described above in the present application.

Fig. 5 is a schematic structural diagram of a kernel function generating device according to an embodiment of the present application, and as shown in fig. 5, the device according to the embodiment may include: an acquisition module 11, a splitting module 12, a construction module 13, a sending module 14 and a processing module 15, wherein,

the obtaining module 11 is configured to obtain kernel function configuration information.

The splitting module 12 is configured to split the M matrix dimension intervals according to the number N of the target computing nodes, to obtain a plurality of matrix dimension subintervals, where the kernel function configuration information includes the M matrix dimension intervals and a parameter space corresponding to each matrix dimension interval, and M | and N are positive integers greater than or equal to 1.

The building module 13 is configured to build N kernel function generation tasks according to the multiple matrix dimension subintervals and the parameter space corresponding to each matrix dimension interval, where each kernel function generation task includes at least one matrix dimension subinterval and the parameter space corresponding to each matrix dimension subinterval;

the sending module 14 is configured to distribute the N kernel function generation tasks and the kernel function execution files to the N target computing nodes, and is configured to generate, by each target computing node, a kernel function corresponding to at least one matrix dimension according to the kernel function generation tasks and the kernel function execution files;

the processing module 15 is configured to merge the kernel functions corresponding to at least one matrix dimension sent by each target computing node to obtain kernel function generation information, where the kernel function generation information includes multiple matrix dimensions and a kernel function corresponding to each matrix dimension.

Optionally, the splitting module 12 is configured to: sequentially splitting each matrix dimension interval into a plurality of matrix dimension subintervals according to the sequence of the first end values of the value intervals of the rows or the columns in the M matrix dimension intervals from large to small, wherein the difference value of the number of the matrix dimensions corresponding to different matrix dimension subintervals is smaller than or equal to a first preset threshold value; alternatively, the first and second electrodes may be,

and sequentially splitting each matrix dimension interval into N matrix dimension subintervals according to the sequence from large to small of the first end values of the value intervals of the rows or columns in the M matrix dimension intervals.

Optionally, the constructing module 13 is configured to determine a parameter space corresponding to each matrix dimension subinterval according to the multiple matrix dimension subintervals and the parameter space corresponding to each matrix dimension interval;

and constructing N kernel function generation tasks according to the N matrix dimension subintervals and the parameter space corresponding to each matrix dimension subinterval.

Optionally, the constructing module 13 is specifically configured to obtain N sets according to the multiple matrix dimension subintervals, where a difference between the numbers of the matrix dimension subintervals in each set is less than or equal to a second preset threshold, and the number of the matrix dimension interval corresponding to the matrix dimension subinterval in each set is an odd number or an even number, or the numbers of the matrix dimension intervals corresponding to the matrix dimension subintervals in each set are consecutive;

and adding a parameter space corresponding to the matrix dimension subintervals in the sets for each set of the N sets to obtain N kernel function generation tasks.

Optionally, the information of each computing node includes an IP address of the computing node, and the sending module 14 is configured to distribute the N kernel function generation tasks and the kernel function execution files to the N target computing nodes according to the IP addresses of the N target computing nodes.

Optionally, the obtaining module 11 is further configured to obtain computing node list information, where the computing node list information includes a plurality of computing nodes and an IP address of each computing node;

the sending module 14 is further configured to send a request for establishing a communication connection to each computing node in the computing node list information in sequence;

the splitting module 12 is also configured to: and if the communication connection with the computing node is successfully established, determining the computing node as a target computing node, and allocating an identifier for the target computing node.

Optionally, the sending module 14 is further configured to: if the receiving module 15 does not receive the timing heartbeat information sent by the first target computing node within the preset time period, the kernel function generating task and the kernel function running file sent to the first target computing node are sent to other target computing nodes.

The apparatus provided in the embodiment of the present application may implement the method embodiment, and specific implementation principles and technical effects thereof may be referred to the method embodiment, which is not described herein again.

It should be noted that the division of the modules of the above apparatus is only a logical division, and the actual implementation may be wholly or partially integrated into one physical entity, or may be physically separated. And these modules can be realized in the form of software called by processing element; or may be implemented entirely in hardware; and part of the modules can be realized in the form of calling software by the processing element, and part of the modules can be realized in the form of hardware. For example, the processing module may be a processing element separately set up, or may be implemented by being integrated in a chip of the apparatus, or may be stored in a memory of the apparatus in the form of program code, and a function of the processing module may be called and executed by a processing element of the apparatus. Other modules are implemented similarly. In addition, all or part of the modules can be integrated together or can be independently realized. The processing element here may be an integrated circuit with signal processing capabilities. In implementation, each step of the above method or each module above may be implemented by an integrated logic circuit of hardware in a processor element or an instruction in the form of software.

For example, the above modules may be one or more integrated circuits configured to implement the above methods, such as: one or more Application Specific Integrated Circuits (ASICs), or one or more microprocessors (DSPs), or one or more Field Programmable Gate Arrays (FPGAs), among others. For another example, when some of the above modules are implemented in the form of a processing element scheduler code, the processing element may be a general-purpose processor, such as a Central Processing Unit (CPU) or other processor that can call program code. As another example, these modules may be integrated together, implemented in the form of a system-on-a-chip (SOC).

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions according to the embodiments of the present application are all or partially generated when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wirelessly (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

Fig. 6 is a schematic structural diagram of a kernel function generating device according to an embodiment of the present application, and as shown in fig. 6, the kernel function generating device according to the present embodiment may include a processor 21 and a memory 22,

the memory 22 is used for storing executable instructions of the processor 21.

The processor 21 is configured to perform the kernel function generation method in the above-described method embodiments via execution of executable instructions.

Alternatively, the memory 22 may be separate or integrated with the processor 21.

When the memory 22 is a device independent of the processor 21, the kernel function generating apparatus of the present embodiment may further include:

a bus 23 for connecting the memory 22 and the processor 21.

Optionally, the kernel function generating device of this embodiment may further include: a communication interface 24, the communication interface 24 being connectable to the processor 21 via a bus 23.

The present application also provides a computer-readable storage medium, in which computer-executable instructions are stored, which, when executed on a computer, cause the computer to perform the kernel function generation method according to the above embodiment.

Embodiments of the present application further provide a computer program product, which includes a computer program, and when the computer program is executed by a processor, the method for generating a kernel function as in the above embodiments is implemented.

In the description herein, reference to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Although embodiments of the present application have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present application, and that variations, modifications, substitutions and alterations may be made to the above embodiments by those of ordinary skill in the art within the scope of the present application.

Claims

1. A kernel function generation method, comprising:

2. The method of claim 1, wherein the splitting the M matrix dimension intervals according to the number N of target computing nodes to obtain a plurality of matrix dimension subintervals comprises:

3. The method of claim 2, wherein constructing the N kernel function generation tasks according to the plurality of matrix dimension subintervals and the parameter space corresponding to each of the matrix dimension intervals comprises:

4. The method of claim 3, wherein constructing the N kernel function generation tasks according to the N, the plurality of matrix dimension subintervals, and the parameter space corresponding to each of the matrix dimension subintervals comprises:

5. The method of claim 1, wherein distributing the N kernel function generation tasks and kernel function execution files to the N target compute nodes comprises:

6. The method according to any one of claims 1-5, further comprising:

7. The method according to any one of claims 1-5, further comprising:

8. A kernel function generation apparatus, comprising:

9. A kernel function generation apparatus, comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the kernel function generation method of any one of claims 1-7 via execution of the executable instructions.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the kernel function generation method of any one of claims 1 to 7.