CN111078415A

CN111078415A - Data processing method, device, server and computer readable storage medium

Info

Publication number: CN111078415A
Application number: CN201911319986.8A
Authority: CN
Inventors: 张吉
Original assignee: Beijing QIYI Century Science and Technology Co Ltd
Current assignee: Beijing QIYI Century Science and Technology Co Ltd
Priority date: 2019-12-19
Filing date: 2019-12-19
Publication date: 2020-04-28

Abstract

The embodiment of the invention provides a data processing method, a data processing device, a server and a computer readable storage medium, wherein the method comprises the following steps: partitioning the graph calculation model according to the available resource amount of each GPU in the multiple GPUs to be distributed to generate multiple model blocks and a first dependency relationship among the multiple model blocks; respectively allocating the GPUs to the plurality of model blocks according to the first resource amount required by each model block and the available resource amount of each GPU, and generating the corresponding relation between the model blocks and the GPUs; loading the plurality of model blocks to corresponding GPUs according to the corresponding relation; receiving a graph computation request for a graph computation model; and responding to the graph calculation request, processing the graph calculation request according to the first dependency relationship through a plurality of model blocks loaded on a plurality of GPUs, and generating and outputting a graph calculation result. The invention realizes the reasoning calculation of the large graph calculation model.

Description

Data processing method, device, server and computer readable storage medium

Technical Field

The present invention relates to the field of cloud computing technologies, and in particular, to a data processing method, an apparatus, a server, and a computer-readable storage medium.

Background

At present, the artificial intelligence model has a larger volume, the inference computation model (generally, a graph computation model) requires more computation resources, and the GPU (Graphics Processing Unit, commonly referred to as a Graphics card) provides less resources.

When the GPU performs inference calculation using the graph calculation model, the whole model needs to be loaded into the video memory, and then the inference calculation is performed. Then when the graph computation model is large, it is difficult for the GPU device to complete loading of the large model, and thus graph computation cannot be completed.

Therefore, one technical problem that needs to be urgently solved by those skilled in the art is: how to perform reasoning calculations on the large graph computation model.

Disclosure of Invention

The embodiment of the invention provides a data processing method, a data processing device, a server and a computer readable storage medium, which are used for solving the problem that reasoning calculation is difficult to complete on a large graph calculation model in the related technology.

In order to solve the above problem, according to an aspect of an embodiment of the present invention, the present invention discloses a data processing method, including:

partitioning a graph calculation model according to the available resource amount of each GPU in a plurality of GPUs to be distributed to generate a plurality of model blocks and a first dependency relationship among the model blocks;

respectively allocating GPUs to the plurality of model blocks according to the first resource amount required by each model block and the available resource amount of each GPU, and generating a corresponding relation between the model blocks and the GPUs, wherein the first resource amount required by the model block in the corresponding relation is less than or equal to the available resource amount of the GPU corresponding to the model block;

loading the plurality of model blocks to corresponding GPUs according to the corresponding relations;

receiving a graph computation request for the graph computation model;

and responding to the graph calculation request, processing the graph calculation request according to the first dependency relationship through the plurality of model blocks loaded on the plurality of GPUs, and generating and outputting a graph calculation result.

According to another aspect of the embodiments of the present invention, the present invention further discloses a data processing apparatus, including:

the partitioning module is used for partitioning the graph calculation model according to the available resource amount of each GPU in the multiple GPUs to be distributed to generate multiple model blocks and a first dependency relationship among the multiple model blocks;

the distribution module is used for distributing the GPUs to the model blocks according to the first resource quantity required by each model block and the available resource quantity of each GPU to generate a corresponding relation between the model blocks and the GPUs, wherein the first resource quantity required by the model blocks in the corresponding relation is less than or equal to the available resource quantity of the GPUs corresponding to the model blocks;

the loading module is used for respectively loading the plurality of model blocks to corresponding GPUs according to the corresponding relation;

a receiving module for receiving a graph computation request for the graph computation model;

and the processing module is used for responding to the graph calculation request, processing the graph calculation request according to the first dependency relationship through the plurality of model blocks loaded on the plurality of GPUs, and generating and outputting a graph calculation result.

According to another aspect of the embodiments of the present invention, the present invention also discloses a server, including: the system comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory complete mutual communication through the communication bus; a memory for storing a computer program; a processor for implementing the steps of the data processing method described in any one of the above when executing the program stored in the memory.

According to another aspect of the embodiments of the present invention, the present invention also discloses a computer-readable storage medium, which stores instructions that, when executed on a computer, cause the computer to execute the steps in the data processing method described in any one of the above.

According to another aspect of the embodiments of the present invention, the present invention also discloses a computer program product containing instructions, which when run on a computer, causes the computer to execute any of the data processing methods described above.

In the embodiment of the invention, the graph computation model is partitioned according to the available resource amount of each GPU in a plurality of GPUs to be allocated to generate a plurality of model blocks and a first dependency relationship among the plurality of model blocks, the GPUs are respectively allocated to the plurality of model blocks according to the first resource amount required by each model block and the available resource amount of each GPU to generate the corresponding relationship between the model blocks and the GPUs, so that a large graph computation model is divided into the plurality of model blocks to be respectively loaded to different GPUs, and as the resource amount required by the model block loaded by one GPU is smaller than the available resource amount of the GPU, a partial model of the large graph computation model is loaded by the single GPU, therefore, the loading of the large graph computation model by the plurality of GPUs is realized, and after a graph computation request for the graph computation model is received, the plurality of model blocks of the graph computation model respectively loaded on the plurality of GPUs can be used, the calculation is carried out according to the first dependency relationship, so that the generated graph calculation result is consistent with the graph calculation result generated by directly adopting the graph calculation model for calculation, the inference calculation of the large graph calculation model is realized, and the resource limitation of a single GPU is broken through.

Drawings

FIG. 1 is a block diagram of a data processing system according to an embodiment of the present invention;

FIG. 2 is a flow chart of the steps of one data processing method embodiment of the present invention;

FIG. 3 is a schematic diagram of a directed acyclic graph according to an embodiment of the present invention;

FIG. 4 is a block diagram of an embodiment of a data processing apparatus of the present invention;

fig. 5 is a block diagram of a server embodiment of the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

Referring to FIG. 1, a block diagram of one embodiment of a data processing system of the present invention is shown that may be implemented in a server.

The system may include a graph partitioning module, a GPU computation module (specifically, each GPU in fig. 1, e.g., GPU0, GPU1, GPU2, GPU n), and a GPU communication module.

Because the system includes multiple GPUs and is respectively used for loading different model blocks of the same large graph computation model, a GPU communication module is required to implement data communication between different GPUs in order to ensure the accuracy of graph computation.

The GPU communication module is responsible for communication between different GPUs when multiple cards (i.e. multiple graphics cards (multiple GPUs)) are used for computing, and is divided into the following ways:

firstly, the method comprises the following steps: the multiple GPUs are deployed on the same server, that is, in a single-machine multi-card mode, the GPU communication module may use PCIE (peripheral component interconnect express, which is a high-speed serial computer expansion bus standard) as shown in fig. 1 or NVLink (a bus and its communication protocol developed and proposed by NVIDIA) to implement communication between multiple cards, thereby completing joint operation of multiple display cards.

II, secondly: the multiple GPUs are deployed on different servers, and when the multiple GPUs need to adopt a network communication mode, the GPU communication module may use RDMA (Remote Direct Memory Access) as shown in fig. 1 to implement network communication between the multiple cards, thereby completing joint operation of the multiple display cards on different servers.

The graph partitioning module is mainly used for partitioning the received model file (namely, a file of a graph calculation model) and the resource description information of the GPU (graphics processing Unit) by using the two information to generate a plurality of model blocks and distributing the GPU to each model block;

the GPU calculation modules, here, 4 GPUs schematically shown in fig. 1, are respectively configured to load the allocated model blocks, and when the system receives a graph calculation request from the caller, the 4 GPUs perform joint operation using the respective loaded model blocks, and return an operation result to the caller.

With regard to the specific implementation steps of the modules in the system, reference may be made to the following description of embodiments of the data processing methods.

The data processing method according to the various embodiments of the present invention will be described in detail below based on the data processing system shown in fig. 1.

Referring to fig. 2, a flowchart illustrating steps of an embodiment of a data processing method according to the present invention is shown, where the method may be applied to a server, and the method may specifically include the following steps:

step 101, partitioning a graph computation model according to the available resource amount of each GPU in a plurality of GPUs to be distributed to generate a plurality of model blocks and a first dependency relationship among the model blocks;

in one example, as shown in FIG. 1, a graph partitioning module may receive a file of a graph computation model and GPU resource description information (here including an amount of available resources for each of a plurality of GPUs to be allocated).

The available resource amounts of the GPUs may be the same or different, and the maximum video memories supported by the GUPs may be the same or different.

In addition, since the large graph computation model is limited by the video memory of one GPU when being loaded, the resource parameter corresponding to the available resource amount of any GPU may be video memory information. That is, the GPU resource description information describes the available video memory for each GPU. For convenience of understanding, the following embodiments are described by taking the available resource amount as an example of the available video memory.

In addition, since each GPU may be loaded with other models before loading the model block of the present invention, or a part of the video memory is occupied for other reasons, the available video memory of each GPU is less than or equal to the maximum video memory supported by the corresponding GPU.

Because the graph computation model is an inference computation model, and the inference computation does not need to propagate gradients in a reverse direction, the computation logic of the Nth layer in the graph computation model depends on the output result of the computation logic of the N-1 layer, and the situation that the computation logic of the N-1 layer depends on the output result of the computation logic of the Nth layer does not exist. That is, the previous layer logic does not need to be assisted by the data input of the next layer logic in the calculation. Thus, the graph partitioning module may partition the graph computation model with reference to the available memory of each GPU, thereby splitting a single model into small sets of models (i.e., multiple model blocks) such that the memory required for each model block does not exceed the available memory of the GPU. Accordingly, after partitioning, dependencies between multiple model blocks are also formed (e.g., the input of model block 4 depends on the output of model block 3, the input of model block 3 depends on the input of model block 2, and the input of model block 2 depends on such dependencies of the output of model block 1).

The video memories required by different model blocks can be the same or different.

In one example, when the available video memories of multiple GPUs are the same, the video memory required by each model block does not exceed the available video memory of the GPU;

in another example, when the available video memories of the GPUs are not identical, the video memory required by each model block does not exceed the average value of the available video memories of the GPUs, or the video memory required by each model block does not exceed the minimum value of the available video memories of the GPUs.

102, respectively allocating GPUs to the plurality of model blocks according to a first resource amount required by each model block and an available resource amount of each GPU, and generating corresponding relations between the model blocks and the GPUs;

wherein, the first resource quantity needed by the model block in the corresponding relation is less than or equal to the available resource quantity of the GPU corresponding to the model block;

in the example of fig. 1, the graph partitioning module may obtain a video memory required by each model block, and when obtaining the video memory information, may determine the video memory required by each layer according to parameter description information of each layer of computation logic in the graph computation model, where each model block includes at least one layer of computation logic, so that the video memory required by each computation logic included in one model block may be accumulated to obtain the video memory required by one model block.

The graph partitioning module may allocate GPUs to the plurality of model blocks according to the required video memory for each model block and the available video memory for each GPU.

It should be noted that one model block is only allocated to one GPU, and different model blocks can be allocated to the same or different GPUs, i.e. one GPU can load a single model block or multiple different model blocks. Whether a single model block or a different model block is loaded specifically depends on the available video memory of the GPU and the video memory required by the model block it needs to load.

After the model blocks are allocated, a correspondence between the model blocks and the GPU may be generated, where the video memory required by any one model block is less than or equal to the available video memory of the GPU to which the model block is allocated (i.e., the GPU corresponding to the correspondence).

103, respectively loading the plurality of model blocks to corresponding GPUs according to the corresponding relations;

in the example of fig. 1, the graph partitioning module may deploy each model block in the correspondence relationship to each corresponding GPU, respectively. Then, the GPU computing modules, i.e. 4 GPUs in fig. 1, can load the deployed model blocks respectively, so as to implement the loading of a large graph computing model by multiple GPUs, thereby completing the inference computation of the large model.

Step 104, receiving a graph computation request for the graph computation model;

in the example of fig. 1, the graph partitioning module may receive a graph computation request from a caller, such as a client, where the graph computation request may include source data to be computed and identification information of a graph computation model that computes the source data. The graph calculation model here is the graph calculation model partitioned in steps 101 to 103.

Step 105, responding to the graph computation request, processing the graph computation request according to the first dependency relationship through the plurality of model blocks loaded on the plurality of GPUs, generating a graph computation result, and outputting the graph computation result.

The graph partitioning module may, in response to the graph computation request, process the graph computation request according to the first dependency relationship by using, for example, 4 model blocks loaded by 4 GPUs in fig. 1, generate a graph computation result, and output the graph computation result.

For example, according to the first dependency relationship, model block 1 loaded on GPU0 calculates the source data to generate intermediate result 1; the GPU0 outputs the intermediate result 1 to the GPU1, and the model block 2 loaded on the GPU1 calculates the intermediate result 1 to generate an intermediate result 2; the GPU1 outputs the intermediate result 2 to the GPU2, and the model block 3 loaded on the GPU2 calculates the intermediate result 2 to generate an intermediate result 3; the GPU2 outputs the intermediate result 3 to the GPUN, and the model block 4 loaded on the GPUN calculates the intermediate result 3 to generate a graph calculation result.

The graph computation results obtained by computing the graph computation request after the operation of the plurality of model blocks are the same as the graph computation results obtained by computing the graph computation request by the original graph computation model before the block division.

In addition, when the plurality of GPUs perform joint computation to respond to the graph computation request, the graph partitioning module may control transmission of intermediate results after model block computation of the plurality of GPUs, or after step 103, the graph partitioning module may issue the first dependency relationship between the plurality of model blocks to the 4 GPUs, and each GPU may automatically control transmission of intermediate results between different GPUs.

In the conventional technology, because the computational resources required by a single graph computational model exceed the computational resources provided by a single GPU, the limitation is the limitation of a GPU operation mechanism and GPU physical equipment, and thus, the single GPU equipment cannot break through the bottleneck of video memory limitation.

In order to solve the above problem, in the embodiment of the present invention, a graph computation model is partitioned according to an available resource amount of each GPU in a plurality of GPUs to be allocated, a plurality of model blocks and a first dependency relationship between the plurality of model blocks are generated, and GPUs are allocated to the plurality of model blocks according to a first resource amount required by each model block and the available resource amount of each GPU, respectively, a correspondence relationship between the model blocks and the GPUs is generated, so that a large graph computation model is divided into a plurality of model blocks to be loaded to different GPUs, respectively, since a resource amount required by a model block loaded by one GPU is smaller than the available resource amount of the GPU, a partial model of the large graph computation model is loaded by a single GPU, and thus, loading of the large graph computation model by the plurality of GPUs is realized, and after a graph computation request for the graph computation model is received, the calculation can be carried out according to the first dependency relationship through a plurality of model blocks of the graph calculation model loaded on a plurality of GPUs respectively, so that the generated graph calculation result is consistent with the graph calculation result generated by directly adopting the graph calculation model for calculation, the inference calculation of a large graph calculation model is realized, and the resource limitation of a single GPU is broken through.

Alternatively, when step 101 is executed, it may be realized by S201 to S203:

s201, acquiring a second dependency relationship among a plurality of first nodes in the graph calculation model;

s202, acquiring a second resource amount required by the graph calculation model;

s203, according to the second resource amount, the number of the GPUs and the available resource amount of each GPU, partitioning the graph calculation model according to the second dependency relationship, and generating a plurality of model blocks and a first dependency relationship among the plurality of model blocks.

Specifically, in S201, each computation logic in the graph computation model may be regarded as one node, where the loop computation logic may be regarded as one loop node (for example, 10 computation logics for loop are regarded as one node), and since the graph computation model has no back propagation, the topology ranking algorithm may be used to obtain the second dependency relationship between the first nodes of the graph computation model.

In S202, the specific implementation method of this step may refer to the above, and is not described here again. The second resource amount is the accumulated sum of the resource amounts required by each computation logic estimated according to the parameter description information of each computation logic in the graph computation model.

In S203, when partitioning the graph computation model, sequential partitioning may be performed according to a second dependency relationship between a plurality of first nodes, each model block including at least one first node. Each time the partition generates a model block, several first nodes are grouped into a group in order to form a model block, which depends on the second resource amount, the number of GPUs, and the available resource amount of each GPU.

Since there is a second dependency between the first nodes in the graph computation model, the first dependency between the generated model blocks follows the second dependency after the model is partitioned according to the second dependency.

When the graph computation model is partitioned in S203, a specific partitioning manner is not limited, and the partitioning may be performed in a manner similar to an average manner (that is, the amount of resources required by different model blocks is substantially the same and is smaller than the minimum value of the amount of resources available to a single GPU in the multiple GPUs), or a flexible partitioning manner may be used, that is, the amount of resources required by different model blocks may be flexibly partitioned with reference to the amount of resources available to each GPU.

In the embodiment of the present invention, since a certain dependency relationship exists between nodes in the graph computation model, in order to ensure consistency between the computation results of the plurality of divided model blocks on the graph computation request and the computation results of the original graph computation model on the graph computation request, the graph computation model may be partitioned according to the dependency relationship, and when partitioning, the number of the plurality of nodes and the plurality of nodes are specifically used as one model block, the number of the plurality of GPUs to be allocated and the available resource amount of each GPU may be referred to reasonably partition the graph computation model to obtain the plurality of model blocks and the first dependency relationship between the plurality of model blocks.

Alternatively, when S203 is executed, it may be realized by S301 to S304:

s301, calculating a ratio of the second resource amount to the number of the GPUs to generate a third resource amount, wherein the third resource amount is less than or equal to the available resource amount of each GPU;

s302, converting a plurality of first nodes in the graph calculation model into a plurality of second nodes in a directed acyclic graph;

s303, according to the third resource amount and a fourth resource amount corresponding to each second node, segmenting the plurality of second nodes according to the second dependency relationship to generate a plurality of subgraphs and a third dependency relationship among the plurality of subgraphs, wherein a fifth resource amount corresponding to each subgraph is less than or equal to the third resource amount;

s304, segmenting the plurality of first nodes according to the conversion relation between the plurality of first nodes and the plurality of second nodes, and generating a plurality of model blocks corresponding to the plurality of sub-graphs and a first dependency relation among the plurality of model blocks, wherein the fifth resource amount corresponding to any sub-graph is a first resource amount required by the model block corresponding to the sub-graph.

Specifically, in S301, the amount of available resources of different GPUs may be the same or different.

The ratio of the total resource amount (i.e., the second resource amount) m required by the graph computation model to the GPU number n may be calculated to generate a third resource amount p, which is the resource amount that each GPU needs to provide on average. The third amount of resources p is less than or equal to the amount of available resources for any of the GPUs.

In S302, a plurality of first nodes of the graph computation model may be abstracted to a directed acyclic graph (where a first node with a loop corresponding to the loop computation logic in the graph computation model may be abstracted to a single second node in the directed acyclic graph, and since a second dependency relationship between the plurality of first nodes in the graph computation model is required to perform graph inference computation, the plurality of first nodes may be converted to a plurality of second nodes in the directed acyclic graph, where each first node (being a logical code) in the graph computation model may be converted back to one second node (being a node number) in the directed acyclic graph, and two groups of nodes before and after conversion are in a one-to-one correspondence relationship.

Any method in the conventional art can be adopted as the method for converting the graph computation model into the directed acyclic graph, and details are not repeated here.

In one example, fig. 3 shows a schematic diagram of a directed acyclic graph into which a graph computation model of an embodiment of the present invention is converted, the directed acyclic graph including 10 second nodes. The direction of the arrows between the 10 nodes follows a second dependency between the plurality of nodes of the graph computation model. It will be appreciated that the direction of the arrows indicate the direction of data transfer, or the order of computation of the data.

In S303, a fourth resource amount corresponding to a second node is a resource amount (for example, a resource amount required by a computation logic) required by the first node before the conversion corresponding to the second node, and an obtaining manner of the fourth resource amount is similar to the obtaining principle of the third resource amount, which is not described herein again. In this step, a single node or a plurality of nodes in the directed acyclic graph can be abstracted into one computing unit. I.e. one model block is represented in a directed acyclic graph as one computational unit.

Therefore, 10 nodes in the directed acyclic graph shown in fig. 3 need to be divided, and the division needs to be performed sequentially in the direction of the arrow (i.e., the second dependency relationship).

When the division is performed, 10 nodes may be divided in a topological sorting manner, considering that the in-degree of an input node (i.e. node 1) is zero and the out-degree of an output node (node 10) is zero (where the in-degree refers to the number of input nodes of one node and the out-degree refers to the number of output nodes of one node), therefore, the amount of resources required by a single node (where the fourth amount of resources corresponding to nodes 1 to 10 is m1, m2, m3, m4, m5, m6, m7, m8, m9, m10, and the total amount of resources required by the graph computation model is equal to the accumulated sum of the above-mentioned 10 fourth amounts of resources) may be accumulated in the topological sorting process, when the accumulated amount of resources is close to the third amount of resources p, the second node sorted by the section is divided into a group, generating a sub-graph until the topological sorting is finished, generating a plurality of sub-graphs and a third dependency relationship among the plurality of sub-graphs, wherein the third dependency relationship follows the second dependency relationship among the plurality of first nodes, and the third dependency relationship is substantially the same as the first dependency relationship except that the object pointed to in the third dependency relationship is a sub-graph, the object pointed to in the first dependency relationship is a model block, and the sub-graph is a representation of a directed acyclic graph of the model block.

And the resource quantity corresponding to the subgraph is the resource quantity obtained by accumulation before segmentation. For example, in fig. 3, 4 sub-graphs are generated, which are sub-graph 1, sub-graph 2, sub-graph 3, and sub-graph n, respectively, where q1 is m1+ m2+ m3 for sub-graph 1, q2 is m4+ m5+ m6+ m7 for sub-graph 2, q3 is m8+ m9 for sub-graph 3, and qn is m10 for sub-graph n. Since the accumulated resource amount is divided when it is close to the third resource amount p before each division, q1, q2, q3, qn are all less than or equal to p. Here, the dependency relationships among the 4 sub-graphs are such that the input of sub-graph n depends on the output of sub-graph 3, the input of sub-graph 3 depends on the input of sub-graph 2, and the input of sub-graph 2 depends on the output of sub-graph 1

Each subgraph is the above-mentioned each computing unit.

Wherein each subgraph comprises at least one second node.

In S304, the plurality of first nodes of the graph calculation model may be divided into a plurality of model blocks, and model block 1, model block 2, model block 3, and model block 4 may correspond to sub-graph 1, sub-graph 2, sub-graph 3, and sub-graph n, respectively.

In the embodiment of the invention, the total resource quantity required by the graph calculation model is averagely distributed by using the GPU number by adopting a method similar to average distribution to obtain a third resource quantity, and the partitioning of the graph calculation model is realized by converting the graph calculation model into a directed acyclic graph, so that the resource quantity required by one model block obtained by partitioning each time is close to and less than or equal to the third resource quantity, and the reasonable partitioning of a plurality of model blocks is realized on the premise of following the dependency relationship between nodes of the graph calculation model.

Then, in this embodiment, in executing step 102, when allocating a target GPU to any one target model block in the plurality of model blocks, the target model block may be allocated to a target GPU in the plurality of GPUs according to a fifth resource amount required by the target model block, where an available resource amount of the target GPU is greater than or equal to the fifth resource amount.

In the embodiment of the invention, when the GPU is allocated to one model block, the available resource amount of the GPU and the resource amount required by the model block are referred, and the module block is allocated to the GPU of which the available resource amount is larger than the required resource amount of the model block, so that the resources in the GPU allocated with the model block can be sufficiently used for loading the allocated model block, each model block of the graph calculation model after being partitioned can be loaded by the corresponding GPU, the situation of insufficient resources does not exist, and the loading and the use of the graph calculation model are ensured.

Optionally, in an embodiment, in order to improve loading efficiency of the graph computation model, when step 103 is executed, for any one target model block of the plurality of model blocks, the target model block may be loaded to a target GPU corresponding to the target model block. That is, only one model block can be allocated to one GPU for loading, so that a plurality of model blocks can be loaded in parallel, thereby improving the loading efficiency of the graph computation model.

Optionally, in another embodiment, in order to improve the utilization rate of the GPU, in step 103, for any target model block in the plurality of model blocks, the target model block may be loaded to a target GPU corresponding to the target model block, and after the target model block is loaded to the target GPU corresponding to the target model block, if the amount of available resources left by the target GPU after the target model block is loaded is greater than or equal to the first amount of resources required by the target model block, the target model block may also be loaded to the target GPU again.

For example, after model block 1 is loaded into GPU0, if there are remaining resources in GPU0 and the amount of remaining resources is greater than m1, which is the amount of resources required by model block 1, model block 1 may continue to be loaded into GPU0 until the amount of remaining resources of GPU0 is less than m1, so that a single GPU may load the same multiple model blocks (e.g., two model blocks 1 are loaded), then GPU0 may support two computation requests of the computation logic corresponding to model block 1 at the same time, thereby improving resource utilization of the GPU.

In the embodiment of the present invention, after one GPU loads the allocated model block, if the available resource amount of the GPU is greater than the resource amount required by the model block, the model block may be loaded to the GPU again, that is, the same GPU is loaded with at least two identical model blocks, since one operation of one model block can complete one graph computation of the computation logic corresponding to the model block, the same GPU is enabled to load at least two identical model blocks, which can improve the resource utilization rate of the GPU and the response efficiency of the graph computation request.

In addition, the GPU computing module of the method according to the embodiment of the present invention uses an Nvidia MPS (multiprocessing service technology) technique or a composition pipeline (linear communication model) to service the model blocks loaded by the GPU, so that when one GPU is loaded with a plurality of identical model blocks, the GPU computing module starts a corresponding number of services according to the number of loaded identical model blocks, for example, the GPU0 loads the identical model block 1 for 2 times, so that 2 services need to be started here, and the number of services is equal to the number of loaded model blocks, so as to meet the requirements of MPS and pipeline computing, thereby improving the resource utilization of the GPU and the computing efficiency of the small model set.

Optionally, in step 105, in response to the graph calculation request, sequentially running the plurality of model blocks loaded respectively by the plurality of GPUs according to the first dependency relationship, and inputting an intermediate result of calculation of a previous model block in the plurality of model blocks to a next model block for calculation until a last model block calculates the input intermediate result to generate and output a graph calculation result.

When the multiple GPUs perform joint computation to respond to the graph computation request, the graph partitioning module may control transmission of intermediate results after model block computation of the multiple GPUs, or after step 103, the graph partitioning module issues the first dependency relationship among the multiple model blocks to the 4 GPUs, so that each GPU automatically controls transmission of intermediate results among different GPUs.

In the embodiment of the invention, the plurality of model blocks loaded by the plurality of GPUs are sequentially operated by the plurality of GPUs according to the sequence of the first dependency relationship among the plurality of model blocks divided by the graph calculation model, so that the operation sequence of the plurality of model blocks conforms to the first dependency relationship, the intermediate result calculated by the previous model block is transmitted to the next model block for calculation until the last model block calculates the input intermediate result, and the calculation result generated by the last model block is the graph calculation result, so that the graph calculation result generated after the divided plurality of model blocks calculate the graph calculation request can be the same as the graph calculation result generated after the complete calculation of the non-divided graph calculation model, and the operation result is ensured to be the same as the operation result before the block loading after the graph calculation model is divided and loaded.

Optionally, after step 103, after the graph partitioning module partitions the directed acyclic graph, the numbers of the partition nodes of each model block may be issued to the corresponding GPU which loads each model block.

For example, the number of the first nodes of the graph computation model is consistent with the number of the nodes of the directed acyclic graph shown in fig. 3, i.e., the graph computation model block includes the 10 first nodes shown in fig. 3, then after step 103, at system initialization, the graph partitioning module may: the numbers of node 2 and node 3 are issued to GPU1 loaded with model block 2 (where model block 2 depends on model block 1), and this dependency relationship of node 4 depending on node 2 and node 3, and node 5 depending on node 3 is issued to GPU 1; similarly, the termination node of model block 2: the numbers of the node 6 and the node 7 are sent to the GPU3, and the dependency relationship of the node 8 depending on the node 6 and the node 7 is sent to the GPU 3; the number of the termination node of model block 3, i.e. node 9, is issued to GPUn, and the dependency relationship of node 10 on node 9 is issued to GPUn.

That is, after partitioning the graph computation model, the graph partitioning module may record the number information of the cut nodes (first nodes) during partitioning, record the GPU number of the GPU allocated to each model block, and send the dependency relationship (belonging to a part of the second dependency relationship) between the cut nodes to the GPU loaded with each cut node.

The graph partitioning module may determine the graph computation model after receiving the graph computation request, then return the address of the GPU loaded with the first model block of the graph computation model (here, address information of GPU0 shown in fig. 1) to the client, and then the client sends the graph computation request (including the source data to be computed) to GPU0 according to the address, so that model block 1 performs computation on the source data to obtain result 2 output by node 2 and result 3 output by node 3; then, the GPU1 may obtain result 2 and result 3 from the GPU0 according to the dependency relationship between the issued cut nodes, and transmit both result 2 and result 3 to the node 4 of the model block 2, and transmit result 3 to the node 5 of the model block 2, and the model block 2 performs an operation according to the received result 2 and result 3 to generate a result 6 output by the node 6 and a result 7 output by the node 7; then, the GPU2 may obtain the result 6 and the result 7 from the GPU1 according to the dependency relationship between the issued cut nodes, and the model block 3 performs operation on the result 6 and the result 7 to generate the result 9 output by the node 9; then, the GPUn may acquire a result 9 from the GPU2 according to the dependency relationship between the delivered cutting nodes, perform an operation on the result 9 by the model block 4, generate a graph calculation result, and output the graph calculation result to the client by the GPUn.

In the embodiment of the invention, after the graph computation model is divided into a plurality of model blocks, the graph division model can record the information of the cutting nodes among the model blocks and the dependency relationship among the cutting nodes, and sends the information to each GPU loaded with each model block, so that when each GPU responds to the graph computation request, the multiple model blocks can be used for performing combined operation, the graph computation result obtained by operation is the same as the result obtained by computation by using the undivided graph computation model, and the accuracy of data transmission and data computation among different GPUs is ensured.

When data communication is performed between different GPUs, for multiple GPUs on the same physical machine), PCIE communication or NVLink technology may be used to implement communication between different GPUs; for the GPUs on different physical machines, if network connection is needed, the RDMA technology can be used for completing the mutual access of the physical machines so as to complete the effective communication of the multiple GPUs.

It should be noted that, for simplicity of description, the method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the illustrated order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments of the present invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.

Corresponding to the above data processing method provided by the above embodiment of the present invention, referring to fig. 4, a block diagram of a data processing apparatus according to an embodiment of the present invention is shown, and specifically, the data processing apparatus may include the following modules:

a blocking module 41, configured to block the graph computation model according to an available resource amount of each GPU in the multiple GPUs to be allocated, and generate multiple model blocks and a first dependency relationship between the multiple model blocks;

an allocating module 42, configured to allocate GPUs to the plurality of model blocks according to the first resource amount required by each model block and the available resource amount of each GPU, and generate a corresponding relationship between the model blocks and the GPUs, where the first resource amount required by a model block in the corresponding relationship is less than or equal to the available resource amount of the GPU corresponding to the model block;

a loading module 43, configured to load the plurality of model blocks to corresponding GPUs according to the correspondence;

a receiving module 44, configured to receive a graph computation request for the graph computation model;

and a processing module 45, configured to, in response to the graph computation request, process the graph computation request according to the first dependency relationship through the plurality of model blocks loaded on the plurality of GPUs, and generate and output a graph computation result.

Optionally, the blocking module 41 includes:

the first obtaining submodule is used for obtaining a second dependency relationship among a plurality of first nodes in the graph calculation model;

the second obtaining submodule is used for obtaining a second resource amount required by the graph calculation model;

and the partitioning submodule is used for partitioning the graph calculation model according to the second resource amount, the number of the GPUs and the available resource amount of each GPU and the second dependency relationship to generate a plurality of model blocks and a first dependency relationship among the plurality of model blocks.

Optionally, the partitioning sub-module includes:

a calculating unit, configured to calculate a ratio of the second resource amount to the number of the GPUs, and generate a third resource amount, where the third resource amount is less than or equal to an available resource amount of each GPU;

a conversion unit, configured to convert a plurality of first nodes in the graph computation model into a plurality of second nodes in a directed acyclic graph;

a first dividing unit, configured to divide the plurality of second nodes according to the third resource amount and a fourth resource amount corresponding to each second node and according to the second dependency relationship, and generate a plurality of subgraphs and a third dependency relationship among the plurality of subgraphs, where a fifth resource amount corresponding to each subgraph is less than or equal to the third resource amount;

and the second dividing unit is used for dividing the plurality of first nodes according to the conversion relation between the plurality of first nodes and the plurality of second nodes to generate a plurality of model blocks corresponding to the plurality of sub-graphs and a first dependency relation between the plurality of model blocks, wherein the fifth resource amount corresponding to any one sub-graph is the first resource amount required by the model block corresponding to the sub-graph.

Optionally, the loading module 43 includes:

a first loading sub-module, configured to load, for any one target model block of the plurality of model blocks, the target model block to a target GPU corresponding to the target model block;

a second loading sub-module, configured to, if an available amount of resources remaining after the target GPU loads the target model block is greater than or equal to the first amount of resources required by the target model block, load the target model block to the target GPU again.

Optionally, the processing module 45 includes:

and the calculation submodule is used for responding to the graph calculation request, sequentially operating the plurality of model blocks loaded by the GPUs according to the first dependency relationship, inputting the intermediate result of calculation of the previous model block in the plurality of model blocks to the next model block for calculation until the last model block calculates the input intermediate result, generating a graph calculation result and outputting the graph calculation result.

For the device embodiment, since it is basically similar to the method embodiment, the description is simple, and for the relevant points, refer to the partial description of the method embodiment.

According to another embodiment of the present invention, the present invention further provides a server, as shown in fig. 5, including a processor 501, a communication interface 502, a memory 503 and a communication bus 504, wherein the processor 501, the communication interface 502 and the memory 503 are communicated with each other through the communication bus 504;

a memory 503 for storing a computer program;

the processor 501, when executing the program stored in the memory 503, implements the following steps:

partitioning the graph calculation model according to the available resource amount of each GPU in the multiple GPUs to be distributed to generate multiple model blocks and a first dependency relationship among the multiple model blocks;

respectively allocating the GPUs to the plurality of model blocks according to the first resource amount required by each model block and the available resource amount of each GPU, and generating the corresponding relation between the model blocks and the GPUs;

loading the plurality of model blocks to corresponding GPUs according to the corresponding relation;

receiving a graph computation request for a graph computation model;

and responding to the graph calculation request, processing the graph calculation request according to the first dependency relationship through a plurality of model blocks loaded on a plurality of GPUs, and generating and outputting a graph calculation result.

The communication bus 504 mentioned above can be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus 504 may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

The communication interface 502 is used for communication between the above-described server and other devices.

The Memory 503 may include a Random Access Memory (RAM) or a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. Optionally, the memory 503 may also be at least one storage device located remotely from the aforementioned processor.

The Processor 501 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the device can also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component.

According to still another embodiment of the present invention, there is also provided a computer-readable storage medium having stored therein instructions, which, when executed on a computer, cause the computer to perform the steps of the data processing method according to any one of the above-mentioned embodiments.

In a further embodiment of the present invention, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform the data processing method of any of the above embodiments.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A data processing method, comprising:

receiving a graph computation request for the graph computation model;

2. The method of claim 1, wherein the partitioning the graph computation model according to the amount of available resources for each of the plurality of GPUs to be allocated to generate a plurality of model blocks and a first dependency between the plurality of model blocks comprises:

obtaining a second dependency relationship among a plurality of first nodes in the graph computation model;

acquiring a second resource amount required by the graph calculation model;

and partitioning the graph calculation model according to the second resource amount, the number of the GPUs and the available resource amount of each GPU and the second dependency relationship to generate a plurality of model blocks and a first dependency relationship among the plurality of model blocks.

3. The method of claim 2, wherein the partitioning the graph computation model according to the second dependency to generate a plurality of model blocks and a first dependency between the plurality of model blocks based on the second amount of resources, the number of the plurality of GPUs, and an amount of available resources per GPU comprises:

calculating a ratio of the second resource amount to the number of the GPUs to generate a third resource amount, wherein the third resource amount is less than or equal to the available resource amount of each GPU;

converting a plurality of first nodes in the graph computation model into a plurality of second nodes in a directed acyclic graph;

according to the third resource amount and a fourth resource amount corresponding to each second node, segmenting the plurality of second nodes according to the second dependency relationship to generate a plurality of subgraphs and a third dependency relationship among the plurality of subgraphs, wherein a fifth resource amount corresponding to each subgraph is less than or equal to the third resource amount;

and segmenting the plurality of first nodes according to the conversion relations between the plurality of first nodes and the plurality of second nodes to generate a plurality of model blocks corresponding to the plurality of sub-graphs and a first dependency relation between the plurality of model blocks, wherein the fifth resource amount corresponding to any one sub-graph is a first resource amount required by the model block corresponding to the sub-graph.

4. The method according to claim 1, wherein the loading the plurality of model blocks to corresponding GPUs according to the correspondence respectively comprises:

for any one target model block in the plurality of model blocks, loading the target model block to a target GPU corresponding to the target model block;

and if the available resource amount left by the target GPU after the target model block is loaded is larger than or equal to the first resource amount required by the target model block, loading the target model block to the target GPU again.

5. The method according to claim 1, wherein the processing, in response to the graph computation request, the graph computation request according to the first dependency relationship through the plurality of model blocks loaded on the plurality of GPUs to generate and output a graph computation result, comprises:

and responding to the graph calculation request, sequentially operating the plurality of model blocks loaded respectively through the plurality of GPUs according to the first dependency relationship, inputting the intermediate result of calculation of the previous model block in the plurality of model blocks to the next model block for calculation until the last model block calculates the input intermediate result, generating a graph calculation result and outputting the graph calculation result.

6. A data processing apparatus, comprising:

7. The apparatus of claim 6, wherein the blocking module comprises:

8. The apparatus of claim 7, wherein the partitioning submodule comprises:

9. The apparatus of claim 6, wherein the loading module comprises:

10. The apparatus of claim 6, wherein the processing module comprises:

11. A server, comprising: the system comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory complete mutual communication through the communication bus;

a memory for storing a computer program;

a processor for implementing the steps of the data processing method as claimed in any one of claims 1 to 5 when executing a program stored on a memory.

12. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, carries out the steps of the data processing method of any one of claims 1 to 5.