CN112799852B

CN112799852B - Multi-dimensional SBP distributed signature decision system and method for logic node

Info

Publication number: CN112799852B
Application number: CN202110386634.5A
Authority: CN
Inventors: 李新奇; 柳俊丞; 李一鹏; 袁进辉
Original assignee: Beijing Oneflow Technology Co Ltd
Current assignee: Beijing Oneflow Technology Co Ltd
Priority date: 2021-04-12
Filing date: 2021-04-12
Publication date: 2021-07-30
Anticipated expiration: 2041-04-12
Also published as: CN112799852A

Abstract

The invention discloses a multidimensional SBP distributed signature decision system for logic nodes of a multilevel distributed data processing system, wherein the SBP distributed signature comprises a one-dimensional SBP distributed signature and a multidimensional SBP distributed signature, and the system comprises: an initial logical node generating component for generating an initial logical node topological graph attached with a candidate SBP distributed signature set; the first-dimension SBP distributed signature selection component selects one or more candidate SBP distributed signatures containing the first-dimension distributed descriptors corresponding to the minimum cost value as candidate SBP distributed signature subsets of the current logic node based on the calculated transmission cost; and a second dimension SBP distributed signature selection component, which selects candidate SBP distributed signatures containing second dimension distributed descriptors of the first logic tensor of the first input end and/or second dimension distributed descriptors of other logic tensors of other input ends as the determined SBP distributed signature of the current logic node.

Description

Multi-dimensional SBP distributed signature decision system and method for logic node

Technical Field

The present disclosure relates to a data processing technology. More particularly, the present disclosure relates to a multidimensional SBP distributed signature decision system for logical nodes of a multi-level distributed data processing system and a method thereof, thereby achieving automatic parallel deployment.

Background

With the popularization of distributed computing, a large job or a large logic tensor can deploy different parts of data to different computing devices of different distributed data processing systems for processing through division, and interaction of intermediate parameters is required in the computing process of each part. Thus, during the processing of a particular job, the intermediate parameters or results of a computation deployed on one computing device may be referred to as input data for a computational task on another computing device, which may cause data transfer overhead between computing devices. In the case of large job data, this overhead of transmission between different computing devices would place a significant computational burden on the distributed data processing system. Therefore, the inventor of the present application has proposed an invention application (publication number CN 110955734B) with application number "202010090335.2" entitled "logic node distributed signature decision system and method" to the chinese patent office at 13/02/2020, and the invention patent proposes an SBP signature decision system which can globally minimize the data exchange amount between different computing devices in the data processing process of a static distributed data processing system, thereby reducing the overhead generated in the data interaction process, and thus effectively reducing the adverse effect of data exchange on the actual operation. This patent of invention is incorporated by reference in the specification of this application as if fully set forth herein.

However, as the model becomes larger and the data to be processed also becomes larger, in the case that the model processing cannot be realized by a single computer, on one hand, the situation that the model is too large is satisfied by increasing the memory of the data processing device (such as a GPU card), but the price of one 16G GPU card is usually twice as expensive as the price of two 8G GPU cards. Therefore, it is not cost effective to increase the memory resources of a single computer. Therefore, in some scenarios, the model is too large to use data parallel communication overhead, or the model exceeds the GPU video memory capacity, in which case the model must be segmented, and only a part of the corresponding calculation of the model is completed on each device, which is called model parallel. People usually meet the condition of a large model by adopting two or more GPU cards with small memory resources in a model parallel mode, namely meet the requirement of data processing in a mode of performing model parallel. Model parallelism does not require synchronization of models between devices, but requires synchronization of data between devices. Currently, most deep learning frames do not support model parallelism or support weakly, and can be efficiently executed only by very subtle adjustment, so that manual and repeated debugging is required. Nevertheless, the results of repeated debugging are not satisfactory. Model parallelism is an industry recognized challenge. In addition to the complexity of model parallelism itself, model parallelism modules in conjunction with other parallel models are also very complex, requiring careful management of data transfers (routing) between upstream and downstream. In addition, in many cases, the communication overhead and synchronization consumption caused by model parallelism exceed those of data parallelism, so the acceleration ratio is not as high as that of data parallelism. However, for a large model which cannot be accommodated by a single-machine memory, model parallelism is a good choice. On the other hand, in the case where the size of data to be processed is also relatively large, it is also necessary to satisfy the demand by data parallel. However, since many deep learning frameworks cannot be automatically implemented by performing hybrid parallel of model parallel and data parallel at the same time, people still need to solve the parallel processing problem by pursuing a large-capacity GPU card, and in the case of a GPU card with a large capacity, a single data parallel mode or a model parallel mode is still usually selected to achieve the purpose of data processing in order to reduce the labor of personnel.

And considering both large-scale data and large-scale model situations, it is more difficult to adopt hybrid parallelism. Taking two adjacent layers of neural networks as an example, if the first layer uses data parallelism and the second layer uses model parallelism, the result of the data parallelism part needs to be copied (Copy) during forward calculation, the Concat two layers of routes are summarized to two devices with parallel models, and if the two layers are executed on different computing devices, cross-machine communication is also needed. If these complex data routes require manual user involvement for management, they are on the one hand too complex (imagine various combinations of data and model parallelism) and on the other hand very error-prone. Ideally, these complexities should be handled by the deep learning platform, but unfortunately, none of the existing open-source deep learning platforms support this functionality.

Therefore, it is desirable to obtain a technical scheme for implementing large-scale model and data processing on the premise of distributed computing resources of a small-capacity GPU card, so that on one hand, model parallelism can be implemented, on the other hand, the same data processing effect as that of data parallelism performed simultaneously under the condition of model parallelism can be satisfied, and parallel deployment can be automatically implemented.

Disclosure of Invention

Therefore, based on the SBP signature decision-making system proposed by the inventor of the application, the possibility is provided for solving the technical problem. The application provides a multidimensional SBP distributed signature decision system for logic nodes of a multi-level distributed data processing system, wherein the SBP distributed signature comprises a one-dimensional SBP distributed signature and a multidimensional SBP distributed signature, and the system comprises: an initial logic node generation component that receives task configuration data input by a user and generates an initial logic node topology map for the distributed data processing system, wherein a source logic node has a designated SBP distributed signature and each initial logic node is attached with a candidate SBP distributed signature set based on the task configuration data, and each SBP distributed signature in the candidate SBP distributed signature set designates a distributed descriptor of each input logic tensor and a distributed descriptor of each output logic tensor of the initial logic node to which the initial logic node belongs; and a first-dimension SBP distributed signature selection component that calculates, for each candidate SBP distributed signature of the current logic node, a cost of transmission data required to transform the distributed descriptor of the logic tensor of each upstream logic node output into a first-dimension distributed descriptor of the logic tensor of the corresponding input of the current logic node based on the data amount of the device set to be distributed in parallel by each upstream logic node, the data amount of the device set to be distributed in parallel by the current logic node, and the size of the logic tensor distributed by each upstream logic node on each device, according to the distributed descriptor of the output of each upstream logic node for which the SBP distributed signature has been determined, and selects one or more candidate SBP distributed signatures containing the first-dimension distributed descriptor corresponding to the minimum value of the cost as a candidate SBP distributed signature subset of the current logic node, the first dimension distributed descriptors describe a parallel manner of the logic tensors of the corresponding inputs; and a second-dimension SBP distributed signature selecting component that compares the sizes of actual computation resources of each computation device of the device set to be distributed in parallel for the current logical node and computation resources required to process the logical tensor of the corresponding input terminal and the resultant logical tensor determined according to the first-dimension distributed descriptor, and selects, when the required computation resources are larger than the actual computation resources, a candidate SBP distributed signature containing a second-dimension distributed descriptor of the first logical tensor for the first input terminal and/or second-dimension distributed descriptors of other logical tensors for other input terminals from the candidate SBP distributed signature subsets as a determined SBP distributed signature for the current logical node, where the second-dimension distributed descriptor of the logical tensor for the first input terminal of the determined SBP distributed signature is a split logical tensor descriptor, the second-dimension distributed descriptor including the first logic tensor to be divided into a predetermined number of first sliced logic tensors on the basis of the distribution described by the first-dimension distributed descriptor and the other logic tensors is a broadcast logic tensor distribution descriptor and includes a repetition number specifying that the other logic tensors are to be repeatedly broadcast, wherein the predetermined number is equal to the predetermined number, and the calculation resources required for the current logic node to process each of the first sliced logic tensors, the logic tensors of the other input terminals, and the resultant sliced tensor obtained thereby are smaller than the actual calculation resources of each of the calculation devices.

A multi-dimensional SBP distributed signature decision system for a logical node of a multi-level distributed data processing system according to the present disclosure, wherein a first logic tensor is a data logic tensor, and one of the other logic tensors is a model logic tensor.

A multi-dimensional SBP distributed signature decision system for a logical node of a multi-level distributed data processing system according to the present disclosure, wherein a first logic tensor is a model logic tensor and other logic tensors are data logic tensors.

The multidimensional SBP distributed signature decision system for logical nodes of a multi-level distributed data processing system according to the present disclosure, wherein the logical tensors of the inputs are all data logical tensors.

A multi-dimensional SBP distributed signature decision system for a logical node of a multi-level distributed data processing system according to the present disclosure, wherein a first logic tensor requires a greater amount of computational resources than one of the other logic tensors of the other inputs.

The multidimensional SBP distributed signature decision system for logical nodes of a multi-level distributed data processing system according to the present disclosure, wherein the distributed data processing system further comprises a computation graph generation component for generating a task logic computation graph based on a logical node topology graph formed by logical nodes for which a certain SBP distributed signature is obtained, wherein a split computation node is inserted before a first input end of a computation node corresponding to a current logical node, a rebroadcast computation node is inserted before other input ends, and a rendezvous computation node is inserted after an output end.

According to another aspect of the present disclosure, there is provided a multidimensional SBP distributed signature decision method for a logical node of a multi-level distributed data processing system, the SBP distributed signature comprising a one-dimensional SBP distributed signature or a multidimensional SBP distributed signature, the method comprising: an initial logic node generation step of receiving task configuration data input by a user and generating an initial logic node topology map for the distributed data processing system, wherein a source logic node has a designated SBP distributed signature and each initial logic node is attached with a candidate SBP distributed signature set based on the task configuration data, and each SBP distributed signature in the candidate SBP distributed signature set designates a distributed descriptor of each input logic tensor and a distributed descriptor of each output logic tensor of the initial logic node to which the initial logic node belongs; and a first-dimension SBP distributed signature selection step of, for each candidate SBP distributed signature of the current logic node, calculating a cost of transmission data required to transform the distributed descriptor of the logic tensor of each upstream logic node output into a first-dimension distributed descriptor of the logic tensor of the corresponding input terminal of the current logic node based on the data amount of the device set to be distributed in parallel by each upstream logic node, the data amount of the device set to be distributed in parallel by the current logic node, and the size of the logic tensor distributed by each upstream logic node on each device, according to the distributed descriptor of the output terminal of each upstream logic node for which the SBP distributed signature has been determined, and selecting one or more candidate SBP distributed signatures containing the first-dimension distributed descriptor corresponding to the minimum value of the cost as a candidate SBP distributed signature subset of the current logic node, the first dimension distributed descriptors describe a parallel manner of the logic tensors of the corresponding inputs; and a second-dimension SBP distributed signature selection step of comparing the sizes of actual computation resources of each computation device of the device set to be distributed in parallel for the current logical node and computation resources required for processing the logical tensor of the corresponding input terminal and the resultant logical tensor determined according to the first-dimension distributed descriptor, and when the required computation resources are larger than the actual computation resources, selecting a candidate SBP distributed signature containing a second-dimension distributed descriptor of the first logical tensor for the first input terminal and/or second-dimension distributed descriptors of other logical tensors for other input terminals from the candidate SBP distributed signature subsets as a determined SBP distributed signature for the current logical node, wherein the second-dimension distributed descriptor of the logical tensor for the first input terminal for the determined SBP distributed signature is a split logical tensor descriptor, the second-dimension distributed descriptor including the first logic tensor to be divided into a predetermined number of first sliced logic tensors on the basis of the distribution described by the first-dimension distributed descriptor and the other logic tensors is a broadcast logic tensor distribution descriptor and includes a repetition number specifying that the other logic tensors are to be repeatedly broadcast, wherein the predetermined number is equal to the predetermined number, and the calculation resources required for the current logic node to process each of the first sliced logic tensors, the logic tensors of the other input terminals, and the resultant sliced tensor obtained thereby are smaller than the actual calculation resources of each of the calculation devices.

The multi-dimensional SBP distributed signature decision method for a logic node of a multi-level distributed data processing system according to the present disclosure, wherein the first logic tensor is a data logic tensor, and one of the other logic tensors is a model logic tensor.

The multi-dimensional SBP distributed signature decision method for a logic node of a multi-level distributed data processing system according to the present disclosure, wherein a first logic tensor is a model logic tensor and other logic tensors are data logic tensors.

The multidimensional SBP distributed signature decision method for the logic node of the multi-stage distributed data processing system is disclosed, wherein the logic tensors of the input ends are all data logic tensors.

The multi-dimensional SBP distributed signature decision method for a logical node of a multi-level distributed data processing system according to the present disclosure, wherein the amount of computational resources required for the first logical tensor is greater than the amount of computational resources required for one of the other logical tensors of the other inputs.

The multidimensional SBP distributed signature decision method for logic nodes of a multi-level distributed data processing system according to the present disclosure, wherein the distributed data processing system further includes a computation graph generation component for generating a task logic computation graph based on a logic node topology graph formed by logic nodes obtaining a certain SBP distributed signature, wherein a division computation node is inserted before a first input end of a computation node corresponding to a current logic node, a rebroadcast computation node is inserted before other input ends, and a rendezvous computation node is inserted after an output end.

By the multidimensional SBP distributed signature decision making system and the multidimensional SBP distributed signature decision making method for the logic nodes of the multi-level distributed data processing system, the data exchange amount of the static distributed data processing system among different computing devices in the data processing process is minimized from the global angle, so that the overhead generated in the data interaction process is reduced, and the adverse effect of the data exchange on the actual operation is effectively reduced. And the requirement on the single-card computing resource amount of the computing equipment can be reduced under the requirements of large-scale models and large-scale data processing, so that the required hardware cost is reduced, and on the other hand, the parallel deployment can be automatically carried out, and especially, the same data processing effect can be automatically realized under the condition of hybrid parallel deployment needing manual intervention.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention.

Drawings

FIG. 1 illustrates a schematic diagram of an SBP distributed signature decision 100 for a logical node of a distributed data processing system according to the present disclosure.

Illustrated in FIG. 2 is a second partial schematic diagram of a logical node distributed signature decision system 100 for a static distributed data processing system according to the present disclosure.

FIG. 3 is a schematic block diagram illustrating a logical node distributed signature decision system 100 for deciding SBP signatures in accordance with the distributed data processing system of the present disclosure.

Fig. 4 illustrates a first schematic diagram of the transmission data amount estimation unit 121 according to the present disclosure estimating the amount of data transmission generated between the logic tensors of different distributed descriptors.

Fig. 5 illustrates a second schematic diagram of the transmission data amount estimation unit 121 according to the present disclosure estimating the amount of data transmission generated between the logic tensors of different distributed descriptors.

Fig. 6 illustrates a third schematic diagram of the transmission data amount estimation unit 121 according to the present disclosure estimating the amount of data transmission generated between the logic tensors of different distributed descriptors.

Fig. 7 illustrates a fourth schematic diagram of the transmission data amount estimation unit 121 according to the present disclosure estimating the amount of data transmission generated between the logic tensors of different distributed descriptors.

Fig. 8 illustrates a fifth schematic diagram of the transmission data amount estimation unit 121 according to the present disclosure estimating the amount of data transmission generated between the logic tensors of different distributed descriptors.

Fig. 9 illustrates a sixth schematic diagram of the transmission data amount estimation unit 121 according to the present disclosure estimating the amount of data transmission generated between the logic tensors of different distributed descriptors.

Illustrated in FIG. 10 is one example of a transformation of a logical node topology graph into a computational graph using a static distributed data processing system of the logical node distributed signature decision system 100 of the present disclosure.

Detailed Description

The present invention will be described in further detail with reference to the following examples and the accompanying drawings so that those skilled in the art can practice the invention with reference to the description.

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

The terminology used in the present disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used in this disclosure and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, one of the two possible devices may be referred to hereinafter as a first logical distributed signature and may also be referred to as a second logical distributed signature, and similarly, the other of the two possible devices may be referred to as a second logical distributed signature and may also be referred to as a first logical distributed signature, without departing from the scope of the present disclosure. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.

For a better understanding of the present disclosure, reference is made to the following detailed description taken in conjunction with the accompanying drawings.

Deep learning is essentially one type of feature learning, and from this perspective, deep learning can be applied directly to extracting features from raw data. And the automatic encoder is one of important models for realizing the function of feature extraction.

FIG. 1 illustrates a first partial schematic diagram of a logical node distributed signature decision system 100 for a static distributed data processing system according to the present disclosure. As shown in fig. 1, the distributed signature decision system 100 includes an initial logical node generation component 110 and a first dimension SBP distributed signature selection component 120. The initial logical node generation component 110 receives user-entered task configuration data and generates an initial logical node topology graph 101 for the static distributed data processing system. After the operation is input, the static distributed data processing system can automatically decompose the operation into a plurality of tiny operation tasks based on the operation description input by a user, wherein the tiny operation tasks are composed of various operation components which are used as logic nodes and are mutually connected in front of and behind to form a preliminary logic tensor processing neural network topological graph. Each of these neural networks includes a plurality of logical nodes, and two adjacent neural networks are connected to each other to provide guidance for Placement (PLACEMENT) of executors that perform actual job processing in the distributed data processing system. A simple initial logical node topology 101 is only schematically shown in fig. 1, where nodes A, B, C, D, E, F, L and K are shown. Others not shown are replaced by omissions. In actual data processing, the initial node topology 101 would be more complex. The initial logical node topology 101 contains the basic compute nodes that implement the computational tasks described by the user. The generation manner of the initial logical node topology 101 belongs to the conventional technology in the art, and therefore, is not described herein.

The various initial logical nodes in the initial logical node topology 101 each contain multiple SBP signatures. The initial logical nodes that are the source logical nodes that have been configured with SBP signatures by the user or that have determined SBP signatures based on the user's task description, such as initial logical node A, E and B, have only unique SBP signatures, such as SBP-1 of initial logical node A, SBP-2 of initial logical node C and SBP-3 of initial logical node E. And the other initial logical nodes include some inherent SBP signatures. Such as the initial logical node B in fig. 1, which has a plurality of candidate SBP signatures, e.g., three, including SBP-1, SBP-2, and SBP-3. The other initial logical nodes also each have different candidate SBP signatures, not listed here. Different initial logical nodes have different fixed candidate SBP signatures depending on the specific operations they perform.

An SBP signature according to the present disclosure is a signature applied in a distributed data processing system. In a distributed data processing system, because there are often situations of data parallel, model parallel, mixed parallel, streaming parallel, and the like, there often exist situations that tasks of adjacent logical nodes are to be deployed to different computing devices at the same time, and therefore, in an actual data processing process, intermediate parameters are exchanged among the computing devices, which results in a large amount of transmission overhead. Therefore, in order to reduce the data transmission overhead, more logical nodes need to be further generated on the basis of the initial logical node topology 101, so as to complete the logical node topology, and especially, to reduce the transmission overhead between the upstream and downstream logical nodes, it is necessary to minimize the change caused by the data distribution manner of the upstream and downstream logical nodes. To this end, the present disclosure assigns a logical distributed signature for each logical node in order to obtain a better downstream logical node. The logic distributed signature is a signature of a logic node by adopting a distributed descriptor of a logic tensor, the distributed descriptor of each logic tensor describes a distribution mode of each logic tensor in the whole computing system, and the logic distributed signature mainly comprises a Splitting (SPLIT) logic tensor descriptor, a Broadcasting (BROADCAST) logic tensor descriptor and a PARTIAL VALUE (PARTIAL VALUE) logic tensor descriptor.

Specifically, the SPLIT (SPLIT) logical tensor descriptor is a splitting manner for describing one logical tensor, for example, one tensor is SPLIT in a specified dimension according to the description of a user, and is distributed to different computing devices to perform specified computing processing. If a tensor is a two-dimensional tensor, when the tensor is cut in the 0 th dimension, the distributed descriptor of the data logic tensor of the batch of data formed by the tensor is S (0), and the distributed descriptor of the data logic tensor obtained by each logic tensor at the input end of the tensor is S (0). Similarly, if a tensor is a two-dimensional tensor, when the tensor is cut in the 1 st dimension, the distributed descriptor of the data logic tensor of the batch of data formed by the tensor is S (1), and the distributed descriptor of the data logic tensor obtained by each logic tensor at the input end of the tensor is S (1). Similarly, if the dimension of the task data to be processed is more, there will be more distributed descriptors, e.g., S (2), S (3) …, etc. Such mentioned data may be processed data or models. If the data itself is sliced, parallel processing of the data is performed on the distributed data processing system, and if the model is split, parallel processing of the model is performed on the distributed data processing system. If the input of the logic node is the descriptor of the SPLIT (SPLIT) logic tensor, in the actual data processing process, if the data size of one logic tensor is T and the logic tensor is to be distributed to four computation cards for parallel computation, the data amount distributed to each card is one fourth of the data, and the data amount on the whole four cards is T. If the distribution of a tensor is firstly divided in the 0 th dimension and then divided in the 1 st dimension again aiming at the divided slice tensor, the distribution tree descriptor of the tensor is a two-dimensional distribution descriptor (S (0), S (1)). If the distribution of a tensor is firstly divided in the 0 th dimension, and then the divided slice tensor is further divided in the 0 th dimension again, the distribution tree descriptor of the tensor is a two-dimensional distribution descriptor (S (0), S (0)). And so on. The distributed descriptors may also be three-dimensional or more.

BROADCAST (BROADCAST) logical tensor descriptors are used to describe the way a logical tensor is published in a BROADCAST fashion in a distributed system. In general, for data processing systems that perform data-only parallelism, model data is typically broadcast to individual computing devices, and therefore broadcast logic tensor descriptors are employed to describe the broadcast data that is input to the logical nodes. In the actual data processing, the tensor size of the broadcasted data on each actual calculation card is the same. If the distribution of a tensor is broadcast first and then divided in the 0 th dimension for the broadcast tensor, the distribution tree descriptor of the tensor is a two-dimensional distribution descriptor (B, S (0)). Similarly, if the distribution of a tensor is first divided in the 0 th dimension and then each sliced tensor is broadcast, the distribution tree descriptor of the tensor is a two-dimensional distribution descriptor (S (0), B). And so on.

A PARTIAL VALUE (PARTIAL VALUE) logic tensor descriptor represents a PARTIAL VALUE of an input or output logic tensor of one logic node as a plurality of homogeneous logic tensors. These partial values include partial sums, partial products, partial AND results, partial maximums, and partial minimums. Since data is usually processed in parallel, the processing of data on different devices is the processing of partial data. For example, if some of the logical tensors are S (0) or S (1), the resulting logical tensor is S (0) obtained on some computing devices, and the resulting logical tensors on the partial computing devices are combined to form a partial logical tensor. And the final output result is obtained by combining the same kind of data on all the devices.

The distributed descriptors of the various logic tensors represent the distribution of the logic tensors in the distributed computing system, and the respective distribution of the logic tensors, which are used as the input and the output of the logic nodes, also describes the distribution description of the logic nodes on the operation data. For convenience of description, this disclosure will simply refer to such a distributed descriptor as an "SBP descriptor".

For this reason, as the initial logical node topology 101 is generated, the initial logical nodes, that is, some operation nodes, also have data distributed descriptors of respective inputs and outputs, and these input and output distributed descriptors form a signature of the logical nodes, that is, the signature of the operation logical nodes by using the distributed descriptors of the logical tensor. For convenience of expression, the english initials of the three distributed descriptors are used to refer to this signature as an "SBP signature" for short.

Such descriptors would include at least three S, B, as well as P, depending on the user's description of the computing task and data parallelism requirements in each distributed computing system. If there are multiple ways of partitioning the data and model, then each way of partitioning is added, a descriptor is added. If a tensor is divided in two different dimensions sequentially or simultaneously, the distribution descriptor is a two-dimensional distribution descriptor as described above. If a tensor is distributed in two distribution modes, its distribution descriptor can be a two-dimensional distribution descriptor as described above. If a tensor is divided in one dimension first and then the split tensor is divided in the same dimension, the distribution descriptor is also a two-dimensional distribution descriptor as described above. By analogy, distributed descriptors in three or more dimensions may be used. For each logical node, its signature contains various combinations of these descriptors. Thus, in a distribution system according to the present disclosure, for a one-dimensional SBP descriptor, there are at least three distributed descriptors, and typically there are four distributed descriptors, e.g., the following four SBP descriptors, S (0), S (1), P, and B. Depending on the number of logical tensor dimensions, there may be more distributed descriptors. If the SBP descriptors are four types, various SBP signatures can be formed according to the permutation and combination of input and output. Some SBP signatures are listed below: examples of one-dimensional SBP signatures, for example: (S (0), B) → S (0), (S (1), B) → S (1), P → P, B → B, (S (0), S (1)) → P, S (0) → P, S (0) → S (0), S (0) → S (1), P → B. For a two-dimensional SBP signature, which is composed of two-dimensional distributed descriptors composed of one-dimensional distributed descriptors, such as (S (0), S (0)), (S (1), S (1)), (S (0), S (0)), (S (0), B)), (S (1), B)), (B, B)), (P, S (0)), and so on, the two-dimensional SBP signature is, for example: [ (S (0), S (0)) (B, B) → (S (0), S (0)) ], [ (S (1), S (1)) (B, B) → (S (1), S (1)) ], [ (S (0), B) (S (1), S (1)) → (P, S (1)) ], [ (S (0), B) (B, S (1)) → (S (0), S (1)) ], and the like. SBP signatures may also be more dimensional, e.g., three or four dimensional or more, as the case may be. All SBP signatures are the result of various SBP descriptor combinations. For a matrix multiplication logical node, if its input logical tensor is cut on the first dimension, its output logical tensor is also cut on the first dimension. In summary, S, B, P is a descriptor for describing the distribution of tensors in a data processing system, and SBP signatures describe the task operations of logical nodes with multiple SBP descriptors. Each tensor can have a plurality of SBP descriptors, and the operation mode represented by each logic node can be a plurality of SBP signature situations. For example, SBP-1 shown in FIG. 1 may be a signature form of (S (0), B) → S (0), and SBP-2 may be a signature form of (S (1), B) → S (1). In practical applications, different signature forms may have different numbers, and the numbers given herein are for descriptive convenience only and do not mean that each signature needs to be given a number, and may not have any number at all, and the different forms of signatures may be distinguished from each other without requiring numbers. For example, SBP-1 may be a two-dimensional SBP signature, such as [ (S (0), B) (B, S (1)) → (S (0), S (1)) ].

Each initial logical node may be given an SBP signature as described above based on the user's task description. Common task logic nodes are arithmetic operation nodes that perform a particular arithmetic operation and therefore have a particular candidate SBP signature. It should be noted that SBP signatures of each task logic node are not the same, and the SBP signature input logic tensor of a task logic node that normally performs multiplication does not include a part and a logic tensor, so the SBP descriptor of the input logic tensor does not include the distributed descriptor P. The candidate SBP signatures for the task logical node performing the addition operation may then include any combination of the various SBP descriptors with each other or with itself. For example, a task logic node performing matrix multiplication, in the case of only data parallel, its candidate SBP signature is usually (S (0), B) → S (0), (S (1), B) → S (1), (S (0), S (1)) → P, etc., but not only these, but with the development of technology, some signatures that were not suitable for matrix multiplication before can also be applied to matrix multiplication, and this is only an example here. With a two-dimensional SBP signature [ (S (0), B) (B, S (1)) → (S (0, S (1)) ], for a logical node having such an SBP signature, a tensor descriptor indicating that it has two input ends, i.e., (S (0), B) and (B, S (1)), and a tensor descriptor of an output end, i.e., a tensor descriptor of the two-dimensional SBP signature, is also two-dimensional. The descriptor (S (0), B) of the first tensor indicates that the first tensor is first divided in the 0 th dimension (here, referred to as the dimension of the tensor itself) (i.e., S (0) of the first dimension) into a plurality of first sliced tensors, then the plurality of divided first sliced tensors are spatially broadcast or are continuously output in time (i.e., B of the second dimension), and the descriptor (B, S (1)) of the second tensor indicates that the first tensor is first spatially broadcast, then the descriptor (B, S (1)) of the second tensor is broadcast in the 1 st dimension (here, referred to as the dimension of the tensor itself) (i.e., S (1) of the second dimension) into a plurality of second sliced tensors, and finally, the distribution descriptor of the resultant tensor formed by the first tensor and the second tensor processed by the roadbed node is (S (0), S (1)). Each initial logical node is accompanied by a set of candidate logical distributed signatures based on the task configuration data. Each logical distributed signature in the set of candidate logical distributed signatures specifies a distributed descriptor of each input logical tensor and a distributed descriptor of each output logical tensor for the initial logical node to which it belongs.

Although the initial logical node generation component 110 generates the initial logical node topology graph 101, a further determination is needed for which SBP signature the determined logical tensor is to be used by each logical node in the initial logical node topology graph 101 or which distributed logical tensor is to be used and which distributed logical tensor is to be input.

Thus, the first dimension SBP distributed signature selection component 120 of the logical node distributed signature decision system 100 according to the present disclosure starts from the source logical node in the initial logical node topology graph 101, when the logical tags or SBP tags of all upstream logical nodes (e.g. logical nodes a and E) of the current logical node (e.g. logical node B) have been determined, the transmission data amount estimation unit 121 calculates, for each candidate logical distributed signature of the logical node B, a cost of transmitted data required to transform the distributed descriptor of the logical tensor of each upstream logical node output into the distributed descriptor of the logical tensor of one of the candidate logical distributed signatures of the corresponding input terminal of the logical node B, based on the distributed descriptors of the output terminals corresponding to the input terminals of the logical node B of all the upstream logical nodes of the logical node B. As shown in FIG. 1, a logical node B, which has many candidate SBP signatures, such as SBP-1, SBP-2, and SBP-3. For example, SBP-1 may be in the form of (S (1), B) → S (1) or (S (1), P) → S (1), [ (S (0), B) (B, S (1)) → (S (0), S (1)) ], [ (S (0), S (0)) ], [ (S (0), B) → (S (0), S (0)) ], [ (S (0), B) (S (1), S (1)) → (P, S (1)) ], a signature SBP-5 of the initial logical node a may be in the form of, for example, a signature of (S (0), B) → S (0), a signature SBP-3 of the initial logical node E may be in the form of, for example, B → B or S (0) → P. In each signature form, the left side of the arrow is the distributed descriptor of the input logic tensor, and the right side of the arrow is the distributed descriptor of the output logic tensor. For the sake of description convenience, hereinafter, the "logic tensor whose distribution descriptor is S (0)" is simply referred to as "S (0) logic tensor", the "logic tensor whose distribution descriptor is B" is simply referred to as "B logic tensor", the "logic tensor whose distribution descriptor is P" is simply referred to as "P logic tensor", similarly, the "logic tensor whose distribution descriptor is (S (0), B)" is simply referred to as "(S (0), B) logic tensor", the "logic tensor whose distribution descriptor is (B, S (1)" is simply referred to as "(B, S (1)) logic", the "logic tensor whose distribution descriptor is (P, S (1)" is simply referred to as "(P, S (1)) logic tensor", the "logic tensor whose distribution descriptor is (S (0), B, S (1)" "is simply referred to as" (S (0), b, S (1)) logic tensor "and" logic tensor whose distribution descriptor is (B, S (1) "are simply referred to as" (B, S (1)) logic tensor ", and" logic tensor whose distribution descriptor is (P, S (1), S (0) "" are simply referred to as "(P, S (1), S (0)) logic tensor", and so on.

As shown in fig. 1, if the form of the label SBP-3 of the logical node E in the initial logical node topology 101 is "S (0) → S (0)", the output logic tensor distribution descriptor thereof is S (0), and thus the output logic tensor thereof is the S (0) logic tensor. If the signature SBP-3 of the logical node E is in the form of "B → B" or "P → P", the distribution descriptor of the logical tensor of its output is B or P, and thus its output logical tensor is a B logical tensor or a P logical tensor. If the candidate signature SBP-1 (i.e., "(S (0), S (1)) → P") of the logical node B is selected as the determined signature, the distribution descriptor of the input logical tensor of the first input terminal corresponding to the output of the node E must be S (0), i.e., the first input terminal must obtain an S (1) logical tensor, and the distribution descriptor of the input logical tensor of the second input terminal corresponding to the output of the node a must be S (0), i.e., the second input terminal must obtain an S (0) logical tensor. It is obvious that P of the output logic tensor distribution descriptor of the node a at this time does not coincide with S (0) of the input logic tensor of the first input terminal of the node B, and therefore, in order to make the logic node B perform a correct operation, it is necessary to convert the logic tensor whose distribution descriptor is P output by the node a into the logic tensor whose distribution descriptor is S (0). Similarly, if the distribution descriptor of the logic tensor output by the node E is S (0), it is not consistent with the distribution descriptor S (1) of the quantity sheet input at the second input terminal of the node B, and therefore, in order to make the logic node B perform a correct operation, it is necessary to convert the logic tensor output by the node E, whose distribution descriptor is S (0), into a logic tensor, whose distribution descriptor is S (1).

In a distributed computing system, since the operation tasks, especially the operation tasks, of the respective logical nodes are divided and distributed to the respective computing devices (e.g., computing card CPU, GPU or TPU), to finally obtain the correct result, the intermediate parameters need to be synchronized continuously, which involves the exchange of intermediate parameters between different computing devices. When the SBP descriptor of the output logic tensor contained in the SBP signature of the previous logic node is inconsistent with the SBP descriptor of the corresponding input logic tensor of the SBP signature of the current node, the output conversion is usually performed in an actual operation process, and this conversion process usually needs to acquire a part of data located on another computing device so as to form data required by the input end of the current logic node together with locally available data, thereby conforming to the distributed descriptor of the data logic tensor at the input end of the current logic node. This process of obtaining a portion of data from another device incurs a relatively large data transmission overhead or transmission cost. Therefore, selecting different signatures for the current logical node may result in different data transmission overhead or cost. For this reason, the transmission data amount estimation unit 121 estimates, for each logical node for which a signature is not determined, a data transmission overhead that will be generated by each candidate signature. For example, for a logical node B, the data transmission cost that would be generated by the logical node B if one of the SBP signatures is adopted is estimated for its three candidate SBP signatures. For the logical node B, selecting any one of the candidate SBP signatures may accomplish its operational task. But the data transmission cost generated by the operation of the SBP is different under the condition that different SBP signatures are adopted. Therefore, in order to minimize the data transmission cost in the data processing process, a signature with the minimum data transmission amount needs to be selected from candidate signatures of each logical node as a signature in the actual operation process.

Between the logical node a and the logical node B in the initial logical node topology 101, which are in the upstream and downstream relationship, the logical node a may be a source node, whose SBP signature may be generated by user configuration, or may be generated naturally based on the description of the task by the user, or the SBP signature of the logical node a has been determined by decision selection basically according to the scheme of the present disclosure, for example, the descriptor of the output logical tensor of the SBP signature of the logical node a is S (0). While the logical node B in the initial logical node topology 101 has many candidate SBP signatures, which may include (S (1), B) → S (1), B → P, S (1) → P, and P → B, etc., from the logical node a to the logical node B, since the distribution descriptor of the output logic tensor of the logical node a is S (0), the corresponding input logic tensor distribution descriptor that the node B can select may be S (1), B, and P.

Therefore, after the signatures of some of the previous logical nodes are determined, the SBP signatures of the logical nodes downstream thereof are also finally selectively determined based on the cost of data transfer between the logical distributed descriptor (SBP descriptor) of the output logical tensor of the upstream logical node and the logical distributed descriptor (SBP descriptor) of the corresponding input logical tensor of the candidate logical distributed signature of the downstream upstream logical node. In this way, once the candidate SBP signature of a logical node is selected for calculation, the respective SBP descriptors of the tensors of the respective input and output ends of the logical node are also determined, so as to calculate or estimate the total cost of data transmission of the current logical node, and use the candidate logical distributed signature with the smallest total cost as the logical distributed signature of the current logical node. It should be noted that if the logical distributed descriptor of the input end of which signature among the candidate signatures of the current logical node coincides with the logical distributed descriptor of the output logical tensor of the logical node upstream thereof, the candidate logical distributed signature containing the logical distributed descriptor may be preferentially selected unless the logical distributed descriptors of the logical tensors of the other input ends of the candidate logical distributed signature cause the final total cost to be larger.

FIG. 3 is a schematic block diagram illustrating a logical node distributed signature decision system 100 for deciding SBP signatures in accordance with the distributed data processing system of the present disclosure. Fig. 3 is an enlarged schematic view of the relationship between nodes A, B and E in fig. 1. As shown in fig. 3, it is assumed that the distribution descriptor of the output logic tensor of the determined SBP signature SBP-3 of the logic node E is S (0), the distribution descriptor of the output logic tensor of the determined SBP signature SBP-5 of the logic node a is P, and the distribution descriptor of the input logic tensor of one of the candidate SBP signatures SBP-2 of the logic node B is (S (1), S (0)) → P. Thus the SBP descriptor of the input logic tensor of logic node B corresponding to the SBP descriptor S (0) of the output logic tensor of logic node E is S (1), and the SBP descriptor of the input logic tensor of logic node B corresponding to the SBP descriptor P of the output logic tensor of logic node a is S (0). Therefore, in order to meet the requirement of the input logic tensor distribution of the SBP candidate signature of the logic node B, it is necessary to convert the logic tensor distribution of one input thereof from the SBP descriptor S (0) of the output logic tensor of the logic node E to S (1) and to convert the logic tensor distribution of the other input thereof from the SBP descriptor P of the output logic tensor of the logic node a to S (0). This transformation will result in data exchange during the actual data processing.

Fig. 4 illustrates a first schematic diagram of the transmission data amount estimation unit 121 according to the present disclosure estimating the amount of data transmission generated between the logic tensors of different distributed descriptors. The candidate SBP signature SBP-2 for the task node B shown in fig. 3 is assumed to be (S (1), S (0)) → P. For ease of description, the tasks of the input source task nodes a and E and the received sink node B are distributed on the same device set. As shown in fig. 4, are distributed over both compute cards GUP 0 and GPU 1. Although only two computing cards are shown, in practice, the source and sink task nodes may be distributed across more cards or across different sets of devices. Fig. 4 shows a data exchange process in the case where the S (0) descriptor logic tensor of the task of task node E in fig. 3 is distributed over two computing cards, and the input end of task node B is to obtain the S (0) descriptor logic tensor.

The task node distributed on GPU 0 of task node B needs to obtain S (1), and needs to supplement the other half of the logic tensor distributed on GPU1 described by the S (0) descriptor of task node E (the acquisition process of such data portion is shown by a dashed arrow) in addition to the half of the logic tensor distributed on GPU 0 described directly by the S (0) descriptor of task node E (the acquisition process of such data portion is shown by a solid arrow). If the size of the logic tensor is T₁The data amount distributed on the task nodes of the GPU 0 and transmitted from the logic tensor of the task node E on the GPU1 to the task node B is T₁/2. Meanwhile, if the task node B distributed on the GPU1 needs to obtain S (1), it needs to supplement the other half of the logic tensor distributed on the GPU 0 described by the S (0) descriptor of the task node E (the process of acquiring such data portion is shown by a solid arrow) in addition to the half of the logic tensor distributed on the GPU1 described by the S (0) descriptor of the task node E directly (the process of acquiring such data portion is shown by a dashed arrow). If the size of the logic tensor is T₁The data quantity distributed on the task node of the GPU1 transmitted from the logic tensor of the task node E of the GPU 0 to the task node B is T₁/2. Thus, transforming the S (0) descriptor logic tensor of task node E to the logic tensor for which the S (0) descriptor is to be obtained at the input of task node B, the totalHas a data transmission cost of T₁=（T₁/2+T₁/2）。T₁Is the size of the distributed logic tensor on the source node. In fig. 4, the size of the logic tensor is the size of the tensor in which S (0) is distributed in the hatched portion on each card, that is, one-half of the entire logic tensor. In the case of a device set with a number of data cards of 3, 4 or 5, the transmission cost is also T₁。

Fig. 5 illustrates a second schematic diagram of the transmission data amount estimation unit 121 according to the present disclosure estimating the amount of data transmission generated between the logic tensors of different distributed descriptors. Similarly, the candidate SBP signature SBP-2 for the task node B shown in fig. 3 is assumed to be (S (1), S (0)) → P. For convenience of description, the tasks of the input source task nodes a and E and the received sink node B are distributed on the same device set, and as shown in fig. 5, are distributed on the computing cards GUP 0, GPU1, and GUP 2. Although three computing cards are shown here, this is for example only. It may also be two cards as shown in figure 4. In fact, the source task node and the sink task node can be distributed to more cards, and can also be distributed to different equipment sets. Fig. 5 shows a data exchange process in the case where the input end of task node B is to obtain the logic tensor of the S (0) descriptor in the case where the P descriptor logic tensor of the task of task node a in fig. 3 is distributed over two computing cards.

The task node of the task node B distributed on the GPU 0 needs to obtain S (0), and needs to supplement one third of the logic tensor distributed on the GPU1 described by the P descriptor of the task node a (the acquisition process of the data portion is shown by a solid arrow) and one third of the logic tensor distributed on the GPU 2 described by the P descriptor of the task node a, in addition to one third of the logic tensor distributed on the GPU 0 described by the P descriptor of the task node a (the acquisition process of the data portion is shown by a dotted arrow). If the size of the logic tensor of the A task node distributed on each GPU card is T₂If the task nodes B distributed on the GPU 0 need to obtain the S (0) logic tensor, the task nodes B need to supplement the logic tensor of the task node a on the GPU1 to the task nodes BTask node transmission data volume T of point B distributed in GPU 0₂/3 and the amount of data T transferred from the logical tensor of task node A on GPU 2 to the task nodes distributed on GPU 0 of task node B₂/3. Similarly, task nodes B distributed on GPU1 need to obtain the S (0) logic tensor, and also need to supplement the data volume T transferred from the logic tensor of task node a on GPU 0 to the task nodes B distributed on GPU1₂/3 and the amount of data T transferred from the logical tensor of task node A on GPU 2 to the task nodes distributed on GPU1 of task node B₂/3. Similarly, task nodes B distributed on GPU 2 need to obtain the S (0) logic tensor, and also need to supplement the data volume T transferred from the logic tensor of task node a on GPU1 to the task nodes distributed on GPU 2 of task node B₂/3 and the amount of data T transferred from the logical tensor of task node A on GPU 0 to the task nodes distributed on GPU 2 of task node B₂/3. Therefore, the data transfer amount in the actual data processing process for converting the P distributed logic tensor into the S (0) distributed logic tensor shown in fig. 5 is 2T₂=（T₂/3+T₂/3+T₂/3+T₂/3+T₂/3+T₂/3). Alternatively, if the number of distributed computing cards of the task node is 2. The transmission amount of data is T₂=（T₂/2+T₂/2). By analogy, in the case where the source node and the sink node have the same device set, if the number of cards in the device set is k, the transmission amount of data is (k-1) · T₂。

It is apparent that, as described above, for the logical node B to perform the bulk operation, the data transmission cost required for selecting the signature SBP-2 signature (e.g., signature (S (1), S (0)) → P) is the sum of the transmission costs of the two inputs. Combining FIG. 4 and FIG. 5 (if there are two computing cards in FIG. 5), the total data amount that the task node needs to transmit in the case of the candidate signature SBP-2 is T₁ +2T₂. For this reason, the transmission cost estimated by the transmission data amount estimation unit 121 for the candidate signature SBP-1 of the logical node B needs to include the transmission costs of the two inputs for the candidate signature.

A calculation table summarizing the amount of data exchange that exists between various SBP descriptors can be generalized based on the exact identity between the device sets for the source task node and sink task node, as shown in table 1 below:

table 1 (the distribution equipment sets of the source task node and the sink task node are identical, and the number of the cards is K)

Changing modes	Data volume of distributed logic tensor of source task node	Amount of data exchange	Remarks for note
				S(i) →S(j)	T₁	0	i=j
S(i) →S(j)	T₁	T₁ ·(K-1)/K	i≠j
				S→B	T₂	(K-1) ·T₂
S→P	T₃	0
				B→S	T₄	0
B→B	T₅	0
				B→P	T₆	0
P→S	T₇	(K-1) · T₇
				P→B	T₈	2(K-1) · T₈
P→P	T	₉	0

Fig. 6 illustrates a third schematic diagram of the transmission data amount estimation unit 121 according to the present disclosure estimating the amount of data transmission generated between the logic tensors of different distributed descriptors. Wherein the device set of the source node is completely different from the device set of the sink node. That is, source task node E is distributed across GPU 0 and GPU1, and sink task node B is distributed across compute cards GPU 2 and GPU 3. If the size of the logic tensor distributed on each computing card is T₃When the data amount to be transmitted is 2T₃。

Fig. 7 illustrates a fourth schematic diagram of the transmission data amount estimation unit 121 according to the present disclosure estimating the amount of data transmission generated between the logic tensors of different distributed descriptors. Wherein the device set of the source node is completely different from the device set of the sink node. Namely, source task node a is distributed over GPU 0, GPU1, and GPU 2, and sink task node B is distributed over compute cards GPU 3, GPU 4, and GPU 5. If the size of the logic tensor distributed on each computing card of each source task node is T₄When the data volume needing to be transmitted is 9 1/3T₄I.e. 3T₄. If the number of the computing cards of the task set distributed by the source task node is 2, the data volume needing to be transmitted is 2T₄. If the number of the calculation cards of the task set distributed by the source task node A is Ks, the transmission amount of the data is Ks.T₄。

A calculation table summarizing the amount of data exchange that exists between various SBP descriptors may be generalized according to the completely different scenarios between the device sets for the source task node and sink task node, as shown in Table 2 below:

table 2 (Source task node (card number K)_s) And sink task node (card number is K)_d) The respective sets of distribution devices being completely different)

Changing modes	Data volume of distributed logic tensor of source task node	Amount of data exchange
			S→S	T₁	T₁
S→B	T₂	K_d ·T₂
			S→P	T₃	T₃
B→S	T₄	T₄
			B→B	T₅	K_d ·T₅
B→P	T₆	T₆
			P→S	T₇	K_s ·T₇
P→B	T₈	(K_s +K_d -1)·T₈
			P→P	T₉	K_s ·T₉

Fig. 8 illustrates a fifth schematic diagram of the transmission data amount estimation unit 121 according to the present disclosure estimating the amount of data transmission generated between the logic tensors of different distributed descriptors. Wherein the device set of the source node is not identical to the device set of the sink node. That is, source task node E is distributed across GPU 0 and GPU1, and sink task node B is distributed across compute cards GPU1 and GPU 2. If the size of the logic tensor distributed on the computing card distributed on each source task node is T₅When the data volume needing to be transmitted is 3/2T₃=（1/2 T₃+1/2 T₃+1/2 T₃). In this case, the calculation has no fixed rule, and needs to be performed according to the specific configuration of the actual device set and the intersection between the actual device set and the actual device set.

Fig. 9 illustrates a sixth schematic diagram of the transmission data amount estimation unit 121 according to the present disclosure estimating the amount of data transmission generated between the logic tensors of different distributed descriptors. Wherein the device set of the source node is not identical to the device set of the sink node. Namely, source task node a is distributed over GPU 0, GPU1, and GPU 2, and sink task node B is distributed over compute cards GPU1, GPU 2, and GPU 3. If the size of the logic tensor distributed on each computing card of each source task node is T₆The data volume needing to be transmitted is 7 1/3T₄I.e. 7/3T₄. In this case, the calculation has no fixed rule, and needs to be performed according to the specific configuration of the actual device set and the intersection between the actual device set and the actual device set.

As described above, the transmission data amount estimation unit 121 traverses all the candidate signatures SBP-1, SBP-2, and SBP-3 of the logical node B in the above-described manner, and obtains the transmission cost thereof for each signature. Then, the transmission cost under each candidate signature is compared by the transmission data amount comparison unit 122, and the minimum transmission cost of the logical node to be determined, for example, the logical node B, is obtained. Finally, the SBP signature determining unit 123 determines the candidate SBP signature corresponding to the minimum transmission cost as the final SBP signature of the logical node B or a candidate SBP signature subset satisfying the minimum transmission cost. For example, if the candidate SBP signature SBP-2 of the logical node B is assumed to be (S (1), S (0)) → P, satisfying the minimum transmission cost, then if one of the candidate SBP signatures SBP-1 is [ (S (1), (S (1)), (S (0), B) → (P, S (1)) ], the one of the candidate SBP signatures SBP-1 also satisfies the minimum transmission cost.

Thus, the first logical node topology output component 130 outputs the first logical node topology map 131 based on the SBP signature determined by the SBP signature determination unit 123 for each logical node, and each logical node constituting the first logical node topology map 131 may be attached with only one SBP signature, or each logical node may specify the distribution pattern or the distribution descriptor of each input logical tensor thereof explicitly, and uniquely determine the distribution pattern or the distribution descriptor of the input logical tensor thereof. Alternatively, some logical nodes of the first logical node topology map 131 may have multiple candidate SBP signatures satisfying the minimum data exchange cost, thereby forming a candidate SBP signature subset of the logical node.

Illustrated in FIG. 2 is a second partial schematic diagram of a logical node distributed signature decision system 100 for a static distributed data processing system according to the present disclosure. A scheme is provided for further determining SBP signatures in a subset of candidate SBP signatures. For example, as shown in fig. 10, which is an example of transforming a logical node topology graph into a computational graph using the static distributed data processing system of the logical node distributed signature decision system 100 of the present disclosure, if the computational resources of one of the computing devices allocated by the logical node N3 are sufficient to satisfy the computational resources required for the first and second

logical tensors

1 and 2 and the resulting logical tensor 3, then a low-dimensional candidate SBP signature, for example, a one-dimensional candidate SBP signature (S (1), S (0)) → P, may be directly employed, as shown in the left side of fig. 10. If the computational resources of one of the computing devices assigned by logical node N3 do not satisfy the computational resources required for the first and second

logical tensors

1 and 2 and the resulting logical tensor 3, the present disclosure provides another multidimensional SBP signature candidate that reduces the computational resource requirements for the computing device assigned by logical node N3 by further partitioning the description of one of the input logical tensors (e.g., the first logical tensor 1) such that the computational resources of the computing device assigned by logical node N3 are sufficient to satisfy the computational resources required to process the further partitioned logical tensor. To this end, a decision system according to the present disclosure provides a second dimension SBP distributed signature selection component 140 as shown in FIG. 2. The calculation resource comparison unit 141 of the SBP distributed signature selection component 140 may compare the size between the actual calculation resource of each calculation device of the device set to be distributed in parallel by the current logical node and the calculation resource required to process the logic tensor of the corresponding input terminal determined according to the first-dimension distributed descriptor. For example, referring to fig. 10, the computing resources, such as memory resources, required by the first logic tensor 1, the second logic tensor 2, and the resulting logic tensor 3 of the current logic node N3 when the logic node N3 is processed are obtained, and the actual computing resources that can be provided by the computing device (such as a certain GPU or a certain CPU or a certain server) in which the current logic node N3 is deployed are also obtained. As previously described, if the deployed computing devices are sufficient to satisfy the computing resources required to process all the input and output logic tensors of the current logical node N3, the SBP signature determination unit 143 directly determines the least dimensional SBP signature of the subset of candidate SBP signatures as the final SBP signature of the current logical node N3, e.g., (S (1), S (0)) → P.

If the deployed computing device is insufficient to satisfy the computing resources required to process all of the input and output logic tensors (or the resultant logic tensor) of the current logic node N3, the distribution descriptor determination unit 142 selects the second-dimension SBP descriptor of the logic tensor of one of the input ends. For example, the first logic tensor 1 of the current logic node N3 in fig. 10 is further divided so that the computation resources required when each of the first sliced logic tensor 1, the second logic tensor 2, and the resultant sliced logic tensor 3 after the division are processed are smaller than the computation resources of the deployed computing device, so that the SBP descriptor of the first sliced logic tensor 1 is (S (1), S (1)). Meanwhile, the distribution descriptor determining unit 142 determines that the SBP descriptor of the second logic tensor 2 is (S (0), B), and as a result, the SBP descriptor of the slice logic tensor is (P, S (1)). Thus, the last current logical node N3 satisfies the SBP signature of the computing resources of the computing device is [ (S (1), (S (1)), (S (0), B) → (P, S (1)) ]. the SBP signature has a predetermined number or a predetermined number of times of distribution attached to each dimension of each SBP descriptor S (1) of each SBP descriptor of the SBP signature, e.g., the SBP descriptor S (1), (S (1)) of the first sliced logical tensor 1 contains the number of computing devices deployed in parallel or the predetermined number of times of division into slices, whereas the SBP descriptor S (0, B) of the second logical tensor 2 is such that the SBP descriptor S (0) of the first dimension contains the number of computing devices deployed in parallel or the predetermined number of times of division into slices, the SBP descriptor B of the second dimension contains the SBP descriptor S (1) of the first logical tensor 1 to be broadcast repeatedly for a predetermined number of times of the second logical tensor 2, the predetermined number of times that the second dimension S (1) of (S (1)) is attached is equal to the predetermined number of times that the second dimension SBP descriptor B of the SBP descriptor (S (0), B) of the second logic tensor 2 is included. Similarly, the SBP descriptor of the resultant slice logical tensor 3 is (P, S (1)), and the predetermined number of SBP descriptors S (1) of the second dimension is also equal to the predetermined number of times the SBP descriptor B of the SBP descriptor (S (0), B) of the second logical tensor 2 is included. The SBP signature determination unit 143 selects, from the candidate SBP distributed signature subset, a candidate SBP distributed signature containing the second-dimension distributed descriptor of the first logic tensor of the first input terminal and/or the second-dimension distributed descriptor of the other logic tensor of the other input terminal as the determined SBP distributed signature of the current logic node based on the descriptors of the respective input terminal tensors determined by the distribution descriptor determination unit 142.

Note that, when selecting the second dimension SBP descriptor, the distribution descriptor determination unit 142 generally preferentially selects the split or parallel distribution descriptor S for the largest one of the input tensors and selects the broadcast descriptor B for the other input tensors. The input tensor selected for segmentation may be a data tensor or a model tensor.

Finally, the second logical node topology output component 150 outputs the final second logical node topology graph 151 based on the SBP signature determined by the SBP signature determination unit 143 for each logical node, and each logical node constituting the second logical node topology graph 151 is attached with only one SBP signature, or each logical node explicitly specifies the distribution pattern or the distribution descriptor of each input tensor thereof and uniquely determines the distribution pattern or the distribution descriptor of the input tensor thereof.

Returning to fig. 2, after the second logical node topology output component 150 of the logical node distributed signature decision system 100 outputs the task topology such as the final second logical node topology 151, the computation graph generating component 160 of the static distributed data processing system generates a computation graph based on the second logical node topology 151, where each task logical node corresponds to the number of distributed or parallel computing devices to form a corresponding number of computing nodes, and in addition, for the case that the distribution descriptor of the input tensor of the computing node corresponding to the current logical node does not correspond to the distribution descriptor of the output tensor of the upstream computing node, a transformation computing node needs to be inserted, for example, as shown in the right part of fig. 10, between the computing nodes N1 and N3, the computing node N4 is inserted, so as to segment the logical tensor 1 whose distribution descriptor is S (1) output by the computing node N1 into SBP descriptors (S (1), the first sliced logic tensor 1 of (S (1)) also inserts the computation node N5 between the computation nodes N2 and N3 of fig. 10, thereby dividing the tensor whose distribution descriptor is P output by the computation node N2 into the second logic tensor 2 of the SBP descriptor (P, B). Specifically, the computation node N4 is a divided computation node that, when processing the logical tensor 1 whose distribution descriptor is S (1) output by the computation node N1, continues to perform division processing (UPACK) on the 1 st dimension of the tensor, divides the logical tensor 1 into a predetermined number of first divided logical tensors 1 that embody the distribution result described by the SBP descriptors (S (1), (S (1)), and the divided computation node N4 outputs a predetermined number of first divided logical tensors one by one to the computation node N3 (corresponding to the logical node N3), and the computation node N5 is a REPEAT broadcast output node that, when processing the second logical tensor 2 output by the computation node, performs REPEAT broadcast output processing (REPEAT output) on the second logical tensor 2, and REPEATs the predetermined number of times of REPEAT output and the predetermined number of first divided logical tensors 1 And the like. Therefore, the calculation node N3, when performing processing, the actual tensors of each processing are the first sliced logic tensor 1 and the second logic tensor 2, which obtain an output resultant logic tensor that is the resultant sliced tensor 3 instead of the resultant logic tensor 3. As a result, the calculation resource required by the calculation node N3 in the case of actually processing the first and second

sliced logic tensors

1 and 2 and obtaining the output resultant sliced logic tensor 3 is much smaller than that in the case of actually processing the first and

second logic tensors

1 and 2 and obtaining the resultant logic tensor 3, and the calculation resource of the calculation device disposed by the calculation node N3 can satisfy the calculation resource required for the actual calculation, reducing the need for a high-cost calculation device.

In addition, after the compute node N3, the compute node N6 needs to be inserted. The compute node N6 is a rendezvous compute node. It performs aggregation processing (accumulator) on the result sliced logic tensor 3 output by the compute node N3, and aggregates the result sliced logic tensor 3 one by one into a result logic tensor 3.

Although the above gives a general case of how to determine the final SBP signature for some candidate SBP signatures, in some specific cases, for some logical nodes, in case of special configuration by the user or in case of user specification, these logical nodes have only the SBP signature specified by the user, so that the logical nodes downstream thereof will make the SBP signature determination based on such specifically specified upstream logical nodes.

By the SBP distributed signature decision system for logical nodes of a distributed data processing system according to the present disclosure, it is possible to minimize, on the one hand, the amount of data exchange between different computing devices in the distributed data processing system in processing data from a global perspective, thereby reducing the overhead generated in the data interaction process, effectively reducing the adverse effect of data exchange on the actual operation, reducing the waiting time of the operation, thereby accelerating the data processing speed, more importantly, reducing the requirement on the single-card computing resource quantity of the computing equipment under the requirements of large-scale models and large-scale data processing, therefore, the required hardware cost is reduced, and on the other hand, the parallel deployment can be automatically carried out, and especially, the same data processing effect can be automatically achieved under the condition of mixed parallel deployment needing manual intervention. Moreover, when the computing equipment deployed by the large tensor cannot meet the computing resources required by the large tensor processing under the condition that the large tensor needs to be processed locally, the SBP signature decision system can eliminate the requirement that the computing resources of the computing equipment need to be increased due to the local large tensor. If the computing resources of the computing device are increased for the processing of the local large tensor, the increased computing resources are mostly idle, which also results in the waste of computing resources.

The basic principles of the present disclosure have been described in connection with specific embodiments, but it should be noted that it will be understood by those skilled in the art that all or any of the steps or components of the method and apparatus of the present disclosure may be implemented in any computing device (including processors, storage media, etc.) or network of computing devices, in hardware, firmware, software, or a combination thereof, which can be implemented by those skilled in the art using their basic programming skills after reading the description of the present disclosure.

Thus, the objects of the present disclosure may also be achieved by running a program or a set of programs on any computing device. The computing device may be a general purpose device as is well known. Thus, the object of the present disclosure can also be achieved merely by providing a program product containing program code for implementing the method or apparatus. That is, such a program product also constitutes the present disclosure, and a storage medium storing such a program product also constitutes the present disclosure. It is to be understood that the storage medium may be any known storage medium or any storage medium developed in the future.

It is also noted that in the apparatus and methods of the present disclosure, it is apparent that individual components or steps may be disassembled and/or re-assembled. These decompositions and/or recombinations are to be considered equivalents of the present disclosure. Also, the steps of executing the series of processes described above may naturally be executed chronologically in the order described, but need not necessarily be executed chronologically. Some steps may be performed in parallel or independently of each other.

The above detailed description should not be construed as limiting the scope of the disclosure. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A multi-dimensional SBP distributed signature decision system for logical nodes of a multi-level distributed data processing system, the SBP distributed signatures including one-dimensional SBP distributed signatures and multi-dimensional SBP distributed signatures, the SBP distributed signatures being signatures of logical nodes with distributed descriptors of logical tensors, the distributed descriptor categories of each logical tensor include SPLIT (SPLIT) logical tensor descriptors, BROADCAST (BROADCAST) logical tensor descriptors, which describe the manner in which the logical tensors are partitioned, and PARTIAL VALUE (PARTIAL VALUE) logical tensor descriptors, which describe the manner in which the logical tensors are distributed in the distributed data processing system in a BROADCAST manner, the PARTIAL VALUE (PARTIAL VALUE) logical tensor descriptors representing the portion of an input or output logical tensor of one logical node as multiple homogeneous logical tensors, the system comprising a logic node and a plurality of distributed descriptors, each of the input and output logic tensors of the logic node describing a distributed description of the logic node on the operational data, thereby forming a signature for the logic node, the system comprising:

an initial logic node generation component that receives task configuration data input by a user and generates an initial logic node topology map for the distributed data processing system, wherein a source logic node has a designated SBP distributed signature and each initial logic node is attached with a candidate SBP distributed signature set based on the task configuration data, and each SBP distributed signature in the candidate SBP distributed signature set designates a distributed descriptor of each input logic tensor and a distributed descriptor of each output logic tensor of the initial logic node to which the initial logic node belongs; and

a first-dimension SBP distributed signature selection component, which is used for calculating the cost of transmission data required for transforming the distributed descriptor of the logic tensor at the output end of each upstream logic node into the first-dimension distributed descriptor of the logic tensor at the corresponding input end of the current logic node based on the data volume of the equipment set to be distributed in parallel by each upstream logic node, the data volume of the equipment set to be distributed in parallel by the current logic node and the size of the logic tensor distributed on each equipment by each upstream logic node according to the distributed descriptor at the output end of each upstream logic node for which the SBP distributed signature is determined, and selecting one or more candidate SBP distributed signatures containing the first-dimension distributed descriptor corresponding to the minimum value of the cost as the candidate SBP distributed signature subsets of the current logic node, the first dimension distributed descriptors describe a parallel manner of the logic tensors of the corresponding inputs; and the number of the first and second groups,

a second-dimension SBP distributed signature selection component that compares the sizes of actual computation resources of each computation device of the device set to be distributed in parallel for the current logical node and computation resources required for processing the logical tensor of the corresponding input terminal and the resultant logical tensor determined according to the first-dimension distributed descriptor, and selects a candidate SBP distributed signature containing a second-dimension distributed descriptor of the first logical tensor of the first input terminal and/or second-dimension distributed descriptors of other logical tensors of other input terminals as a determined SBP distributed signature of the current logical node from the candidate SBP distributed signature subset when the required computation resources are larger than the actual computation resources, wherein the second-dimension distributed descriptor of the logical tensor of the first input terminal of the determined SBP distributed signature is a split logical tensor descriptor, the second-dimension distributed descriptor including the first logic tensor to be divided into a predetermined number of first sliced logic tensors on the basis of the distribution described by the first-dimension distributed descriptor and the other logic tensors is a broadcast logic tensor distribution descriptor and includes a predetermined number of times that the other logic tensors are specified to be repeatedly broadcast, wherein the predetermined number is equal to the predetermined number of times, and the calculation resources required for the current logic node to process each of the first sliced logic tensors, the logic tensors of the other input terminals, and the resultant sliced tensor thus obtained are smaller than the actual calculation resources of each of the calculation devices.

2. The multi-dimensional SBP distributed signature decision system for a logic node of a multi-level distributed data processing system as recited in claim 1, wherein the first logic tensor is a data logic tensor and one of the other logic tensors is a model logic tensor.

3. The multi-dimensional SBP distributed signature decision system for a logic node of a multi-level distributed data processing system as recited in claim 1, wherein the first logic tensor is a model logic tensor and the other logic tensors are data logic tensors.

4. The multi-dimensional SBP distributed signature decision system for a logical node of a multi-level distributed data processing system as recited in claim 1, wherein the logic tensors of the inputs are all data logic tensors.

5. The multi-dimensional SBP distributed signature decision system for a logical node of a multi-level distributed data processing system as recited in claim 1, wherein the first logic tensor requires a greater amount of computational resources than one of the other logic tensors of the other inputs.

6. The multidimensional SBP distributed signature decision system for logical nodes of a multi-level distributed data processing system of claim 1, wherein the distributed data processing system further comprises a computation graph generation component for generating a task logic computation graph based on a logical node topology graph formed by logical nodes from which the determined SBP distributed signature is obtained, wherein a split compute node is inserted before a first input of a compute node corresponding to a current logical node, a rebroadcast compute node is inserted before other inputs, and a rendezvous compute node is inserted after an output.

7. A multi-dimensional SBP distributed signature decision method for a logical node of a multi-level distributed data processing system, the SBP distributed signature including a one-dimensional SBP distributed signature or a multi-dimensional SBP distributed signature, the SBP distributed signature being a signature of the logical node with distributed descriptors of logical tensors, the distributed descriptor categories of each logical tensor include a SPLIT (SPLIT) logical tensor descriptor, a BROADCAST (BROADCAST) logical tensor descriptor, and a PARTIAL VALUE (PARTIAL VALUE) logical tensor descriptor, the SPLIT (SPLIT) logical tensor descriptor describing a SPLIT manner of the logical tensor, the BROADCAST (BROADCAST) logical tensor descriptor describing a manner in which the logical tensor is distributed in the distributed data processing system in a BROADCAST manner, the PARTIAL VALUE (PARTIAL VALUE) logical tensor descriptor representing a portion of an input or output logical tensor of one logical node as multiple homogeneous logical tensors, the distributed descriptors of the respective input and output logical tensors of the logical node describe the distributed description of the logical node on the operational data, thereby forming a signature for the logical node, the method comprising:

an initial logic node generation step of receiving task configuration data input by a user and generating an initial logic node topology map for the distributed data processing system, wherein a source logic node has a designated SBP distributed signature and each initial logic node is attached with a candidate SBP distributed signature set based on the task configuration data, and each SBP distributed signature in the candidate SBP distributed signature set designates a distributed descriptor of each input logic tensor and a distributed descriptor of each output logic tensor of the initial logic node to which the initial logic node belongs; and

a first-dimension SBP distributed signature selection step of calculating, for each candidate SBP distributed signature of a current logic node, a cost of transmission data required for transforming the distributed descriptor of the logic tensor of each upstream logic node output end into a first-dimension distributed descriptor of the logic tensor of a corresponding input end of the current logic node based on the data amount of the device set to be distributed in parallel by each upstream logic node, the data amount of the device set to be distributed in parallel by the current logic node, and the size of the logic tensor distributed on each device by each upstream logic node, according to the distributed descriptor of the output end of each upstream logic node for which the SBP distributed signature has been determined, and selecting one or more candidate SBP distributed signatures containing the first-dimension distributed descriptor corresponding to the minimum value of the cost as a candidate SBP distributed signature subset of the current logic node, the first dimension distributed descriptors describe a parallel manner of the logic tensors of the corresponding inputs; and the number of the first and second groups,

a second-dimension SBP distributed signature selecting step of comparing the sizes of actual computation resources of each computation device of the device set to be distributed in parallel for the current logical node and computation resources required for processing the logical tensor of the corresponding input terminal and the resultant logical tensor determined according to the first-dimension distributed descriptor, and when the required computation resources are larger than the actual computation resources, selecting a candidate SBP distributed signature containing a second-dimension distributed descriptor of the first logical tensor of the first input terminal and/or second-dimension distributed descriptors of other logical tensors of other input terminals as a determined SBP distributed signature of the current logical node from the candidate SBP distributed signature subset, wherein the second-dimension distributed descriptor of the logical tensor of the first input terminal of the determined SBP distributed signature is a split logical tensor descriptor, the second-dimension distributed descriptor including the first logic tensor to be divided into a predetermined number of first sliced logic tensors on the basis of the distribution described by the first-dimension distributed descriptor and the other logic tensors is a broadcast logic tensor distribution descriptor and includes a predetermined number of times that the other logic tensors are specified to be repeatedly broadcast, wherein the predetermined number is equal to the predetermined number of times, and the calculation resources required for the current logic node to process each of the first sliced logic tensors, the logic tensors of the other input terminals, and the resultant sliced tensor thus obtained are smaller than the actual calculation resources of each of the calculation devices.

8. The multi-dimensional SBP distributed signature decision method for a logic node of a multi-level distributed data processing system as recited in claim 7, wherein the first logic tensor is a data logic tensor and one of the other logic tensors is a model logic tensor.

9. The multi-dimensional SBP distributed signature decision method for a logic node of a multi-level distributed data processing system as recited in claim 7, wherein the first logic tensor is a model logic tensor and the other logic tensors are data logic tensors.

10. The multi-dimensional SBP distributed signature decision method for a logic node of a multi-level distributed data processing system as recited in claim 7, wherein the logic tensors of the inputs are all data logic tensors.

11. The multi-dimensional SBP distributed signature decision method for a logical node of a multi-level distributed data processing system as recited in claim 7, wherein the amount of computational resources required for the first logical tensor is greater than the amount of computational resources required for one of the other logical tensors for the other inputs.

12. The multi-dimensional SBP distributed signature decision method for logical nodes of a multi-level distributed data processing system of claim 7, wherein said distributed data processing system further comprises a computation graph generation component for generating a task logic computation graph based on a logical node topology graph formed by logical nodes for which a determined SBP distributed signature is obtained, wherein a split computation node is inserted before a first input of a computation node corresponding to a current logical node, a rebroadcast computation node is inserted before other inputs, and a rendezvous computation node is inserted after an output.