CN117114091B

CN117114091B - Calculation graph processing method based on federal learning, computer equipment and storage medium

Info

Publication number: CN117114091B
Application number: CN202311385248.XA
Authority: CN
Inventors: 罗除
Original assignee: Shenzhen Kaihong Digital Industry Development Co Ltd
Current assignee: Shenzhen Kaihong Digital Industry Development Co Ltd
Priority date: 2023-10-25
Filing date: 2023-10-25
Publication date: 2024-03-05
Anticipated expiration: 2043-10-25
Also published as: CN117114091A

Abstract

The present application relates to the field of artificial intelligence, and in particular, to a computational graph processing method, a computer device, and a storage medium based on federal learning, where the method includes: acquiring a federation learning task, wherein the federation learning task comprises an initial calculation graph and an equivalent operator comparison table; splitting the initial calculation graph based on the node type corresponding to each calculation node to obtain a sub calculation graph corresponding to each calculation node; distributing the sub-calculation graphs corresponding to each calculation node and the equivalent operator comparison table to each corresponding calculation node, so that each calculation node carries out operator replacement on the distributed sub-calculation graphs according to the equivalent operator comparison table to obtain a target sub-calculation graph; and acquiring a target sub-calculation graph returned by each calculation power node, and determining the target calculation graph according to all the target sub-calculation graphs. According to the method, node types of different computing power nodes are fully considered, optimization processing of the computing graph is achieved in parallel and cross-node cooperation and cooperation mode among a plurality of computing power nodes, and the efficiency of computing graph optimization is improved.

Description

Calculation graph processing method based on federal learning, computer equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to a computational graph processing method, a computer device, and a computer readable storage medium based on federal learning.

Background

At present, algorithms such as artificial intelligence are widely used in various industries. However, in the implementation of various algorithms, the hardware providing computational power is of a wide variety, many thousands of models, and efficient execution of the algorithm by a particular type of hardware requires proper optimization of the computational graph. However, the computational graph optimization tends to be computationally expensive, and takes up significant memory space on a single computing hardware device. In the related art, the optimization of the computation graph is often performed on a single computing hardware device, which takes a long time, and also challenges the storage space of the single device, and meanwhile, more repeated computation exists, so that time is wasted.

Therefore, how to improve the processing efficiency of the optimization calculation map is a problem to be solved.

Disclosure of Invention

The application provides a computational graph processing method, computer equipment and storage medium based on federal learning, which solve the problem of low processing efficiency of optimizing computational graphs in related technologies.

In a first aspect, the present application provides a computational graph processing method based on federal learning, applied to nodes in a computer cluster, the computer cluster including at least one management node and at least one computational power node, the method comprising:

acquiring a federation learning task, wherein the federation learning task comprises an initial calculation graph to be processed and an equivalent operator comparison table; splitting the initial computational graph based on the node type corresponding to each computational power node to obtain a sub computational graph corresponding to each computational power node; distributing the sub-calculation graphs corresponding to each calculation node and the equivalent operator comparison table to each corresponding calculation node, so that each calculation node performs operator replacement on the distributed sub-calculation graphs according to the equivalent operator comparison table to obtain a target sub-calculation graph, wherein operators in the target sub-calculation graph are operators with shortest execution time; and acquiring the target sub-calculation graphs returned by each calculation node, and determining the target calculation graphs according to all the target sub-calculation graphs.

According to the computing graph processing method based on federal learning, the initial computing graph is split based on the node type corresponding to each computing node to obtain the sub-computing graph corresponding to each computing node, the sub-computing graph corresponding to each computing node and the equivalent operator comparison table are distributed to each corresponding computing node, so that each computing node carries out operator replacement on the distributed sub-computing graph according to the equivalent operator comparison table, the computing speeds of executing operators by computing nodes of different node types are fully considered, and optimization processing of the computing graph is achieved in parallel and cross-node cooperation and cooperation among the computing nodes, so that the optimizing efficiency of the computing graph is improved.

In a second aspect, the present application also provides a computer device comprising a memory and a processor;

the memory is used for storing a computer program;

the processor is configured to execute the computer program and implement the computation graph processing method based on federal learning as described above when the computer program is executed.

In a third aspect, the present application also provides a computer readable storage medium storing a computer program which, when executed by a processor, causes the processor to implement a federally learning-based computational graph processing method as described above.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a computer cluster provided in an embodiment of the present application;

FIG. 2 is a block diagram schematically illustrating a structure of a computer device according to an embodiment of the present application;

FIG. 3 is a schematic flow chart of a computational graph processing method based on federal learning provided in an embodiment of the present application;

FIG. 4 is a schematic flow chart of sub-steps of a split computation graph provided by an embodiment of the present application;

FIG. 5 is a schematic flow chart of sub-steps for determining a target computational graph provided by embodiments of the present application;

FIG. 6 is a schematic flow chart diagram of another federally learned based computational graph processing method provided by embodiments of the present application;

FIG. 7 is a schematic flow chart of another substep of adding execution time in an equivalence operator lookup table provided by an embodiment of the present application;

FIG. 8 is a schematic flow chart of the sub-steps for marking an anomaly operator in an equivalence operator lookup table provided by an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

The flow diagrams depicted in the figures are merely illustrative and not necessarily all of the elements and operations/steps are included or performed in the order described. For example, some operations/steps may be further divided, combined, or partially combined, so that the order of actual execution may be changed according to actual situations.

It is to be understood that the terminology used in the description of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should also be understood that the term "and/or" as used in this specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

Embodiments of the present application provide a method for processing a computational graph based on federal learning, a computer device computer cluster, and a storage medium. The computational graph processing method based on federal learning is applied to computer equipment, an initial computational graph is split based on node types corresponding to each computational power node to obtain sub-computational graphs corresponding to each computational power node, the sub-computational graphs corresponding to each computational power node and an equivalent operator comparison table are distributed to each corresponding computational power node, so that each computational power node carries out operator replacement on the distributed sub-computational graphs according to the equivalent operator comparison table, the computational speed of executing operators by the computational power nodes of different node types is fully considered, and optimization processing of the computational graph is achieved in parallel and in cross-node cooperation and cooperation among the multiple computational power nodes, so that the optimization efficiency of the computational graph is improved.

The computer device may be a server or a terminal, for example.

The server may be an independent server, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content delivery networks (Content Delivery Network, CDN), and basic cloud computing services such as big data and artificial intelligence platforms. The terminal can be a smart phone, a tablet computer, a notebook computer, a desktop computer and other devices.

Referring to fig. 1, fig. 1 is a schematic diagram of a computer cluster 1000 according to an embodiment of the disclosure. As shown in fig. 1, the computer cluster 1000 may include at least one management node 100 and at least one computing node 200. Wherein, the nodes can establish wired or wireless communication connection.

For example, for a certain computer cluster C, it may be defined as c= (M _i ，S _j ），M _i Representing an ith management node, S, in a computer cluster C _j Representing the j-th computing node in computer cluster C.

In the embodiment of the present application, the management node 100 and the computing node 200 refer to computer devices that provide computing power.

The main computing forces of the management node 100, the computing nodes 200 may be a central processing unit (Central Processing Unit, CPU), a digital signal processor (Digital Signal Processor, DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), a Field programmable gate array (Field-Programmable Gate Array, FPGA), a neural network processor (Neural Network Processing Unit, NPU), a tensor processor (Tensor Processing Unit, TPU), etc. The data format on the computing node 200 may be tensor, or may be a one-dimensional or high-dimensional array.

The management node 100 is configured to determine a sub-computation graph corresponding to each computing node 200 according to a federal learning task, and allocate the sub-computation graph corresponding to each computing node 200 and an equivalent operator comparison table to each corresponding computing node 200. The computing power node 200 is configured to perform operator replacement on the allocated sub-computing graphs according to the equivalent operator comparison table, obtain a target sub-computing graph, and return the target sub-computing graph to the management node 200, so as to determine an optimized target computing graph according to all the target sub-computing graphs.

It will be appreciated that the management node 100 may be configured to control the operator replacement of other computing nodes 200, and that the management node 100 itself may also be configured as a computing node for operator replacement based on the sub-computation graph sent by the other management node. I.e. the same computer device may act as both management node 100 and page as computing node 200.

For ease of explanation, in the embodiments of the present application, how to perform computation graph processing is described in terms of one of the nodes as a management node and as a computation node.

Referring to fig. 2, fig. 2 is a schematic block diagram of a computer device 10 according to an embodiment of the present application. In fig. 2, the computer device 10 comprises a processor 101 and a memory 102, wherein the processor 101 and the memory 102 are connected by a bus, such as any suitable bus, for example an integrated circuit (Inter-integrated Circuit, I2C) bus.

The memory 102 may include a storage medium and an internal memory, among others. The storage medium may store an operating system and a computer program. The computer program comprises program instructions that, when executed, cause the processor to perform any of a number of federally learned computational graph processing methods.

The processor 101 is used to provide computing and control capabilities to support the operation of the overall computer device 10.

The processor 101 may be a central processing unit, and the processor may also be a general purpose processor, a digital signal processor, an application specific integrated circuit, a field programmable gate array or other programmable logic device, a discrete gate or transistor logic device, a discrete hardware component, or the like. The general purpose processor may be a microprocessor, or it may be any conventional processor or the like.

The processor 101 is configured to execute a computer program stored in the memory 102, and when executing the computer program, implement the following steps:

acquiring a federation learning task, wherein the federation learning task comprises an initial calculation graph to be processed and an equivalent operator comparison table; splitting the initial calculation graph based on the node type corresponding to each calculation node to obtain a sub calculation graph corresponding to each calculation node; distributing the sub-calculation graphs corresponding to each calculation node and the equivalent operator comparison table to each corresponding calculation node so that each calculation node can perform operator replacement on the distributed sub-calculation graphs according to the equivalent operator comparison table to obtain a target sub-calculation graph, wherein operators in the target sub-calculation graph are operators with shortest execution time; and acquiring a target sub-calculation graph returned by each calculation power node, and determining the target calculation graph according to all the target sub-calculation graphs.

In some embodiments, when implementing splitting the initial computational graph based on the node type corresponding to each computing power node to obtain the sub-computational graph corresponding to each computing power node, the processor 101 is configured to implement:

classifying all the computing power nodes in the computer cluster to obtain at least one computing power node set, wherein all the computing power nodes in each computing power node set belong to the same node type; splitting the initial computational graph according to the number of the nodes in each computational-force node set to obtain sub-computational graphs corresponding to the computational-force nodes in each computational-force node set.

In some embodiments, the processor 101, when implementing determining the target computational graph from all target sub-computational graphs, is configured to implement:

combining the target sub-calculation graphs corresponding to the calculation nodes in each calculation node set to obtain candidate calculation graphs corresponding to each calculation node set; determining the total execution time of each corresponding candidate calculation graph according to the execution time of each target sub calculation graph in each candidate calculation graph; and determining a target calculation graph according to the candidate calculation graph corresponding to the minimum total execution time.

In some embodiments, the sub-computational graph includes at least one initial operator; the processor 101 is further configured to implement:

executing the sub-calculation graph when receiving the sub-calculation graph and the equivalent operator comparison table sent by the management node, and recording a first execution time corresponding to each initial operator in the sub-calculation graph; sequentially carrying out operator replacement on initial operators meeting preset replacement conditions in the sub-calculation map based on the equivalent operator comparison table; after each operator replacing operation, determining a second execution time corresponding to the equivalent operator after each replacing operation, if the second execution time is smaller than the first execution time, reserving a sub-calculation diagram of the operator replacing operation, and if the second execution time is larger than or equal to the first execution time, canceling the operator replacing operation; after the operator replacement operation of the sub-calculation map is completed, determining a target sub-calculation map according to the sub-calculation map after the operator replacement is completed; and returning the target sub-calculation graph to the management node.

In some embodiments, the replacement condition includes: the initial operator in the sub calculation graph has a corresponding equivalent operator in an equivalent operator comparison table, and the execution time of the equivalent operator recorded by the equivalent operator comparison table is smaller than the first execution time of the corresponding initial operator; or the initial operator in the sub-calculation graph has corresponding equivalent operator in the equivalent operator comparison table, and the execution time of the equivalent operator corresponding to the initial operator is not recorded in the equivalent operator comparison table.

In some embodiments, when implementing the determining the second execution time corresponding to the equivalence operator after each replacement, the processor 101 is configured to implement:

sequentially determining the equivalent operators after each replacement as current equivalent operators; if the equivalent operator comparison table records the execution time corresponding to the current equivalent operator, determining the execution time corresponding to the current equivalent operator as the second execution time corresponding to the current equivalent operator; and if the equivalent operator comparison table does not record the execution time corresponding to the current equivalent operator, executing the current equivalent operator, and recording the second execution time corresponding to the current equivalent operator.

In some embodiments, after implementing the determining the second execution time corresponding to the equivalence operator after each replacement, the processor 101 is further configured to implement:

Adding a first execution time of an initial operator before each replacement and a second execution time of the equivalent operator after each replacement into an equivalent operator comparison table to obtain an equivalent operator comparison table with added execution time; and sending the equivalent operator comparison table added with the execution time to the management node so that the management node can synchronize the equivalent operator comparison table added with the execution time to other computing nodes.

In some embodiments, after implementing the operator replacement of the initial operators satisfying the preset replacement condition in the sequential sub-computation graph, the processor 101 is further configured to implement:

determining the storage space required by the equivalent operator after each operator replacement; if the storage space is larger than a preset storage space threshold value, marking the equivalent operator replaced at this time as an abnormal operator in an equivalent operator comparison table, and canceling the operator replacing operation at this time; and sending the marked equivalent operator comparison table to the management node so that the management node can synchronize the marked equivalent operator comparison table to other computing nodes.

Some embodiments of the present application are described in detail below with reference to the accompanying drawings. The following embodiments and features of the embodiments may be combined with each other without conflict. Referring to fig. 3, fig. 3 is a schematic flowchart of a computing graph processing method based on federal learning according to an embodiment of the present application. As shown in fig. 3, the federal learning-based calculation map processing method may include steps S201 to S204.

Step S201, acquiring a federation learning task, wherein the federation learning task comprises an initial calculation map to be processed and an equivalent operator comparison table.

It should be noted that, in the embodiment of the present application, a user may issue a federal learning task through a management node, so that the management node performs optimization of a calculation graph according to the federal learning task to obtain a target calculation graph.

For example, the management node may obtain a federal learning task, where the federal learning task may include an initial computational graph to be processed and an equivalence operator lookup table. The initial computational graph comprises at least one initial operator, and the equivalent operator comparison table comprises one equivalent operator or a plurality of equivalent operators corresponding to the at least one initial operator.

It should be noted that, the initial calculation map to be processed refers to a calculation map that needs to be optimized. Wherein a computational graph is a graph describing a computational structure, its elements include nodes (nodes) and edges (edges), the nodes representing variables may be scalar, vector, tensor, etc., and the edges representing an operation, i.e., a function.

Illustratively, the equivalent operator lookup table may include a plurality of pairs of mathematically equivalent operators, abbreviated equivalent operators. For example, a "matrix a multiplied by matrix B, then a" operator transposed by a result matrix and a "transposed by matrix B" operator is a pair of equivalent operators, where a "matrix a multiplied by matrix B, then a result matrix transposed" operator can be used as the original operator, and a "transposed by matrix a multiplied by matrix B" operator can be used as the equivalent operator, and vice versa. On hardware of different node types, the operation speeds of the matrix A multiplied by the matrix B and then the operation speeds of the result matrix transposition 'operator and the operation speeds of the matrix A transposition multiplied by the matrix B transposition' operator are different, so that operators with faster operation speeds on specific hardware need to be found out.

Step S202, splitting the initial computational graph based on the node type corresponding to each computational power node to obtain a sub computational graph corresponding to each computational power node.

For example, after the to-be-processed initial calculation graph and the equivalent operator comparison table are obtained, in order to detect the difference of the calculation speeds of the calculation operators executed by the calculation power nodes of different node types, the initial calculation graph needs to be split according to the node types, so as to obtain the sub calculation graphs corresponding to the calculation power nodes of the same node type. And then, respectively counting the execution time of the target sub-computation graph after optimizing the plurality of computing nodes of the same node type to obtain the total execution time of the execution operators of the plurality of computing nodes of the same node type, so that the computing speed of the computing nodes of which node type can be determined according to the total execution time is higher.

In some embodiments, the initial computational graph may be split based on the node type corresponding to each of the computing power nodes, resulting in a sub-computational graph corresponding to each of the computing power nodes.

By way of example, the node types and the number of the computing nodes in the computer cluster may be counted to obtain a node list comprising the node types and the number of the nodes. Splitting the initial calculation graph according to the number of the nodes of the calculation force nodes of the same node type in the node list to obtain sub calculation graphs corresponding to the calculation force nodes of the same node type.

Among other types of nodes, may include, but are not limited to CPU, DSP, ASIC, FPGA, NPU and TPU, among others.

According to the embodiment, the initial computational graph is split based on the node type corresponding to each computational power node, so that the sub computational graph corresponding to each computational power node of the same node type can be obtained, the operation speed of the execution operators of the computational power nodes of different node types is fully considered, and the effect of the follow-up optimization computational graph can be effectively improved.

Referring to fig. 4, fig. 4 is a schematic flowchart of a sub-step of splitting a calculation map provided in an embodiment of the present application, and splitting an initial calculation map in step S202 may include the following steps S301 and S302.

Step 301, classifying all the computing nodes in the computer cluster to obtain at least one computing node set, wherein all the computing nodes in each computing node set belong to the same node type.

For example, each computing node in the computer cluster may be classified based on a node type, and computing nodes belonging to the same node type may be classified into the same computing node set, to obtain at least one computing node set.

For example, the set of computing nodes 1 includes computing nodes of 5 ASIC node types, the set of computing nodes 2 includes computing nodes of 3 NPU node types, the set of computing nodes 3 includes computing nodes of 8 TPU node types, and so on.

Step S302, splitting the initial computational graph according to the number of nodes in each computational-force node set, and obtaining a sub computational graph corresponding to each computational-force node in each computational-force node set.

For example, after classifying each computing node in the computer cluster to obtain at least one computing node set, the initial computing graph may be split according to the number of nodes in each computing node set, to obtain a sub computing graph corresponding to each computing node in each computing node set.

For example, for the power node set 1, since the power node set 1 includes 5 power nodes of ASIC node types, the initial computational graph may be split into 5 sub-computational graphs, resulting in sub-computational graphs corresponding to the power nodes of the 5 ASIC node types, respectively.

For another example, for the power node set 2, since the power node set 2 includes 3 power nodes of NPU node types, the initial calculation graph may be split into 3 sub-calculation graphs, to obtain sub-calculation graphs corresponding to the power nodes of the 3 NPU node types.

According to the embodiment, the computing power nodes in the computer cluster are classified to obtain at least one computing power node set, the initial computing graph is split according to the number of the nodes in each computing power node set, the initial computing graph can be optimized through a plurality of computing power nodes of each node type, and then the target computing graph with the shortest execution time is selected from the optimized computing graphs.

Step 203, allocating the sub-computation graph and the equivalent operator comparison table corresponding to each computation node corresponding to the computation node, so that each computation node performs operator replacement on the allocated sub-computation graph according to the equivalent operator comparison table to obtain a target sub-computation graph, wherein the operators in the target sub-computation graph are operators with the shortest execution time.

For example, after splitting the initial computational graph based on the node type corresponding to each computational power node to obtain the sub-computational graph corresponding to each computational power node, the sub-computational graph corresponding to each computational power node and the equivalent operator comparison table may be allocated to each corresponding computational power node, so that each computational power node performs operator replacement on the allocated sub-computational graph according to the equivalent operator comparison table, and a target sub-computational graph is obtained, where an operator in the target sub-computational graph is an operator with the shortest execution time.

For example, for a 5 ASIC node type of the power node set 1, and 3 NPU node type of the power node set 2, the sub-graph and equivalence operator lookup tables corresponding to the 8 power nodes may be assigned to the corresponding 8 power nodes.

For example, after each computing power node receives the corresponding sub-computation graph and the equivalent operator comparison table, the distributed sub-computation graph may be subjected to operator replacement according to the equivalent operator comparison table to obtain a target sub-computation graph, where an operator in the target sub-computation graph is an operator with the shortest execution time.

According to the embodiment, the sub-calculation graphs and the equivalent operator comparison tables corresponding to each calculation node are distributed to each corresponding calculation node, so that each calculation node carries out operator replacement on the distributed sub-calculation graphs according to the equivalent operator comparison tables, the optimization processing of the calculation graphs can be realized in parallel and in cooperation with each other across the nodes among a plurality of calculation nodes, and the efficiency of optimizing the calculation graphs can be greatly improved.

Step S204, obtaining target sub-calculated graphs returned by each calculation node, and determining the target calculated graphs according to all the target sub-calculated graphs.

In some embodiments, after assigning the sub-computational graph and the equivalent operator lookup table corresponding to each of the computational force nodes to each of the corresponding computational force nodes, a target sub-computational graph returned by each of the computational force nodes may be received, and the target computational graph may be determined from all of the target sub-computational graphs.

For example, the target sub-computation graphs corresponding to the computation nodes of the same node type may be combined to obtain multiple candidate computation graphs, and then the candidate computation graphs are screened according to the total execution time of each candidate computation graph to obtain a complete target computation graph.

Referring to fig. 5, fig. 5 is a schematic flowchart of a sub-step of determining a target calculation map provided in an embodiment of the present application, and determining the target calculation map in step S204 according to all target sub-calculation maps may include the following steps S401 to S403.

Step S401, merging the target sub-computation graphs corresponding to the computation nodes in each computation node set to obtain candidate computation graphs corresponding to each computation node set.

For example, for the computing power node set 1, the target sub-computing graphs corresponding to the computing power nodes of the 5 ASIC node types may be combined to obtain a candidate computing graph a corresponding to the computing power node set 1. For the computing power node set 2, the target sub-computing graphs corresponding to the computing power nodes of the 3 NPU node types can be combined to obtain a candidate computing graph b corresponding to the computing power node set 2. And the like, the candidate calculation graph corresponding to the calculation power node set can be obtained.

Step S402, determining the total execution time of each corresponding candidate calculation graph according to the execution time of each target sub calculation graph in each candidate calculation graph.

For example, the execution times of the target sub-computation graphs in each candidate computation graph may be added to obtain a corresponding total execution time of each candidate computation graph. For example, the execution times of the target sub-computation graphs in the candidate computation graph a may be added to obtain the total execution time of the candidate computation graph a, denoted as T ₁ . For example, the execution time of each target sub-calculation map in the candidate calculation map b is added to obtain the total execution time of the candidate calculation map b, which is denoted by T ₂ 。

Step S403, determining a target calculation map according to the candidate calculation map corresponding to the minimum total execution time.

For example, after determining the total execution time of each candidate computation graph, the candidate computation graph corresponding to the minimum total execution time may be determined as the target computation graph. For example, if the total execution time of candidate computational graph a is minimal, candidate computational graph a may be determined to be the target computational graph.

According to the embodiment, the total execution time of each corresponding candidate calculation graph is determined according to the execution time of each target sub-calculation graph in each candidate calculation graph, and the candidate calculation graph corresponding to the minimum total execution time is determined as the target calculation graph, so that the shortest execution time of the optimized target calculation graph can be ensured, and the optimization effect is improved.

Referring to fig. 6, fig. 6 is a schematic flowchart of another computing graph processing method based on federal learning according to an embodiment of the present application, which may include the following steps S501 to S505.

Step S501, executing the sub-computation graph when receiving the sub-computation graph and the equivalent operator comparison table sent by the management node, and recording a first execution time corresponding to each initial operator in the sub-computation graph.

It should be noted that, in the embodiment of the present application, the management node may be used to control other computing nodes to perform operator replacement, and the management node may also be used as a computing node to perform operator replacement according to the sub-computation graphs sent by other management nodes. How the computational graph processing is performed will be described below from the perspective of the computational power node.

In some embodiments, the computing power node executes the sub-computation graph when receiving the sub-computation graph and the equivalent operator comparison table sent by the management node, and records a first execution time corresponding to each initial operator in the sub-computation graph.

It should be noted that, the computing node may use the data on the local area to execute each initial operator in the sub-computation graph, and record the first execution time corresponding to each initial operator.

In the above embodiment, by executing the sub-computation graph and recording the first execution time corresponding to each initial operator in the sub-computation graph, it may be subsequently determined whether to execute the operator replacement according to the first execution time corresponding to each initial operator.

Step S502, based on an equivalent operator comparison table, sequentially replacing the initial operators meeting preset replacement conditions in the sub-calculation map.

For example, the operator replacement can be sequentially performed on the initial operators meeting the preset replacement condition in the sub-computation graph based on the equivalent operator comparison table.

Wherein the replacement conditions include: the initial operator in the sub calculation graph has a corresponding equivalent operator in an equivalent operator comparison table, and the execution time of the equivalent operator recorded by the equivalent operator comparison table is smaller than the first execution time of the corresponding initial operator; or the initial operator in the sub-calculation graph has corresponding equivalent operator in the equivalent operator comparison table, and the execution time of the equivalent operator corresponding to the initial operator is not recorded in the equivalent operator comparison table.

It should be noted that, when the execution time of the equivalent operator recorded in the equivalent operator comparison table is greater than or equal to the first execution time of the corresponding initial operator, the operator replacing operation is not executed.

For example, for the initial operator "matrix a multiplied by matrix B and then transposed with the result matrix", if there is a corresponding equivalent operator "transpose of matrix a multiplied by matrix B" in the equivalent operator comparison table and the execution time of the equivalent operator "transpose of matrix a multiplied by matrix B" is less than the first execution time of the initial operator "matrix a multiplied by matrix B and then transposed with the result matrix", then the initial operator "matrix a multiplied by matrix B is determined and then the result matrix transpose" satisfies the replacement condition. If the execution time of the equivalent operator ' the transpose of the matrix A multiplied by the transpose of the matrix B ' is not recorded in the equivalent operator comparison table, determining that the initial operator ' the matrix A multiplied by the matrix B ' and then transposing the result matrix ' meets the replacement condition.

In addition, if the execution time of the equivalent operator "transpose of matrix a multiplied by transpose of matrix B" is greater than or equal to the first execution time of the initial operator "matrix a multiplied by matrix B, then transpose of the result matrix", then the initial operator "matrix a multiplied by matrix B is determined, and then the result matrix transpose" does not satisfy the replacement condition.

Illustratively, performing operator replacement on the initial operators meeting the preset replacement condition in the sub-computation graph in sequence based on the equivalent operator comparison table may include: sequentially determining initial operators meeting replacement conditions in the sub-calculation graphs as current operators; obtaining a target equivalent operator corresponding to the current operator from an equivalent operator comparison table; replacing the current operator in the sub-computation graph according to the target equivalent operator.

According to the embodiment, the operator replacement is sequentially carried out on the initial operators meeting the preset replacement conditions in the sub-calculation graphs, so that the execution time of the sub-calculation graphs can be shortened, the optimization processing of the sub-calculation graphs is realized, and the optimal sub-calculation graphs are obtained.

Step S503, after each operator replacing operation, determining a second execution time corresponding to the equivalent operator after each replacing operation, if the second execution time is smaller than the first execution time, reserving a sub-calculation graph of the operator replacing operation, and if the second execution time is greater than or equal to the first execution time, canceling the operator replacing operation.

For example, after each operator replacing operation, a second execution time corresponding to the equivalent operator after each replacing operation may be determined, and whether to reserve a sub-calculation graph of the operator replacing operation performed this time is determined according to a magnitude relation between the second execution time corresponding to the equivalent operator and the first execution time of the corresponding initial operator.

In some embodiments, if the second execution time is less than the first execution time, a sub-computational graph of the operator replacement operation performed this time is retained.

For example, for the initial operator "matrix a multiplied by matrix B, then the result matrix transpose" and, if the second execution time corresponding to the equivalent operator "matrix a transpose multiplied by matrix B" is smaller than the first execution time corresponding to the initial operator "matrix a multiplied by matrix B, then the result matrix transpose" is reserved, that is, the initial operator "matrix a multiplied by matrix B in the sub-calculation map is confirmed, and then the result matrix transpose" is replaced by the equivalent operator "matrix a transpose multiplied by matrix B".

In other embodiments, the present operator replacement operation is canceled if the second execution time is greater than or equal to the first execution time.

For example, for the initial operator "matrix a multiplied by matrix B, then the result matrix transpose" and, if the second execution time corresponding to the equivalent operator "matrix a transpose multiplied by matrix B" is greater than or equal to the first execution time corresponding to the initial operator "matrix a multiplied by matrix B, then the result matrix transpose" is not reserved, i.e., the initial operator "matrix a multiplied by matrix B in the sub-calculation map is cancelled, and then the result matrix transpose" is replaced with the equivalent operator "matrix a transpose multiplied by matrix B".

In the above embodiment, when the second execution time is less than the first execution time, the sub-calculation map of the operator replacing operation is reserved, and when the second execution time is greater than or equal to the first execution time, the operator replacing operation is canceled, so that it is ensured that the sub-calculation map with shorter execution time can be obtained after each operator replacing, and a better sub-calculation map can be obtained.

In some embodiments, determining the second execution time corresponding to the equivalence operator after each replacement may include: sequentially determining the equivalent operators after each replacement as current equivalent operators; if the equivalent operator comparison table records the execution time corresponding to the current equivalent operator, determining the execution time corresponding to the current equivalent operator as the second execution time corresponding to the current equivalent operator; and if the equivalent operator comparison table does not record the execution time corresponding to the current equivalent operator, executing the current equivalent operator, and recording the second execution time corresponding to the current equivalent operator.

It should be noted that, in the embodiment of the present application, when the execution time corresponding to the current equivalent operator is recorded in the equivalent operator comparison table, the execution time corresponding to the current equivalent operator recorded in the equivalent operator comparison table may be directly determined as the second execution time corresponding to the current equivalent operator without executing the current equivalent operator. And when the execution time corresponding to the current equivalent operator is not recorded in the equivalent operator comparison table, the current equivalent operator needs to be executed, the execution time corresponding to the current equivalent operator is recorded, and the second execution time corresponding to the current equivalent operator is obtained.

According to the embodiment, when the execution time corresponding to the current equivalent operator is recorded in the equivalent operator comparison table, the recorded execution time corresponding to the current equivalent operator is determined to be the second execution time corresponding to the current equivalent operator, so that the condition that the second execution time can be obtained only by executing the current equivalent operator can be avoided, and the processing efficiency is improved.

Step S504, after finishing the operator replacement operation of the sub-computation graph, determining the target sub-computation graph according to the sub-computation graph after finishing the operator replacement.

For example, after completing operator replacement of each of the initial operators satisfying the replacement condition in the sub-computation graph, the sub-computation graph after completing the operator replacement may be determined as the target sub-computation graph.

It should be noted that, the sub-computation graph after the operator replacement is completed, i.e., the sub-computation graph after the optimization. At this time, the sub-computation graph after completion of the operator replacement may be determined as the target sub-computation graph.

Step S505, returning the target sub-calculation graph to the management node.

For example, after determining the target sub-computational graph, the target sub-computational graph may be returned to the management node for the management node to determine the target computational graph based on the target sub-computational graph returned by each computing node. The specific process of determining the target calculation map may be referred to the detailed description of the above embodiment, which is not repeated herein.

According to the embodiment, the target sub-calculation graphs are returned to the management node, so that the management node can determine the target calculation graphs according to the target sub-calculation graphs returned by each calculation node, and further, the optimization processing of the calculation graphs can be realized in parallel and in cooperation and cooperation among a plurality of calculation nodes, and the optimization efficiency of the calculation graphs is improved.

Referring to fig. 7, fig. 7 is a schematic flowchart of another sub-step of adding execution time in the equivalent operator lookup table according to the embodiment of the present application, which may include the following steps S601 to S603.

Step S601, determining a second execution time corresponding to the equivalent operator after each replacement.

It is understood that the step S601 is the same as the step S503, and will not be described herein.

Step S602, adding a first execution time of an initial operator before each replacement and a second execution time of an equivalent operator after each replacement into an equivalent operator comparison table to obtain an equivalent operator comparison table with added execution time.

For example, after determining the second execution time corresponding to the equivalent operator after each replacement, the first execution time of the initial operator before each replacement and the second execution time of the equivalent operator after each replacement may be added to the equivalent operator comparison table.

Step S603, the equivalent operator lookup table for adding the execution time is sent to the management node, so that the management node synchronizes the equivalent operator lookup table for adding the execution time to other computing nodes.

For example, the power node may send the equivalent operator lookup table with the added execution time to the management node, such that the management node synchronizes the equivalent operator lookup table with the added execution time to the other power nodes.

It should be noted that, the management node synchronizes the equivalent operator comparison table with the added execution time to other computing nodes, and the other computing nodes can update the local equivalent operator comparison table according to the equivalent operator comparison table with the added execution time, so that the equivalent operator comparison table local to the other computing nodes has the execution time of each equivalent operator recorded, and it is convenient to quickly determine whether to execute the operator replacing operation on the initial operator according to the execution time of the equivalent operator recorded in the equivalent operator comparison table when the operator is replaced.

According to the embodiment, the equivalent operator comparison table added with the execution time is sent to the management node, the management node synchronizes the equivalent operator comparison table added with the execution time to other computing nodes, and when the other computing nodes perform operator replacement, whether the operator replacement is performed can be judged directly according to the equivalent operator comparison table added with the execution time, and the execution time of the replaced equivalent operator does not need to be recorded again, so that the efficiency of the operator replacement can be effectively improved.

Referring to fig. 8, fig. 8 is a schematic flowchart of a sub-step of marking an abnormal operator in an equivalent operator comparison table according to an embodiment of the present application, which may include the following steps S701 to S704.

And step 701, sequentially replacing the initial operators meeting the preset replacement conditions in the sub-calculation map based on the equivalent operator comparison table.

It is understood that the step S701 is the same as the step S502, and will not be described herein.

Step S702, determining the storage space required by the equivalent operator after each operator replacement.

For example, the equivalent operator after each operator substitution may be performed and the memory space used to perform the equivalent operator may be read.

And step 703, if the storage space is greater than the preset storage space threshold, marking the equivalent operator replaced at this time as an abnormal operator in the equivalent operator comparison table, and canceling the operator replacing operation at this time.

For example, after determining the storage space required by the equivalent operator after each operator replacement, if the storage space is greater than a preset storage space threshold, marking the equivalent operator replaced at this time as an abnormal operator in the equivalent operator comparison table.

The preset storage space threshold value can be determined according to the maximum storage space which can be provided by the computing node.

Illustratively, if the storage space is greater than a preset storage space threshold, the operator replacement operation is canceled.

It should be noted that the abnormal operator represents that the equivalent operator after replacement has infinite execution time. Because the storage space required by the replaced equivalent operator during execution exceeds the maximum storage space provided by the computing power node, the performance of the computing power node is affected, in order to avoid that the replaced equivalent operator affects the performance of the computing power node, the replaced equivalent operator needs to be marked as an abnormal operator, and the replacing operation of the operator is canceled.

According to the embodiment, when the storage space is larger than the preset storage space threshold value, the equivalent operator replaced at this time is marked as the abnormal operator in the equivalent operator comparison table, and the operator replacing operation at this time is canceled, so that the influence of the replaced equivalent operator on the performance of the computing power node can be avoided, and the running performance of the computing power node is ensured.

Step S704, the marked equivalent operator comparison table is sent to the management node so that the management node can synchronize the marked equivalent operator comparison table to other computing nodes.

For example, after marking the equivalent operator replaced at this time as an abnormal operator and canceling the operation of replacing the operator at this time, the management node may send the marked equivalent operator comparison table to synchronize the marked equivalent operator comparison table to other computing nodes.

It should be noted that, other power calculation nodes may update the local equivalent operator comparison table according to the equivalent operator comparison table after adding the label, so as to implement the marking of the abnormal operator in the local equivalent operator comparison table, thereby avoiding the operator replacement according to the abnormal operator.

According to the embodiment, the marked equivalent operator comparison table is sent to the management node, so that the management node can synchronize the marked equivalent operator comparison table to other computing nodes, and the other computing nodes can be prevented from replacing operators according to the abnormal operators.

The embodiment of the application also provides a computer readable storage medium, the computer readable storage medium stores a computer program, the computer program comprises program instructions, and a processor executes the program instructions to realize any computing graph processing method based on federal learning. For example, the computer program is loaded by a processor, the following steps may be performed:

The specific implementation of each operation above may be referred to the previous embodiments, and will not be described herein.

The computer readable storage medium may be an internal storage unit of the computer device of the foregoing embodiment, for example, a hard disk or a memory of the computer device. The computer readable storage medium may also be an external storage device of a computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital Card (SD), a Flash memory Card (Flash Card), etc. which are provided on the computer device.

The foregoing is merely a specific embodiment of the present application, but the protection scope of the present application is not limited thereto, and any equivalent modifications or substitutions will be apparent to those skilled in the art within the scope of the present application, and these modifications or substitutions should be covered in the protection scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of processing a computational graph based on federal learning, applied to nodes in a computer cluster, the computer cluster including at least one management node and at least one computational power node, the method comprising:

acquiring a federation learning task, wherein the federation learning task comprises an initial calculation graph to be processed and an equivalent operator comparison table;

splitting the initial computational graph based on the node type corresponding to each computational power node to obtain a sub computational graph corresponding to each computational power node;

distributing a sub-calculation graph corresponding to each calculation node and the equivalent operator comparison table to each corresponding calculation node, so that each calculation node can replace the distributed sub-calculation graph according to the equivalent operator comparison table to obtain a target sub-calculation graph, wherein an operator in the target sub-calculation graph is an operator with shortest execution time, and the equivalent operator comparison table comprises a plurality of pairs of mathematically equivalent operators;

Acquiring the target sub-calculation graphs returned by each calculation node, and determining a target calculation graph according to all the target sub-calculation graphs;

splitting the initial computational graph based on the node type corresponding to each computational power node to obtain a sub-computational graph corresponding to each computational power node, including: classifying all the computing power nodes in the computer cluster to obtain at least one computing power node set, wherein all the computing power nodes in each computing power node set belong to the same node type; splitting the initial computational graph according to the number of nodes in each computational power node set to obtain sub computational graphs corresponding to the computational power nodes in each computational power node set;

the determining the target calculation graph according to all the target sub calculation graphs comprises the following steps: combining the target sub-calculation graphs corresponding to all the calculation nodes of the same node type in each calculation node set to obtain candidate calculation graphs corresponding to each calculation node set; adding the execution time of each target sub-calculation graph in each candidate calculation graph to obtain the corresponding total execution time of each candidate calculation graph; and screening the total execution time of the candidate calculation graphs, and determining the candidate calculation graph corresponding to the minimum total execution time as the target calculation graph.

2. The federally-learned based computational graph processing method according to claim 1, wherein the sub-computational graph includes at least one initial operator, the method further comprising:

executing the sub-calculation graph when receiving a sub-calculation graph and an equivalent operator comparison table sent by the management node, and recording a first execution time corresponding to each initial operator in the sub-calculation graph;

sequentially replacing the initial operators meeting preset replacement conditions in the sub-calculation graphs based on the equivalent operator comparison table;

after each operator replacing operation, determining a second execution time corresponding to the equivalent operator after each replacing operation, if the second execution time is smaller than the first execution time, reserving a sub-calculation diagram of the operator replacing operation, and if the second execution time is larger than or equal to the first execution time, canceling the operator replacing operation;

after the operator replacement operation of the sub-calculation map is completed, determining the target sub-calculation map according to the sub-calculation map after the operator replacement is completed;

and returning the target sub-calculation graph to the management node.

3. The federally learned based computational graph processing method according to claim 2, wherein the replacement condition comprises: the initial operators in the sub-computation graph have corresponding equivalent operators in the equivalent operator comparison table, and the execution time of the equivalent operators recorded in the equivalent operator comparison table is smaller than the first execution time of the corresponding initial operators; or (b)

The initial operator in the sub-calculation graph has corresponding equivalent operators in the equivalent operator comparison table, and the execution time of the equivalent operators corresponding to the initial operator is not recorded in the equivalent operator comparison table.

4. The method for processing a computational graph based on federal learning according to claim 2, wherein determining the second execution time corresponding to the equivalent operator after each replacement includes:

sequentially determining the replaced equivalent operator as a current equivalent operator;

if the equivalent operator comparison table records the execution time corresponding to the current equivalent operator, determining the execution time corresponding to the current equivalent operator as a second execution time corresponding to the current equivalent operator;

and if the equivalent operator comparison table does not record the execution time corresponding to the current equivalent operator, executing the current equivalent operator, and recording the second execution time corresponding to the current equivalent operator.

5. The method for processing a computational graph based on federal learning according to claim 2, wherein after determining the second execution time corresponding to each replaced equivalence operator, the method further comprises:

Adding a first execution time of an initial operator before each replacement and a second execution time of the equivalent operator after each replacement into the equivalent operator comparison table to obtain an equivalent operator comparison table with added execution time;

and sending the equivalent operator comparison table of the adding execution time to the management node so that the management node can synchronize the equivalent operator comparison table of the adding execution time to other computing nodes.

6. The method for processing the computation graph based on federal learning according to claim 2, wherein after sequentially performing operator replacement on the initial operators satisfying a preset replacement condition in the sub computation graph, the method further comprises:

determining the storage space required by the equivalent operator after each operator replacement;

if the storage space is larger than a preset storage space threshold value, marking the equivalent operator replaced at this time as an abnormal operator in the equivalent operator comparison table, and canceling the operator replacing operation at this time;

and sending the marked equivalent operator comparison table to the management node so that the management node can synchronize the marked equivalent operator comparison table to other computing nodes.

7. A computer device, the computer device comprising a memory and a processor;

the memory is used for storing a computer program;

the processor is configured to execute the computer program and implement the federally-learning-based computation graph processing method according to any one of claims 1 to 6 when the computer program is executed.

8. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, causes the processor to implement the federally learning-based calculation map processing method according to any one of claims 1 to 6.