CN115587222B - Distributed graph calculation method, system and equipment - Google Patents

Distributed graph calculation method, system and equipment Download PDF

Info

Publication number
CN115587222B
CN115587222B CN202211587848.XA CN202211587848A CN115587222B CN 115587222 B CN115587222 B CN 115587222B CN 202211587848 A CN202211587848 A CN 202211587848A CN 115587222 B CN115587222 B CN 115587222B
Authority
CN
China
Prior art keywords
graph
working node
working
active point
calculation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211587848.XA
Other languages
Chinese (zh)
Other versions
CN115587222A (en
Inventor
孟轲
耿亮
李雪
陶乾
于文渊
周靖人
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba China Co Ltd
Original Assignee
Alibaba China Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba China Co Ltd filed Critical Alibaba China Co Ltd
Priority to CN202211587848.XA priority Critical patent/CN115587222B/en
Publication of CN115587222A publication Critical patent/CN115587222A/en
Application granted granted Critical
Publication of CN115587222B publication Critical patent/CN115587222B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/505Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the load
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5083Techniques for rebalancing the load in a distributed system
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The application provides a distributed graph calculation method, a distributed graph calculation system and distributed graph calculation equipment. According to the distributed graph computing system, a plurality of working nodes execute graph computing tasks, and graph data are iterated to carry out multiple rounds of graph computing to obtain processing results; and introducing a work transfer mechanism, generating an active point transfer strategy according to the time cost required by work transfer among different work nodes under the condition that the load of the work nodes participating in the calculation is unbalanced in each iteration of the graph calculation task, determining the active point transferred from the second work node by the first work node according to the active point transfer strategy, acquiring the information of the transferred active point by the first work node, completing the graph calculation of the transferred active point in the iteration of the current round, and adaptively balancing the load of the work nodes in each iteration by adopting a dynamic work transfer method, thereby solving the dynamic load balance problem of the distributed graph calculation system and improving the calculation efficiency and the performance of the distributed graph calculation system.

Description

Distributed graph calculation method, system and equipment
Technical Field
The present application relates to computer technologies, and in particular, to a distributed graph computation method, system, and device.
Background
Graph algorithms are widely used in various application fields such as social network analysis, network routing, data mining, and the like. The process of performing analytical calculations on graph data based on graph algorithms to obtain valuable information is called graph calculation. As graph data grows larger and larger, graph computing systems employ methods of distributed graph computation, running graph algorithms/graph computation tasks in a distributed environment with multiple computing devices (e.g., central Processing Units (CPUs), graphics Processing Units (GPUs)) to accelerate the speed of graph computation, which requires balanced and efficient use of computing, memory, and communication resources across the devices.
In a distributed graph computing system, a large graph is usually divided into a plurality of subgraphs, each computing device holds one subgraph, and each subgraph is processed in a data parallel mode in the computing device, so that the problem of load imbalance of the multiple computing devices is easily caused, and the efficiency and the performance of the distributed graph computing system are poor.
Disclosure of Invention
The application provides a distributed graph computing method, a distributed graph computing system and distributed graph computing equipment, which are used for solving the problems of low efficiency and poor performance of the system caused by unbalanced load of different computing equipment in the distributed graph computing system.
In one aspect, the present application provides a distributed graph computation method applied to a decision control node in a distributed graph computation system, where the distributed graph computation system includes a plurality of work nodes, and the method includes:
acquiring graph calculation tasks to be executed and graph data of the graph calculation tasks, controlling the plurality of working nodes to execute the graph calculation tasks, performing multiple rounds of graph calculation on the graph data in an iteration mode, and outputting processing results of the graph calculation tasks; in each iteration of the graph calculation task, if the load of the working nodes participating in the calculation is determined to be unbalanced, determining a first active point transfer strategy according to the time cost required by the work transfer among different working nodes; and sending the first active point transfer strategy to the working node with the work transfer to control the working node with the work transfer to complete the graph calculation of the transferred active point according to the first active point transfer strategy.
In another aspect, the present application provides a distributed graph computation method, applied to a first work node in a distributed graph computation system, where the distributed graph computation system includes multiple work nodes, and the multiple work nodes execute a graph computation task in a distributed manner, and perform multiple rounds of graph computations on graph data iterations, where the method includes:
receiving a first active point transfer strategy, wherein the first active point transfer strategy is determined according to the time cost required by work transfer among different working nodes under the condition that the load of the working nodes participating in calculation is determined to be unbalanced, and the first active point transfer strategy comprises the following steps: the first working node transfers the information of the active points to be calculated from the second working node; in the iteration, the information of the transferred active points is obtained according to the first active point transfer strategy, and the graph calculation of the transferred active points is completed.
In another aspect, the present application provides a distributed graph computation method applied to a cloud server of an e-commerce platform, where the cloud server includes a plurality of working nodes, and the method includes:
acquiring a graph calculation task and a constructed consumer-product relation graph in an e-commerce scene; controlling the plurality of working nodes to execute the graph calculation task according to the consumer-product relation graph, performing multiple rounds of graph calculation on the consumer-product relation graph iteration, and outputting a processing result of the graph calculation task; in each iteration of the graph calculation task, if the load of the working nodes participating in the calculation is determined to be unbalanced, determining a first active point transfer strategy according to the time cost required by the work transfer among different working nodes; and sending the first active point transfer strategy to the working node with the work transfer to control the working node with the work transfer to complete the graph calculation of the transferred active point according to the first active point transfer strategy.
On the other hand, the present application provides a distributed graph computation system, including multiple work nodes, where a decision control node is configured to obtain a graph computation task to be executed and graph data of the graph computation task, control the multiple work nodes to execute the graph computation task, perform multiple rounds of graph computation on iteration of the graph data, and output a processing result of the graph computation task; the decision control node is further adapted to: in each iteration of the graph calculation task, if the load of the working nodes participating in the calculation is determined to be unbalanced, determining a first active point transfer strategy according to the time cost required for carrying out work transfer among different working nodes, and sending the first active point transfer strategy to the first working node and the second working node where the work transfer occurs;
the first working node is used for receiving the first active point transfer strategy, acquiring the information of the active points transferred from the second working node according to the first active point transfer strategy in the iteration, and completing the graph calculation of the transferred active points.
In another aspect, the present application provides an electronic device comprising: a processor, and a memory communicatively coupled to the processor, the memory storing computer-executable instructions. Wherein the processor executes computer-executable instructions stored in the memory to implement the method of any of the above aspects.
In another aspect, the present application provides a computer-readable storage medium having stored thereon computer-executable instructions for implementing the method of any one of the above aspects when executed by a processor.
According to the distributed graph computing method, the distributed graph computing system and the distributed graph computing equipment, the graph computing tasks are executed through the plurality of working nodes, multiple rounds of graph computing are carried out on graph data in an iteration mode, processing results of the graph computing tasks are output, in the process, a work transfer mechanism is introduced, under the condition that the loads of the working nodes participating in the computing are unbalanced, according to time cost required by work transfer among different working nodes, a first active point transfer strategy is determined, according to the first active point transfer strategy, active points to be computed, transferred by the first working nodes from second working nodes are determined, information of the transferred active points is obtained by the first working nodes, the graph computing of the transferred active points is completed in the iteration, the loads of the working nodes in each round of iteration are balanced in a self-adaptive mode through a dynamic work transfer method, the problem of Dynamic Load Balancing (DLB) of the distributed graph computing system is solved, and computing efficiency and performance of the distributed graph computing system are improved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and, together with the description, serve to explain the principles of the application.
Fig. 1 is a schematic diagram of an example system architecture provided in an embodiment of the present application;
FIG. 2 is an architecture diagram of a distributed graph computing system provided by an exemplary embodiment of the present application;
FIG. 3 is a flow chart of a distributed graph computation method provided by an exemplary embodiment of the present application;
FIG. 4 is a schematic diagram of an active point transition policy provided by an exemplary embodiment of the present application;
FIG. 5 is a flowchart of generating an active point transition policy provided by an exemplary embodiment of the present application;
FIG. 6 is a flowchart of a distributed graph computation method provided in another exemplary embodiment of the present application;
FIG. 7 is a flow diagram of an ownership transfer mechanism provided in an exemplary embodiment of the present application;
fig. 8 is a schematic diagram of a topology of an interconnection channel among multiple GPUs according to an exemplary embodiment of the present application;
fig. 9 is a schematic diagram of a specification tree of an ownership transfer policy provided in an exemplary embodiment of the present application;
FIG. 10 is a flowchart of a distributed graph computation method incorporating ownership transfer and livepoint transfer as provided in an exemplary embodiment of the present application;
fig. 11 is a flowchart framework of a distributed graph computation method according to an exemplary embodiment of the present application.
Specific embodiments of the present application have been shown by way of example in the drawings and will be described in more detail below. These drawings and written description are not intended to limit the scope of the inventive concepts in any manner, but rather to illustrate the inventive concepts to those skilled in the art by reference to specific embodiments.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. The following description refers to the accompanying drawings in which the same numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the application, as detailed in the appended claims.
The terms referred to in this application are explained first:
FIG. (Graph): is an abstract data structure composed of vertices and edges. The graph may be represented as G = (V, E), where V represents a set of vertices, including a finite number of vertices. Each vertex V ∈ V has a unique identifier (e.g., id) and some associated attributes. E \8838andV × V represents a set of edges. For each edge (s, d), the first vertex s is its source vertex and the second vertex d is its destination vertex. The number of vertices in the figure is denoted by | V | and the number of edges is denoted by | E |. The weighted graph is denoted by G = (V, E, w), where w is a function that maps edges to real values, so each edge is associated with a weight. In practice, { V, E } is invariant, usually very sparse, i.e. | E | is much smaller than | V | × | V |. In this embodiment, a diagram that can adapt to the memory of the aggregation device is considered.
Graph calculation (Graph Processing): the graph data is analyzed and calculated to obtain valuable information.
And (3) graph algorithm: the graph computation task is a set of processes for performing data analysis and computation on the graph G to solve the actual problem, namely the implementation logic of the graph computation task. Since GAS (gate abstraction) abstraction of the graph algorithm is widely used and has a good effect in practical application, the graph algorithm (also referred to as a GAS graph algorithm) based on GAS abstraction is considered to run in a Batch Synchronous Parallel (BSP) mode on a plurality of GPUs in the embodiment. Specifically, for input graph G = (V, E), only a subset of V participate in the operation, the vertices participating in the operation are referred to as boundaries (also called active points, active vertices), and the outgoing edges of the boundaries are referred to as active edges (also called active edges). Gather is used to aggregate incoming messages on which an aggregation function is defined to aggregate the information collected by the current boundary through the edges. Apply is the application of the aggregated results to the current boundary, e.g., updating the distance of each vertex in a Single Source Shortest Path (SSSP) algorithm. Scatter propagates the latest results of the boundary along all edges of the boundary.
Graph computing Framework/System (Graph Processing Framework/System): a programming framework/system dedicated to performing graph computations.
Super step (SuperStep): also known as super step, superstep. A graph algorithm following the Bulk Synchronous Parallel (BSP) mode consists of a series of super steps. And synchronously executing among the super steps.
Operator: the method is a unit for realizing certain computational logic, and realizes the computational logic of an upper layer interface.
Dividing the graph: the term "graph partitioning", graph partitioning, and graph cutting "refers to dividing a large graph into point sets or edge sets that do not overlap each other. In order to perform graph computation in a distributed system with n work nodes (worker), the input graph should be divided into n non-overlapping subgraphs. Usually, the graph is divided into two modes, namely vertex cutting and edge cutting, for the edge cutting mode, each vertex only belongs to one sub-graph, and for the vertex cutting mode, each edge only belongs to one sub-graph. A good graph partitioning scheme is that the size distribution of the individual subgraphs is more uniform, and the number of edges (vertices) replicated across the subgraphs is as small as possible.
GPU: the general term Graphics Processing Unit, also called image Processing Unit or display Processing Unit, is a microprocessor that is specialized in image and Graphics related operations.
NVLink: the bus is a point-to-point serial transmission bus and a communication protocol thereof, and is used for interconnection communication between GPUs.
Work transfer: also called work-stealing (work-stealing), in a scenario of parallel computing, an idle processing unit steals a task from a task queue of a busy processing unit to process the task, so as to achieve an effect of load balancing.
For the problem of load imbalance of multiple computing devices in a distributed graph computing system, a traditional solution depends heavily on a static graph partitioning algorithm, and aims to solve the problem of load imbalance by balancing the number of vertices/edges of sub-graphs held by each computing device as much as possible through a better graph partitioning scheme.
However, in most cases, the boundaries (active points) involved in the computation in one iteration (one superstep) of the graph computation typically involve only a small fraction of the vertices. Due to the irregularity of the graph structure and the Dynamic characteristic of the graph algorithm, along with the change of iteration, the size of the upper boundary of each working node also changes dynamically, which causes the Load imbalance of the working nodes, and even if the graph is well partitioned by using an advanced graph partitioner, the graph computing system still has the problem of Dynamic Load Balancing (DLB for short) in the graph computing process, which causes the computing efficiency and performance of the graph computing system to be low.
For the distributed graph computing system based on the GPU, a work node for performing graph computation in the system is the GPU, which is more sensitive to load imbalance and involves more data transmission overhead, such as synchronization between the host and the GPU, so the distributed graph computing system based on the GPU also has a serious load imbalance problem, and is low in computation efficiency and poor in performance.
The distributed graph calculation method provided by the application can be applied to a distributed graph calculation system based on a GPU, or a distributed graph calculation system with high-speed communication channels (such as a high-speed interconnection channel NVLink) among other working nodes, or a distributed graph calculation system with high-speed communication channels among the working nodes in a subsequent evolution process, and can balance the working load of each working node by adopting a work transfer mode so as to increase a small amount of communication time and save more calculation time instead, thereby improving the calculation efficiency and performance of the distributed graph calculation system.
Before introducing the technical solution provided by the present application, a system architecture of the technical solution of the present application is first described in detail below.
Fig. 1 is a schematic diagram of an example system architecture provided in an embodiment of the present application. As shown in fig. 1, the system architecture includes: the cloud side device is in communication connection with the end side device through an end cloud link.
In this embodiment, the cloud-side device is a distributed cluster, and may be implemented based on a distributed cloud architecture. The distributed graph computing system is deployed on the cloud side equipment and comprises a plurality of working nodes, and the distributed graph computing system provides services for implementing graph computing tasks through distributed graph computing. The cloud side equipment acquires graph computing tasks to be executed and graph data of the graph computing tasks, executes the graph computing tasks through a plurality of working nodes, performs multiple rounds of graph computing on the graph data in an iterative manner, and updates information of vertexes in the graph data; moreover, by introducing a work transfer mechanism, in each iteration of a graph calculation task, if the load of the working nodes participating in the calculation is determined to be unbalanced according to the number of active points on the working nodes participating in the calculation, a first active point transfer strategy is determined according to the time cost required for carrying out work transfer among different working nodes; and sending the first active point transfer strategy to the first working node and the second working node, wherein the first active point transfer strategy comprises the information of the active points to be calculated, which are transferred from the second working node by the first working node in the iteration, so as to control the first working node to acquire the information of the transferred active points and replace the second working node to finish the graph calculation of the transferred active points. And after the graph calculation task is executed, the cloud side equipment outputs a processing result of the graph calculation task according to the updated graph data.
The end-side device may be a user-side device that needs to use graph computation capability to implement graph computation tasks, and may be a cloud server, a local device, a client device, and the like of various platforms. For example, the server may be a server of a platform such as e-commerce, social network, intelligent transportation, etc., or a device having a network routing function, etc. Different end-side devices may have different application domains/scenarios, and may have different graph computation requirements.
In one possible usage scenario, the cloud-side device may provide control flow information of one or more graph computation tasks and provide information of executable graph computation tasks to the end-side device, and the end-side device specifies which graph computation task to perform. In addition, the end-side device can also submit control flow information of the customized graph computing task to the cloud-side device. And the cloud side equipment executes the corresponding graph calculation task by executing the control flow information.
The method comprises the steps that a user can send a graph calculation request to cloud side equipment through the side equipment, wherein the graph calculation request carries graph data and information of graph calculation tasks. The cloud side equipment acquires graph data according to the received graph calculation request, acquires corresponding control flow information according to the information of the graph calculation task, executes the graph calculation task through a plurality of working nodes, performs multiple rounds of graph calculation on the graph data in an iterative mode, updates the information of the top point in the graph data, and determines the processing result of the graph calculation task according to the updated graph data. And the cloud side equipment outputs the processing result of the graph computing task to the end side equipment.
In order to facilitate understanding of the interaction process executed between the devices in the system architecture shown in fig. 1, the following describes the interaction process executed between the devices in the above usage scenario with reference to several specific application scenarios.
In one possible application scenario, the end-side device is a cloud server of an e-commerce platform. The end-side equipment can acquire data of the electricity merchant platform, for example, behavior data of a consumer browsing a certain product, a consumer ordering a certain product, a consumer sharing a certain product with relatives and friends and the like; consumer groups, categories of products, suppliers, and the like. And the end-side equipment constructs a consumer-product relation graph in the E-commerce field according to the collected data of the E-commerce platform. And the user sends a graph calculation request to the cloud side equipment through the side equipment, and the graph calculation request carries information of the consumer-product relation graph and the graph calculation task. The cloud side device executes the graph calculation task through the multiple working nodes based on the information of the consumer-product relation graph and the graph calculation task sent by the end side device, performs multiple rounds of graph calculation on the consumer-product relation graph in an iteration mode, updates information of a vertex in the consumer-product relation graph, determines a processing result of the graph calculation task according to the updated information of the vertex in the consumer-product relation graph after the graph calculation is completed, and feeds back the processing result to the end side device. In addition, the end-side device can also directly acquire the constructed consumer-product relationship diagram from other devices.
For example, the graph calculation task may be consumer preference analysis, and the information of the vertex in the consumer-product relationship graph may be used to record the preference information of the consumer for the product, and the output processing result is the product preferred by the user.
For example, the graph calculation task may be precise recommendation of a product, the information of the vertex in the consumer-product relationship graph may be used to record the possibility information that the product is recommended to the user, and the output execution result is the product information recommended to the user.
For example, the graph computation task may be hot product statistics, the information of the vertex in the consumer-product relationship graph may be used to record the hot information of the product, and the output execution result is the hot product information.
In one possible application scenario, the end-side device is a server of a social networking platform. The end-side device can obtain usage data of the social network platform, such as articles and comments published by the user, articles and comments approved by the user, information of friends added by the user, positions visited by the user, and the like. And the end-side equipment constructs a social network information graph of the social domain according to the collected use data of the social network platform. And the user sends the information of the social network information graph and the graph calculation task to the cloud side equipment through the side equipment. The cloud side device executes the graph computing task through the multiple working nodes based on the social network information graph and the graph computing task information sent by the end side device, performs multiple rounds of graph computing on the social network information graph in an iteration mode, updates the information of the top point in the social network information graph, determines the processing result of the graph computing task according to the updated information of the top point in the social network information graph after the graph computing is completed, and feeds back the processing result of the graph computing task to the end side device. In addition, the end-side device can also directly acquire the constructed social network information graph from other devices.
For example, the graph computation task may be discovery of communities, information of vertices in the social network information graph may be used to record probability information of users in the same community, and the output processing result may be information of one or more communities.
For example, the graph calculation task may be accurately recommending friends, information of vertices in the social network information graph may be used to record the closeness of relationships between users, and the output processing result is user information for recommending and adding friends to the user.
In one possible application scenario, the end-side device is a device with network routing functionality. The end-side device is able to obtain network configuration data, such as the number of nodes in the network, the location of each network node, performance parameters, etc. And the end-side equipment builds a network topological graph according to the network structure data. And the user sends the network topology map and the task information for planning the shortest network path to the cloud side equipment through the end side equipment. The cloud side equipment executes a graph calculation task for planning the shortest network path through a plurality of working nodes based on a network topological graph and task information sent by the end side equipment, performs multiple rounds of graph calculation on the network topological graph in an iteration mode, updates information of a vertex in the network topological graph, determines the shortest network path according to the updated information of the vertex in the network topological graph after the graph calculation is completed, and feeds back the shortest network path to the end side equipment. Wherein, the information of the vertex in the network topology graph can indicate the cost information of the path. In addition, the end-side device can also directly acquire the constructed network topology map from other devices.
In addition to the system architecture shown in fig. 1, the distributed graph computing system may be deployed on an electronic device with distributed computing capability on a user side, where the electronic device runs with multiple work nodes. The electronic device may be one computing device or a distributed cluster comprising a plurality of computing devices.
The electronic device has the capability of the end-side device and is responsible for acquiring graph calculation tasks and graph data to be executed. The electronic equipment has the distributed graph computing capacity of the cloud side equipment, executes graph computing tasks through a plurality of working nodes, performs multiple rounds of graph computing on graph data in an iterative manner, and updates the information of vertexes in the graph data; a work transfer mechanism is introduced, and in each iteration of a graph calculation task, if the unbalanced load of the working nodes participating in the calculation is determined according to the number of active points on the working nodes participating in the calculation, a first active point transfer strategy is determined according to the time cost required for carrying out work transfer among different working nodes; and sending the first active point transfer strategy to the first working node and the second working node so as to control the first working node to acquire the information of the transferred active points and complete the graph calculation of the transferred active points. And after the graph calculation task is executed, outputting the processing result of the graph calculation task according to the updated information of the vertex in the graph data.
Taking the above e-commerce scenario as an example, the electronic device may be a cloud server of the e-commerce platform. The electronic device is capable of obtaining a consumer-product relationship graph in the merchant domain, and a graph computation task to be performed. The electronic equipment executes the graph calculation task through the multiple working nodes based on the consumer-product relation graph, carries out multiple rounds of graph calculation on the consumer-product relation graph in an iterative mode, updates information of vertexes in the consumer-product relation graph, determines a processing result of the graph calculation task according to the updated information of the vertexes in the consumer-product relation graph after the graph calculation is completed, and outputs the processing result of the graph calculation task.
For any scene, the end-side equipment collects data of a specific application scene, constructs graph data and sends the graph data to the cloud-side equipment, and the end-side equipment timely deletes personal data and privacy data of a user possibly involved for privacy protection, so that service is provided for the user while the privacy of the user is ensured not to be returned.
The distributed graph calculation method provided by the present application may also be applied to various application scenarios in multiple fields such as financial security, internet, industry, biomedicine, public security, smart city (intelligent transportation), and the like, and the embodiment is not limited in this respect.
In general, a distributed graph computing system includes a plurality of worker nodes (workers). Such as a GPU-based distributed graph computing system, includes a plurality of GPUs, each GPU acting as a work node. Each working node performs part of distributed computation processing, and communication between different working nodes is performed through a high-speed communication channel (such as a high-speed interconnection channel NVLink between GPUs), so that the communication time increased by work transfer is smaller relative to the saved computation time.
In order to realize distributed graph computation on graph data, a distributed graph computation system divides the graph data into n subgraphs, and respectively stores the vertex sets of the n subgraphs to n working nodes. Each working node has a set of vertices of a subgraph responsible for processing its corresponding computations and communications. Where n represents the total number of worker nodes (workers) in the distributed graph computing system that may participate in graph computation.
In this embodiment, the graph division may be performed by edge cutting, and the original graph G (V, E) is divided into n paths of graphs to obtain n sub-graphs (F) 1 ,…,F n ) The union of the vertex sets of the n subgraphs is the vertex set of the graph G, the union of the edge sets of the n subgraphs is the edge set of the graph G, and each vertex only belongs to one subgraph, namely the vertex sets of different subgraphs do not contain the same vertex. Each worker node has a subgraph (set of vertices) that is responsible for handling its corresponding computation and communication.
The distributed graph computation method provided by the application follows a Batch Synchronous Parallel (BSP) mode. The graph computation task/algorithm comprises a plurality of super steps, and the distributed graph computation system runs the graph computation task/algorithm through a plurality of rounds of iteration to realize synchronous execution of the super steps. Each iteration corresponds to the execution of one super step, and each iteration is the process of updating the vertex of a complete-pass graph of the system.
Specifically, in the method of this embodiment, multiple rounds of graph computation are performed on graph data (input graph G) by a plurality of work nodes executing a graph computation task or processing logic of a graph algorithm, and information of vertices in the graph data is updated, and in each round of iteration of the graph computation task, when a load imbalance of the work nodes participating in the computation is determined according to the number of active points on the work nodes participating in the computation, a first active point transfer policy is determined according to a time cost required for performing work transfer between different work nodes, where the first active point transfer policy includes: the first working node transfers the information of the active points to be calculated from the second working node, and the active points are vertexes in the graph data; sending the first active point transfer strategy to the first working node and the second working node so as to control the first working node to acquire the information of the transferred active points and complete the graph calculation of the transferred active points; and after the graph calculation task is executed, outputting the processing result of the graph calculation task according to the updated graph data. In the graph computing process, a dynamic work transfer method is adopted to adaptively balance the load of each work node (GPU) so as to solve the problem of Dynamic Load Balance (DLB) of the distributed graph computing system based on the GPU and improve the computing efficiency and performance of the distributed graph computing system based on the GPU.
Illustratively, the active point transition policy may be expressed as: active point of GPUj transferring from GPUi
Figure 149034DEST_PATH_IMAGE001
Where GPUj and GPUi may refer to any two different GPUs,
Figure 195487DEST_PATH_IMAGE001
indicating the active point of GPUj branching from GPUi, i and j take [1,m ]]And the integer i in the interval is not equal to j, and m is the number of the GPUs participating in calculation.
Figure 711919DEST_PATH_IMAGE001
Is an active point on GPUi, and after being transferred by GPUj, in the iteration of the round
Figure 54301DEST_PATH_IMAGE001
The associated graph computation will be performed by GPUj instead of GPUi, whereby part of the workload of GPUi is shared by GPUj. In different iterations of the graph calculation process, the active point transfer strategy used in each iteration is dynamically adjusted, so that the Dynamic Load Balance (DLB) problem of the distributed graph calculation system is realized, and the calculation efficiency and the performance of the graph calculation system are improved.
The following describes the technical solutions of the present application and how to solve the above technical problems with specific embodiments. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments. Embodiments of the present application will be described below with reference to the accompanying drawings.
The distributed graph computing method provided by the application can be applied to a distributed graph computing system, and the distributed graph computing system is deployed on the cloud-side equipment. Fig. 2 is an architecture diagram of a distributed graph computing system provided in an embodiment of the present application. The distributed graph computing system comprises a plurality of working nodes, and the working nodes perform distributed graph computation to execute a graph computation task/graph algorithm to obtain a processing result. For example, a distributed graph computing system based on GPUs is taken as an example, and comprises a plurality of GPUs, wherein high-speed interconnection channels (such as NVLink) are arranged among the GPUs, and each GPU is taken as a working node.
As shown in fig. 2, the distributed graph computing system includes a decision control node, a first worker node, and a second worker node. And each working node carries out remote data access through a high-speed interconnection channel. In this embodiment, any work node in the distributed graph computing system may serve as a decision control node. The first working node refers to a working node to which an active point on the other working node is transferred, and the second working node refers to a working node to which an active point is transferred by the other working node. If a live point transition occurs, there may be one or more first working nodes and multiple second working nodes, and only one first working node and one second working node are shown in FIG. 2 for illustrative purposes, the distributed graph computing system may include multiple first working nodes and multiple second working nodes.
The decision control node is used for acquiring graph calculation tasks to be executed and graph data of the graph calculation tasks, controlling the plurality of working nodes to execute the graph calculation tasks, performing multiple rounds of graph calculation on the graph data in an iteration mode, and outputting processing results of the graph calculation tasks.
Specifically, in each iteration of the graph calculation task, if the decision control node determines that the load of the working nodes participating in the calculation is unbalanced, the decision control node determines a first active point transfer strategy according to the time cost required for performing work transfer between different working nodes, and sends the first active point transfer strategy to the first working node and the second working node. The first active point transition policy includes: and the first working node transfers the information of the active points to be calculated from the second working node, wherein the active points are vertexes in the graph data. And the first working node is used for receiving the first active point transfer strategy, acquiring the information of the transferred active points according to the first active point transfer strategy in the iteration, and finishing the graph calculation of the transferred active points. The second working node also receives the first active point transfer strategy, and does not perform graph calculation of the active points transferred by the first working node in the current iteration according to the first active point transfer strategy.
Fig. 3 is a flowchart of a distributed graph computation method according to an exemplary embodiment of the present application. The distributed graph calculation method provided by the embodiment is applied to a decision control node in a distributed graph calculation system. As shown in fig. 3, the method comprises the following specific steps:
step S301, graph calculation tasks to be executed and graph data of the graph calculation tasks are acquired.
In this embodiment, the graph calculation task to be executed may be a task for implementing graph calculation by executing any one graph algorithm in a specific application scenario, and specifically may be a graph calculation task in various application scenarios in various fields such as financial security, internet, industry, biomedicine, public security, smart city (intelligent transportation), and the like. When the method is applied to different graph calculation tasks, graph data needing graph calculation can be different.
And S302, controlling the plurality of working nodes to execute the graph calculation task, performing multiple rounds of graph calculation on graph data iteration, and outputting a processing result of the graph calculation task.
In this embodiment, the distributed graph computation is performed by a plurality of work nodes based on the distributed graph computation system, so as to implement the execution of the graph computation task. In the process of executing the control flow information (computation logic) of the graph computation task, multiple rounds of graph computation are generally required to be iterated, and each round of graph computation includes the process of computation and update of vertex information in graph data.
In this embodiment, through the following substeps 3021-S3022, in each iteration of the graph calculation performed in the step S302, a dynamic work transfer mechanism is added to adaptively balance the load of each work node, so as to solve the Dynamic Load Balancing (DLB) problem of the distributed graph calculation system, so as to improve the calculation efficiency and performance of the distributed graph calculation system based on the GPU.
Step S3021, in each iteration of the graph calculation task, if the load of the working nodes involved in the calculation is not balanced, determining a first active point transfer strategy according to the time cost required for the work transfer among different working nodes.
The active points on each working node may determine vertices where information changes according to a previous round of calculation result, and determine whether to mark the vertices as active points according to a specific processing logic of a graph calculation task, which is consistent with a method for determining an active point set in existing distributed graph calculation and is not described herein again.
The decision control node in the distributed graph computing system can acquire the working load conditions of other working nodes and judge whether the working nodes participating in the computation are in load balance or not.
In the case of determining the load imbalance of the working nodes participating in the calculation, the activity point transfer mechanism is started in the current iteration, and a first activity point transfer strategy is generated according to the time cost required for carrying out work transfer among different working nodes, wherein the first activity point transfer strategy comprises which working nodes (called first working nodes) transfer which activity points from which working nodes (called second working nodes). There may be one or more first working nodes to which the active points of the other working nodes are transferred, and there may also be one or more second working nodes to which the active points are transferred.
And step S3022, sending the first active point transfer strategy to the working node where the work transfer occurs, so as to control the working node where the work transfer occurs to complete the graph calculation of the transferred active point according to the first active point transfer strategy.
The working nodes for sending the work transfer comprise a first working node and a second working node. The first working node and the second working node are any two different working nodes. The first active point transition policy may include: the first working node transfers the information of the active points to be calculated from the second working node.
In this embodiment, the information of the active point acquired by the first working node refers to data required for performing graph computation on the active point, and includes, but is not limited to, attribute information (or state information) of the active point, a neighbor list, attribute information (state information) of a neighbor, and a weight of an edge.
It should be noted that the information of the active point transferred from the second working node by the first working node is still stored in the second working node, and the first working node can remotely access the information of the active point stored in the second working node through the high-speed interconnection channel.
According to the first active point transfer strategy, in the current iteration, the first working node with smaller workload transfers part of the active points (transferred active points) of the second working node with larger workload, and the first working node is responsible for the processing of the transferred active points, so that the workload of the second working node is reduced.
Illustratively, graph G, shown on the left side of FIG. 4, contains 8 vertices, denoted by v0-v7, transformed to 0-7, respectively. The computing system comprises two working nodes in a distributed graph: GPU0 and GPU1 are taken as examples, and graph G is divided into two sub-graphs. Where one subgraph includes vertices v0-v3 and associated edges, assigned to GPU0, and the other subgraph includes other vertices (v 4-v 7) and edges, assigned to GPU1. Assuming that, in this iteration, before using the active point policy, active points activated by root vertex GPU0 include v1, v2, v3, and v5, there are 3 active points on GPU 0: v1, v2, v3, and 1 v5 active point on the GPU1. GPU0 needs to process 8 edges when processing v1, v2 and v3, and GPU1 needs to process v5 only 2 edges. After a work transfer mechanism is introduced, the active point transfer strategy can transfer an active point v3 on the GPU0 for the GPU1, in the iteration of the current round, the GPU0 processes v1 and v2 and needs to process 5 edges, the GPU0 processes v3 and v5 and needs to process 5 edges, and therefore loads of the GPU0 and the GPU1 are balanced. It should be noted that, fig. 4 illustrates an example of only one active point transition policy, and when generating the first active point transition policy in practical application, time cost caused by performing active point transition between different working nodes is considered.
In this embodiment, by introducing a work transfer mechanism in each iteration of distributed graph computation, when a working node participating in computation satisfies a load imbalance condition, a first active point transfer policy is determined according to a time cost required for work transfer between different working nodes, where the first active point transfer policy includes: the first working node transfers the active point information to be calculated from the second working node. And sending the first active point transfer strategy to the first working node and the second working node to control the first working node to acquire the information of the transferred active points and complete graph calculation of the transferred active points, and adaptively balancing the load of each working node in each iteration by adopting a dynamic work transfer method to solve the Dynamic Load Balance (DLB) problem of the distributed graph calculation system so as to improve the calculation efficiency and the performance of the distributed graph calculation system.
In an optional embodiment, in step S3021, determining that the workload of the working nodes participating in the calculation is unbalanced may be implemented as follows: determining the maximum active point quantity and the minimum active point quantity according to the quantity of the active points on the working nodes participating in calculation; and if the maximum active point quantity is greater than or equal to the first active quantity threshold value, and the difference value between the maximum active point quantity and the minimum active point quantity is greater than or equal to the second active quantity threshold value, determining that the loads of the working nodes participating in the calculation are unbalanced. The first active quantity threshold and the second active quantity threshold may be determined according to an empirical value in combination with a configuration and a processing capability of a working node used in an actual application scenario, and are not specifically limited herein. For example, the first activity number threshold may be generally set to 131072, and the second activity number threshold may be generally set to 4096, which is not specifically limited herein.
In this embodiment, considering that in a distributed graph computing system, the active points (boundaries) can be regarded as basic work items to be processed in each iteration, the number of active points on a work node can better represent the load of the work node. According to the number of active points on the working nodes participating in the calculation, the condition of unbalanced load of the working nodes participating in the calculation can be accurately identified. Under the condition that the workload of the working nodes participating in the calculation is unbalanced, the working nodes with lower workload can spend an idle period to transfer active points of other working nodes with heavier workload, thereby avoiding the starvation phenomenon of the working nodes and shortening the end-to-end graph calculation time.
In addition, by requiring that the maximum number of active points is greater than or equal to the first active number threshold, the active point transfer strategy can be generated only when the working nodes have dense active points, and the calculation time saved by active point transfer can be ensured to be greater than the time overhead for generating the active point transfer strategy, so that unnecessary work transfer is avoided under the condition that the active points on the working nodes are fewer (sparse), and the efficiency and performance of distributed calculation are improved.
In another optional embodiment, in implementing the determination of whether the workload of the working nodes participating in the computation is unbalanced in load, it may also be determined that the workload of the working nodes participating in the computation is unbalanced in load when the difference between the maximum number of active points and the minimum number of active points is greater than or equal to the second active number threshold, based on the second active number threshold only.
The goal of the first active point transition strategy in this embodiment is to balance and minimize the processing time for this iteration. Exemplarily, referring to any working node by GPUi, the processing time of GPUi to complete its work in the k-th iteration can be determined by
Figure 656184DEST_PATH_IMAGE002
And (4) showing. The goal of the first active point transition strategy is to minimize
Figure 240749DEST_PATH_IMAGE003
I.e. minimizing the processing time of the working node (and the dequeuer) with the longest processing time. Where m represents the number of working nodes participating in the computation, and i is taken to be [1,m ]]Integer within the period.
In graph data (graph G) of a practical application scenario, edge distributions of vertices are usually unbalanced, i.e., the number of associated edges of different vertices in the graph data is very different. Therefore, in generating the first active point transition policy, it is desirable to balance the number of edges that each worker node needs to process in this iteration.
Considering that two vertices of an edge may be stored on different working nodes, transferring an edge between working nodes may result in an atomic operation across the working nodes, requiring a high time cost. In an optional embodiment, when determining the first active point transfer policy according to the time cost required for performing work transfer between different working nodes is implemented, first, through steps S501-S502, the number of edges that the first working node needs to transfer from the second working node in the current iteration is determined according to the time cost required for the first working node to process one edge on the second working node and the number of associated edges of the active point on the second working node, which is called a target edge number. The target edge number indicates the number of edges transferred between different working nodes in the active point transfer strategy, and the workload between different working nodes can be well balanced by balancing the target edge number for carrying out work transfer between different working nodes. However, because the direct edge transfer between the working nodes increases the time cost, in this embodiment, the edge is not directly transferred, but the active point to be transferred between the working nodes is determined according to the number of the target edges to be transferred between the working nodes after the target number of the edges to be transferred between the working nodes is calculated, so that the effect of transferring the target number of the edges can be achieved after the active point is transferred.
Further, through step S503, according to the target number of edges that the first working node needs to transfer from the second working node in the current iteration and the data of the subgraph on the second working node, the active point that the first working node transfers from the second working node in the current iteration is determined, and a certain number of active points are transferred from the second working node through the first working node, so as to achieve the same effect that the first working node transfers the target number of edges from the second working node, thereby avoiding the atomic operation across the working nodes and reducing the time overhead brought by the work transfer.
Referring to fig. 5, fig. 5 is a flowchart for generating an active point migration policy provided in the embodiment of the present application, and as shown in fig. 5, a first active point migration policy is generated according to a time cost required for performing work migration between different work nodes, where the specific implementation steps are as follows:
step S501, constructing a minimum-maximum problem model according to the time cost required by the first working node to process one edge on the second working node and the number of the associated edges of the active points on the second working node.
Wherein the optimization objective of the min-max problem model is to minimize the objective function value z. The constraints of the min-max problem model include:
constraint 1: the objective function value z is greater than or equal to the total time cost of work transfer in the iteration, wherein the total time cost is as follows: the sum of the product of the number of edges that the first worker node needs to transfer from the second worker node and the time cost that the first worker node needs to process one edge on the second worker node.
Constraint 2: for any second working node, the sum of the number of edges that the first working node needs to transfer from the second working node in the iteration is equal to the number of active edges on the second working node.
Constraint 3: for any second working node, the sum of the number of edges that the first working node needs to transfer from the second working node in the iteration is an integer less than or equal to the number of edges of the subgraph on the second working node.
In this embodiment, the time cost coefficient for processing an edge across the working nodes may be stored as a cost coefficient matrix C. The cost coefficient matrix C is an n x n matrix, n is the total number of working nodes in the distributed computing system, wherein
Figure 611688DEST_PATH_IMAGE004
Representing the time cost of worker node j to process an edge on worker node i. The cost coefficient matrix C may be determined offline in advance according to the network bandwidth between the respective working nodes and the feature data of the subgraph owned by the respective working nodes.
It should be noted that, in general, the number m of working nodes participating in the calculation is equal to the total number n of working nodes in the distributed computing system, but in the case of using the ownership transfer policy (to be described in the following embodiments) of the present application, the number m of working nodes participating in the calculation may be less than n. For the working nodes which do not participate in calculation, when the cost coefficient matrix C is used for calculating the time cost of the active point transfer strategy, the working nodes which do not participate in calculation in the cost coefficient matrix C are modified
Figure 623506DEST_PATH_IMAGE005
It is infinite, so that when the active point transition strategy is generated, no active point is transferred by the working nodes which do not participate in the computation.
In this embodiment, taking the generation of the first active point transition policy in the kth iteration as an example, the number of working nodes participating in the calculation is represented by m, and the first active point transition policy may be represented by m
Figure 447105DEST_PATH_IMAGE006
It is shown that,
Figure 100941DEST_PATH_IMAGE006
can be stored as an m x m matrix, wherein
Figure 60806DEST_PATH_IMAGE007
Indicating that working node j is transferred from working node i
Figure 25219DEST_PATH_IMAGE008
And (4) a live point. Transferred active point
Figure 601693DEST_PATH_IMAGE008
Is the vertex in the subgraph that working node i owns in the k-th iteration. When executing the Gather processing based on GAS abstraction, the working node j needs to access the remote data (such as the weight of the edge) on the working node i.
The target edge number of the first working node to be transferred from the second working node in the iteration can be stored as m multiplied by mOf the related edge matrix X, wherein
Figure 528061DEST_PATH_IMAGE009
Is that
Figure 608013DEST_PATH_IMAGE008
The number of associated edges of (c).
By using
Figure 227213DEST_PATH_IMAGE010
Representing the number of active edges in the m working nodes participating in the computation in the k-th iteration, wherein
Figure 25405DEST_PATH_IMAGE011
Indicating the number of active edges on worker node i in the kth iteration. In order to balance the processing time of each working node, the optimization goal of generating the first active point transition strategy is:
Figure 21042DEST_PATH_IMAGE012
(1)
and, generating the first active point transition strategy needs to satisfy the following constraint conditions:
Figure 456965DEST_PATH_IMAGE013
Figure 247067DEST_PATH_IMAGE014
wherein, in the step (A),
Figure 532554DEST_PATH_IMAGE015
representing the set of edges of the subgraph owned by the working node i participating in the computation.
The Dynamic Load Balancing (DLB) problem of generating the first active point transition policy in equation (1) is NP-hard, and the Dynamic Load Balancing (DLB) problem of generating the first active point transition policy is simplified to the following min-max problem model in this embodiment: the optimization goals are: the objective function value z is minimized. The constraint conditions include: 1)
Figure 800725DEST_PATH_IMAGE016
;2)
Figure 855268DEST_PATH_IMAGE013
;3)
Figure 816271DEST_PATH_IMAGE014
And S502, solving a minimum-maximum problem model by using a mixed integer linear programming solver, and determining the number of target edges of the first working node to be transferred from the second working node in the current iteration according to the solving result.
The simplified minimum-maximum problem model is a Mixed Integer Linear Programming (MILP) model, and the Mixed Integer Linear programming model can be solved through a Mixed Integer Linear programming solver to obtain an associated edge matrix X, so that the target edge number of the working node j to be transferred from the working node i in the kth iteration is obtained. The complexity of the mixed integer linear programming solver is only related to the number of working nodes participating in calculation, and for the current GPU-based distributed server, the number of GPUs is usually small (for example, 8), so that the time overhead of generating the active point transfer strategy is usually small and is within an acceptable range.
The number of the associated edges in the accurate solution of the MILP model obtained by solving through a mixed integer linear programming solver
Figure 323476DEST_PATH_IMAGE017
It may not be an integer, so the exact solution is rounded to an integer to yield the final correlation edge matrix X. In the matrix X of associated edges
Figure 395337DEST_PATH_IMAGE017
Is that
Figure 334081DEST_PATH_IMAGE008
I.e. the number of edges that the working node j should transfer from the working node i in the k-th iteration. In particular, when i = j,
Figure 200406DEST_PATH_IMAGE018
indicating the number of edges on worker node i that need to be processed locally.
Step S503, determining the information of the active point transferred from the second working node by the first working node in the iteration according to the target edge number transferred from the second working node by the first working node in the iteration and the data of the subgraph on the second working node.
After the target edge number of the first working node to be transferred from the second working node in the iteration of the current round is determined, namely the associated edge matrix X is obtained, for any second working node i, the out-degrees of the active points on the second working node i can be sequentially stored into the array, the prefix sum of the continuous elements is determined in the array in a prefix sum mode, the active points corresponding to the continuous elements are used as the transferred active points, and therefore the continuous active points are transferred, and the atomic operation and the time cost brought by the active point transfer can be reduced.
Alternatively, several continuous/discontinuous active points can also be found from the second working node i by other means, so that the sum of the out-degrees of the transferred active points is equal to the target edge number of the second working node i.
In this embodiment, in consideration of irregularities of the graph topology, the total flow of each channel in the communication network between the working nodes cannot be estimated, and it is difficult to generate the active point transfer policy for the whole graph calculation process as a whole. In addition, because the information of the transferred active points is still stored on the original working nodes, the minimum-maximum problem model of each iteration is independent and has no influence on the subsequent iterations.
In an alternative embodiment, considering that the time cost for the first working node j to process the edge on the second working node i includes a communication time cost and a computation time cost, a communication cost function and a computation cost function may be predefined for computing the communication time cost and the computation time cost for the first working node j to process the edge on the second working node i, respectively.
Illustratively, under any possible activity point transition policy, the communication time cost and the computation time cost of the first working node j processing the GPU are respectively expressed as
Figure 460486DEST_PATH_IMAGE019
And
Figure 70459DEST_PATH_IMAGE020
then the total time cost for the first working node j may be:
Figure 99595DEST_PATH_IMAGE021
specifically, the communication time cost required for the first working node to process one edge on the second working node may be determined according to the network bandwidth between the first working node and the second working node. Typically, a worker node processes a local edge faster than remote edges of other worker nodes, and a worker node processes a remote edge of another worker node connected by two high-speed interconnect channels (e.g., NVLink) faster than a remote edge of another worker node connected by only one high-speed interconnect channel (e.g., NVLink).
Illustratively, the communication time cost may be determined by:
Figure 136821DEST_PATH_IMAGE022
wherein, in the process,
Figure 884197DEST_PATH_IMAGE023
the network bandwidth between the second working node i and the first working node j can be determined through micro-reference test, which is not described herein again. When i = j, the number of the bits is increased,
Figure 799325DEST_PATH_IMAGE024
representing the local memory bandwidth of working node i.
Specifically, when the computation time cost required by the first working node to process the edge on the second working node is determined, a set of feature data of a vertex in the subgraph on the second working node may be obtained, and the computation time cost required by other first working nodes to process the edge on the second working node is predicted based on the set of feature data and a pre-trained polynomial regression model. The computation time cost mainly comprises the memory access of the working node and the running time of the atomic operation brought by the updating operation.
Optionally, the feature data of the vertex in the subgraph owned by the second working node may include at least one of the following features of the vertex in the subgraph: average in degree, average out degree, range of in degree (including maximum in degree and minimum in degree), range of out degree (including maximum out degree and minimum out degree), kini coefficient, degree distribution entropy. Where the average out-degree (or average in-degree) of a vertex may determine the number of incoming (or outgoing) neighbors that may be accessed during graph computation, the range of in-degrees (or range of out-degrees) describes the distribution diversity of the edges, the kini coefficients, degree distribution entropy describe the distribution of the edges, and these features may affect the atomic operations, memory access models, etc. of graph computation.
Illustratively, by
Figure 682968DEST_PATH_IMAGE025
Representing the characteristic data corresponding to the working node i, wherein,
Figure 156675DEST_PATH_IMAGE026
one characteristic item is represented, r represents the number of items of characteristic data containing characteristics, and each characteristic item
Figure 391347DEST_PATH_IMAGE026
The characteristics of the vertexes in the subgraph owned by the working node i are closely related to the subgraph of the working node i. In the feature data
Figure 77543DEST_PATH_IMAGE027
Above defined computational cost function
Figure 346850DEST_PATH_IMAGE028
The cost of computing time required by the first worker node j to process an edge on the second worker node i may be determined as follows:
Figure 725879DEST_PATH_IMAGE029
further, the total time cost of the first working node j is:
Figure 182268DEST_PATH_IMAGE030
it can be seen that the time cost required for the first working node j to process one edge on the second working node i is:
Figure 442129DEST_PATH_IMAGE031
further, a cost function will be calculated
Figure 300364DEST_PATH_IMAGE028
Modeling is carried out to obtain a polynomial regression model, and the training sample is used for training the polynomial regression model to obtain the trained polynomial regression model.
Specifically, the training samples may include characteristic data of a plurality of working nodes and observed actual computational costs, and taking any working node i as an example, one training sample of the working node i may be denoted as [, ]
Figure 115873DEST_PATH_IMAGE027
Figure 59558DEST_PATH_IMAGE032
],
Figure 353136DEST_PATH_IMAGE032
Representing the actual computational cost. By performing graph calculation based on a plurality of real graph data and synthesized graph data, a run log is recorded in the process. Extracting training samples from the operation logs of the working nodes at a plurality of different positionsA plurality of different training samples of the same working node can be extracted at any time. And training the polynomial regression model based on the collected training samples to obtain the trained polynomial regression model.
Alternatively, when training the polynomial regression model, a Stochastic Gradient Descent (SGD) algorithm or other training method for balancing efficiency and accuracy, such as Adam (Adaptive mean) may be used to train the polynomial regression model using Root Mean Square Relative Error (RMSRE) as a loss function.
After the trained polynomial regression model is obtained, the characteristic data of the working node i is obtained and input into the trained polynomial regression model, and the calculation cost function is determined through the polynomial regression model
Figure 65877DEST_PATH_IMAGE028
. According to the network bandwidth between the working node j and the working node i
Figure 52288DEST_PATH_IMAGE023
Determining a communication cost function
Figure 483269DEST_PATH_IMAGE033
Further, the time cost required for worker node j to process an edge on worker node i can be determined:
Figure 82003DEST_PATH_IMAGE031
fig. 6 is a flowchart of a distributed graph calculation method according to another embodiment of the present application. The method of the embodiment is applied to a first working node in the distributed graph computing system. The distributed graph computing system comprises a plurality of working nodes, the plurality of working nodes execute graph computing tasks in a distributed mode, and multiple rounds of graph computing are conducted on graph data iteration. The first working node refers to any working node to which the active points of other working nodes (i.e., the second working node) are transferred. As shown in fig. 6, the method comprises the following specific steps:
step S601, receiving a first active point transfer strategy, wherein the first active point transfer strategy comprises: and the first working node transfers the information of the active points to be calculated from the second working node.
The first active point transfer strategy is a decision control node in the distributed graph computing system, and is generated according to the time cost required for work transfer among different working nodes under the condition that the load of the working nodes participating in computation is determined to be unbalanced in the initial stage of the iteration of the current round. For a specific implementation manner of the decision control node generating the first active point transition policy, reference is made to the method flow executed by the decision control node in the foregoing method embodiment, and details of this embodiment are not repeated here.
Step S602, in the current iteration, the information of the transferred active points is obtained according to the first active point transfer strategy, and the graph calculation of the transferred active points is completed.
And if the first working node determines that the first working node transfers the active point of the second working node according to the received first active point transfer strategy, in the iteration, the first working node acquires the information of the transferred active point and completes the graph calculation of the transferred active point so as to share part of the working load of the second working node.
For the second working node transferred with the active point in the distributed graph computing system, according to the received first active point transfer strategy, if determining that part of the active points are transferred by another first working node, the second working node does not execute the graph computation of the transferred active point in the current iteration.
In an alternative embodiment, for any worker node in the distributed graph computing system, at least one vertex can be selected as a waypoint based on the in-degree of the vertex in the graph data, and a contiguous list of the waypoints is pre-cached on the plurality of worker nodes. Considering that the vertex with high degree of in-depth is easy to be activated for multiple times in the graph data, according to the degree of in-depth of the vertex in the graph data, the vertex with the degree of in-depth larger than the preset degree-of-depth threshold is taken as a super point, and the heavy remote memory access is avoided by caching the adjacent list of the super points in advance, so that the communication time cost brought by the remote data access is reduced, and the overall efficiency and performance of the distributed graph calculation are improved.
Further, in the iteration, for the first working node which has transferred the active points of other working nodes, when the information of the transferred active point needs to be acquired, if the first working node has cached the information of the transferred active point in advance, the first working node reads the information of the transferred active point from the local cache. If the first working node does not cache the information of the transferred active point in advance, the first working node accesses the information of the active point stored by the second working node through a high-speed interconnection channel (such as an inter-GPU high-speed interconnection NVLink) between the first working node and the second working node.
Specifically, when the first working node needs to acquire the information of the transferred active point, it is first determined whether the information of the transferred active point is cached in the local cache. And if the information of the transferred active point exists in the local cache, reading the information of the transferred active point from the local cache. And if the information of the transferred active point does not exist in the local cache, accessing the information of the active point transferred by the first working node, which is stored by the second working node, through a high-speed interconnection (NVLink) between the second working node and the second working node.
Illustratively, the vertices of the cache may be marked using a bitmap, and if the active point transferred by the first working node exists in the bitmap, the first working node reads information of the transferred active point in a local cache instead of remotely accessing the information of the transferred active point from another working node, which may reduce the time overhead of acquiring the information of the transferred active point. If the active point transferred by the first working node does not exist in the bitmap, the first working node accesses the information of the active point transferred by the first working node, which is stored by the second working node, through a high speed interconnect channel (NVLink) with the second working node.
In the above embodiment, by using the active point transfer policy, the goal is to balance the workload of different working nodes, so as to balance the computation time of each working node.
As shown in FIG. 4, by using the active point transfer policy, GPU0 and GPU1 are both enabled to process 2 vertices and 5 edges. But since GPU1 needs to access the neighborhood of vertex v3 stored on GPU0 via a high speed interconnect channel (e.g., NVLink), rather than via a local memory bus, this increases the computational time cost of the worker node, keeping the worker node processing the same number of edges is not necessarily the most suitable load balancing policy.
In practical application, for algorithms of graph traversal class, such as SSSP algorithm, broadcast First Search (BFS for short) algorithm, delta-PageRank, etc., it often needs many iterations to converge, and in implementing such a graph algorithm, a distributed graph computing system often has a Long Tail (LT for short) problem. At the end of such graph algorithms, only a small fraction of the vertices are activated as boundaries, in which case runtime is dominated by latency overhead due to the computational power, including data synchronization between worker nodes (e.g., preparing message buffers for communication) and inevitable data movement (e.g., transferring data between the CPU and GPU), among others. These overheads take on the order of milliseconds and are negligible in busy iterations, but at thousands of such delay-limited iterations the overhead is significant, as the proportion of synchronization overhead in total time can be as high as 21%. Generally, the more worker nodes that participate in the computation, the longer the latency. The long tail problem severely limits the scalability of the graph algorithm.
In this embodiment, in the process of performing multiple rounds of graph calculation on graph data iteration by executing a graph calculation task through multiple working nodes, if it is determined that the working nodes participating in the calculation satisfy the workload adjustment condition, the number of the working nodes participating in the calculation may also be adjusted to balance the synchronization time between different working nodes and the calculation time of the working nodes, thereby improving the overall efficiency and performance of the distributed graph calculation.
Specifically, if the running time of the previous iteration is less than or equal to the first time threshold, it is determined that the workload adjustment condition is met, and the number of the working nodes participating in the calculation is adjusted. The first time threshold is a smaller value, the running time of one iteration is less than or equal to the first time threshold, which indicates that the work load of the working nodes at the moment is very sparse, and the synchronization overhead among the working nodes is the main overhead of the distributed graph calculation, so that the number of the working nodes participating in the calculation is adjusted, the working nodes participating in the calculation are reduced, the synchronization overhead is reduced, and the overall efficiency and performance of the distributed graph calculation are improved. The first time threshold may be configured and adjusted according to the needs of an actual application scenario, and is not limited specifically here.
And if the running time of the previous iteration is greater than or equal to the second time threshold and the working nodes which do not participate in the calculation exist, determining that the working load adjustment condition is met, and adjusting the number of the working nodes which participate in the calculation. The second time threshold is larger than the first time threshold and is a relatively large value, the running time of one iteration is larger than or equal to the second time threshold, which indicates that the workload of each working node is large, and the number of the working nodes participating in the calculation is increased by adjusting the number of the working nodes participating in the calculation, so that the efficiency of the calculation of the graph is improved, and the overall efficiency and performance of the calculation of the distributed graph are improved. The second time threshold may be configured and adjusted according to the needs of the actual application scenario, and is not limited specifically here.
Referring to fig. 7, fig. 7 is a flowchart of an ownership transfer mechanism provided in an embodiment of the present application. As shown in fig. 7, based on the ownership transfer mechanism, adjusting the number of working nodes participating in the computation may be implemented by the following steps:
step S701, determining an ownership transfer policy when different numbers of working nodes participate in the computation according to the high-speed interconnection network among the plurality of working nodes, where the ownership transfer policy includes a third working node participating in the computation and mapping information between the third working node and a fourth working node, where the fourth working node is a working node to which the third working node has transferred the subgraph ownership.
In this embodiment, the ownership transfer policies when different numbers of working nodes participate in the computation are predetermined and stored according to the characteristics of the high-speed interconnection network among the plurality of working nodes, so as to keep that the communication network among the working nodes participating in the computation after using the ownership transfer policies has a larger staggered bandwidth, and the time overhead caused by enumerating all possible ownership transfer policies can be reduced.
For example, fig. 8 shows the topology of the high-speed interconnect channel NVLink among multiple GPUs, and as shown in fig. 8, one bidirectional arrow represents one high-speed interconnect channel NVLink. The communication network between the GPUs has a certain degree of equivalence, for example, deleting GPU2 and GPU3 from the communication network, and deleting GPU4 and GPU5 from the communication network will lose equal bandwidth, and an ownership transfer policy that GPU2 and GPU3 do not participate in the calculation and an ownership transfer policy that GPU4 and GPU5 do not participate in the calculation can be called an equivalent ownership transfer policy. When the ownership transfer strategy is used for reducing the working nodes participating in the calculation, the running time is mainly the synchronization overhead, and the objective of the ownership transfer strategy is to reduce the synchronization time when the calculation time is equivalent to the synchronization time. Thus, the performance resulting from using an equivalent ownership transfer policy is also similar. In addition, the interconnection of the communication networks has asymmetry, as shown in fig. 8, there may be two NVLink channels (50 GB/s) or one NVLink channel (25 GB/s) between two GPUs, or there may be no NVLink channel between any two GPUs, and the speed of the communication links between different GPUs is very different, resulting in higher time cost for communication between some GPU pairs than other GPU pairs. For example, removing GPU0 and GPU3 from the communication network may have a much better communication performance than removing GPU0 and GPU 7. Additionally, there are multiple transfer paths between GPU pairs, e.g., GPU0 may transfer some of the workload of GPU7 by taking GPU1 or GPU6 as an intermediate transfer. Since the amount of transmitted data in each path may vary in each iteration, the performance of the different transfer paths may vary significantly.
In this embodiment, in consideration of asymmetry of high-speed internet interconnection between GPUs, by enumerating the number of working nodes participating in computation, ownership transfer strategies when different numbers of working nodes participate in computation are respectively determined, and a specification tree is generated for storing information of different ownership transfer strategies, and by using the specification tree, a search space of all transfer strategies can be reduced.
Illustratively, fig. 9 is a topological structure of the high-speed interconnect channel NVLink between GPUs shown in fig. 8, and a specification tree of the determined ownership transfer policy, as shown in fig. 9, where m represents the number of working nodes participating in the computation, and when m =8, all GPUs participate in the computation. When m =6, the working nodes participating in the calculation include: GPU0, GPU6, GPU1, GPU7, GPU2, GPU3, ownership of GPU4 is transferred by GPU2, and ownership of GPU5 is transferred by GPU 3. When m =4, the working nodes participating in the calculation include: GPU0, GPU1, GPU2, GPU3, ownership of GPU4 is transferred by GPU2, ownership of GPU5 is transferred by GPU3, ownership of GPU6 is transferred by GPU0, and ownership of GPU7 is transferred by GPU1. When m =2, the working nodes participating in the calculation include: ownership of GPU0, GPU3, GPU6, GPU1, and GPU7 is transferred by GPU0, and ownership of GPU2, GPU4, and GPU5 is transferred by GPU 3. When m =1, only GPU0 participates in the computation, and ownership of the other GPUs 1-7 is transferred by GPU 0.
In this embodiment, the working nodes participating in the calculation in different ownership transfer policies are different. The ownership transfer strategy comprises a third working node participating in calculation and mapping information between the third working node and a fourth working node. Wherein the fourth working node is the working node to which the third working node has transferred the ownership of the subgraph.
The ownership transfer policy may be expressed as an n-dimensional vector, n being the total number of worker nodes in the distributed graph computing system. Wherein
Figure 649251DEST_PATH_IMAGE034
And (1 is less than or equal to i, j is less than or equal to n) indicates that the working node i transfers ownership of the whole subgraph of the working node j. The main goal of the ownership transfer policy is to reduce the synchronization overhead between working nodes by excluding some working nodes (fourth working nodes) from the communication network at the cost of increasing the workload of other working nodes (third working nodes).
Step S702, respectively calculating time costs of different ownership transfer policies, and selecting an ownership transfer policy with the smallest time cost as a target ownership transfer policy.
In this embodiment, for a given ownership transfer policy, the time cost of a single iteration is considered, which mainly includes the computation time cost and the synchronization time cost of the working nodes.
The calculation time cost of the ownership transfer policy comprises the time for performing distributed calculation and communication (updating), and is determined by the time used by the last working node for completing the calculation.
The synchronization time cost of the ownership transfer policy includes the start-up time of the working node, inevitable data movement, time to prepare message buffers, etc. The synchronization time cost is essential in each iteration, and is generally proportional to the size of the working nodes participating in the calculation in the distributed graph calculation system, and can be determined by multiplying the number of the working nodes participating in the calculation by a preset synchronization time estimation parameter. The synchronization time estimation parameter may be determined according to an empirical value in combination with an actual application scenario, and is not specifically limited herein.
Specifically, for each ownership transfer policy, the working nodes participating in the computation only include a third working node, according to mapping information between the third working node and a fourth working node, an active point of the fourth working node is taken as an active point of the third working node, a sub-graph of the fourth working node is taken as a sub-graph of the third working node, according to time costs required for performing work transfer between different third working nodes, the computation time costs of the second active point transfer policy and the second active point transfer policy are determined, and the computation time cost of the second active point transfer policy is taken as the computation time cost of the ownership transfer policy. And determining the synchronization time cost of the ownership transfer strategy according to the number of the third working nodes participating in the calculation in the ownership transfer strategy. The total temporal cost of the ownership transfer policy is determined from the calculated temporal cost and the synchronized temporal cost of the ownership transfer policy.
The calculation time costs of the second active point transfer policy and the second active point transfer policy are determined according to the time costs required for performing work transfer between different third working nodes, and the method for generating the first active point transfer policy with the minimum time cost according to the time costs required for performing work transfer between different working nodes in the corresponding embodiment of fig. 5 may be used, and this embodiment is not described again here.
Alternatively, when the total time cost of the ownership transfer policy is determined based on the calculated time cost and the synchronized time cost of the ownership transfer policy, the sum of the calculated time cost and the synchronized time cost of the ownership transfer policy may be used as the total time cost of the ownership transfer policy.
Alternatively, when determining the total time cost of the ownership transfer policy according to the calculated time cost and the synchronization time cost of the ownership transfer policy, different weighting coefficients may be set for the calculated time cost and the synchronization time cost of the ownership transfer policy, respectively, and the result of weighted summation of the calculated time cost and the synchronization time cost of the ownership transfer policy may be used as the total time cost of the ownership transfer policy.
Step S703, sending the target to the third working node and the fourth working node to control the third working node to obtain data of the subgraph of the fourth working node, and completing graph computation of the subgraph of the fourth working node.
In this embodiment, the use of the ownership transfer policy (adjusting the number of working nodes participating in the calculation at a time) is a long-term effective behavior, and when one ownership transfer policy is used, not only the workload of the third working node to which the ownership is transferred in the current iteration and the workload of the fourth working node to which the ownership is transferred are changed, but also the ownership transfer policy and the workload of the fourth working node to which the ownership is transferred are acted in subsequent iterations, and messages sent to the fourth working node are forwarded to the corresponding third working node. Until a change occurs when a new ownership transfer policy is used.
Illustratively, when a GPUi transfers ownership of a sub-graph in the GPUj, the GPUj will have no workload to do in the next iteration, only being responsible for forwarding the received message to the GPUi that transferred its ownership, unless it is again involved in the computation using the new ownership transfer policy.
And for a third working node in the distributed graph computing system, acquiring the ownership of the subgraph corresponding to the fourth working node according to the used target ownership transfer strategy, and in the current round and subsequent iterations, the third working node replaces the fourth working node to perform the related graph computation of the subgraph of which the ownership is transferred. And when the third working node needs to acquire the information of the vertex in the transferred subgraph, the third working node can remotely access the related information stored on the fourth working node through the high-speed interconnection channel. At this time, the fourth working node is equivalent to the extended memory of the third working node.
In an alternative embodiment, at least one vertex is selected as a hyper-point based on the in-degree of the vertex in the graph data, and a contiguous list of hyper-points is pre-cached on the plurality of worker nodes. Considering that the vertex with high degree of in-depth is easy to be activated for multiple times in the graph data, according to the degree of in-depth of the vertex in the graph data, the vertex with the degree of in-depth larger than the preset degree-of-depth threshold is taken as a super point, and the heavy remote memory access is avoided by caching the adjacent list of the super points in advance, so that the communication time cost brought by the remote data access is reduced, and the overall efficiency and performance of the distributed graph calculation are improved.
Further, when the third working node needs to acquire information of vertices in the subgraph transferred based on the ownership transfer policy, if the third working node locally caches the information of the vertices, the third working node reads the information of the vertices from the local cache. If the information of the vertexes is not cached locally by the third working node, the information of the vertexes stored on the fourth working node is remotely accessed by the third working node through a high-speed interconnection channel (such as an NVLink between GPUs).
Illustratively, a bitmap may be used to mark the vertices of the cache, and if a vertex for which the third worker node needs to obtain information exists in the bitmap, the third worker node reads the information of the vertex in the local cache. And if the vertex of the information which needs to be acquired by the third working node does not exist in the bitmap, the third working node remotely accesses the information of the vertex stored by the fourth working node through a high-speed interconnection channel (NVLink).
In this embodiment, z represents the calculation time cost of the ownership transfer policy, p represents the preset synchronization time estimation parameter, and m represents the work involved in the calculationThe number of nodes, the total time cost of the ownership transfer policy is:
Figure 806563DEST_PATH_IMAGE035
. Synchronization time cost when the workload of each worker node is low, especially late in the execution of graph computation tasks/graph algorithms: (
Figure 990419DEST_PATH_IMAGE036
) Accounting for a major portion of the run time. In this case, the ownership transfer policy used is adjusted to reduce the number of worker nodes participating in the computation to reduce the total runtime. For example, the number of working nodes participating in the computation is reduced from 8 to 4. When the workload of the working nodes is large, the calculation time cost (z) of the working nodes accounts for a main part of the operation time, so that the efficiency of the overall distributed graph calculation can be improved, and in this case, the used ownership transfer strategy is adjusted to increase the number of the working nodes participating in the calculation. By using an ownership transfer mechanism, the used ownership transfer strategy is adaptively adjusted, and the sizes of working nodes participating in the calculation are adjusted, so that the parallelism and the efficiency of the distributed graph calculation are balanced.
For example, in the initial phase of graph computation task/graph algorithm execution, all n work nodes participate in the computation, each work node holding a subgraph under the n-way partition of the graph G. In each of the following iterations, the least time-cost ownership transfer policy is selected and used when the workload adjustment condition is satisfied. Once the ownership transfer policy is used, the ownership transfer policy will be distributed to each worker node and a new set of worker nodes will participate in the post-processing. For example, when a breadth-first search (BFS) algorithm is performed on a sparse graph, the early workload may be low, and thus a less-working-node-involved ownership transfer strategy is used to scale down the working nodes involved in the computation, leaving only a few working nodes involved in the computation. As the workload increases, by adjusting the ownership transfer policy used, more worker nodes will participate in the computation, and some/all stolen ownership will be returned. In the case that the workload is extremely low at the later stage of the execution of the graph calculation task/graph algorithm, the ownership transfer strategy using the working nodes participating in the calculation with less number can be adjusted again, so that the sizes of the working nodes participating in the calculation are reduced again to reduce the total overhead.
In this embodiment, based on the asymmetry of the communication network interconnection between the working nodes, the capability of processing edges across the working nodes is described by the cost coefficient matrix, and the accuracy of the cost coefficient matrix is ensured by using an offline polynomial regression model, so that the time cost of the active point transfer strategy can be accurately determined. The distributed graph computing system can automatically sense the condition that the work load is far lower than the computing capacity of the system, so that the scale of the work nodes participating in the computation is reduced in a self-adaptive mode through an ownership transfer strategy, the time overhead of communication and synchronization is reduced, and the overall performance of the distributed graph computation is improved.
In this embodiment, an ownership transfer mechanism is adopted, and for the problem that when the workload is particularly sparse, the computation time of a working node is much shorter than the synchronization time, in this case, at least one third working node directly acquires ownership of the entire data of another fourth working node, and kicks the fourth working node out of the working node group that needs to synchronize the data, so that the number of working nodes participating in synchronization and communication is reduced. In the iteration with more workload, a larger number of working nodes are used for participating in the graph calculation, and in the iteration with less workload, fewer or even a single workload is used for participating in the graph calculation, so that the synchronous waiting expense is avoided. But choosing which work node to kick out, how many work nodes to kick out will result in a huge decision space, so the choice of ownership transfer policy is fixed using the reduction tree, thereby reducing the decision space.
In this embodiment, an overall flow of distributed graph computation using an ownership transfer policy and an active point transfer policy will be described. Fig. 10 is a flowchart of a distributed graph computation method combining ownership transfer and active point transfer according to an embodiment of the present application. As shown in fig. 10, the method comprises the following specific steps:
step S101, graph calculation tasks to be executed and graph data of the graph calculation tasks are obtained.
Step S102, dividing the graph data into n sub graphs according to the number n of the working nodes in the distributed graph computing system, and respectively distributing the n sub graphs to the n working nodes.
In this embodiment, the total memory of n working nodes (GPUs) is sufficient to store the entire graph data. For any working node, the vertex in the subgraph assigned to the working node is the internal vertex of the working node, and the other vertices are external vertices.
Each working node is responsible for communication of distributed graph computation, such as sending vertex data to remote working nodes, receiving the vertex data from the remote working nodes, and forwarding messages as transfer stations of other working nodes.
Optionally, each working node has a unique identification ID, and the working node with the lowest ID (or any other working node) can be used as a coordinator to perform processing related to the active point transfer policy and the ownership transfer policy, for example, generating the ownership transfer policy and broadcasting the ownership transfer policy to all the working nodes; and generating an active point transfer strategy and broadcasting the active point transfer strategy.
And step S103, when each iteration starts, each working node processes the message buffer area and determines the active point of the iteration of the current round.
In this step, the data of the vertices are observed to change according to the calculation result of the previous round, and then whether the vertices are marked as active points is judged according to the calculation rule of the graph calculation task/graph algorithm.
And step S104, determining whether the workload adjustment condition is met.
If the workload adjustment condition is satisfied, executing step S105; if the workload adjustment condition is not met, the step S106 is directly executed without using a new ownership transfer policy, and the active point transfer policy used in the current iteration is determined.
And step S105, if the workload adjustment condition is met, adjusting the used ownership transfer strategy so as to adjust the number of the working nodes participating in the calculation.
In this embodiment, to solve the DLB and LT problems simultaneously, the ownership transfer policy and the active point transfer policy are allowed to be used simultaneously during distributed graph computation. Since the ownership transfer policy needs to transfer ownership of the entire subgraph, it has a great influence on the livepoint transfer policy. For example, if GPUi transfers ownership of the subgraph of GPUj, GPUj will work like extending memory without participating in subsequent distributed graph computations. Thus, prior to generating the livepoint transfer policy, the ownership transfer policy in use is pre-determined.
The number of the working nodes participating in the calculation is enumerated, the corresponding ownership transfer strategy is determined through the protocol tree, the synchronous time cost and the calculation time cost of the ownership transfer strategy are calculated, the ownership transfer strategy with the minimum total time cost is selected, and the selected ownership transfer strategy is used.
Next, the determination of the active point transfer policy used by the current iteration continues with the use of the selected ownership transfer policy.
And step S106, determining whether the working nodes participating in the calculation meet the load unbalance condition.
And if the working nodes participating in the calculation meet the load imbalance condition, executing the step S107 and generating an active point transfer strategy. If the working nodes participating in the calculation do not satisfy the load imbalance condition, the step S108 is directly executed without using an active point transfer policy.
And S107, if the working nodes participating in the calculation meet the load unbalance condition, using an active point transfer strategy in the iteration.
After using the ownership transfer policy, it is possible that some worker nodes do not participate in the computation. In order to allow the ownership transfer strategy and the active point transfer strategy to interact, on the basis of the embodiment of generating the first active point transfer strategy, corresponding to the working nodes which do not participate in the calculation in the cost coefficient matrix C are modified
Figure 360221DEST_PATH_IMAGE005
Is made ofThe result is poor, so that when the active point transition strategy is generated, no active point is transferred by the working nodes which do not participate in the calculation.
And S108, the working nodes perform distributed graph calculation processing based on the current ownership transfer strategy and the active point transfer strategy.
If the ownership transfer strategy and/or the active point transfer strategy are/is used currently, distributed graph calculation is carried out according to the used transfer strategy, information of the vertex is updated according to the GAS calculation model, a message is sent to the neighbor vertex, and the new vertex is activated. If the ownership transfer policy and the livepoint transfer policy are not currently used, conventional distributed graph computation may be performed.
Illustratively, taking the example of a GPU-based distributed graph computing system executing a Single Source Shortest Path (SSSP) algorithm, in each iteration, the number of GPUs participating in the computation is first enumerated (denoted by m) and the corresponding ownership transfer policy is determined based on the trie. Then, in case of each ownership transfer policy, the MILP problem is solved to generate the active point transfer policy and to record the runtime (total time cost) estimated when the active point transfer policy is generated. After enumerating all ownership transfer policies, the ownership transfer policies and livepoint transfer policies used are selected to minimize the estimated runtime. For example, during execution of the Single Source Shortest Path (SSSP) algorithm, the first 20 iterations process a large number of edges and the computation time cost is much larger than the synchronization time cost, so the ownership transfer policy keeps as much GPU as possible participating in the computation and does not make a decision to transfer any ownership. When the algorithm enters late in execution, the synchronization time cost is comparable to the computation time cost, and this situation may last around 200 iterations, thus using a new ownership transfer policy to reduce the number of GPUs participating in the computation. By the work transfer method combining the ownership transfer strategy and the active point transfer strategy, the execution efficiency and performance of a Single Source Shortest Path (SSSP) algorithm can be remarkably improved.
Fig. 11 is a flowchart framework of a distributed graph computation method according to an exemplary embodiment of the present application. As shown in fig. 11 and as an example of a distributed graph computing system with GPUs, edge cutting is used to perform graph partitioning on input graph data to obtain as many sub-graphs as GPUs, each GPU has ownership of sub-graph data, GPU0 and GPU1 are shown in the graph, GPU0 has ownership of data of sub-graph 0, GPU1 has ownership of data of sub-graph 1, and other GPUs are omitted and not shown. When the GPU accesses the locally stored sub-graph data, the local access is only needed, and the data stored on other GPUs need to be remotely accessed. Using the GAS computing model, the distributed graph computing system takes the workflow of GPU0 as an example, and in each iteration, GPU0 receives a message, which may be a message generated by GPU0 in the last iteration (as shown in the figure) or may receive a message generated by another GPU in the last iteration (not shown in the figure). GPU0 decompresses the message, activates a new active point, and makes ownership transfer and active point transfer decisions according to the number of the current active points and the workload to determine the transfer strategies (including ownership transfer strategies and active point transfer strategies) to use. According to the used transfer strategy, the active points (including transferred active points) are processed through a GPU kernel, the output message is compressed according to the processing result, and the message is sent to other GPUs to update the vertex data and complete one round of iteration. Fig. 11 shows a complete work flow of the distributed graph computing system, the work flows of the GPUs in the system are the same, which is only exemplarily described in one possible case, and different processing logics and flows may be possible according to different actual computing logics, which is not described herein again.
The embodiment of the application provides a distributed graph computing method, which is applied to a cloud server of an e-commerce platform, wherein the cloud server comprises a plurality of working nodes, and the method comprises the following steps: acquiring a graph calculation task and a constructed consumer-product relation graph in an e-commerce scene; and controlling a plurality of working nodes to execute graph calculation tasks according to the consumer-product relation graph, carrying out multi-round graph calculation on the consumer-product relation graph in an iteration mode, and outputting processing results of the graph calculation tasks.
In each iteration of the graph calculation task, if the load of the working nodes participating in the calculation is determined to be unbalanced, determining a first active point transfer strategy according to the time cost required by the work transfer among different working nodes; the first active point transfer policy is sent to the working node where the work transfer occurs, so as to control the working node where the work transfer occurs to complete the graph calculation of the transferred active point according to the first active point transfer policy, which may be specifically implemented in a manner similar to that in step S302, and for details, reference is made to the relevant content in step S302, which is not described herein again.
The embodiment of the application provides electronic equipment. The electronic device includes: a memory, and a memory communicatively coupled to the processor, the memory storing computer-executable instructions. The processor executes the computer execution instruction stored in the memory to implement the scheme provided by the decision control node, the first working node, or any other working node in any of the above method embodiments, and specific functions and technical effects that can be achieved are not described herein again.
The embodiments of the present application provide a computer-readable storage medium, where computer-executable instructions are stored in the computer-readable storage medium, and when the computer-executable instructions are executed by a processor, the computer-executable instructions are used to implement the method provided in any of the above method embodiments, and specific functions and technical effects that can be achieved are not described herein again.
An embodiment of the present application provides a computer program product, where the computer program product includes: the computer program is stored in a readable storage medium, at least one processor of the electronic device can read the computer program from the readable storage medium, and the at least one processor executes the computer program to enable the electronic device to execute the method provided by any one of the above method embodiments, and specific functions and achievable technical effects are not described herein again.
The embodiment of the application provides a chip, including: the processing module can execute the technical scheme of the cloud device in the method embodiment. Optionally, the chip further includes a storage module (e.g., a memory), where the storage module is configured to store instructions, and the processing module is configured to execute the instructions stored by the storage module, and execute the instructions stored in the storage module so that the processing module executes the method provided in the foregoing embodiment, where the implementation principle and the technical effect are similar, and are not described again here.
In addition, in some of the flows described in the above embodiments and the drawings, a plurality of operations are included in a certain order, but it should be clearly understood that the operations may be executed out of the order presented herein or in parallel, and only for distinguishing between different operations, and the sequence number itself does not represent any execution order. Additionally, the flows may include more or fewer operations, and the operations may be performed sequentially or in parallel. It should be noted that, the descriptions of "first", "second", etc. in this document are used for distinguishing different messages, devices, modules, etc., and do not represent a sequential order, nor do they limit the types of "first" and "second". The meaning of "a plurality" is two or more unless specifically limited otherwise.
Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims. It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims (12)

1. A distributed graph computation method is applied to a decision control node in a distributed graph computation system, wherein the distributed graph computation system comprises a plurality of working nodes, and the method comprises the following steps:
acquiring graph calculation tasks to be executed and graph data of the graph calculation tasks, controlling the plurality of working nodes to execute the graph calculation tasks, performing multiple rounds of graph calculation on the graph data in an iteration mode, and outputting processing results of the graph calculation tasks;
in each iteration of the graph computation task, if it is determined that the loads of the working nodes participating in the computation are unbalanced, determining a first active point transition strategy according to a time cost required by a first working node to process an edge on a second working node, wherein the first active point transition strategy comprises: the first working node transfers the graph computation work of the transferred active point from the second working node to the first working node, the information of the transferred active point is still stored on the second working node, the first active point transfer strategy is only effective in the current iteration, the time cost required for the first working node to process one edge on the second working node comprises communication time cost and computation time cost, the communication time cost is determined according to the network bandwidth between the first working node and the second working node, and the computation time cost is determined according to the characteristic data of the top point in the subgraph stored by the second working node through a pre-trained polynomial regression model;
and sending the first active point transfer strategy to the first working node and the second working node so as to control the first working node to acquire the information of the transferred active points according to the first active point transfer strategy and complete the graph calculation of the transferred active points.
2. The method of claim 1, wherein determining the load imbalance of the working nodes participating in the computation comprises:
determining the maximum number of active points and the minimum number of active points according to the number of active points on the working nodes participating in calculation;
and if the maximum active point quantity is greater than or equal to a first active quantity threshold value, and the difference value between the maximum active point quantity and the minimum active point quantity is greater than or equal to a second active quantity threshold value, determining that the load of the working nodes participating in the calculation is unbalanced.
3. The method of claim 1, wherein determining the first active point transition policy based on a cost of time required for the first working node to process an edge on the second working node comprises:
determining the target edge number of the first working node to be transferred from the second working node in the iteration according to the time cost of the first working node for processing one edge on the second working node and the number of the associated edges of the active points on the second working node;
determining the active point of the first working node transferred from the second working node in the iteration according to the target edge number of the first working node to be transferred from the second working node in the iteration and the data of the subgraph on the second working node;
the first active point transfer policy includes active point information to be calculated for the first working node to transfer from the second working node.
4. The method according to any one of claims 1-3, further comprising:
and executing the graph calculation task through a plurality of working nodes, and adjusting the number of the working nodes participating in calculation if the working load adjustment condition is determined to be met in the process of carrying out multiple rounds of graph calculation on the graph data iteration.
5. The method of claim 4, wherein the adjusting the number of working nodes participating in the computation comprises:
determining an ownership transfer strategy when different numbers of working nodes participate in computation according to the interconnected network among the working nodes, wherein the ownership transfer strategy comprises a third working node participating in computation and mapping information between the third working node and a fourth working node, and the fourth working node is a working node to which the third working node transfers the sub-graph ownership;
respectively calculating the time cost of different ownership transfer strategies, and selecting the ownership transfer strategy with the minimum time cost as a target ownership transfer strategy;
and sending the target ownership transfer strategy to the third working node and the fourth working node so as to control the third working node to acquire data of the subgraph of the fourth working node and complete graph calculation of the subgraph of the fourth working node.
6. The method of claim 5, wherein separately calculating the time cost for different ownership transfer policies comprises:
respectively aiming at a third working node participating in calculation in each ownership transfer strategy and mapping information between the third working node and a fourth working node, taking an active point of the fourth working node as an active point of the third working node, determining a second active point transfer strategy and a calculation time cost of the second active point transfer strategy according to time costs required by work transfer between different third working nodes, and taking the calculation time cost of the second active point transfer strategy as the calculation time cost of the ownership transfer strategy;
determining the synchronization time cost of the ownership transfer strategy according to the number of the third working nodes participating in calculation in the ownership transfer strategy;
determining a total time cost of the ownership transfer policy based on the calculated time cost and the synchronized time cost of the ownership transfer policy.
7. A distributed graph computation method is applied to a first working node in a distributed graph computation system, the distributed graph computation system comprises a plurality of working nodes, the plurality of working nodes execute graph computation tasks in a distributed mode, and multiple rounds of graph computation are conducted on graph data iteration, and the method comprises the following steps:
receiving a first active point transfer policy, the first active point transfer policy being determined according to a time cost required for a first working node to process an edge on a second working node when determining load imbalance of working nodes participating in computation, the first active point transfer policy comprising: the first working node transfers the graph computation work of the transferred active point from the second working node to the first working node, the information of the transferred active point is still stored on the second working node, the first active point transfer strategy is only effective in the current iteration, the time cost required for the first working node to process one edge on the second working node comprises communication time cost and computation time cost, the communication time cost is determined according to network bandwidth between the first working node and the second working node, and the computation time cost is determined according to characteristic data of a vertex in a subgraph stored by the second working node through a pre-trained polynomial regression model;
in the iteration of the current round, the information of the transferred active points is obtained according to the first active point transfer strategy, and the graph calculation of the transferred active points is completed.
8. The method of claim 7, further comprising:
and selecting at least one vertex as a super point according to the in-degree of the vertex in the graph data, and caching an adjacent list of the super point on the plurality of working nodes in advance.
9. The method of claim 8, wherein obtaining information of the transferred active point comprises:
if the information of the transferred active point exists in the local cache, reading the information of the transferred active point from the local cache;
and if the information of the transferred active point does not exist in the local cache, accessing the information of the transferred active point stored by the second working node through an interconnection channel between the local cache and the second working node.
10. A distributed graph computing method is applied to a cloud server of an e-commerce platform, wherein the cloud server comprises a plurality of working nodes, and the method comprises the following steps:
acquiring a graph calculation task and a constructed consumer-product relation graph in an e-commerce scene;
controlling the plurality of working nodes to execute the graph calculation tasks according to the consumer-product relationship graph, performing multiple rounds of graph calculation on the consumer-product relationship graph in an iteration mode, and outputting processing results of the graph calculation tasks;
in each iteration of the graph computation task, if it is determined that the loads of the working nodes participating in the computation are unbalanced, determining a first active point transfer strategy according to a time cost required by a first working node to process an edge on a second working node, wherein the first active point transfer strategy comprises: the first working node transfers the graph computation work of the transferred active point from the second working node to the first working node, the information of the transferred active point is still stored on the second working node, the first active point transfer strategy is only effective in the current iteration, the time cost required for the first working node to process one edge on the second working node comprises communication time cost and computation time cost, the communication time cost is determined according to the network bandwidth between the first working node and the second working node, and the computation time cost is determined according to the characteristic data of the top point in the subgraph stored by the second working node through a pre-trained polynomial regression model;
and sending the first active point transfer strategy to a working node with work transfer to control the first working node to acquire the information of the transferred active point according to the first active point transfer strategy and complete the graph calculation of the transferred active point.
11. A distributed graph computing system, comprising a plurality of worker nodes,
the decision control node is used for acquiring graph calculation tasks to be executed and graph data of the graph calculation tasks, controlling the plurality of working nodes to execute the graph calculation tasks, performing multiple rounds of graph calculation on the graph data in an iteration mode, and outputting processing results of the graph calculation tasks;
the decision control node is further adapted to: in each iteration of the graph computation task, if the load of the working nodes participating in the computation is determined to be unbalanced, determining a first active point transfer strategy according to the time cost required by the first working node to process one edge on a second working node, and sending the first active point transfer strategy to the first working node and the second working node where the work transfer occurs, wherein the first active point transfer strategy comprises the following steps: the first working node transfers the graph computation work of the transferred active point from the second working node to the first working node, the information of the transferred active point is still stored on the second working node, the first active point transfer strategy is only effective in the current iteration, the time cost required for the first working node to process one edge on the second working node comprises communication time cost and computation time cost, the communication time cost is determined according to network bandwidth between the first working node and the second working node, and the computation time cost is determined according to characteristic data of a vertex in a subgraph stored by the second working node through a pre-trained polynomial regression model;
and the first working node is used for receiving the first active point transfer strategy, acquiring the information of the active points transferred from the second working node according to the first active point transfer strategy in the iteration, and completing the graph calculation of the transferred active points.
12. A computer-readable storage medium having computer-executable instructions stored therein, which when executed by a processor, are configured to implement the method of any one of claims 1-10.
CN202211587848.XA 2022-12-12 2022-12-12 Distributed graph calculation method, system and equipment Active CN115587222B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211587848.XA CN115587222B (en) 2022-12-12 2022-12-12 Distributed graph calculation method, system and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211587848.XA CN115587222B (en) 2022-12-12 2022-12-12 Distributed graph calculation method, system and equipment

Publications (2)

Publication Number Publication Date
CN115587222A CN115587222A (en) 2023-01-10
CN115587222B true CN115587222B (en) 2023-03-17

Family

ID=84783433

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211587848.XA Active CN115587222B (en) 2022-12-12 2022-12-12 Distributed graph calculation method, system and equipment

Country Status (1)

Country Link
CN (1) CN115587222B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113342510A (en) * 2021-08-05 2021-09-03 国能大渡河大数据服务有限公司 Water and power basin emergency command cloud-side computing resource cooperative processing method
WO2022145087A1 (en) * 2020-12-28 2022-07-07 Soinn株式会社 Information processing device, information processing method, and non-transitory computer-readable medium

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2499547B (en) * 2010-11-22 2020-04-22 Ibm Load balancing in distributed database
CN103188345B (en) * 2013-03-01 2016-05-18 北京邮电大学 Distributed dynamic load management system and method
US9720728B2 (en) * 2013-12-06 2017-08-01 Huawei Technologies Co., Ltd. Migrating a VM when the available migration duration times of a source and destination node are greater than the VM's migration duration time
CN104780213B (en) * 2015-04-17 2018-02-23 华中科技大学 A kind of master-salve distributed figure processing system load dynamic optimization method
IL244937A (en) * 2016-04-05 2017-07-31 Musman Lior Global optimization and load balancing in networks
CN110659278A (en) * 2018-06-12 2020-01-07 上海郑明现代物流有限公司 Graph data distributed processing system based on CPU-GPU heterogeneous architecture
CN110264467B (en) * 2019-06-26 2022-12-06 西安电子科技大学 Dynamic power law graph real-time repartitioning method based on vertex cutting
US11159609B2 (en) * 2020-03-27 2021-10-26 Intel Corporation Method, system and product to implement deterministic on-boarding and scheduling of virtualized workloads for edge computing
US20220198296A1 (en) * 2020-12-23 2022-06-23 EMC IP Holding Comnpany LLC User context migration based on computation graph in artificial intelligence application executing in edge computing environment
CN114143326A (en) * 2021-12-08 2022-03-04 深圳前海微众银行股份有限公司 Load adjustment method, management node, and storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2022145087A1 (en) * 2020-12-28 2022-07-07 Soinn株式会社 Information processing device, information processing method, and non-transitory computer-readable medium
CN113342510A (en) * 2021-08-05 2021-09-03 国能大渡河大数据服务有限公司 Water and power basin emergency command cloud-side computing resource cooperative processing method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Henning Meyerhenke.《Dynamic Load Balancing for Parallel Numerical Simulations Based on Repartitioning with Disturbed Diffusion》.2010,全文. *
崔虹燕 ; .P2P视频直播***中的分布式负载均衡算法.2009,(第12期),全文. *

Also Published As

Publication number Publication date
CN115587222A (en) 2023-01-10

Similar Documents

Publication Publication Date Title
Chen et al. Deploying data-intensive applications with multiple services components on edge
Zhao et al. Offloading dependent tasks in mobile edge computing with service caching
Sallam et al. Shortest path and maximum flow problems under service function chaining constraints
Cai et al. DGCL: An efficient communication library for distributed GNN training
CN113835899B (en) Data fusion method and device for distributed graph learning
CN102281290A (en) Emulation system and method for a PaaS (Platform-as-a-service) cloud platform
CN113037800B (en) Job scheduling method and job scheduling device
Hu et al. Throughput optimized scheduler for dispersed computing systems
CN114595049A (en) Cloud-edge cooperative task scheduling method and device
Zhou et al. Cost-aware partitioning for efficient large graph processing in geo-distributed datacenters
Filelis-Papadopoulos et al. Towards simulation and optimization of cache placement on large virtual content distribution networks
Faysal et al. Distributed community detection in large networks using an information-theoretic approach
Dalgkitsis et al. Schema: Service chain elastic management with distributed reinforcement learning
Dong et al. An improved shuffled frog-leaping algorithm for the minmax multiple traveling salesman problem
Imani et al. iSample: Intelligent client sampling in federated learning
Zhao et al. Large-scale machine learning cluster scheduling via multi-agent graph reinforcement learning
Elzohairy et al. FedLesScan: Mitigating Stragglers in Serverless Federated Learning
CN105049315B (en) A kind of virtual network improvement mapping method based on virtual network segmentation
CN115587222B (en) Distributed graph calculation method, system and equipment
Shefu et al. Fruit fly optimization algorithm for network-aware web service composition in the cloud
CN114912041A (en) Information processing method, electronic device, and computer program product
Aral et al. Subgraph matching for resource allocation in the federated cloud environment
Garg et al. Heuristic and reinforcement learning algorithms for dynamic service placement on mobile edge cloud
Fan et al. DRL-D: revenue-aware online service function chain deployment via deep reinforcement learning
Xia et al. Distributed resource management and admission control of stream processing systems with max utility

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant