CN117218459B

CN117218459B - Distributed node classification method and device

Info

Publication number: CN117218459B
Application number: CN202311483982.XA
Authority: CN
Inventors: 朱仲书
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2023-11-08
Filing date: 2023-11-08
Publication date: 2024-01-26
Anticipated expiration: 2043-11-08
Also published as: CN117218459A

Abstract

The embodiment of the specification relates to a distributed node classification method and a device, wherein the method is applied to any first working equipment in a plurality of working equipment of a distributed system, and comprises the following steps: obtaining first sub-graph data of full graph data, wherein the data in the first sub-graph can be privacy data, then performing K-round model processing on all nodes in the first sub-graph data by using a K-layer graph neural network to obtain K-order vector representations of all nodes, wherein the ith-round model processing comprises the steps of obtaining i-1-order vector representations of all nodes and a plurality of neighbor nodes thereof from a distributed file system, inputting the i-1-order vector representations into the ith-layer graph neural network to obtain i-order vector representations of all nodes, storing the i-order vector representations of all nodes into the distributed file system, obtaining the K-order vector representations of target nodes to be classified from the distributed file system, and inputting the K-order vector representations into a node classification model to obtain classification results of the target nodes.

Description

Distributed node classification method and device

Technical Field

One or more embodiments of the present disclosure relate to the field of graph processing, and in particular, to a method and apparatus for classifying distributed nodes.

Background

In recent years, as a tool for expressing complex relationships between data in the real world, graph data has received increasing attention, and one important application is to model nodes in a graph using a graph neural network (Graph Neural Networks, GNNs), and then predict whether the nodes have a certain attribute, i.e., node classification, by using a trained model. The graph data may be private data, such as data generated by a user during a transaction.

As the scale of graph data continues to expand and graph models continue to complicate, performing node classification tasks on graph data at the billions or even billions of levels requires significant resources. Since GNNs are essentially computed layer by layer in the form of a paradigm of information transfer, traditional sample-by-sample computing modes introduce a large number of repeated computations during the model prediction phase, thereby limiting their scalability.

Disclosure of Invention

One or more embodiments of the present disclosure describe a method and apparatus for classifying distributed nodes, which aim to store intermediate results generated in a computing process into a distributed file system for reuse in combination with computing characteristics of a graph neural network, thereby reducing data redundancy and improving operation efficiency.

In a first aspect, a distributed node classification method is provided, which is applied to any first working device in a plurality of working devices in a distributed system, and includes:

acquiring first sub-graph data of the full graph data;

performing K-round model processing on all nodes in the first sub-graph data by using a K-layer graph neural network to obtain K-order vector representations of all nodes, wherein the ith-round model processing comprises the steps of acquiring i-1-order vector representations of all nodes and a plurality of neighbor nodes thereof from a distributed file system, inputting the i-1-order vector representations into the ith-layer graph neural network to obtain i-order vector representations of all nodes, and storing the i-order vector representations of all nodes into the distributed file system; the distributed file system is shared by the plurality of working devices;

and obtaining the K-order vector representation of the target node to be classified from the distributed file system, and inputting the K-order vector representation into a node classification model to obtain a classification result of the target node.

In one possible embodiment, the method further comprises:

for any node in the first subgraph, a graph sampling algorithm is used to determine N neighbor nodes from all the one-hop neighbor nodes.

In one possible implementation, obtaining an i-1 order vector representation of each node and its multiple neighbor nodes from a distributed file system includes:

the i-1 order vector representations of each node and N neighbor nodes thereof are obtained from the distributed file system.

In one possible implementation, the graph sampling algorithm includes: random sampling, uniform sampling, weighted sampling, and type sampling.

In one possible implementation manner, the first sub-graph data is obtained by dividing the full-graph data by a plurality of working devices in a distributed system through a graph segmentation algorithm.

In one possible implementation, the graph cut algorithm includes: METIS, distributed neighbor extension algorithm.

In one possible implementation, the node classification model is a multi-layer neural network MLP.

In a second aspect, there is provided a distributed node classification apparatus deployed on any first working device of a plurality of working devices of a distributed system, including:

an acquisition unit configured to acquire first sub-image data of the full-image data;

the vector calculation unit is configured to perform K-round model processing on all nodes in the first sub-graph data by using a K-layer graph neural network to obtain K-order vector representations of all the nodes, wherein the ith-round model processing comprises the steps of acquiring i-1-order vector representations of all the nodes and a plurality of neighbor nodes thereof from a distributed file system, inputting the i-1-order vector representations into the ith-layer graph neural network to obtain i-order vector representations of all the nodes, and storing the i-order vector representations of all the nodes into the distributed file system; the distributed file system is shared by the plurality of working devices;

the node classification unit is configured to acquire a K-order vector representation of a target node to be classified from the distributed file system, and input the K-order vector representation into a node classification model to obtain a classification result of the target node.

In one possible embodiment, the method further comprises:

and the neighbor sampling unit is configured to determine N neighbor nodes from all the one-hop neighbor nodes by using a graph sampling algorithm for any node in the first subgraph.

In a third aspect, there is provided a computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of the first aspect.

In a fourth aspect, there is provided a computing device comprising a memory and a processor, wherein the memory has executable code stored therein, and wherein the processor, when executing the executable code, implements the method of the first aspect.

According to the distributed node classification method and device provided by the embodiment of the specification, the method combines the calculation characteristics of the graph neural network, and intermediate results generated in the calculation process are stored in the distributed file system for repeated use, so that the data redundancy is reduced, the operation efficiency and expandability are improved, and the node classification task of the ultra-large scale graph data can be completed.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments disclosed in the present specification, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only examples of the embodiments disclosed in the present specification, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 illustrates a schematic diagram of a conventional GNN-computing node vector representation, in accordance with one embodiment;

FIG. 2 illustrates an implementation scenario diagram of a distributed node classification method according to one embodiment;

FIG. 3 illustrates a flow diagram of a distributed node classification method according to one embodiment;

fig. 4 shows a schematic block diagram of a distributed node classification apparatus according to an embodiment.

Detailed Description

The following describes the scheme provided in the present specification with reference to the drawings.

In graph inference, a node classification task refers to predicting whether a node to be predicted has a certain specific type according to a known graph structure and node attributes. For example, given a natural person node, it is predicted whether there is a house under the natural person name.

The link prediction task relies on coding characterizations of nodes in the graph. Specifically, for link prediction, nodes in a graph are encoded through a graph neural network GNN to obtain an encoding representation of the nodes; in the prediction, a prediction network is utilized to obtain predictions about the connected edges according to the coded representation of two given nodes.

In the conventional coding process of the graph neural network GNN model on the nodes, M-hop neighbor nodes of each node in the graph are sampled in batches, and then M-layer GNN is used for respectively carrying out aggregation calculation on each node to obtain M-order vector representation of each node. In this process, intermediate results of the computation on any node u are discarded directly, leaving only the final M-order vector representation of node u. In this way, when calculating the M-order vector representation of the node v adjacent to the node u, the intermediate result in calculating the M-order vector representation of the node u cannot be used, but the calculation is performed again from scratch, which is inefficient and reduces the efficiency of the overall node classification task. Meanwhile, the M-hop neighbor nodes of each node in the batch sampling graph can also cause a great deal of data redundancy.

For example, FIG. 1 shows a schematic representation of a conventional GNN-computing node vector representation, in accordance with one embodiment. As shown in fig. 1, 2-order vector representations of node 1 and node 2 are calculated using a 2-layer GNN model, respectively. When the node 1 and the node 2 are respectively sampled in the 2-hop neighbor subgraphs, the node 3 is sampled into the 2-hop neighbor subgraphs of the node 1 and the node 2 at the same time. When the 2-level vector representation of node 1 is calculated by the 2-layer GNN model, an intermediate calculation result of the 1-level vector representation of node 3 is generated, but this intermediate calculation result is directly discarded after the calculation of the 2-level vector representation of node 1 is completed. When the 2-order vector representation of the node 2 is calculated subsequently, the 1-order vector representation of the node 3 is recalculated again, and the repeated calculation results in overall inefficiency. Thereby affecting the efficiency of subsequent node classification tasks.

To solve the above problem, fig. 2 shows a schematic diagram of an implementation scenario of a distributed node classification method according to one embodiment. In the example of fig. 2, the full graph data for performing node classification tasks is partitioned by a graph splitting algorithm into multiple sub-graphs, which are sent to multiple working devices in the distributed system, respectively. A K-layer graph neural network GNN is running on any one of the working devices. The layer 1 GNN is configured to calculate a 1-order vector representation of each node according to an initial vector representation (0-order vector representation) of each node and a plurality of neighboring nodes in the subgraph, and store the 1-order vector representation in a distributed file system, where data in the distributed file system may be shared by a plurality of working devices in the distributed system. Then, when the layer 2 GNN calculates the 2 nd order vector representation according to the 1 st order vector representations of each node and its multiple neighboring nodes in the subgraph, the layer 2 GNN does not need to start calculation from the head, but only needs to directly obtain from the distributed file system, and then stores the calculated 2 nd order vector representation into the distributed file system again, and so on. The neural network of the ith layer acquires i-1 order vector representations of each node and a plurality of neighbor nodes thereof from the distributed file system, calculates the i-order vector representations of each node, and stores the i-order vector representations in the distributed file system. And finally, the K-level GNN stores the calculated K-order vector representation of each node into a distributed file system for the subsequent node classification model. The steps are respectively executed by a plurality of working devices in the distributed system, so that K-order vector representations of all nodes in the whole graph can be obtained and stored in the distributed file system.

The node classification model is also operated on any piece of working equipment, for any target node to be classified, the K-order vector representation of the target node is obtained from the distributed file system and is input into the node classification model, and a classification result of the target node is obtained, which specifically can include whether the target node has a target attribute.

The following describes specific implementation steps of the above-described distributed node classification method in connection with specific embodiments. Fig. 3 illustrates a flow chart of a distributed node classification method, according to an embodiment, the execution subject of which may be any platform or server or cluster of devices with computing, processing capabilities, etc. It should be noted that the distributed system includes a plurality of working devices, and fig. 3 only shows implementation steps on any first working device. The steps performed on other operating devices in the distributed system can be deduced with reference to the steps in fig. 3.

Fig. 3 illustrates a distributed node classification method according to an embodiment, which is applied to any first working device of multiple working devices of a distributed system, and includes at least: step 302, obtaining first sub-graph data of full graph data; step 306, performing K-round model processing on all nodes in the first sub-graph data by using a K-layer graph neural network to obtain K-order vector representations of all the nodes, wherein the ith-round model processing comprises the steps of acquiring i-1-order vector representations of all the nodes and a plurality of neighbor nodes thereof from a distributed file system, inputting the i-1-order vector representations into the ith-layer graph neural network to obtain i-order vector representations of all the nodes, and storing the i-order vector representations of all the nodes into the distributed file system; the distributed file system is shared by the plurality of working devices; step 312, obtaining a K-order vector representation of the first node and the second node from the distributed file system, and inputting the K-order vector representation into a node classification model to obtain a prediction result about the target relationship between the first node and the second node.

First, in step 302, first sub-graph data of full graph data is acquired.

The first sub-graph data may be graph structure data indicating only the connection relationships of the nodes in the sub-graph, and not including vector representations of the nodes, to conserve storage resources. The vector representation of the node may be obtained from a distributed file system.

In one embodiment, the first sub-graph data is obtained by dividing the full-graph data by a plurality of working devices in a distributed system by a graph segmentation algorithm. The full graph may be partitioned using a variety of graph cut algorithms, such as METIS, distributed neighbor expansion algorithm DistributedNE (Distributed Neighbor Expansion), and the like. By using the graph splitting algorithm, the nodes adjacent to each other in the whole graph can be divided into the same sub-graph, so that when the neighbor nodes of each node are sampled in the subsequent step 306, the sampling can be directly performed from the sub-graph of a single working device, the communication between the working devices is reduced, and the operation efficiency is further improved.

Then, in step 306, performing K-round model processing on all nodes in the first sub-graph data by using a K-layer graph neural network to obtain K-order vector representations of each node, where the i-th round model processing includes obtaining i-1-order vector representations of each node and a plurality of neighbor nodes thereof from a distributed file system, inputting the i-1-order vector representations into the i-th layer graph neural network to obtain i-order vector representations of each node, and storing the i-order vector representations of each node in the distributed file system; the distributed file system is shared by the plurality of work devices.

Wherein, a plurality of neighbor nodes of any node are direct neighbors or one-hop neighbor nodes. Specifically, first, an initial vector representation (0-order vector representation) of each node and a plurality of neighboring nodes in the first sub-graph is obtained, then the initial vector representation is input into a layer 1 graph neural network to calculate 1-order vector representations of each node, and the 1-order vector representations of each node are stored in a distributed file system. The initial vector representation may be a one-hot coding feature, or may be an embedded vector obtained by coding text or pictures via a corresponding encoder, which is not limited herein.

And then, acquiring 1-order vector representations of each node and a plurality of neighbor nodes in the first sub-graph from the distributed file system, inputting the 1-order vector representations into the layer 2 graph neural network, calculating 2-order vector representations of each node, and storing the 2-order vector representations of each node into the distributed file system.

And by analogy, the ith round of model processing comprises the steps of obtaining i-1 order vector representations of each node and a plurality of neighbor nodes thereof in a first sub-graph from a distributed file system, inputting the i-1 order vector representations into an ith layer of graph neural network, calculating the i-order vector representations of each node, and storing the i-order vector representations of each node into the distributed file system.

The distributed file system is shared by a plurality of working devices. The steps 302 and 306 are performed by a plurality of working devices in the distributed system, so that the K-order vector representation of all nodes in the whole graph can be obtained and stored in the distributed file system.

The intermediate data generated in the calculation process of the GNN of each layer of graph neural network is cached by using the distributed file system in step 302 and step 306, and the intermediate calculation results of each node can be multiplexed when different nodes are calculated, and can be shared among a plurality of working devices, so that a large number of repeated calculations in the traditional GNN are eliminated. For example, taking fig. 1 as an example, when the scheme of the above embodiment is adopted, the 1 st order vector representation of the node 3 is stored in the distributed file system for reading when calculating the 2 nd order vectors of the node 1 and the node 2. For another example, assume that in the full graph data, node a and node B are second order neighbors, but are divided into a first working device and a second working device, respectively. The intermediate order vector of node a calculated by the first operating device may be stored in the distributed file system and read by the second operating device to calculate the higher order vector of node B.

Furthermore, according to the above embodiment, the working device only needs to collect the one-hop neighbor node of any node in the first subgraph, instead of collecting the N-hop neighbor node in the conventional GNN, so that the calculation amount is further reduced in the sampling process.

In some possible implementations, before step 306, the method further includes step 304 of determining, for any node in the first subgraph, N neighbor nodes from all its one-hop neighbor nodes using a graph sampling algorithm.

Wherein the graph sampling algorithm may include: random sampling, uniform sampling, weighted sampling, and type sampling.

At this time, the i-1 order vector representation of each node and its multiple neighboring nodes obtained from the distributed file system in step 306 specifically includes: the i-1 order vector representations of each node and N neighbor nodes thereof are obtained from the distributed file system.

By using the graph sampling algorithm, data expansion caused when the graph data size is excessively large can be prevented.

Finally, in step 308, a K-order vector representation of the target node to be classified is obtained from the distributed file system, and is input into a node classification model to obtain a classification result for the target node.

The output of the node classification model may be a probability value indicating the probability that the target node has a certain target attribute, and when the probability value is greater than a preset first threshold, the classification result indicates that the target node has the target attribute.

In one embodiment, any node may have multiple attributes at the same time, where the output of the node classification model may be multiple probability values corresponding to the multiple attributes, and multiple attributes with probability values greater than the corresponding preset threshold are determined as the classification result of the node according to the comparison result of the probability value of each attribute and the corresponding preset threshold.

In one embodiment, the node classification model may be a Multi-Layer neural network (MLP) model.

Sub-sampling and model reasoning of the embodiment of the specification can be performed in a pipeline mode, N-hop neighbor sub-image data of each node do not need to be produced in advance, and therefore time consumption of sub-sampling is saved. In addition, one-hop neighbor subgraphs are produced and consumed in real time, and extra storage resources are not needed. Meanwhile, the scheme caches the intermediate calculation result into the distributed file system, and does not depend on a MapReduce framework of distributed calculation, so that various drawing learning frameworks can be adapted seamlessly.

Meanwhile, the scheme avoids a large number of repeated calculation problems in the traditional GNN model in a mode of caching intermediate results, thereby improving performance and expansibility. In addition, since the information of the multi-hop neighbors can be represented by the cached intermediate result, only one-hop neighbors are needed when sub-sampling is performed, thereby further reducing the calculation amount.

According to an embodiment of another aspect, a distributed node classification apparatus is also provided. Fig. 4 illustrates a schematic block diagram of a distributed node classification apparatus according to an embodiment, which may be deployed in any device, platform or cluster of devices having computing, processing capabilities. As shown in fig. 4, the apparatus 400 is deployed on any first working device of the plurality of working devices of the distributed system, and includes:

an acquisition unit 401 configured to acquire first sub-image data of the full-image data;

the vector calculation unit 403 is configured to perform K-round model processing on all nodes in the first sub-graph data by using a K-layer graph neural network to obtain K-order vector representations of each node, where the i-th round model processing includes obtaining i-1-order vector representations of each node and a plurality of neighbor nodes thereof from a distributed file system, inputting the i-1-order vector representations into the i-th layer graph neural network to obtain i-order vector representations of each node, and storing the i-order vector representations of each node in the distributed file system; the distributed file system is shared by the plurality of working devices;

the node classification unit 404 is configured to obtain a K-order vector representation of a target node to be classified from the distributed file system, and input the K-order vector representation into a node classification model to obtain a classification result for the target node.

In some possible embodiments, the method further comprises:

the neighbor sampling unit 402 is configured to determine, for any node in the first sub-graph, N neighbor nodes from all its one-hop neighbor nodes using a graph sampling algorithm.

According to an embodiment of another aspect, there is also provided a computer-readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method described in any of the above embodiments.

According to an embodiment of yet another aspect, there is also provided a computing device including a memory and a processor, wherein the memory has executable code stored therein, and the processor, when executing the executable code, implements the method described in any of the above embodiments.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments in part.

The foregoing describes specific embodiments of the present disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, and the program may be stored in a computer readable storage medium, where the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The foregoing description of the embodiments has been provided for the purpose of illustrating the general principles of the invention, and is not meant to limit the scope of the invention, but to limit the invention to the particular embodiments, and any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. A distributed node classification method is applied to any first working equipment in a plurality of working equipment of a distributed system, and comprises the following steps:

acquiring first sub-graph data of the full graph data;

performing K-round model processing on all nodes in the first sub-graph data by using a K-layer graph neural network to obtain K-order vector representations of all nodes, wherein the ith-round model processing comprises the steps of acquiring i-1-order vector representations of all nodes and a plurality of neighbor nodes thereof from a distributed file system, inputting the i-1-order vector representations into the ith-layer graph neural network to obtain i-order vector representations of all nodes, and storing the i-order vector representations of all nodes into the distributed file system; the distributed file system is shared by the plurality of working devices; the vector representations of each order of each node can be multiplexed when different nodes are calculated;

2. The method of claim 1, further comprising:

3. The method of claim 2, obtaining an i-1 order vector representation of each node and its plurality of neighbor nodes from a distributed file system, comprising:

4. The method of claim 2, wherein the graph sampling algorithm comprises: random sampling, uniform sampling, weighted sampling, and type sampling.

5. The method of claim 1, wherein the first sub-graph data is partitioned by a plurality of working devices in a distributed system performing a graph cut algorithm on the full graph data.

6. The method of claim 5, wherein the graph cut algorithm comprises: METIS, distributed neighbor extension algorithm.

7. The method of claim 1, wherein the node classification model is a multi-layer neural network, MLP.

8. A distributed node classification apparatus deployed on any first working device of a plurality of working devices of a distributed system, comprising:

the vector calculation unit is configured to perform K-round model processing on all nodes in the first sub-graph data by using a K-layer graph neural network to obtain K-order vector representations of all the nodes, wherein the ith-round model processing comprises the steps of acquiring i-1-order vector representations of all the nodes and a plurality of neighbor nodes thereof from a distributed file system, inputting the i-1-order vector representations into the ith-layer graph neural network to obtain i-order vector representations of all the nodes, and storing the i-order vector representations of all the nodes into the distributed file system; the distributed file system is shared by the plurality of working devices; the vector representations of each order of each node can be multiplexed when different nodes are calculated;

9. The apparatus of claim 8, further comprising:

10. A computer readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of any of claims 1-7.

11. A computing device comprising a memory and a processor, wherein the memory has executable code stored therein, which when executed by the processor, implements the method of any of claims 1-7.