CN112287182A - Graph data storage and processing method and device and computer storage medium - Google Patents

Graph data storage and processing method and device and computer storage medium Download PDF

Info

Publication number
CN112287182A
CN112287182A CN202011192437.1A CN202011192437A CN112287182A CN 112287182 A CN112287182 A CN 112287182A CN 202011192437 A CN202011192437 A CN 202011192437A CN 112287182 A CN112287182 A CN 112287182A
Authority
CN
China
Prior art keywords
vertex
data
target
edge
partition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011192437.1A
Other languages
Chinese (zh)
Other versions
CN112287182B (en
Inventor
余利峰
陈哲嘉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Hikvision Digital Technology Co Ltd
Original Assignee
Hangzhou Hikvision Digital Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Hikvision Digital Technology Co Ltd filed Critical Hangzhou Hikvision Digital Technology Co Ltd
Priority to CN202011192437.1A priority Critical patent/CN112287182B/en
Publication of CN112287182A publication Critical patent/CN112287182A/en
Application granted granted Critical
Publication of CN112287182B publication Critical patent/CN112287182B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application discloses a graph data storage and processing method, a graph data storage and processing device and a computer storage medium, and belongs to the technical field of graph databases. The method comprises the following steps: acquiring a target vertex sequence number of a target vertex stored in a target partition, wherein a plurality of vertexes are stored in the target partition and correspond to the vertex sequence numbers respectively; acquiring the partition number of a partition where another vertex of each edge of one or more edges associated with the target vertex is located, the vertex sequence number of another vertex and the direction of each edge to obtain target edge data corresponding to the target vertex sequence number; and writing the target edge data into an edge data file in the target partition, wherein the edge data file is used for storing the edge data corresponding to each vertex sequence number. The embodiment of the application takes the partition number of the partition where the vertex is located and the vertex sequence number of the vertex in the partition as the identification, so that the memory required by graph calculation is reduced, and the calculation performance of the calculation node is improved.

Description

Graph data storage and processing method and device and computer storage medium
Technical Field
The embodiment of the application relates to the technical field of graph databases, in particular to a graph data storage and processing method, a graph data storage and processing device and a computer storage medium.
Background
Graph databases are non-relational databases that use graphical representations of vertices and edges to characterize entities and relationships between entities. Wherein a vertex in the graph database indicates an entity and an edge between two vertices indicates a relationship between the two entities. An entity may represent an object in real life, and a relationship may represent a connection between different objects. Data stored in a graph database may be referred to as graph data, which includes data for vertices and data for edges. Wherein, the data of the vertex is used for indicating the related information of the entity, and the data of the edge is used for indicating the related information of the relationship. In addition, different storage manners of graph data in a graph database have different influences on subsequent graph data processing, and therefore, how to store the graph data in the graph database is a hot spot of current research.
In the related art, data of vertices and data of edges in graph data are usually stored in a graph database by means of key values. Wherein, the key of the vertex is used to indicate the ID (identity) of the vertex, and the value of the vertex is used to indicate the attribute of the vertex. The edge's key is used to indicate the ID (identity) of the two vertices at the two ends of the edge, and the edge's value is used to indicate the edge's attributes.
After the graph database is organized according to the above manner of storing graph data, when the subsequent computing nodes perform the related computation of the graph database, all graph data in the graph database need to be loaded into the memory of the computing nodes first. In this scenario, if the storage space occupied by the ID of the vertex is large, the storage space occupied by the key value of the vertex itself and the key value of the edge corresponding to the vertex are both large. Therefore, the graph data loaded into the memory occupies a larger memory space, thereby affecting the computing performance of the computing node.
Disclosure of Invention
The embodiment of the application provides a graph data storage method, a graph data processing method, a graph data storage device, a graph data processing device and a computer storage medium, which can improve the computing performance of a computing node. The technical scheme is as follows:
in one aspect, a graph data storage method is provided, and is applied to a target node in a storage system storing graph data, where the target node is any node in the storage system, and the method includes:
acquiring a target vertex sequence number of a target vertex stored in a target partition, wherein the target partition is any storage partition on the target node, a plurality of vertexes are stored in the target partition, the vertexes are respectively corresponding to the vertex sequence numbers, the vertex sequence number of any vertex indicates the sequence of the vertex in the vertexes, and the target vertex is any vertex in the vertexes;
acquiring the partition number of the partition where the other vertex of each edge of the one or more edges associated with the target vertex is located, the vertex sequence number of the other vertex and the direction of each edge to obtain target edge data corresponding to the target vertex sequence number;
and writing the target edge data into an edge data file in the target partition, wherein the edge data file is used for storing edge data corresponding to each vertex sequence number.
Optionally, the edge data in the edge data file are stored in sequence according to the corresponding vertex sequence numbers;
and the target partition is also stored with an edge index file, the edge index file stores the length of each edge data in the edge data file, and the lengths of each edge data stored in the edge index file are sequentially stored according to the corresponding vertex sequence number.
Optionally, the data type of the length of each edge data stored in the edge index file is an integer data type of 32 bytes or an integer data type of 64 bytes.
Optionally, the method further comprises:
and writing the corresponding relation between the target vertex sequence number and the vertex identification of the target vertex into a mapping table in the target partition, wherein the mapping table stores vertex sequence numbers respectively corresponding to the vertex identifications of the plurality of vertexes.
Optionally, the method further comprises:
and writing the vertex data of the target vertex into a vertex data file in the target partition, wherein the vertex data file stores vertex data respectively corresponding to the vertex sequence numbers.
Optionally, the vertex data in the vertex data file are stored in sequence according to the corresponding vertex sequence numbers;
and the target partition is also stored with a vertex index file, the vertex index file stores the length of each vertex data in the vertex data file, and the lengths of the vertex data stored in the vertex index file are sequentially stored according to the vertex sequence number corresponding to the vertex data.
Optionally, the vertex data of the target vertex includes an attribute of the target vertex.
In another aspect, a graph data processing method is provided, which is applied to a computing node in a storage system storing graph data, and the method includes:
determining edge data to be processed in an edge data file stored in a target partition of a target node;
processing graph data according to the side data to be processed;
the object node is any node in the storage system, the object partition is any storage partition on the object node, the edge data file stores therein edge data respectively corresponding to vertex sequence numbers of multiple vertices stored in the object partition, the edge data includes a partition number of a partition in which another vertex of each edge of one or more edges associated with the corresponding vertex is located, a vertex sequence number of the another vertex, and a direction of each edge, and the vertex sequence number of any vertex in the object partition indicates a sequence of the any vertex in the multiple vertices.
Optionally, the edge data in the edge data file are sequentially stored according to corresponding vertex sequence numbers, an edge index file is further stored in the target partition, the length of each edge data in the edge data file is stored in the edge index file, and the length of each edge data stored in the edge index file is sequentially stored according to corresponding vertex sequence numbers;
the determining of the to-be-processed edge data in the edge data file stored in the target partition of the target node includes:
and determining the to-be-processed side data according to the length of each side data stored in the side index file.
Optionally, in a case that the graph data processing is iterative processing, the to-be-processed side data is current to-be-processed side data that is sequentially iterated according to the length of each side data.
Optionally, in a case that the graph data processing is query processing, the to-be-processed edge data is edge data located according to the length of each edge data and a vertex sequence number of a vertex to be queried currently.
Optionally, the computing node further stores a position of edge data corresponding to a reference vertex sequence number in the edge data file, where the reference vertex sequence number is one or more vertex sequence numbers of multiple vertices stored in the target partition;
and the current to-be-processed edge data is positioned according to the length of each edge data, the position of the edge data corresponding to the reference vertex sequence number in the edge data file, and the vertex sequence number of the current to-be-inquired vertex.
Optionally, a vertex data file is stored in the target partition, and vertex data corresponding to vertex sequence numbers of the multiple vertices is stored in the vertex data file;
the method further comprises the following steps:
and acquiring vertex data corresponding to the target vertex sequence number from the vertex data file according to the target vertex sequence number to be inquired.
Optionally, the vertex data of each of the plurality of vertices are sequentially stored according to a corresponding vertex sequence number, a vertex index file is further stored in the target partition, the vertex index file stores the length of each vertex data in the vertex data file, and the length of each vertex data stored in the vertex index file is sequentially stored according to the corresponding vertex sequence number;
the obtaining of the vertex data corresponding to the target vertex sequence number from the vertex data file according to the target vertex sequence number includes:
determining the position of the vertex data corresponding to the target vertex sequence number in the vertex data file according to the target vertex sequence number and the index file;
and acquiring the vertex data corresponding to the target vertex sequence number from the vertex data file according to the position of the vertex data corresponding to the target vertex sequence number in the vertex data file.
Optionally, a mapping table is stored in the target partition, and a plurality of vertex sequence numbers respectively corresponding to the plurality of vertex identifications are stored in the mapping table;
after processing the graph data stored in the target partition according to the edge index file, the method further includes:
determining a vertex sequence number of a vertex obtained after processing the graph data;
and acquiring a vertex identifier corresponding to the determined vertex sequence number according to the mapping table, and taking the vertex identifier as a graph data processing result.
In another aspect, a graph data storage apparatus is provided, which is applied to a target node in a storage system storing graph data, where the target node is any node in the storage system, and the apparatus includes:
an obtaining module, configured to obtain a target vertex sequence number of a target vertex stored in a target partition, where the target partition is any storage partition on the target node, the target partition stores multiple vertices, the multiple vertices respectively correspond to vertex sequence numbers, the vertex sequence number of any vertex indicates a sequence of the vertex in the multiple vertices, and the target vertex is any vertex in the multiple vertices;
the obtaining module is further configured to obtain a partition number of a partition where another vertex of each edge of the one or more edges associated with the target vertex is located, a vertex sequence number of the another vertex, and a direction of each edge, so as to obtain target edge data corresponding to the target vertex sequence number;
and the writing module is used for writing the target edge data into an edge data file in the target partition, and the edge data file is used for storing the edge data corresponding to each vertex sequence number.
Optionally, the edge data in the edge data file are stored in sequence according to the corresponding vertex sequence numbers;
and the target partition is also stored with an edge index file, the edge index file stores the length of each edge data in the edge data file, and the lengths of each edge data stored in the edge index file are sequentially stored according to the corresponding vertex sequence number.
Optionally, the data type of the length of each edge data stored in the edge index file is an integer data type of 32 bytes or an integer data type of 64 bytes.
Alternatively,
and the writing module is further configured to write the correspondence between the target vertex sequence number and the vertex identifier of the target vertex into a mapping table in the target partition, where vertex sequence numbers respectively corresponding to the vertex identifiers of the multiple vertices are stored in the mapping table.
Alternatively,
and the writing module is also used for writing the vertex data of the target vertex into a vertex data file in the target partition, wherein the vertex data file stores vertex data respectively corresponding to the sequence numbers of the vertexes.
Optionally, the vertex data in the vertex data file are stored in sequence according to the corresponding vertex sequence numbers;
and the target partition is also stored with a vertex index file, the vertex index file stores the length of each vertex data in the vertex data file, and the lengths of the vertex data stored in the vertex index file are sequentially stored according to the vertex sequence number corresponding to the vertex data.
Optionally, the vertex data of the target vertex includes an attribute of the target vertex.
In another aspect, a graph data processing apparatus is provided, which is applied to a compute node in a storage system storing graph data, and includes:
the determining module is used for determining the edge data to be processed in the edge data file stored in the target partition of the target node;
the processing module is used for processing the graph data according to the side data to be processed;
the object node is any node in the storage system, the object partition is any storage partition on the object node, the edge data file stores therein edge data respectively corresponding to vertex sequence numbers of multiple vertices stored in the object partition, the edge data includes a partition number of a partition in which another vertex of each edge of one or more edges associated with the corresponding vertex is located, a vertex sequence number of the another vertex, and a direction of each edge, and the vertex sequence number of any vertex in the object partition indicates a sequence of the any vertex in the multiple vertices.
Optionally, the edge data in the edge data file are sequentially stored according to corresponding vertex sequence numbers, an edge index file is further stored in the target partition, the length of each edge data in the edge data file is stored in the edge index file, and the length of each edge data stored in the edge index file is sequentially stored according to corresponding vertex sequence numbers;
the determination module is to:
and determining the to-be-processed side data according to the length of each side data stored in the side index file.
Optionally, in a case that the graph data processing is iterative processing, the to-be-processed side data is current to-be-processed side data that is sequentially iterated according to the length of each side data.
Optionally, when the graph data processing is query processing, the to-be-processed edge data is edge data located according to the length of each edge data and a vertex sequence number of a vertex to be queried currently.
Optionally, the computing node further stores a position of edge data corresponding to a reference vertex sequence number in the edge data file, where the reference vertex sequence number is one or more vertex sequence numbers of multiple vertices stored in the target partition;
and the current to-be-processed edge data is the edge data positioned according to the length of each edge data, the position of the edge data corresponding to the reference vertex sequence number in the edge data file, and the vertex sequence number of the vertex to be inquired currently.
Optionally, a vertex data file is stored in the target partition, and vertex data corresponding to vertex sequence numbers of the multiple vertices is stored in the vertex data file;
the device further comprises:
and the acquisition module is used for acquiring the vertex data corresponding to the target vertex sequence number from the vertex data file according to the target vertex sequence number to be inquired.
Optionally, the vertex data of each of the plurality of vertices are sequentially stored according to a corresponding vertex sequence number, a vertex index file is further stored in the target partition, the vertex index file stores the length of each vertex data in the vertex data file, and the length of each vertex data stored in the vertex index file is sequentially stored according to the corresponding vertex sequence number;
the acquisition module is used for determining the position of the vertex data corresponding to the target vertex sequence number in the vertex data file according to the target vertex sequence number and the index file;
and acquiring the vertex data corresponding to the target vertex sequence number from the vertex data file according to the position of the vertex data corresponding to the target vertex sequence number in the vertex data file.
Optionally, a mapping table is stored in the target partition, and a plurality of vertex sequence numbers respectively corresponding to the plurality of vertex identifications are stored in the mapping table;
the determining module is used for determining the vertex sequence number of the vertex obtained after the graph data is processed;
the device also comprises an acquisition module which is used for acquiring the vertex identification corresponding to the determined vertex sequence number according to the mapping table and taking the vertex identification as the graph data processing result.
In another aspect, a server is provided, including:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to perform the steps of any one of the graph data storage methods or the graph data processing methods described above.
In another aspect, a computer-readable storage medium is provided, which stores instructions thereon, and when executed by a processor, implements the steps of any one of the graph data storage method or the graph data processing method described above.
In another aspect, a computer program product comprising instructions which, when run on a computer, cause the computer to perform the method as provided in the preceding.
The beneficial effects brought by the technical scheme provided by the embodiment of the application at least comprise:
(1) the vertex and the information of the vertex-associated edge can be stored in the form of the partition number and the vertex sequence number, so that when graph calculation is performed subsequently, compared with graph calculation based on the vertex ID, the data volume processed by the graph calculation based on the partition number and the vertex sequence number of the vertex is less, and the graph calculation efficiency is improved.
(2) In addition, in the graph database, the edge data of the edge associated with the vertex is only used for storing the partition number where the vertex at the other end is located, the vertex sequence number and the edge direction, and compared with the related art in which the IDs of the vertices at the two ends of the edge need to be stored, the storage method provided by the embodiment of the present application can also save the storage space in the storage system.
(3) When graph calculation is subsequently performed, the ID of the vertex can be replaced by the partition number and the vertex sequence number of the vertex in the related information of the vertex loaded into the memory, so that the amount of data loaded into the memory is reduced, the problem that graph calculation cannot be completed due to memory overflow is avoided, and the calculation performance of the calculation node is improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a schematic diagram of a storage system provided by an embodiment of the present application;
fig. 2 is a schematic format diagram of an edge data file according to an embodiment of the present application;
FIG. 3 is a flowchart of a graph data storage method according to an embodiment of the present disclosure;
FIG. 4 is a schematic flow chart of an alternative identifier provided by an embodiment of the present application;
FIG. 5 is a flowchart of a graph data processing method according to an embodiment of the present disclosure;
FIG. 6 is a schematic structural diagram of a graph data storage device according to an embodiment of the present application;
FIG. 7 is a schematic structural diagram of a graph data processing apparatus according to an embodiment of the present application;
fig. 8 is a schematic structural diagram of a server according to an embodiment of the present application.
Detailed Description
To make the objects, technical solutions and advantages of the embodiments of the present application more clear, the embodiments of the present application will be further described in detail with reference to the accompanying drawings.
Before explaining the embodiments of the present application in detail, an application scenario of the embodiments of the present application is explained.
In order to increase the data processing speed of graph databases, current graph databases are generally graph databases based on distributed storage systems. The distributed storage system comprises a plurality of nodes, and each node is used for storing part of graph data in the graph database. The distributed storage system may also be referred to as a clustered storage environment.
In addition, in order to facilitate management of graph data stored on a node, a storage space of the node is divided into different partitions, and then the graph data are respectively placed in the different partitions. The storage space on any node may include a storage space on a storage medium such as a local disk of the node, a storage space on other storage devices that the node hangs down, and a virtual storage space configured on the node, which is not specifically limited in this embodiment of the present application.
In addition, there are currently many scenarios where graph computations need to be performed based on graph databases. Graph computation typically involves determining the distribution of vertices or edges in a graph database, etc. For example, the number of vertices with incomes greater than 2 in the graph database needs to be counted, and in this case, the graph calculation can be performed on the graph database. The in degree of a vertex refers to the number of edges pointing to the vertex, and the out degree of the vertex refers to the number of edges pointing to other vertices.
The method provided by the embodiment of the application can be applied to a scene of graph calculation of a graph database. In addition, the method provided by the embodiment of the present application may be applied to the distributed storage system, and certainly may also be applied to a centralized storage system. Wherein the centralized storage system comprises only one node. The embodiments of the present application do not limit the specific category of the storage system.
In order to solve the problem that a storage structure of graph data in the related art causes a large memory occupation when graph computation is performed on a large-scale graph database, an embodiment of the present application provides a storage system, in which, for each vertex, a partition number of a partition where the vertex is located and a vertex sequence number of the vertex in the partition are used as identifiers. Because the partition number of the partition where the vertex is located and the number of bytes occupied by the vertex sequence number of the vertex in the partition are small, compared with the case that the ID of the vertex needs to be loaded into the memory as the identifier, the partition number of the partition where the vertex is located and the vertex sequence number of the vertex in the partition are loaded into the memory as the identifier, and the latter can reduce the space occupied by the identifier of the vertex in the memory. Thereby improving the computational performance of the compute node.
It should be noted that the partition numbers are uniformly configured by the storage system for the partitions on the respective nodes based on the partition policy, and therefore, which partition on which node is currently determined can be determined based on the partition numbers. Therefore, in the embodiment of the present application, the partition number of the partition in which the vertex is located and the vertex sequence number of the vertex in the partition may be used as the unique identifier of the vertex.
For convenience of description later, the data organization method in the storage system provided by the embodiment of the present application is explained first.
Fig. 1 is a schematic diagram of a storage system according to an embodiment of the present application. As shown in fig. 1, a node (node) i is included in the storage system, where the node i is any one of one or more nodes included in the storage system. The storage system shown in fig. 1 may include one node or may include a plurality of nodes. Only one node is illustrated in fig. 1 as an example. For other nodes in the storage system 100 in fig. 1, the data organization manner on the other nodes may refer to the data organization manner in the node i shown in fig. 1.
As shown in FIG. 1, the storage space on node i includes n partitions. The n partitions are denoted as P1, P2, P3, …, Pn in fig. 1, respectively (P1, P …, Pn are illustrated in fig. 1). The data organization in the partition is described below by taking the first partition P1 as an example. The data in the other partitions is organized in substantially the same manner as in P1, except that the stored vertices are different and will not be explained in greater detail herein.
As shown in FIG. 1, partition P1 includes an edge data file, an edge index file, a mapping table indicating a mapping relationship between vertex IDs and vertex sequence numbers, a vertex data file, and a vertex index file.
It is assumed that the partition P1 stores information about a plurality of vertices, and each of the plurality of vertices corresponds to a vertex sequence number. The vertex sequence numbers may be sequentially set in the time order of importing the partition according to the vertex related information. For example, the partition P1 stores 3 vertices, which are respectively vertex a, vertex B, and vertex C, and the order of importing the information about the 3 vertices into the partition P1 is vertex B, top bottom C, and top bottom a. Thus, vertex B has vertex sequence number 1, vertex C has vertex sequence number 2, and vertex A has vertex sequence number 3.
In order to facilitate the subsequent query of vertex IDs, after determining the vertex sequence number of a vertex, the corresponding relationship between the vertex ID and the vertex sequence number of the vertex can be written into a mapping table for indicating the mapping relationship between the vertex ID and the vertex sequence number. The mapping table may be stored in the partition in a manner of a dictionary table, and may also be stored in the partition in other storage manners, which is not limited in this embodiment of the application.
The edge data file is used to store edge data for each of the plurality of vertices. The edge data of each vertex comprises a vertex sequence number of another vertex of each edge in one or more edges related to the vertex, a partition number of a partition where the vertex sequence number is located, and the direction of each edge. The edge data of each vertex in the edge data file corresponds to the vertex sequence number of the vertex. Thus, the edge data of a vertex can be queried based on the vertex sequence number of the vertex.
The initial storage position of the edge data of each vertex in the edge data file can be added with a corresponding vertex sequence number, so that the edge data of any vertex can be directly obtained based on the edge data file.
Alternatively, the edge data of each vertex in the edge data file may be stored sequentially according to the corresponding vertex sequence number, and then the edge index file may be configured in the partition P1. The length of each piece of side data is stored in the side index file, and the length of each piece of side data is also stored in sequence according to the corresponding vertex sequence number. Therefore, the storage position of the edge data of a certain vertex in the edge data file can be quickly positioned according to the edge index file. For example, when the edge data of the vertex with the vertex sequence number of 3 needs to be determined currently, the start position of the edge data of the vertex with the vertex sequence number of 3 in the edge data file can be located according to the first length data and the second length data stored in the edge index file.
It should be noted that, which position in the edge index file corresponds to the first length, and which position corresponds to the second length may be predetermined. In one possible implementation, if the space size for storing each edge data length is the same, each edge data length stored in the edge index file may be identified according to the same space size. In another possible implementation manner, if the sizes of the spaces storing the respective side data lengths are different, the size of the space occupied by each side data length may be determined according to a compression algorithm used when the side data lengths are stored in advance, so that the respective side data lengths stored in the side index file may be identified.
In addition, in current graph databases, the number of edges corresponding to a point does not usually exceed ten million levels, so that the length of each edge data can be stored in an edge index file by using a data type of an integer type (INT type) of 32 bytes. In this case, the size M of the space occupied by the edge index file in the memory can be estimated as follows:
where S in the above formula represents the number of vertices stored in a partition, and len (int) identifies the size of the space occupied by each edge data length, here 32 bytes.
It should be noted that, when the length of each edge data is stored in the edge index file using the INT type, the 32-byte INT type allows a maximum of 2 Λ 32 values, and thus a maximum of 2 Λ 32 edge data lengths can be stored in the edge index file. Thus, a maximum of 2 Λ 32 vertices are allowed in the current partition.
If the estimated number of vertices stored in a single partition might be greater than 2 Λ 32, the INT type could be replaced by an integer type of 64 bytes (LONG type), increasing the maximum number of vertices in the theoretically allowed partition. Alternatively, the number of partitions on a node may be increased, thereby reducing the number of vertices stored on each partition.
Therefore, in the embodiment of the present application, the data type storing the length of each edge data in the edge index file may be an integer data type of 32 bytes or an integer data type of 64 bytes. Specifically, which manner is used may be configured adaptively based on the requirement, which is not limited in the embodiment of the present application.
Fig. 2 is a schematic diagram of a data organization method in an edge data file according to an embodiment of the present application. As shown in fig. 2, it is assumed that k vertices with vertex sequence numbers 0, 1, …, and k are stored in the partition. Among them, in FIG. 2
Figure BDA0002753109810000111
Representing the jth edge of a vertex with vertex sequence number i. The edge data of each edge stored in the edge data file includes the edge direction, the partition number of the partition where the vertex at the other end of the edge is located, and the vertex sequence number of the vertex at the other end.
liIndicating the length of the edge data for the vertex with vertex sequence number i. As shown in fig. 2, the edge data of each vertex in the edge data file is sequentially stored in the edge data file according to the arrow direction in fig. 2, so that the edge data of each vertex in the edge data file is sequentially stored according to the corresponding vertex sequence number. In this case, the lengths of the respective edge data in the edge index file are also stored in order according to the corresponding vertex sequence numbers. So as to quickly locate the storage position of the edge data of a certain vertex in the edge data file.
In addition, in performing the graph calculation, if the correlation calculation of the attributes of the vertices is involved, the vertex data file may also be configured in the partition P1. The vertex data file is for storing vertex data for each of the plurality of vertices, the vertex data including attributes of the vertex. Each vertex data in the vertex data file also corresponds to a vertex sequence number for the corresponding vertex, such that the vertex data for the corresponding vertex is determined based on the vertex sequence number. Alternatively, when the correlation calculation of the attributes of the vertices is not involved in the graph calculation process, the vertex data file may be configured, but the vertex data of each vertex in the vertex data file may be empty.
In addition, when the vertex data file is arranged, the vertex data may be stored in order of the vertex order number. In this case, a vertex index file may also be configured. The vertex index file is used for storing the length of each vertex data in the vertex data file. Therefore, the storage position of the vertex data of a certain vertex in the vertex data file can be quickly positioned according to the vertex index file. The specific implementation manner may refer to the functions of the edge index file, which are not described in detail herein.
As with the edge index file, which position in the vertex index file corresponds to the vertex data length at which that vertex is stored may be predetermined. The specific implementation manner may also refer to the functions of the edge index file, and will not be described in detail herein.
Based on the storage system shown in fig. 1, the following explains a graph data storage method and a graph data processing method provided in the embodiments of the present application.
Fig. 3 is a flowchart of a graph data storage method according to an embodiment of the present application. The method shown in fig. 3 may be applied to any node in the storage system shown in fig. 1, and the following embodiment takes a target node in the storage system as an example, that is, the target node is any node in the storage system storing graph data. As shown in fig. 3, the method includes the following steps.
Step 301: and acquiring a target vertex sequence number of a target vertex stored in a target partition, wherein the target partition is any storage partition on the target node.
The target partition is stored with a plurality of vertexes, the vertexes are respectively corresponding to vertex sequence numbers, the vertex sequence number of any vertex indicates the sequence of the vertex in the vertexes, and the target vertex is any vertex in the vertexes.
As can be seen from the storage system shown in fig. 1, the target partition may also have a mapping table stored therein. The mapping table stores vertex sequence numbers corresponding to the vertex identifications respectively, and the vertex identifications are vertex IDs. Therefore, in step 301, the target vertex sequence number of the target vertex stored in the target partition may be obtained by: and acquiring a vertex sequence number corresponding to the vertex identifier of the target vertex from the mapping table, and if the vertex sequence number corresponding to the vertex identifier of the target vertex can be acquired from the mapping table, taking the acquired vertex sequence number as the target vertex sequence number. And if the vertex sequence number corresponding to the vertex identification of the target vertex cannot be acquired from the mapping table, configuring a vertex sequence number for the target vertex.
The step of configuring a vertex sequence number for the target vertex may be configured according to a preset vertex sequence number generation rule. The vertex sequence number generation rule may be that the newly generated vertex sequence number is obtained by adding the largest vertex sequence number in the mapping table to the reference value. The reference value may be 1, 2, etc. It should be noted that the embodiments of the present application do not limit the specific manner of generating the vertex sequence number rule.
In addition, when the target vertex sequence number is the reconfigured vertex sequence number, the target node may write the mapping table with the correspondence between the target vertex sequence number and the vertex identifier of the target vertex, so as to complete the updating of the mapping table.
Furthermore, in the case where there is no mapping table in the target partition, the target node may obtain the vertex sequence number of the target vertex based on other information. For example, the vertex sequence number of the target node is automatically generated based on a vertex storage log recorded in the target partition, where the vertex storage log is used to record a time record of each vertex stored in the target partition to the target partition, so as to sequentially configure the vertex sequence number for each vertex based on the sequence of each vertex stored in the target partition.
Step 302: and acquiring the partition number of the partition where the other vertex of each edge in the one or more edges associated with the target vertex is located, the vertex sequence number of the other vertex and the direction of each edge to obtain target edge data corresponding to the target vertex sequence number.
In step 302, each of the one or more edges associated with the target vertex refers to all the edges associated with the target vertex, so as to ensure the integrity of the edge data.
The edge associated with the target vertex may be an edge pointing to the target vertex, an edge pointing to another vertex via the target vertex, or both. When the edge associated with the target vertex is an edge directed to the target vertex, or an edge directed to another vertex through the target vertex, the edges associated with the other vertices may be determined according to a uniform rule. For example, the edge data in the target partition stores information of the edge pointing to the vertex. Or the information of the edge which points to other vertexes through the vertex is stored in the data of any edge in the target partition.
In the embodiment of the present application, in order to improve the efficiency of the subsequent graph calculation process, the edge associated with the target vertex may include both the edge pointing to the target vertex and the edge pointing to other vertices through the target vertex. Thus, after the vertex sequence number of a vertex is obtained, all the information of the edges associated with the vertex can be obtained based on the edge data corresponding to the vertex sequence number in the vertex data file. Therefore, the information of all edges associated with the vertex can be obtained without traversing the edge data of other vertices.
Step 303: and writing the target edge data into an edge data file in the target partition, wherein the edge data file is used for storing the edge data corresponding to each vertex sequence number.
After the target side data is obtained, the target side data can be written into a side data file so as to perform graph calculation based on the side data file.
In one possible implementation, the target edge data file is stored to the edge data file in an order of the target vertex sequence numbers in the vertex sequence numbers of all vertices stored by the target partition. Therefore, each piece of side data in the side data file is stored in sequence according to the corresponding vertex sequence number.
In this scenario, because the edge index file is also stored in the target partition, and the length of each piece of edge data in the edge data file is stored in the edge index file, the lengths of each piece of edge data stored in the edge index file are sequentially stored according to the corresponding vertex sequence number. Therefore, after the target edge data is written into the edge data file, the length of the target edge data needs to be stored in the edge index file according to the sequence of the target vertex sequence number in the vertex sequence numbers of all the vertices stored in the target partition.
In another possible implementation manner, if the edge index file is not stored in the target partition, in this scenario, when the target edge data is written into the edge data file, an identifier may be further added to the start position or the end position of the storage of the target edge data in the edge data file. The identifier is used to indicate that the currently written target edge data is the edge data corresponding to the target vertex sequence number, so that the edge index file does not need to be configured in the target partition.
In addition, as shown in fig. 1, alternatively, in the case where the graph calculation may involve processing of attributes of vertices, a vertex data file may also be configured in the target partition. The vertex data file stores vertex data corresponding to the vertex sequence numbers. In this scenario, the target node may also write vertex data for the target vertices to a vertex data file.
The vertex data of the target vertex may include an attribute of the target vertex, or may be a null value.
In addition, the vertex data in the vertex data file can be stored in sequence according to the corresponding vertex sequence numbers. In this scenario, the target partition further stores a vertex index file, the vertex index file stores lengths of the vertex data in the vertex data file, and the lengths of the vertex data stored in the vertex index file are sequentially stored according to vertex sequence numbers corresponding to the vertex data. The vertex index file is convenient for quickly searching the vertex data of a certain vertex in the vertex data file subsequently based on the vertex index file.
In such a scenario, after the vertex data of the target vertex is written into the vertex data file, the length of the vertex data of the target vertex needs to be written into the vertex index file. The specific implementation manner may refer to the aforementioned writing of the target edge data into the edge data file and the writing of the length of the target edge data into the edge index file, which is not described herein again.
As with the edge data file, in the embodiment of the present application, the vertex index file may not be configured for the vertex data file. In this case, when the vertex data of the target vertex is written in the vertex data file, an identifier may be added to the vertex data of the target vertex at the start position or the end position of the storage of the vertex data file. The identifier is used to indicate that the vertex data of the currently written target vertex is the vertex data corresponding to the target vertex sequence number, so that the vertex index file does not need to be configured in the target partition.
The foregoing steps 301 to 303 are used to explain how to update the corresponding edge data file, edge index file, vertex data file, and vertex index file for a vertex. Alternatively, the foregoing steps 301 and 303 may also be applied in a scenario in which an edge data file, an edge index file, a vertex data file, and a vertex index file are generated based on a graph database that has been constructed in the related art. In this scenario, the generation of the data in these files can be illustrated by the following steps.
1. Suppose that the current partition Ps needs to generate several files as described above. Firstly, traversing all the original edge data e of all the vertexes stored in the current partition Ps in parallel in a distributed computing mode<ids,idt>. e represents an edge, idsID, ID of a vertex characterizing the edgetThe ID of the other vertex that characterizes the edge. Wherein idsTo set a vertex stored in the current partition Ps.
2. Determining vertex id by current partition Ps according to data partition strategytThe partition Pt where the vertex id is inquiredtVertex sequence number within partition Pt.
3. Partition Pt receives query idtAfter the request of the vertex sequence number, the mapping table map of the vertex ID and the vertex sequence number is firstly stored<idi,Nt>To query. If the id is inquiredtAnd directly returning the corresponding vertex sequence number to the current partition Ps, otherwise, allocating the vertex sequence number Nt as Nmax +1(Nmax is the maximum vertex sequence number allocated to the partition Pt) to the vertex idtAnd will be<idt,Nt>Added to the mapping table. Id to be assignedtThe corresponding vertex sequence number is returned to the current partition Ps.
4. E is replaced according to the inquired Nt and Pt<ids,idt>Id intThen, repeating step 3) for querying id in Ps partitionsAnd replaces e with Ns<ids,idt>Id ins. Will substitute the obtained e<Ns,Pt+Nt>And the direction of the edge e is stored in the edge log file.
5. And when all edges have replaced the IDs, storing the mapping table updated for the last time into the current partition Ps. And then sorting the records in the edge log file according to Ns, merging the edges of the same vertex together to obtain edge data of the vertex, and then writing the edge data of each vertex into an edge data file according to the sequence of the vertex sequence numbers. And writing the length of the merged edge data into the edge index file according to the order of the top sequence numbers.
6. If the subsequent graph calculation process relates to vertex attributes, the attributes corresponding to the vertices can be used as vertex data to be written into a vertex data file after the vertices are assigned with sequence numbers, and the length of the vertex data is written into a vertex index file.
In the above steps 1 to 4, e<ids,idt>Is replaced by e<Ns,Pt+Nt>Can be further illustrated by fig. 4. The detailed process in fig. 4 is not described herein.
Through the graph data storage method shown in fig. 3, the embodiment of the present application can achieve at least the following technical effects:
(1) the vertex and the information of the vertex-associated edge can be stored in the form of the partition number and the vertex sequence number, so that when graph calculation is performed subsequently, compared with graph calculation based on the vertex ID, the data volume processed by the graph calculation based on the partition number and the vertex sequence number of the vertex is less, and the graph calculation efficiency is improved.
(2) In addition, in the graph database, the edge data of the edge associated with the vertex is only used for storing the partition number where the vertex at the other end is located, the vertex sequence number and the edge direction, and compared with the related art in which the IDs of the vertices at the two ends of the edge need to be stored, the storage method provided by the embodiment of the present application can also save the storage space in the storage system.
(3) When graph calculation is subsequently performed, the vertex ID can be replaced by the partition number and the vertex sequence number of the vertex in the related information of the vertex loaded into the memory, so that the data amount loaded into the memory is reduced, and the problem that graph calculation cannot be completed due to memory overflow is avoided.
Based on the storage system shown in fig. 1 and the graph data storage method shown in fig. 3, the embodiment of the present application further provides a graph data processing method for explaining how to perform graph data processing in a graph calculation process. Fig. 5 is a flowchart of a graph data processing method according to an embodiment of the present application. The method is applied to a computing node, and it should be noted that the computing node may be a node in the storage system shown in fig. 1, and in this scenario, the computing node is the same as a target node in the following embodiments. Alternatively, the computing node may be a separate node for graph computation, which is distributed with the nodes of the storage system. As shown in fig. 5, the method includes the following steps:
step 501: and determining the edge data to be processed in the edge data file stored in the target partition of the target node.
In this embodiment of the present application, the edge index file stored in the target partition of the target node may be loaded into the memory of the compute node in advance. Based on the function of the edge index file, the edge data of each vertex in the edge data file in the target partition on the target node can be sequentially inquired based on the edge index file loaded in the memory until the edge data to be processed is inquired.
That is, under the condition that the edge index file is cached in the memory, the edge data to be processed may be determined according to the length of each edge data stored in the edge index file.
In addition, the graph data processing involved in the current graph computation process typically includes two types of graph data processing. Graph data processing for vertices and graph data processing type for edges. The following description will be given taking the image data processing for the opposite side as an example. The graph data processing for the edges can be divided into two types, one type is iteration of all the edge data to count some indexes, and the other type is query of the edge data of a certain specified vertex.
Therefore, when the graph data processing is iterative processing, the edge data to be processed is the current edge data to be processed that is sequentially iterated according to the length of each edge data. In this scenario, the indexes in the side data and the side data index file in the iterative process are both read sequentially, and the designed storage structure is small, so that the iterative performance is good.
And under the condition that the graph data processing is query processing, the to-be-processed edge data is the edge data positioned according to the length of each edge data and the vertex sequence number of the current to-be-queried vertex. According to the vertex sequence number n and the edge index file, calculating the initial position Pos-n of the edge data of the vertex in the edge data file, which can be represented by the following formula:
Pos-n=∑(i-0 to n-1)li, where li is the length of the edge data corresponding to any vertex sequence number with the vertex sequence number before n.
In addition, the calculation is repeated from the starting position of the edge index file in each query, so that the position information of the edge data at the middle position of a part of the cache memory can be cached in the first iteration process based on the query structure. For example, the position information of the edge data corresponding to the vertex sequence number being an integer multiple of 1000 is cached in the memory. Therefore, when the edge data of the vertex sequence numbers at the adjacent positions are inquired next time, only a small amount of data in the edge index file needs to be calculated according to the cache data.
That is, the positions of the edge data corresponding to the reference vertex sequence numbers in the edge data file are also stored in the computing node, and the reference vertex sequence numbers are one or more vertex sequence numbers in the vertex sequence numbers of the multiple vertices stored in the target partition. In this case, the current edge data to be processed is the edge data located according to the length of each edge data, the position of the edge data corresponding to the reference vertex sequence number in the edge data file, and the vertex sequence number of the vertex to be queried currently.
Optionally, both the edge index file and the edge data file stored in the target partition of the target node may be loaded into the memory of the compute node in advance. Under the scene, the query of the to-be-processed edge data can be realized in the memory of the computing node, so that the efficiency of determining the to-be-processed edge data is improved.
In addition, when the edge index file is not configured in the target partition, when the graph data is processed for the first time and the edge data of the edge data file in the target partition is sequentially traversed, the edge index file for the edge data of the edge data file in the target partition can be generated in the memory. When the graph data is processed again subsequently, the to-be-processed edge data is positioned based on the edge index file instead of traversing each edge data of the edge data file in the target partition again and then positioning the to-be-processed edge data, so that the computing performance of the computing node is improved.
Step 502: and carrying out graph data processing on the to-be-processed side data.
The specific operation of graph data processing on the side data to be processed in step 502 depends on the function that needs to be implemented in the current graph calculation. The embodiment of the present application does not limit a specific implementation manner of performing graph data processing on the to-be-processed side data.
In addition, step 501 and step 502 are described by taking edge data as an example. Optionally, in this embodiment of the present application, the graph data processing may further include an iterative processing or a query processing on the vertex data.
In one possible implementation, a vertex data file is stored in the target partition, and vertex data of each of vertex sequence numbers of the plurality of vertices is stored in the vertex data file. In such a scenario, vertex data corresponding to the target vertex sequence number can be obtained from the vertex data file according to the target vertex sequence number to be queried.
In addition, the vertex data of the multiple vertexes are stored in sequence according to the corresponding vertex sequence numbers, a vertex index file is further stored in the target partition, the lengths of the vertex data in the vertex data file are stored in the vertex index file, and the lengths of the vertex data stored in the vertex index file are stored in sequence according to the corresponding vertex sequence numbers. In this scenario, the specific process of obtaining vertex data corresponding to the target vertex sequence number from the vertex data file according to the target vertex sequence number may be as follows: determining the position of the vertex data corresponding to the target vertex sequence number in the vertex data file according to the target vertex sequence number and the index file; and acquiring the vertex data corresponding to the target vertex sequence number from the vertex data file according to the position of the vertex data corresponding to the target vertex sequence number in the vertex data file.
In addition, the target partition is stored with a mapping table, and the mapping table is stored with a plurality of vertex sequence numbers respectively corresponding to a plurality of vertex identifications. In such a scenario, after the graph data is processed, the vertex sequence number of the vertex obtained after the graph data is processed can be determined; and acquiring a vertex identifier corresponding to the determined vertex sequence number according to the mapping table, and taking the vertex identifier as a graph data processing result, wherein the vertex identifier is also a vertex ID. And displaying the data processing result to a user. That is, the identifier of the vertex ID is used as the identifier of the vertex, and the partition number and the vertex sequence number of the partition where the vertex is located are used as the identifiers of the vertices in the storage system. Therefore, the calculation performance of the calculation node and the storage performance of the storage system can be improved, and the outward display function of the vertex can be improved, so that a user can quickly analyze the graph data processing result in the next step.
Through the graph data processing method shown in fig. 5, the embodiment of the present application can achieve at least the following technical effects:
(1) the vertex and the information of the vertex-associated edge can be stored in the form of the partition number and the vertex sequence number, so that when graph calculation is performed subsequently, compared with graph calculation based on the vertex ID, the data volume processed by the graph calculation based on the partition number and the vertex sequence number of the vertex is less, and the graph calculation efficiency is improved.
(2) When graph calculation is subsequently performed, the vertex ID can be replaced by the partition number and the vertex sequence number of the vertex in the related information of the vertex loaded into the memory, so that the data amount loaded into the memory is reduced, and the problem that graph calculation cannot be completed due to memory overflow is avoided.
All the above optional technical solutions can be combined arbitrarily to form an optional embodiment of the present application, and the present application embodiment is not described in detail again.
Fig. 6 is a schematic structural diagram of a graph data storage device according to an embodiment of the present application. The method is applied to a target node in a storage system storing graph data, and the target node is any node in the storage system. As shown in fig. 6, the apparatus 600 includes:
an obtaining module 601, configured to obtain a target vertex sequence number of a target vertex stored in a target partition, where the target partition is any storage partition on a target node, the target partition stores multiple vertices, the multiple vertices respectively correspond to the vertex sequence numbers, the vertex sequence number of any vertex indicates a sequence of any vertex in the multiple vertices, and the target vertex is any vertex in the multiple vertices;
the obtaining module 601 is further configured to obtain a partition number of a partition where another vertex of each edge in one or more edges associated with the target vertex is located, a vertex sequence number of another vertex, and a direction of each edge, so as to obtain target edge data corresponding to the target vertex sequence number;
a writing module 602, configured to write the target edge data into an edge data file in the target partition, where the edge data file is used to store edge data corresponding to each vertex sequence number.
Optionally, the edge data in the edge data file are stored in sequence according to the corresponding vertex sequence numbers;
and the target partition is also stored with an edge index file, the length of each edge data in the edge data file is stored in the edge index file, and the lengths of each edge data stored in the edge index file are sequentially stored according to the corresponding vertex sequence number.
Optionally, the data type of the length of each edge data stored in the edge index file is an integer data type of 32 bytes or an integer data type of 64 bytes.
Alternatively,
and the writing module is also used for writing the corresponding relation between the target vertex sequence number and the vertex identification of the target vertex into a mapping table in the target partition, and the vertex sequence numbers respectively corresponding to the vertex identifications of the plurality of vertexes are stored in the mapping table.
Alternatively,
and the writing module is also used for writing the vertex data of the target vertex into a vertex data file in the target partition, and the vertex data file stores vertex data respectively corresponding to each vertex sequence number.
Optionally, the vertex data in the vertex data file are stored in sequence according to the corresponding vertex sequence numbers;
and the target partition is also stored with a vertex index file, the vertex index file stores the length of each vertex data in the vertex data file, and the lengths of the vertex data stored in the vertex index file are sequentially stored according to the vertex sequence number corresponding to the vertex data.
Optionally, the vertex data of the target vertex includes attributes of the target vertex.
Through the device shown in fig. 6, the embodiment of the present application can achieve at least the following technical effects:
(1) the vertex and the information of the vertex-associated edge can be stored in the form of the partition number and the vertex sequence number, so that when graph calculation is performed subsequently, compared with graph calculation based on the vertex ID, the data volume processed by the graph calculation based on the partition number and the vertex sequence number of the vertex is less, and the graph calculation efficiency is improved.
(2) In addition, in the graph database, the edge data of the edge associated with the vertex is only used for storing the partition number where the vertex at the other end is located, the vertex sequence number and the edge direction, and compared with the related art in which the IDs of the vertices at the two ends of the edge need to be stored, the storage method provided by the embodiment of the present application can also save the storage space in the storage system.
(3) When graph calculation is subsequently performed, the vertex ID can be replaced by the partition number and the vertex sequence number of the vertex in the related information of the vertex loaded into the memory, so that the data amount loaded into the memory is reduced, and the problem that graph calculation cannot be completed due to memory overflow is avoided.
It should be noted that: in the graph data storage device provided in the above embodiment, when the graph data is stored, only the division of the above functional modules is taken as an example, and in practical applications, the above function distribution may be completed by different functional modules according to needs, that is, the internal structure of the apparatus is divided into different functional modules, so as to complete all or part of the above described functions. In addition, the graph data storage device provided by the above embodiment and the graph data storage method embodiment belong to the same concept, and specific implementation processes thereof are detailed in the method embodiment and are not described herein again.
Fig. 7 is a schematic structural diagram of a graph data processing apparatus according to an embodiment of the present application. The method is applied to the computing nodes in the storage system storing the graph data. As shown in fig. 7, the apparatus 700 includes:
a determining module 701, configured to determine to-be-processed edge data in an edge data file stored in a target partition of a target node;
a processing module 702, configured to perform graph data processing according to the side data to be processed;
the target node is any node in a storage system which stores graph data, the target partition is any storage partition on the target node, the edge data files store edge data which respectively correspond to vertex sequence numbers of a plurality of vertexes stored in the target partition, the edge data comprise partition numbers of partitions in which another vertex of each edge in one or more edges related to the corresponding vertexes is located, vertex sequence numbers of the other vertexes and directions of the edges, and the vertex sequence number of any vertex in the target partition indicates the sequence of any vertex in the vertexes.
Optionally, the edge data in the edge data file are sequentially stored according to corresponding vertex sequence numbers, an edge index file is further stored in the target partition, the length of each edge data in the edge data file is stored in the edge index file, and the length of each edge data stored in the edge index file is sequentially stored according to corresponding vertex sequence numbers;
the determination module is to:
and determining the side data to be processed according to the length of each side data stored in the side index file.
Optionally, when the graph data processing is iterative processing, the to-be-processed side data is current to-be-processed side data that is sequentially iterated according to the length of each side data.
Optionally, in a case that the graph data processing is query processing, the to-be-processed edge data is edge data located according to a length of each edge data and a vertex sequence number of a vertex to be queried currently.
Optionally, the position of the edge data corresponding to the reference vertex sequence number in the edge data file is also stored in the computing node, and the reference vertex sequence number is one or more vertex sequence numbers in the vertex sequence numbers of the multiple vertices stored in the target partition;
the current edge data to be processed is the edge data positioned according to the length of each edge data, the position of the edge data corresponding to the reference vertex sequence number in the edge data file, and the vertex sequence number of the vertex to be inquired.
Optionally, a vertex data file is stored in the target partition, and vertex data corresponding to vertex sequence numbers of a plurality of vertexes is stored in the vertex data file;
the device still includes:
and the acquisition module is used for acquiring the vertex data corresponding to the target vertex sequence number from the vertex data file according to the target vertex sequence number to be inquired.
Optionally, the vertex data of each of the multiple vertices are sequentially stored according to the corresponding vertex sequence number, a vertex index file is further stored in the target partition, the length of each vertex data in the vertex data file is stored in the vertex index file, and the length of each vertex data stored in the vertex index file is sequentially stored according to the corresponding vertex sequence number;
the acquisition module is used for determining the position of the vertex data corresponding to the target vertex sequence number in the vertex data file according to the target vertex sequence number and the index file;
and acquiring the vertex data corresponding to the target vertex sequence number from the vertex data file according to the position of the vertex data corresponding to the target vertex sequence number in the vertex data file.
Optionally, a mapping table is stored in the target partition, and a plurality of vertex sequence numbers respectively corresponding to the plurality of vertex identifications are stored in the mapping table;
the device still includes:
the determining module is used for determining the vertex sequence number of the vertex obtained after the graph data is processed;
and the acquisition module is used for acquiring the vertex identification corresponding to the determined vertex sequence number according to the mapping table and taking the vertex identification as a graph data processing result.
Through the device shown in fig. 7, the embodiment of the present application can achieve at least the following technical effects:
(1) the vertex and the information of the vertex-associated edge can be stored in the form of the partition number and the vertex sequence number, so that when graph calculation is performed subsequently, compared with graph calculation based on the vertex ID, the data volume processed by the graph calculation based on the partition number and the vertex sequence number of the vertex is less, and the graph calculation efficiency is improved.
(2) When graph calculation is subsequently performed, the vertex ID can be replaced by the partition number and the vertex sequence number of the vertex in the related information of the vertex loaded into the memory, so that the data amount loaded into the memory is reduced, and the problem that graph calculation cannot be completed due to memory overflow is avoided.
It should be noted that: in the graph data processing apparatus provided in the above embodiment, when the graph data processing is performed, only the division of the functional modules is illustrated, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the internal structure of the device may be divided into different functional modules to complete all or part of the functions described above. In addition, the graph data processing apparatus provided in the above embodiment and the graph data processing method embodiment belong to the same concept, and specific implementation processes thereof are described in the method embodiment and are not described herein again.
Fig. 8 is a schematic structural diagram of a server according to an embodiment of the present application. Any of the nodes or compute nodes in the foregoing embodiments may be implemented by the server. Specifically, the method comprises the following steps:
the server 800 includes a Central Processing Unit (CPU)801, a system memory 804 including a Random Access Memory (RAM)802 and a Read Only Memory (ROM)803, and a system bus 805 connecting the system memory 804 and the central processing unit 801. The server 800 also includes a basic input/output system (I/O system) 806, which facilitates transfer of information between devices within the computer, and a mass storage device 807 for storing an operating system 813, application programs 814, and other program modules 815.
The basic input/output system 806 includes a display 808 for displaying information and an input device 809 such as a mouse, keyboard, etc. for user input of information. Wherein a display 808 and an input device 809 are connected to the central processing unit 801 through an input output controller 810 connected to the system bus 805. The basic input/output system 806 may also include an input/output controller 810 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, input-output controller 810 also provides output to a display screen, a printer, or other type of output device.
The mass storage device 807 is connected to the central processing unit 801 through a mass storage controller (not shown) connected to the system bus 805. The mass storage device 807 and its associated computer-readable media provide non-volatile storage for the server 800. That is, the mass storage device 807 may include a computer-readable medium (not shown) such as a hard disk or CD-ROM drive.
Without loss of generality, computer readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, EPROM, EEPROM, flash memory or other solid state memory technology, CD-ROM, DVD, or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will appreciate that computer storage media is not limited to the foregoing. The system memory 804 and mass storage 807 described above may be collectively referred to as memory.
According to various embodiments of the present application, server 800 may also operate as a remote computer connected to a network through a network, such as the Internet. That is, the server 800 may be connected to the network 812 through the network interface unit 811 coupled to the system bus 805, or may be connected to other types of networks or remote computer systems (not shown) using the network interface unit 811.
The memory further includes one or more programs, and the one or more programs are stored in the memory and configured to be executed by the CPU. The one or more programs include instructions for performing the graph data storage or processing methods provided by embodiments of the present application.
The embodiment of the application also provides a non-transitory computer readable storage medium, and when the instructions in the storage medium are executed by a processor of the server, the server is enabled to execute the graph data storage or processing method provided by the embodiment.
The embodiment of the present application further provides a computer program product containing instructions, which when run on a server, causes the server to execute the graph data storage or processing method provided by the foregoing embodiment.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The above description is only a preferred embodiment of the present application and should not be taken as limiting the present application, and any modifications, equivalents, improvements, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims (19)

1. A graph data storage method is applied to a target node in a storage system storing graph data, wherein the target node is any node in the storage system, and the method comprises the following steps:
acquiring a target vertex sequence number of a target vertex stored in a target partition, wherein the target partition is any storage partition on the target node, a plurality of vertexes are stored in the target partition, the vertexes are respectively corresponding to the vertex sequence numbers, the vertex sequence number of any vertex indicates the sequence of the vertex in the vertexes, and the target vertex is any vertex in the vertexes;
acquiring the partition number of the partition where the other vertex of each edge of the one or more edges associated with the target vertex is located, the vertex sequence number of the other vertex and the direction of each edge to obtain target edge data corresponding to the target vertex sequence number;
and writing the target edge data into an edge data file in the target partition, wherein the edge data file is used for storing edge data corresponding to each vertex sequence number.
2. The method of claim 1, wherein the edge data in the edge data file is stored in order by corresponding vertex sequence numbers;
and the target partition is also stored with an edge index file, the edge index file stores the length of each edge data in the edge data file, and the lengths of each edge data stored in the edge index file are sequentially stored according to the corresponding vertex sequence number.
3. The method of claim 2, wherein the edge index file stores therein a data type of a length of each edge data as an integer data type of 32 bytes or an integer data type of 64 bytes.
4. The method of claim 1, wherein the method further comprises:
and writing the corresponding relation between the target vertex sequence number and the vertex identification of the target vertex into a mapping table in the target partition, wherein the mapping table stores vertex sequence numbers respectively corresponding to the vertex identifications of the plurality of vertexes.
5. The method of any of claims 1 to 4, further comprising:
and writing the vertex data of the target vertex into a vertex data file in the target partition, wherein the vertex data file stores vertex data respectively corresponding to the vertex sequence numbers.
6. The method of claim 5, wherein the vertex data in the vertex data file is stored sequentially by respective vertex sequence numbers;
and the target partition is also stored with a vertex index file, the vertex index file stores the length of each vertex data in the vertex data file, and the lengths of the vertex data stored in the vertex index file are sequentially stored according to the vertex sequence number corresponding to the vertex data.
7. The method of claim 5, wherein the vertex data for the target vertex comprises attributes of the target vertex.
8. A graph data processing method applied to a compute node in a storage system storing graph data, the method comprising:
determining edge data to be processed in an edge data file stored in a target partition of a target node;
processing graph data according to the side data to be processed;
the object node is any node in the storage system, the object partition is any storage partition on the object node, the edge data file stores therein edge data respectively corresponding to vertex sequence numbers of multiple vertices stored in the object partition, the edge data includes a partition number of a partition in which another vertex of each edge of one or more edges associated with the corresponding vertex is located, a vertex sequence number of the another vertex, and a direction of each edge, and the vertex sequence number of any vertex in the object partition indicates a sequence of the any vertex in the multiple vertices.
9. The method of claim 8, wherein the edge data in the edge data file are stored sequentially according to corresponding vertex sequence numbers, the target partition further stores an edge index file, the edge index file stores therein lengths of the edge data in the edge data file, and the lengths of the edge data stored in the edge index file are stored sequentially according to corresponding vertex sequence numbers;
the determining of the to-be-processed edge data in the edge data file stored in the target partition of the target node includes:
and determining the to-be-processed side data according to the length of each side data stored in the side index file.
10. The method according to claim 9, wherein, in a case where the graph data processing is iterative processing, the side data to be processed is currently-to-be-processed side data that is sequentially iterated in accordance with the respective side data lengths.
11. The method of claim 9, wherein in a case where the graph data processing is query processing, the edge data to be processed is edge data located according to the respective edge data lengths and vertex sequence numbers of vertices currently to be queried.
12. The method of claim 11, wherein the computing node further stores therein a location in the edge data file of edge data corresponding to a reference vertex sequence number, the reference vertex sequence number being one or more of the vertex sequence numbers of the plurality of vertices stored in the target partition;
and the current to-be-processed edge data is positioned according to the length of each edge data, the position of the edge data corresponding to the reference vertex sequence number in the edge data file, and the vertex sequence number of the current to-be-inquired vertex.
13. The method of claim 8, wherein said target partition has stored therein a vertex data file having stored therein vertex data corresponding to vertex sequence numbers of said plurality of vertices;
the method further comprises the following steps:
and acquiring vertex data corresponding to the target vertex sequence number from the vertex data file according to the target vertex sequence number to be inquired.
14. The method of claim 13, wherein the vertex data for each of the plurality of vertices is stored sequentially by a corresponding vertex sequence number, the target partition further having stored therein a vertex index file, the vertex index file having stored therein a length of each of the vertex data in the vertex index file, and the lengths of each of the vertex data stored in the vertex index file being stored sequentially by a corresponding vertex sequence number;
the obtaining of the vertex data corresponding to the target vertex sequence number from the vertex data file according to the target vertex sequence number includes:
determining the position of the vertex data corresponding to the target vertex sequence number in the vertex data file according to the target vertex sequence number and the index file;
and acquiring the vertex data corresponding to the target vertex sequence number from the vertex data file according to the position of the vertex data corresponding to the target vertex sequence number in the vertex data file.
15. The method according to any one of claims 8 to 14, wherein the target partition has stored therein a mapping table having stored therein a plurality of vertex sequence numbers corresponding to the plurality of vertex identifications, respectively;
after processing the graph data stored in the target partition according to the edge index file, the method further includes:
determining a vertex sequence number of a vertex obtained after processing the graph data;
and acquiring a vertex identifier corresponding to the determined vertex sequence number according to the mapping table, and taking the vertex identifier as a graph data processing result.
16. A graph data storage apparatus, applied to a target node in a storage system storing graph data, where the target node is any node in the storage system, the apparatus comprising:
an obtaining module, configured to obtain a target vertex sequence number of a target vertex stored in a target partition, where the target partition is any storage partition on the target node, the target partition stores multiple vertices, the multiple vertices respectively correspond to vertex sequence numbers, the vertex sequence number of any vertex indicates a sequence of the vertex in the multiple vertices, and the target vertex is any vertex in the multiple vertices;
the obtaining module is further configured to obtain a partition number of a partition where another vertex of each edge of the one or more edges associated with the target vertex is located, a vertex sequence number of the another vertex, and a direction of each edge, so as to obtain target edge data corresponding to the target vertex sequence number;
and the writing module is used for writing the target edge data into an edge data file in the target partition, and the edge data file is used for storing the edge data corresponding to each vertex sequence number.
17. A graph data processing apparatus applied to a compute node in a storage system storing graph data, the apparatus comprising:
the determining module is used for determining the edge data to be processed in the edge data file stored in the target partition of the target node;
the processing module is used for processing the graph data according to the side data to be processed;
the object node is any node in the storage system, the object partition is any storage partition on the object node, the edge data file stores therein edge data respectively corresponding to vertex sequence numbers of multiple vertices stored in the object partition, the edge data includes a partition number of a partition in which another vertex of each edge of one or more edges associated with the corresponding vertex is located, a vertex sequence number of the another vertex, and a direction of each edge, and the vertex sequence number of any vertex in the object partition indicates a sequence of the any vertex in the multiple vertices.
18. A server, characterized in that the server comprises:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to perform the steps of the method of any of the preceding claims 1 to 7, or claims 8-15.
19. A computer-readable storage medium having stored thereon instructions which, when executed by a processor, carry out the steps of the method of any of claims 1 to 7, or claims 8 to 15.
CN202011192437.1A 2020-10-30 2020-10-30 Graph data storage and processing method and device and computer storage medium Active CN112287182B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011192437.1A CN112287182B (en) 2020-10-30 2020-10-30 Graph data storage and processing method and device and computer storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011192437.1A CN112287182B (en) 2020-10-30 2020-10-30 Graph data storage and processing method and device and computer storage medium

Publications (2)

Publication Number Publication Date
CN112287182A true CN112287182A (en) 2021-01-29
CN112287182B CN112287182B (en) 2023-09-19

Family

ID=74352978

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011192437.1A Active CN112287182B (en) 2020-10-30 2020-10-30 Graph data storage and processing method and device and computer storage medium

Country Status (1)

Country Link
CN (1) CN112287182B (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113449153A (en) * 2021-06-28 2021-09-28 湖南大学 Index construction method and device, computer equipment and storage medium
CN113468275A (en) * 2021-07-28 2021-10-01 浙江大华技术股份有限公司 Data importing method and device of graph database, storage medium and electronic equipment
CN113609318A (en) * 2021-10-09 2021-11-05 北京海致星图科技有限公司 Graph data processing method and device, electronic equipment and storage medium
CN114095958A (en) * 2021-11-16 2022-02-25 新华三大数据技术有限公司 Method, device, equipment and storage medium for determining cell coverage area
CN114186100A (en) * 2021-10-08 2022-03-15 支付宝(杭州)信息技术有限公司 Data storage and query method, device and database system
CN114254164A (en) * 2022-03-01 2022-03-29 全球能源互联网研究院有限公司 Graph data storage method and device
CN114282073A (en) * 2022-03-02 2022-04-05 支付宝(杭州)信息技术有限公司 Data storage method and device and data reading method and device
CN114791968A (en) * 2022-06-27 2022-07-26 杭州连通图科技有限公司 Processing method, device and system for graph calculation and computer readable medium
CN115203489A (en) * 2022-09-15 2022-10-18 阿里巴巴(中国)有限公司 Dynamic graph data storage system, reading system and corresponding method
WO2023078120A1 (en) * 2021-11-02 2023-05-11 支付宝(杭州)信息技术有限公司 Graph data querying
WO2023131218A1 (en) * 2022-01-07 2023-07-13 支付宝(杭州)信息技术有限公司 Graph data storage
CN114186100B (en) * 2021-10-08 2024-05-31 支付宝(杭州)信息技术有限公司 Data storage and query method, device and database system

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170116315A1 (en) * 2015-10-21 2017-04-27 International Business Machines Corporation Fast path traversal in a relational database-based graph structure
CN109522428A (en) * 2018-09-17 2019-03-26 华中科技大学 A kind of external memory access method of the figure computing system based on index positioning
CN110688055A (en) * 2018-07-04 2020-01-14 清华大学 Data access method and system in large graph calculation
CN110795417A (en) * 2019-10-30 2020-02-14 北京明略软件***有限公司 System and method for storing knowledge graph
CN111241353A (en) * 2020-01-16 2020-06-05 支付宝(杭州)信息技术有限公司 Method, device and equipment for partitioning graph data
CN111694834A (en) * 2019-03-15 2020-09-22 杭州海康威视数字技术股份有限公司 Method, device and equipment for putting picture data into storage and readable storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170116315A1 (en) * 2015-10-21 2017-04-27 International Business Machines Corporation Fast path traversal in a relational database-based graph structure
CN110688055A (en) * 2018-07-04 2020-01-14 清华大学 Data access method and system in large graph calculation
CN109522428A (en) * 2018-09-17 2019-03-26 华中科技大学 A kind of external memory access method of the figure computing system based on index positioning
CN111694834A (en) * 2019-03-15 2020-09-22 杭州海康威视数字技术股份有限公司 Method, device and equipment for putting picture data into storage and readable storage medium
CN110795417A (en) * 2019-10-30 2020-02-14 北京明略软件***有限公司 System and method for storing knowledge graph
CN111241353A (en) * 2020-01-16 2020-06-05 支付宝(杭州)信息技术有限公司 Method, device and equipment for partitioning graph data

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113449153A (en) * 2021-06-28 2021-09-28 湖南大学 Index construction method and device, computer equipment and storage medium
CN113449153B (en) * 2021-06-28 2023-09-26 湖南大学 Index construction method, apparatus, computer device and storage medium
CN113468275A (en) * 2021-07-28 2021-10-01 浙江大华技术股份有限公司 Data importing method and device of graph database, storage medium and electronic equipment
CN114186100A (en) * 2021-10-08 2022-03-15 支付宝(杭州)信息技术有限公司 Data storage and query method, device and database system
CN114186100B (en) * 2021-10-08 2024-05-31 支付宝(杭州)信息技术有限公司 Data storage and query method, device and database system
CN113609318A (en) * 2021-10-09 2021-11-05 北京海致星图科技有限公司 Graph data processing method and device, electronic equipment and storage medium
WO2023078120A1 (en) * 2021-11-02 2023-05-11 支付宝(杭州)信息技术有限公司 Graph data querying
CN114095958A (en) * 2021-11-16 2022-02-25 新华三大数据技术有限公司 Method, device, equipment and storage medium for determining cell coverage area
CN114095958B (en) * 2021-11-16 2023-09-12 新华三大数据技术有限公司 Cell coverage area determining method, device, equipment and storage medium
WO2023131218A1 (en) * 2022-01-07 2023-07-13 支付宝(杭州)信息技术有限公司 Graph data storage
CN114254164A (en) * 2022-03-01 2022-03-29 全球能源互联网研究院有限公司 Graph data storage method and device
CN114282073A (en) * 2022-03-02 2022-04-05 支付宝(杭州)信息技术有限公司 Data storage method and device and data reading method and device
CN114791968A (en) * 2022-06-27 2022-07-26 杭州连通图科技有限公司 Processing method, device and system for graph calculation and computer readable medium
CN115203489A (en) * 2022-09-15 2022-10-18 阿里巴巴(中国)有限公司 Dynamic graph data storage system, reading system and corresponding method

Also Published As

Publication number Publication date
CN112287182B (en) 2023-09-19

Similar Documents

Publication Publication Date Title
CN112287182B (en) Graph data storage and processing method and device and computer storage medium
CN112363979B (en) Distributed index method and system based on graph database
CN106682215B (en) Data processing method and management node
CN110555001B (en) Data processing method, device, terminal and medium
US20170109388A1 (en) Signature-based cache optimization for data preparation
CN110134335B (en) RDF data management method and device based on key value pair and storage medium
WO2017161540A1 (en) Data query method, data object storage method and data system
CN111459884B (en) Data processing method and device, computer equipment and storage medium
CN112015820A (en) Method, system, electronic device and storage medium for implementing distributed graph database
CN109766318B (en) File reading method and device
CN111159235A (en) Data pre-partition method and device, electronic equipment and readable storage medium
CN112434027A (en) Indexing method and device for multi-dimensional data, computer equipment and storage medium
CN114691721A (en) Graph data query method and device, electronic equipment and storage medium
CN115935020A (en) Graph data storage method and device
US8548980B2 (en) Accelerating queries based on exact knowledge of specific rows satisfying local conditions
CN116089414B (en) Time sequence database writing performance optimization method and device based on mass data scene
CN110069466B (en) Small file storage method and device for distributed file system
CN111666302A (en) User ranking query method, device, equipment and storage medium
CN116010345A (en) Method, device and equipment for realizing table service scheme of flow batch integrated data lake
CN111752941A (en) Data storage method, data access method, data storage device, data access device, server and storage medium
CN112307272B (en) Method, device, computing equipment and storage medium for determining relation information between objects
CN111104435B (en) Metadata organization method, device and equipment and computer readable storage medium
CN110990394B (en) Method, device and storage medium for counting number of rows of distributed column database table
CN117540056B (en) Method, device, computer equipment and storage medium for data query
CN111949439B (en) Database-based data file updating method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant