CN112905598B

CN112905598B - Interface-based graph task intermediate result storage method and system for realizing separation

Info

Publication number: CN112905598B
Application number: CN202110275558.0A
Authority: CN
Inventors: 陈榕; 姚子航; 陈海波; 臧斌宇
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2021-03-15
Filing date: 2021-03-15
Publication date: 2022-06-28
Anticipated expiration: 2041-03-15
Also published as: CN112905598A

Abstract

The invention provides a method and a system for storing intermediate results of graph tasks based on interface separation, wherein the method comprises the following steps: step S1: the server receives and analyzes the query task and the analysis task in the client combined graph task request, and sends the query task and the analysis task to a query engine for execution; step S2: the query engine determines a bottom data structure used by the intermediate result, executes a query task, and transmits the query result to the analysis engine; step S3: constructing a data structure used by an analysis algorithm by using a defined data interface, and operating the analysis algorithm on the constructed data structure; step S4: and adding the analysis result into the original query result by using a data interface, and returning the result to the client. The invention designs a set of uniform data interface and different types of bottom layer data structures, reduces the data format conversion overhead between the query task and the analysis task, and enables the combined graph task to be executed efficiently in a single system.

Description

Interface-based graph task intermediate result storage method and system for realizing separation

Technical Field

The invention relates to the technical field of graph task intermediate result storage, in particular to a graph task intermediate result storage method and system based on interface separation.

Background

With the advent of the big data age, complex relevance appears among data, and Graph (Graph) structure data can well express complex relevance information among massive data by abstracting data instances into nodes and abstracting relevance among the instances into edges. Many application scenarios, such as social networks, knowledge graphs, use graphs as data storage structures, and there are two main types of workloads that these applications execute on graph data: graph data query and graph data analysis. The combined graph task refers to graph query and graph analysis, and the triples consist of IDs of subjects, predicates and objects.

The graph query task queries nodes and edge data meeting semantic requirements according to specific semantics given by a user, SPARQL is a query language aiming at graph data stored in an RDF format, and a query statement written by using the SPARQL language generally consists of a plurality of triples (subject, predicate and object), and the triples describe the semantic requirements which the target data should meet. In graph query systems, the intermediate results of a query are typically represented as simple two-dimensional data, with each row of data representing a set of node data that conforms to the semantics of the query.

The graph analysis task executes a complex analysis algorithm, such as SSSP (single source shortest path), BFS (breadth first search), etc., on graph data, during the execution of the analysis algorithm, it needs to traverse the neighbors of each node for message passing operation, and in order to obtain faster execution performance, the graph analysis system uses a Compressed Sparse matrix format (Compressed Sparse Row, Compressed Sparse Column) convenient for traversing the neighbors of the node in the bottom layer storage.

Chinese invention patent publication No. CN105210058A discloses a graph query process using multiple engines, receiving a graph query submitted to a graph database modeled by an attribute graph. The graph query is decomposed into a plurality of query components. For each of the query components, a query execution engine of the query execution engines that is available to process the query component is identified, a sub-query that represents the query component is generated, the sub-query is sent to the identified query execution engine for processing, and results of the sub-query are received from the identified query execution engine. The received results are then combined to generate a response to the graph query.

The current system is only developed for a graph query task or a graph analysis task, but in an actual application scenario, the graph query and the graph analysis are involved in one task at the same time. For example, in a movie recommendation system, a typical recommendation task consists of two stages of graph query and graph analysis, and after a user views a movie, the system queries other users who view the movie and movies recently viewed by the users; and then, a PageRank algorithm is operated on the query result to analyze several movies which are most concerned recently and return to the user as recommendation results. In the current solution, a query operation needs to be executed in a graph query system, and then a query result is transmitted to a graph analysis system to execute an analysis algorithm, the query result needs to be converted into a bottom data structure used by the graph analysis system in the transmission process, and the format conversion operation brings huge performance overhead.

The format conversion overhead of the existing system for such combined tasks is usually reduced by modifying an intermediate result data structure of the graph query system, but this solution has the problem of scalability: different graph analysis algorithms use different underlying data stores, and when an analysis algorithm in an application changes, a developer needs to rewrite execution logic in a graph query to adapt to the new analysis algorithm. Some systems also design a general data structure, which is used in the query process and the analysis process, but the data structure reduces the execution performance of the analysis system in order to take account of the general applicability.

Therefore, how to design a general data structure scheme for a combined task of query and analysis reduces the data conversion overhead between query and analysis, and the data structure execution performance can be close to the optimal data structure execution performance in various analysis algorithms, thereby improving the execution performance of the combined task, which has become a technical problem to be solved by those skilled in the art.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide a method and a system for storing graph task intermediate results based on interface separation.

According to the method and the system for realizing the separated graph task intermediate result storage based on the interface, the scheme is as follows:

in a first aspect, a method for storing graph task intermediate results based on interface separation is provided, where the method includes:

step S1: the server receives a combined graph task request of the client, analyzes a query task and an analysis task in the combined graph task request and sends the query task and the analysis task to a query engine for execution;

step S2: after receiving the request, the query engine determines a bottom data structure used by the intermediate result according to the characteristics of the subsequent analysis task, starts to execute the query task, and transmits the query result to the analysis engine after the query task is executed;

step S3: after receiving the query result, the analysis engine constructs a data structure used by an analysis algorithm by using a defined data interface, and runs the analysis algorithm on the constructed data structure;

step S4: and after the operation of the analysis algorithm is finished, the analysis engine adds the analysis result to the original query result by using the data interface and returns the result to the client.

Preferably, the step S1 is specifically as follows:

receiving a combined task request sent by a client, wherein the combined task request comprises a query task and an analysis task which need to be executed;

The analysis of the query task is to convert the SPARQL query statement represented by the character string into a triplet represented by a number;

analyzing the analysis task, wherein the analysis task comprises the name of an analysis algorithm, parameters used by the algorithm and naming of an algorithm output result;

after the analysis is completed, the query task and the analysis task are transmitted to a query engine, and the query engine executes the query task first.

Preferably, the query task in step S2 includes:

the query engine sequentially executes the triple query statements, and modifies the intermediate result by using a defined data interface after the graph data corresponding to the triple is queried;

repeating the step S2 until all the triples are executed, and proceeding to step S3 to continue execution.

Preferably, the step S2 includes:

the query engine determines a bottom layer data structure used for querying the intermediate result according to a subsequently executed analysis algorithm;

initializing intermediate result data, starting to execute query, sequentially executing triples in a query statement, modifying the query intermediate result by using the edge data according to the edge data in the storage of the subject and predicate access key value pairs, wherein the modification operation uses a defined data interface to operate the intermediate result, and comprises an adding operation and a pruning operation.

Preferably, the step S3 includes: the analysis engine receives the query result sent by the query engine, acquires two columns of node data and edge data to be analyzed by the analysis engine from the query result by using a defined data interface, constructs a data structure convenient for executing an analysis algorithm, and executes the analysis algorithm.

In a second aspect, a graph task intermediate result storage system based on interface separation is provided, the system including:

module M1: the server receives a combined graph task request of the client, analyzes a query task and an analysis task in the combined graph task request and sends the query task and the analysis task to a query engine for execution;

module M2: after receiving the request, the query engine determines a bottom data structure used by the intermediate result according to the characteristics of the subsequent analysis task, starts to execute the query task, and transmits the query result to the analysis engine after the query task is executed;

module M3: after receiving the query result, the analysis engine constructs a data structure used by an analysis algorithm by using a defined data interface, and runs the analysis algorithm on the constructed data structure;

module M4: and after the operation of the analysis algorithm is finished, the analysis engine adds the analysis result to the original query result by using the data interface and returns the result to the client.

Preferably, the module M1 includes:

Preferably, the query task in the module M2 includes:

the repeating module M2 repeats the execution until all the triples are executed, and the entering module M3 continues the execution.

Preferably, the module M2 includes:

Preferably, the module M3 includes: the analysis engine receives the query result sent by the query engine, acquires two columns of node data and edge data to be analyzed by the analysis engine from the query result by using a defined data interface, constructs a data structure convenient for the execution of an analysis algorithm, and executes the analysis algorithm.

Compared with the prior art, the invention has the following beneficial effects:

1. the storage method of the invention realizes the separated graph task intermediate result storage based on the interface, can fully utilize the execution characteristics of the graph query system, and defines a set of uniform intermediate result data interface, so that the query engine does not need to consider the underlying intermediate result storage structure when executing the query, thereby improving the universality of the query engine;

2. according to the execution characteristics of the graph analysis algorithm, various intermediate result storage structures are designed and realized, so that the graph analysis algorithm can improve the operation performance of the analysis algorithm by utilizing a traversal friendly data structure, and meanwhile, the invention supports the realization of expanding a new storage structure by a user according to specific requirements;

3. according to the method, the bottom data structure used by the query engine can be selected according to the characteristics of the graph analysis algorithm, so that the storage structure of the query result can be quickly converted into the data structure used by the analysis engine, the data format conversion cost between the query stage and the analysis stage is reduced, and the overall execution performance of the combined graph task is improved;

4. The invention provides a scheme for storing intermediate results of graph tasks based on interface separation, which eliminates the data format conversion overhead in the execution of combined tasks by the existing system, has high universality and expandability, and provides reference significance for the intermediate result storage of other types of combined tasks later.

Drawings

Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:

FIG. 1 is a flow chart of a method for storing intermediate results of graph tasks by using an interface-based implementation according to the present invention.

Detailed Description

The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.

The embodiment of the invention provides a graph task intermediate result storage method based on interface separation, and as shown in fig. 1, firstly, a server receives a combined graph task request from a client, analyzes a query task and an analysis task in the combined graph task request, and sends the query task and the analysis task to a query engine for execution, specifically as follows:

The method comprises the steps of receiving a combined task request sent by a client, wherein the combined task request comprises a query task and an analysis task which need to be executed, analyzing the query task, converting the SPARQL query statement represented by a character string into a query statement represented by triples (subject, predicate and object) represented by numbers, and the subject, the predicate and the object in each triplet are represented by unique numbers (ID).

Analyzing the analysis algorithm, algorithm parameters and result naming information used in the analysis task, after the analysis is completed, the query task and the analysis task are transmitted to a query engine, and the query engine executes the query task first.

Secondly, after receiving the request, the query engine determines the bottom layer data structure used by the intermediate result according to the characteristics of the subsequent analysis task, starts to execute the query task, and after the query task is executed, the query engine transmits the query result to the analysis engine, wherein the specific conditions are as follows:

the query engine selects a proper intermediate result bottom layer data structure according to the characteristics of the subsequent analysis task, wherein proper means that after the query is finished, the result data generated by the query can be converted into the data structure used by the analysis engine with the least overhead. After selecting one of the underlying data structures, the query engine initializes the intermediate result object in preparation for executing the query.

The query engine sequentially executes the triple query statements, specifically: traversing nodes corresponding to the subject according to the subject and the predicate of the triple, accessing key values to store and acquire neighbor node data, operating the intermediate result by using a defined uniform data interface, and if the subject of the triple does not exist in the current intermediate result, adding the intermediate result: traversing each row in the existing intermediate result, and inquiring an object, namely a neighbor node in the corresponding graph storage according to the subject and the predicate; traversing the inquired node list, splicing each neighbor node with the current line into a new line and storing the new line into a new intermediate result; and entering the next row of operation after the traversal of the node list is finished.

If the object of the triple already exists in the current intermediate result, a pruning operation is performed on the intermediate result: traversing each row in the existing intermediate result, and inquiring an object, namely a neighbor node in the corresponding graph storage according to the subject and the predicate; and searching in the inquired node list by using the node ID corresponding to the current row object, if the matching is successful, retaining the row, and if the matching is failed, deleting the row.

And after the query engine executes all the triple query statements, sending the query result to the analysis engine, and continuing to execute the next steps.

Then, the analysis engine receives the query result, constructs a data structure used by the analysis algorithm using the defined data interface, and runs the analysis algorithm on the constructed data structure, which specifically includes the following steps:

the analysis engine receives the query result, and constructs a data structure used by the analysis engine by using the query result, specifically: and the analysis engine calls a data interface to acquire node data and edge data between nodes according to the nodes to be analyzed, and constructs a traversal friendly data structure used by the analysis engine.

The analysis engine executes an analysis algorithm, specifically: the analysis engine traverses the adjacent edge information of the nodes in the data, logic is executed according to different analysis algorithms, the state information of the adjacent nodes and the current state information of the nodes are gathered to calculate the state information of the nodes in the next iteration, after the state information of all the nodes is updated, the current calculated iteration number is judged, if all the iterations are completed, the results are summarized, and the subsequent steps are continued to be executed.

And finally, the analysis engine collects the analysis result and the original query result and returns the collected analysis result and the original query result to the client, and the method specifically comprises the following steps: and the analysis engine collects the state information of all nodes by using a data interface of the intermediate result, traverses each row of data of the query result, adds the state data corresponding to the nodes into the row as a new row of data, and returns the result to the client after traversing is finished.

The invention designs a storage method of combined graph task intermediate results by using an interface and a concept of realizing separation, mainly because the traditional storage method does not consider the inconsistency of data structures between a combined task query stage and an analysis stage, so that the large data format conversion cost exists between the stages, and the traditional storage method has the following problems:

1. the inconsistency of data storage formats among combined task stages is not considered, a traditional storage method usually designs a storage structure aiming at a single type of graph tasks, under the combined task scene, result data generated by a query task can be used as input data of a next-stage analysis task, the storage structure used by the query task is different from the data structure used by the analysis task, when the result data are transmitted among the stages, additional data format conversion needs to be carried out, and the process causes high overhead.

2. The different requirements of different analysis algorithms on data structures are not considered, for some graph calculation algorithms, an analysis engine usually uses a sparse matrix format for data storage, for some neural network methods, the analysis engine usually uses an adjacent matrix for message aggregation of neighbor nodes, and the traditional storage method uses a single data structure as storage, so that the system can achieve better performance only on a few algorithms due to the single storage method.

The storage method adopted by the invention aims at the intermediate result storage method of the combined graph task, and has the following advantages in some aspects:

1. aiming at the problem that the data structure between the query task and the analysis task is inconsistent, the performance optimization is carried out on the combined graph task, and the data structure of abundant intermediate results is designed, so that the query result generated by the execution of a query engine can be converted into the data structure used by the analysis engine with smaller cost, and the performance cost between the combined task stages is reduced.

2. Considering that various analysis algorithms have different requirements on intermediate result data, the invention designs a set of uniform intermediate result data interface, so that a query engine does not need to change query execution codes when the intermediate result bottom layer structure is changed, a user can develop a specific data structure as a storage structure of the intermediate result according to a specific application scene, and the intermediate result data interface and the bottom layer are separated to realize the invention, thereby having good universality and expandability.

In summary, the interface-based method for storing intermediate results of combined graph tasks, which is provided by the invention, fully considers the data conversion cost in the combined graph tasks, designs abundant intermediate result data structures aiming at the characteristics of analysis tasks, so that query results can be quickly converted into analysis task-friendly data structures, and selects the data structure with the minimum cost in different application scenes, thereby greatly improving the overall performance of the combined graph tasks.

It is well within the knowledge of a person skilled in the art to implement the system and its various devices, modules, units provided by the present invention in a purely computer readable program code means that the same functionality can be implemented by logically programming method steps in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Therefore, the system and various devices, modules and units thereof provided by the present invention can be regarded as a hardware component, and the devices, modules and units included therein for implementing various functions can also be regarded as structures within the hardware component; means, modules, units for realizing various functions can also be regarded as structures in both software modules and hardware components for realizing the methods.

The foregoing description has described specific embodiments of the present invention. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.

Claims

1. A method for realizing separated graph task intermediate result storage based on an interface is characterized by comprising the following steps:

step S1: the server receives the combined graph task request of the client, analyzes the query task and the analysis task in the combined graph task request, and sends the query task and the analysis task to a query engine for execution;

step S4: after the operation of the analysis algorithm is finished, the analysis engine adds the analysis result to the original query result by using a data interface and returns the result to the client;

wherein, the query task in step S2 includes:

repeating the step S2 until all triples are executed, and then proceeding to step S3;

The step S2 specifically includes:

the query engine determines a bottom data structure used for querying the intermediate result according to a subsequently executed analysis algorithm;

2. The method for storing graph task intermediate results based on interface implementation separation according to claim 1, wherein the step S1 is specifically as follows:

3. The method for storing graph task intermediate results based on interface implementation separation of claim 1, wherein the step S3 includes: the analysis engine receives the query result sent by the query engine, acquires two columns of node data and edge data to be analyzed by the analysis engine from the query result by using a defined data interface, constructs a data structure convenient for executing an analysis algorithm, and executes the analysis algorithm.

4. An interface-based split graph task intermediate result storage system, comprising:

Module M4: after the operation of the analysis algorithm is finished, the analysis engine adds the analysis result to the original query result by using a data interface and returns the result to the client;

wherein, the query task in the module M2 includes:

the repeating module M2 repeatedly executes until all triples are executed, and then the entering module M3 continues to execute;

the module M2 specifically includes:

5. The system according to claim 4, wherein said module M1 comprises:

6. The system according to claim 4, wherein said module M3 comprises: the analysis engine receives the query result sent by the query engine, acquires two columns of node data and edge data to be analyzed by the analysis engine from the query result by using a defined data interface, constructs a data structure convenient for executing an analysis algorithm, and executes the analysis algorithm.