CN112487015B

CN112487015B - Distributed RDF system based on incremental repartitioning and query optimization method thereof

Info

Publication number: CN112487015B
Application number: CN202011371750.1A
Authority: CN
Inventors: 冯钧; 王秉发; 陆佳民; 杨程
Original assignee: Hohai University HHU
Current assignee: Hohai University HHU
Priority date: 2020-11-30
Filing date: 2020-11-30
Publication date: 2022-10-14
Anticipated expiration: 2040-11-30
Also published as: CN112487015A

Abstract

The invention discloses a distributed RDF system based on incremental repartitioning and a query optimization method thereof, belonging to the technical field of knowledge map data storage. The invention provides a storage frame for storing RDF data in a mixed relational mode, which reduces the data preprocessing time and the system storage cost; by adopting a hybrid storage mode combining Hash division and vertical division, various types of query modes are optimized, and the query performance of the distributed SPARQL is remarkably improved; an incremental repartitioning model based on a frequent mode is designed, and dynamic adaptation to the change of the query workload is realized.

Description

Distributed RDF system based on incremental repartitioning and query optimization method thereof

Technical Field

The invention belongs to the technical field of knowledge graph data storage, and particularly relates to a distributed RDF system based on incremental repartitioning and a query optimization method thereof.

Background

RDF, as a data model for exposing, sharing, and connecting networks, has been widely used in a variety of applications. With the increasing of the RDF data size, the storage and SPARQL query processing of the RDF data have exceeded the processing capability of a single machine, and people need to design a high-performance distributed RDF data management system to implement management and reuse of large-scale RDF data.

The existing distributed RDF data management system realizes the high-efficiency processing of large-scale RDF data through a shared-nothing cluster. Existing systems can be classified according to the execution model of the distributed RDF system into: hadoop-based systems and memory (RAM) -based systems. In order to improve the performance and the flexibility of distributed SPARQL query evaluation, the storage mode of RDF data and the query conversion from SPARQL to SQL are researched and optimized.

However, the existing distributed RDF data management system still has the following problems:

(1) The optimization is only carried out aiming at a specific query type, so that the query efficiency is low;

(2) Expensive pre-processing overhead and data loading time are required;

(3) The data redundancy is high;

(4) Cannot dynamically adapt to changes in workload.

Disclosure of Invention

The invention aims to: the invention aims to provide a distributed RDF system based on incremental weight repartitioning; another object of the present invention is to provide a query optimization method for a distributed RDF system based on incremental repartitioning.

The technical scheme is as follows: in order to achieve the purpose, the invention provides the following technical scheme:

the distributed RDF system based on the incremental repartitioning comprises an RDF data partitioning module, an RDF data incremental repartitioning module and a distributed query module, wherein:

the RDF data partitioning module comprises a relation storage mode selector and a storage executor; a storage mode selector which selects a required mixed storage scheme from the constructed relational storage mode library; the storage executor is used for storing the RDF data by using a corresponding storage mode according to the selected mixed storage scheme;

the RDF data incremental repartitioning module comprises a frequent pattern miner and an incremental repartitioning executor; the frequent pattern digger monitors the query workload through the query monitor and digs out frequent patterns by utilizing the co-occurrence relation of predicates in SPARQL query; the incremental repartitioning executor is used for constructing a frequent predicate expansion vertical partition table according to the frequent mode to realize incremental repartitioning of the RDF data;

the distributed query module comprises a query monitor, a query planner and a query executor; the query monitor is used for monitoring query workload and periodically distributing the query workload to the frequent pattern digger; the query planner is used for designing a distributed query plan, and generating a logic query plan after algebraic optimization and connection sequence optimization are applied to SPARQL query; and the query executor generates a corresponding physical query plan through Spark SQL to perform query calculation.

Further, the relational storage mode library comprises hash partition, vertical Partition (VP), extended vertical partition (ExtVP), and attribute table.

The query optimization method of the distributed RDF system based on the incremental repartitioning comprises the following steps:

(1) Cleaning the RDF data set;

(2) Storing RDF data based on the mixed relation mode;

(3) Designing a distributed query plan;

(4) Executing the distributed query plan;

(5) The RDF data is incrementally repartitioned.

Further, the step (1) specifically comprises the following steps:

(11) Carrying out format conversion on the data set with the format to be converted, and converting the data set into an N-Triples format in batches;

(12) And inputting the format-converted data set into a regular syntax analyzer, filtering useless information of blank lines, redundant data and body description information, and converting RDF data into a prefix format.

Further, the step (2) specifically comprises the following steps:

(21) Selecting, by a storage mode selector, a hybrid storage mode based on hash partitioning and vertical partitioning;

(22) Sequentially executing RDF data loading on the storage schemes screened out by the storage mode selector according to corresponding storage modes through a storage executor to realize initial division of the RDF data;

wherein, the concrete process of step (22) is as follows:

1) Carrying out Hash division on the initial data based on the RDF triple subject to form Hash fragments;

2) And completing the initial division of the RDF data on each hash partition in a vertical division mode.

Further, the step (3) specifically comprises the following steps:

(31) Performing query conversion through a query planner, analyzing SPARQL query to a corresponding algebraic tree by using a Jena ARQ component, and generating an equivalent Spark SQL expression after applying basic algebraic optimization, namely a basic graph mode BGP BGP = { tp = { (tp) } ₁ ，tp ₂ ，…tp _n Converts into a set of equivalent sub-queries

The specific transformation steps are as follows:

1) Selecting any two ternary group patterns tp for a given basic graph pattern bgp by inquiring a candidate table _i And

according to tp _i And tp _j Respectively matching the corresponding vertical partition tables (VPs);

(1) firstly, judging whether the two three-tuple modes have correlation or not, if tp, judging whether the two three-tuple modes have correlation or not _i And tp _j There is no correlation between the two, enter (2); if tp _i And tp _j There is a correlation between them, go to (3);

(2) selecting tp _i And tp _j The predicate of (3) corresponds to the vertical partition table VP _i And VP _j Separately add tp _i And tp _j The candidate table is queried;

(3) determine tp _i And tp _j Whether the predicates in (1) are all frequent predicates or not is judged, and if the predicates are all frequent predicates, the step (5) is carried out; otherwise, go to (2);

(4) for tp _i First, the corresponding vertical division tables VP are compared _i And statistical information of FP-ExtVP tables in corresponding relation, selecting a smaller table from the statistical information, and putting the smaller table into tp _i Querying the candidate list ascending queue; for tp _j Comparison procedure with tp _i ；

(5) Take out tp separately _i And tp _j Inquiring the queue head element in the ascending queue of the candidate table to be used as a final inquiry table, namely selecting the table with the minimum size to be used as the final inquiry table;

2) Converting SPARQL into SQL, mapping the three-tuple mode to a corresponding algebraic tree by using a relational algebraic sign, and generating an SQL query statement;

3) Optimizing the connection sequence of the query: for a basic graph mode bgp with n ternary modes, n-1 connection operations are required to calculate a query result in the generated SQL query; sequencing the connection sequence of the three-tuple mode through a connection cost evaluation model, and reducing the number of intermediate results so as to optimize the query performance; the method comprises the following specific steps:

(1) constructing a undirected Graph (Join Graph) based on the correlation of the triple patterns, and if the correlation exists between any two triple patterns, setting a connecting edge between two nodes;

(2) calculating the selection degree of vertexes in the undirected connection graph, and selecting the vertex with the lowest selection degree as a starting point of graph traversal; the calculation formula of the selectivity is as follows:

wherein

The selection degree of the triple mode in the whole RDF graph is represented, and | C | represents the number of constants in the triple mode; | E | represents the degree of connecting the vertices in the graph;

(3) traversing the undirected graph in a depth-first manner; calculating the connection cost according to the cost evaluation model, and selecting the traversal sequence with the minimum cost as the connection sequence; the calculation formula of the cost evaluation model is as follows:

wherein T is _I/O Represents the time required for a Disk to perform an I/O operation, disk (tp) _i ) Representing queries tp _i Number of disk pages required for an included triplet, net (tp) _i ) Indicating transport tp _i Number of network transmissions, T, required for the triplet contained _net Represents the time required to make a network transmission;

(4) writing the SQL query statement after optimizing the query connection sequence into a query file;

(32) Optimizing a Spark SQL logic execution plan, which comprises the following specific steps:

1) Traversing a frequent predicate Tree (P-Tree) from top to bottom, carrying out 'AND' operation on hash codes of the query table AND Tree nodes, AND positioning cluster nodes where the query table is located;

2) The SPARQL query is only distributed to the nodes containing the corresponding tables for query processing, and the search space is reduced.

Further, the step (4) specifically comprises the following steps:

and the generated Spark SQL logic execution plan is handed to a memory computing framework Spark, the Spark SQL generates a corresponding physical query plan according to the corresponding logic plan, the query evaluation calculation is carried out, and a corresponding query result is returned.

Further, the step (5) specifically comprises the following steps:

(51) Monitoring the query workload through a query monitor, and mining a frequent pattern by utilizing the co-occurrence relation of predicates in SPARQL query;

(52) Weighing average query response time and space copy ratio through a frequent pattern miner, and selecting an optimal frequent threshold value to screen frequent patterns;

(53) Constructing a frequent predicate extension vertical division table according to a frequent mode through an increment re-division actuator, performing half-join calculation on two related ternary group modes in advance, and then materializing the extension vertical division table corresponding to the frequent mode to realize increment re-division of data;

in the step (53), a frequent predicate expansion vertical partition table is constructed based on the frequent pattern, and the specific process is as follows:

1) Inputting (52) the set of frequent patterns screened;

2) Finding the correlation among the three-tuple modes; for arbitrary triple pattern tp _i And

let tp be _i ＝＜x，fp ₁ ，y＞，tp _j ＝＜x，fp ₁ Z >, if there is a same variable between two triplet patterns, then tp is called _i And tp _j There is a correlation, denoted correlations (tp) _i ，tp _j ) (ii) a For tp _i And tp _j Because the variable x appears in the subject of both triplet patterns at the same time, there is a subject-subject correlation (SS) between the two triplet patterns; in addition, there is a subject-object correlation (SO), an object-subject correlation and an object-object correlation (OO) between two triplet patterns based on the relative position of the same variable between the two different triplet patterns; since object-object correlations are rarely used in typical SPARQL queries, OO relationships are not searched;

3) Acquiring the positions of connecting columns according to the correlation between the three tuple modes, and performing left half connection calculation on the two VPs on corresponding columns of a vertical partition table (VP); if correlations (tp) _i ，tp _j ) = SS materializing left half-join compute view to

In the table, if correlations (tp) _i ，tp _j ) = SO, materialize left half-join compute view to

In the table, if correlations (tp) _i ，tp _j ) = OS, materializing left half-join compute view to

In the table;

(54) And constructing a vertical partition table and a frequent predicate expansion vertical partition table on a frequent predicate index Tree (P-Tree) index cluster, and reducing a search space during query execution.

Has the beneficial effects that: compared with the prior art, the distributed RDF system based on the incremental repartitioning effectively reduces the system data loading time, the system storage overhead and the data redundancy. A hybrid storage mode based on hash partitioning and vertical partitioning of a subject is used, a query workload mining frequent mode is monitored to guide RDF data to carry out incremental repartitioning, and query conversion is optimized by combining a query connection Graph (Join Graph) and a cost evaluation model, so that intelligent storage and query optimization of the RDF data are realized.

Drawings

FIG. 1 is a schematic diagram of a distributed RDF system based on incremental repartitioning;

FIG. 2 is a system architecture diagram of a query optimization method;

FIG. 3 is a diagram of a distributed query plan implemented by the query optimization method;

FIG. 4 is a diagram of a frequent predicate expansion vertical partition data model structure designed by the query optimization method;

FIG. 5 is a diagram of a predicate tree structure designed by the query optimization method.

Detailed Description

The invention will be further described with reference to the following drawings and specific embodiments.

the RDF data partitioning module comprises a relation storage mode selector and a storage executor; a storage mode selector for selecting a desired hybrid storage scheme from the constructed relational storage mode library; the storage executor is used for storing the RDF data by using a corresponding storage mode according to the selected mixed storage scheme; the relational storage mode library comprises Hash division, vertical division (VP), extended vertical division (ExtVP) and an attribute table;

the RDF data incremental repartitioning module comprises a frequent pattern miner and an incremental repartitioning executor; the frequent pattern digger monitors the query workload through the query monitor and digs out frequent patterns by utilizing the co-occurrence relation of predicates in SPARQL query; the incremental repartitioning executor is used for constructing a frequent predicate expansion vertical partitioning table according to a frequent mode to realize incremental repartitioning of RDF data;

The query optimization method of the distributed RDF system based on the incremental weight repartitioning comprises the following steps:

(1) Cleaning the RDF data set;

(2) Storing RDF data based on the mixed relation mode;

(3) Designing a distributed query plan;

(4) Executing the distributed query plan;

(5) Performing incremental repartitioning on the RDF data;

the step (1) of cleaning the RDF data set comprises the following steps:

(12) Inputting the format-converted data set into a regular grammar analyzer, filtering useless information such as blank lines, redundant data, ontology description information and the like, and converting RDF data into a prefix format;

the step (2) of storing the RDF data based on the hybrid relationship mode includes the steps of:

(22) And sequentially executing the RDF data loading on the storage schemes screened out by the storage mode selector according to the corresponding storage modes through a storage executor, so as to realize the initial division of the RDF data.

The specific process of step (22) is as follows:

3) Carrying out Hash division on the initial data based on the RDF triple subject to form Hash fragments;

4) Completing initial division of RDF data on each hash slice in a vertical division mode;

the step (3) of designing a distributed query plan, the method comprising the steps of:

(31) Through inquiry plannerPerforming query conversion, analyzing SPARQL query to a corresponding algebraic tree by using a Jena ARQ component, and generating an equivalent Spark SQL expression after applying basic algebraic optimization, namely generating a basic graph mode BGP BGP = { tp = ₁ ，tp ₂ ，…tp _n Converts into a set of equivalent sub-queries

The specific transformation steps are as follows:

4) Selecting any two ternary group patterns tp for a given basic graph pattern bgp by inquiring a candidate table _i And

according to tp _i And tp _j The predicates of (b) respectively match the corresponding vertical partition tables (VPs).

(1) Firstly, judging whether the two three-tuple modes have correlation or not, if tp, judging whether the two three-tuple modes have correlation or not _i And tp _j There is no correlation between them, enter (2); if tp _i And tp _j There is a correlation therebetween, enter (3).

(2) Selecting tp _i And tp _j The predicate of (3) corresponds to the vertical partition table VP _i And VP _j Addition of tp separately _i And tp _j To query the candidate table.

(3) Determine tp _i And tp _j Whether the predicates in (1) are all frequent predicates or not is judged, and if the predicates are all frequent predicates, the step (5) is carried out; otherwise, go to (2).

(4) For tp _i First, the corresponding vertical division tables VP are compared _i And statistical information of FP-ExtVP tables in corresponding relation, selecting a smaller table from the statistical information, and putting the smaller table into tp _i Querying the candidate table in an ascending queue; for tp _j Comparison procedure with tp _i 。

(5) Take out tp separately _i And tp _j And querying the queue head element in the ascending queue of the candidate table to serve as a final query table, namely selecting the table with the minimum size as the final query table.

5) And converting the SPARQL into SQL, mapping the triad mode to a corresponding algebraic tree by using a relational algebraic sign, and generating an SQL query statement.

6) Optimizing the connection sequence of the query: for a basic graph pattern bgp with n ternary patterns, n-1 join operations are required to calculate a query result in the generated SQL query. And sequencing the connection sequence of the three-tuple mode through a connection cost evaluation model, and reducing the number of intermediate results so as to optimize the query performance. The method comprises the following specific steps:

(1) constructing a undirected Graph (Join Graph) based on the correlation of the triple patterns, and if the correlation exists between any two triple patterns, setting a connecting edge between two nodes.

(2) And calculating the selection degree of the vertexes in the undirected connection graph, and selecting the vertex with the lowest selection degree as the starting point of graph traversal. The calculation formula of the selectivity is as follows:

wherein

Representing the degree of selection of the triplet pattern in the overall RDF graph, | C | represents the number of constants in the triplet pattern. | E | represents the degree of connecting vertices in the graph.

(3) The undirected connected graph is traversed in a depth-first manner. And calculating the connection cost according to the cost evaluation model, and selecting the traversal sequence with the minimum cost as the connection sequence. The calculation formula of the cost evaluation model is as follows:

wherein T is _I/O Represents the time required for a Disk to perform an I/O operation, disk (tp) _i ) Representing a query tp _i Number of disk pages required for an included triplet, net (tp) _i ) Indicating transport tp _i Number of network transmissions, T, required for the triplet contained _net Showing a go through a netThe time required for network transmission.

(4) And writing the SQL query statement after optimizing the query connection sequence into a query file.

2) The SPARQL query is only distributed to the nodes containing the corresponding tables for query processing, so that the search space is reduced;

the distributed query plan execution method in the step (4) comprises the following steps:

(41) The generated Spark SQL logic execution plan is handed to a memory computing framework Spark, the Spark SQL generates a corresponding physical query plan according to the corresponding logic plan, the query evaluation calculation is carried out, and a corresponding query result is returned;

and (5) carrying out incremental repartitioning on the RDF data, wherein the method comprises the following steps:

(53) Constructing a frequent predicate expansion vertical partition table according to a frequent mode through an increment repartitioning actuator, performing half-connection calculation on two ternary group modes with correlation in advance, and then materializing the expansion vertical partition table corresponding to the frequent mode to realize increment repartitioning of data;

1) Inputting (52) the set of frequent patterns screened;

2) Correlations between the triplet patterns are found. For arbitrary triple pattern tp _i And

let tp be _i ＝＜x，fp ₁ ，y＞，tp _j ＝＜x，fp ₁ Z >, if there is a same variable between two triplet patterns, then tp is called _i And tp _j There is a correlation, denoted correlations (tp) _i ，tp _j ). For tp _i And tp _j Because the variable x appears in the subject of both triplet patterns at the same time, there is a subject-subject correlation (SS) between the two triplet patterns. In addition, subject-object correlations (SO), object-subject correlations, and object-object correlations (OO) exist between two different triplet patterns based on the relative positions of the same variable between the two triplet patterns. OO relationships are not searched since object-object correlations are rarely used in typical SPARQL queries.

3) And acquiring the positions of the connected columns according to the correlation between the three tuple modes, and performing left half connection calculation on the two VPs on the corresponding columns of the vertical partition table (VP). If correlations (tp) _i ，tp _j ) = SS, materializing left half-join compute view to

In the table, if correlations (tp) _i ，tp _j ) = OS, materializing left half-join compute views to

In the table.

Examples

As shown in fig. 1, the invention discloses a distributed RDF query optimization method and system architecture based on incremental repartitioning, and the technology and architecture provided by the method can be specifically applied to storage of distributed RDF data and SPARQL query. The overall technical architecture is shown in fig. 2, and the embodiment takes storage and query of a synthetic data set WatDiV as an example, and the specific steps are as follows:

the method comprises the following steps: cleaning the RDF data set to complete the conversion of the data format, and comprising the following steps:

(11) Inputting the generated WatDiv data set file into an RDF data format converter, and converting the RDF data format into N-Triples;

(12) Inputting the converted data file into a constructed regular grammar analyzer, extracting and analyzing useless information such as RDF data, filtering empty lines and repeated data, and converting the RDF data into a prefix format;

step two: storing RDF data based on a hybrid relational schema, comprising the steps of:

(21) Selecting a storage data partitioning scheme based on hash partitioning and vertical partitioning of a subject;

(22) Carrying out Hash division on the initial data based on the RDF triple subject to form Hash fragments;

(23) The initial Partitioning of the RDF data is accomplished on each hash slice using a Vertical Partitioning approach, resulting in a VP as shown in the Vertical Partitioning area in FIG. 4 _offers ，VP _includes ，VP _subscribes Three vertical division tables;

step three: designing a distributed query plan by a query planner, comprising the steps of:

(31) A SPARQL basic query subgraph is analyzed to a corresponding algebraic tree by using a Jena ARQ component, and an SQL expression is generated after basic algebraic optimization is applied;

(32) According to the SPARQL basic graph mode BGP, each triplet mode is taken as a vertex, and a undirected connection graph is constructed based on the correlation of the triplet modes. Three tuple pattern TP ₁ And TP ₄ There is object-object correlation (OO), a connecting edge e is set between two nodes ₁ . Similarly, a connecting edge e is arranged ₂ ，e ₃ ，e ₄ Finally, the connection diagram structure shown in fig. 3 is formed;

(33) Computing the degree of selection of vertices in an undirected connectivity graph

Selecting the vertex TP with the lowest selection ₃ As a starting point for graph traversal;

(34) Traversing the undirected connected graph in a depth-first mode to obtain all traversal sequences;

(35) Inputting all the traversal sequences into a cost evaluation model, calculating connection cost, and selecting the traversal sequence with the minimum cost, namely TP ₃ ，TP ₄ ，TP ₂ ，TP ₁ As a linker sequence;

(36) Generating an equivalent Spark SQL expression;

(37) Traversing a frequent predicate Tree (P-Tree) from top to bottom, carrying out 'AND' operation on hash codes of the query table AND the Tree nodes, AND positioning cluster nodes where the query table is located;

(38) And generating a Spark SQL logic execution plan, and distributing the SPARQL query to the nodes containing the corresponding tables for query processing, so that the search space is reduced.

Step four: executing the distributed query plan by the query executor;

(41) Spark SQL generates a corresponding physical query plan according to the generated logic execution plan, performs query evaluation calculation, and returns a corresponding query result;

step five: mining frequent patterns in the query workload, and guiding data to carry out incremental repartitioning, wherein the method comprises the following steps:

(51) Mining candidate frequent patterns by monitoring query workload and utilizing the co-occurrence relation of predicates in SPARQL query;

(52) Weighing average query response time and space replication ratio, and selecting an optimal frequent threshold screening frequent mode;

(53) As shown in fig. 4, a frequent predicate extension vertical partition table (FP-ExtVP) is constructed according to frequent patterns, and half-join calculation is performed on two triplet patterns with correlation in advance. Because of VP _offers ，VP _includes The corresponding triple patterns are all frequent patterns, VP _offers The existence of object-subject correlation between the two triple patterns materializes the left half-connected computation view into

In the table. VP _offers And VP _includes Subject-subject correlation exists between, materializing the left half-join computational view into

In the table. Similarly, other frequent predicate expansion vertical division tables are generated

(54) As shown in FIG. 5, a frequent predicate index Tree (P-Tree) is constructed from the bottom up. The data elements of each leaf node store hash encodings of a vertical partitioning table (VP) or a frequent predicate extension table (FP-ExtVP) in the corresponding cluster node,

is encoded by

And

the hash codes of the leaf nodes are obtained by performing bit-by-bit AND operation, the hash codes of the non-leaf nodes are values of the hash codes of all the corresponding child nodes after performing bit-by-bit OR operation, pointer elements of the hash codes store pointers pointing to all the child nodes, AND each parent node of the leaf nodes corresponds to one cluster node.

Claims

1. The distributed RDF system based on the incremental weight repartitioning is characterized in that: the RDF data incremental repartitioning method comprises an RDF data partitioning module, an RDF data incremental repartitioning module and a distributed query module, wherein:

the RDF data partitioning module comprises a relation storage mode selector and a storage executor; a storage mode selector for selecting a desired hybrid storage scheme from the constructed relational storage mode library; the storage executor is used for storing the RDF data by using a corresponding storage mode according to the selected mixed storage scheme;

2. The incremental repartitioning based distributed RDF system of claim 1, wherein: the relational storage mode library comprises Hash division, vertical division, extended vertical division and an attribute table.

3. The query optimization method for the incremental repartitioning-based distributed RDF system according to claim 1 or 2, wherein: the method comprises the following steps:

(1) Cleaning the RDF data set;

(2) Storing RDF data based on the mixed relation mode;

(3) Designing a distributed query plan;

(4) Executing the distributed query plan;

(5) The RDF data is incrementally repartitioned.

4. The query optimization method for the incremental repartitioning-based distributed RDF system of claim 3, wherein: the step (1) specifically comprises the following steps:

5. The query optimization method for the incremental repartitioning-based distributed RDF system of claim 3, wherein: the step (2) specifically comprises the following steps:

(22) Sequentially loading the RDF data according to the corresponding storage modes on the storage schemes screened out by the storage mode selector through a storage actuator, so as to realize the initial division of the RDF data;

wherein, the specific process of the step (22) is as follows:

2) And completing the initial division of the RDF data on each hash slice in a vertical division mode.

6. The query optimization method for the incremental repartitioning-based distributed RDF system of claim 3, wherein: the step (3) specifically comprises the following steps:

(31) Performing query conversion through a query planner, analyzing SPARQL query to a corresponding algebraic tree by using a Jena ARQ component, and generating an equivalent Spark SQL expression after algebraic optimization, namely generating a basic graph mode BGP BGP = { tp = ₁ ，tp ₂ ，...tp _n Converts into a set of equivalent sub-queries

The specific transformation steps are as follows:

1) And querying candidate table selection, and selecting any two triplet patterns tp for a given basic graph pattern bgp _i And

according to tp _i And tp _j The predicates of the vertical division tables are respectively matched with the corresponding vertical division tables;

(1) firstly, judging whether the two three-tuple modes have correlation or not, if tp, judging whether the two three-tuple modes have correlation or not _i And tp _j There is no correlation between them, enter (2); if tp _i And tp _j There is a correlation between them, enter (3);

(3) judging tp _i And tp _j Whether the predicates in (1) are all frequent predicates or not is judged, and if the predicates are all frequent predicates, the step (5) is carried out; otherwise, go to (2);

(4) for tp _i First, compare the corresponding vertical division table VP _i Statistical information of FP-ExtVP table in corresponding relation, selecting smaller table and putting it in tp _i Querying the candidate table in an ascending queue; for tp _j Comparison procedure with tp _i ；

(1) constructing a undirected connection graph based on the correlation of the triple patterns, and if the correlation exists between any two triple patterns, setting a connection edge between two nodes;

wherein

wherein T is _I/O Represents the time required for a Disk to perform an I/O operation, disk (tp) _i ) Representing a query tp _i Number of disk pages required for an included triplet, net (tp) _i ) Representing the number of network transmissions, T, required to transmit the triplets contained in tpi _net Represents the time required to make a network transmission;

1) Traversing the frequent predicate tree from top to bottom, performing AND operation on the hash codes of the query table AND the tree nodes, AND positioning cluster nodes where the query table is located;

7. The query optimization method for the incremental repartitioning-based distributed RDF system of claim 3, wherein: the step (4) specifically comprises the following steps:

8. The query optimization method for the incremental repartitioning-based distributed RDF system according to claim 3, wherein: the step (5) specifically comprises the following steps:

1) Inputting (52) the set of frequent patterns screened;

2) Searching the correlation among the three tuple modes; for arbitrary triple patterns tpi and

let tp be _i ＝<x，fp ₁ ，y>，tp _j ＝<x，fp ₁ ，z>If there is an identical variable between two triplet patterns, then tp is said _i And tp _j There is a correlation, denoted correlations (tp) _i ，tp _j ) (ii) a For tpi and tp _j Because the variable x appears in the subject of both triplet patterns at the same time, there is a subject-subject correlation between the two triplet patterns; in addition, there is also a subject-object correlation, an object-subject correlation, and an object-object correlation between the two triplet patterns based on the relative positions of the same variable between the two different triplet patterns;

3) Acquiring the positions of the connection columns according to the correlation among the three-tuple modes, and performing left half connection calculation on the two VPs on the corresponding columns of the vertical partition table VP; if correlations (tp) _i ，tp _j ) = SS, materializing left half-join compute view to

In the table, if correlations (tp) _i ，tp _j ) = SO materializing left half-join compute view to

In the table;

(54) And constructing a vertical partition table and a frequent predicate expansion vertical partition table on the frequent predicate index tree index cluster, and reducing a search space during query execution.