CN112487015B - Distributed RDF system based on incremental repartitioning and query optimization method thereof - Google Patents

Distributed RDF system based on incremental repartitioning and query optimization method thereof Download PDF

Info

Publication number
CN112487015B
CN112487015B CN202011371750.1A CN202011371750A CN112487015B CN 112487015 B CN112487015 B CN 112487015B CN 202011371750 A CN202011371750 A CN 202011371750A CN 112487015 B CN112487015 B CN 112487015B
Authority
CN
China
Prior art keywords
query
frequent
rdf
repartitioning
incremental
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011371750.1A
Other languages
Chinese (zh)
Other versions
CN112487015A (en
Inventor
冯钧
王秉发
陆佳民
杨程
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hohai University HHU
Original Assignee
Hohai University HHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hohai University HHU filed Critical Hohai University HHU
Priority to CN202011371750.1A priority Critical patent/CN112487015B/en
Publication of CN112487015A publication Critical patent/CN112487015A/en
Application granted granted Critical
Publication of CN112487015B publication Critical patent/CN112487015B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/2433Query languages
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Fuzzy Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a distributed RDF system based on incremental repartitioning and a query optimization method thereof, belonging to the technical field of knowledge map data storage. The invention provides a storage frame for storing RDF data in a mixed relational mode, which reduces the data preprocessing time and the system storage cost; by adopting a hybrid storage mode combining Hash division and vertical division, various types of query modes are optimized, and the query performance of the distributed SPARQL is remarkably improved; an incremental repartitioning model based on a frequent mode is designed, and dynamic adaptation to the change of the query workload is realized.

Description

Distributed RDF system based on incremental repartitioning and query optimization method thereof
Technical Field
The invention belongs to the technical field of knowledge graph data storage, and particularly relates to a distributed RDF system based on incremental repartitioning and a query optimization method thereof.
Background
RDF, as a data model for exposing, sharing, and connecting networks, has been widely used in a variety of applications. With the increasing of the RDF data size, the storage and SPARQL query processing of the RDF data have exceeded the processing capability of a single machine, and people need to design a high-performance distributed RDF data management system to implement management and reuse of large-scale RDF data.
The existing distributed RDF data management system realizes the high-efficiency processing of large-scale RDF data through a shared-nothing cluster. Existing systems can be classified according to the execution model of the distributed RDF system into: hadoop-based systems and memory (RAM) -based systems. In order to improve the performance and the flexibility of distributed SPARQL query evaluation, the storage mode of RDF data and the query conversion from SPARQL to SQL are researched and optimized.
However, the existing distributed RDF data management system still has the following problems:
(1) The optimization is only carried out aiming at a specific query type, so that the query efficiency is low;
(2) Expensive pre-processing overhead and data loading time are required;
(3) The data redundancy is high;
(4) Cannot dynamically adapt to changes in workload.
Disclosure of Invention
The invention aims to: the invention aims to provide a distributed RDF system based on incremental weight repartitioning; another object of the present invention is to provide a query optimization method for a distributed RDF system based on incremental repartitioning.
The technical scheme is as follows: in order to achieve the purpose, the invention provides the following technical scheme:
the distributed RDF system based on the incremental repartitioning comprises an RDF data partitioning module, an RDF data incremental repartitioning module and a distributed query module, wherein:
the RDF data partitioning module comprises a relation storage mode selector and a storage executor; a storage mode selector which selects a required mixed storage scheme from the constructed relational storage mode library; the storage executor is used for storing the RDF data by using a corresponding storage mode according to the selected mixed storage scheme;
the RDF data incremental repartitioning module comprises a frequent pattern miner and an incremental repartitioning executor; the frequent pattern digger monitors the query workload through the query monitor and digs out frequent patterns by utilizing the co-occurrence relation of predicates in SPARQL query; the incremental repartitioning executor is used for constructing a frequent predicate expansion vertical partition table according to the frequent mode to realize incremental repartitioning of the RDF data;
the distributed query module comprises a query monitor, a query planner and a query executor; the query monitor is used for monitoring query workload and periodically distributing the query workload to the frequent pattern digger; the query planner is used for designing a distributed query plan, and generating a logic query plan after algebraic optimization and connection sequence optimization are applied to SPARQL query; and the query executor generates a corresponding physical query plan through Spark SQL to perform query calculation.
Further, the relational storage mode library comprises hash partition, vertical Partition (VP), extended vertical partition (ExtVP), and attribute table.
The query optimization method of the distributed RDF system based on the incremental repartitioning comprises the following steps:
(1) Cleaning the RDF data set;
(2) Storing RDF data based on the mixed relation mode;
(3) Designing a distributed query plan;
(4) Executing the distributed query plan;
(5) The RDF data is incrementally repartitioned.
Further, the step (1) specifically comprises the following steps:
(11) Carrying out format conversion on the data set with the format to be converted, and converting the data set into an N-Triples format in batches;
(12) And inputting the format-converted data set into a regular syntax analyzer, filtering useless information of blank lines, redundant data and body description information, and converting RDF data into a prefix format.
Further, the step (2) specifically comprises the following steps:
(21) Selecting, by a storage mode selector, a hybrid storage mode based on hash partitioning and vertical partitioning;
(22) Sequentially executing RDF data loading on the storage schemes screened out by the storage mode selector according to corresponding storage modes through a storage executor to realize initial division of the RDF data;
wherein, the concrete process of step (22) is as follows:
1) Carrying out Hash division on the initial data based on the RDF triple subject to form Hash fragments;
2) And completing the initial division of the RDF data on each hash partition in a vertical division mode.
Further, the step (3) specifically comprises the following steps:
(31) Performing query conversion through a query planner, analyzing SPARQL query to a corresponding algebraic tree by using a Jena ARQ component, and generating an equivalent Spark SQL expression after applying basic algebraic optimization, namely a basic graph mode BGP BGP = { tp = { (tp) } 1 ,tp 2 ,…tp n Converts into a set of equivalent sub-queries
Figure GDA0002885240510000034
The specific transformation steps are as follows:
1) Selecting any two ternary group patterns tp for a given basic graph pattern bgp by inquiring a candidate table i And
Figure GDA0002885240510000031
according to tp i And tp j Respectively matching the corresponding vertical partition tables (VPs);
(1) firstly, judging whether the two three-tuple modes have correlation or not, if tp, judging whether the two three-tuple modes have correlation or not i And tp j There is no correlation between the two, enter (2); if tp i And tp j There is a correlation between them, go to (3);
(2) selecting tp i And tp j The predicate of (3) corresponds to the vertical partition table VP i And VP j Separately add tp i And tp j The candidate table is queried;
(3) determine tp i And tp j Whether the predicates in (1) are all frequent predicates or not is judged, and if the predicates are all frequent predicates, the step (5) is carried out; otherwise, go to (2);
(4) for tp i First, the corresponding vertical division tables VP are compared i And statistical information of FP-ExtVP tables in corresponding relation, selecting a smaller table from the statistical information, and putting the smaller table into tp i Querying the candidate list ascending queue; for tp j Comparison procedure with tp i
(5) Take out tp separately i And tp j Inquiring the queue head element in the ascending queue of the candidate table to be used as a final inquiry table, namely selecting the table with the minimum size to be used as the final inquiry table;
2) Converting SPARQL into SQL, mapping the three-tuple mode to a corresponding algebraic tree by using a relational algebraic sign, and generating an SQL query statement;
3) Optimizing the connection sequence of the query: for a basic graph mode bgp with n ternary modes, n-1 connection operations are required to calculate a query result in the generated SQL query; sequencing the connection sequence of the three-tuple mode through a connection cost evaluation model, and reducing the number of intermediate results so as to optimize the query performance; the method comprises the following specific steps:
(1) constructing a undirected Graph (Join Graph) based on the correlation of the triple patterns, and if the correlation exists between any two triple patterns, setting a connecting edge between two nodes;
(2) calculating the selection degree of vertexes in the undirected connection graph, and selecting the vertex with the lowest selection degree as a starting point of graph traversal; the calculation formula of the selectivity is as follows:
Figure GDA0002885240510000032
wherein
Figure GDA0002885240510000033
The selection degree of the triple mode in the whole RDF graph is represented, and | C | represents the number of constants in the triple mode; | E | represents the degree of connecting the vertices in the graph;
(3) traversing the undirected graph in a depth-first manner; calculating the connection cost according to the cost evaluation model, and selecting the traversal sequence with the minimum cost as the connection sequence; the calculation formula of the cost evaluation model is as follows:
Figure GDA0002885240510000041
wherein T is I/O Represents the time required for a Disk to perform an I/O operation, disk (tp) i ) Representing queries tp i Number of disk pages required for an included triplet, net (tp) i ) Indicating transport tp i Number of network transmissions, T, required for the triplet contained net Represents the time required to make a network transmission;
(4) writing the SQL query statement after optimizing the query connection sequence into a query file;
(32) Optimizing a Spark SQL logic execution plan, which comprises the following specific steps:
1) Traversing a frequent predicate Tree (P-Tree) from top to bottom, carrying out 'AND' operation on hash codes of the query table AND Tree nodes, AND positioning cluster nodes where the query table is located;
2) The SPARQL query is only distributed to the nodes containing the corresponding tables for query processing, and the search space is reduced.
Further, the step (4) specifically comprises the following steps:
and the generated Spark SQL logic execution plan is handed to a memory computing framework Spark, the Spark SQL generates a corresponding physical query plan according to the corresponding logic plan, the query evaluation calculation is carried out, and a corresponding query result is returned.
Further, the step (5) specifically comprises the following steps:
(51) Monitoring the query workload through a query monitor, and mining a frequent pattern by utilizing the co-occurrence relation of predicates in SPARQL query;
(52) Weighing average query response time and space copy ratio through a frequent pattern miner, and selecting an optimal frequent threshold value to screen frequent patterns;
(53) Constructing a frequent predicate extension vertical division table according to a frequent mode through an increment re-division actuator, performing half-join calculation on two related ternary group modes in advance, and then materializing the extension vertical division table corresponding to the frequent mode to realize increment re-division of data;
in the step (53), a frequent predicate expansion vertical partition table is constructed based on the frequent pattern, and the specific process is as follows:
1) Inputting (52) the set of frequent patterns screened;
2) Finding the correlation among the three-tuple modes; for arbitrary triple pattern tp i And
Figure GDA0002885240510000042
let tp be i =<x,fp 1 ,y>,tp j =<x,fp 1 Z >, if there is a same variable between two triplet patterns, then tp is called i And tp j There is a correlation, denoted correlations (tp) i ,tp j ) (ii) a For tp i And tp j Because the variable x appears in the subject of both triplet patterns at the same time, there is a subject-subject correlation (SS) between the two triplet patterns; in addition, there is a subject-object correlation (SO), an object-subject correlation and an object-object correlation (OO) between two triplet patterns based on the relative position of the same variable between the two different triplet patterns; since object-object correlations are rarely used in typical SPARQL queries, OO relationships are not searched;
3) Acquiring the positions of connecting columns according to the correlation between the three tuple modes, and performing left half connection calculation on the two VPs on corresponding columns of a vertical partition table (VP); if correlations (tp) i ,tp j ) = SS materializing left half-join compute view to
Figure GDA0002885240510000051
In the table, if correlations (tp) i ,tp j ) = SO, materialize left half-join compute view to
Figure GDA0002885240510000052
In the table, if correlations (tp) i ,tp j ) = OS, materializing left half-join compute view to
Figure GDA0002885240510000053
In the table;
(54) And constructing a vertical partition table and a frequent predicate expansion vertical partition table on a frequent predicate index Tree (P-Tree) index cluster, and reducing a search space during query execution.
Has the beneficial effects that: compared with the prior art, the distributed RDF system based on the incremental repartitioning effectively reduces the system data loading time, the system storage overhead and the data redundancy. A hybrid storage mode based on hash partitioning and vertical partitioning of a subject is used, a query workload mining frequent mode is monitored to guide RDF data to carry out incremental repartitioning, and query conversion is optimized by combining a query connection Graph (Join Graph) and a cost evaluation model, so that intelligent storage and query optimization of the RDF data are realized.
Drawings
FIG. 1 is a schematic diagram of a distributed RDF system based on incremental repartitioning;
FIG. 2 is a system architecture diagram of a query optimization method;
FIG. 3 is a diagram of a distributed query plan implemented by the query optimization method;
FIG. 4 is a diagram of a frequent predicate expansion vertical partition data model structure designed by the query optimization method;
FIG. 5 is a diagram of a predicate tree structure designed by the query optimization method.
Detailed Description
The invention will be further described with reference to the following drawings and specific embodiments.
The distributed RDF system based on the incremental repartitioning comprises an RDF data partitioning module, an RDF data incremental repartitioning module and a distributed query module, wherein:
the RDF data partitioning module comprises a relation storage mode selector and a storage executor; a storage mode selector for selecting a desired hybrid storage scheme from the constructed relational storage mode library; the storage executor is used for storing the RDF data by using a corresponding storage mode according to the selected mixed storage scheme; the relational storage mode library comprises Hash division, vertical division (VP), extended vertical division (ExtVP) and an attribute table;
the RDF data incremental repartitioning module comprises a frequent pattern miner and an incremental repartitioning executor; the frequent pattern digger monitors the query workload through the query monitor and digs out frequent patterns by utilizing the co-occurrence relation of predicates in SPARQL query; the incremental repartitioning executor is used for constructing a frequent predicate expansion vertical partitioning table according to a frequent mode to realize incremental repartitioning of RDF data;
the distributed query module comprises a query monitor, a query planner and a query executor; the query monitor is used for monitoring query workload and periodically distributing the query workload to the frequent pattern digger; the query planner is used for designing a distributed query plan, and generating a logic query plan after algebraic optimization and connection sequence optimization are applied to SPARQL query; and the query executor generates a corresponding physical query plan through Spark SQL to perform query calculation.
The query optimization method of the distributed RDF system based on the incremental weight repartitioning comprises the following steps:
(1) Cleaning the RDF data set;
(2) Storing RDF data based on the mixed relation mode;
(3) Designing a distributed query plan;
(4) Executing the distributed query plan;
(5) Performing incremental repartitioning on the RDF data;
the step (1) of cleaning the RDF data set comprises the following steps:
(11) Carrying out format conversion on the data set with the format to be converted, and converting the data set into an N-Triples format in batches;
(12) Inputting the format-converted data set into a regular grammar analyzer, filtering useless information such as blank lines, redundant data, ontology description information and the like, and converting RDF data into a prefix format;
the step (2) of storing the RDF data based on the hybrid relationship mode includes the steps of:
(21) Selecting, by a storage mode selector, a hybrid storage mode based on hash partitioning and vertical partitioning;
(22) And sequentially executing the RDF data loading on the storage schemes screened out by the storage mode selector according to the corresponding storage modes through a storage executor, so as to realize the initial division of the RDF data.
The specific process of step (22) is as follows:
3) Carrying out Hash division on the initial data based on the RDF triple subject to form Hash fragments;
4) Completing initial division of RDF data on each hash slice in a vertical division mode;
the step (3) of designing a distributed query plan, the method comprising the steps of:
(31) Through inquiry plannerPerforming query conversion, analyzing SPARQL query to a corresponding algebraic tree by using a Jena ARQ component, and generating an equivalent Spark SQL expression after applying basic algebraic optimization, namely generating a basic graph mode BGP BGP = { tp = 1 ,tp 2 ,…tp n Converts into a set of equivalent sub-queries
Figure GDA0002885240510000061
The specific transformation steps are as follows:
4) Selecting any two ternary group patterns tp for a given basic graph pattern bgp by inquiring a candidate table i And
Figure GDA0002885240510000071
according to tp i And tp j The predicates of (b) respectively match the corresponding vertical partition tables (VPs).
(1) Firstly, judging whether the two three-tuple modes have correlation or not, if tp, judging whether the two three-tuple modes have correlation or not i And tp j There is no correlation between them, enter (2); if tp i And tp j There is a correlation therebetween, enter (3).
(2) Selecting tp i And tp j The predicate of (3) corresponds to the vertical partition table VP i And VP j Addition of tp separately i And tp j To query the candidate table.
(3) Determine tp i And tp j Whether the predicates in (1) are all frequent predicates or not is judged, and if the predicates are all frequent predicates, the step (5) is carried out; otherwise, go to (2).
(4) For tp i First, the corresponding vertical division tables VP are compared i And statistical information of FP-ExtVP tables in corresponding relation, selecting a smaller table from the statistical information, and putting the smaller table into tp i Querying the candidate table in an ascending queue; for tp j Comparison procedure with tp i
(5) Take out tp separately i And tp j And querying the queue head element in the ascending queue of the candidate table to serve as a final query table, namely selecting the table with the minimum size as the final query table.
5) And converting the SPARQL into SQL, mapping the triad mode to a corresponding algebraic tree by using a relational algebraic sign, and generating an SQL query statement.
6) Optimizing the connection sequence of the query: for a basic graph pattern bgp with n ternary patterns, n-1 join operations are required to calculate a query result in the generated SQL query. And sequencing the connection sequence of the three-tuple mode through a connection cost evaluation model, and reducing the number of intermediate results so as to optimize the query performance. The method comprises the following specific steps:
(1) constructing a undirected Graph (Join Graph) based on the correlation of the triple patterns, and if the correlation exists between any two triple patterns, setting a connecting edge between two nodes.
(2) And calculating the selection degree of the vertexes in the undirected connection graph, and selecting the vertex with the lowest selection degree as the starting point of graph traversal. The calculation formula of the selectivity is as follows:
Figure GDA0002885240510000072
wherein
Figure GDA0002885240510000073
Representing the degree of selection of the triplet pattern in the overall RDF graph, | C | represents the number of constants in the triplet pattern. | E | represents the degree of connecting vertices in the graph.
(3) The undirected connected graph is traversed in a depth-first manner. And calculating the connection cost according to the cost evaluation model, and selecting the traversal sequence with the minimum cost as the connection sequence. The calculation formula of the cost evaluation model is as follows:
Figure GDA0002885240510000074
wherein T is I/O Represents the time required for a Disk to perform an I/O operation, disk (tp) i ) Representing a query tp i Number of disk pages required for an included triplet, net (tp) i ) Indicating transport tp i Number of network transmissions, T, required for the triplet contained net Showing a go through a netThe time required for network transmission.
(4) And writing the SQL query statement after optimizing the query connection sequence into a query file.
(32) Optimizing a Spark SQL logic execution plan, which comprises the following specific steps:
1) Traversing a frequent predicate Tree (P-Tree) from top to bottom, carrying out 'AND' operation on hash codes of the query table AND Tree nodes, AND positioning cluster nodes where the query table is located;
2) The SPARQL query is only distributed to the nodes containing the corresponding tables for query processing, so that the search space is reduced;
the distributed query plan execution method in the step (4) comprises the following steps:
(41) The generated Spark SQL logic execution plan is handed to a memory computing framework Spark, the Spark SQL generates a corresponding physical query plan according to the corresponding logic plan, the query evaluation calculation is carried out, and a corresponding query result is returned;
and (5) carrying out incremental repartitioning on the RDF data, wherein the method comprises the following steps:
(51) Monitoring the query workload through a query monitor, and mining a frequent pattern by utilizing the co-occurrence relation of predicates in SPARQL query;
(52) Weighing average query response time and space copy ratio through a frequent pattern miner, and selecting an optimal frequent threshold value to screen frequent patterns;
(53) Constructing a frequent predicate expansion vertical partition table according to a frequent mode through an increment repartitioning actuator, performing half-connection calculation on two ternary group modes with correlation in advance, and then materializing the expansion vertical partition table corresponding to the frequent mode to realize increment repartitioning of data;
in the step (53), a frequent predicate expansion vertical partition table is constructed based on the frequent pattern, and the specific process is as follows:
1) Inputting (52) the set of frequent patterns screened;
2) Correlations between the triplet patterns are found. For arbitrary triple pattern tp i And
Figure GDA0002885240510000081
let tp be i =<x,fp 1 ,y>,tp j =<x,fp 1 Z >, if there is a same variable between two triplet patterns, then tp is called i And tp j There is a correlation, denoted correlations (tp) i ,tp j ). For tp i And tp j Because the variable x appears in the subject of both triplet patterns at the same time, there is a subject-subject correlation (SS) between the two triplet patterns. In addition, subject-object correlations (SO), object-subject correlations, and object-object correlations (OO) exist between two different triplet patterns based on the relative positions of the same variable between the two triplet patterns. OO relationships are not searched since object-object correlations are rarely used in typical SPARQL queries.
3) And acquiring the positions of the connected columns according to the correlation between the three tuple modes, and performing left half connection calculation on the two VPs on the corresponding columns of the vertical partition table (VP). If correlations (tp) i ,tp j ) = SS, materializing left half-join compute view to
Figure GDA0002885240510000091
In the table, if correlations (tp) i ,tp j ) = SO, materialize left half-join compute view to
Figure GDA0002885240510000092
In the table, if correlations (tp) i ,tp j ) = OS, materializing left half-join compute views to
Figure GDA0002885240510000093
In the table.
(54) And constructing a vertical partition table and a frequent predicate expansion vertical partition table on a frequent predicate index Tree (P-Tree) index cluster, and reducing a search space during query execution.
Examples
As shown in fig. 1, the invention discloses a distributed RDF query optimization method and system architecture based on incremental repartitioning, and the technology and architecture provided by the method can be specifically applied to storage of distributed RDF data and SPARQL query. The overall technical architecture is shown in fig. 2, and the embodiment takes storage and query of a synthetic data set WatDiV as an example, and the specific steps are as follows:
the method comprises the following steps: cleaning the RDF data set to complete the conversion of the data format, and comprising the following steps:
(11) Inputting the generated WatDiv data set file into an RDF data format converter, and converting the RDF data format into N-Triples;
(12) Inputting the converted data file into a constructed regular grammar analyzer, extracting and analyzing useless information such as RDF data, filtering empty lines and repeated data, and converting the RDF data into a prefix format;
step two: storing RDF data based on a hybrid relational schema, comprising the steps of:
(21) Selecting a storage data partitioning scheme based on hash partitioning and vertical partitioning of a subject;
(22) Carrying out Hash division on the initial data based on the RDF triple subject to form Hash fragments;
(23) The initial Partitioning of the RDF data is accomplished on each hash slice using a Vertical Partitioning approach, resulting in a VP as shown in the Vertical Partitioning area in FIG. 4 offers ,VP includes ,VP subscribes Three vertical division tables;
step three: designing a distributed query plan by a query planner, comprising the steps of:
(31) A SPARQL basic query subgraph is analyzed to a corresponding algebraic tree by using a Jena ARQ component, and an SQL expression is generated after basic algebraic optimization is applied;
(32) According to the SPARQL basic graph mode BGP, each triplet mode is taken as a vertex, and a undirected connection graph is constructed based on the correlation of the triplet modes. Three tuple pattern TP 1 And TP 4 There is object-object correlation (OO), a connecting edge e is set between two nodes 1 . Similarly, a connecting edge e is arranged 2 ,e 3 ,e 4 Finally, the connection diagram structure shown in fig. 3 is formed;
(33) Computing the degree of selection of vertices in an undirected connectivity graph
Figure GDA0002885240510000101
Selecting the vertex TP with the lowest selection 3 As a starting point for graph traversal;
(34) Traversing the undirected connected graph in a depth-first mode to obtain all traversal sequences;
(35) Inputting all the traversal sequences into a cost evaluation model, calculating connection cost, and selecting the traversal sequence with the minimum cost, namely TP 3 ,TP 4 ,TP 2 ,TP 1 As a linker sequence;
(36) Generating an equivalent Spark SQL expression;
(37) Traversing a frequent predicate Tree (P-Tree) from top to bottom, carrying out 'AND' operation on hash codes of the query table AND the Tree nodes, AND positioning cluster nodes where the query table is located;
(38) And generating a Spark SQL logic execution plan, and distributing the SPARQL query to the nodes containing the corresponding tables for query processing, so that the search space is reduced.
Step four: executing the distributed query plan by the query executor;
(41) Spark SQL generates a corresponding physical query plan according to the generated logic execution plan, performs query evaluation calculation, and returns a corresponding query result;
step five: mining frequent patterns in the query workload, and guiding data to carry out incremental repartitioning, wherein the method comprises the following steps:
(51) Mining candidate frequent patterns by monitoring query workload and utilizing the co-occurrence relation of predicates in SPARQL query;
(52) Weighing average query response time and space replication ratio, and selecting an optimal frequent threshold screening frequent mode;
(53) As shown in fig. 4, a frequent predicate extension vertical partition table (FP-ExtVP) is constructed according to frequent patterns, and half-join calculation is performed on two triplet patterns with correlation in advance. Because of VP offers ,VP includes The corresponding triple patterns are all frequent patterns, VP offers The existence of object-subject correlation between the two triple patterns materializes the left half-connected computation view into
Figure GDA0002885240510000102
In the table. VP offers And VP includes Subject-subject correlation exists between, materializing the left half-join computational view into
Figure GDA0002885240510000103
In the table. Similarly, other frequent predicate expansion vertical division tables are generated
Figure GDA0002885240510000104
(54) As shown in FIG. 5, a frequent predicate index Tree (P-Tree) is constructed from the bottom up. The data elements of each leaf node store hash encodings of a vertical partitioning table (VP) or a frequent predicate extension table (FP-ExtVP) in the corresponding cluster node,
Figure GDA0002885240510000111
is encoded by
Figure GDA0002885240510000112
And
Figure GDA0002885240510000113
the hash codes of the leaf nodes are obtained by performing bit-by-bit AND operation, the hash codes of the non-leaf nodes are values of the hash codes of all the corresponding child nodes after performing bit-by-bit OR operation, pointer elements of the hash codes store pointers pointing to all the child nodes, AND each parent node of the leaf nodes corresponds to one cluster node.

Claims (8)

1. The distributed RDF system based on the incremental weight repartitioning is characterized in that: the RDF data incremental repartitioning method comprises an RDF data partitioning module, an RDF data incremental repartitioning module and a distributed query module, wherein:
the RDF data partitioning module comprises a relation storage mode selector and a storage executor; a storage mode selector for selecting a desired hybrid storage scheme from the constructed relational storage mode library; the storage executor is used for storing the RDF data by using a corresponding storage mode according to the selected mixed storage scheme;
the RDF data incremental repartitioning module comprises a frequent pattern miner and an incremental repartitioning executor; the frequent pattern digger monitors the query workload through the query monitor and digs out frequent patterns by utilizing the co-occurrence relation of predicates in SPARQL query; the incremental repartitioning executor is used for constructing a frequent predicate expansion vertical partitioning table according to a frequent mode to realize incremental repartitioning of RDF data;
the distributed query module comprises a query monitor, a query planner and a query executor; the query monitor is used for monitoring query workload and periodically distributing the query workload to the frequent pattern digger; the query planner is used for designing a distributed query plan, and generating a logic query plan after algebraic optimization and connection sequence optimization are applied to SPARQL query; and the query executor generates a corresponding physical query plan through spark SQL to perform query calculation.
2. The incremental repartitioning based distributed RDF system of claim 1, wherein: the relational storage mode library comprises Hash division, vertical division, extended vertical division and an attribute table.
3. The query optimization method for the incremental repartitioning-based distributed RDF system according to claim 1 or 2, wherein: the method comprises the following steps:
(1) Cleaning the RDF data set;
(2) Storing RDF data based on the mixed relation mode;
(3) Designing a distributed query plan;
(4) Executing the distributed query plan;
(5) The RDF data is incrementally repartitioned.
4. The query optimization method for the incremental repartitioning-based distributed RDF system of claim 3, wherein: the step (1) specifically comprises the following steps:
(11) Carrying out format conversion on the data set with the format to be converted, and converting the data set into an N-Triples format in batches;
(12) And inputting the format-converted data set into a regular syntax analyzer, filtering useless information of blank lines, redundant data and body description information, and converting RDF data into a prefix format.
5. The query optimization method for the incremental repartitioning-based distributed RDF system of claim 3, wherein: the step (2) specifically comprises the following steps:
(21) Selecting, by a storage mode selector, a hybrid storage mode based on hash partitioning and vertical partitioning;
(22) Sequentially loading the RDF data according to the corresponding storage modes on the storage schemes screened out by the storage mode selector through a storage actuator, so as to realize the initial division of the RDF data;
wherein, the specific process of the step (22) is as follows:
1) Carrying out Hash division on the initial data based on the RDF triple subject to form Hash fragments;
2) And completing the initial division of the RDF data on each hash slice in a vertical division mode.
6. The query optimization method for the incremental repartitioning-based distributed RDF system of claim 3, wherein: the step (3) specifically comprises the following steps:
(31) Performing query conversion through a query planner, analyzing SPARQL query to a corresponding algebraic tree by using a Jena ARQ component, and generating an equivalent Spark SQL expression after algebraic optimization, namely generating a basic graph mode BGP BGP = { tp = 1 ,tp 2 ,...tp n Converts into a set of equivalent sub-queries
Figure FDA0003752579890000021
The specific transformation steps are as follows:
1) And querying candidate table selection, and selecting any two triplet patterns tp for a given basic graph pattern bgp i And
Figure FDA0003752579890000022
according to tp i And tp j The predicates of the vertical division tables are respectively matched with the corresponding vertical division tables;
(1) firstly, judging whether the two three-tuple modes have correlation or not, if tp, judging whether the two three-tuple modes have correlation or not i And tp j There is no correlation between them, enter (2); if tp i And tp j There is a correlation between them, enter (3);
(2) selecting tp i And tp j The predicate of (3) corresponds to the vertical partition table VP i And VP j Separately add tp i And tp j The candidate table is queried;
(3) judging tp i And tp j Whether the predicates in (1) are all frequent predicates or not is judged, and if the predicates are all frequent predicates, the step (5) is carried out; otherwise, go to (2);
(4) for tp i First, compare the corresponding vertical division table VP i Statistical information of FP-ExtVP table in corresponding relation, selecting smaller table and putting it in tp i Querying the candidate table in an ascending queue; for tp j Comparison procedure with tp i
(5) Take out tp separately i And tp j Inquiring the queue head element in the ascending queue of the candidate table to be used as a final inquiry table, namely selecting the table with the minimum size to be used as the final inquiry table;
2) Converting SPARQL into SQL, mapping the three-tuple mode to a corresponding algebraic tree by using a relational algebraic sign, and generating an SQL query statement;
3) Optimizing the connection sequence of the query: for a basic graph mode bgp with n ternary modes, n-1 connection operations are required to calculate a query result in the generated SQL query; sequencing the connection sequence of the three-tuple mode through a connection cost evaluation model, and reducing the number of intermediate results so as to optimize the query performance; the method comprises the following specific steps:
(1) constructing a undirected connection graph based on the correlation of the triple patterns, and if the correlation exists between any two triple patterns, setting a connection edge between two nodes;
(2) calculating the selection degree of vertexes in the undirected connection graph, and selecting the vertex with the lowest selection degree as a starting point of graph traversal; the calculation formula of the selectivity is as follows:
Figure FDA0003752579890000031
wherein
Figure FDA0003752579890000032
The selection degree of the triple mode in the whole RDF graph is represented, and | C | represents the number of constants in the triple mode; | E | represents the degree of connecting the vertices in the graph;
(3) traversing the undirected graph in a depth-first manner; calculating the connection cost according to the cost evaluation model, and selecting the traversal sequence with the minimum cost as the connection sequence; the calculation formula of the cost evaluation model is as follows:
Figure FDA0003752579890000033
wherein T is I/O Represents the time required for a Disk to perform an I/O operation, disk (tp) i ) Representing a query tp i Number of disk pages required for an included triplet, net (tp) i ) Representing the number of network transmissions, T, required to transmit the triplets contained in tpi net Represents the time required to make a network transmission;
(4) writing the SQL query statement after optimizing the query connection sequence into a query file;
(32) Optimizing a Spark SQL logic execution plan, which comprises the following specific steps:
1) Traversing the frequent predicate tree from top to bottom, performing AND operation on the hash codes of the query table AND the tree nodes, AND positioning cluster nodes where the query table is located;
2) The SPARQL query is only distributed to the nodes containing the corresponding tables for query processing, and the search space is reduced.
7. The query optimization method for the incremental repartitioning-based distributed RDF system of claim 3, wherein: the step (4) specifically comprises the following steps:
and the generated Spark SQL logic execution plan is handed to a memory computing framework Spark, the Spark SQL generates a corresponding physical query plan according to the corresponding logic plan, the query evaluation calculation is carried out, and a corresponding query result is returned.
8. The query optimization method for the incremental repartitioning-based distributed RDF system according to claim 3, wherein: the step (5) specifically comprises the following steps:
(51) Monitoring the query workload through a query monitor, and mining a frequent pattern by utilizing the co-occurrence relation of predicates in SPARQL query;
(52) Weighing average query response time and space copy ratio through a frequent pattern miner, and selecting an optimal frequent threshold value to screen frequent patterns;
(53) Constructing a frequent predicate extension vertical division table according to a frequent mode through an increment re-division actuator, performing half-join calculation on two related ternary group modes in advance, and then materializing the extension vertical division table corresponding to the frequent mode to realize increment re-division of data;
in the step (53), a frequent predicate expansion vertical partition table is constructed based on the frequent pattern, and the specific process is as follows:
1) Inputting (52) the set of frequent patterns screened;
2) Searching the correlation among the three tuple modes; for arbitrary triple patterns tpi and
Figure FDA0003752579890000044
Figure FDA0003752579890000045
let tp be i =<x,fp 1 ,y>,tp j =<x,fp 1 ,z>If there is an identical variable between two triplet patterns, then tp is said i And tp j There is a correlation, denoted correlations (tp) i ,tp j ) (ii) a For tpi and tp j Because the variable x appears in the subject of both triplet patterns at the same time, there is a subject-subject correlation between the two triplet patterns; in addition, there is also a subject-object correlation, an object-subject correlation, and an object-object correlation between the two triplet patterns based on the relative positions of the same variable between the two different triplet patterns;
3) Acquiring the positions of the connection columns according to the correlation among the three-tuple modes, and performing left half connection calculation on the two VPs on the corresponding columns of the vertical partition table VP; if correlations (tp) i ,tp j ) = SS, materializing left half-join compute view to
Figure FDA0003752579890000041
In the table, if correlations (tp) i ,tp j ) = SO materializing left half-join compute view to
Figure FDA0003752579890000042
In the table, if correlations (tp) i ,tp j ) = OS, materializing left half-join compute view to
Figure FDA0003752579890000043
In the table;
(54) And constructing a vertical partition table and a frequent predicate expansion vertical partition table on the frequent predicate index tree index cluster, and reducing a search space during query execution.
CN202011371750.1A 2020-11-30 2020-11-30 Distributed RDF system based on incremental repartitioning and query optimization method thereof Active CN112487015B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011371750.1A CN112487015B (en) 2020-11-30 2020-11-30 Distributed RDF system based on incremental repartitioning and query optimization method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011371750.1A CN112487015B (en) 2020-11-30 2020-11-30 Distributed RDF system based on incremental repartitioning and query optimization method thereof

Publications (2)

Publication Number Publication Date
CN112487015A CN112487015A (en) 2021-03-12
CN112487015B true CN112487015B (en) 2022-10-14

Family

ID=74937243

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011371750.1A Active CN112487015B (en) 2020-11-30 2020-11-30 Distributed RDF system based on incremental repartitioning and query optimization method thereof

Country Status (1)

Country Link
CN (1) CN112487015B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110825738A (en) * 2019-10-22 2020-02-21 天津大学 Data storage and query method and device based on distributed RDF
CN110909111A (en) * 2019-10-16 2020-03-24 天津大学 Distributed storage and indexing method based on knowledge graph RDF data characteristics

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8489649B2 (en) * 2010-12-13 2013-07-16 Oracle International Corporation Extensible RDF databases
US9639575B2 (en) * 2012-03-30 2017-05-02 Khalifa University Of Science, Technology And Research Method and system for processing data queries

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110909111A (en) * 2019-10-16 2020-03-24 天津大学 Distributed storage and indexing method based on knowledge graph RDF data characteristics
CN110825738A (en) * 2019-10-22 2020-02-21 天津大学 Data storage and query method and device based on distributed RDF

Also Published As

Publication number Publication date
CN112487015A (en) 2021-03-12

Similar Documents

Publication Publication Date Title
CN107169033B (en) Relational data query optimization method based on data mode conversion and parallel framework
Freitag et al. Adopting worst-case optimal joins in relational database systems
Zhou et al. A learned query rewrite system using monte carlo tree search
US8037059B2 (en) Implementing aggregation combination using aggregate depth lists and cube aggregation conversion to rollup aggregation for optimizing query processing
US7730055B2 (en) Efficient hash based full-outer join
EP3014488B1 (en) Incremental maintenance of range-partitioned statistics for query optimization
CN107491476B (en) Data model conversion and query analysis method suitable for various big data management systems
CN113535788A (en) Retrieval method, system, equipment and medium for marine environment data
US20060161525A1 (en) Method and system for supporting structured aggregation operations on semi-structured data
CN113946600A (en) Data query method, data query device, computer equipment and medium
CN101710336A (en) Method for accelerating data processing by using relational middleware
Su et al. Indexing and parallel query processing support for visualizing climate datasets
Shanoda et al. JOMR: Multi-join optimizer technique to enhance map-reduce job
Gou et al. A/sup*/search: an efficient and flexible approach to materialized view selection
CN108804580B (en) Method for querying keywords in federal RDF database
Yang et al. Traverse: simplified indexing on large map-reduce-merge clusters
CN112487015B (en) Distributed RDF system based on incremental repartitioning and query optimization method thereof
Sakr et al. Efficient relational techniques for processing graph queries
CN116383247A (en) Large-scale graph data efficient query method
Chawla et al. JOTR: Join-optimistic triple reordering approach for SPARQL query optimization on big RDF data
Leeka et al. RQ-RDF-3X: going beyond triplestores
CN112148830A (en) Semantic data storage and retrieval method and device based on maximum area grid
Bajaj A survey on query performance optimization by index recommendation
Vidhya et al. Entity Resolution and Blocking: A Review
Floratos et al. DBSpinner: Making a Case for Iterative Processing in Databases

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant