CN109992593A

CN109992593A - A kind of large-scale data parallel query method based on subgraph match

Info

Publication number: CN109992593A
Application number: CN201910187235.9A
Authority: CN
Inventors: 季雅雯; 杨柳
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2019-03-13
Filing date: 2019-03-13
Publication date: 2019-07-09

Abstract

In recent years, with the rapid development of computer network, RDF data amount on Web is skyrocketed through, especially there are many large-scale RDF data collection, and often there is complicated cyberrelationships between the data of this magnanimity, therefore traditional centralized query scheme can not rapidly and accurately obtain query result.The present invention discloses a kind of large-scale data parallel query method based on subgraph match.Present invention combination distributed platform, primarily to improving the efficiency data query concentrated in large-scale data.Adjacency list storage scheme is used first against datagram and query graph, makes full use of the topology information and attribute information of figure, it includes deterministic process that query process, which is converted into field,.Then it solves the problems, such as matching order selection by each candidate region query point number of candidates of accurate evaluation, reduces the generation of intermediate result, and multiple candidate region heuristic process can solve parallel.By the above-mentioned means, the present invention can effectively improve search efficiency, and obtain accurate query result.

Description

A kind of large-scale data parallel query method based on subgraph match

Technical field

The invention belongs to the related fieldss of data query under large-scale data, in particular to a kind of based on the big of subgraph match Scale data parallel query method.

Background technique

As semantic net is in the extensive use of the numerous areas such as unstructured data management, biological information, digital library, RDF data amount on Web is skyrocketed through, and many large-scale RDF data collection is especially occurred, is ended in June, 2018, link Altogether comprising 1231 open data sets, 16132 links in open data cloud.Efficiently and conveniently tissue RDF data becomes one A urgent issue.

RDF is the all-purpose language for the descriptive semantics Web information that W3C is proposed.RDF data is substantially single as it using triple Position, each triple are made of subject, predicate and object, and the relationship between subject and object is described by predicate, are expressed as < s, p,o>.Compared with relation data, RDF data is a kind of typical non-mode data, and this feature also causes to be difficult to be believed according to structure Breath optimization storage, causes query performance low.

RDF triple data can be expressed as RDF graph.And RDF graph can be indicated by the node and side for having label, In each triple correspond to the subgraph of one " node on one side a node " on figure, set forth signified in subject and object Relationship between things, this relationship are indicated by predicate.Node in RDF graph is subject and object, and side is then predicate.RDF graph It is defined as a digraph, wherein object is directed toward by the subject in triple in the direction on side.

SPARQL is for a kind of query language and data acquisition protocols of RDF exploitation, it is the RDF number developed by W3C It is defined according to model, using triple mode as inquiry basis.Almost all of RDF storage can be directly or by dedicated SPARQL encapsulation is supported to use SPARQL.

Subgraph match problem, i.e. Subgraph Isomorphism are considered as the inquiry problem of RDF graph, NP difficulty is well known as and asks Topic.In order to solve this problem, it has been proposed that many algorithms.The subgraph match efficiency how improved in Large Scale Graphs is very heavy It wants, therefore, the good index strategy of design one not only can be reduced the expense of memory space, and can be realized efficient RDF graph The matching of subgraph；And in query process, the brought improved efficiency of good search algorithm is even more to be substantially better than the tactful institute's energy of index Enough brings influence, and effectively reduce the time overhead in RDF data query process to promote operation efficiency.

Summary of the invention

The purpose of the invention is to improve the efficiency data query concentrated in large-scale data, since data of today are advised Mould is huge and data between there is complicated cyberrelationship, traditional data query scheme cannot be solved rapidly and accurately Query Problem, the present invention provide a kind of large-scale data parallel query method based on subgraph match.

To achieve the purpose of the present invention, figure is taken full advantage of using the storage mode of adjacency list in conjunction with distributed platform Topology information and attribute information, by subgraph match process and area research process be changed into field comprising deterministic process, and lead to Overmatching sequential selection, parallel carries out subgraph match inquiry to each candidate region.The present invention is based on subgraph using a kind of The large-scale data parallel query method matched improves efficiency data query, and the detailed process of this method is as follows:

(1) adjacency list is established to the query graph of subgraph and input after division；

(2) calculating of rank value is carried out to vertex all in query graph, selects the smallest vertex u of wherein rank value；

(3) be with point u tree root node, query tree is established by the way of breadth First, non-tree side is (i.e. very if it exists The tree side of rule), retain these sides；

(4) in datagram G, the Candidate Set of point u is matched, is set out using each of Candidate Set vertex v as initial Query point according to query tree each layer of progress Candidate Set exploration；

(5) matching order, the lesser vertex of priority match candidate data set be can determine in Candidate Set heuristic process；

(6) subgraph match is carried out according to obtained matching order, can includes that judgement is kicked except redundancy or mistake according to adjacency information Accidentally (determination can not further matching inquiry point) vertex；

(7) if obtaining complete correct matching result, which is returned；If certain query point Candidate Set is sky, With failure.

Detailed description of the invention

Fig. 1 is the system structure diagram of method of the present invention.

Fig. 2 is the algorithm flow chart of method of the present invention.

Fig. 3 is the query graph of method of the present invention.

Specific embodiment

In order to keep the purpose of the present invention, feature, advantage more obvious and easy to understand, below with reference to basic theory, formula attached drawing, press The present invention is done and further believes explanation according to basic principle, macroscopical process, the sequence of specific steps.

Before carrying out specific query process, need first to establish adjacency list, the specific format of adjacency list for RDF data figure It is as follows:

In table, each vertex u is indicated by an adjacent list [uid, ulabel, adjlist], wherein uid is vertex ID, uLabel are the attributes of vertex correspondence, and adjList is the column for pointing out the adjacent vertex attribute of side attribute and side connection Table, it is generally the case that adjList (u)={ (ei.eLabel, ei.nLabel) }, wherein ei is the side for being adjacent to point u, ELabel is the side tag attributes of ei, and nLabel is by the adjacent vertex tag attributes of the side ei u connected.Next, by right Query scheme employed herein is simply introduced.

It needs to establish adjacency list, uID in table to the query graph of input using mode same as datagram adjacency list is established Being classified as query point uniquely indicates V_i.It should be noted that side attribute is known terms, now to wherein in most of inquiry Vertex point situation discuss: if the vertex attributes values determine, the point uLabel column content be its attribute value；If the top Point attribute it is unknown, then using _ (blank) replace its attribute value, adjList column in ei.nLabel item similarly, inquiry abut Table is as follows:

Then in query graph Q, rank value is calculated to each vertex, and choose the wherein the smallest vertex of rank value, Middle rank is worth calculation formula as follows:

The top half of formula are as follows: in datagram G, the vertex quantity of set is matched with point u, it is evident that the candidate of point u Vertex v quantity is fewer, and point u has better rank value；Formula lower half portion is out-degree of the query point in query graph, if this is looked into The no side out of point is ask, then deg_outIt (u) is 0, Rank (u) tends to be infinitely great at this time.Need point situation that freg (g, u) is discussed below Calculation:

If the attribute value of query point u is unknown, if adjList (v) includes in query graph adjacency list in datagram adjacency list Full content in adjList (u), wherein _ arbitrary content can be represented, then point v is the candidate matches vertex of point u；

If query point u attribute value it is known that if identical with point u attribute value vertex v is searched first in datagram G, so Judge afterwards adjList (v) whether comprising the full content in adjList (u) in query graph adjacency list, wherein _ can represent and appoint Meaning content, if satisfied, then point v is the candidate matches vertex of point u.

Next, selecting the smallest point u of wherein rank value, back has obtained candidate point set of the point u in datagram G It closes, now each vertex from set respectively, explores Candidate Set region as initial query point.Selecting initial query After point set, need that query graph is first converted to corresponding query tree, rule of contributing is as follows:

(1) root node of the smallest point u of rank value as query tree；

(2) it by the way of breadth First, contributes；

(3) retain non-tree side, and be embodied in query tree.

Then, Candidate Set is explored since initial query point v, explores detailed process are as follows: since some initial query point, According to query tree, for each vertex of the second level of tree, in datagram G, finds out and meet query point adjacent side limitation These vertex are added in Candidate Set on the matching vertex on each vertex.After the adjacent vertex of initial query point whole all completes inquiry, If it is sky that some vertex, which matches vertex Candidate Set, illustrate that the vertex can not be matched to, is concentrated from initial query point and reject the point； If the Candidate Set on all vertex is not sky, since the non-leaf nodes of the second level, adopt in a like fashion, inquiry the The matching candidate collection on three level vertex, later process similarly, until all query point Candidate Sets all have determined and non-empty.

By exploration process before, the matching candidate set on each inquiry vertex is it is known that therefore from initial query point Start, successively for the candidate collection size on each layer of inquiry vertex, determines matching order from small to large.In Candidate Set region In, because to guarantee that all vertex all meet adjacent side limitation, is explored by candidate region and not only meet query tree The structure on middle conventional tree side is matched, and the non-tree side structure in query tree also meets matching condition.It is worth noting that, if In matching process, the candidate collection of certain query point u is kicked except being sky, illustrates the nothing in the candidate region started with this initial query point Qualified matching result.After completing all query point matchings, matching result is returned.

It is noted that only needing a matching process of culminating point that can obtain if inquiry is star-like inquiry Matched data.

This scheme divides candidate data for multiple regions according to initial query point Candidate Set quantity, and each region is obtained Matching order and carry out the process of matching inquiry can be with parallel practice.

Claims

1. a kind of large-scale data parallel query method based on subgraph match.It is characterized in that using adjacency list storage mode, The topology information and attribute information for taking full advantage of figure, by subgraph match process be converted to data adjacency list and inquiry adjacency list it Between corresponding vertex adjacency information include deterministic process；And data Candidate Set is divided into multiple regions, pass through parallel form It preferentially selects each region the lesser query point of candidate data to match, obtains query result in conjunction with distributed platform.

2. according to the method described in claim 1, wherein, to datagram and query graph for vertex go out while and this while connected Abutment points information establishes adjacency list, and attribute unknown vertex use _ (null attribute) indicates in query graph.It requires to look into matching process True ground information is completely contained in datagram related top adjacency list information in inquiry figure, and null attribute can match any information.

3. determining starting inquiry vertex, and by datagram according to the method described in claim 1, sorting according to Rank value Data Candidate Set is divided into n region by the n vertex on matching inquiry vertex, and each region can be matched parallel, matches Shi Kegen The candidate region size of each abutment points in the vertex is determined according to known vertex, it is preferential that the lesser inquiry vertex in candidate region is selected to carry out Matching includes judgement, rejects redundancy or wrong vertex.

4. according to the method described in claim 1, in the matching process, if certain region inquiry vertex Candidate Set is rejected as sky, Then the region is without matching result；If having matched all inquiry vertex, matching candidate collection is not sky, then returns to query result.

5. according to the method described in claim 1, if only including by the information to culminating point when inquiry is star-like inquiry Judgement, can obtain matching result.