CN109992593A - A kind of large-scale data parallel query method based on subgraph match - Google Patents

A kind of large-scale data parallel query method based on subgraph match Download PDF

Info

Publication number
CN109992593A
CN109992593A CN201910187235.9A CN201910187235A CN109992593A CN 109992593 A CN109992593 A CN 109992593A CN 201910187235 A CN201910187235 A CN 201910187235A CN 109992593 A CN109992593 A CN 109992593A
Authority
CN
China
Prior art keywords
query
vertex
inquiry
data
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910187235.9A
Other languages
Chinese (zh)
Inventor
季雅雯
杨柳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Central South University
Original Assignee
Central South University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Central South University filed Critical Central South University
Priority to CN201910187235.9A priority Critical patent/CN109992593A/en
Publication of CN109992593A publication Critical patent/CN109992593A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

In recent years, with the rapid development of computer network, RDF data amount on Web is skyrocketed through, especially there are many large-scale RDF data collection, and often there is complicated cyberrelationships between the data of this magnanimity, therefore traditional centralized query scheme can not rapidly and accurately obtain query result.The present invention discloses a kind of large-scale data parallel query method based on subgraph match.Present invention combination distributed platform, primarily to improving the efficiency data query concentrated in large-scale data.Adjacency list storage scheme is used first against datagram and query graph, makes full use of the topology information and attribute information of figure, it includes deterministic process that query process, which is converted into field,.Then it solves the problems, such as matching order selection by each candidate region query point number of candidates of accurate evaluation, reduces the generation of intermediate result, and multiple candidate region heuristic process can solve parallel.By the above-mentioned means, the present invention can effectively improve search efficiency, and obtain accurate query result.

Description

A kind of large-scale data parallel query method based on subgraph match
Technical field
The invention belongs to the related fieldss of data query under large-scale data, in particular to a kind of based on the big of subgraph match Scale data parallel query method.
Background technique
As semantic net is in the extensive use of the numerous areas such as unstructured data management, biological information, digital library, RDF data amount on Web is skyrocketed through, and many large-scale RDF data collection is especially occurred, is ended in June, 2018, link Altogether comprising 1231 open data sets, 16132 links in open data cloud.Efficiently and conveniently tissue RDF data becomes one A urgent issue.
RDF is the all-purpose language for the descriptive semantics Web information that W3C is proposed.RDF data is substantially single as it using triple Position, each triple are made of subject, predicate and object, and the relationship between subject and object is described by predicate, are expressed as < s, p,o>.Compared with relation data, RDF data is a kind of typical non-mode data, and this feature also causes to be difficult to be believed according to structure Breath optimization storage, causes query performance low.
RDF triple data can be expressed as RDF graph.And RDF graph can be indicated by the node and side for having label, In each triple correspond to the subgraph of one " node on one side a node " on figure, set forth signified in subject and object Relationship between things, this relationship are indicated by predicate.Node in RDF graph is subject and object, and side is then predicate.RDF graph It is defined as a digraph, wherein object is directed toward by the subject in triple in the direction on side.
SPARQL is for a kind of query language and data acquisition protocols of RDF exploitation, it is the RDF number developed by W3C It is defined according to model, using triple mode as inquiry basis.Almost all of RDF storage can be directly or by dedicated SPARQL encapsulation is supported to use SPARQL.
Subgraph match problem, i.e. Subgraph Isomorphism are considered as the inquiry problem of RDF graph, NP difficulty is well known as and asks Topic.In order to solve this problem, it has been proposed that many algorithms.The subgraph match efficiency how improved in Large Scale Graphs is very heavy It wants, therefore, the good index strategy of design one not only can be reduced the expense of memory space, and can be realized efficient RDF graph The matching of subgraph;And in query process, the brought improved efficiency of good search algorithm is even more to be substantially better than the tactful institute's energy of index Enough brings influence, and effectively reduce the time overhead in RDF data query process to promote operation efficiency.
Summary of the invention
The purpose of the invention is to improve the efficiency data query concentrated in large-scale data, since data of today are advised Mould is huge and data between there is complicated cyberrelationship, traditional data query scheme cannot be solved rapidly and accurately Query Problem, the present invention provide a kind of large-scale data parallel query method based on subgraph match.
To achieve the purpose of the present invention, figure is taken full advantage of using the storage mode of adjacency list in conjunction with distributed platform Topology information and attribute information, by subgraph match process and area research process be changed into field comprising deterministic process, and lead to Overmatching sequential selection, parallel carries out subgraph match inquiry to each candidate region.The present invention is based on subgraph using a kind of The large-scale data parallel query method matched improves efficiency data query, and the detailed process of this method is as follows:
(1) adjacency list is established to the query graph of subgraph and input after division;
(2) calculating of rank value is carried out to vertex all in query graph, selects the smallest vertex u of wherein rank value;
(3) be with point u tree root node, query tree is established by the way of breadth First, non-tree side is (i.e. very if it exists The tree side of rule), retain these sides;
(4) in datagram G, the Candidate Set of point u is matched, is set out using each of Candidate Set vertex v as initial Query point according to query tree each layer of progress Candidate Set exploration;
(5) matching order, the lesser vertex of priority match candidate data set be can determine in Candidate Set heuristic process;
(6) subgraph match is carried out according to obtained matching order, can includes that judgement is kicked except redundancy or mistake according to adjacency information Accidentally (determination can not further matching inquiry point) vertex;
(7) if obtaining complete correct matching result, which is returned;If certain query point Candidate Set is sky, With failure.
Detailed description of the invention
Fig. 1 is the system structure diagram of method of the present invention.
Fig. 2 is the algorithm flow chart of method of the present invention.
Fig. 3 is the query graph of method of the present invention.
Specific embodiment
In order to keep the purpose of the present invention, feature, advantage more obvious and easy to understand, below with reference to basic theory, formula attached drawing, press The present invention is done and further believes explanation according to basic principle, macroscopical process, the sequence of specific steps.
Before carrying out specific query process, need first to establish adjacency list, the specific format of adjacency list for RDF data figure It is as follows:
In table, each vertex u is indicated by an adjacent list [uid, ulabel, adjlist], wherein uid is vertex ID, uLabel are the attributes of vertex correspondence, and adjList is the column for pointing out the adjacent vertex attribute of side attribute and side connection Table, it is generally the case that adjList (u)={ (ei.eLabel, ei.nLabel) }, wherein ei is the side for being adjacent to point u, ELabel is the side tag attributes of ei, and nLabel is by the adjacent vertex tag attributes of the side ei u connected.Next, by right Query scheme employed herein is simply introduced.
It needs to establish adjacency list, uID in table to the query graph of input using mode same as datagram adjacency list is established Being classified as query point uniquely indicates Vi.It should be noted that side attribute is known terms, now to wherein in most of inquiry Vertex point situation discuss: if the vertex attributes values determine, the point uLabel column content be its attribute value;If the top Point attribute it is unknown, then using _ (blank) replace its attribute value, adjList column in ei.nLabel item similarly, inquiry abut Table is as follows:
Then in query graph Q, rank value is calculated to each vertex, and choose the wherein the smallest vertex of rank value, Middle rank is worth calculation formula as follows:
The top half of formula are as follows: in datagram G, the vertex quantity of set is matched with point u, it is evident that the candidate of point u Vertex v quantity is fewer, and point u has better rank value;Formula lower half portion is out-degree of the query point in query graph, if this is looked into The no side out of point is ask, then degoutIt (u) is 0, Rank (u) tends to be infinitely great at this time.Need point situation that freg (g, u) is discussed below Calculation:
If the attribute value of query point u is unknown, if adjList (v) includes in query graph adjacency list in datagram adjacency list Full content in adjList (u), wherein _ arbitrary content can be represented, then point v is the candidate matches vertex of point u;
If query point u attribute value it is known that if identical with point u attribute value vertex v is searched first in datagram G, so Judge afterwards adjList (v) whether comprising the full content in adjList (u) in query graph adjacency list, wherein _ can represent and appoint Meaning content, if satisfied, then point v is the candidate matches vertex of point u.
Next, selecting the smallest point u of wherein rank value, back has obtained candidate point set of the point u in datagram G It closes, now each vertex from set respectively, explores Candidate Set region as initial query point.Selecting initial query After point set, need that query graph is first converted to corresponding query tree, rule of contributing is as follows:
(1) root node of the smallest point u of rank value as query tree;
(2) it by the way of breadth First, contributes;
(3) retain non-tree side, and be embodied in query tree.
Then, Candidate Set is explored since initial query point v, explores detailed process are as follows: since some initial query point, According to query tree, for each vertex of the second level of tree, in datagram G, finds out and meet query point adjacent side limitation These vertex are added in Candidate Set on the matching vertex on each vertex.After the adjacent vertex of initial query point whole all completes inquiry, If it is sky that some vertex, which matches vertex Candidate Set, illustrate that the vertex can not be matched to, is concentrated from initial query point and reject the point; If the Candidate Set on all vertex is not sky, since the non-leaf nodes of the second level, adopt in a like fashion, inquiry the The matching candidate collection on three level vertex, later process similarly, until all query point Candidate Sets all have determined and non-empty.
By exploration process before, the matching candidate set on each inquiry vertex is it is known that therefore from initial query point Start, successively for the candidate collection size on each layer of inquiry vertex, determines matching order from small to large.In Candidate Set region In, because to guarantee that all vertex all meet adjacent side limitation, is explored by candidate region and not only meet query tree The structure on middle conventional tree side is matched, and the non-tree side structure in query tree also meets matching condition.It is worth noting that, if In matching process, the candidate collection of certain query point u is kicked except being sky, illustrates the nothing in the candidate region started with this initial query point Qualified matching result.After completing all query point matchings, matching result is returned.
It is noted that only needing a matching process of culminating point that can obtain if inquiry is star-like inquiry Matched data.
This scheme divides candidate data for multiple regions according to initial query point Candidate Set quantity, and each region is obtained Matching order and carry out the process of matching inquiry can be with parallel practice.

Claims (5)

1. a kind of large-scale data parallel query method based on subgraph match.It is characterized in that using adjacency list storage mode, The topology information and attribute information for taking full advantage of figure, by subgraph match process be converted to data adjacency list and inquiry adjacency list it Between corresponding vertex adjacency information include deterministic process;And data Candidate Set is divided into multiple regions, pass through parallel form It preferentially selects each region the lesser query point of candidate data to match, obtains query result in conjunction with distributed platform.
2. according to the method described in claim 1, wherein, to datagram and query graph for vertex go out while and this while connected Abutment points information establishes adjacency list, and attribute unknown vertex use _ (null attribute) indicates in query graph.It requires to look into matching process True ground information is completely contained in datagram related top adjacency list information in inquiry figure, and null attribute can match any information.
3. determining starting inquiry vertex, and by datagram according to the method described in claim 1, sorting according to Rank value Data Candidate Set is divided into n region by the n vertex on matching inquiry vertex, and each region can be matched parallel, matches Shi Kegen The candidate region size of each abutment points in the vertex is determined according to known vertex, it is preferential that the lesser inquiry vertex in candidate region is selected to carry out Matching includes judgement, rejects redundancy or wrong vertex.
4. according to the method described in claim 1, in the matching process, if certain region inquiry vertex Candidate Set is rejected as sky, Then the region is without matching result;If having matched all inquiry vertex, matching candidate collection is not sky, then returns to query result.
5. according to the method described in claim 1, if only including by the information to culminating point when inquiry is star-like inquiry Judgement, can obtain matching result.
CN201910187235.9A 2019-03-13 2019-03-13 A kind of large-scale data parallel query method based on subgraph match Pending CN109992593A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910187235.9A CN109992593A (en) 2019-03-13 2019-03-13 A kind of large-scale data parallel query method based on subgraph match

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910187235.9A CN109992593A (en) 2019-03-13 2019-03-13 A kind of large-scale data parallel query method based on subgraph match

Publications (1)

Publication Number Publication Date
CN109992593A true CN109992593A (en) 2019-07-09

Family

ID=67130537

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910187235.9A Pending CN109992593A (en) 2019-03-13 2019-03-13 A kind of large-scale data parallel query method based on subgraph match

Country Status (1)

Country Link
CN (1) CN109992593A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110347846A (en) * 2019-07-15 2019-10-18 苏州工业职业技术学院 The non-interconnected knowledge mapping querying method of having time constraint
CN110990426A (en) * 2019-12-05 2020-04-10 桂林电子科技大学 RDF query method based on tree search
CN111309979A (en) * 2020-02-27 2020-06-19 桂林电子科技大学 RDF Top-k query method based on neighbor vector

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110347846A (en) * 2019-07-15 2019-10-18 苏州工业职业技术学院 The non-interconnected knowledge mapping querying method of having time constraint
CN110347846B (en) * 2019-07-15 2023-05-26 苏州工业职业技术学院 Non-connected knowledge graph query method with time constraint
CN110990426A (en) * 2019-12-05 2020-04-10 桂林电子科技大学 RDF query method based on tree search
CN110990426B (en) * 2019-12-05 2022-10-14 桂林电子科技大学 RDF query method based on tree search
CN111309979A (en) * 2020-02-27 2020-06-19 桂林电子科技大学 RDF Top-k query method based on neighbor vector
CN111309979B (en) * 2020-02-27 2022-08-05 桂林电子科技大学 RDF Top-k query method based on neighbor vector

Similar Documents

Publication Publication Date Title
CN102693310B (en) A kind of resource description framework querying method based on relational database and system
CN104573039A (en) Keyword search method of relational database
CN109992786B (en) Semantic sensitive RDF knowledge graph approximate query method
Hammouda et al. Hierarchically distributed peer-to-peer document clustering and cluster summarization
CN109992593A (en) A kind of large-scale data parallel query method based on subgraph match
CN103116625A (en) Volume radio direction finde (RDF) data distribution type query processing method based on Hadoop
CN109033314A (en) The Query method in real time and system of extensive knowledge mapping in the case of memory-limited
CN106528648A (en) Distributed keyword approximate search method for RDF in combination with Redis memory database
CN104102699B (en) A kind of subgraph search method in the set of graphs that clusters
CN112084781B (en) Standard term determining method, device and storage medium
CN106484815B (en) A kind of automatic identification optimization method based on mass data class SQL retrieval scene
CN105335510A (en) Text data efficient searching method
CN104156431B (en) A kind of RDF keyword query methods based on sterogram community structure
CN113434659B (en) Implicit conflict sensing method in collaborative design process
CN105160046A (en) Text-based data retrieval method
CN110032676A (en) One kind being based on the associated SPARQL enquiring and optimizing method of predicate and system
CN103294791A (en) Extensible markup language pattern matching method
CN111680205B (en) Event evolution analysis method and device based on event map
CN108198084A (en) A kind of complex network is overlapped community discovery method
Desai et al. Issues and challenges in big graph modelling for smart city: an extensive survey
CN111898039A (en) Attribute community searching method fusing hidden relations
CN116383247A (en) Large-scale graph data efficient query method
CN106933844A (en) Towards the construction method of the accessibility search index of extensive RDF data
CN104657429A (en) Complex-network-based technology-driven product innovation method
CN111274498B (en) Network characteristic community searching method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20190709

WD01 Invention patent application deemed withdrawn after publication