CN105630881A

CN105630881A - Data storage method and query method for RDF (Resource Description Framework)

Info

Publication number: CN105630881A
Application number: CN201510955821.5A
Authority: CN
Inventors: 袁柳; 张鸿洋; 翟梅
Original assignee: Shaanxi Normal University
Current assignee: Shaanxi Normal University
Priority date: 2015-12-18
Filing date: 2015-12-18
Publication date: 2016-06-01
Anticipated expiration: 2035-12-18
Also published as: CN105630881B

Abstract

The invention relates to a data storage method and query method for an RDF (Resource Description Framework). The data storage method comprises the steps of designing entity-oriented RDF data storage structure and storage mapping; converting a URI (Uniform Resource Identifier) and a literal amount of RDF data into 64-bit binary data; and storing the 64-bit binary data according to the designed storage structure. The data query method comprises the steps of analyzing and converting an SPARQL query statement; estimating single query cost according to an analysis result of the whole data set and a connection relationship among queries by a plurality of query triples in the SPARQL statement; and finally generating a least-cost query process. According to the data storage method and query method, the data comparison speed can be greatly increased and the storage space can be reduced; and compared with a conventional method for directly converting SPARQL into SQL to perform query, the query method provided by the invention has the advantages that the query efficiency is greatly improved and the query method can be used in the fields of Web data management, Web semantic retrieval and the like.

Description

The date storage method of a kind of RDF and querying method

Technical field

The invention belongs to web data management technique field, be specifically related to date storage method and the querying method of the RDF of a kind of search efficiency reducing the memory space of RDF data, raising SPARQL.

Background technology

RDF (resourcedescriptionframework) is the framework that WWW (WorldWideWeb) upper information is described proposed by WWW, and it provides information Description standard for the various application on Web. RDF subject S (Subject), predicate P (Predicate), object O (Object) triple form the resource on Web is described. Wherein, subject generally represents the information entity (or concept) on Web with Uniform Resource Identifier URI (UniformResourceIdentifiers), and predicate describes the association attributes that entity has, and object is corresponding property value. Such form of presentation makes RDF can be used to represent appointing on WebThe information what is identified, and make it can exchange among applications and not lose semantic information. Therefore, RDF becomes the standard that semantic data describes, and is widely used in the description of metadata, body and semantic net. Along with being on the increase of semantic web data, structure stores and inquire about the system of these semantic web datas efficiently becomes the very important aspect that semantic net application obtains popularizing, and RDF is as the description basis of semantic web data, therefore studies the efficient storage of RDF data and inquire about the focus becoming research of semantic web. The storage mode of current RDF data and optimal way mainly have three kinds.

The first, based on the storage mode of relational database

Owing to RDF data can regard the set of<Subject, Predicate, Object>tlv triple as, therefore the most natural mode is to use triple table directly to store these data. Therefore many RDF data based on relational database store system and directly use relational database, and design triple table or similar mode store RDF data. The step of the method comprises: RDF data is resolved to tlv triple by (1); (2) URI in tlv triple is encoded by MD5 (MessageDigestAlgorithm5) Hash, and intercepts front 64 identifiers as resource of MD5 Hash; (3) use the table of one 3 row that data are stored in relevant database, and set up relative index. But, the method is when carrying out SPARQL inquiry, it is necessary to SPARQL query language is converted into SQL SQL and inquires about, it is necessary to the conversion operation of multilamellar. Owing to RDF data and relation data are very different, when RDF data is stored in relation database table, it is necessary to carry out the map operation between table. Therefore the efficiency of space utilisation and inquiry is reduced.

The second, based on the storage mode of local binary file

RDF document is able to store in file with certain form, and in semantic net, substantial amounts of RDF document just exists with the form of RDF/XML. RDF data and relation data structurally make a big difference, and it describes much more complicated than relational database of grammer, but using RDF to describe resource is have bigger motility. Store RDF document based on fixed disk file and can reach better storage efficiency, can ensure that simultaneously and quickly respond inquiry, at present more existing design based on the system of the storage organization of hard disk, these systems are often by means of the general B-tree of data base, B+ tree and Hash table technology. But, the storage mode development cost based on file is of a relatively high, and owing to the RDF semantic web data being basic describes basis, if also having needs to support that on basic storage organization data carry out inquiry reasoning, that just also needs to do substantial amounts of work.

The third, based on the storage mode of internal memory

Along with the development of hardware technology, internal memory is also more and more cheap, and memory size is also increasing, and the RDF data storage system based on internal memory that builds also becomes the focus of Recent study. First internal memory can provide quickish access speed, it is possible to data carry out real-time operation, saves the I/O expense of disk, if designing a good RDF of storage organization in internal memory to store system, it is possible to further improve inquiry and the efficiency analyzed. But, which is not suitable for the storage of large-scale RDF data, and current option b RAHMS, BitMat etc. do not support the direct inquiry of SPARQL. The visible RDF storage organization based on internal memory is still within constantly studying and improving the stage.

Summary of the invention

It is an object of the invention to overcome the deficiency of above-mentioned prior art, it is proposed to a kind of provide for RDF education resource that to compare speed between a kind of data fast and reduce the RDF data storage method of memory space.

Present invention also offers a kind of and above-mentioned storage method mate and can the RDF data querying method of quick search, thus improving the recall precision of RDF education resource.

To achieve these goals, the technical solution used in the present invention is:

The storage method of RDF data of the present invention is made up of following steps:

(1) storage organization of the RDF data of entity-oriented is designed

(1.1) mode of entity-oriented is adopted, data are stored in the k row of relevant database n row, wherein k is the meansigma methods of the predicate quantity of all subjects in RDF data, n is the sum of the line number line that all subjects need, as the predicate quantity sum��k of single subject, then required line number line=1; As sum > k time, then carry out multirow storage, then required line number line=(sum/k)+1;

(1.2) determine after k value, according to mapping predicates algorithm to, predicate is transferred row subscript, obtain the list structure of n row k row;

Wherein the predicate of step (1.2) is converted into the lower target of row method particularly includes:

(1.2.1) calculating row subscript with mapping predicates algorithm, the formula of mapping predicates algorithm is:

h_{1} &CirclePlus; h_{2} ... &CirclePlus; h_{j} (u r i) = i, i &Element; [0, k]

H in formula₁, h₂��h_jCorresponding to j hash function, i is row subscript;

(1.2.2) remain without, when j hash function has calculated, the lower timestamp finding the free time, then open up new a line, these data are stored to h₁In the subscript calculated.

(2) design maps for the storage of RDF data

Adopt hash algorithm that URI and the literal of RDF data are separately converted to 64 bit binary data, URI takes the high 64 of hash algorithm, literal measure the low 64 of hash algorithm, ascending order arrangement will be carried out in the binary data storage of conversion to hash concordance list and to the row in hash concordance list, in order to quickly carry out mapping by binary chop algorithm during lookup and convert;

(3) RDF data storage

After RDF data is carried out mapping according to the method for step (2) and changes, first time storage is in the list structure of step (1), storage is analyzed to data in list structure, create analytical table S, record each Subject and Object tlv triple number comprised and the highest 20 the highest with frequency for URI 20 frequencies that literal is corresponding of the frequency of occurrences, list structure according still further to step (1), using Object as storage entity, storage to the data in list structure is carried out second time storage again after mapping and the conversion of step (2), namely the data storage of RDF is completed.

The RDF data querying method that a kind of with above-mentioned RDF data storage method is mated, it is made up of following steps:

(a.1) extraction of variable and conversion

Tlv triple parent map pattern in SPARQL query statement is decomposed, and determine that the variable number in query statement is count, mapping mode URI in query statement and literal respectively referred in the step (2) in storage method is translated into 64 bit binary data, and the variable comprised carries out the assignment of-1 to-count;

(a.2) conversion of basic query chart-pattern

According to the tlv triple parent map Mode Decomposition result in step (a.1), each parent map pattern being converted into tlv triple query node structure, wherein tlv triple query node structure is:

Tlv triple query node structure

{

The Id of node;

The Id of subject;

The Id of predicate;

The Id of object;

The mark of storage mode;

}

The mark of storage mode selects first time storage or the second time storage of step (3) in RDF data storage method;

To URI and literal, the Id of subject, predicate, object respectively 64 bit binary data; To variable, the Id of subject, predicate, object corresponds to institute's assigned value;

(a.3) expression of attended operation is inquired about

Tlv triple according to decomposing in parent map pattern in step (a.1) compares mutually, to the tlv triple that there is identical variable, establish a connection with the node Id in step (a.2) structure for unique identifier, and annexation is converted into attended operation limit structure, wherein attended operation limit structure is:

Attended operation limit structure

{

The Id of the node of initial tlv triple,

Terminate the Id of the node of tlv triple,

The Id of co-variate

;

(a.4) Query Cost of each inquiry is calculated

According to the tlv triple query node structure obtained in step (a.2), the attended operation limit structure obtained in step (a.3) is carried out respectively according to cost algorithms costing analysis, the cost value obtaining attended operation limit structure is c, and the formula of cost algorithms is:

TMC(t,m,S)��c

Wherein: t is the tlv triple needing inquiry; M is the middle first time storage of step (3) or the second time storage of RDF data storage method; S is analytical table;

(a.5) generation of inquiry plan

The cost value c of all attended operation limits structure obtained in step (a.4) is carried out ascending sequence, obtain the sequence node by cost value sequence, choosing the node that in sequence, c value is minimum is start node, choose the next node in sequence successively, if the variable in node is not inquired about, then it is attached inquiry, until the variable in all nodes all completes inquiry, namely realizes the inquiry of statement.

Also include step (a.6) after above-mentioned steps (a.5) and set up caching mechanism, particularly as follows: the query statement of user's input is carried out hash operation according to the set of the tlv triple query node structure obtained in step (a.2), obtain the end value of hash function, if cache list exists this value, then directly take out buffered results and feed back to user; Otherwise, then repeat step (a.3) to (a.5), acquired results is stored in hard disk, the end value of corresponding address mark and hash function is stored in cache list.

The date storage method of the RDF of the present invention and querying method are the optimization of the memory structure to data, and for this structure, SPARQL are done query optimization, it is achieved the method that the education resource based on RDF carries out quickly retrieval and inquiry. Compared with prior art, the invention have the advantages that

(1) 64 bit binary data are used to replace the storage of URI originally and literal, the speed compared between data can be promoted greatly and reduce memory space, simultaneously to URI and literal, take the high 64 and low 64 of hash algorithm respectively, to distinguish URI with literal for identical character string. And the storage record of hash index is ranked up, in order to during lookup, quickly navigate to required record by binary chop algorithm.

(2) for the storage organization of RDF data, adopt the mode of entity-oriented (entry-oriented), store with subject (Subject) for entity with object (Object) for entity two ways simultaneously, the former realizes going inquiry predicate (Predicate) from subject (Subject) efficiently, it is to avoid the substantial amounts of attended operation when inquiry of the conventional store mode; The latter realizes efficient from predicate (Predicate) to the inquiry of Subject (subject).

(3) SPARQL query statement is resolved and converts, by the multiple each inquiry tlv triple in SPARQL statement according to the annexation between the analysis result of whole data set and each inquiry, estimate single inquiry cost, ultimately generate minimum cost querying flow, compare and traditional direct SPARQL is converted into SQL inquires about, significantly promote search efficiency.

(4) adding caching mechanism in the process of inquiry, the data set that enquiry frequency is high is carried out buffer memory, cache list in internal memory, the row in each cache list comprises end value and the address designation of hash function, promotes the efficiency of inquiry.

(5) present invention proposes Data Storage Models and query optimization plan can extend to the fields such as web data management, Web semantic retrieval, the even storage and retrieval of other RDF resource data.

Accompanying drawing explanation

Fig. 1 is the analysis of the SPARQL of step (a.2) in embodiment and converts schematic diagram.

Fig. 2 is the explanation that SPARQL generates query tree of step (a.3) in embodiment.

Fig. 3 is the cache model schematic diagram of step (a.6) in embodiment.

Detailed description of the invention

Below in conjunction with drawings and Examples, the present invention is described further.

In the present embodiment, the date storage method of RDF is realized by following steps:

(1) design maps for the storage of RDF data

Storage organization for RDF data, adopt the mode of entity-oriented (entry-oriented), data being stored in the k row of relevant database n row, wherein k is the meansigma methods of the predicate quantity of all subjects in RDF data, and n is the sum of the line number line that all subjects need.

(1.1) the columns k and required line number n of list structure are determined

As predicate (Predicate) quantity sum��k of single subject (Subject), then required line number line=1; As sum > k time, then need multirow tuple to store, required line number line=(sum/k)+1;

Such as data below:

(CharlesFlint,born,1850)

(CharlesFlint,died,1934)

(CharlesFlint,founder,IBM)

(LarryPage,born,1973)

(LarryPage,founder,Google)

(LarryPage,board,Google)

(LarryPage,home,PaloAlto)

(Android,developer,Google)

(Android,version,4.1)

(Android,kernel,Linux)

(Android,preceded,4.0)

(Android,graphics,OpenGL)

Storage form is as shown in table 1:

Table 1 is with the Object storage table being entity

(1.2) the subscript i that predicate (Predicate) stores is determined

After determining k value, according to mapping predicates algorithm, predicate is transferred to row subscript, when multiple predicates of same target obtain identical subscript through mapping algorithm, then it is called conflict, it is necessary to define multiple hash algorithm and utilize the row in space as much as possible and avoid conflict, when multiple hash algorithm have calculated and still there is conflict, then storing for this Subject many increases tuple a line, mapping predicates algorithmic function is:

h_{1} &CirclePlus; h_{2} ... &CirclePlus; h_{j} (u r i) = i, i &Element; [0, k]

H in formula₁, h₂��h_jCorresponding to j hash function, i is row subscript,

Remain without, when j hash function has calculated, the lower timestamp finding the free time, then open up new a line, these data are stored to h₁In the subscript calculated.

Associative list 1, checks the tlv triple that Subject is Android, it is assumed that this tlv triple is inserted in data base one by one, and arranging j is 2, then there is h₁,h₂, the subscript process calculating pred is as shown in table 2:

Table 2 is for calculating target process under predicate

Developer is through h₁Calculating obtains subscript 1, and now subscript 1 element-free, directly places.

Version is in like manner placed into subscript 2.

Kernel is through h₁Calculating, obtain subscript 1, now 1 is not idle, and meaning clashes, then use h₂It is 3 that continuation calculating obtains subscript, places.

Preceded is through h₁It is that k places that calculating obtains subscript.

Graphics is through h₁,h₂The subscript 3 and 2 obtained all is conflicted, then newly-built a line, puts it into pred₃��

(2) design maps for the storage of RDF data

The tlv triple data of usual RDF are divided into two classes: URI and literal.

Adopt hash algorithm that URI and literal are separately converted to 64 bit binary data, the high 64 of hash algorithm is taken for URI, the low 64 of hash algorithm is measured for literal, to distinguish URI and the literal of identical characters string, ascending order arrangement will be carried out in the binary data storage of conversion to hash concordance list and to the row in hash concordance list, in order to quickly carry out mapping by binary chop algorithm during lookup and convert;

(3) RDF data storage

RDF data is mapped and after conversion according to the method for step (2), first time storage is in the list structure of step (1), and storage is analyzed to data in list structure, create analytical table S, record each Subject and Object tlv triple number comprised and the highest 20 the highest with frequency for URI 20 frequencies that literal is corresponding of the frequency of occurrences, list structure according still further to step (1), using Object as storage entity, storage to the data in list structure is carried out second time storage again after mapping and the conversion of step (2), complete the data storage of RDF.

With the data in table 1, storage form is shown in table 3:

Table 3 is that the data in table 1 are by the storage form that Object is entity

The efficient method for quickly querying of a kind of RDF data suitable in said method storage, is realized by following steps:

6 tlv triple parent map pattern (BasicGraphPattern are comprised with SPARQL statement, BGP) for example, next SPARQL query statement is needed to change, conversion in order that can conveniently the storage result of bottom be operated, after conversion, each tlv triple is carried out Query Cost estimation, ultimately form lowest costs and perform flow process, specifically realized by following steps:

(a.1) extraction of variable and conversion

By the tlv triple parent map pattern (BasicGraphPattern of SPARQL query statement, BGP) decompose, and determine that the variable number in query statement is count, the mapping of the step (2) that the URI in query statement and literal store method with reference to above-mentioned RDF data is translated into 64 bit binary data with method for transformation, and the variable for comprising in query statement carries out the assignment of-1 to-count;

Such as data below:

SELECT? x? yWHERE{

Xhome " PaloAlto ". //q1

Yfounder " IBM ". //q2

Zfounder " Google ". //q3

XmemberOf? z. //q4

Zrevenue? y. //q5

Xdeveloper? y. //q6

}

Above-mentioned query statement is resolved, obtain three variablees? x,? y,? z, and all of variable is carried out id be encoded to-1 ,-2 ,-3, for other URI or literal, then directly carry out inquiring about in the concordance list of step (2).

(a.2) conversion of basic query chart-pattern

Referring to Fig. 1, according to tlv triple parent map pattern (BasicGraphPattern, the BGP) decomposition texture in step (a.1), each parent map pattern being converted into tlv triple query node structure, wherein tlv triple query node structure is:

Tlv triple query node structure

{

The Id of node;

The Id of subject;

The Id of predicate;

The Id of object;

The mark of storage mode;

}

To URI and literal, the Id of subject, predicate, object respectively 64 bit binary data; To variable, the Id of subject, predicate, object is institute's assigned value;

The mark of storage mode may select first time storage (access-by-Subject) and second time storage (access-by-Object) of step (3) in above-mentioned RDF data storage method, first time storage realizes going inquiry predicate (Predicate) from subject (Subject) efficiently, it is to avoid the substantial amounts of attended operation when inquiry of the conventional store mode; When subject the unknown, optional second time storage mode inquiry.

Before carrying out single tlv triple inquiry, first have to the incidence relation determining between number and tlv triple variable and the constant of the number of each tlv triple variable, constant, may decide that the order of inquiry according to these relations.

(a.3) expression of attended operation is inquired about

Tlv triple according to tlv triple parent map Mode Decomposition all of in step (a.1) compares mutually, the tlv triple that there is identical variable is established a connection with the node Id in step (a.2) structure for unique identifier, and annexation is converted into attended operation limit structure, wherein attended operation limit structure is:

Attended operation limit structure

{

The Id of the node of initial tlv triple,

Terminate the Id of the node of tlv triple,

The Id of co-variate

}

Ultimately form the attended operation structure in Fig. 2.

Query statement being converted through above-mentioned and process, it is achieved that the coding of variable and collection, the tlv triple of parent map pattern represents and the attended operation inquired about represents.

(a.4) Query Cost of each inquiry is calculated

According to the tlv triple query node structure obtained in step (a.2), the attended operation limit structure conventionally cost algorithms obtained in step (a.3) is carried out costing analysis, the cost value obtaining attended operation limit structure is c, and the formula of cost algorithms is:

TMC(t,m,S)��c

Wherein: t is the tlv triple needing inquiry; M is the middle first time storage of storage method step (3) or the second time storage of RDF data, and S is analytical table;

Such as:

(? xfounderGoogle)

Use access-by-Object for this tlv triple, then the execution result of TMC function is: the tlv triple number comprised in each Object in analytical table S.

(a.5) generation of inquiry plan

The cost value c of all attended operation limits structure obtained in step (a.4) is carried out ascending sequence, obtain the sequence node by cost value sequence, choosing node minimum for c in sequence is start node, choose the next node in sequence successively, if the variable in node is not inquired about, then it is attached inquiry, until the variable in all nodes all completes inquiry, namely realizes the inquiry of statement.

With reference to Fig. 2, first inquiry plan is chosen first tlv triple query node in inquiry plan and, as starting point, is chosen the 4th query node in inquiry plan structure, according to the information of the inquiry plan provided, to variable? x be attached operation, obtain two variablees<? x? z>intermediate result set; This intermediate result set the again with five is inquired about tlv triple node carry out according to variable? z be attached operation, obtain three variablees middle table<? z? x? y>, by that analogy, perform all of query statement, will obtain? z? x? y>middle table. Finally the result of inquiry is carried out SELECT operation, take out variable? x? the value that y is corresponding.

(a.6) caching mechanism is set up

In the process of data query, setting up the result of caching mechanism caching query, referring to Fig. 3, thus promoting the efficiency of inquiry, concrete operations are:

The query statement of user's input is carried out hash operation according to the set of the tlv triple query node structure obtained in step (a.2), obtains the end value of hash function, if cache list exists this value, then directly take out buffered results and feed back to user; Otherwise, then repeat the above steps (a.3) arrives (a.5), is stored in hard disk by acquired results, and the end value of corresponding address mark and hash function is stored in cache list. When the capacity of buffer memory exceedes intended setting, the frequency according to inquiry, delete minimum frequency.

Claims

1. a RDF data storage method, it is characterised in that be made up of following steps:

(1) storage organization of the RDF data of entity-oriented is designed

(2) design maps for the storage of RDF data

(3) RDF data storage

2. the date storage method towards RDF according to claim 1, it is characterised in that: described step (1.2) predicate is converted into the lower calibration method of row and is:

h_{1} &CirclePlus; h_{2} ... &CirclePlus; h_{j} (u r i) = i, i &Element; [0, k]

3. the RDF data querying method mated with the RDF data storage method described in claim 1, it is characterised in that be made up of following steps:

(a.1) extraction of variable and conversion

(a.2) conversion of basic query chart-pattern

Tlv triple query node structure

{

The Id of node;

The Id of subject;

The Id of predicate;

The Id of object;

The mark of storage mode;

}

(a.3) expression of attended operation is inquired about

Attended operation limit structure

{

The Id of the node of initial tlv triple,

Terminate the Id of the node of tlv triple,

The Id of co-variate

;

(a.4) Query Cost of each inquiry is calculated

TMC(t,m,S)��c

(a.5) generation of inquiry plan

4. RDF data querying method according to claim 3, it is characterised in that also include step (a.6) after described step (a.5) and set up caching mechanism, particularly as follows:

The query statement of user's input is carried out hash operation according to the set of the tlv triple query node structure obtained in step (a.2), obtains the end value of hash function, if cache list exists this value, then directly take out buffered results and feed back to user; Otherwise, then repeat step (a.3) to (a.5), acquired results is stored in hard disk, the end value of corresponding address mark and hash function is stored in cache list.