CN104778277A

CN104778277A - RDF (radial distribution function) data distributed type storage and querying method based on Redis

Info

Publication number: CN104778277A
Application number: CN201510213313.XA
Authority: CN
Inventors: 汪璟玢; 董书暕
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2015-04-30
Filing date: 2015-04-30
Publication date: 2015-07-15

Abstract

The invention relates to an RDF (radial distribution function) distributed type storage and querying method based on Redis. The RDF distributed type storage method is characterized in that the RDF distributed type storage method of an RDF distributed type storage system which utilizes Type-P data distributing and is based on the Redis is adopted; by judging the complexity of a to-be-queried sentence, different querying methods are selected, so as to quickly and effectively query. The RDF distributed type storage and querying method based on the Redis has the advantages that the distributed type storage system and the optimized querying method are combined, so the querying range is effectively reduced, and the querying efficiency is improved; the working efficiency is high under the conditions of multiple querying tuple modes and complicated semantics, so the storage and querying requirements of a large amount of RDF data are met.

Description

A kind of RDF Data distribution8 formula based on Redis stores and querying method

Technical field

The present invention relates to RDF data to store and inquiring technology field, particularly a kind of RDF distributed storage based on Redis and querying method.

Background technology

At present, RDF (the Resource Description Framework) management system of some maturations has been sent out by academia Jian, as Jena, Sesame, RDF-3X, the a little system mostly traditional centralized relevant database of Bian stores RDF data, data are stored in relevant database according to certain organizational form, utilize ripe relation or Object Relational Database to carry out backstage storage, SPARQL inquiry is changed into SQL query statement and inquires about.

Along with the growth rapidly of RDF data, oneself warp of traditional RDF storage management system based on relevant database cannot adapt to the magnanimity RDF data of explosive growth, and increasing researcher starts to utilize the mass data storage of distributed system and computation capability to solve magnanimity RDF data management problem.Distributed RDF data store and usually adopt distributed file system to store data with the form of document form or the many concordance lists of NoSQL database with inquiry, usually adopt the connection of MapReduce computation module treatment S PARQL clause at query aspects, or the API utilizing database to provide realizes query processing.The research of this respect is the study hotspot of nearly 2 years, but is also in the starting stage of research, does not also have ripe system schema to occur at present.Adopt traditional Relational DataBase storage system to store RDF data and there is many storage bottlenecks, and the non-mode feature of RDF data makes it be difficult to use the query optimization policies of Relational DBMS.Now there are some researches show that relevant database stores when processing magnanimity RDF data lower than distributed data base with search efficiency; And adopting the storage mode of file system for extensive RDF data, search efficiency is very low; Although all very fast based on the storage mode storing queries efficiency of internal memory, by the restriction of memory size, be only adapted to RDF data on a small scale.

Summary of the invention

The object of the present invention is to provide a kind of RDF distributed storage based on Redis and querying method, to solve existing centralized Redis(Remote Dictionary Server) store and inquire about the problem by memory size restriction existed.

For achieving the above object, technical scheme of the present invention is: a kind of RDF Data distribution8 formula storage means based on Redis, is characterized in that, realizes in accordance with the following steps:

S1: provide one based on the RDF Data distribution8 formula storage system of Redis, this storage system comprises a: management node (Manage Node) and the processing node (Process Node) matched with this management node (Manage Node) and memory node (Storage Node); Wherein, described management node (Manage Node) provide external interface, is responsible for receiving and resolving outside RDF data;

S2: in the Redis of memory node (Storage Node), first according to the definition in RDF body of data, set up with the class database of the class name life belonging to subject, simultaneously in such database, for each attribute is set up with the community set of this attribute names, i.e. predicate set; According to type and the predicate of resolving subject in tlv triple corresponding to rear RDF data, the subject that by subject be same class, predicate is identical is not repeatedly placed in the predicate set of such database, and be that each subject in this predicate set sets up the object set named with the predicate of its correspondence with this subject, in order to deposit this subject and all objects corresponding to predicate thereof; Then for predicate reversion backup set up in each predicate, namely according to same predicate, one is set up with the reversion predicate set of this predicate reversion predicate name; This reversion predicate set is not repeatedly placed in again by resolving the object that in tlv triple corresponding to rear RDF data, subject is same class, predicate is identical, and be that in this reversion predicate set, the subject set named with the predicate of its correspondence with this object set up in each object, to deposit this object and all subjects corresponding to predicate thereof.

In an embodiment of the present invention, each memory node (Storage Node) comprises a Redis, all can create N number of class database, and this N is positive integer in each Redis.

In an embodiment of the present invention, described management node (Manage Node) accesses the class database in the Redis of each memory node (Storage Node) by IP address corresponding to each memory node (Storage Node), port address and class database accession number.

In an embodiment of the present invention, within the storage system, the API that described processing node (Process Node) is provided by Redis communicates with described memory node (Storage Node).

In an embodiment of the present invention, within the storage system, described processing node (Process Node) and described memory node (Storage Node) relation that is multi-to-multi.

Further, a kind of RDF Data distribution8 formula querying method based on Redis is also provided, it is characterized in that, comprise the steps:

S31: management node (Manage Node) judges the Type that query statement is corresponding; If class is known belonging to the subject of query statement, predicate is unknown, then proceed to step S32; If class belonging to the subject of query statement is unknown, predicate is known, then proceed to step S33; If class is known belonging to the subject of query statement, predicate is known, then proceed to step S34;

S32: search from the class database that the Redis of each memory node (Storage Node) of described storage system is corresponding;

S33: the field of definition obtaining predicate from the ontology file of query statement, using the subject type of the common factor of the field of definition of all predicates as inquiry, is converted into the type that class belonging to subject is known, predicate is known, and proceeds to step S34;

S34: management node (Manage Node) judge subject in query statement or object whether known, if subject or object wherein have one known, management node (Manage Node) is directly inquired about, and the time complexity of this query script is O(1); If subject and or object all unknown, then proceed to step S35;

S35: management node (Manage Node) searches the registration table of processing node (ProcessNode), according to the number of processing node (ProcessNode) registered in this registration table, whole query task is divided into the subtask of corresponding number, and distributes to each processing node (ProcessNode) and inquire about; The memory node (Storage Node) that processing node (ProcessNode) is corresponding according to query statement in subquery task, inquires about from this memory node (Storage Node); Result set is returned to management node (Manage Node) after having inquired about by processing node (ProcessNode).

In an embodiment of the present invention, subject or the known inquiry of object, inquiry is divided into three phases by management node (Manage Node): query statement analysis, locator data collection and perform query manipulation.

In an embodiment of the present invention, the more and semantic more complicated query statement for tuple number of modes, query task is divided into multiple subquery task and is sent to processing node (ProcessNode) and performs by management node (Manage Node).

In an embodiment of the present invention, more and the semantic more complicated query statement for tuple number of modes, management node (Manage Node) generates connection strategy by connecting selection strategy tree (SST), and in described connection selection strategy tree (SST), selection strategy tree comprises a root node: Decision node, for generating connection strategy; Described Decision node next stage is the Pi node generated by predicate correspondence in every bar query statement, Pi node comprises two seed node: Si (subject) node and Oi (object) node, and Si (subject) node comprises the subject example that all predicates are Pi, Oi (object) node comprises the object example that all predicates are Pi; Except Decision node, each node has oneself weights, and symbol definition is as follows: i-th P node in Pi:SST; The S child node of Si: the i-th P node; The O child node of Oi: the i-th P node; A jth s child node under sj:Si node; A jth o node under oj:Oi node; Weight computing formula is as follows:

The weights value (Si) of SST according to the Si node of each Pi node and the weights value (Oi) of Oi node, obtains data query collection, specifically comprises the steps:

S41: if value (Si) >value (Oi), then proceed to step S42; If if value (Si) <value (Oi), then proceed to step S43; If value (Si)=value (Oi), then proceed to step S42 or step S43 at random;

S42: using the object of query statement as key, corresponding subject set as value stored in Map, and value (Pi)=value (Oi);

S43: using the subject of query statement as key, corresponding object set as value stored in Map, and value (Pi)=value (Si);

Compared to prior art, the present invention has following beneficial effect: a kind of RDF distributed storage method based on Redis proposed by the invention and querying method, meet the non-mode feature of RDF data, RDF data are stored in the high-speed cache of Redis with the form of key-value, compare file storage and there is search efficiency faster, for simple query statement, query time can not increase along with the increase of data volume, search efficiency is close to constant time, for the query statement of complexity, in conjunction with proposed storage means and selection strategy tree (SST) connection selecting method, make it also have and well inquire about effect, the design of distributed type assemblies, how many real Redis are specifically had to store after making need not to be concerned about during inquiry, so just can carry out infinite expanding memory node by parallel expansion Redis Master server, effectively reduce query context, improve search efficiency, and also can efficiently work when the more and semanteme of tuple number of modes inquired about is more complicated.

Accompanying drawing explanation

Fig. 1 is the systematic schematic diagram based on the RDF distributed storage method of Redis in the present invention.

Fig. 2 is the storage principle figure based on the RDF distributed memory system of Redis in the present invention.

Fig. 3 is the systematic schematic diagram based on the RDF distributed enquiring method of Redis in the present invention.

Fig. 4 is the structural drawing of selection strategy tree (SST) in the present invention.

Embodiment

Below in conjunction with accompanying drawing, technical scheme of the present invention is specifically described.

The invention provides a kind of RDF distributed storage method based on Redis,

S1: provide one based on the RDF Data distribution8 formula storage system of Redis, as shown in Figure 1, this storage system comprises: a management node (Manage Node) and the processing node (Process Node) matched with this management node (Manage Node) and memory node (Storage Node); Wherein, described management node (Manage Node) provide external interface, be responsible for receiving and resolving outside RDF data, and these RDF data are divided by the class in tlv triple belonging to subject, each class after dividing is stored into corresponding memory node (Storage Node), management node (Manage Node) provides external query interface simultaneously, and external system can carry out data query by the query interface of management node (Manage Node);

S2: as shown in Figure 2, in the Redis of memory node (Storage Node), first according to the definition in RDF body of data, set up with the class database of the class name life belonging to subject, as DB_Type_1, DB_Type_n etc., simultaneously in such database, for each attribute is set up with the community set of this attribute names, i.e. predicate set, as the P1_Set in DB_Type_1; According to type and the predicate of resolving subject in tlv triple corresponding to rear RDF data, the subject that by subject be same class, predicate is identical is not repeatedly placed in the predicate set of such database, and be that each subject in this predicate set sets up the object set named with the predicate of its correspondence with this subject, in order to deposit this subject and all objects corresponding to predicate thereof, S1_P1_Set as corresponding in the S1 in P1_Set and this S1; Then for predicate reversion backup set up in each predicate, namely according to same predicate, one is set up with the reversion predicate set of this predicate reversion predicate name, as P1_Reverse_Set; This reversion predicate set is not repeatedly placed in again by resolving the object that in tlv triple corresponding to rear RDF data, subject is same class, predicate is identical, and be that in this reversion predicate set, the subject set named with the predicate of its correspondence with this object set up in each object, to deposit this object and all subjects corresponding to predicate thereof, as the O1 in P1_Reverse_Set, and the O1_P1_Reverse_Set that this O1 is corresponding.

By adopting said method, no matter being known for subject or that object is known query statement, effectively can reducing query context, improve search efficiency.

Further, as depicted in figs. 1 and 2, in the present embodiment, each memory node (Storage Node) comprises a Redis, all can create N number of database, and this N is positive integer in each Redis; Described management node (Manage Node) accesses the class database in the Redis of each memory node (Storage Node) by IP address corresponding to each memory node (Storage Node), port address and database accession number; In whole storage system, described processing node (Process Node) by Redis provide API communicate with described memory node (Storage Node); The relation that processing node described in native system (Process Node) and described memory node (Storage Node) are multi-to-multi.

Further, a kind of RDF distributed enquiring method based on Redis is also provided, after completing Distributed Storage, be queried the Type that data set just can be determined according to query statement, navigate to the StorageNode at data place, then navigate to the data set at place according to predicate, thus when reducing data set, inquire about, as shown in Figure 3, realize in accordance with the following steps:

S34: management node (Manage Node) judge subject in query statement or object whether known, if subject or object wherein have one known, because the storage means in the present embodiment is that key value is to storage, or when subject object wherein have one known when, management node (Manage Node) is directly inquired about, and the time complexity of this query script is O(1); If subject and or object all unknown, then proceed to step S35;

S35: management node (Manage Node) searches the registration table of processing node (ProcessNode), in the present embodiment, this registration table stores the relevant informations such as the IP of processing node, according to the number of processing node (ProcessNode) registered in this registration table, whole query task is divided into the subtask of corresponding number, and distributes to each processing node (ProcessNode) and inquire about; Processing node (ProcessNode), according to the memory node (Storage Node) at query statement corresponding data place in subquery task, carries out data query from this memory node (Storage Node); Result set is returned to management node (Manage Node) after having inquired about by processing node (ProcessNode).

If do not adopt the storage means and querying method that propose in the embodiment of the present invention, for the query statement of complexity, need to find all related datas, then carry out attended operation, but adopt the querying method by the storage means that proposes in the present embodiment and correspondence, owing to setting up in storage system process, corresponding management node (Manage Node), each processing node (ProcessNode) and each memory node (Storage Node) establish corresponding topology diagram, processing node (ProcessNode) as long as obtain the data required for present treatment node (ProcessNode) connection from the database be assigned to, carry out attended operation, Query Result is gathered management node (Manage Node) by last each processing node (ProcessNode), effectively can utilize distributed proccessing like this, also the memory pressure of single personal computer can be alleviated.

In the present embodiment, for subject or the known inquiry of object, inquiry is divided into three phases by management node (ManageNode): query statement analysis, locator data collection (i.e. place class database) and perform query manipulation.For Q1:

Query steps for Q1 performs as follows:

S51:ManageNode analysis and consult statement, obtains the class belonging to result set, type:GraduateStudent;

S52: obtain predicate, predicate:takesCourse, because object is known, therefore PREDICAT is takesCourse_R;

S53: the storageNode obtaining data set place according to type;

S54:ManageNode performs query manipulation;

Specific implementation process is as follows:

1.Begin

2.sparqlQuery: the sparql query statement of required inquiry;

3.getDataBaseByType (): the storageNode obtaining place according to Type; The response of order returns, and the result that many are ordered can be bundled to and return to client together after processing many orders by redis service end;

5.pl.smembers (key): obtain the element in set set corresponding in redis according to key;

6.type = sparqlQuery.getType；

7.dataBase = getDataBaseByType(type);

8.predicate = sparqlQuery.getPredicate;

9.IF (subject is known)

10. key= subject +”_”+predicate;

11.ELSE (object is known)

12. key=object +”_”+predicate+”_R”;

13.End IF

14.pl=dataBase.getPipeline ;

15.Set<String> response = pl.smembers(key);

16. pl.sync;

17.End

Wherein, what deposit in response is exactly the result set inquired about.

In the present embodiment, the more and semantic more complicated query statement for tuple number of modes, query task is divided into multiple subquery task and is sent to processing node (ProcessNode) and performs by management node (Manage Node)., for Q9:

Query steps for Q9 is as follows:

S61:ManageNode obtains in all Student memory nodes, advisor, advisor_R, takeCourse, the size of teacherOf, teacherOf_R set in takeCourse_R and all Faculty memory nodes, thus structure is selected spanning tree to generate to connect selection strategy;

S62:ManageNode searches ProcessNode registration table, according to the number of registered ProcessNode, whole query task is divided into the subtask of corresponding number, and sends to the ProcessNode of registration to calculate the storageNode information package at the data place required for subtask and subtask;

After S63:ProcessNode has inquired about, result set is returned to ManageNode;

ManageNode division of tasks algorithm pseudo code:

1.Begin

2.dataBase: the database at first statement subject place in inquiry plan;

3.predicate: the predicate of first statement in inquiry plan;

4.keySet: the public subject set of first statement and Article 2 statement in inquiry plan;

5.processNum: the number being connected to the ProcessNode of ManageNode;

6.dataMap: store the related data needed for ProcessNode subtasking;

7.separateSet (Set set, int num): set is divided into num set;

The packet communicated between 8.DataPacket:ManageNode and ProcessNode;

9.Pipeline pl=dataBase.getPipeline; The data pipe in // connection data storehouse

10.keySet = pl.smembers(predicate);

11. List<Set<String>> list = separateSet(keySet,processNum);

12. FOR(int i=0;i<processNum;i++)

13. dataMap.put("keySet", l.get(i));

14. DataPacket dataPacket = new DataPacket(DataPacket.search_type, dataMap);

15.objectOutputStream[i].writeObject(dataPacket);

16.END FOR

17.End

The join algorithm pseudo-code of ProcessNode:

1.Begin

The database at 2.dataBasei: data set i place;

3.predicatei: the predicate of i-th statement in inquiry plan;

5.Pipeline pli=dataBasei.getPipeline; The data pipe of // connection i-th database

6.FOR(KEY:keySet)

7. Set L1 = pl1.smembers(KEY+"_"+predicate1);

8. Set L2 = pl2.smembers(KEY+"_"+predicate2);

9. FOR(STRING1:L1)

10. FOR(STRING2;l2)

11. IF(pl3.sismember(

STRING1+"_"+predicate3, STRING2))

12. // do anything

13. END IF

14. END FOR

15. END FOF

16.END FOR

17.End

Further, in the pattern match about BGP (Basic Graph Pattern), by ensureing under the prerequisite that Query Result is correct someway, its query script time cost is reduced, title the method is an optimisation strategy about BGP.SST(SelectivityStrategyTree) connect selection strategy by the analysis to query statement, do not repeat the number of subject by obtaining from corresponding stored node in corresponding predicate set and do not repeat the number of object, generate selection strategy tree.In the present embodiment, as shown in Figure 4, more and the semantic more complicated query statement for tuple number of modes, management node (Manage Node) generates connection strategy by connecting selection strategy tree (SST), and described selection strategy tree SST comprises a root node, i.e. decision node Decision Node, is responsible for generating connection strategy, the next stage be connected with described decision node Decision Node is by the predicate node Predicate Node generated by predicate in every bar query statement, each predicate node Predicate Node comprises two seed node: subject node Subject Node and object node Object Node, described subject node Subject Node comprises all subjects that predicate is the corresponding predicate of this predicate node Predicate Node, described object node Object Node comprises all objects that predicate is the corresponding predicate of this predicate node Predicate Node, namely Decision node next stage is that the Pi node generated by predicate correspondence in every bar query statement (does not comprise type, type and the class belonging to each subject), Pi node comprises two seed node: Si (subject) node and Oi (object) node, and Si (subject) node comprises the subject example that all predicates are Pi, Oi (object) node comprises the object example that all predicates are Pi.As shown in Figure 4, comprise for the subject node S1 of predicate node P1, P1 the subject example that all predicates are P1; Object node O1 comprises the object example that all predicates are P1.

Further, in the present embodiment, in described selection strategy tree (SST), except Decision node, each node has oneself weights, and symbol definition is as follows: i-th P node in Pi:SST; The S child node of Si: the i-th P node; The O child node of Oi: the i-th P node; A jth s child node under sj:Si node; A jth o node under oj:Oi node; Weight computing formula is as follows:

Wherein, during the course, Map is the container that key-value pair stores, in order to ensure when key is known at O(1) find value in the time.

The weights of all predicate node Predicate Node are obtained in selection strategy tree SST, the weights of each predicate node Predicate Node sort by selection strategy tree SST from small to large, and two corresponding for predicate in predicate node Predicate Node minimum for weights query statements are first connected, connect the result generated to be connected with next query statement again, complete query statement and connect.In the present embodiment, for query statement Q9 the most complicated in LUMB:

Then calculated by above-mentioned steps, can show that corresponding connection scheme is: 2->1->3.

Be more than preferred embodiment of the present invention, all changes done according to technical solution of the present invention, when the function produced does not exceed the scope of technical solution of the present invention, all belong to protection scope of the present invention.

Claims

1. based on a RDF Data distribution8 formula storage means of Redis, it is characterized in that, realize in accordance with the following steps:

S2: in the Redis of memory node (Storage Node), first according to the definition in RDF body of data, set up with the class database of the class name belonging to subject, simultaneously in such database, for each attribute is set up with the community set of this attribute names, i.e. predicate set; According to type and the predicate of resolving subject in tlv triple corresponding to rear RDF data, the subject that by subject be same class, predicate is identical is not repeatedly placed in the predicate set of such database, and be that each subject in this predicate set sets up the object set named with the predicate of its correspondence with this subject, in order to deposit this subject and all objects corresponding to predicate thereof; Then for predicate reversion backup set up in each predicate, namely according to same predicate, one is set up with the reversion predicate set of this predicate reversion predicate name; This reversion predicate set is not repeatedly placed in again by resolving the object that in tlv triple corresponding to rear RDF data, subject is same class, predicate is identical, and be that in this reversion predicate set, the subject set named with the predicate of its correspondence with this object set up in each object, to deposit this object and all subjects corresponding to predicate thereof.

2. a kind of RDF distributed storage method based on Redis according to claim 1, is characterized in that: each memory node (Storage Node) comprises a Redis, all can create N number of class database, and this N is positive integer in each Redis.

3. a kind of RDF distributed storage method based on Redis according to claim 1, is characterized in that: described management node (Manage Node) accesses the class database in the Redis of each memory node (Storage Node) by IP address corresponding to each memory node (Storage Node), port address and class database accession number.

4. a kind of RDF distributed storage method based on Redis according to claim 1, it is characterized in that: within the storage system, the API that described processing node (Process Node) is provided by Redis communicates with described memory node (Storage Node).

5. a kind of RDF distributed storage method based on Redis according to claim 1, it is characterized in that: within the storage system, the relation that described processing node (Process Node) and described memory node (Storage Node) are multi-to-multi.

6., based on the RDF Data distribution8 formula querying method based on Redis of a kind of RDF distributed storage method based on Redis described in any one of claim 1 ~ 5, it is characterized in that, comprise the steps:

S35: management node (Manage Node) searches the registration table of processing node (ProcessNode), according to the number of processing node (ProcessNode) registered in this registration table, whole query task is divided into the subtask of corresponding number, and distributes to each processing node (ProcessNode) and inquire about; The memory node (Storage Node) that processing node (ProcessNode) is corresponding according to query statement in subquery task, inquires about from this memory node (Storage Node); Result is returned to management node (Manage Node) after having inquired about by processing node (ProcessNode).

7. a kind of RDF distributed enquiring method based on Redis according to claim 6, it is characterized in that: subject or the known inquiry of object, inquiry is divided into three phases by management node (Manage Node): query statement analysis, locator data collection and perform query manipulation.

8. a kind of RDF distributed enquiring method based on Redis according to claim 6, it is characterized in that: the more and semantic more complicated query statement for tuple number of modes, query task is divided into multiple subquery task and is sent to processing node (ProcessNode) and performs by management node (Manage Node).

9. a kind of RDF distributed enquiring method based on Redis according to claim 6, it is characterized in that: the more and semantic more complicated query statement for tuple number of modes, management node (Manage Node) generates connection strategy by connecting selection strategy tree (SST), and in described connection selection strategy tree (SST), selection strategy tree comprises a root node: Decision node, for generating connection strategy; Described Decision node next stage is the Pi node generated by predicate correspondence in every bar query statement, Pi node comprises two seed node: Si (subject) node and Oi (object) node, and Si (subject) node comprises the subject example that all predicates are Pi, Oi (object) node comprises the object example that all predicates are Pi; Except Decision node, each node has oneself weights, and symbol definition is as follows: i-th P node in Pi:SST; The S child node of Si: the i-th P node; The O child node of Oi: the i-th P node; A jth s child node under sj:Si node; A jth o node under oj:Oi node; Weight computing formula is as follows:

S43: using the subject of query statement as key, corresponding object set as value stored in Map, and value (Pi)=value (Si).