CN109033314B - Real-time query method and system for large-scale knowledge graph under condition of limited memory - Google Patents

Real-time query method and system for large-scale knowledge graph under condition of limited memory Download PDF

Info

Publication number
CN109033314B
CN109033314B CN201810787762.9A CN201810787762A CN109033314B CN 109033314 B CN109033314 B CN 109033314B CN 201810787762 A CN201810787762 A CN 201810787762A CN 109033314 B CN109033314 B CN 109033314B
Authority
CN
China
Prior art keywords
knowledge graph
index
query
vocabulary
hash list
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810787762.9A
Other languages
Chinese (zh)
Other versions
CN109033314A (en
Inventor
王宏志
万晓珑
高宏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology
Original Assignee
Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology filed Critical Harbin Institute of Technology
Priority to CN201810787762.9A priority Critical patent/CN109033314B/en
Publication of CN109033314A publication Critical patent/CN109033314A/en
Application granted granted Critical
Publication of CN109033314B publication Critical patent/CN109033314B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the technical field of data processing, and provides a real-time query method and a real-time query system for a large-scale knowledge graph under the condition of limited memory, wherein the method comprises the following steps: processing and analyzing the original knowledge graph to obtain an inverted file hash list; constructing a multilevel structure index based on the original knowledge graph; and analyzing the query sentence to obtain a target vocabulary, and searching a triple corresponding to the target vocabulary according to the inverted file hash list and the multi-level structure index to generate a result subgraph. The invention greatly improves the query capability of the single-machine knowledge map, and can provide a result set which not only meets the time requirement of a user, but also meets the precision requirement of the user under the condition of extremely limited memory.

Description

Real-time query method and system for large-scale knowledge graph under condition of limited memory
Technical Field
The invention relates to the technical field of data processing, in particular to a real-time query method and a real-time query system for a large-scale knowledge graph under the condition of limited memory.
Background
The world wide web has formed a huge network from birth to now, and the nodes of the network are individual web pages which are related to each other through hyperlinks. Based on the simple open technology of the world wide web, modern search engine technology can search for relevant web pages of a problem in a huge network space. However, due to the development of mobile internet and the limited screen space of mobile devices, users expect that search engines can obtain accurate results, rather than looking one by one in search results. Due to this accuracy requirement of the user, the storage of web pages alone is not yet satisfactory.
To address such a demand, XML (extensible markup language), RDF (resource description framework), OWL (network ontology language), and the like are proposed for describing information in a network. XML facilitates data exchange by tagging documents and data content; RDF describes the semantic relation of resources in the network in the form of (subject, predicate, object) triples; OWL makes it possible to describe this concept with very strong expressive and interpretative capabilities. The concept of knowledge graph through the above three internet information description modes has been proposed in recent years. The entities and the entity attributes in the webpage are identified and then are stored in the knowledge graph, and when a user initiates searching, the user intention can be understood according to the known nodes in the knowledge graph, and accurate answers can be given.
At present, the main storage query methods of knowledge graph based on RDF triple form include: based on a large three-tuple table, based on a plurality of attribute-clustered tables and based on a plurality of vertical-sorted partition tables. The form based on a huge three-tuple table is to store all the triples in a huge three-column table, and the main systems using the method are as follows: RDF-3x and Hexastore; there are two main types of tables based on the form of multiple tables clustered by attributes: tuple attribute cluster tables and tables of objects with similar attributes; a separate 2-column table is constructed for each attribute based on a number of tables partitioned by vertical classification, etc. For storing subjects and objects. RDF storage systems based on the three forms are Jena, Yars2, Sesame 2.0, SW-store, EDF-3x, x-RDF-3x, Hexastore, gStore and the like.
Existing RDF storage query systems, such as Jena, Yars2, and Sesame 2.0, work poorly on larger RDF datasets. SW-store, EDF-3x, x-RDF-3x and Hexastore solve the problem of large RDF data set by using a mapping dictionary, but only can support fixed spark QL language. And most of the current methods cannot rapidly solve the problem of online update of RDF data. For example, based on a number of systems Jena in the form of attribute clustering tables, a re-clustering and reconstruction of an attribute table is required if the attribute information of the data is to be updated on its data set. In the SW-store system, the update cost is also quite expensive because the update requires many columns to be rewritten. Although the method of "overflow table + batch write" is used, it is difficult to be adopted by the application requiring high real-time performance. And many RDF data tend to be non-strictly structured, e.g., not all have the same attributes in the same type of data. This non-rigid structure facilitates the integration of data but speeds up the data synthesis query process for many classical relational approaches. Although the gStore adopts a T-index method to solve the partial problems, the size of a single-machine support data set is limited by a T-index structure, and only the data management task of the RDF knowledge graph with the billion triple scale can be supported.
However, as human knowledge is updated progressively larger, the size of the knowledge graph is also increasing, with sizes well in excess of one billion. The computing power of common computing devices can not keep pace with the rate of growth of the knowledge graph, and it is increasingly difficult for common users to perform query processing on the knowledge graph. For example, freebase is about 380G, about 8G exists in the current common indoor, and a large amount of I/O operations are generated when a common PC user directly makes a query on the common indoor, which greatly wastes user time. However, most ordinary users do not need very accurate results, and only need the query program to give an approximate solution. With the rise of approximate query processing technology, more and more research results show that: in most cases, the approximate result can meet the user requirement, the user computing time can be greatly saved, and the requirement on computing equipment is reduced.
Disclosure of Invention
The technical problem to be solved by the present invention is to provide a method and a system for real-time query of a large-scale knowledge graph under the condition of limited memory, aiming at one or more of the above defects in the prior art.
In order to solve the technical problem, the invention provides a real-time query method of a large-scale knowledge graph under the condition of limited memory, which comprises the following steps:
processing and analyzing the original knowledge graph to obtain an inverted file hash list;
constructing a multilevel structure index based on the original knowledge graph;
and analyzing the query sentence to obtain a target vocabulary, and searching a triple corresponding to the target vocabulary according to the inverted file hash list and the multi-level structure index to generate a result subgraph. Optionally, the processing and analyzing the original knowledge graph to obtain an inverted file hash list includes:
extracting tuple information in a form of vocabulary firstly and then offset in an original knowledge graph;
converting the extracted tuple information into a form of vocabulary firstly and then offset;
sorting the tuple information in the form of first vocabularies and then offsets according to the vocabularies to obtain an inverted file;
and carrying out hash processing on the obtained inverted file to obtain an inverted file hash list.
Optionally, the constructing a multi-level structure index based on the original knowledge graph includes:
carrying out data classification, cleaning and simplified data representation on the primary structure discovery result of the original knowledge graph to obtain a data classification and simplification result of the knowledge graph;
extracting bottom layer structure nodes based on the data classification and simplification results of the knowledge graph;
and further extracting the classification simplified result of the knowledge map data to realize superior structure indexing.
Optionally, the step of analyzing the query statement to obtain a target vocabulary, and searching the triple corresponding to the target vocabulary according to the inverted file hash list and the multi-level structure index to generate a result subgraph includes:
receiving a query statement Q, a lower limit min of the number of returned tuples, an upper limit max of the number of returned tuples and a sampling ratio input by a user;
analyzing the query statement Q to obtain a vocabulary set needing to be queried;
for each vocabulary in the vocabulary set, a corresponding disk index set { S ] is searched in parallel in the hash list of the inverted file1,S2,……,SnObtaining a disk index intersection S after the intersection is obtained;
judging whether the length of the disk index intersection S is smaller than the lower limit min of the number of returned tuples:
if so, adding the index and the position information thereof as a node into the result subgraph for any index position in the disk index intersection S;
otherwise, judging whether the length of the disk index intersection S is greater than the upper limit max of the number of returned tuples, if so, making the sampling number be max, otherwise, making the sampling number be the product of the length of the disk index intersection S and the sampling ratio, and if the sampling number is less than the lower limit min of the number of returned tuples, making the sampling number be the lower limit min of the number of tuples; after the sampling number is determined, the disk index intersection S is semi-randomly sampled, wherein the information of the auxiliary sampling node superNode obtained in step S102 needs to be used. And adding each index obtained by sampling into the structure subgraph in the multilevel structure indexes and the position information thereof.
The invention also provides a real-time query system of the large-scale knowledge graph under the condition of limited memory, which comprises the following steps: the system comprises a Hash list establishing unit, a multi-level index establishing unit and a searching unit;
the hash list establishing unit is used for processing and analyzing the original knowledge graph to obtain an inverted file hash list;
the multi-level index construction unit is used for constructing a multi-level structure index based on the original knowledge graph; (ii) a
And the query unit is used for analyzing the query statement to obtain a target vocabulary, and searching the triple corresponding to the target vocabulary according to the inverted file hash list and the multi-level structure index to generate a result subgraph.
Optionally, the hash list establishing unit is configured to perform the following steps:
extracting tuple information in a form of vocabulary firstly and then offset in an original knowledge graph;
converting the extracted tuple information into a form of vocabulary firstly and then offset;
sorting the tuple information in the form of first vocabularies and then offsets according to the vocabularies to obtain an inverted file;
and carrying out hash processing on the obtained inverted file to obtain an inverted file hash list.
Optionally, the multi-level index building unit is configured to perform the following steps:
carrying out data classification, cleaning and simplified data representation on the primary structure discovery result of the original knowledge graph to obtain a data classification and simplification result of the knowledge graph;
extracting bottom layer structure nodes based on the data classification and simplification results of the knowledge graph;
and further extracting the classification simplified result of the knowledge map data to realize superior structure indexing. Optionally, the querying unit is configured to perform the following steps:
receiving a query statement Q, a lower limit min of the number of returned tuples, an upper limit max of the number of returned tuples and a sampling ratio input by a user;
analyzing the query statement Q to obtain a vocabulary set needing to be queried;
for each vocabulary in the vocabulary set, a corresponding disk index set { S ] is searched in parallel in the hash list of the inverted file1,S2,……,SnObtaining a disk index intersection S after the intersection is obtained;
judging whether the length of the disk index intersection S is smaller than the lower limit min of the number of returned tuples:
if so, adding the index and the position information thereof as a node into the result subgraph for any index position in the disk index intersection S;
otherwise, judging whether the length of the disk index intersection S is greater than the upper limit max of the number of returned tuples, if so, making the sampling number be max, otherwise, making the sampling number be the product of the length of the disk index intersection S and the sampling ratio, and if the sampling number is less than the lower limit min of the number of returned tuples, making the sampling number be the lower limit min of the number of tuples; after the sampling number is determined, the disk index intersection S is semi-randomly sampled, wherein the information of the auxiliary sampling node superNode obtained in step S102 needs to be used. And adding each index obtained by sampling into the structure subgraph in the multilevel structure indexes and the position information thereof.
The method and the system for inquiring the large-scale knowledge graph in real time under the condition of limited memory provided by the embodiment of the invention at least have the following beneficial effects:
1. the invention can give consideration to the relation between the user requirement and the user equipment capability, improve the single-machine data processing capability of the user through the inverted index and the structure index, and find the result set of the user in a quicker time.
2. The invention further extracts the sub-graph structure after obtaining the large-scale result set appointed by the user by fusing the approximate query processing technology and adopting the thought of the approximate query processing field. The method saves the query time of the user, reduces the restriction of the memory space on the query engine, and can return a result which can be quickly understood by the user according to the intention of the user.
Drawings
Fig. 1 is a flowchart of a method for querying a large-scale knowledge graph in real time under a condition of limited memory according to an embodiment of the present invention;
FIG. 2 is a schematic diagram in accordance with the principles of the present invention;
FIGS. 3a, 3b and 3c are a bottom layer structure diagram, a bottom layer node and relationship diagram and an upper layer node and relationship diagram, respectively, extracted by the present invention;
fig. 4 is a schematic diagram of a real-time query system of a large-scale knowledge graph under a limited memory condition according to a fifth embodiment of the present invention;
in the figure: 401: a hash list establishing unit; 402: a multilevel index building unit; 403: and searching the unit.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.
Example one
Fig. 1 is a flowchart of a real-time query method of a large-scale knowledge graph under a limited memory condition according to an embodiment of the present invention; fig. 2 is a schematic diagram according to the principles of the present invention. As shown in the figure, the method for querying a large-scale knowledge graph in real time under the condition of limited memory provided by the embodiment of the invention may include the following steps:
step S101: and executing the step of establishing the inverted file hash list, namely processing and analyzing the original knowledge graph to obtain the inverted file hash list. Due to the fact that the repetition rate of the vocabularies in the non-numerical type super-large-scale knowledge graph is high, the reverse files can be used for rapidly positioning triples according to the vocabularies, and the hash processing of the reverse files is conducted to accelerate the vocabulary searching speed and reduce file I/O (input/output) operation.
Step S102: and executing a multilevel structure index construction step, and constructing a multilevel structure cable based on the original knowledge graph.
Step S103: and during query, querying the statement according to the inverted file hash list, the multi-level structure index and the original knowledge graph.
The key point of the invention is that the query time and precision requirements of common PC users are realized by using a small memory space, namely: the knowledge graph is queried in real time under the condition that the memory space is limited, so that the conditions that the CPU utilization rate is low, the time consumption for reading files is excessive, and the user waiting time is extremely long due to a large amount of I/O operations generated when the memory space of the user is small are avoided.
According to the invention, for the current knowledge graph based on the RDF structure, a structure extraction method is adopted, and the data vertex is subjected to hierarchical processing, so that a simplified vertex structure is obtained. A Hash structure is added in the design of the inverted file, so that the tuple can be searched in O (1) time. Combining the two structures, the user's result set can be found at O (1) time. By fusing approximate query processing technology and adopting the thought of the approximate query processing field, a sub-graph structure is extracted after a large-scale result set specified by a user is obtained. The method saves the query time of the user, reduces the restriction of the memory space on the query engine, and can return a result which can be quickly understood by the user according to the intention of the user.
The invention overcomes the difficulty of extracting knowledge graph structure on the non-strict structural knowledge graph and the time complexity of operation on a large data set, and can ensure shorter off-line data processing time and on-line searching time.
Example two
On the basis of the real-time query method for the large-scale knowledge graph under the condition of limited memory provided in the first embodiment, the process of processing and analyzing the original knowledge graph to obtain the hash list of the inverted files in step S101 may be specifically implemented in the following manner:
step 1: and extracting tuple information in a form of vocabulary firstly and then offset in the original knowledge graph. The vocabulary-first-offset format refers to a (offset, vocabulary, … …, vocabulary) format, i.e., tuple information in the (offset, vocabulary, … …, vocabulary) format is extracted from the original knowledge graph in step 1.
Step 2: and converting the extracted tuple information into a vocabulary-first and offset-second form. The vocabulary-first offset format refers to (vocabulary, offset, … …, offset) format, i.e. the tuple information in (offset, vocabulary, … …, vocabulary) format is converted into (vocabulary, offset, … …, offset) format in step 2.
And step 3: sorting the tuple information in a form of first vocabulary and then offset (vocabulary, offset … …, offset) according to the vocabulary to obtain an inverted file;
the step 3 comprises the following steps:
step 3.1: merging offset information of repeated vocabularies between adjacent 10 ten thousand tuple sets;
step 3.2: sorting the memories by taking 10 ten thousand as units;
step 3.3: merging and sequencing the obtained files;
step 3.4: resulting in sorted (vocabulary, offset, … …, offset) tuples.
And 4, step 4: and carrying out hash processing on the obtained inverted file to obtain an inverted file hash list so as to improve the subsequent searching efficiency.
The algorithm for constructing the hash list part of the inverted file is shown as the following algorithm 1, and the lines 1 to 11 correspond to the steps 1 to 3. Lines 1 to 7 are processes for extracting (v, p, …, p) tuples from an original knowledge graph, namely the large-scale knowledge graph G, the number of the tuples extracted in each inverted file does not exceed a preset number Max, and "list. And obtaining a sorted result set within the Max range, and exporting the result set to the file to obtain the inverted file. And 8-11 lines, the number of the sorted inverted files obtained in the previous step can be obtained. The inverted file hash list fileList can be obtained by selecting a hash function and merging all the inverted files, lines 12-18.
Figure BDA0001734070810000081
Figure BDA0001734070810000091
EXAMPLE III
On the basis of the real-time query method for the large-scale knowledge graph under the condition of limited memory provided in the second embodiment, the process of constructing the multilevel structure index based on the original knowledge graph in step S102 may be specifically implemented in the following manner:
the invention carries out body layer separation on an original knowledge graph to obtain a primary structure discovery, and then constructs a multi-level index structure, comprising the following steps: the method comprises three parts of knowledge graph structure depth analysis, knowledge graph storage node index establishment and whole structure index establishment.
(1) And (3) carrying out structural depth analysis on the knowledge graph: carrying out data classification, cleaning and simplified data representation on the preliminary structure discovery result of the knowledge graph to obtain a data classification and simplification result of the knowledge graph; wherein the data simplified representation is converted aiming at the original RDF knowledge graph. The original knowledge-graph has much redundant information, and here, the redundant information of the original knowledge-graph is deleted.
(2) Establishing a knowledge graph storage node index: extracting a disk position where a vertex in an original knowledge graph (RDF triples are adjacently stored in a disk according to the same position of a subject in principle) appears for the first time, and sequencing tuples of the disk position according to the size relationship between nodes by using a quick sequencing method to obtain a knowledge graph storage node index; the node is the disk location.
(3) Establishing an overall structure index: and further extracting the disk position of the classification simplified result of the knowledge graph data to realize the superior structure index, and then organically combining the knowledge graph storage node index and the superior structure index to obtain the multilevel structure index.
Knowledge maps developed from ontology languages possess basic ontology concepts, including collections of real-world objects and collections of relationships between real-world objects. Such a knowledge graph can be easily divided into an ontology (concept) layer and a fact (object) layer. It is clear that in a large-scale knowledge-graph the ontology layer possesses many instances of the fact layer. By utilizing the characteristic, the invention can easily extract the body layer of the knowledge graph by utilizing the data mining technology, further separate the body layer and the fact layer and complete the construction of the multi-layer structure index. The invention can realize automatic layering of the knowledge graph by using a bottom-up method. Of course, an important step is to be done before layering: and (3) performing knowledge graph cleaning operation, namely reducing redundant information in the knowledge graph by using a certain coding rule, and simultaneously extracting leaf nodes in the knowledge graph and disk position information of the leaf nodes as a bottom structure of the multi-level index. And then, extracting the relationship information between the bottom nodes and the information of the upper nodes by using the bottom nodes. And further separating the knowledge graph. For example, as shown in fig. 3a, the obtained underlying structure of the present invention will obtain the node relationship information of this level (as shown in fig. 3 b) and the node information of the previous level and the node relationship information of the upper and lower levels (as shown in fig. 3 c) in the next level of loop.
In an embodiment of the present invention, the building process of the multi-level structure index specifically includes the following steps:
step 1: extracting the bottom structure nodes of the large-scale knowledge graph G, which specifically comprises the following steps: traversing each triple in the large-scale knowledge graph G, judging whether the object of the triple is a leaf node, if so, adding the subject and the position information of the triple into the set N0And the subject and the position information of the triple are used as a node and added into the multi-layer structure index; otherwise, adding the subject and the position information of the triple into the set N1In (1).
Step 2: constructing the incidence relation information of the upper node index of the current level node and the current level node, which specifically comprises the following steps:
detection set N1When not empty, let set S0=N0,S1=N1Will aggregate N0And set N1Setting as an empty set; for set S1Is traversed, for the current (triplet, location):
if the object of the triple is in set S0Chinese and subject is not in set S0Then extracting the following information (triplet subject, position) of the triplet and adding the extracted information into the set N0And extracts the following information of the triplet (triplet subject, location, set S)0The position of the triplet object) into the multi-level structure index;
if the object of the triple is in set S0Chinese and subject in set S0Then extract the following information (set S) of the triplet0The position of the triplet subject, set S0The triple object ofPosition of) is added to the multi-level structure index;
otherwise, add (triplet, location) of the triplet to set N1Performing the following steps;
and step 3: the high-level nodes (high-level node information) in the multi-level structure index are extracted.
The following algorithm 2 shows in detail how to extract the multi-level structure index by an automated knowledge graph hierarchy mining method. The 1-7 lines of the algorithm extract the underlying structure nodes of the large-scale knowledge graph G. The algorithm builds the structure index gradually in the following loop. And constructing an upper node index of the current node level and a two-layer node index relationship through lines 11-13. And constructing the incidence relation information of the current level node through the 14 th and 15 th lines. Note that in order to establish the association relationship between the hierarchical index and the inverted file hash series table, both store the nodes in the form of key value pairs, where the "key" is the position of each node in the disk, and the "value" is the information needed in various algorithms. The above process is of a cyclic structure, N0Represents the extracted lower node, N1Representing the extracted upper level nodes. And assigns it to S in the second cycle0S1. Finally, in line 18, we need to extract a superNode (high-level node information) according to the obtained structure index, so as to serve our subsequent search algorithm.
Figure BDA0001734070810000121
Example four
On the basis of the real-time query method of the large-scale knowledge graph under the condition of limited memory provided in the third embodiment, in step S103, the query statement is analyzed to obtain a target vocabulary, and a process of searching a triple corresponding to the target vocabulary according to the inverted file hash list and the multi-level structure index to generate a result subgraph is specifically implemented in the following manner:
step 1: receiving a query statement Q, a lower limit min of the number of returned tuples, an upper limit max of the number of returned tuples and a sampling ratio input by a user;
step 2: analyzing the query statement Q to obtain a vocabulary set needing to be queried;
and step 3: for each vocabulary in the vocabulary set, searching the corresponding disk index set { S ] in parallel in the hash list fileList of the inverted file1,S2,……,SnObtaining a disk index intersection S after the intersection is obtained; where n is the number of words in the vocabulary set.
And 4, step 4: judging whether the length of the disk index intersection S is smaller than the lower limit min of the number of returned tuples:
if so, adding the index and the position information thereof as a node into the result subgraph for any index position in the disk index intersection S;
otherwise, judging whether the length of the disk index intersection S is greater than the upper limit max of the number of returned tuples, if so, making the sampling number be max, otherwise, making the sampling number be the product of the length of the disk index intersection S and the sampling ratio, and if the sampling number is less than the lower limit min of the number of returned tuples, making the sampling number be the lower limit min of the number of tuples; after the sampling number is determined, the disk index intersection S is semi-randomly sampled, wherein the information of the auxiliary sampling node superNode obtained in step S102 needs to be used. And adding each index obtained by sampling into the structure subgraph in the multilevel structure indexes and the position information thereof.
The present invention serves a search using the operation results of step S101 and step S102. And finding the tuple position which the user wants to find by using the inverted file hash list, and finding the adjacent vertex structure by using the multilayer index, thereby realizing the quick structure query under the condition of limited memory. However, the result obtained by using only the file hash list is far from enough, and many problems still exist in the searching process, such as whether the query input by the user is an accurate query or not, and how to solve the problem that the result set obtained by the inverted file hash list is huge. The present invention does not limit the queries entered by the user, and it is this unrestricted query that results in the possible inaccuracy of the query. Inaccurate queries result in widely distributed result sets, and even if the present invention uses the inverted file hash list and the multi-level index, the user memory may not process the query results. For large-scale knowledge graphs, the present invention must give an accurate and perfect result in an efficient time, assuming that the user-entered query is highly accurate, querying one of the upper node words 'Award winner', e.g. "Award winner". However, if the user wishes to view the word 'Award'? In this case, even if the result (the result that the user memory needs to process) returned by us does not relate to the information content of the adjacent node, the size of the result is several times or even dozens of times larger than the size of the memory that the user can provide for the search algorithm, and the descriptor in the SPARQL statement is not even more relevant. Moreover, the frequency of the query statements of the user is particularly high, and in the case that the user has little knowledge about the query content, the usable means is to query the knowledge graph from macro to micro, and in short, the user cannot provide an accurate query statement in most cases. Therefore, in order to realize efficient query under such a situation, how to solve the problem caused by such an inaccurate query statement becomes a main problem to be overcome by the query method of the present invention.
To address the problems mentioned above, and to balance the accuracy of user queries with query time requirements, the present invention combines some concepts of approximate query processing in a search method, namely: a semi-accurate result is given to the user when searching. From a certain point of view, the search algorithm of the present invention is very close to the online query in the approximate query processing, but in the approximate query processing system, since the user's query is oriented to the entire data set, the user needs to specify the sampling rate. At each query, no matter how the downsampling method is pushed down to correct the query statement, in fact, a sampler must be used to operate in the approximate query processing system. In which the precision guarantee presents a more and more complex and difficult situation with the gradually pushing down of the sampler.
In the invention, because of the existence of the hash list structure of the inverted files, not all queries need sampling operation, which undoubtedly ensures the absolute accuracy of a part of queries. When the user carries out fuzzy query, a result subgraph space size interval ([ Max, Min ]) and a desired sampling ratio (E) variable are provided to be specified by the user, and a semi-random sampler provides precision guarantee. Obviously, when the size of a result set obtained from the inverted file hash list is smaller than or equal to Min, the obtained result does not need to be sampled, and a synthetic subgraph structure is directly carried out through the inquired triple positions and is sent to a user for viewing. And when the resulting size exceeds Max, we perform a semi-randomized sampling of the result set given a sampling rate of Max/length (results). When the result set is in the interval designated by the user, the result set is sampled semi-randomly by using the sampling ratio E expected by the user, when the size of the sampling result is smaller than Min, the actual sampling ratio is Min ÷ length (results) for sampling, and if the result set is in the interval range, the actual sampling ratio A is equal to the expected sampling ratio E of the user. Therefore, the result set size range [ Min, Max ] specified by the user is absolute, and the algorithm works in strict accordance with the interval range specified by the user. However, the user-specified desired sampling rate E varies according to the actual situation, and finally the algorithm returns to the actual sampling rate a. Furthermore, it is worth mentioning that precision guarantee is a very important measure in approximate query processing. The present invention uses a semi-random sampling function to ensure our result accuracy. The semi-random is to use the superNode obtained in the previous step to reserve the upper node in the sampling process.
The pseudo code of the specific implementation of this step S103 is shown in the following algorithm 3. In line 1, the query statement input by the user is parsed to find a query target vocabulary. Then, from lines 2-6, the inverted file hash list obtained by the algorithm 2 and the target vocabulary obtained in the previous step are used for locating all triples related to the user query, and the result distribution condition is found. Since the user does not know whether the query statement he specifies is accurate, there is no way to accurately predict the size of the query result, namelyThe invention ensures that the size of the result set is suitable for the memory operation of the user and the query efficiency is ensured, and the invention requires the user to give the size range [ Min, Max ] of the result set before each query]And a user desired sampling rate E. Thus, on line 7, we need to set the result subset size range to [ Min, Max ]]. In addition, according to the result distribution position obtained by the reverse index and the size of the result set, in line 9, whether to sample and the sampling mode are determined. A sub-graph structure is then constructed using lines 10-11 and 20-21. It is worth explaining that G*A structure for automatically adjusting a sub-graph structure, G, while adding a new node each time*The result set can be automatically adjusted according to the hierarchical index, and the multi-level indexes are stored according to a key value structure, which means that the time complexity for extracting the multi-level hierarchical index is O (1), so that a sub-graph structure G is constructed in O (1) time*Is clearly feasible.
Figure BDA0001734070810000161
Figure BDA0001734070810000171
EXAMPLE five
As shown in fig. 4, the real-time query system of a large-scale knowledge graph under a limited memory condition according to the fifth embodiment of the present invention may include: a hash list establishing unit 401, a level index constructing unit 402 and a query unit 403;
and the hash list establishing unit 401 is configured to process and analyze the original knowledge graph to obtain an inverted file hash list. The hash list creation unit 401 performs the same operation as step S101 in the foregoing method.
A multi-level index building unit 402, configured to build a multi-level structure index based on the original knowledge graph. The operation performed by the multi-level index building unit 402 is the same as step S102 in the foregoing method.
And the query unit 403 is configured to analyze the query statement to obtain a target vocabulary, and search a triple corresponding to the target vocabulary according to the inverted file hash list and the multi-level structure index to generate a result subgraph. The operation performed by the querying unit 403 is the same as step S103 in the foregoing method.
Preferably, the hash list establishing unit 401 is configured to perform the following steps:
extracting tuple information in a form of vocabulary firstly and then offset in an original knowledge graph;
converting the extracted tuple information into a form of vocabulary firstly and then offset;
sorting the tuple information in the form of first vocabularies and then offsets according to the vocabularies to obtain an inverted file;
and carrying out hash processing on the obtained inverted file to obtain an inverted file hash list.
Preferably, the multi-level index building unit 402 is configured to perform the following steps:
carrying out data classification, cleaning and simplified data representation on the primary structure discovery result of the original knowledge graph to obtain a data classification and simplification result of the knowledge graph;
extracting bottom layer structure nodes based on the data classification and simplification results of the knowledge graph;
and further extracting the classification simplified result of the knowledge map data to realize superior structure indexing.
Preferably, the querying unit 403 is configured to perform the following steps:
receiving a query statement Q, a lower limit min of the number of returned tuples, an upper limit max of the number of returned tuples and a sampling ratio input by a user;
analyzing the query statement Q to obtain a vocabulary set needing to be queried;
for each vocabulary in the vocabulary set, searching a corresponding disk index set { S ] in parallel in the hash list D of the inverted file1,S2,……,SnObtaining a disk index intersection S after the intersection is obtained;
judging whether the length of the disk index intersection S is smaller than the lower limit min of the number of returned tuples:
if so, adding the index and the position information thereof as a node into the result subgraph for any index position in the disk index intersection S;
otherwise, judging whether the length of the disk index intersection S is greater than the upper limit max of the number of returned tuples, if so, making the sampling number be max, otherwise, making the sampling number be the product of the length of the disk index intersection S and the sampling ratio, and if the sampling number is less than the lower limit min of the number of returned tuples, making the sampling number be the lower limit min of the number of tuples; after the sampling number is determined, the disk index intersection S is semi-randomly sampled, wherein the information of the auxiliary sampling node superNode obtained in step S102 needs to be used. And adding each index obtained by sampling into the structure subgraph in the multilevel structure indexes and the position information thereof.
It should be noted that the real-time query system of the large-scale knowledge graph under the condition of limited memory provided by the embodiment of the present invention may be implemented by software, or implemented by hardware, or implemented by a combination of hardware and software. Taking a software implementation as an example, as shown in fig. 4, as a system in a logical sense, the CPU of the device reads corresponding computer program instructions in the nonvolatile memory into the memory for running.
In conclusion, compared with the prior art, the method and the device greatly improve the single-machine knowledge map query capability, and can provide the result set which meets the time requirement and the precision requirement of the user under the condition of extremely limited memory. The existing knowledge graph query system mainly provides complete query processing capability, ignores the requirement of individual users on knowledge graph query under the condition of great knowledge explosion at present, and consumes a large amount of memory space to find results which exceed the data understanding capability of common users.
The invention can give consideration to the relation between the user requirement and the user equipment capability, improves the single-machine data processing capability of the user through the inverted index and the structure index, and provides a proper result set for the user through the approximate query processing technology and the automatic structure understanding of the large-scale knowledge map.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified or replaced with equivalents; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (4)

1. A real-time query method of a large-scale knowledge graph under the condition of limited memory is characterized by comprising the following steps:
processing and analyzing the original knowledge graph to obtain an inverted file hash list;
constructing a multilevel structure index based on the original knowledge graph;
analyzing the query sentence to obtain a target vocabulary, and searching a triple generation result subgraph corresponding to the target vocabulary according to the inverted file hash list and the multi-level structure index;
the processing and analyzing of the original knowledge graph to obtain the hash list of the inverted files comprises the following steps:
extracting tuple information in a form of firstly offsetting and then vocabulary in an original knowledge graph;
converting the extracted tuple information into a form of vocabulary firstly and then offset;
sorting the tuple information in the form of first vocabularies and then offsets according to the vocabularies to obtain an inverted file;
performing hash processing on the obtained inverted file to obtain an inverted file hash list;
the step of analyzing the query sentence to obtain a target vocabulary, and searching the triple generation result subgraph corresponding to the target vocabulary according to the inverted file hash list and the multi-level structure index comprises the following steps:
receiving a query statement Q, a lower limit min of the number of returned tuples, an upper limit max of the number of returned tuples and a sampling ratio input by a user;
analyzing the query statement Q to obtain a vocabulary set needing to be queried;
for each vocabulary in the vocabulary set, in the hash list of the inverted fileParallel search of corresponding disk index set
Figure DEST_PATH_IMAGE001
And obtaining the intersection of disk indexes after solving the intersection
Figure DEST_PATH_IMAGE002
(ii) a Wherein n is the number of words in the word set;
determining disk index intersection
Figure 932095DEST_PATH_IMAGE002
Whether the length of (d) is less than the lower limit min of the number of returned tuples:
if yes, then the intersection is found for the disk index
Figure 318077DEST_PATH_IMAGE002
Taking the index and the position information thereof as a node to be added into the result subgraph;
otherwise, judging the intersection of disk indexes
Figure 691289DEST_PATH_IMAGE002
If the length of the data is greater than the upper limit max of the number of the returned tuples, if so, the sampling number is made to be max, otherwise, the sampling number is made to be the intersection of the disk indexes
Figure 289761DEST_PATH_IMAGE002
And if the sample number is less than the lower limit min of the number of returned tuples, the sample number is made to be the lower limit min of the number of tuples; intersecting disk indices after determining sample number
Figure 448341DEST_PATH_IMAGE002
And performing semi-random sampling, and adding each index obtained by sampling into the structure subgraph in the multilevel structure indexes and the position information thereof.
2. The method of claim 1, wherein the constructing a multi-level structure index based on the original knowledge-graph comprises:
carrying out data classification, cleaning and simplified data representation on the primary structure discovery result of the original knowledge graph to obtain a data classification and simplification result of the knowledge graph;
extracting bottom layer structure nodes based on the data classification and simplification results of the knowledge graph;
and further extracting the classification simplified result of the knowledge map data to realize superior structure indexing.
3. A real-time query system of a large-scale knowledge graph under the condition of limited memory is characterized by comprising the following steps: the system comprises a Hash list establishing unit, a multi-level index establishing unit and a query unit;
the hash list establishing unit is used for processing and analyzing the original knowledge graph to obtain an inverted file hash list;
the multi-level index construction unit is used for constructing a multi-level structure index based on the original knowledge graph;
the query unit is used for analyzing the query statement to obtain a target vocabulary, and searching the triple corresponding to the target vocabulary according to the inverted file hash list and the multi-level structure index to generate a result subgraph;
the hash list establishing unit is used for executing the following steps:
extracting tuple information in a form of firstly offsetting and then vocabulary in an original knowledge graph;
converting the extracted tuple information into a form of vocabulary firstly and then offset;
sorting the tuple information in the form of first vocabularies and then offsets according to the vocabularies to obtain an inverted file;
performing hash processing on the obtained inverted file to obtain an inverted file hash list;
the query unit is configured to perform the following steps:
receiving a query statement Q, a lower limit min of the number of returned tuples, an upper limit max of the number of returned tuples and a sampling ratio input by a user;
analyzing the query statement Q to obtain a vocabulary set needing to be queried;
for each vocabulary in the vocabulary set, searching corresponding disk index sets in parallel in the hash list D of the inverted file
Figure DEST_PATH_IMAGE003
And obtaining the intersection of disk indexes after solving the intersection
Figure 333120DEST_PATH_IMAGE002
(ii) a Wherein n is the number of words in the word set;
determining disk index intersection
Figure 68995DEST_PATH_IMAGE002
Whether the length of (d) is less than the lower limit min of the number of returned tuples:
if yes, then the intersection is found for the disk index
Figure 609173DEST_PATH_IMAGE002
Taking the index and the position information thereof as a node to be added into the result subgraph;
otherwise, judging the intersection of disk indexes
Figure 746894DEST_PATH_IMAGE002
If the length of the data is greater than the upper limit max of the number of the returned tuples, if so, the sampling number is made to be max, otherwise, the sampling number is made to be the intersection of the disk indexes
Figure 225411DEST_PATH_IMAGE002
And if the sample number is less than the lower limit min of the number of returned tuples, the sample number is made to be the lower limit min of the number of tuples; intersecting disk indices after determining sample number
Figure 258701DEST_PATH_IMAGE002
And performing semi-random sampling, and adding each index obtained by sampling into the structure subgraph in the multilevel structure indexes and the position information thereof.
4. The system of claim 3, wherein the multi-level index building unit is configured to perform the following steps:
carrying out data classification, cleaning and simplified data representation on the primary structure discovery result of the original knowledge graph to obtain a data classification and simplification result of the knowledge graph;
extracting bottom layer structure nodes based on the data classification and simplification results of the knowledge graph;
and further extracting the classification simplified result of the knowledge map data to realize superior structure indexing.
CN201810787762.9A 2018-07-18 2018-07-18 Real-time query method and system for large-scale knowledge graph under condition of limited memory Active CN109033314B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810787762.9A CN109033314B (en) 2018-07-18 2018-07-18 Real-time query method and system for large-scale knowledge graph under condition of limited memory

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810787762.9A CN109033314B (en) 2018-07-18 2018-07-18 Real-time query method and system for large-scale knowledge graph under condition of limited memory

Publications (2)

Publication Number Publication Date
CN109033314A CN109033314A (en) 2018-12-18
CN109033314B true CN109033314B (en) 2020-10-23

Family

ID=64643743

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810787762.9A Active CN109033314B (en) 2018-07-18 2018-07-18 Real-time query method and system for large-scale knowledge graph under condition of limited memory

Country Status (1)

Country Link
CN (1) CN109033314B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110275894B (en) * 2019-06-24 2021-12-14 恒生电子股份有限公司 Knowledge graph updating method and device, electronic equipment and storage medium
CN112445890A (en) * 2019-08-27 2021-03-05 北京国双科技有限公司 Data processing method based on contract knowledge graph and related device
CN113010746B (en) * 2021-03-19 2023-08-29 厦门大学 Medical record graph sequence retrieval method and system based on sub-tree inverted index
CN112905806B (en) * 2021-03-25 2022-11-01 哈尔滨工业大学 Knowledge graph materialized view generator based on reinforcement learning and generation method
CN113094449B (en) * 2021-04-09 2023-04-18 天津大学 Large-scale knowledge map storage method based on distributed key value library
CN113254720A (en) * 2021-05-06 2021-08-13 天津大学深圳研究院 Hash sorting construction method in storage based on novel memory
CN113486092B (en) * 2021-07-30 2023-07-21 苏州工业职业技术学院 Time constraint-based time chart approximate query method and device
CN114911844B (en) * 2022-05-11 2024-04-05 复旦大学 Approximate query optimization system based on machine learning

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105653706A (en) * 2015-12-31 2016-06-08 北京理工大学 Multilayer quotation recommendation method based on literature content mapping knowledge domain
CN105760468A (en) * 2016-02-05 2016-07-13 大连大学 Large-scale image querying system based on inverted position-sensitive Hash indexing in mobile environment
CN105868313A (en) * 2016-03-25 2016-08-17 浙江大学 Mapping knowledge domain questioning and answering system and method based on template matching technique
CN108256065A (en) * 2018-01-16 2018-07-06 智言科技(深圳)有限公司 Knowledge mapping inference method based on relationship detection and intensified learning

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160224637A1 (en) * 2013-11-25 2016-08-04 Ut Battelle, Llc Processing associations in knowledge graphs

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105653706A (en) * 2015-12-31 2016-06-08 北京理工大学 Multilayer quotation recommendation method based on literature content mapping knowledge domain
CN105760468A (en) * 2016-02-05 2016-07-13 大连大学 Large-scale image querying system based on inverted position-sensitive Hash indexing in mobile environment
CN105868313A (en) * 2016-03-25 2016-08-17 浙江大学 Mapping knowledge domain questioning and answering system and method based on template matching technique
CN108256065A (en) * 2018-01-16 2018-07-06 智言科技(深圳)有限公司 Knowledge mapping inference method based on relationship detection and intensified learning

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
LKAQ: Large-scale knowledge graph approximate query algorithm;Xiaolong Wan等;《Information Sciences》;20190725;306-324页 *
Querying Knowledge Graphs by Example Entity Tuples;Jayaram N等;《IEEE Transactions on Knowledge & Data Engineering》;20150427;第27卷(第10期);2797-2811页 *
Querying Large-scale Knowledge Graphs;Yang Shengqi;《Dissertations & Theses Gradworks》;20151231;1-294页 *
The Semantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain;Grainger T等;《 2016 IEEE 3rd International Conference on Data Science and Advanced Analytics》;20160905;420-429页 *

Also Published As

Publication number Publication date
CN109033314A (en) 2018-12-18

Similar Documents

Publication Publication Date Title
CN109033314B (en) Real-time query method and system for large-scale knowledge graph under condition of limited memory
CN107180045B (en) Method for extracting geographic entity relation contained in internet text
CN111159330B (en) Database query statement generation method and device
US10452661B2 (en) Automated database schema annotation
JP2007206771A (en) Information element processing program, information element processing method, and information element processor
WO2021047373A1 (en) Big data-based column data processing method, apparatus, and medium
Brisaboa et al. Exploiting geographic references of documents in a geographical information retrieval system using an ontology-based index
CN109783484A (en) The construction method and system of the data service platform of knowledge based map
KR20180113438A (en) Auto-extraction and structuring for sub-topic of subject inquiry
CN114201480A (en) Multi-source POI fusion method and device based on NLP technology and readable storage medium
Han et al. Design and implementation of elasticsearch for media data
Isaksen et al. Linking archaeological data
Chen English translation template retrieval based on semantic distance ontology knowledge recognition algorithm
Gao et al. Association and Recomendation for Geosciences Data Attributes Based on Semantic Similarity Measurement
CN117271577B (en) Keyword retrieval method based on intelligent analysis
KR102605929B1 (en) Method for processing structured data and unstructured data by allocating different processor resource and data processing system providing the method
CN113987145B (en) Method, system, equipment and storage medium for accurately reasoning user attribute entity
Hamdulla et al. A hierarchical clustering based relation extraction method for domain ontology
CN116756375B (en) Processing system of heterogeneous data based on atlas
Sun et al. A Point of Interest Intelligent Search Method based on Browsing History.
KR102599008B1 (en) Method for processing multi-queries based on multi-query scheduler and data processing system providing the method
Wang et al. RDF Multi-query optimization algorithm based on triple pattern reordering
Li et al. Suffix tree based incremental web services clustering method
Lang et al. The next-generation search engine: Challenges and key technologies
Liu et al. Heterogeneous Knowledge Fusion Algorithm for Minority Cultural Resources based on MapReduce

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant