CN114020781B

CN114020781B - Query task optimization method based on technological consultation large-scale graph data

Info

Publication number: CN114020781B
Application number: CN202111316037.1A
Authority: CN
Inventors: 鄂海红; 宋美娜; 梁静茹; 刘雨薇; 魏秋实
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2021-11-08
Filing date: 2021-11-08
Publication date: 2024-05-31
Anticipated expiration: 2041-11-08
Also published as: WO2023077731A1; CN114020781A

Abstract

In the query task optimization method, the query task optimization system and the storage medium based on the technological consultation large-scale graph data, the identification of the query task is obtained, and the corresponding query optimization method is selected according to the identification of the query task, wherein the query optimization method comprises the steps of adjusting graph traversal unfolding sequence strategies, CARDINALITY reducing, mode advancing and materialized view, then querying a graph database by utilizing the query optimization method, and outputting query results. Therefore, in the method provided by the disclosure, the corresponding query optimization method can be selected according to the identification of the query task, so that the flexibility of the query method is improved. Meanwhile, in the method provided by the disclosure, the query optimization method improves the query efficiency of the query task under different scenes of technological consultation large-scale graph data, reduces the complexity of query calculation, and shortens the time spent on query.

Description

Query task optimization method based on technological consultation large-scale graph data

Technical Field

The application relates to the field of large-scale graph data query, in particular to a query task optimization method, a query task optimization device and a storage medium based on technological consultation of large-scale graph data.

Background

The query task on the graph data is one of the most fundamental problems in the field of knowledge graph, so that efficient query processing is generally required on large-scale graph data, so that a user can quickly obtain a query result.

At present, although query optimization technology on graph data has advanced to a great extent, some problems still exist: such as graph partitioning techniques for graph query optimization, graph data can be split into multiple servers, but the servers have higher communication costs and processing overhead. In addition, most of query optimization technologies perform query optimization based on graph data of social networks, and are not applicable to graph data of complex topological structures of scientific and technological consultation scenes. Therefore, how to consult the query task optimization of large-scale graph data based on technology is a problem to be solved.

Disclosure of Invention

The application provides a query task optimization method, a query task optimization system and a query task optimization storage medium based on technological consultation large-scale graph data, and aims to provide a query task optimization method based on technological consultation large-scale graph data.

An embodiment of a first aspect of the present application provides a query task optimization method based on technological consultation large-scale graph data, including:

Acquiring an identification of a query task;

Selecting a corresponding query optimization method according to the identification of the query task, wherein the query optimization method comprises the steps of adjusting a graph traversal unfolding sequence strategy, CARDINALITY reduction, mode advance and materialized view;

and querying the graph database by using the query optimization method, and outputting a query result.

An embodiment of a second aspect of the present application provides a query task optimization system based on technological consultation of large-scale graph data, including:

The acquisition module is used for acquiring the identification of the query task;

the selection module is used for selecting a corresponding query optimization method according to the identification of the query task, wherein the query optimization method comprises the steps of adjusting a graph traversal expansion sequence strategy, CARDINALITY reduction, mode advance and materialized view;

And the display module is used for inquiring the graph database by utilizing the inquiry optimization method and outputting an inquiry result.

The embodiment of the third aspect of the application provides a computer storage medium, wherein the computer storage medium stores computer executable instructions; the computer executable instructions, when executed by a processor, are capable of implementing the method as described in the first aspect above.

An embodiment of the fourth aspect of the present application provides a computer device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor is capable of implementing the method according to the first aspect.

The technical scheme provided by the embodiment of the application at least has the following beneficial effects:

Additional aspects and advantages of the application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the application.

Drawings

The foregoing and/or additional aspects and advantages of the application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings, in which:

FIG. 1 is a flow chart of a query task optimization method based on technological consultation large-scale graph data according to an embodiment of the present application;

Fig. 2 is a schematic structural diagram of a query task optimization system based on technological consultation large-scale graph data according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative and intended to explain the present application and should not be construed as limiting the application.

The following describes a query task optimization method and a query task optimization system based on technological consultation large-scale graph data according to an embodiment of the application with reference to the accompanying drawings.

Example 1

Fig. 1 is a flow chart of a query task optimization method based on technological consultation large-scale graph data according to an embodiment of the present application, and as shown in fig. 1, the method may include:

step 101, obtaining the identification of the query task.

It should be noted that, in the embodiments of the present disclosure, the query task may include an organization, talents, and industry chains. In the embodiment of the disclosure, the organization may be an ID of a company and the talents may be personnel

In the embodiment of the disclosure, the identification of the query task may be obtained according to the content of the query task. By way of example, in embodiments of the present disclosure, assuming that a query task is to view the corporate, patent situation associated with a person, the identity of the query task is obtained.

Step 102, selecting a corresponding query optimization method according to the identification of the query task, wherein the query optimization method comprises the steps of adjusting a graph traversal unfolding sequence strategy, CARDINALITY reduction, mode advance and materialized view.

In the embodiment of the disclosure, different identifiers correspond to different query optimization methods, and the corresponding query methods can be selected according to the identifiers of the query tasks.

And, query optimization precautions in embodiments of the present disclosure may include adjusting graph traversal expansion order policies, CARDINALITY reduction, pattern advance, materialized views.

Further, in the embodiment of the present disclosure, the graph traversal expansion sequence policy is adjusted to consult an actual query scene in combination with science and technology, and the graph traversal expansion sequence of the bidirectional BFS is designed, and searching is started from two directions of the starting point and the ending point, and once a position that has been searched in the other direction is searched (or a certain state is accessed by both directions), a shortest path connecting the starting point and the ending point is found. And then to a point in the middle of the shortest path, meet at the path midpoint, so the number of nodes of the bidirectional BFS is of the order of 2 x ^nm/2+1.

Specifically, in an embodiment of the present disclosure, the adjustment graph traversal expansion order policy may include the following steps:

s11, inputting a source entity node and a target entity node, and inputting an intermediate entity node type mtype and a path mode pattern;

S12, initializing two node sets S1 and S2, wherein S1 is initialized to an input source entity node, and S2 is initialized to an input target entity node;

s13, calculating the unfolding sequence of the bidirectional BFS by using patterns and mtype, and using pattern1 to represent the left unfolding sequence and pattern2 to represent the right unfolding sequence;

s14, if S1 or S2 is not empty, continuing to execute the step S15; otherwise, step S111 is performed;

s15, S is a set of expansion nodes of the layer;

s16, exchanging S1 and S2, and alternately expanding from the left end and the right end;

s17, expanding next-layer neighbor nodes of each node in the S1 set according to the mode, and representing the next-node;

s18, judging the node in each next_nodes, and if the node is in the S set, namely finding a path, performing step S111;

s19, adding all nodes next_nodes expanded in the layer into a set S, copying the set S to S1, and storing paths;

s110, repeating the step S14;

S111, ending.

For example, in the embodiment of the disclosure, the query task gives an industry chain tag and personnel information person, and queries its child industry chain tag from the tag, and a patent belonging to the child industry chain tag, and a company to which the patent belongs, and associated personnel such as the job title/investment of the company. In the constructed technological consultation knowledge graph, 146284 patent intermediate nodes are generated on the path of the industrial chain-sub-industrial chain label-patent, and if 146284 patents are expanded by using unidirectional BFS, explosive intermediate results are generated, so that the query performance is seriously affected.

If the graph traversal expansion sequence optimization strategy of the bidirectional BFS in the embodiment of the present disclosure is used, bidirectional search is performed from the starting point and the end point, that is, the two directions of the industry chain label-child industry chain label-patent and the personnel-company-patent are traversed, 146284 patent intermediate nodes generated by the industry chain label-child industry chain label-patent are processed into a hash table, then the process is reversed from the personnel node, a set of results are generated by the personnel-company-patent path, finally the set of results are intersected with the hash table, a path which meets the condition and communicates the starting point and the end point is found, and the time complexity also only needs o (n).

Further, CARDINALITY represents the number of unique values after deduplication, such as Columns Cardinality (column radix) refers to the number of non-duplicate values that a column contains in embodiments of the present disclosure. This number has a direct impact on the effect of model compression and the performance of the engine when scanning. It is desirable to minimize CARDINALITY to reduce the time required for a query.

Wherein, in an embodiment of the present disclosure, CARDINALITY reduction may include the steps of:

s21, inputting a source entity node and a path mode pattern;

S22, next_nodes are node sets of the next layer, and are initialized to neighbor nodes of the next layer of source entity nodes expanded according to the mode;

s23, de-duplicating the next_nodes;

S24, q is a node queue, and is initialized to be next_nodes;

s25, if q is not null, continuing to execute the step S26; otherwise, executing step S212;

s26, the size is the number of the current queues;

s27, if the size is not empty, continuing to execute the step S28; otherwise, executing step S211;

S28, popping up a current queue node;

S29, expanding next-layer neighbor nodes next_nodes of the node according to the mode;

S210, adding next_nodes into a queue q;

s211, if the pattern is traversed currently, continuing to execute the step S212, otherwise executing the step S25;

s212, ending.

For example, in the embodiment of the disclosure, in the knowledge graph under the actual scenario of technological consultation, heavy edges or different types of edges may exist between two points, for example, three relations of "company-investor"/"company-public stakeholder-person"/"company-tenninal" exist between a "company" node and a "person" node. Thus, looking for "people" nodes adjacent to a company from some company, it is possible to locate some identical "people" nodes from the above three relationships, thereby generating duplicate nodes. And the number of redundant nodes is increased by CARDINALITY, when the repeated 'personnel' nodes continue to search for adjacent nodes, the repeated traversal is performed, so that the number of intermediate nodes is increased, and the query time is increased. Thus, in embodiments of the present disclosure, distinct advance optimization strategies are used to reduce cardinality.

Specifically, in the embodiment of the present disclosure, the task of query under the scenario of technological consultation is to give person, search for its associated company from the given person query, and the patent owned by the company, and the industry chain label to which the patent belongs, and output the label tuple of the company, the patent, and the industry chain without repetition, which accords with the path. The embodiment of the disclosure uses distinct to reduce CARDINALITY optimization strategies in advance, advances the deduplication operation to the generation of repeated nodes, namely immediately performs the deduplication operation after the 'personnel' node traverses to the 'company' node, reduces 201 repeated company intermediate nodes to 131 company nodes without repetition, thereby reducing the generation of intermediate nodes and the subsequent traversing time.

Further, in the embodiment of the present disclosure, target data needs to be acquired and screened according to service conditions, and this process is filtering of data queries. There are a large number of filtering operations in the large-scale graph query task, and various filtering conditions used in the filtering process are necessary steps for acquiring accurate data, such as basic algorithms (<, >, =), logical operations (AND, OR, NOT), and pattern matching.

In an embodiment of the present disclosure, the mode advance may include the steps of:

S31, inputting source entity nodes, path mode patterns and filter_patterns;

S32, initializing a mode advance set filter_ nodeset;

s33, q is a node queue, and is initialized to be an input source entity node;

S34, if q is not null, continuing to execute the step S35; otherwise, step S313 is performed;

s35, initializing the number size of the current queues;

S36, if the size is not empty, continuing to step S37; otherwise, executing step S312;

S37, popping up a current queue node;

S38, expanding next-layer neighbor nodes next_nodes of the node according to the mode;

s39, judging whether the current next_nodes node type is the node type of filter_ nodeset, if yes, continuing to execute the step S310; otherwise, executing step S311;

S310, traversing nodes next_node of the next_nodes set, and filtering out the nodes if the nodes next_node is in the filter_ nodeset set;

S311, adding next_nodes into a queue q;

S312, if the pattern is traversed currently, continuing to execute the step S313, otherwise executing the step S35;

S313, ending.

For example, in the embodiment of the present disclosure, the query task in the scientific and technological consultation scenario is to give the tag information tag of the industry chain, query the company associated with the tag from the tag, and the patent owned by the company, and there is a filtering condition that: the company cannot have abnormal business, namely, no pattern of company-abnormal business exists, and no repeated company and patent tuple is output.

In particular, the pattern advance in embodiments of the present disclosure is to replace traversal operations in a pattern with efficient lookups of the collection. The method comprises the steps of making a company-operation abnormality mode in advance, putting company ID information associated with an operation abnormality node into a hash table, judging whether the operation abnormality node exists in the hash table or not by a filtering condition, and if the operation abnormality node does not exist in the hash table, indicating that the company does not exist in the hash table, carrying out set search only by 3292 times of o (1) time complexity, thereby improving the query efficiency.

Further, in the embodiment of the disclosure, the result of the operations with more time consumption such as table connection or aggregation is calculated and stored in advance mainly by using the materialized view, so that the operations with more time consumption can be avoided when the query task is executed subsequently, and thus the query result can be obtained quickly. Under the technological consultation scene, the materialized view greatly improves the query performance of the hot spot problems of the same query result which is frequently reused, so that the data is quickly read from the materialized view.

For example, in the embodiment of the disclosure, an inquiry task under a scientific and technological consultation scene gives an industry chain label information tag, inquires its child industry chain label from the tag and a company belonging to the child industry chain label, inquires about a path which takes the child industry chain label as a starting node and finally traverses a path reaching the company node by a path patent, and counts the company information and the number of patents conforming to the mode. This is very time consuming if each company is queried separately. However, the materialized view method in the embodiment of the disclosure can acquire the patents owned by each company in advance, judge the industry chain label to which each patent belongs for each patent, aggregate the obtained patent number under the industry chain label, and input the obtained patent number into the attribute of the company-industry chain label side, and the pre-calculated materialized view improves the query efficiency.

And 103, inquiring the graph database by utilizing an inquiry optimization method, and outputting an inquiry result.

In the embodiment of the present disclosure, the query optimization method in step 102 is used to query the graph database, and the query result is output. And, in embodiments of the present disclosure, the query results may include an association between nodes in a graph database.

In the query task optimization method based on the technological consultation large-scale graph data, the identification of a query task is obtained, and a corresponding query optimization method is selected according to the identification of the query task, wherein the query optimization method comprises the steps of adjusting a graph traversal expansion sequence strategy, CARDINALITY reducing, mode advancing and materialized view, then querying a graph database by utilizing the query optimization method, and outputting a query result. Therefore, in the method provided by the disclosure, the corresponding query optimization method can be selected according to the identification of the query task, so that the flexibility of the query method is improved. Meanwhile, in the method provided by the disclosure, the query optimization method improves the query efficiency of the query task under different scenes of technological consultation large-scale graph data, reduces the complexity of query calculation, and shortens the time spent on query.

FIG. 2 is a schematic structural diagram of a query task optimization system based on technological consultation large-scale graph data according to an embodiment of the present application, where the system may include:

An obtaining module 201, configured to obtain an identifier of a query task;

The selection module 202 is configured to select a corresponding query optimization method according to the identification of the query task, where the query optimization method includes adjustment of a graph traversal expansion sequence policy, CARDINALITY reduction, mode advance, and materialized view;

and the display module 203 is configured to query the graph database by using a query optimization method, and output a query result.

In embodiments of the present disclosure, the query task may include an organization, talents, and industry chains, among others.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and additional implementations are included within the scope of the preferred embodiment of the present application in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order from that shown or discussed, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the embodiments of the present application.

While embodiments of the present application have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the application, and that variations, modifications, alternatives and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the application.

Claims

1. The query task optimization method based on technological consultation large-scale graph data is characterized by comprising the following steps of:

Acquiring an identification of a query task;

inquiring the graph database by utilizing the inquiry optimization method, and outputting an inquiry result;

The inquiry task comprises a mechanism, talents and an industry chain;

the CARDINALITY reduction, comprising:

s21, inputting a source entity node and a path mode pattern;

s23, de-duplicating the next_nodes;

S24, q is a node queue, and is initialized to be next_nodes;

s26, the size is the number of the current queues;

S28, popping up a current queue node;

S210, adding next_nodes into a queue q;

s212, ending.

2. The query task optimization method of claim 1, wherein the adjustment graph traverses a deployment order strategy, comprising:

s15, S is a set of expansion nodes of the layer;

s110, repeating the step S14;

S111, ending.

3. The query task optimization method of claim 1, wherein the pattern advances, comprising:

S31, inputting source entity nodes, path mode patterns and filter_patterns;

S32, initializing a mode advance set filter_ nodeset;

s33, q is a node queue, and is initialized to be an input source entity node;

s35, initializing the number size of the current queues;

S37, popping up a current queue node;

S311, adding next_nodes into a queue q;

S313, ending.

4. A query task optimization system based on technological consultation of large-scale graph data, the system comprising:

the display module is used for inquiring the graph database by utilizing the inquiry optimization method and outputting an inquiry result;

The inquiry task comprises a mechanism, talents and an industry chain;

the CARDINALITY reduction, comprising:

s21, inputting a source entity node and a path mode pattern;

s23, de-duplicating the next_nodes;

S24, q is a node queue, and is initialized to be next_nodes;

s26, the size is the number of the current queues;

S28, popping up a current queue node;

S210, adding next_nodes into a queue q;

s212, ending.

5. A computer storage medium, wherein the computer storage medium stores computer-executable instructions; the computer executable instructions, when executed by a processor, are capable of implementing the method of any of claims 1-3.

6. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of any of claims 1-3 when the program is executed.