CN109597895B

CN109597895B - Knowledge graph-based official document searching method

Info

Publication number: CN109597895B
Application number: CN201811332469.XA
Authority: CN
Inventors: 熊子奇; 王鹏
Original assignee: CETC Big Data Research Institute Co Ltd
Current assignee: CETC Big Data Research Institute Co Ltd
Priority date: 2018-11-09
Filing date: 2018-11-09
Publication date: 2021-10-22
Anticipated expiration: 2038-11-09
Also published as: CN109597895A

Abstract

The invention provides a knowledge graph-based official document searching method, which comprises the following steps: collecting data, official document semantic description, official document atlas description, searching and displaying. The invention can effectively solve the problem of 'meaning one word' or 'meaning more than one word', can update the publically published documents in time, covers most of main publishing mechanisms, is applied to the special field, and has more accurate and reasonable searching structure.

Description

Knowledge graph-based official document searching method

Technical Field

The invention relates to a knowledge graph-based official document searching method, and belongs to the field of search engines.

Background

A traditional search scheme is a word-based search. For chinese, it is necessary to perform word segmentation on a document to be searched, then reverse indexing, and convert the structure of "document 1- (word 1, word 2, …)" into the structure of "word 1- (document 1, document 2, …)" for storage. When a user searches, a search engine firstly divides the search word of the user, then inquires the document containing the word in the data after the inverted index according to the word, and finally scores the document according to the number of the words appearing in the document and other relevant factors, and returns the relevant document according to the scoring result.

The content related to the official document field is wide, general word segmentation tools are difficult to meet the requirements of the official document field, however, massive manually labeled corpus data is needed when a word segmentation tool suitable for the official document field is trained from the beginning, official document searching mainly serves a search engine built by each official document publishing website, and the method is not suitable for the current scene.

The self-built service can not effectively solve the problems of multiple meanings of one word and multiple meanings of one word, and meanwhile, the self-built service has the service range of the website and can not track the official documents of other official document publishing websites in time; the general search engine cannot timely and effectively distinguish the official documents from other documents, and the result is not friendly to the official document retrieval user.

In summary, the conventional search engine has the following problems in the document field:

1. the coverage of the field search engine is small, and all official document issuing mechanisms are not covered;

2. the search results of the general search engine are mixed, and the general search engine cannot be well adapted to the field of official document search;

3. both have problems with the understanding of "polysemy" or "polysemy" in the official arts.

Disclosure of Invention

In order to solve the technical problems, the invention provides a knowledge graph-based official document searching method, which solves the problems of one-time polysemy and polysemy in the official document searching field, and the problems of data coverage and accurate query in the official document field.

The invention is realized by the following technical scheme.

The invention provides a knowledge graph-based official document searching method, which comprises the following steps:

collecting data: crawling the official documents published by each official document publishing website and the data of encyclopedia websites to acquire encyclopedia data and official document data;

document semantics portrayal: cleaning the official document data and encyclopedia data, extracting terms from the encyclopedia data, constructing a special dictionary, and acquiring entities, concepts, attributes and relationship weights from the special dictionary by using an entity recognition tool;

third, the official document atlas is drawn: the entity, concept, attribute and empowerment are dumped into a storage mode of a knowledge graph to form a graph;

searching: and c, identifying the query request of the user according to the concept graph in the step c, and returning the related document.

And fifthly, showing: and displaying the related official documents by using the traditional text content or concept map and knowledge map modes.

The step II comprises the following steps:

(2.1) extract terms: cleaning encyclopedia data, and extracting a large number of terms from the encyclopedia data;

(2.2) document cleaning: cleaning the document data;

(2.3) forming a professional dictionary: collecting the official documents and terms in the steps (2.1) and (2.2);

(2.4) acquiring an entity set from the dictionary by using an entity recognition tool, and supplementing unidentified entities by using a point mutual information formula;

(2.5) concept identification: extracting keyword representation between the entity and the concept according to the part-of-speech label, extracting instanceOf relationship, and obtaining a concept set;

(2.6) attribute classification: carrying out attribute classification on entities in the entity set and concepts in the concept set;

(2.7) entity statistics: according to the step (2.4), counting the occurrence frequency of each entity and the total occurrence frequency of all entities in each document in the official document publishing website and encyclopedia website;

(2.8) concept statistics: according to the step (2.5), in the official document publishing website and the encyclopedia website, the entity times of the same concept in the same document are accumulated, and the occurrence times of the concept and the total occurrence times of all concepts in the same document are obtained.

In the step (2.4), the point mutual information formula is as follows:

wherein p (x, y) represents the number of times the term x appears in the same sentence in the specialized dictionary, p (x) represents the number of times the term x appears in the specialized dictionary, and p (y) represents the number of times the term y appears in the specialized dictionary.

In the step (2.5), the hierarchical relationship among the concept sets is extracted to form a sulclass of relationship, and a concept hierarchical relationship set is obtained.

In the step (2.7), the probability of each entity appearing is:

wherein N is_wRepresenting the number of times w occurs in a given document and N is the total number of entities in the document.

In the step (2.8), the probability is:

wherein N is_cRepresenting the number of occurrences of the concept in a given document, and N is the number of occurrences of all concepts in the document.

In the third step, the graph nodes of the concept graph are documents, concepts, entities and attributes, and the edges are the empowerment relations among the entities.

And (4) representing the empowerment relationship by the probability in the step (2.7).

The step IV comprises the following steps:

(4.1) performing entity identification on the query request of the user;

(4.2) utilizing an entity recognition tool to perform word segmentation on the entity and marking part of speech;

(4.3) converting the query words into corresponding entities and concepts in the map, and acquiring modification relations between the entities and the concepts;

and (4.4) acquiring the most relevant document according to the modification relation of the entity and the concept and the relation weight of the document, and returning the relevant document.

In the second step, the entity recognition tool is constructed according to the professional dictionary, the public marking linguistic data and the entity recognition marking linguistic data in the official document field to form the entity recognition tool in the official document field;

the entity recognition tool has the functions of word segmentation and part-of-speech tagging, and can obtain the weight of the relationship among entities, concepts and documents by counting the entities and concepts appearing in the documents.

The invention has the beneficial effects that: the problem of 'meaning of one word' or 'meaning of more words' is effectively solved, and the publicly issued documents can be updated in time, most of main issuing mechanisms are covered, the method is applied to the special field, and the searching structure is more accurate and reasonable.

Detailed Description

The technical solution of the present invention is further described below, but the scope of the claimed invention is not limited to the described.

A method for searching official documents based on knowledge graph comprises the following steps:

preferably, the NLP tool is used to extract the set of related entities, concepts and attributes from the special dictionary by means of pattern extraction, and the method specifically includes the following steps:

(2.2) document cleaning: cleaning the document data;

specifically, according to language habits and Chinese characteristics, an entity belongs to a certain concept (for example, "China electronic department is a company"), so that the entity and the concept can appear in the same sentence in a large amount, and accordingly, a high-quality and reliable recall template can be expanded from a small number of labeled high-quality seed templates through a template self-learning strategy, so that keyword representations between the entity and the concept are mined, and then the instanceOf relationship between the entity and the concept is extracted in such a way, so that a concept set to which the entity belongs is obtained;

Further, in the step (2.4), the point-to-point mutual information formula is as follows:

Further, in the step (2.5), extracting the hierarchical relationship among the concept sets to form a sullasof relationship, and acquiring a concept hierarchical relationship set; such as: we get the hierarchical relationship between concepts by similar modes of "is a", "is a subclass of …", etc., i.e. form the subalasof relationship between concept sets, get the concept hierarchical relationship sets.

Further, in the step (2.7), the probability of each entity occurring is:

Further, in the step (2.8), the probability is:

Thirdly, forming a document connotation knowledge graph by the entity, the concept, the document and the relation weight, and forming a document field concept graph by the document, the entity and the attribute; graph nodes of the concept graph are documents, concepts, entities and attributes, and edges are empowerment relations among the entities; and describing the correlation degree between the two through a map mode.

Further, the weighting relation is represented by the probability in the step (2.7)

Searching: identifying the query request of the user according to the concept map in the step three, and returning a related document; the method comprises the following steps:

(4.1) performing entity identification on the query request of the user;

And fifthly, showing: displaying related official documents in a traditional text content or concept graph and knowledge graph mode; the text content display is used for the user to look up the document content, and the map display is used for displaying the document and the related display of the related entity.

Further, in the second step, the entity recognition tool is constructed according to the professional dictionary, the public marking corpora and the entity recognition marking corpora in the official document field to form the entity recognition tool in the official document field;

Claims

1. A method for searching official documents based on knowledge graph is characterized in that: the method comprises the following steps:

third, the official document atlas is drawn: forming an official document connotation knowledge graph by the entity, concept, official document and relationship weight, and forming an official document field concept graph by the official document, entity and attribute;

searching: identifying the query request of the user according to the concept map in the step three, and returning a related document;

and fifthly, showing: displaying related official documents in a traditional text content or concept graph and knowledge graph mode;

the step II comprises the following steps:

(2.2) document cleaning: cleaning the document data;

(2.4) entity identification: acquiring an entity set from the dictionary by using an entity recognition tool, and supplementing unidentified entities by using a point mutual information formula;

2. The knowledge-graph-based official document searching method as claimed in claim 1, wherein: in the step (2.4), the point mutual information formula is as follows:

3. The knowledge-graph-based official document searching method as claimed in claim 1, wherein: in the step (2.5), the hierarchical relationship among the concept sets is extracted to form a sulclass of relationship, and a concept hierarchical relationship set is obtained.

4. The knowledge-graph-based official document searching method as claimed in claim 1, wherein: in the step (2.7), the probability of each entity appearing is:

5. The knowledge-graph-based official document searching method as claimed in claim 1, wherein: in the step (2.8), the probability is:

6. The knowledge-graph-based official document searching method as claimed in claim 1, wherein: in the third step, the graph nodes of the concept graph are documents, concepts, entities and attributes, and the edges are the empowerment relations among the entities.

7. The knowledge-graph-based official document searching method as claimed in claim 1, wherein: the weighting relationship is represented by the probability described in step (2.7).

8. The knowledge-graph-based official document searching method as claimed in claim 1, wherein: the step IV comprises the following steps:

(4.1) performing entity identification on the query request of the user;

9. The knowledge-graph-based official document searching method as claimed in claim 1, wherein: and in the second step, the entity recognition tool is constructed according to the professional dictionary, the public marking linguistic data and the entity recognition marking linguistic data in the official document field to form the entity recognition tool in the official document field.

10. The knowledge-graph-based official document searching method as claimed in claim 1, wherein: the entity recognition tool has the functions of word segmentation and part-of-speech tagging, and can obtain the weight of the relationship among entities, concepts and documents by counting the entities and concepts appearing in the documents.