CN109597895B - Knowledge graph-based official document searching method - Google Patents

Knowledge graph-based official document searching method Download PDF

Info

Publication number
CN109597895B
CN109597895B CN201811332469.XA CN201811332469A CN109597895B CN 109597895 B CN109597895 B CN 109597895B CN 201811332469 A CN201811332469 A CN 201811332469A CN 109597895 B CN109597895 B CN 109597895B
Authority
CN
China
Prior art keywords
document
entity
concept
graph
official
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811332469.XA
Other languages
Chinese (zh)
Other versions
CN109597895A (en
Inventor
熊子奇
王鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CETC Big Data Research Institute Co Ltd
Original Assignee
CETC Big Data Research Institute Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CETC Big Data Research Institute Co Ltd filed Critical CETC Big Data Research Institute Co Ltd
Priority to CN201811332469.XA priority Critical patent/CN109597895B/en
Publication of CN109597895A publication Critical patent/CN109597895A/en
Application granted granted Critical
Publication of CN109597895B publication Critical patent/CN109597895B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention provides a knowledge graph-based official document searching method, which comprises the following steps: collecting data, official document semantic description, official document atlas description, searching and displaying. The invention can effectively solve the problem of 'meaning one word' or 'meaning more than one word', can update the publically published documents in time, covers most of main publishing mechanisms, is applied to the special field, and has more accurate and reasonable searching structure.

Description

Knowledge graph-based official document searching method
Technical Field
The invention relates to a knowledge graph-based official document searching method, and belongs to the field of search engines.
Background
A traditional search scheme is a word-based search. For chinese, it is necessary to perform word segmentation on a document to be searched, then reverse indexing, and convert the structure of "document 1- (word 1, word 2, …)" into the structure of "word 1- (document 1, document 2, …)" for storage. When a user searches, a search engine firstly divides the search word of the user, then inquires the document containing the word in the data after the inverted index according to the word, and finally scores the document according to the number of the words appearing in the document and other relevant factors, and returns the relevant document according to the scoring result.
The content related to the official document field is wide, general word segmentation tools are difficult to meet the requirements of the official document field, however, massive manually labeled corpus data is needed when a word segmentation tool suitable for the official document field is trained from the beginning, official document searching mainly serves a search engine built by each official document publishing website, and the method is not suitable for the current scene.
The self-built service can not effectively solve the problems of multiple meanings of one word and multiple meanings of one word, and meanwhile, the self-built service has the service range of the website and can not track the official documents of other official document publishing websites in time; the general search engine cannot timely and effectively distinguish the official documents from other documents, and the result is not friendly to the official document retrieval user.
In summary, the conventional search engine has the following problems in the document field:
1. the coverage of the field search engine is small, and all official document issuing mechanisms are not covered;
2. the search results of the general search engine are mixed, and the general search engine cannot be well adapted to the field of official document search;
3. both have problems with the understanding of "polysemy" or "polysemy" in the official arts.
Disclosure of Invention
In order to solve the technical problems, the invention provides a knowledge graph-based official document searching method, which solves the problems of one-time polysemy and polysemy in the official document searching field, and the problems of data coverage and accurate query in the official document field.
The invention is realized by the following technical scheme.
The invention provides a knowledge graph-based official document searching method, which comprises the following steps:
collecting data: crawling the official documents published by each official document publishing website and the data of encyclopedia websites to acquire encyclopedia data and official document data;
document semantics portrayal: cleaning the official document data and encyclopedia data, extracting terms from the encyclopedia data, constructing a special dictionary, and acquiring entities, concepts, attributes and relationship weights from the special dictionary by using an entity recognition tool;
third, the official document atlas is drawn: the entity, concept, attribute and empowerment are dumped into a storage mode of a knowledge graph to form a graph;
searching: and c, identifying the query request of the user according to the concept graph in the step c, and returning the related document.
And fifthly, showing: and displaying the related official documents by using the traditional text content or concept map and knowledge map modes.
The step II comprises the following steps:
(2.1) extract terms: cleaning encyclopedia data, and extracting a large number of terms from the encyclopedia data;
(2.2) document cleaning: cleaning the document data;
(2.3) forming a professional dictionary: collecting the official documents and terms in the steps (2.1) and (2.2);
(2.4) acquiring an entity set from the dictionary by using an entity recognition tool, and supplementing unidentified entities by using a point mutual information formula;
(2.5) concept identification: extracting keyword representation between the entity and the concept according to the part-of-speech label, extracting instanceOf relationship, and obtaining a concept set;
(2.6) attribute classification: carrying out attribute classification on entities in the entity set and concepts in the concept set;
(2.7) entity statistics: according to the step (2.4), counting the occurrence frequency of each entity and the total occurrence frequency of all entities in each document in the official document publishing website and encyclopedia website;
(2.8) concept statistics: according to the step (2.5), in the official document publishing website and the encyclopedia website, the entity times of the same concept in the same document are accumulated, and the occurrence times of the concept and the total occurrence times of all concepts in the same document are obtained.
In the step (2.4), the point mutual information formula is as follows:
Figure BDA0001860393170000031
wherein p (x, y) represents the number of times the term x appears in the same sentence in the specialized dictionary, p (x) represents the number of times the term x appears in the specialized dictionary, and p (y) represents the number of times the term y appears in the specialized dictionary.
In the step (2.5), the hierarchical relationship among the concept sets is extracted to form a sulclass of relationship, and a concept hierarchical relationship set is obtained.
In the step (2.7), the probability of each entity appearing is:
Figure BDA0001860393170000041
wherein N iswRepresenting the number of times w occurs in a given document and N is the total number of entities in the document.
In the step (2.8), the probability is:
Figure BDA0001860393170000042
wherein N iscRepresenting the number of occurrences of the concept in a given document, and N is the number of occurrences of all concepts in the document.
In the third step, the graph nodes of the concept graph are documents, concepts, entities and attributes, and the edges are the empowerment relations among the entities.
And (4) representing the empowerment relationship by the probability in the step (2.7).
The step IV comprises the following steps:
(4.1) performing entity identification on the query request of the user;
(4.2) utilizing an entity recognition tool to perform word segmentation on the entity and marking part of speech;
(4.3) converting the query words into corresponding entities and concepts in the map, and acquiring modification relations between the entities and the concepts;
and (4.4) acquiring the most relevant document according to the modification relation of the entity and the concept and the relation weight of the document, and returning the relevant document.
In the second step, the entity recognition tool is constructed according to the professional dictionary, the public marking linguistic data and the entity recognition marking linguistic data in the official document field to form the entity recognition tool in the official document field;
the entity recognition tool has the functions of word segmentation and part-of-speech tagging, and can obtain the weight of the relationship among entities, concepts and documents by counting the entities and concepts appearing in the documents.
The invention has the beneficial effects that: the problem of 'meaning of one word' or 'meaning of more words' is effectively solved, and the publicly issued documents can be updated in time, most of main issuing mechanisms are covered, the method is applied to the special field, and the searching structure is more accurate and reasonable.
Detailed Description
The technical solution of the present invention is further described below, but the scope of the claimed invention is not limited to the described.
A method for searching official documents based on knowledge graph comprises the following steps:
collecting data: crawling the official documents published by each official document publishing website and the data of encyclopedia websites to acquire encyclopedia data and official document data;
document semantics portrayal: cleaning the official document data and encyclopedia data, extracting terms from the encyclopedia data, constructing a special dictionary, and acquiring entities, concepts, attributes and relationship weights from the special dictionary by using an entity recognition tool;
preferably, the NLP tool is used to extract the set of related entities, concepts and attributes from the special dictionary by means of pattern extraction, and the method specifically includes the following steps:
(2.1) extract terms: cleaning encyclopedia data, and extracting a large number of terms from the encyclopedia data;
(2.2) document cleaning: cleaning the document data;
(2.3) forming a professional dictionary: collecting the official documents and terms in the steps (2.1) and (2.2);
(2.4) acquiring an entity set from the dictionary by using an entity recognition tool, and supplementing unidentified entities by using a point mutual information formula;
(2.5) concept identification: extracting keyword representation between the entity and the concept according to the part-of-speech label, extracting instanceOf relationship, and obtaining a concept set;
specifically, according to language habits and Chinese characteristics, an entity belongs to a certain concept (for example, "China electronic department is a company"), so that the entity and the concept can appear in the same sentence in a large amount, and accordingly, a high-quality and reliable recall template can be expanded from a small number of labeled high-quality seed templates through a template self-learning strategy, so that keyword representations between the entity and the concept are mined, and then the instanceOf relationship between the entity and the concept is extracted in such a way, so that a concept set to which the entity belongs is obtained;
(2.6) attribute classification: carrying out attribute classification on entities in the entity set and concepts in the concept set;
(2.7) entity statistics: according to the step (2.4), counting the occurrence frequency of each entity and the total occurrence frequency of all entities in each document in the official document publishing website and encyclopedia website;
(2.8) concept statistics: according to the step (2.5), in the official document publishing website and the encyclopedia website, the entity times of the same concept in the same document are accumulated, and the occurrence times of the concept and the total occurrence times of all concepts in the same document are obtained.
Further, in the step (2.4), the point-to-point mutual information formula is as follows:
Figure BDA0001860393170000061
wherein p (x, y) represents the number of times the term x appears in the same sentence in the specialized dictionary, p (x) represents the number of times the term x appears in the specialized dictionary, and p (y) represents the number of times the term y appears in the specialized dictionary.
Further, in the step (2.5), extracting the hierarchical relationship among the concept sets to form a sullasof relationship, and acquiring a concept hierarchical relationship set; such as: we get the hierarchical relationship between concepts by similar modes of "is a", "is a subclass of …", etc., i.e. form the subalasof relationship between concept sets, get the concept hierarchical relationship sets.
Further, in the step (2.7), the probability of each entity occurring is:
Figure BDA0001860393170000071
wherein N iswRepresenting the number of times w occurs in a given document and N is the total number of entities in the document.
Further, in the step (2.8), the probability is:
Figure BDA0001860393170000072
wherein N iscRepresenting the number of occurrences of the concept in a given document, and N is the number of occurrences of all concepts in the document.
Thirdly, forming a document connotation knowledge graph by the entity, the concept, the document and the relation weight, and forming a document field concept graph by the document, the entity and the attribute; graph nodes of the concept graph are documents, concepts, entities and attributes, and edges are empowerment relations among the entities; and describing the correlation degree between the two through a map mode.
Further, the weighting relation is represented by the probability in the step (2.7)
Searching: identifying the query request of the user according to the concept map in the step three, and returning a related document; the method comprises the following steps:
(4.1) performing entity identification on the query request of the user;
(4.2) utilizing an entity recognition tool to perform word segmentation on the entity and marking part of speech;
(4.3) converting the query words into corresponding entities and concepts in the map, and acquiring modification relations between the entities and the concepts;
and (4.4) acquiring the most relevant document according to the modification relation of the entity and the concept and the relation weight of the document, and returning the relevant document.
And fifthly, showing: displaying related official documents in a traditional text content or concept graph and knowledge graph mode; the text content display is used for the user to look up the document content, and the map display is used for displaying the document and the related display of the related entity.
Further, in the second step, the entity recognition tool is constructed according to the professional dictionary, the public marking corpora and the entity recognition marking corpora in the official document field to form the entity recognition tool in the official document field;
the entity recognition tool has the functions of word segmentation and part-of-speech tagging, and can obtain the weight of the relationship among entities, concepts and documents by counting the entities and concepts appearing in the documents.

Claims (10)

1. A method for searching official documents based on knowledge graph is characterized in that: the method comprises the following steps:
collecting data: crawling the official documents published by each official document publishing website and the data of encyclopedia websites to acquire encyclopedia data and official document data;
document semantics portrayal: cleaning the official document data and encyclopedia data, extracting terms from the encyclopedia data, constructing a special dictionary, and acquiring entities, concepts, attributes and relationship weights from the special dictionary by using an entity recognition tool;
third, the official document atlas is drawn: forming an official document connotation knowledge graph by the entity, concept, official document and relationship weight, and forming an official document field concept graph by the official document, entity and attribute;
searching: identifying the query request of the user according to the concept map in the step three, and returning a related document;
and fifthly, showing: displaying related official documents in a traditional text content or concept graph and knowledge graph mode;
the step II comprises the following steps:
(2.1) extract terms: cleaning encyclopedia data, and extracting a large number of terms from the encyclopedia data;
(2.2) document cleaning: cleaning the document data;
(2.3) forming a professional dictionary: collecting the official documents and terms in the steps (2.1) and (2.2);
(2.4) entity identification: acquiring an entity set from the dictionary by using an entity recognition tool, and supplementing unidentified entities by using a point mutual information formula;
(2.5) concept identification: extracting keyword representation between the entity and the concept according to the part-of-speech label, extracting instanceOf relationship, and obtaining a concept set;
(2.6) attribute classification: carrying out attribute classification on entities in the entity set and concepts in the concept set;
(2.7) entity statistics: according to the step (2.4), counting the occurrence frequency of each entity and the total occurrence frequency of all entities in each document in the official document publishing website and encyclopedia website;
(2.8) concept statistics: according to the step (2.5), in the official document publishing website and the encyclopedia website, the entity times of the same concept in the same document are accumulated, and the occurrence times of the concept and the total occurrence times of all concepts in the same document are obtained.
2. The knowledge-graph-based official document searching method as claimed in claim 1, wherein: in the step (2.4), the point mutual information formula is as follows:
Figure FDA0003111528350000021
wherein p (x, y) represents the number of times the term x appears in the same sentence in the specialized dictionary, p (x) represents the number of times the term x appears in the specialized dictionary, and p (y) represents the number of times the term y appears in the specialized dictionary.
3. The knowledge-graph-based official document searching method as claimed in claim 1, wherein: in the step (2.5), the hierarchical relationship among the concept sets is extracted to form a sulclass of relationship, and a concept hierarchical relationship set is obtained.
4. The knowledge-graph-based official document searching method as claimed in claim 1, wherein: in the step (2.7), the probability of each entity appearing is:
Figure FDA0003111528350000022
wherein N iswRepresenting the number of times w occurs in a given document and N is the total number of entities in the document.
5. The knowledge-graph-based official document searching method as claimed in claim 1, wherein: in the step (2.8), the probability is:
Figure FDA0003111528350000031
wherein N iscRepresenting the number of occurrences of the concept in a given document, and N is the number of occurrences of all concepts in the document.
6. The knowledge-graph-based official document searching method as claimed in claim 1, wherein: in the third step, the graph nodes of the concept graph are documents, concepts, entities and attributes, and the edges are the empowerment relations among the entities.
7. The knowledge-graph-based official document searching method as claimed in claim 1, wherein: the weighting relationship is represented by the probability described in step (2.7).
8. The knowledge-graph-based official document searching method as claimed in claim 1, wherein: the step IV comprises the following steps:
(4.1) performing entity identification on the query request of the user;
(4.2) utilizing an entity recognition tool to perform word segmentation on the entity and marking part of speech;
(4.3) converting the query words into corresponding entities and concepts in the map, and acquiring modification relations between the entities and the concepts;
and (4.4) acquiring the most relevant document according to the modification relation of the entity and the concept and the relation weight of the document, and returning the relevant document.
9. The knowledge-graph-based official document searching method as claimed in claim 1, wherein: and in the second step, the entity recognition tool is constructed according to the professional dictionary, the public marking linguistic data and the entity recognition marking linguistic data in the official document field to form the entity recognition tool in the official document field.
10. The knowledge-graph-based official document searching method as claimed in claim 1, wherein: the entity recognition tool has the functions of word segmentation and part-of-speech tagging, and can obtain the weight of the relationship among entities, concepts and documents by counting the entities and concepts appearing in the documents.
CN201811332469.XA 2018-11-09 2018-11-09 Knowledge graph-based official document searching method Active CN109597895B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811332469.XA CN109597895B (en) 2018-11-09 2018-11-09 Knowledge graph-based official document searching method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811332469.XA CN109597895B (en) 2018-11-09 2018-11-09 Knowledge graph-based official document searching method

Publications (2)

Publication Number Publication Date
CN109597895A CN109597895A (en) 2019-04-09
CN109597895B true CN109597895B (en) 2021-10-22

Family

ID=65957210

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811332469.XA Active CN109597895B (en) 2018-11-09 2018-11-09 Knowledge graph-based official document searching method

Country Status (1)

Country Link
CN (1) CN109597895B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111859974A (en) * 2019-04-22 2020-10-30 广东小天才科技有限公司 Semantic disambiguation method and device combined with knowledge graph and intelligent learning equipment
CN110188186A (en) * 2019-04-24 2019-08-30 平安科技(深圳)有限公司 Content recommendation method, electronic device, equipment and the storage medium of medical field
CN111966816B (en) * 2020-07-09 2022-07-12 福建亿榕信息技术有限公司 Intelligent association method and system for official documents
CN112364172A (en) * 2020-10-16 2021-02-12 上海晏鼠计算机技术股份有限公司 Method for constructing knowledge graph in government official document field
CN116028597B (en) * 2023-03-27 2023-07-21 南京燧坤智能科技有限公司 Object retrieval method, device, nonvolatile storage medium and computer equipment

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103488724A (en) * 2013-09-16 2014-01-01 复旦大学 Book-oriented reading field knowledge map construction method
US8843434B2 (en) * 2006-02-28 2014-09-23 Netseer, Inc. Methods and apparatus for visualizing, managing, monetizing, and personalizing knowledge search results on a user interface
CN104462501A (en) * 2014-12-19 2015-03-25 北京奇虎科技有限公司 Knowledge graph construction method and device based on structural data
CN107908671A (en) * 2017-10-25 2018-04-13 南京擎盾信息科技有限公司 Knowledge mapping construction method and system based on law data
CN108009299A (en) * 2017-12-28 2018-05-08 北京市律典通科技有限公司 Law tries method and device for business processing
CN108446367A (en) * 2018-03-15 2018-08-24 湖南工业大学 A kind of the packaging industry data search method and equipment of knowledge based collection of illustrative plates

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8843434B2 (en) * 2006-02-28 2014-09-23 Netseer, Inc. Methods and apparatus for visualizing, managing, monetizing, and personalizing knowledge search results on a user interface
CN103488724A (en) * 2013-09-16 2014-01-01 复旦大学 Book-oriented reading field knowledge map construction method
CN104462501A (en) * 2014-12-19 2015-03-25 北京奇虎科技有限公司 Knowledge graph construction method and device based on structural data
CN107908671A (en) * 2017-10-25 2018-04-13 南京擎盾信息科技有限公司 Knowledge mapping construction method and system based on law data
CN108009299A (en) * 2017-12-28 2018-05-08 北京市律典通科技有限公司 Law tries method and device for business processing
CN108446367A (en) * 2018-03-15 2018-08-24 湖南工业大学 A kind of the packaging industry data search method and equipment of knowledge based collection of illustrative plates

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
一种准确而高效的领域知识图谱构建方法;杨玉基等;《软件学报》;20180208;第2932-2949页 *

Also Published As

Publication number Publication date
CN109597895A (en) 2019-04-09

Similar Documents

Publication Publication Date Title
CN109597895B (en) Knowledge graph-based official document searching method
CN109492077B (en) Knowledge graph-based petrochemical field question-answering method and system
CN109800284B (en) Task-oriented unstructured information intelligent question-answering system construction method
JP5252725B2 (en) System, method, and software for hyperlinking names
US7593920B2 (en) System, method, and software for identifying historically related legal opinions
Sarawagi et al. Open-domain quantity queries on web tables: annotation, response, and consensus models
CN108197117A (en) A kind of Chinese text keyword extracting method based on document subject matter structure with semanteme
CN103488724A (en) Book-oriented reading field knowledge map construction method
CN105045852A (en) Full-text search engine system for teaching resources
CN105528411B (en) Apparel interactive electronic technical manual full-text search device and method
CN108509521B (en) Image retrieval method for automatically generating text index
CN112559684A (en) Keyword extraction and information retrieval method
CN112364172A (en) Method for constructing knowledge graph in government official document field
CN105608232A (en) Bug knowledge modeling method based on graphic database
CN110705292B (en) Entity name extraction method based on knowledge base and deep learning
CN112507109A (en) Retrieval method and device based on semantic analysis and keyword recognition
CN102929902A (en) Character splitting method and device based on Chinese retrieval
CN112148885A (en) Intelligent searching method and system based on knowledge graph
CN109522396B (en) Knowledge processing method and system for national defense science and technology field
CN112149422B (en) Dynamic enterprise news monitoring method based on natural language
CN114090861A (en) Education field search engine construction method based on knowledge graph
CN112036178A (en) Distribution network entity related semantic search method
US11487795B2 (en) Template-based automatic software bug question and answer method
CN112148938B (en) Cross-domain heterogeneous data retrieval system and retrieval method
Chang Domain specific word extraction from hierarchical Web documents: A first step toward building lexicon trees from Web corpora

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant