CN110750698A

CN110750698A - Knowledge graph construction method and device, computer equipment and storage medium

Info

Publication number: CN110750698A
Application number: CN201910848696.6A
Authority: CN
Inventors: 董润华; 徐国强
Original assignee: OneConnect Smart Technology Co Ltd
Current assignee: OneConnect Smart Technology Co Ltd
Priority date: 2019-09-09
Filing date: 2019-09-09
Publication date: 2020-02-04
Also published as: WO2021047188A1

Abstract

The application relates to the field of data analysis knowledge graph drawing, in particular to a knowledge graph construction method and device, computer equipment and a storage medium. The method comprises the following steps: crawling knowledge information of a plurality of webpages to be selected according to the seed vocabularies; acquiring an extended vocabulary according to the jump link; acquiring a vocabulary label of a seed vocabulary on a webpage to be selected, and constructing an upper vocabulary set; performing word filtering on the extended vocabulary according to the superior vocabulary set, and acquiring a target vocabulary set according to the filtered extended vocabulary and the seed vocabulary; and constructing a knowledge graph according to seed vocabularies, extended vocabularies and jump link relations in the upper word set and the target word set. According to the knowledge graph construction method, the seed vocabulary data are obtained, jumping is carried out based on the seed vocabulary data, the jumping vocabularies are obtained through expansion, meanwhile, the jumping vocabularies are classified through the superior word set, the target word set is obtained, then the knowledge graph is constructed based on the target word set, and the construction efficiency is high in the special field.

Description

Knowledge graph construction method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for constructing a knowledge graph, a computer device, and a storage medium.

Background

The knowledge map, also called scientific knowledge map, is called knowledge domain visualization or knowledge domain mapping map in the book intelligence world, is a series of different graphs displaying the relation between the knowledge development process and the structure, describes knowledge resources and carriers thereof by using visualization technology, and excavates, analyzes, constructs, draws and displays knowledge and the mutual relation between the knowledge resources and the carriers. It takes entities or concepts as nodes and connects through semantic relations. By discovering the association between entities, the semi-structured and unstructured data are integrated, and the knowledge graph can help a machine to understand data, explain phenomena and knowledge reasoning, so that deep-level relationships are discovered, and intelligent search and intelligent interaction are realized.

In the traditional method for constructing the knowledge graph in the vertical field, repeated means are needed to identify and screen the knowledge field of the vocabulary in the knowledge acquisition stage so as to ensure that the acquired knowledge conforms to the current field, and the construction efficiency of the knowledge graph is low.

Disclosure of Invention

Based on this, it is necessary that the prior knowledge graph construction process needs to ensure that the acquired knowledge conforms to the current field, which affects the construction efficiency, and a high-efficiency knowledge graph construction method, device, computer equipment and storage medium are provided.

A method of knowledge-graph construction, the method comprising:

crawling knowledge information of a plurality of to-be-selected webpages according to seed vocabularies, wherein the to-be-selected webpages comprise jump links related to the seed vocabularies, and the seed vocabularies are the knowledge vocabularies in the field to which the to-be-constructed knowledge graph belongs;

acquiring an extended vocabulary according to the jump link;

acquiring a vocabulary label of the seed vocabulary on the webpage to be selected, and constructing an upper vocabulary set;

performing word filtering on the extended vocabulary according to the superior vocabulary set, and acquiring a target vocabulary set according to the filtered extended vocabulary and the seed vocabulary;

and constructing a knowledge graph according to the upper word set, the seed words in the target word set, the extended words and the jump link relation.

In one embodiment, before the constructing a knowledge graph according to the upper word set, the seed words in the target word set, the extended words and the jump link relationship, the method further includes:

crawling knowledge information of a plurality of webpages to be selected according to the extended vocabulary after the word filtration, and acquiring an iterative jump link related to the extended vocabulary after the word filtration;

acquiring an iterative extended vocabulary according to the iterative skip link;

performing word filtering on the iterative expansion vocabulary according to the superior word set, and updating the target word set according to the filtered iterative expansion vocabulary;

taking the iterative expanded vocabulary as a new expanded vocabulary after word filtration, returning to the step of crawling knowledge information of a plurality of webpages to be selected according to the expanded vocabulary after word filtration and acquiring iterative jump links related to the expanded vocabulary after word filtration until all latest iterative expanded vocabularies can be filtered through the upper word set;

constructing a knowledge graph according to the seed vocabulary, the extended vocabulary and the jump link relation in the upper word set and the target word set comprises the following steps:

and constructing a knowledge graph according to the upper word set, the seed words in the updated target word set, the extended words, the jump links, the iterative extended words and the iterative jump links.

In one embodiment, before performing word filtering on the iterative extended vocabulary according to the hypernym set, updating the target word set according to the filtered iterative extended vocabulary, the method further includes:

and reconstructing the upper word set according to the seed words, the expanded words and the iterative expanded words in the target word set and the word tags of the webpage to be selected.

In one embodiment, the performing word filtering on the extended vocabulary according to the hypernym set, and acquiring a target vocabulary set vocabulary according to the filtered extended vocabulary and the seed vocabulary includes:

acquiring each vocabulary label corresponding to the expanded vocabulary;

filtering the expanded vocabulary, wherein the ratio of the vocabulary labels belonging to the superior word set to each vocabulary label corresponding to the expanded vocabulary is less than or equal to the expanded vocabulary of a preset classification threshold;

and acquiring a target word set according to the filtered extended vocabulary and the seed vocabulary.

In one embodiment, the crawling of the knowledge information of a plurality of candidate web pages according to the seed vocabulary further comprises, before the candidate web pages include jump links related to the seed vocabulary:

acquiring domain information corresponding to a knowledge graph to be constructed;

and according to the domain information, crawling seed vocabularies of the domain to which the knowledge graph to be constructed belongs from the domain classification tree of the third-party platform based on the script crawler frame and the xpath analysis library.

In one embodiment, before crawling knowledge information of a plurality of web pages to be selected according to the seed vocabulary, the method further includes:

searching the same vocabulary in the seed vocabulary;

searching a synonymous vocabulary in the seed vocabulary through semantic dependency analysis;

and removing the weight of the seed vocabulary according to the same vocabulary and the synonymous vocabulary.

In one embodiment, the obtaining of the vocabulary tags of the seed vocabularies on the to-be-selected web pages and the constructing of the upper vocabulary set include:

acquiring a vocabulary label corresponding to the seed vocabulary;

and when the word label has an incidence relation with a preset core word, classifying the word label into the upper word set.

A knowledge-graph building apparatus, the apparatus comprising:

the vocabulary information acquisition module is used for crawling knowledge information of a plurality of to-be-selected webpages according to seed vocabularies, wherein the to-be-selected webpages comprise jump links related to the seed vocabularies, and the seed vocabularies are the knowledge vocabularies in the field to which the to-be-constructed knowledge maps belong;

the extended vocabulary identification module is used for acquiring extended vocabularies according to the jump links;

the upper word set building module is used for acquiring the word labels of the seed words on the webpage to be selected and building an upper word set;

the word set filtering module is used for carrying out word filtering on the extended vocabulary according to the superior word set and acquiring a target word set according to the filtered extended vocabulary and the seed vocabulary;

and the map construction module is used for constructing a knowledge map according to the upper word set, the seed words in the target word set, the extended words and the jump link relation.

In one embodiment, the system further includes a seed vocabulary acquiring module, configured to:

A computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:

acquiring an extended vocabulary according to the jump link;

A computer-readable storage medium, on which a computer program is stored which, when executed by a processor, carries out the steps of:

acquiring an extended vocabulary according to the jump link;

According to the knowledge graph construction method, the knowledge graph construction device, the computer equipment and the storage medium, firstly, knowledge information of a plurality of webpages to be selected is crawled according to seed vocabularies; acquiring an extended vocabulary according to the jump link; acquiring a vocabulary label of a seed vocabulary on a webpage to be selected, and constructing an upper vocabulary set; performing word filtering on the extended vocabulary according to the superior vocabulary set, and acquiring a target vocabulary set according to the filtered extended vocabulary and the seed vocabulary; and constructing a knowledge graph according to seed vocabularies, extended vocabularies and jump link relations in the upper word set and the target word set. According to the knowledge graph construction method, the seed vocabulary data are obtained, jumping is carried out based on the seed vocabulary data, the jumping vocabularies are obtained through expansion, meanwhile, the jumping vocabularies are classified through the superior word set, the target word set is obtained, then the knowledge graph is constructed based on the target word set, and the construction efficiency is high in the special field.

Drawings

FIG. 1 is a diagram of an application environment of a method for knowledge graph construction in one embodiment;

FIG. 2 is a schematic flow diagram of a method for knowledge graph construction in one embodiment;

FIG. 3 is a schematic sub-flow chart of step S100 of FIG. 2 in one embodiment;

FIG. 4 is a schematic flow chart diagram of a method of knowledge graph construction in another embodiment;

FIG. 5 is a block diagram showing the structure of a knowledge-graph constructing apparatus according to an embodiment;

FIG. 6 is a diagram illustrating an internal structure of a computer device according to an embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The knowledge graph construction method provided by the application can be applied to an application environment shown in fig. 1, wherein the graph construction server 102 can communicate with the third-party platform server 104 in a network mode, corresponding data is searched through the third-party platform server 104, the graph construction server 102 first crawls knowledge information of a plurality of to-be-selected webpages from the third-party platform server 104 according to seed vocabularies, the to-be-selected webpages include jump links related to the seed vocabularies, and the seed vocabularies are knowledge vocabularies in the field to which the to-be-constructed knowledge graphs belong. Then the map construction server 102 acquires the extended vocabulary according to the jump link; acquiring a vocabulary label of a seed vocabulary on a webpage to be selected, and constructing an upper vocabulary set; performing word filtering on the extended vocabulary according to the superior vocabulary set, and acquiring a target vocabulary set according to the filtered extended vocabulary and the seed vocabulary; and constructing a knowledge graph according to seed vocabularies, extended vocabularies and jump link relations in the upper word set and the target word set.

As shown in fig. 2, in one embodiment, the method for constructing a knowledge graph of the present application is implemented by a graph construction server, and specifically includes the following steps:

s100, crawling knowledge information of a plurality of to-be-selected webpages according to the seed vocabularies, wherein the to-be-selected webpages comprise jump links relevant to the seed vocabularies, and the seed vocabularies are the knowledge vocabularies in the field to which the to-be-constructed knowledge graph belongs.

The knowledge graph is also called a scientific knowledge graph, is a large-scale semantic network, takes entities or concepts as nodes, and is connected through semantic relations. By discovering the association between entities, the semi-structured and unstructured data are integrated, and the knowledge graph can help a machine to understand data, explain phenomena and knowledge reasoning, so that deep-level relationships are discovered, and intelligent search and intelligent interaction are realized. The domain is the domain faced by the knowledge graph, and the knowledge graph to be constructed has the professionality and is used for facing each vertical domain. The seed vocabulary refers to some concept vocabularies which are more common or important in the vertical field. With respect to the source of the seed vocabulary, in one embodiment, the seed vocabulary may be crawled from a classification tree corresponding to the current domain of the encyclopedia website platform. In another embodiment, the term of art may be obtained from a knowledgebase in the current domain as a seed vocabulary. In another embodiment, the related concept vocabulary can be obtained from the domain literature in the current domain as the seed vocabulary. The candidate web pages refer to web pages that can be provided by a third-party platform, and in one embodiment, a plurality of encyclopedia web pages corresponding to the seed vocabulary can be searched for and serve as the candidate web pages of the seed vocabulary. The knowledge information is corresponding information for explaining and explaining the seed vocabulary, and comprises meaning explanation, expansion explanation and the like of the seed vocabulary. For example, the meaning of the beginning part of encyclopedia is explained, specifically, the explanation of vocabulary blockchain by encyclopedia "blockchain is distributed, which is executed from 2019, 2/15. "this part of the content is the knowledge information of the seed vocabulary block chain. The knowledge information comprises a plurality of vocabularies for assisting in explaining the seed vocabularies, and the vocabularies comprise jump vocabularies which can jump to another vocabulary. The jump link is the link corresponding to the jump vocabulary. In one embodiment, the jump link corresponding to the jump vocabulary can be located by reading the webpage code of the webpage to be selected. In another embodiment, the jump vocabulary positioned in the knowledge information can be identified through a characteristic identification technology, and then the jump operation is carried out through the position information of the jump vocabulary to obtain the jump link.

And S300, acquiring the extended vocabulary according to the jump link.

The server can search concept information corresponding to the seed vocabulary from the encyclopedic website platform according to the seed vocabulary, recognize the jump link in the concept information, acquire the extended vocabulary according to the jump link, and enrich the knowledge map through the extended vocabulary.

S500, acquiring the vocabulary labels of the seed vocabularies on the webpage to be selected, and constructing an upper vocabulary set.

The vocabulary labels are highly generalized vocabularies of various seed vocabularies, belong to the superior vocabularies of the seed vocabularies, and can be used for constructing the superior vocabulary sets corresponding to the field based on the vocabulary labels corresponding to the various seed vocabularies. The vocabulary label refers to a content label added to the current seed vocabulary by each network encyclopedia platform, such as a vocabulary label of the lowest part in encyclopedia, namely, the vocabulary labels of the lowest part in a lower graph, such as a vocabulary block chain, "scientific encyclopedia vocabulary scientific classification", "finance", and "internet", can be recognized as the vocabulary label of the seed vocabulary "block chain". The server can construct a corresponding upper word set based on the vocabulary labels corresponding to various sub-vocabularies through the vocabulary labels corresponding to the acquired seed vocabularies.

And S700, performing word filtering on the extended vocabulary according to the hypernym set, and acquiring a target word set according to the filtered extended vocabulary and the seed vocabulary.

And simultaneously, screening the vocabularies which are iteratively jumped based on the upper word set, and removing the vocabularies which do not belong to the current vertical field. Due to the fact that the middle part is subjected to a skipping process, the generated extended vocabularies may not belong to the current vertical field any more, the field of each extended vocabulary may need to be screened through an upper word set, the vocabularies belonging to the current vertical field are selected as iterative seed vocabularies, and concept information of the iterative seed vocabularies is obtained to form a knowledge graph.

And S900, constructing a knowledge graph according to the seed vocabulary, the extended vocabulary and the jump link relation in the upper word set and the target word set.

The method can acquire the jump relation between the seed vocabulary and the expanded vocabulary, establish the current knowledge graph based on various information which is determined currently, for example, the knowledge node with the upper word set as the highest level, the seed vocabulary as the lower level node and the expanded vocabulary as the lower level node, establish the connection network of the knowledge graph according to the label relation of the upper word set and the seed vocabulary, the jump link relation of each seed vocabulary and the expanded vocabulary, and store the knowledge information corresponding to each node to each corresponding node.

The knowledge graph construction method comprises the steps of crawling knowledge information of a plurality of webpages to be selected according to seed vocabularies; acquiring an extended vocabulary according to the jump link; acquiring a vocabulary label of a seed vocabulary on a webpage to be selected, and constructing an upper vocabulary set; performing word filtering on the extended vocabulary according to the superior vocabulary set, and acquiring a target vocabulary set according to the filtered extended vocabulary and the seed vocabulary; and constructing a knowledge graph according to the seed vocabulary, the extended vocabulary and the jump link in the target word set. According to the knowledge graph construction method, the seed vocabulary data are obtained, jumping is carried out based on the seed vocabulary data, the jumping vocabularies are obtained through expansion, meanwhile, the jumping vocabularies are classified through the superior word set, the target word set is obtained, then the knowledge graph is constructed based on the target word set, and the construction efficiency is high in the special field.

In one embodiment, step S900 is preceded by:

and crawling knowledge information of a plurality of webpages to be selected according to the expanded vocabulary after the word filtration, and acquiring iterative jump links related to the expanded vocabulary after the word filtration.

And acquiring an iterative expansion vocabulary according to the iterative jump link.

And performing word filtering on the iterative expansion vocabulary according to the hypernym set, and updating the target word set according to the filtered iterative expansion vocabulary.

And taking the iterative expanded vocabulary as the new expanded vocabulary after the word filtration, returning to the step of crawling the knowledge information of a plurality of webpages to be selected according to the expanded vocabulary after the word filtration and obtaining the iterative jump link related to the expanded vocabulary after the word filtration until all the latest iterative expanded vocabulary can be filtered through the upper word set.

The obtained expanded vocabulary corresponding to the seed vocabulary can be used as a new seed vocabulary to carry out a new round of jumping, iterative expanded vocabulary is obtained through continuous jumping, and knowledge information corresponding to the iterative expanded vocabulary also needs to be identified. Meanwhile, the vocabularies which are iteratively jumped can be screened based on the upper word set, the vocabularies which do not belong to the current vertical field are filtered, and the vertical field knowledge graph can be greatly enriched by repeated iteration. The current vocabulary iteration can be ended when no new iterative expansion vocabulary exists or all latest iterative expansion vocabularies are filtered out from the upper vocabulary set.

In one embodiment, performing word filtering on the iterative expansion vocabulary according to the hypernym set, and before updating the target word set according to the filtered iterative expansion vocabulary, the method further includes:

and reconstructing an upper word set according to the seed vocabulary, the expanded vocabulary and the iterative expanded vocabulary in the target word set on the vocabulary tags of the webpage to be selected.

The upper word set can be reconstructed by the filtered extended vocabulary or the vocabulary labels corresponding to the iterative extended vocabulary, and the coverage of the constructed vertical domain knowledge graph can be continuously improved by continuously reconstructing the upper word set.

As shown in fig. 3, in one embodiment, S700 includes:

s720, acquiring each vocabulary tag corresponding to the vocabulary expansion vocabulary;

s740, filtering the vocabulary tags belonging to the superior vocabulary set in the expanded vocabulary, wherein the proportion of the vocabulary tags occupying the vocabulary corresponding to the expanded vocabulary is smaller than or equal to the expanded vocabulary of the preset classification threshold;

s760, acquiring a target word set according to the filtered extended vocabulary and the seed vocabulary.

Whether the current expanded vocabulary belongs to the current vertical field vocabulary can be judged based on whether the proportion of the tag information belonging to the superior vocabulary set in the vocabulary tags of the expanded vocabulary is larger than a preset threshold value. For example, the preset threshold is 20%, the encyclopedia page of the extended vocabulary a has 3 tags, wherein the number of the tags in the hypernym set is 1, and 1/3 tags are greater than 20%, and the skipped vocabulary can be determined to be the domain vocabulary without filtering. Based on a 20% percentage, the expanded vocabulary can be saturated, perhaps through three rounds of jumping. The preset classification threshold value can be set according to the current specific vertical field, the breadth of the knowledge graph and the verticality requirement.

In one embodiment, S100 is preceded by:

and acquiring the domain information corresponding to the knowledge graph to be constructed.

And (3) crawling seed vocabularies of the domain to which the knowledge graph to be constructed belongs from the domain classification tree of the encyclopedic website platform based on the script crawler frame and the xpath analysis library according to the domain information.

The structured seed vocabulary refers to seed vocabulary existing in a structured data form, and structured data can be represented and stored by using a relational database, such as MySQL, Oracle, SQL Server and the like, and represent data in a two-dimensional form. The corresponding information can be obtained through the inherent key value. The general characteristics are as follows: data is in row units, one row of data represents information of one entity, and the attribute of each row of data is the same. The storage and arrangement of the structured data is very regular, which is helpful for operations such as query and modification. The server obtains the seed vocabulary through a web crawler technology, for example, the seed vocabulary of the vertical field to which the knowledge graph to be constructed belongs can be obtained from the classification tree of the encyclopedic website based on a script crawler frame and an xpath analysis library, and in another embodiment, the seed vocabulary of the vertical field to which the knowledge graph to be constructed belongs can also be obtained from the classification knowledge library of the encyclopedic website.

In one embodiment, S100 is preceded by:

and searching the same vocabulary in the seed vocabulary.

And searching the synonyms in the seed vocabulary through semantic dependency analysis.

And de-weighting the seed vocabulary according to the same vocabulary and the synonymous vocabulary.

Because the data from different websites can be repeated by using the crawler technology, the filtering operation needs to be carried out on the crawled seed vocabulary, and the vocabulary belongs to the vertical field and has strong professional field, so that the synonymous vocabulary in the seed vocabulary can be identified through semantic analysis, the synonymous vocabulary is divided together, and the deduplication operation is carried out on the synonymous vocabulary. Meanwhile, the same seed vocabulary is filtered through a deduplication operation, and the deduplication can be performed through a python aggregation operation according to the vocabulary name of the seed vocabulary.

As shown in fig. 4, in one embodiment, S500 includes:

s520, acquiring the vocabulary label of the seed vocabulary on the webpage to be selected.

And S540, when the word labels have an association relation with the preset core words, classifying the word labels into an upper word set.

Some core vocabularies in the field can be constructed in advance, which superior words belong to the current vertical field in the labels of the seed vocabularies are judged based on the preset core vocabularies, namely, vocabularies which are in certain connection with the core vocabularies can be regarded as superior vocabularies in the field, and irrelevant superior vocabularies are filtered. The core vocabulary is used for determining the range of the upper vocabulary which cannot exceed the vertical field, and the specific realization can be realized by searching the vocabulary labels and judging whether the concept explanation corresponding to the vocabulary labels contains the core vocabulary or not. In addition, the higher-level words can be audited through manual auditing, or the efficiency of upper-level word auditing is improved through the combination of manual auditing and machine auditing.

In one embodiment, the method for constructing a knowledge graph comprises the following steps: acquiring domain information corresponding to a knowledge graph to be constructed; and (3) crawling seed vocabularies of the domain to which the knowledge graph to be constructed belongs from the domain classification tree of the encyclopedic website platform based on the script crawler frame and the xpath analysis library according to the domain information. Searching the same vocabulary in the seed vocabulary; searching a synonymous word in the seed word through semantic dependency analysis; and de-weighting the seed vocabulary according to the same vocabulary and the synonymous vocabulary. Crawling knowledge information of a plurality of to-be-selected webpages according to the seed vocabularies, wherein the to-be-selected webpages comprise jump links related to the seed vocabularies, and the seed vocabularies are the knowledge vocabularies in the field to which the to-be-constructed knowledge graph belongs; acquiring an extended vocabulary according to the jump link; acquiring a vocabulary label corresponding to the seed vocabulary; and when the vocabulary labels have an incidence relation with the preset core vocabulary, the vocabulary labels are classified into the upper vocabulary set. Performing word filtering on the extended vocabulary according to the superior vocabulary set, and acquiring a target vocabulary set according to the filtered extended vocabulary and the seed vocabulary; crawling knowledge information of a plurality of webpages to be selected according to the expanded vocabulary after the word filtration, and acquiring iterative jump links related to the expanded vocabulary after the word filtration; obtaining an iterative expansion vocabulary according to the iterative jump link; and constructing a knowledge graph according to the upper word set, the seed words in the updated target word set, the expanded words, the jump links, the iterative expanded words and the iterative jump links. Acquiring each vocabulary tag corresponding to the vocabulary expansion vocabulary; filtering the extended vocabulary, wherein the vocabulary tags belonging to the superior vocabulary set in the extended vocabulary occupy the extended vocabulary with the proportion of each vocabulary tag corresponding to the vocabulary extended vocabulary being less than or equal to the preset classification threshold; and acquiring a target word set according to the filtered extended vocabulary and the seed vocabulary. The iterative extended vocabulary is used as the new extended vocabulary after the word filtration, the step of crawling the knowledge information of a plurality of webpages to be selected according to the extended vocabulary after the word filtration and obtaining the iterative jump links related to the extended vocabulary after the word filtration is returned until all the latest iterative extended vocabulary can be filtered through the upper word set; and constructing a knowledge graph according to the upper word set, the seed words in the updated target word set, the expanded words, the jump links, the iterative expanded words and the iterative jump links.

It should be understood that although the various steps in the flow charts of fig. 2-4 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2-4 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternating with other steps or at least some of the sub-steps or stages of other steps.

As shown in fig. 5, a knowledge-graph constructing apparatus includes:

the vocabulary information acquisition module 100 is used for crawling knowledge information of a plurality of to-be-selected webpages according to the seed vocabulary, wherein the to-be-selected webpages comprise jump links related to the seed vocabulary, and the seed vocabulary is a knowledge vocabulary in the field to which the to-be-constructed knowledge map belongs;

an extended vocabulary recognition module 300 for obtaining an extended vocabulary according to the jump link;

the upper word set building module 500 is used for acquiring the word labels of the seed words on the web pages to be selected and building an upper word set;

the word set filtering module 700 is configured to perform word filtering on the extended vocabulary according to the hypernym set, and obtain a target word set according to the filtered extended vocabulary and the seed vocabulary;

and the map building module 900 is configured to build a knowledge map according to the seed vocabulary, the extended vocabulary and the jump link relationship in the upper word set and the target word set.

In one embodiment, the system further comprises an iteration expansion module, a word filtering module and a word filtering module, wherein the iteration expansion module is used for crawling knowledge information of a plurality of webpages to be selected according to the expanded words after the words are filtered and acquiring iteration jump links related to the expanded words after the words are filtered; obtaining an iterative expansion vocabulary according to the iterative jump link; performing word filtering on the iterative expansion vocabulary according to the hypernym set, and updating a target word set according to the filtered iterative expansion vocabulary; the iterative extended vocabulary is used as the new extended vocabulary after the word filtration, the step of crawling the knowledge information of a plurality of webpages to be selected according to the extended vocabulary after the word filtration and obtaining the iterative jump links related to the extended vocabulary after the word filtration is returned until all the latest iterative extended vocabulary can be filtered through the upper word set; the map building module 900 is further configured to build a knowledge map according to the upper word set, the seed vocabulary in the updated target word set, the expanded vocabulary, the jump link, the iterative expanded vocabulary, and the iterative jump link.

In one embodiment, the iterative expansion module is further used for reconstructing the upper word set on the word tags of the webpage to be selected according to the seed words, the expanded words and the iterative expanded words in the target word set.

In one embodiment, the word set filtering module is used for acquiring each vocabulary tag corresponding to the vocabulary extended vocabulary; filtering the extended vocabulary, wherein the vocabulary tags belonging to the superior vocabulary set in the extended vocabulary occupy the extended vocabulary with the proportion of each vocabulary tag corresponding to the vocabulary extended vocabulary being less than or equal to the preset classification threshold; and acquiring a target word set according to the filtered extended vocabulary and the seed vocabulary.

In one embodiment, the system further comprises a seed vocabulary acquisition module, which is used for acquiring the domain information corresponding to the knowledge graph to be constructed; and (3) crawling seed vocabularies of the domain to which the knowledge graph to be constructed belongs from the domain classification tree of the third-party platform based on the script crawler frame and the xpath analysis library according to the domain information.

In one embodiment, the system further comprises a seed vocabulary duplicate removal module, which is used for searching the same vocabulary in the seed vocabulary; searching a synonymous word in the seed word through semantic dependency analysis; and de-weighting the seed vocabulary according to the same vocabulary and the synonymous vocabulary.

In one embodiment, the upper word set building module is used for acquiring a vocabulary tag corresponding to a seed vocabulary; and when the vocabulary labels have an incidence relation with the preset core vocabulary, the vocabulary labels are classified into the upper vocabulary set.

For specific limitations of the knowledge graph constructing apparatus, reference may be made to the above limitations of the knowledge graph constructing method, which are not described herein again. The modules in the knowledge graph constructing apparatus can be wholly or partially implemented by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 6. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The database of the computer device is used for storing knowledge-graph related data. The computer program is executed by a processor to implement a method of knowledge-graph construction.

Those skilled in the art will appreciate that the architecture shown in fig. 6 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having a computer program stored therein, the processor implementing the following steps when executing the computer program:

crawling knowledge information of a plurality of to-be-selected webpages according to the seed vocabularies, wherein the to-be-selected webpages comprise jump links related to the seed vocabularies, and the seed vocabularies are the knowledge vocabularies in the field to which the to-be-constructed knowledge graph belongs;

acquiring an extended vocabulary according to the jump link;

acquiring a vocabulary label of a seed vocabulary on a webpage to be selected, and constructing an upper vocabulary set;

and constructing a knowledge graph according to seed vocabularies, extended vocabularies and jump link relations in the upper word set and the target word set.

In one embodiment, the processor, when executing the computer program, further performs the steps of: crawling knowledge information of a plurality of webpages to be selected according to the expanded vocabulary after the word filtration, and acquiring iterative jump links related to the expanded vocabulary after the word filtration; obtaining an iterative expansion vocabulary according to the iterative jump link; performing word filtering on the iterative expansion vocabulary according to the hypernym set, and updating a target word set according to the filtered iterative expansion vocabulary; the iterative extended vocabulary is used as the new extended vocabulary after the word filtration, the step of crawling the knowledge information of a plurality of webpages to be selected according to the extended vocabulary after the word filtration and obtaining the iterative jump links related to the extended vocabulary after the word filtration is returned until all the latest iterative extended vocabulary can be filtered through the upper word set; and constructing a knowledge graph according to the upper word set, the seed words in the updated target word set, the expanded words, the jump links, the iterative expanded words and the iterative jump links.

In one embodiment, the processor, when executing the computer program, further performs the steps of: and reconstructing an upper word set according to the seed vocabulary, the expanded vocabulary and the iterative expanded vocabulary in the target word set on the vocabulary tags of the webpage to be selected.

In one embodiment, the processor, when executing the computer program, further performs the steps of: acquiring each vocabulary tag corresponding to the vocabulary expansion vocabulary; filtering the extended vocabulary, wherein the vocabulary tags belonging to the superior vocabulary set in the extended vocabulary occupy the extended vocabulary with the proportion of each vocabulary tag corresponding to the vocabulary extended vocabulary being less than or equal to the preset classification threshold; and acquiring a target word set according to the filtered extended vocabulary and the seed vocabulary.

In one embodiment, the processor, when executing the computer program, further performs the steps of: acquiring domain information corresponding to a knowledge graph to be constructed; and (3) crawling seed vocabularies of the domain to which the knowledge graph to be constructed belongs from the domain classification tree of the third-party platform based on the script crawler frame and the xpath analysis library according to the domain information.

In one embodiment, the processor, when executing the computer program, further performs the steps of: searching the same vocabulary in the seed vocabulary; searching a synonymous word in the seed word through semantic dependency analysis; and de-weighting the seed vocabulary according to the same vocabulary and the synonymous vocabulary.

In one embodiment, the processor, when executing the computer program, further performs the steps of: acquiring a vocabulary label corresponding to the seed vocabulary; and when the vocabulary labels have an incidence relation with the preset core vocabulary, the vocabulary labels are classified into the upper vocabulary set.

In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of:

acquiring an extended vocabulary according to the jump link;

In one embodiment, the computer program when executed by the processor further performs the steps of: crawling knowledge information of a plurality of webpages to be selected according to the expanded vocabulary after the word filtration, and acquiring iterative jump links related to the expanded vocabulary after the word filtration; obtaining an iterative expansion vocabulary according to the iterative jump link; performing word filtering on the iterative expansion vocabulary according to the hypernym set, and updating a target word set according to the filtered iterative expansion vocabulary; the iterative extended vocabulary is used as the new extended vocabulary after the word filtration, the step of crawling the knowledge information of a plurality of webpages to be selected according to the extended vocabulary after the word filtration and obtaining the iterative jump links related to the extended vocabulary after the word filtration is returned until all the latest iterative extended vocabulary can be filtered through the upper word set; and constructing a knowledge graph according to the upper word set, the seed words in the updated target word set, the expanded words, the jump links, the iterative expanded words and the iterative jump links.

In one embodiment, the computer program when executed by the processor further performs the steps of: and reconstructing an upper word set according to the seed vocabulary, the expanded vocabulary and the iterative expanded vocabulary in the target word set on the vocabulary tags of the webpage to be selected.

In one embodiment, the computer program when executed by the processor further performs the steps of: acquiring each vocabulary tag corresponding to the vocabulary expansion vocabulary; filtering the extended vocabulary, wherein the vocabulary tags belonging to the superior vocabulary set in the extended vocabulary occupy the extended vocabulary with the proportion of each vocabulary tag corresponding to the vocabulary extended vocabulary being less than or equal to the preset classification threshold; and acquiring a target word set according to the filtered extended vocabulary and the seed vocabulary.

In one embodiment, the computer program when executed by the processor further performs the steps of: acquiring domain information corresponding to a knowledge graph to be constructed; and (3) crawling seed vocabularies of the domain to which the knowledge graph to be constructed belongs from the domain classification tree of the third-party platform based on the script crawler frame and the xpath analysis library according to the domain information.

In one embodiment, the computer program when executed by the processor further performs the steps of: searching the same vocabulary in the seed vocabulary; searching a synonymous word in the seed word through semantic dependency analysis; and de-weighting the seed vocabulary according to the same vocabulary and the synonymous vocabulary.

In one embodiment, the computer program when executed by the processor further performs the steps of: acquiring a vocabulary label corresponding to the seed vocabulary; and when the vocabulary labels have an incidence relation with the preset core vocabulary, the vocabulary labels are classified into the upper vocabulary set.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware related to instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above examples only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method of knowledge-graph construction, the method comprising:

acquiring an extended vocabulary according to the jump link;

2. The method of claim 1, wherein before constructing the knowledge graph according to the superordinate word set, the seed words in the target word set, the extended words and the jump link relationship, further comprising:

3. The method of claim 2, wherein before performing word filtering on the iteratively expanded vocabulary according to the hypernym set, updating the target word set according to the filtered iteratively expanded vocabulary further comprises:

4. The method of claim 1, wherein the word filtering the extended vocabulary according to the hypernym set, and wherein iteratively extending the vocabulary according to the filtered extended vocabulary and the seed vocabulary to obtain the target set of words comprises:

obtaining each vocabulary label corresponding to the expanded vocabulary of the iterative expanded vocabulary;

filtering out the extended vocabulary, wherein the vocabulary tags belonging to the upper level word set occupy the extended vocabulary, and the proportion of each vocabulary tag corresponding to the extended vocabulary is less than or equal to the extended vocabulary of a preset classification threshold;

5. The method of claim 1, wherein crawling knowledge information of a plurality of candidate web pages according to a seed vocabulary, the candidate web pages including jumped links related to the seed vocabulary further comprises:

6. The method of claim 5, wherein prior to crawling the knowledge information of the plurality of candidate web pages according to the seed vocabulary, the method further comprises:

searching the same vocabulary in the seed vocabulary;

7. The method of claim 1, wherein the obtaining of the vocabulary tags of the seed vocabulary in the to-be-selected web page and the constructing of the upper vocabulary set comprise:

acquiring a vocabulary label corresponding to the seed vocabulary;

8. An apparatus for knowledge-graph construction, the apparatus comprising:

9. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.