CN111143448A

CN111143448A - Knowledge base construction method

Info

Publication number: CN111143448A
Application number: CN201911221375.XA
Authority: CN
Inventors: 孙晓光; 刘为民; 张利达
Original assignee: Beijing Borui Tongyun Technology Co Ltd
Current assignee: Beijing Borui Tongyun Technology Co Ltd
Priority date: 2019-12-03
Filing date: 2019-12-03
Publication date: 2020-05-12
Anticipated expiration: 2039-12-03
Also published as: CN111143448B

Abstract

The invention relates to a knowledge base construction method, which comprises the following steps: determining a data source; acquiring body data according to a data source; according to the ontology data; capturing body data according to a data source to generate data to be processed; determining whether the data to be processed is semi-structured data; if the data to be processed is not semi-structured data, extracting knowledge information of the data to be processed to obtain result data; if the data to be processed is semi-structured data, performing unified processing on the data to be processed to obtain the data after the unified processing; extracting knowledge information from the uniformly processed data to obtain result data; if the data to be processed is not semi-structured data, extracting knowledge information of the data to be processed to obtain result data; the resulting data is added to the knowledge base.

Description

Knowledge base construction method

Technical Field

The invention relates to the technical field of data processing, in particular to a knowledge base construction method.

Background

A better voice interaction system needs a large amount of underlying data support besides accurately resolving the question of the user, and in the aspect of health knowledge expression, because health knowledge is difficult to define various relationships related to concepts by using a relational database, particularly the relationships among the concepts may need dynamic maintenance, the table structure of the relational database is difficult to realize.

Disclosure of Invention

The invention aims to provide a knowledge base construction method aiming at the defects of the prior art, so that the extraction of knowledge information can be realized no matter whether the data is semi-structured data or not.

In order to achieve the above object, the present invention provides a knowledge base construction method, including:

determining a data source;

acquiring body data according to the data source;

according to the ontology data;

capturing body data according to the data source to generate data to be processed;

determining whether the data to be processed is semi-structured data;

if the data to be processed is not the semi-structured data, extracting knowledge information of the data to be processed to obtain result data;

if the data to be processed is the semi-structured data, performing unified processing on the data to be processed to obtain uniformly processed data;

extracting the knowledge information from the uniformly processed data to obtain result data;

and adding the result data to a knowledge base.

Preferably, before the acquiring the ontology data, the method further comprises:

acquiring investigation information;

the research information comprises: the domain of the ontology, the probability of reusing the existing ontology, the terminology of the ontology, the class hierarchy, the class attribute and the constraint information of the attribute;

and constructing the body data according to the research information.

Preferably, the unification processing specifically includes:

determining whether an embodiment with the same name as the current data to be processed exists;

if the embodiment with the same name as the current data to be processed exists, performing attribute rule mapping on the current data to be processed and the current embodiment, and adding the current data to be processed to an embodiment library;

if the embodiment with the same name as the current data to be processed does not exist, determining whether the attribute vector space of the current data to be processed corresponds to the embodiment library or not;

if the attribute vector space of the current data to be processed corresponds to the instance library, adding the name of the current data to be processed, and adding the current data to be processed to the instance library;

and if the attribute vector space of the current data to be processed does not correspond to the instance base, adding the current data to be processed to the instance base.

Further preferably, the determining whether the attribute vector space of the current data to be processed corresponds to the instance base specifically includes:

setting the attribute name of the current data to be processed as a keyword;

establishing a keyword vector of the keyword;

and calculating to obtain the cosine value of the included angle of the keyword vector to determine whether the attribute vector space of the current data to be processed corresponds to the instance library.

Preferably, the extracting knowledge information of the data to be processed specifically includes:

and extracting knowledge information through a learning model.

Further preferably, the extracting knowledge information through the learning model specifically includes:

class labeling is carried out on the training data set to obtain labeling information;

determining keywords through a log-likelihood ratio algorithm according to the labeling information;

determining the combination of the keywords through an FP-Growth algorithm, and generating a matching rule pattern string according to a combination result;

and extracting knowledge information of the data to be processed through a regular template according to the matching rule pattern string.

Further preferably, the determining the keyword according to the labeling information by using a log-likelihood ratio algorithm specifically includes:

determining a first sample number of target features including a current keyword in the training data set, a second sample number of features not including the current target keyword, and a third sample number of features including the target current keyword and the target word in the training data set;

carrying out maximum likelihood calculation on the first sample number, the second sample number and the third sample number to obtain a first likelihood function and a second likelihood function;

obtaining a final likelihood ratio according to the first likelihood function and the second likelihood function;

and determining the keywords according to the final likelihood ratio.

Further preferably, the determining, by using the FP-Growth algorithm, the combination of the keywords specifically is:

constructing an FP tree for the keywords;

mining a frequent item combination according to the FP tree;

calculating the Kulc and IR values of the frequent term combinations,

and if frequent items with Kulc values larger than a first threshold value and IR values smaller than a second threshold value exist, screening the frequent items, and determining the combination of the keywords according to the screening result.

Further preferably, the building of the FP-tree for the keyword specifically includes:

determining keywords through a log-likelihood ratio algorithm according to the labeling information in a traversal mode to obtain a complete set of the keywords and the support degree of each keyword;

obtaining a frequent item list according to the complete set and the support degree of the keywords;

sequencing each keyword according to the frequent item list to obtain frequent items;

determining whether a node in a constructed FP tree is the same as the frequent item and whether a prefix of the node in the constructed FP tree is the same as the prefix of the frequent item;

if the node in the constructed FP tree is the same as a frequent item and the prefix of the node in the constructed FP tree is the same as the prefix of the frequent item, the count of the node in the constructed FP tree is increased by one;

if the node in the constructed FP tree is different from the frequent item, or the prefix of the node in the constructed FP tree is different from the prefix of the frequent item; a new node is generated and inserted into the tail of the linked list with the current frequent items.

Further preferably, the FP-tree mining frequent item combination specifically is:

acquiring a conditional mode base of each frequent item according to the head pointer list;

generating a condition tree according to the condition mode base of the frequent item;

and recursively searching a frequent item set through the conditional tree to obtain the frequent item combination.

According to the knowledge base construction method provided by the embodiment of the invention, the semi-structured data and the non-semi-structured data are processed in a distinguishing manner, so that the extraction of knowledge information can be realized no matter whether the data is semi-structured data or not, and the accuracy of the construction of the knowledge base in the health field is improved.

Drawings

FIG. 1 is a flow chart of a knowledge base construction method provided by an embodiment of the invention;

fig. 2 is a flowchart of a method for performing a unified processing on data to be processed according to an embodiment of the present invention;

FIG. 3 is a flowchart of a method for extracting knowledge information from data to be processed according to an embodiment of the present invention;

FIG. 4 is a flowchart of a method for building an FP tree for keywords according to an embodiment of the present invention;

fig. 5 is a flowchart of a method for mining frequent item combinations according to the FP-tree according to the embodiment of the present invention.

Detailed Description

The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.

The embodiment of the invention provides a knowledge base construction method, which is applied to a server. For a better understanding of the contents of the present application, the following first explains definitions of key terms referred to in the present application:

body (Ontology): the basic terms and relationships that make up the words of the related art are given, along with definitions that specify the rules for extension of the words that are constructed using those terms and relationships.

A knowledge base: to use and manage a large amount of knowledge conveniently and efficiently, a system is constructed in which knowledge that a human has is stored in a computer in a certain form representation.

Entity alignment: also called entity matching or entity resolution, is the process of determining whether two named entities represent the same thing in the real world at the same time, if the same data set or data sets are not the same.

The support degree is as follows: representing the frequency with which antecedents and consequent terms appear simultaneously in one dataset. Expressed as a percentage.

As shown in fig. 1, the knowledge base construction method according to the embodiment of the present application includes:

step 101, a server determines a data source;

specifically, the diversity of data sources can ensure that the finally constructed health knowledge has both the breadth and the depth. The health field knowledge representation range is wider, and a plurality of data sources are more required to be adopted for data acquisition.

In a specific example, the data sources in the present application include the following categories:

and the data of the portal website of the authoritative department can extract the authoritative interpretation of the health specification and the ontology class. Such as the Chinese disease control center, the Chinese medical society.

The health information of the known websites in the aspect of health is rich, the health ontology type, the attribute and the relationship can be well embodied, and the data expression is semi-structured data.

Health related information web sites, such as online cases, online inquiries, etc., are used as a supplemental resource for the ontology.

102, acquiring body data according to a data source;

specifically, target data is captured through a crawler technology according to the characteristics of a target website in a data source, and the target data is used as body data to be added into a temporary processing result library.

103, generating data to be processed according to the body data;

specifically, ontology data is extracted from the temporary processing result library and compiled to generate to-be-processed data.

In a specific example, the server adopts a seven-step method proposed by Stanford university to write ontology data, acquires investigation information including the domain of the ontology, the probability of reusing the existing ontology, the terminology of the ontology, class hierarchy, class attribute and constraint information of the attribute, and constructs the ontology data according to the investigation information.

The process of acquiring the ontology data can be performed by software. Preferably, the method can be performed by using Prot g e software, and compared with the similar system, the method has the advantages that: the graphical interface can enable a user to carry out visual editing, supports various ontology languages, is friendly to the user by the Chinese plug-in, supports Unicode character set input and the like.

In a specific embodiment, the present application adopts a Web Ontology Language (OWL) as a Language mode selected after Ontology data is acquired, OWL is an Ontology description Language recommended by W3C, has a strong expression capability, and describes domain knowledge in an object-oriented manner, that is, describes objects by classes and attributes, and describes features and relationships of the classes and attributes by Axioms.

Step 104, determining whether the data to be processed is semi-structured data;

specifically, in general, knowledge expression is divided into four levels: l1: non-mechanization; l2: semi-structuring; l3: structured, that is, data elements are structured and semantic, the expression logic of which can be read by a computer, such as XML or UML; l4: executable, that is, the computer can reason directly from the definition of the knowledge base. Knowledge expressions that can be performed by a computational mechanism solution require two levels of L3 or L4.

If the data to be processed is semi-structured data, step 105 is performed. If the data to be processed is not semi-structured data, executing step 106;

105, unifying the data to be processed;

specifically, in an example, as shown in fig. 2, the performing the normalization processing on the data to be processed may include:

step 201, determining whether an embodiment with the same name as that of the current data to be processed exists;

specifically, if there is, step 202 is executed, and if there is no, step 203 is executed.

Step 202, performing attribute rule mapping on the current data to be processed and the current embodiment, and adding the current data to be processed to an embodiment library;

specifically, because the data to be processed may be a plurality of data sources, there may be a plurality of data of the same thing, but there are different attribute expressions, and these data are processed into the process of the same thing in the system, that is, entity alignment. The process of mapping the attribute rule between the current data to be processed and the current embodiment in the present application can be understood as an entity alignment process.

Step 203, determining whether the attribute vector space of the current data to be processed corresponds to the instance base;

specifically, the server sets up keywords according to the attribute name of the current data to be processed, and then establishes keyword vectors of the keywords to obtain formula 1.

(S ═ { w1, w 2.., wn }) formula 1,

where in the above expression, S represents an unstructured text keyword vector for entity w.

Then, calculating to obtain an included angle cosine value of the keyword vector according to the formula 1 and the formula 2, and determining whether the attribute vector space of the current data to be processed corresponds to the instance library or not according to the included angle cosine value of the keyword vector.

sim (s1, s2) ═ cos (s1, s2) formula 2

Where S1 represents the unstructured text keyword vector for entity 1, S2 represents the unstructured text keyword vector for entity 2, and cos (S1, S2) represents the angle cosine values of the unstructured text keyword vectors for entities 1 and 2.

If the attribute vector space of the current data to be processed corresponds to the instance base, executing step 204; if the attribute vector space of the current data to be processed does not correspond to the instance base, step 205 is executed.

Step 204, adding the name of the current data to be processed, and adding the current data to be processed to an instance library;

specifically, if the attribute vector space of the current data to be processed corresponds to the instance base, after the name of the current data to be processed is added, attribute rule mapping is performed on the current data to be processed and the current instance, and the current data to be processed is added to the instance base. At this time, the current data to be processed is the data after the unified processing.

Step 205, adding the current data to be processed to an instance library;

specifically, at this time, the current data to be processed is the data after the unified processing.

106, extracting knowledge information of the data to be processed;

specifically, knowledge information extraction is directed to the processing of unstructured data, as these data are mainly used to discover new knowledge processes from unstructured health sources, such as online inquiry, electronic medical records, and the like. The data entities and relations are hidden in the text, and the part of data is processed by a machine learning method, namely knowledge information extraction is carried out by a learning model.

In one example, as shown in fig. 3, the extracting knowledge information from the data to be processed may include:

301, performing class labeling on a training data set to obtain labeling information;

step 302, determining keywords through a log-likelihood ratio algorithm according to the labeling information;

in particular, log-likelihood ratio is a hypothesis testing method used to compare the degree of fit of two models to determine which of the models being compared is more reliable on the current data.

The server determines a first sample number containing the target features and a second sample number not containing the target features in the training data set, and a third sample number containing the target features and the target words in the training data set, performs maximum likelihood calculation on the first sample number, the second sample number and the third sample number to obtain a first likelihood function (namely the sample data (containing the current keywords in the training samples) and a second likelihood function (namely the sample number not containing the current features), obtains a final likelihood ratio according to the first likelihood function and the second likelihood function, and finally determines the keywords according to the final likelihood ratio.

In a particular embodiment of the present invention,

two assumptions are made first: the following assumption 1 indicates that the probability of the word w and the feature F occurring at the same time is the same as the probability of the word w and the feature F occurring without the feature F, that is, that the word w and the feature F have no correlation. The following assumption 2 indicates that the probability of the word w and the feature F occurring simultaneously and the probability of the word w being different from the probability of the feature F not occurring, that is, the probability of affecting the occurrence of w when F occurs, have a correlation. For example: we find that 'Ma xi' will appear along with 'Ma ya' in statistics, and the probability of the 'Ma xi' and 'Ma ya' appearing at the same time is different from the probability of the 'Ma xi' appearing only, so as to meet the 2 nd assumption. For another example, the probability of the simultaneous occurrence of "weather" and "football" is the same as the probability of "weather" only, which satisfies the 1 st assumption that "football" does not affect the statistics of weather, and there is no correlation between the two. Suppose 1 is as in formula 1:

suppose 2 is as in equation 2:

assume 2 is the core of the algorithm to extract keywords, i.e. a certain word w does not occur randomly in feature F. The probability of the word w is considered to be a feature, which is understood herein as a word or word that has an effect on the word w, such as the feature of "jazz" in the above example.

According to the analysis of the training corpus, the following results can be obtained:

c₁＝N(F)，c₂＝N(w)，c_1，2equal to N (w ^ F) (type 3)

In the above expression, c represents the number of occurrences, c1 represents the number of occurrences of the feature F, c2 represents the number of occurrences of the word w, and c1,2 represents the number of occurrences of both the word w and the feature F, so that the probability can be calculated by counting the number of occurrences. N (F) and

the first and second sample numbers are respectively represented, and N (w ^ F) represents the third sample number.

Carrying out maximum likelihood estimation on p, p1 and p2 to obtain

The known binomial distribution is as follows:

equation 5 shows that the random variable x has two parameters k and n, and a likelihood function of hypothesis 1 can be obtained, i.e. a first likelihood function:

L(H₁)＝b(c_1，2；c₁，p)b(c₂-c₁；N-c₁p) (formula 6)

Assume a likelihood function of 2, a second likelihood function:

L(H₂)＝b(c_1，2；c₁，p₁)b(c₂-c1；N-c₁，p₂) (formula 7)

The final likelihood ratio is defined as follows:

wilks (the statistic of the ratio of two generalized variances) has demonstrated that when the sample is large enough, -2log λ obeys χ 2 distributions, and the above expressions are substituted and reduced to end up with log-likelihood ratios of concepts and words:

when the frequency of the word w appearing in the non-feature F is higher, the discrimination of the feature F is smaller, and the probability of being selected as a keyword is smaller. In order to obtain a keyword set with high purity as much as possible, introducing a parameter epsilon (epsilon > 1) to modify the formula to obtain:

the larger the L (w, F) value is, the higher the weight of the word in the feature is, and the word can be used as the feature keyword.

Step 303, determining the combination of the keywords through an FP-Growth algorithm, and generating a matching rule pattern string according to a combination result;

specifically, the server constructs an FP tree for the keywords, mines frequent item combinations according to the FP tree, calculates a Kulc value and an IR value of the frequent item combinations, screens the frequent items if the Kulc value is larger than a first threshold value and the IR value is smaller than a second threshold value, and determines the keyword combinations according to the screening results.

Kulc is defined as follows:

kulc-1/2 (P (a | B) + P (B | a)) (formula 11)

Wherein, A and B are understood to represent two frequent terms (frequently-occurring keywords), the expression is used for representing whether the two frequently-occurring keywords are related, the parameter value range is 0-1, and the larger the value is, the larger the relevance is represented.

The definition of IR is as follows:

the IR is 0, the same direction, and the larger the difference between the two, the larger the unbalance ratio.

Preferably, the first threshold is 0.5; the second threshold value is 0.1.

In a specific example, as shown in fig. 4, constructing the FP-tree for the keyword may include:

step 401, determining keywords through a log-likelihood ratio algorithm according to the labeling information in a traversal manner to obtain a complete set of the keywords and the support degree of each keyword;

step 402, obtaining a frequent item list according to the support degree of the complete set and the keywords;

step 403, sorting each keyword according to the frequent item list to obtain frequent items;

step 404, determining whether the nodes in the constructed FP tree are the same as the frequent items and whether the prefixes of the nodes in the constructed FP tree are the same as the prefixes of the frequent items;

specifically, if the node in the constructed FP-tree is the same as the frequent item, and the prefix of the node in the constructed FP-tree is the same as the prefix of the frequent item, step 405 is executed; if the node in the constructed FP tree is different from the frequent item, or the prefix of the node in the constructed FP tree is different from the prefix of the frequent item; step 406 is performed.

Before this step, a FP-tree root node may also be created first, with a null value.

Step 405, adding one to the node count in the constructed FP tree;

step 406, generating a new node, and inserting the new node into the tail of the linked list with the current frequent item;

step 407, establishing an additional head pointer list;

specifically, while constructing the FP tree, an additional head pointer list needs to be established, each frequent item is recorded, and sorted according to the support degree, and the position of each item in the tree is recorded through a linked list.

In a specific example, as shown in fig. 5, mining frequent item combinations according to the FP-tree may specifically be:

step 501, acquiring a conditional mode base of each frequent item according to a head pointer list;

in particular, this process may be understood as a set of paths for which the element sought is the end.

Step 502, generating a condition tree according to the condition mode base of the frequent item;

specifically, this process may be understood as filtering out infrequent items with low support;

step 503, searching a frequent item set through a conditional tree recursion to obtain a frequent item combination;

specifically, each frequent item and prefix path combination in the head pointer list is added into a result set, then a condition tree of the frequent item is calculated, when the condition tree is not empty, a new condition tree and a head pointer are constructed, and recursive calling is carried out until the constructed condition tree is empty.

Step 304, extracting knowledge information of the data to be processed according to the matching rule pattern string;

step 107, adding the result data to a knowledge base;

Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware, in a software module executed by a user terminal, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A knowledge base construction method, the method comprising:

determining a data source;

acquiring body data according to the data source;

according to the ontology data;

determining whether the data to be processed is semi-structured data;

and adding the result data to a knowledge base.

2. The method of claim 1, wherein prior to the obtaining ontology data, the method further comprises:

acquiring investigation information;

and constructing the body data according to the research information.

3. The knowledge base construction method according to claim 1, wherein the unification process is specifically:

4. The knowledge base construction method according to claim 3, wherein the determining whether the attribute vector space of the current data to be processed corresponds to the instance base specifically comprises:

setting the attribute name of the current data to be processed as a keyword;

establishing a keyword vector of the keyword;

5. The knowledge base construction method according to claim 1, wherein the extracting knowledge information of the data to be processed specifically comprises:

and extracting knowledge information through a learning model.

6. The knowledge base construction method according to claim 5, wherein the extracting knowledge information through the learning model specifically comprises:

7. The knowledge base construction method according to claim 6, wherein the determining of the keyword by the log-likelihood ratio algorithm according to the labeling information specifically comprises:

and determining the keywords according to the final likelihood ratio.

8. The knowledge base construction method according to claim 6, wherein the determining the combination of the keywords by the FP-Growth algorithm specifically comprises:

constructing an FP tree for the keywords;

mining a frequent item combination according to the FP tree;

calculating the Kulc and IR values of the frequent term combinations,

9. The knowledge base construction method according to claim 8, wherein the construction of the FP-tree for the keyword specifically comprises:

10. The knowledge base construction method according to claim 9, wherein the mining frequent item combination according to the FP-tree is specifically: