CN111143448A - Knowledge base construction method - Google Patents

Knowledge base construction method Download PDF

Info

Publication number
CN111143448A
CN111143448A CN201911221375.XA CN201911221375A CN111143448A CN 111143448 A CN111143448 A CN 111143448A CN 201911221375 A CN201911221375 A CN 201911221375A CN 111143448 A CN111143448 A CN 111143448A
Authority
CN
China
Prior art keywords
data
processed
tree
current
frequent item
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911221375.XA
Other languages
Chinese (zh)
Other versions
CN111143448B (en
Inventor
孙晓光
刘为民
张利达
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Borui Tongyun Technology Co Ltd
Original Assignee
Beijing Borui Tongyun Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Borui Tongyun Technology Co Ltd filed Critical Beijing Borui Tongyun Technology Co Ltd
Priority to CN201911221375.XA priority Critical patent/CN111143448B/en
Publication of CN111143448A publication Critical patent/CN111143448A/en
Application granted granted Critical
Publication of CN111143448B publication Critical patent/CN111143448B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a knowledge base construction method, which comprises the following steps: determining a data source; acquiring body data according to a data source; according to the ontology data; capturing body data according to a data source to generate data to be processed; determining whether the data to be processed is semi-structured data; if the data to be processed is not semi-structured data, extracting knowledge information of the data to be processed to obtain result data; if the data to be processed is semi-structured data, performing unified processing on the data to be processed to obtain the data after the unified processing; extracting knowledge information from the uniformly processed data to obtain result data; if the data to be processed is not semi-structured data, extracting knowledge information of the data to be processed to obtain result data; the resulting data is added to the knowledge base.

Description

Knowledge base construction method
Technical Field
The invention relates to the technical field of data processing, in particular to a knowledge base construction method.
Background
A better voice interaction system needs a large amount of underlying data support besides accurately resolving the question of the user, and in the aspect of health knowledge expression, because health knowledge is difficult to define various relationships related to concepts by using a relational database, particularly the relationships among the concepts may need dynamic maintenance, the table structure of the relational database is difficult to realize.
Disclosure of Invention
The invention aims to provide a knowledge base construction method aiming at the defects of the prior art, so that the extraction of knowledge information can be realized no matter whether the data is semi-structured data or not.
In order to achieve the above object, the present invention provides a knowledge base construction method, including:
determining a data source;
acquiring body data according to the data source;
according to the ontology data;
capturing body data according to the data source to generate data to be processed;
determining whether the data to be processed is semi-structured data;
if the data to be processed is not the semi-structured data, extracting knowledge information of the data to be processed to obtain result data;
if the data to be processed is the semi-structured data, performing unified processing on the data to be processed to obtain uniformly processed data;
extracting the knowledge information from the uniformly processed data to obtain result data;
if the data to be processed is not the semi-structured data, extracting knowledge information of the data to be processed to obtain result data;
and adding the result data to a knowledge base.
Preferably, before the acquiring the ontology data, the method further comprises:
acquiring investigation information;
the research information comprises: the domain of the ontology, the probability of reusing the existing ontology, the terminology of the ontology, the class hierarchy, the class attribute and the constraint information of the attribute;
and constructing the body data according to the research information.
Preferably, the unification processing specifically includes:
determining whether an embodiment with the same name as the current data to be processed exists;
if the embodiment with the same name as the current data to be processed exists, performing attribute rule mapping on the current data to be processed and the current embodiment, and adding the current data to be processed to an embodiment library;
if the embodiment with the same name as the current data to be processed does not exist, determining whether the attribute vector space of the current data to be processed corresponds to the embodiment library or not;
if the attribute vector space of the current data to be processed corresponds to the instance library, adding the name of the current data to be processed, and adding the current data to be processed to the instance library;
and if the attribute vector space of the current data to be processed does not correspond to the instance base, adding the current data to be processed to the instance base.
Further preferably, the determining whether the attribute vector space of the current data to be processed corresponds to the instance base specifically includes:
setting the attribute name of the current data to be processed as a keyword;
establishing a keyword vector of the keyword;
and calculating to obtain the cosine value of the included angle of the keyword vector to determine whether the attribute vector space of the current data to be processed corresponds to the instance library.
Preferably, the extracting knowledge information of the data to be processed specifically includes:
and extracting knowledge information through a learning model.
Further preferably, the extracting knowledge information through the learning model specifically includes:
class labeling is carried out on the training data set to obtain labeling information;
determining keywords through a log-likelihood ratio algorithm according to the labeling information;
determining the combination of the keywords through an FP-Growth algorithm, and generating a matching rule pattern string according to a combination result;
and extracting knowledge information of the data to be processed through a regular template according to the matching rule pattern string.
Further preferably, the determining the keyword according to the labeling information by using a log-likelihood ratio algorithm specifically includes:
determining a first sample number of target features including a current keyword in the training data set, a second sample number of features not including the current target keyword, and a third sample number of features including the target current keyword and the target word in the training data set;
carrying out maximum likelihood calculation on the first sample number, the second sample number and the third sample number to obtain a first likelihood function and a second likelihood function;
obtaining a final likelihood ratio according to the first likelihood function and the second likelihood function;
and determining the keywords according to the final likelihood ratio.
Further preferably, the determining, by using the FP-Growth algorithm, the combination of the keywords specifically is:
constructing an FP tree for the keywords;
mining a frequent item combination according to the FP tree;
calculating the Kulc and IR values of the frequent term combinations,
and if frequent items with Kulc values larger than a first threshold value and IR values smaller than a second threshold value exist, screening the frequent items, and determining the combination of the keywords according to the screening result.
Further preferably, the building of the FP-tree for the keyword specifically includes:
determining keywords through a log-likelihood ratio algorithm according to the labeling information in a traversal mode to obtain a complete set of the keywords and the support degree of each keyword;
obtaining a frequent item list according to the complete set and the support degree of the keywords;
sequencing each keyword according to the frequent item list to obtain frequent items;
determining whether a node in a constructed FP tree is the same as the frequent item and whether a prefix of the node in the constructed FP tree is the same as the prefix of the frequent item;
if the node in the constructed FP tree is the same as a frequent item and the prefix of the node in the constructed FP tree is the same as the prefix of the frequent item, the count of the node in the constructed FP tree is increased by one;
if the node in the constructed FP tree is different from the frequent item, or the prefix of the node in the constructed FP tree is different from the prefix of the frequent item; a new node is generated and inserted into the tail of the linked list with the current frequent items.
Further preferably, the FP-tree mining frequent item combination specifically is:
acquiring a conditional mode base of each frequent item according to the head pointer list;
generating a condition tree according to the condition mode base of the frequent item;
and recursively searching a frequent item set through the conditional tree to obtain the frequent item combination.
According to the knowledge base construction method provided by the embodiment of the invention, the semi-structured data and the non-semi-structured data are processed in a distinguishing manner, so that the extraction of knowledge information can be realized no matter whether the data is semi-structured data or not, and the accuracy of the construction of the knowledge base in the health field is improved.
Drawings
FIG. 1 is a flow chart of a knowledge base construction method provided by an embodiment of the invention;
fig. 2 is a flowchart of a method for performing a unified processing on data to be processed according to an embodiment of the present invention;
FIG. 3 is a flowchart of a method for extracting knowledge information from data to be processed according to an embodiment of the present invention;
FIG. 4 is a flowchart of a method for building an FP tree for keywords according to an embodiment of the present invention;
fig. 5 is a flowchart of a method for mining frequent item combinations according to the FP-tree according to the embodiment of the present invention.
Detailed Description
The technical solution of the present invention is further described in detail by the accompanying drawings and embodiments.
The embodiment of the invention provides a knowledge base construction method, which is applied to a server. For a better understanding of the contents of the present application, the following first explains definitions of key terms referred to in the present application:
body (Ontology): the basic terms and relationships that make up the words of the related art are given, along with definitions that specify the rules for extension of the words that are constructed using those terms and relationships.
A knowledge base: to use and manage a large amount of knowledge conveniently and efficiently, a system is constructed in which knowledge that a human has is stored in a computer in a certain form representation.
Entity alignment: also called entity matching or entity resolution, is the process of determining whether two named entities represent the same thing in the real world at the same time, if the same data set or data sets are not the same.
The support degree is as follows: representing the frequency with which antecedents and consequent terms appear simultaneously in one dataset. Expressed as a percentage.
As shown in fig. 1, the knowledge base construction method according to the embodiment of the present application includes:
step 101, a server determines a data source;
specifically, the diversity of data sources can ensure that the finally constructed health knowledge has both the breadth and the depth. The health field knowledge representation range is wider, and a plurality of data sources are more required to be adopted for data acquisition.
In a specific example, the data sources in the present application include the following categories:
and the data of the portal website of the authoritative department can extract the authoritative interpretation of the health specification and the ontology class. Such as the Chinese disease control center, the Chinese medical society.
The health information of the known websites in the aspect of health is rich, the health ontology type, the attribute and the relationship can be well embodied, and the data expression is semi-structured data.
Health related information web sites, such as online cases, online inquiries, etc., are used as a supplemental resource for the ontology.
102, acquiring body data according to a data source;
specifically, target data is captured through a crawler technology according to the characteristics of a target website in a data source, and the target data is used as body data to be added into a temporary processing result library.
103, generating data to be processed according to the body data;
specifically, ontology data is extracted from the temporary processing result library and compiled to generate to-be-processed data.
In a specific example, the server adopts a seven-step method proposed by Stanford university to write ontology data, acquires investigation information including the domain of the ontology, the probability of reusing the existing ontology, the terminology of the ontology, class hierarchy, class attribute and constraint information of the attribute, and constructs the ontology data according to the investigation information.
The process of acquiring the ontology data can be performed by software. Preferably, the method can be performed by using Prot g e software, and compared with the similar system, the method has the advantages that: the graphical interface can enable a user to carry out visual editing, supports various ontology languages, is friendly to the user by the Chinese plug-in, supports Unicode character set input and the like.
In a specific embodiment, the present application adopts a Web Ontology Language (OWL) as a Language mode selected after Ontology data is acquired, OWL is an Ontology description Language recommended by W3C, has a strong expression capability, and describes domain knowledge in an object-oriented manner, that is, describes objects by classes and attributes, and describes features and relationships of the classes and attributes by Axioms.
Step 104, determining whether the data to be processed is semi-structured data;
specifically, in general, knowledge expression is divided into four levels: l1: non-mechanization; l2: semi-structuring; l3: structured, that is, data elements are structured and semantic, the expression logic of which can be read by a computer, such as XML or UML; l4: executable, that is, the computer can reason directly from the definition of the knowledge base. Knowledge expressions that can be performed by a computational mechanism solution require two levels of L3 or L4.
If the data to be processed is semi-structured data, step 105 is performed. If the data to be processed is not semi-structured data, executing step 106;
105, unifying the data to be processed;
specifically, in an example, as shown in fig. 2, the performing the normalization processing on the data to be processed may include:
step 201, determining whether an embodiment with the same name as that of the current data to be processed exists;
specifically, if there is, step 202 is executed, and if there is no, step 203 is executed.
Step 202, performing attribute rule mapping on the current data to be processed and the current embodiment, and adding the current data to be processed to an embodiment library;
specifically, because the data to be processed may be a plurality of data sources, there may be a plurality of data of the same thing, but there are different attribute expressions, and these data are processed into the process of the same thing in the system, that is, entity alignment. The process of mapping the attribute rule between the current data to be processed and the current embodiment in the present application can be understood as an entity alignment process.
Step 203, determining whether the attribute vector space of the current data to be processed corresponds to the instance base;
specifically, the server sets up keywords according to the attribute name of the current data to be processed, and then establishes keyword vectors of the keywords to obtain formula 1.
(S ═ { w1, w 2.., wn }) formula 1,
where in the above expression, S represents an unstructured text keyword vector for entity w.
Then, calculating to obtain an included angle cosine value of the keyword vector according to the formula 1 and the formula 2, and determining whether the attribute vector space of the current data to be processed corresponds to the instance library or not according to the included angle cosine value of the keyword vector.
sim (s1, s2) ═ cos (s1, s2) formula 2
Where S1 represents the unstructured text keyword vector for entity 1, S2 represents the unstructured text keyword vector for entity 2, and cos (S1, S2) represents the angle cosine values of the unstructured text keyword vectors for entities 1 and 2.
If the attribute vector space of the current data to be processed corresponds to the instance base, executing step 204; if the attribute vector space of the current data to be processed does not correspond to the instance base, step 205 is executed.
Step 204, adding the name of the current data to be processed, and adding the current data to be processed to an instance library;
specifically, if the attribute vector space of the current data to be processed corresponds to the instance base, after the name of the current data to be processed is added, attribute rule mapping is performed on the current data to be processed and the current instance, and the current data to be processed is added to the instance base. At this time, the current data to be processed is the data after the unified processing.
Step 205, adding the current data to be processed to an instance library;
specifically, at this time, the current data to be processed is the data after the unified processing.
106, extracting knowledge information of the data to be processed;
specifically, knowledge information extraction is directed to the processing of unstructured data, as these data are mainly used to discover new knowledge processes from unstructured health sources, such as online inquiry, electronic medical records, and the like. The data entities and relations are hidden in the text, and the part of data is processed by a machine learning method, namely knowledge information extraction is carried out by a learning model.
In one example, as shown in fig. 3, the extracting knowledge information from the data to be processed may include:
301, performing class labeling on a training data set to obtain labeling information;
step 302, determining keywords through a log-likelihood ratio algorithm according to the labeling information;
in particular, log-likelihood ratio is a hypothesis testing method used to compare the degree of fit of two models to determine which of the models being compared is more reliable on the current data.
The server determines a first sample number containing the target features and a second sample number not containing the target features in the training data set, and a third sample number containing the target features and the target words in the training data set, performs maximum likelihood calculation on the first sample number, the second sample number and the third sample number to obtain a first likelihood function (namely the sample data (containing the current keywords in the training samples) and a second likelihood function (namely the sample number not containing the current features), obtains a final likelihood ratio according to the first likelihood function and the second likelihood function, and finally determines the keywords according to the final likelihood ratio.
In a particular embodiment of the present invention,
two assumptions are made first: the following assumption 1 indicates that the probability of the word w and the feature F occurring at the same time is the same as the probability of the word w and the feature F occurring without the feature F, that is, that the word w and the feature F have no correlation. The following assumption 2 indicates that the probability of the word w and the feature F occurring simultaneously and the probability of the word w being different from the probability of the feature F not occurring, that is, the probability of affecting the occurrence of w when F occurs, have a correlation. For example: we find that 'Ma xi' will appear along with 'Ma ya' in statistics, and the probability of the 'Ma xi' and 'Ma ya' appearing at the same time is different from the probability of the 'Ma xi' appearing only, so as to meet the 2 nd assumption. For another example, the probability of the simultaneous occurrence of "weather" and "football" is the same as the probability of "weather" only, which satisfies the 1 st assumption that "football" does not affect the statistics of weather, and there is no correlation between the two. Suppose 1 is as in formula 1:
Figure BDA0002300949460000091
suppose 2 is as in equation 2:
Figure BDA0002300949460000092
assume 2 is the core of the algorithm to extract keywords, i.e. a certain word w does not occur randomly in feature F. The probability of the word w is considered to be a feature, which is understood herein as a word or word that has an effect on the word w, such as the feature of "jazz" in the above example.
According to the analysis of the training corpus, the following results can be obtained:
c1=N(F),c2=N(w),c1,2equal to N (w ^ F) (type 3)
In the above expression, c represents the number of occurrences, c1 represents the number of occurrences of the feature F, c2 represents the number of occurrences of the word w, and c1,2 represents the number of occurrences of both the word w and the feature F, so that the probability can be calculated by counting the number of occurrences. N (F) and
Figure BDA0002300949460000093
the first and second sample numbers are respectively represented, and N (w ^ F) represents the third sample number.
Carrying out maximum likelihood estimation on p, p1 and p2 to obtain
Figure BDA0002300949460000101
The known binomial distribution is as follows:
Figure BDA0002300949460000102
equation 5 shows that the random variable x has two parameters k and n, and a likelihood function of hypothesis 1 can be obtained, i.e. a first likelihood function:
L(H1)=b(c1,2;c1,p)b(c2-c1;N-c1p) (formula 6)
Assume a likelihood function of 2, a second likelihood function:
L(H2)=b(c1,2;c1,p1)b(c2-c1;N-c1,p2) (formula 7)
The final likelihood ratio is defined as follows:
Figure BDA0002300949460000103
wilks (the statistic of the ratio of two generalized variances) has demonstrated that when the sample is large enough, -2log λ obeys χ 2 distributions, and the above expressions are substituted and reduced to end up with log-likelihood ratios of concepts and words:
Figure BDA0002300949460000104
when the frequency of the word w appearing in the non-feature F is higher, the discrimination of the feature F is smaller, and the probability of being selected as a keyword is smaller. In order to obtain a keyword set with high purity as much as possible, introducing a parameter epsilon (epsilon > 1) to modify the formula to obtain:
Figure BDA0002300949460000105
the larger the L (w, F) value is, the higher the weight of the word in the feature is, and the word can be used as the feature keyword.
Step 303, determining the combination of the keywords through an FP-Growth algorithm, and generating a matching rule pattern string according to a combination result;
specifically, the server constructs an FP tree for the keywords, mines frequent item combinations according to the FP tree, calculates a Kulc value and an IR value of the frequent item combinations, screens the frequent items if the Kulc value is larger than a first threshold value and the IR value is smaller than a second threshold value, and determines the keyword combinations according to the screening results.
Kulc is defined as follows:
kulc-1/2 (P (a | B) + P (B | a)) (formula 11)
Wherein, A and B are understood to represent two frequent terms (frequently-occurring keywords), the expression is used for representing whether the two frequently-occurring keywords are related, the parameter value range is 0-1, and the larger the value is, the larger the relevance is represented.
The definition of IR is as follows:
Figure BDA0002300949460000111
the IR is 0, the same direction, and the larger the difference between the two, the larger the unbalance ratio.
Preferably, the first threshold is 0.5; the second threshold value is 0.1.
In a specific example, as shown in fig. 4, constructing the FP-tree for the keyword may include:
step 401, determining keywords through a log-likelihood ratio algorithm according to the labeling information in a traversal manner to obtain a complete set of the keywords and the support degree of each keyword;
step 402, obtaining a frequent item list according to the support degree of the complete set and the keywords;
step 403, sorting each keyword according to the frequent item list to obtain frequent items;
step 404, determining whether the nodes in the constructed FP tree are the same as the frequent items and whether the prefixes of the nodes in the constructed FP tree are the same as the prefixes of the frequent items;
specifically, if the node in the constructed FP-tree is the same as the frequent item, and the prefix of the node in the constructed FP-tree is the same as the prefix of the frequent item, step 405 is executed; if the node in the constructed FP tree is different from the frequent item, or the prefix of the node in the constructed FP tree is different from the prefix of the frequent item; step 406 is performed.
Before this step, a FP-tree root node may also be created first, with a null value.
Step 405, adding one to the node count in the constructed FP tree;
step 406, generating a new node, and inserting the new node into the tail of the linked list with the current frequent item;
step 407, establishing an additional head pointer list;
specifically, while constructing the FP tree, an additional head pointer list needs to be established, each frequent item is recorded, and sorted according to the support degree, and the position of each item in the tree is recorded through a linked list.
In a specific example, as shown in fig. 5, mining frequent item combinations according to the FP-tree may specifically be:
step 501, acquiring a conditional mode base of each frequent item according to a head pointer list;
in particular, this process may be understood as a set of paths for which the element sought is the end.
Step 502, generating a condition tree according to the condition mode base of the frequent item;
specifically, this process may be understood as filtering out infrequent items with low support;
step 503, searching a frequent item set through a conditional tree recursion to obtain a frequent item combination;
specifically, each frequent item and prefix path combination in the head pointer list is added into a result set, then a condition tree of the frequent item is calculated, when the condition tree is not empty, a new condition tree and a head pointer are constructed, and recursive calling is carried out until the constructed condition tree is empty.
Step 304, extracting knowledge information of the data to be processed according to the matching rule pattern string;
step 107, adding the result data to a knowledge base;
according to the knowledge base construction method provided by the embodiment of the invention, the semi-structured data and the non-semi-structured data are processed in a distinguishing manner, so that the extraction of knowledge information can be realized no matter whether the data is semi-structured data or not, and the accuracy of the construction of the knowledge base in the health field is improved.
Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied in hardware, in a software module executed by a user terminal, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (10)

1. A knowledge base construction method, the method comprising:
determining a data source;
acquiring body data according to the data source;
according to the ontology data;
capturing body data according to the data source to generate data to be processed;
determining whether the data to be processed is semi-structured data;
if the data to be processed is not the semi-structured data, extracting knowledge information of the data to be processed to obtain result data;
if the data to be processed is the semi-structured data, performing unified processing on the data to be processed to obtain uniformly processed data;
extracting the knowledge information from the uniformly processed data to obtain result data;
if the data to be processed is not the semi-structured data, extracting knowledge information of the data to be processed to obtain result data;
and adding the result data to a knowledge base.
2. The method of claim 1, wherein prior to the obtaining ontology data, the method further comprises:
acquiring investigation information;
the research information comprises: the domain of the ontology, the probability of reusing the existing ontology, the terminology of the ontology, the class hierarchy, the class attribute and the constraint information of the attribute;
and constructing the body data according to the research information.
3. The knowledge base construction method according to claim 1, wherein the unification process is specifically:
determining whether an embodiment with the same name as the current data to be processed exists;
if the embodiment with the same name as the current data to be processed exists, performing attribute rule mapping on the current data to be processed and the current embodiment, and adding the current data to be processed to an embodiment library;
if the embodiment with the same name as the current data to be processed does not exist, determining whether the attribute vector space of the current data to be processed corresponds to the embodiment library or not;
if the attribute vector space of the current data to be processed corresponds to the instance library, adding the name of the current data to be processed, and adding the current data to be processed to the instance library;
and if the attribute vector space of the current data to be processed does not correspond to the instance base, adding the current data to be processed to the instance base.
4. The knowledge base construction method according to claim 3, wherein the determining whether the attribute vector space of the current data to be processed corresponds to the instance base specifically comprises:
setting the attribute name of the current data to be processed as a keyword;
establishing a keyword vector of the keyword;
and calculating to obtain the cosine value of the included angle of the keyword vector to determine whether the attribute vector space of the current data to be processed corresponds to the instance library.
5. The knowledge base construction method according to claim 1, wherein the extracting knowledge information of the data to be processed specifically comprises:
and extracting knowledge information through a learning model.
6. The knowledge base construction method according to claim 5, wherein the extracting knowledge information through the learning model specifically comprises:
class labeling is carried out on the training data set to obtain labeling information;
determining keywords through a log-likelihood ratio algorithm according to the labeling information;
determining the combination of the keywords through an FP-Growth algorithm, and generating a matching rule pattern string according to a combination result;
and extracting knowledge information of the data to be processed through a regular template according to the matching rule pattern string.
7. The knowledge base construction method according to claim 6, wherein the determining of the keyword by the log-likelihood ratio algorithm according to the labeling information specifically comprises:
determining a first sample number of target features including a current keyword in the training data set, a second sample number of features not including the current target keyword, and a third sample number of features including the target current keyword and the target word in the training data set;
carrying out maximum likelihood calculation on the first sample number, the second sample number and the third sample number to obtain a first likelihood function and a second likelihood function;
obtaining a final likelihood ratio according to the first likelihood function and the second likelihood function;
and determining the keywords according to the final likelihood ratio.
8. The knowledge base construction method according to claim 6, wherein the determining the combination of the keywords by the FP-Growth algorithm specifically comprises:
constructing an FP tree for the keywords;
mining a frequent item combination according to the FP tree;
calculating the Kulc and IR values of the frequent term combinations,
and if frequent items with Kulc values larger than a first threshold value and IR values smaller than a second threshold value exist, screening the frequent items, and determining the combination of the keywords according to the screening result.
9. The knowledge base construction method according to claim 8, wherein the construction of the FP-tree for the keyword specifically comprises:
determining keywords through a log-likelihood ratio algorithm according to the labeling information in a traversal mode to obtain a complete set of the keywords and the support degree of each keyword;
obtaining a frequent item list according to the complete set and the support degree of the keywords;
sequencing each keyword according to the frequent item list to obtain frequent items;
determining whether a node in a constructed FP tree is the same as the frequent item and whether a prefix of the node in the constructed FP tree is the same as the prefix of the frequent item;
if the node in the constructed FP tree is the same as a frequent item and the prefix of the node in the constructed FP tree is the same as the prefix of the frequent item, the count of the node in the constructed FP tree is increased by one;
if the node in the constructed FP tree is different from the frequent item, or the prefix of the node in the constructed FP tree is different from the prefix of the frequent item; a new node is generated and inserted into the tail of the linked list with the current frequent items.
10. The knowledge base construction method according to claim 9, wherein the mining frequent item combination according to the FP-tree is specifically:
acquiring a conditional mode base of each frequent item according to the head pointer list;
generating a condition tree according to the condition mode base of the frequent item;
and recursively searching a frequent item set through the conditional tree to obtain the frequent item combination.
CN201911221375.XA 2019-12-03 2019-12-03 Knowledge base construction method Active CN111143448B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911221375.XA CN111143448B (en) 2019-12-03 2019-12-03 Knowledge base construction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911221375.XA CN111143448B (en) 2019-12-03 2019-12-03 Knowledge base construction method

Publications (2)

Publication Number Publication Date
CN111143448A true CN111143448A (en) 2020-05-12
CN111143448B CN111143448B (en) 2023-05-12

Family

ID=70517521

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911221375.XA Active CN111143448B (en) 2019-12-03 2019-12-03 Knowledge base construction method

Country Status (1)

Country Link
CN (1) CN111143448B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111680093A (en) * 2020-06-05 2020-09-18 深圳市华云中盛科技股份有限公司 Intellectual property case analysis method and device, computer equipment and storage medium
US11464313B2 (en) 2020-05-09 2022-10-11 Sz Zuvi Technology Co., Ltd. Apparatuses and methods for drying an object

Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7542958B1 (en) * 2002-09-13 2009-06-02 Xsb, Inc. Methods for determining the similarity of content and structuring unstructured content from heterogeneous sources
US20100185689A1 (en) * 2009-01-20 2010-07-22 Microsoft Corporation Enhancing Keyword Advertising Using Wikipedia Semantics
CN104281617A (en) * 2013-07-10 2015-01-14 广州中国科学院先进技术研究所 Domain knowledge-based multilayer association rules mining method and system
CN105224631A (en) * 2015-09-24 2016-01-06 四川长虹电器股份有限公司 Build the system of the open cloud of industry and the method for establishment XBRL financial statement
CN106407208A (en) * 2015-07-29 2017-02-15 清华大学 Establishment method and system for city management ontology knowledge base
CN106844723A (en) * 2017-02-10 2017-06-13 厦门大学 medical knowledge base construction method based on question answering system
CN108399180A (en) * 2017-02-08 2018-08-14 腾讯科技(深圳)有限公司 A kind of knowledge mapping construction method, device and server
CN108595449A (en) * 2017-11-23 2018-09-28 北京科东电力控制***有限责任公司 The structure and application process of dispatch automated system knowledge mapping
CN109213863A (en) * 2018-08-21 2019-01-15 北京航空航天大学 A kind of adaptive recommended method and system based on learning style
US20190102462A1 (en) * 2017-09-29 2019-04-04 International Business Machines Corporation Identification and evaluation white space target entity for transaction operations
CN110136008A (en) * 2019-04-15 2019-08-16 深圳壹账通智能科技有限公司 Utilize product data method for pushing, device, equipment and the storage medium of big data
CN110197280A (en) * 2019-05-20 2019-09-03 中国银行股份有限公司 A kind of knowledge mapping construction method, apparatus and system
CN110390023A (en) * 2019-07-02 2019-10-29 安徽继远软件有限公司 A kind of knowledge mapping construction method based on improvement BERT model
CN110489395A (en) * 2019-07-27 2019-11-22 西南电子技术研究所(中国电子科技集团公司第十研究所) Automatically the method for multi-source heterogeneous data knowledge is obtained

Patent Citations (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7542958B1 (en) * 2002-09-13 2009-06-02 Xsb, Inc. Methods for determining the similarity of content and structuring unstructured content from heterogeneous sources
US20100185689A1 (en) * 2009-01-20 2010-07-22 Microsoft Corporation Enhancing Keyword Advertising Using Wikipedia Semantics
CN104281617A (en) * 2013-07-10 2015-01-14 广州中国科学院先进技术研究所 Domain knowledge-based multilayer association rules mining method and system
CN106407208A (en) * 2015-07-29 2017-02-15 清华大学 Establishment method and system for city management ontology knowledge base
CN105224631A (en) * 2015-09-24 2016-01-06 四川长虹电器股份有限公司 Build the system of the open cloud of industry and the method for establishment XBRL financial statement
CN108399180A (en) * 2017-02-08 2018-08-14 腾讯科技(深圳)有限公司 A kind of knowledge mapping construction method, device and server
CN106844723A (en) * 2017-02-10 2017-06-13 厦门大学 medical knowledge base construction method based on question answering system
US20190102462A1 (en) * 2017-09-29 2019-04-04 International Business Machines Corporation Identification and evaluation white space target entity for transaction operations
CN108595449A (en) * 2017-11-23 2018-09-28 北京科东电力控制***有限责任公司 The structure and application process of dispatch automated system knowledge mapping
CN109213863A (en) * 2018-08-21 2019-01-15 北京航空航天大学 A kind of adaptive recommended method and system based on learning style
CN110136008A (en) * 2019-04-15 2019-08-16 深圳壹账通智能科技有限公司 Utilize product data method for pushing, device, equipment and the storage medium of big data
CN110197280A (en) * 2019-05-20 2019-09-03 中国银行股份有限公司 A kind of knowledge mapping construction method, apparatus and system
CN110390023A (en) * 2019-07-02 2019-10-29 安徽继远软件有限公司 A kind of knowledge mapping construction method based on improvement BERT model
CN110489395A (en) * 2019-07-27 2019-11-22 西南电子技术研究所(中国电子科技集团公司第十研究所) Automatically the method for multi-source heterogeneous data knowledge is obtained

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
汪洪臣: ""基于实例模型的知识推理及其在自动阅卷***中的应用"" *
王磊: ""基于实例推理的数据检索算法的研究与设计"" *
申妍;魏小鹏;王建维;: "基于本体的产品知识表示方法研究" *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11464313B2 (en) 2020-05-09 2022-10-11 Sz Zuvi Technology Co., Ltd. Apparatuses and methods for drying an object
CN111680093A (en) * 2020-06-05 2020-09-18 深圳市华云中盛科技股份有限公司 Intellectual property case analysis method and device, computer equipment and storage medium
CN111680093B (en) * 2020-06-05 2023-07-21 深圳市华云中盛科技股份有限公司 Intellectual property case analysis method, apparatus, computer device and storage medium

Also Published As

Publication number Publication date
CN111143448B (en) 2023-05-12

Similar Documents

Publication Publication Date Title
CN110245496B (en) Source code vulnerability detection method and detector and training method and system thereof
US10725836B2 (en) Intent-based organisation of APIs
Olmezogullari et al. Representation of click-stream datasequences for learning user navigational behavior by using embeddings
RU2679988C1 (en) Extracting information objects with the help of a classifier combination
CN110362597A (en) A kind of structured query language SQL injection detection method and device
Moro et al. Semantic rule filtering for web-scale relation extraction
CN107679035B (en) Information intention detection method, device, equipment and storage medium
CN112989055A (en) Text recognition method and device, computer equipment and storage medium
CN109840255A (en) Reply document creation method, device, equipment and storage medium
Oliveira et al. An efficient similarity-based approach for comparing XML documents
CN111143448B (en) Knowledge base construction method
CN107526721A (en) A kind of disambiguation method and device to electric business product review vocabulary
CN112580331A (en) Method and system for establishing knowledge graph of policy text
CN113254671B (en) Atlas optimization method, device, equipment and medium based on query analysis
CN110737469A (en) Source code similarity evaluation method based on semantic information on functional granularities
CN107784019A (en) Word treatment method and system are searched in a kind of searching service
CN106844338B (en) method for detecting entity column of network table based on dependency relationship between attributes
CN117290478A (en) Knowledge graph question-answering method, device, equipment and storage medium
US20210192316A1 (en) Device and method for processing digital data
Li et al. NameChecker: Detecting Inconsistency between Method Names and Method Bodies
CN117435246B (en) Code clone detection method based on Markov chain model
CN111813934B (en) Multi-source text topic model clustering method based on DMA model and feature division
KR102661819B1 (en) Methods for Understanding Context of Temporal Relations Based on Open-domain Information
Ghaemmaghami et al. Integrated-Block: A New Combination Model to Improve Web Page Segmentation
Xi et al. Bayes Performance of Batch Data Mining Based on Functional Dependencies

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant