CN117592562B - Knowledge base automatic construction method based on natural language processing - Google Patents

Knowledge base automatic construction method based on natural language processing Download PDF

Info

Publication number
CN117592562B
CN117592562B CN202410072571.XA CN202410072571A CN117592562B CN 117592562 B CN117592562 B CN 117592562B CN 202410072571 A CN202410072571 A CN 202410072571A CN 117592562 B CN117592562 B CN 117592562B
Authority
CN
China
Prior art keywords
semantic
data set
neighbor
process knowledge
classification data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410072571.XA
Other languages
Chinese (zh)
Other versions
CN117592562A (en
Inventor
屠静
赵策
王亚
张玥
雷媛媛
孙岩
潘亮亮
刘岩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhuo Shi Future Tianjin Technology Co ltd
Original Assignee
Zhuo Shi Future Tianjin Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhuo Shi Future Tianjin Technology Co ltd filed Critical Zhuo Shi Future Tianjin Technology Co ltd
Priority to CN202410072571.XA priority Critical patent/CN117592562B/en
Publication of CN117592562A publication Critical patent/CN117592562A/en
Application granted granted Critical
Publication of CN117592562B publication Critical patent/CN117592562B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to the technical field of data processing, and provides an automatic knowledge base construction method based on natural language processing, which comprises the following steps: acquiring a process knowledge classification data set; constructing a semantic salient contrast coefficient according to semantic features of each element in the process knowledge classification data set; acquiring a semantic salient contrast sequence according to the semantic salient contrast coefficient; calculating a semantic salient neighbor coefficient according to each element in the process knowledge classification data set and the corresponding semantic salient comparison sequence; acquiring a semantic neighbor analysis sample set according to the semantic salient neighbor coefficients; obtaining a shared neighbor sample set according to the semantic neighbor analysis sample set; acquiring a semantic neighbor similarity distance according to the shared neighbor sample set; and obtaining a clustering result of the process knowledge classification data set by adopting a hierarchical clustering algorithm based on the semantic neighbor similarity distance, and constructing a process knowledge base according to the clustering result. According to the invention, clustering analysis is carried out on the data through the semantic neighbor similarity distance, so that the accuracy of constructing a process knowledge base is improved.

Description

Knowledge base automatic construction method based on natural language processing
Technical Field
The invention relates to the technical field of data processing, in particular to an automatic knowledge base construction method based on natural language processing.
Background
Along with the development of industrial Internet, industrial manufacturing is undergoing a digital revolution, so that the construction of the industrial new ecology with the characteristics of intelligent production, personalized customization, cooperative production and the like is quickened. The automatic or intelligent process design is always a target of computer-aided process design, the intellectualization of the process design is realized, and process design software needs to be energized to enable the process design software to have the capabilities of understanding design problems, generating design results, learning design knowledge and the like, and the acquisition of process knowledge, namely the construction of a process knowledge base, is always a key problem.
The construction method of the process knowledge base can only express concept and instance information, and the expressed information is less, for example, the information of the ontology description is incomplete, so that the information of the instance layer is less, the construction efficiency of the knowledge base is influenced, and even the inferable capability of the knowledge base is influenced. The construction process of the process knowledge base comprises semantic expression, cluster analysis and process template construction, wherein the lack of ontology description in the semantic expression affects the accuracy of the final process knowledge base structure, the cluster analysis is carried out on the basis of the semantic expression to directly affect the accuracy of the final various knowledge base construction, for example, the problems of sensitivity in abnormal value processing, excessive clustering and the like exist in the classification result of the process knowledge obtained by adopting a traditional hierarchical clustering algorithm, and the problems result in lower classification effect of the process knowledge by adopting the traditional clustering algorithm, and lower accuracy and efficiency of constructing the process knowledge base.
Disclosure of Invention
The invention provides an automatic knowledge base construction method based on natural language processing, which aims to solve the problems of low knowledge base construction precision and efficiency, and adopts the following technical scheme:
one embodiment of the invention provides an automatic knowledge base construction method based on natural language processing, which comprises the following steps:
acquiring a process knowledge data set;
acquiring a process knowledge classification data set according to the process knowledge data set; calculating the semantic prominence contrast coefficient of each word in each element according to the semantic feature relation among different words in each element of the process knowledge classification data set; acquiring a semantic salient contrast sequence of each element according to the semantic salient contrast coefficient of the word segmentation in each element of the process knowledge classification data; acquiring a semantic salient neighbor coefficient of each element according to each element of the process knowledge classification data set and a corresponding semantic salient comparison sequence thereof; acquiring a semantic neighbor analysis sample set of each element according to the semantic prominent neighbor coefficient of each element of the process knowledge classification data set;
obtaining a shared neighbor sample set among different elements of the process knowledge classification data set according to the semantic neighbor analysis sample set of each element of the process knowledge classification data set; obtaining semantic neighbor similarity distances among different elements of a process knowledge classification data set according to a shared neighbor sample set among the different elements; obtaining a clustering result of the process knowledge classification data set by adopting a hierarchical clustering algorithm based on the semantic neighbor similarity distance;
and constructing a process knowledge base according to the clustering result of the process knowledge classification data set.
Preferably, the method for acquiring the process knowledge classification data set according to the process knowledge data set comprises the following steps:
the method comprises the steps of obtaining word segmentation results of each element of a process knowledge data set by adopting a dictionary matching algorithm, taking the word segmentation results of each element as input, obtaining vector representations of all the words in each element of the process knowledge data set by utilizing a self-encoder based on a bidirectional long-short-time memory network, taking vectors formed by the vector representations of all the words in each element as semantic vectors of each element, and taking a set formed by the semantic vectors of all the elements as a process knowledge classification data set.
Preferably, the method for calculating the semantic prominence contrast coefficient of each word in each element according to the semantic feature relation between different words in each element of the process knowledge classification data set comprises the following steps:
in the method, in the process of the invention,indicate->Semantic prominence and contrast coefficients of individual word segmentation; />And->Respectively represent +.>Person and->Vector representation of individual word, ∈>Representation->And->Cosine similarity between them; />Representation->Is a transpose of (2); />Indicate->The number of the word segments in the element where the individual word segments are located; />Representing the adjustment parameters.
Preferably, the method for obtaining the semantic salient contrast sequence of each element according to the semantic salient contrast coefficient of the word segmentation in each element of the process knowledge classification data comprises the following steps:
and taking a sequence formed by sequencing the semantic prominence and contrast coefficients corresponding to all the segmentation words in each element of the process knowledge classification data set according to the sequence of the corresponding segmentation words in each element as the semantic prominence and contrast sequence of each element.
Preferably, the method for obtaining the semantic salient neighbor coefficient of each element according to each element of the process knowledge classification data set and the corresponding semantic salient comparison sequence thereof comprises the following steps:
in the middle of,Representing the technical knowledge classification data set +.>Semantic salient neighbor coefficients of the individual elements; />And->Respectively represent ++th in process knowledge classification data set>Person and->Semantic vector corresponding to the individual element->Representation->And->Cosine similarity between them; />And->Respectively represent ++th in process knowledge classification data set>Person and->Semantic prominence contrast sequence corresponding to each element, +.>Representation->And->DTW distance between; />Representing the number of elements in the process knowledge classification dataset.
Preferably, the method for obtaining the semantic neighbor analysis sample set of each element according to the semantic prominent neighbor coefficient of each element of the process knowledge classification data set comprises the following steps:
for each element of the process knowledge classification data set, taking the difference value of the semantic salient neighbor coefficient corresponding to each element and other elements as one semantic salient neighbor characteristic value of each other element, taking a sequence formed by sequencing all the semantic salient neighbor characteristic values according to the order from small to large as a semantic neighbor analysis sequence of each element, and taking a set formed by all elements corresponding to the data of a preset quantity before the semantic neighbor analysis sequence in the process knowledge classification data set as a semantic neighbor analysis sample set of each element.
Preferably, the method for obtaining the shared neighbor sample set between different elements of the process knowledge classification data set according to the semantic neighbor analysis sample set of each element of the process knowledge classification data set comprises the following steps:
and for any two elements of the process knowledge classification data set, taking a set formed by all elements in an intersection set of semantic neighbor analysis sample sets corresponding to the any two elements as a shared neighbor sample set of the any two sets.
Preferably, the method for obtaining the semantic neighbor similarity distance between different elements according to the shared neighbor sample set between the different elements of the process knowledge classification data set comprises the following steps:in (1) the->Representing the technical knowledge classification data set +.>Individual elements and->Semantic neighbor similarity distances between individual elements; />And->Respectively represent ++th in process knowledge classification data set>Individual elements and->Semantic neighbor analysis sample set of individual elements, +.>Representation->And->A Jacquard coefficient therebetween; />And->Respectively represent ++th in process knowledge classification data set>Individual elements and->Semantic salient neighbor coefficients corresponding to the individual elements; />Representing the technical knowledge classification data set +.>Individual elements and->The shared neighbor sample set corresponding to the individual element +.>Semantic salient neighbor coefficients corresponding to the samples; />Representing the technical knowledge classification data set +.>Individual elements and->The number of samples in the shared neighbor sample set corresponding to the individual elements; />Representing the adjustment parameters.
Preferably, the method for acquiring the clustering result of the process knowledge classification dataset by adopting the hierarchical clustering algorithm based on the semantic neighbor similarity distance comprises the following steps:
and taking the semantic neighbor similarity distance between different elements in the process knowledge classification data set as the similarity between samples in a condensation hierarchical clustering algorithm, and acquiring a clustering result of the process knowledge classification data set by adopting the condensation hierarchical clustering algorithm.
Preferably, the method for constructing the process knowledge base according to the clustering result of the process knowledge classification data set comprises the following steps:
and taking all elements corresponding to each cluster in the clustering result of the process knowledge classification data set as a process classification sample set, drawing up a standard expression method of a process method or an operation method represented by each process classification sample set, and forming a final process knowledge base according to all process classification sample sets.
The beneficial effects of the invention are as follows: the method comprises the steps of obtaining a semantic coding result of process knowledge through a natural language processing technology, obtaining a process knowledge classification data set based on the semantic coding result, constructing a semantic salient contrast coefficient according to an analysis result of semantic features of each element in the process knowledge classification data set, calculating a semantic salient neighbor coefficient according to the semantic salient contrast coefficient, obtaining a semantic neighbor analysis sample set according to the semantic salient neighbor coefficient, obtaining a shared neighbor sample set according to the semantic neighbor analysis sample set among different classification samples in the process knowledge classification data set, constructing semantic neighbor similarity distances among different classification samples according to the shared neighbor sample set, and obtaining a clustering result of the process knowledge classification data set based on the semantic neighbor similarity distances by adopting an AHC hierarchical clustering algorithm.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the description below are only some embodiments of the invention, and that other drawings can be obtained according to these drawings without inventive faculty for a person skilled in the art.
FIG. 1 is a flowchart of an automatic knowledge base construction method based on natural language processing according to an embodiment of the present invention;
fig. 2 is a schematic diagram of a process knowledge base construction flow according to an embodiment of the invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Referring to fig. 1, a flowchart of a method for automatically constructing a knowledge base based on natural language processing according to an embodiment of the invention is shown, the method includes the following steps:
step S001, obtaining a process knowledge classification data set.
The process knowledge base comprises the contents of a process name, a process sentence and the like, wherein the process name is a word or phrase which briefly indicates the operation content of the process, and the process sentence describes the operation content of the process and comprises the following specification, drawing, operation object, required parts, parameters, operation and the like. Further, the contents such as the process names and the process sentences are acquired from the process specification file, and the acquired set of the contents such as the process names and the process sentences is used as the process knowledge data set.
Because the process names and the process sentences are unstructured data described by natural language and are not computable, the input is a process knowledge data set, a dictionary matching algorithm is adopted to obtain word segmentation results of each element in the process knowledge data set, and the specific calculation process of the dictionary matching algorithm is a known technology and is not repeated; the word segmentation results of all elements in the process knowledge data set are used as input, a self-encoder based on a bidirectional long-short-time memory network is used for obtaining vector representations of all the words in each element in the process knowledge data set, a vector formed by the vector representations of all the words in each element is used as a semantic vector of each element, an optimization algorithm is SGD (Stochastic Gradient Descent) algorithm, a loss function is a cross entropy loss function, and a specific training process of the self-encoder based on the bidirectional long-short-time memory network is a known technology and is not repeated. And taking the set consisting of semantic vectors of all elements in the process knowledge data set as a process knowledge classification data set.
Thus, a process knowledge classification dataset is obtained.
Step S002, a semantic salient contrast coefficient is constructed according to the semantic features of each element in the process knowledge classification data set, a semantic salient contrast sequence is obtained according to the semantic salient contrast coefficient, and a semantic salient neighbor coefficient is obtained according to each element in the process knowledge classification data set and the corresponding semantic salient contrast sequence.
Each element in the process knowledge classification data set is a corresponding process name or vector representation of a process sentence, classification results of different process knowledge can be obtained through clustering analysis of the process knowledge classification data set, a process knowledge base is constructed according to the classification results of the process knowledge, wherein the quality of the clustering results of the process knowledge classification data set is poor due to the fact that description of the process knowledge is similar, and the quality of the clustering results of the process knowledge classification data set is poor due to the fact that the similarity is calculated through the vector representations of the different elements in the process knowledge classification data set, so that the construction accuracy of the process knowledge base is reduced.
Further, the semantic vector of each element in the process knowledge classification data set is composed of vector representation results of all the segmented words in each element. In particular, e.g. process knowledge classification data setThe word segmentation number of the individual elements is +.>Then->The semantic vector composition of the individual elements is +.>Wherein->Indicate->The vector of individual tokens represents the result. Further, because the correlations between the segmented words in each element in the process knowledge classification data set are different, the correlation degree of the segmented words in each element can be better obtained by analyzing the relationship between the different segmented words in each elementCapturing long-range dependencies in each element.
Further, the semantic prominence contrast coefficient of each element word is calculated through the correlation analysis between different words in each element of the process knowledge classification data set, and a specific calculation formula is as follows:in (1) the->Indicate->Semantic prominence and contrast coefficients of individual word segmentation; />And->Respectively represent +.>Person and->Vector representation of individual word, ∈>Representation->And->Cosine similarity between them; />Representation->Is a transpose of (2); />Indicate->The number of the word segments in the element where the individual word segments are located; />The regulating parameter is expressed, and the empirical value is 0.01.
If the process knowledge classifies the first element of the data setThe degree of association of the individual word with other words in the element is greater, the calculated +.>The smaller the value of (2), the calculated +.>The larger the value of (2), i.e. the calculated +.>Semantic prominence contrast coefficient of individual word>The larger the value of (2) is, the +.>The semantic relevance of the individual word with other words in the element is greater.
Further, classifying the data set according to the process knowledgeThe semantic salient contrast coefficient corresponding to each word in the semantic vector of each element is obtained as the +.>The semantically prominent contrast sequence of the individual elements, in particular, the +.>The corresponding semantic prominence contrast coefficient of each word in each element is formed according to the arrangement sequence of the corresponding elementsSequence as->Semantic highlighting of the individual elements contrast sequences +.>,/>Indicate->The%>The semantics corresponding to the individual word are distinguished from the contrast coefficient.
Further, according to the difference of each element in the process knowledge classification data set and other elements in the semantic features, a semantic salient neighbor coefficient is calculated, and a specific calculation formula is as follows:in (1) the->Representing the technical knowledge classification data set +.>Semantic salient neighbor coefficients of the individual elements; />And->Respectively represent ++th in process knowledge classification data set>Person and->Semantic vector corresponding to the individual element->Representation->And->Cosine similarity between them; />And->Respectively represent ++th in process knowledge classification data set>Person and->Semantic prominence contrast sequence corresponding to each element, +.>Representation->And->DTW distance between; />Representing the number of samples in the process knowledge classification dataset.
If the process knowledge classifies the data setThe individual elements have larger difference with other sample semantic features, the +.>And->The larger the value of (2), i.e. the calculated process knowledge classification data set ++>Semantic prominent neighbor coefficients of individual elements +.>The larger the value of (2) is, the more ∈is in the process knowledge classification dataset>The semantic features of the individual elements are more obvious.
So far, the semantically prominent neighbor coefficients of each element in the process knowledge classification data set are obtained.
Step S003, a semantic neighbor analysis sample set is obtained according to the semantic salient neighbor coefficients of each element in the process knowledge classification data set, a shared neighbor sample set is obtained according to the semantic neighbor analysis sample set, and a semantic neighbor similarity distance is calculated according to the shared neighbor sample set.
For each element of the process knowledge classification data set, taking the difference value of the semantic salient neighbor coefficient corresponding to each element and other elements as one semantic salient neighbor characteristic value of each other element, taking a sequence formed by sequencing all the semantic salient neighbor characteristic values according to the order from small to large as a semantic neighbor analysis sequence of each element, and taking a set formed by all elements corresponding to the data of a preset quantity before the semantic neighbor analysis sequence in the process knowledge classification data set as a semantic neighbor analysis sample set of each element.
In particular, for example, computing the process knowledge classification data setDifferences of the semantic prominent neighbor coefficients corresponding to each element and each other element are taken as semantic prominent neighbor feature values, and a sequence formed by ordering all the semantic prominent neighbor feature values according to the order from small to large is taken as a +.>The semantic neighbor analysis sequence of each element is selected to be the +.>Before +.f. in individual element semantic neighbor analysis sequence>(size taken 10) the set of all elements corresponding to the data is taken as +.>Semantic neighbor analysis sample set of individual elements +.>. Obtaining a shared neighbor sample set according to semantic neighbor analysis sample sets among different elements in the process knowledge classification data set, reacting the similarity among the samples through the shared neighbor sample set, and specifically, taking a set formed by samples in intersections of the semantic neighbor analysis sample sets among the different elements in the process knowledge classification data set as the shared neighbor sample set among the different elements.
Further, according to the shared neighbor sample set among different elements in the process knowledge classification data set, the semantic neighbor similarity distance among the different elements is calculated, and a specific calculation formula is as follows:in (1) the->Representing the technical knowledge classification data set +.>Individual elements and->Semantic neighbor similarity distances between individual elements; />And->Respectively represent the knowledge of the processClassification data set->Individual elements and->Semantic neighbor analysis sample set of individual elements, +.>Representation->And->A Jacquard coefficient therebetween; />And->Respectively represent ++th in process knowledge classification data set>Individual elements and->Semantic salient neighbor coefficients corresponding to the individual elements; />Representing the technical knowledge classification data set +.>Individual elements and->The shared neighbor sample set corresponding to the individual element +.>Semantic salient neighbor coefficients corresponding to the samples; />Representing the technical knowledge classification data set +.>Individual elements and->The number of samples in the shared neighbor sample set corresponding to the individual elements; />The regulating parameter is expressed, and the empirical value is 0.01.
If the process knowledge classifies the data setIndividual elements and->Semantic features are similar among the individual elements, the calculated +.>And->The smaller the value of (2), i.e. the calculated process knowledge classification data set ++>Individual elements and->Semantic neighbor similarity distance between individual elements +.>The smaller the value of (2) is, the more ∈is in the process knowledge classification dataset>Individual elements and->The greater the likelihood that the individual elements belong to the same class of samples.
So far, the semantic neighbor similarity distance between the different elements in the process-only and classification data set is obtained.
Step S004, obtaining a classification result of the process knowledge classification data set by adopting a hierarchical clustering algorithm based on the semantic neighbor similarity distance, and constructing a process knowledge base according to the classification result.
The input is a process knowledge classification data set, and a AHC (Agglomerative Hierarchical Clustering) hierarchical clustering algorithm is adopted to obtain the classification result of the elements in the process knowledge classification data set. Specifically, each element in the process knowledge classification data set is regarded as an independent category, the distance between different categories is calculated by adopting a semantic neighbor similarity distance, and the clustering result of the process knowledge classification data set is obtained according to an AHC hierarchical clustering algorithm.
Further, a process knowledge base is constructed according to the clustering result of the process knowledge classification data set, specifically, all classification samples corresponding to each cluster in the clustering result of the process knowledge classification data set are taken as a process classification sample set, a standard expression method of a process method or an operation method represented by each process classification sample set is drawn, and a final process knowledge base is formed according to all process classification sample sets. Specifically, as shown in fig. 2, the process knowledge base is constructed as follows:
step S1, acquiring a data set, acquiring the data set required by building a process knowledge base, and acquiring vector representation of data in the data set by using a natural language processing technology;
step S2, cluster analysis is carried out on a data set required by the construction of the process knowledge base, vector representation results of all data in the data set required by the construction of the process knowledge base are input, and an AHC hierarchical clustering algorithm is adopted to obtain the clustering results of the data set based on semantic neighbor similarity distances;
step S3, a knowledge base is constructed, a process knowledge base is formed according to the clustering analysis result of the data required by the construction of the process knowledge base, a standard expression method of the data of each cluster in the clustering result is constructed according to the clustering result of the data set required by the construction of the process knowledge base, and the process knowledge base is formed based on the standardized expression method.
In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. The above description is only of the preferred embodiments of the present invention and is not intended to limit the invention, but any modifications, equivalent substitutions, improvements, etc. within the principles of the present invention should be included in the scope of the present invention.

Claims (7)

1. The automatic knowledge base construction method based on natural language processing is characterized by comprising the following steps:
acquiring a process knowledge data set;
acquiring a process knowledge classification data set according to the process knowledge data set; calculating the semantic prominence contrast coefficient of each word in each element according to the semantic feature relation among different words in each element of the process knowledge classification data set; acquiring a semantic salient contrast sequence of each element according to the semantic salient contrast coefficient of the word segmentation in each element of the process knowledge classification data; acquiring a semantic salient neighbor coefficient of each element according to each element of the process knowledge classification data set and a corresponding semantic salient comparison sequence thereof; acquiring a semantic neighbor analysis sample set of each element according to the semantic prominent neighbor coefficient of each element of the process knowledge classification data set;
obtaining a shared neighbor sample set among different elements of the process knowledge classification data set according to the semantic neighbor analysis sample set of each element of the process knowledge classification data set; obtaining semantic neighbor similarity distances among different elements of a process knowledge classification data set according to a shared neighbor sample set among the different elements; obtaining a clustering result of the process knowledge classification data set by adopting a hierarchical clustering algorithm based on the semantic neighbor similarity distance;
constructing a process knowledge base according to the clustering result of the process knowledge classification data set;
the method for calculating the semantic prominence contrast coefficient of each word in each element according to the semantic feature relation among different words in each element of the process knowledge classification data set comprises the following steps:
in (1) the->Indicate->Semantic prominence and contrast coefficients of individual word segmentation; />And->Respectively represent +.>Person and->Vector representation of individual word, ∈>Representation->And->Cosine similarity between them; />Representation->Is a transpose of (2); />Indicate->The number of the word segments in the element where the individual word segments are located; />Representing the adjustment parameters;
the method for acquiring the semantic salient neighbor coefficients of each element according to each element of the process knowledge classification data set and the corresponding semantic salient comparison sequence comprises the following steps:
in (1) the->Representing the technical knowledge classification data set +.>Semantic salient neighbor coefficients of the individual elements; />And->Respectively represent ++th in process knowledge classification data set>Person and->Semantic vector corresponding to the individual element->Representation->And->Cosine similarity between them; />And->Respectively represent ++th in process knowledge classification data set>Person and->Semantic prominence contrast sequence corresponding to each element, +.>Representation->And->DTW distance between; />Representing the number of elements in the process knowledge classification dataset;
the method for acquiring the semantic neighbor similarity distance between different elements according to the shared neighbor sample set between the different elements of the process knowledge classification data set comprises the following steps:
in (1) the->Representing the technical knowledge classification data set +.>Individual elements and->Semantic neighbor similarity distances between individual elements; />And->Respectively represent ++th in process knowledge classification data set>Individual elements and->Semantic neighbor analysis sample set of individual elements, +.>Representation->And->A Jacquard coefficient therebetween; />And->Respectively represent ++th in process knowledge classification data set>Individual elements and->Semantic salient neighbor coefficients corresponding to the individual elements; />Representing the technical knowledge classification data set +.>Individual elements and->The shared neighbor sample set corresponding to the individual element +.>Semantic salient neighbor coefficients corresponding to the samples; />Representing the technical knowledge classification data set +.>Individual elements and->The number of samples in the shared neighbor sample set corresponding to the individual elements; />Representing the adjustment parameters.
2. The automatic knowledge base construction method based on natural language processing according to claim 1, wherein the method for acquiring the process knowledge classification data set according to the process knowledge data set is as follows:
the method comprises the steps of obtaining word segmentation results of each element of a process knowledge data set by adopting a dictionary matching algorithm, taking the word segmentation results of each element as input, obtaining vector representations of all the words in each element of the process knowledge data set by utilizing a self-encoder based on a bidirectional long-short-time memory network, taking vectors formed by the vector representations of all the words in each element as semantic vectors of each element, and taking a set formed by the semantic vectors of all the elements as a process knowledge classification data set.
3. The automatic knowledge base construction method based on natural language processing according to claim 1, wherein the method for obtaining the semantic salient contrast sequence of each element according to the semantic salient contrast coefficient of the word in each element of the process knowledge classification data comprises the following steps:
and taking a sequence formed by sequencing the semantic prominence and contrast coefficients corresponding to all the segmentation words in each element of the process knowledge classification data set according to the sequence of the corresponding segmentation words in each element as the semantic prominence and contrast sequence of each element.
4. The automatic knowledge base construction method based on natural language processing according to claim 1, wherein the method for obtaining the semantic neighbor analysis sample set of each element according to the semantic significance neighbor coefficient of each element of the process knowledge classification data set is as follows:
for each element of the process knowledge classification data set, taking the difference value of the semantic salient neighbor coefficient corresponding to each element and other elements as one semantic salient neighbor characteristic value of each other element, taking a sequence formed by sequencing all the semantic salient neighbor characteristic values according to the order from small to large as a semantic neighbor analysis sequence of each element, and taking a set formed by all elements corresponding to the data of a preset quantity before the semantic neighbor analysis sequence in the process knowledge classification data set as a semantic neighbor analysis sample set of each element.
5. The automatic knowledge base construction method based on natural language processing according to claim 1, wherein the method for obtaining a shared neighbor sample set between different elements of the process knowledge classification data set according to the semantic neighbor analysis sample set of each element of the process knowledge classification data set comprises the following steps:
and for any two elements of the process knowledge classification data set, taking a set consisting of all elements in an intersection set of semantic neighbor analysis sample sets corresponding to the any two elements as a shared neighbor sample set of the any two elements.
6. The automatic knowledge base construction method based on natural language processing according to claim 1, wherein the method for obtaining the clustering result of the process knowledge classification dataset by using hierarchical clustering algorithm based on the semantic neighbor similarity distance is as follows:
and taking the semantic neighbor similarity distance between different elements in the process knowledge classification data set as the similarity between samples in a condensation hierarchical clustering algorithm, and acquiring a clustering result of the process knowledge classification data set by adopting the condensation hierarchical clustering algorithm.
7. The automatic knowledge base construction method based on natural language processing according to claim 1, wherein the method for constructing a process knowledge base according to the clustering result of the process knowledge classification dataset comprises the steps of:
and taking all elements corresponding to each cluster in the clustering result of the process knowledge classification data set as a process classification sample set, drawing up a standard expression method of a process method or an operation method represented by each process classification sample set, and forming a final process knowledge base according to all process classification sample sets.
CN202410072571.XA 2024-01-18 2024-01-18 Knowledge base automatic construction method based on natural language processing Active CN117592562B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410072571.XA CN117592562B (en) 2024-01-18 2024-01-18 Knowledge base automatic construction method based on natural language processing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410072571.XA CN117592562B (en) 2024-01-18 2024-01-18 Knowledge base automatic construction method based on natural language processing

Publications (2)

Publication Number Publication Date
CN117592562A CN117592562A (en) 2024-02-23
CN117592562B true CN117592562B (en) 2024-04-09

Family

ID=89910256

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410072571.XA Active CN117592562B (en) 2024-01-18 2024-01-18 Knowledge base automatic construction method based on natural language processing

Country Status (1)

Country Link
CN (1) CN117592562B (en)

Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101477521A (en) * 2008-12-18 2009-07-08 四川大学 Non-standard knowledge acquisition method used for constructing mechanical product design knowledge base
CN102609512A (en) * 2012-02-07 2012-07-25 北京中机科海科技发展有限公司 System and method for heterogeneous information mining and visual analysis
CN109918491A (en) * 2019-03-12 2019-06-21 焦点科技股份有限公司 A kind of intelligent customer service question matching method of knowledge based library self study
CN110362664A (en) * 2019-05-31 2019-10-22 厦门快商通信息咨询有限公司 A kind of pair of chat robots FAQ knowledge base storage and matched method and device
CN110825877A (en) * 2019-11-12 2020-02-21 中国石油大学(华东) Semantic similarity analysis method based on text clustering
CN112199961A (en) * 2020-12-07 2021-01-08 浙江万维空间信息技术有限公司 Knowledge graph acquisition method based on deep learning
CN112529187A (en) * 2021-02-18 2021-03-19 中国科学院自动化研究所 Knowledge acquisition method fusing multi-source data semantics and features
WO2021121198A1 (en) * 2020-09-08 2021-06-24 平安科技(深圳)有限公司 Semantic similarity-based entity relation extraction method and apparatus, device and medium
CN114064901A (en) * 2021-11-26 2022-02-18 重庆邮电大学 Book comment text classification method based on knowledge graph word meaning disambiguation
CN114911951A (en) * 2022-05-18 2022-08-16 西安理工大学 Knowledge graph construction method for man-machine cooperation assembly task
CN115221280A (en) * 2022-06-28 2022-10-21 北京无线电测量研究所 Knowledge retrieval method, system and equipment based on aerospace quality knowledge base
CN115563968A (en) * 2022-10-10 2023-01-03 北京许继电气有限公司 Water and electricity transportation and inspection knowledge natural language artificial intelligence system and method
CN116166782A (en) * 2023-02-07 2023-05-26 山东浪潮科学研究院有限公司 Intelligent question-answering method based on deep learning
CN116775874A (en) * 2023-06-21 2023-09-19 六晟信息科技(杭州)有限公司 Information intelligent classification method and system based on multiple semantic information
CN117150050A (en) * 2023-10-31 2023-12-01 卓世科技(海南)有限公司 Knowledge graph construction method and system based on large language model

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11580415B2 (en) * 2019-07-09 2023-02-14 Baidu Usa Llc Hierarchical multi-task term embedding learning for synonym prediction

Patent Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101477521A (en) * 2008-12-18 2009-07-08 四川大学 Non-standard knowledge acquisition method used for constructing mechanical product design knowledge base
CN102609512A (en) * 2012-02-07 2012-07-25 北京中机科海科技发展有限公司 System and method for heterogeneous information mining and visual analysis
CN109918491A (en) * 2019-03-12 2019-06-21 焦点科技股份有限公司 A kind of intelligent customer service question matching method of knowledge based library self study
CN110362664A (en) * 2019-05-31 2019-10-22 厦门快商通信息咨询有限公司 A kind of pair of chat robots FAQ knowledge base storage and matched method and device
CN110825877A (en) * 2019-11-12 2020-02-21 中国石油大学(华东) Semantic similarity analysis method based on text clustering
WO2021121198A1 (en) * 2020-09-08 2021-06-24 平安科技(深圳)有限公司 Semantic similarity-based entity relation extraction method and apparatus, device and medium
CN112199961A (en) * 2020-12-07 2021-01-08 浙江万维空间信息技术有限公司 Knowledge graph acquisition method based on deep learning
CN112529187A (en) * 2021-02-18 2021-03-19 中国科学院自动化研究所 Knowledge acquisition method fusing multi-source data semantics and features
CN114064901A (en) * 2021-11-26 2022-02-18 重庆邮电大学 Book comment text classification method based on knowledge graph word meaning disambiguation
CN114911951A (en) * 2022-05-18 2022-08-16 西安理工大学 Knowledge graph construction method for man-machine cooperation assembly task
CN115221280A (en) * 2022-06-28 2022-10-21 北京无线电测量研究所 Knowledge retrieval method, system and equipment based on aerospace quality knowledge base
CN115563968A (en) * 2022-10-10 2023-01-03 北京许继电气有限公司 Water and electricity transportation and inspection knowledge natural language artificial intelligence system and method
CN116166782A (en) * 2023-02-07 2023-05-26 山东浪潮科学研究院有限公司 Intelligent question-answering method based on deep learning
CN116775874A (en) * 2023-06-21 2023-09-19 六晟信息科技(杭州)有限公司 Information intelligent classification method and system based on multiple semantic information
CN117150050A (en) * 2023-10-31 2023-12-01 卓世科技(海南)有限公司 Knowledge graph construction method and system based on large language model

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
《环境科学》课程思政知识库自动构建方法研究;郭胜娟 等;《武汉工程职业技术学院学报》;20231231;第35卷(第4期);全文 *
An automatic literature knowledge graph and reasoning network modeling framework based on ontology and natural language processing;Hainan Chen 等;《Advanced Engineering Informatics》;20190703;第42卷;全文 *
基于中文自然语言理解的问答***研究;孙靖;《中国优秀硕士学位论文全文数据库 信息科技辑》;20140515(第5期);全文 *
基于知识分类体系的专利检索***;梁田;胡正银;程欣;刘春江;方曙;杨志萍;;情报理论与实践;20120430(04);全文 *
基于知识图谱词义消歧的文本聚类方法;张延星 等;《华北理工大学学报(自然科学版)》;20191031;第41卷(第4期);全文 *

Also Published As

Publication number Publication date
CN117592562A (en) 2024-02-23

Similar Documents

Publication Publication Date Title
CN110717047B (en) Web service classification method based on graph convolution neural network
WO2022116537A1 (en) News recommendation method and apparatus, and electronic device and storage medium
CN110069709B (en) Intention recognition method, device, computer readable medium and electronic equipment
WO2022001333A1 (en) Hyperbolic space representation and label text interaction-based fine-grained entity recognition method
CN109858015B (en) Semantic similarity calculation method and device based on CTW (computational cost) and KM (K-value) algorithm
CN112101040B (en) Ancient poetry semantic retrieval method based on knowledge graph
CN112052684A (en) Named entity identification method, device, equipment and storage medium for power metering
CN111507827A (en) Health risk assessment method, terminal and computer storage medium
CN117453921B (en) Data information label processing method of large language model
CN109670182A (en) A kind of extremely short file classification method of magnanimity indicated based on text Hash vectorization
CN113705238B (en) Method and system for analyzing aspect level emotion based on BERT and aspect feature positioning model
CN115269882A (en) Intellectual property retrieval system and method based on semantic understanding
CN115859980A (en) Semi-supervised named entity identification method, system and electronic equipment
CN112989813A (en) Scientific and technological resource relation extraction method and device based on pre-training language model
CN115146062A (en) Intelligent event analysis method and system fusing expert recommendation and text clustering
CN113032573A (en) Large-scale text classification method and system combining theme semantics and TF-IDF algorithm
CN117592562B (en) Knowledge base automatic construction method based on natural language processing
CN113486670A (en) Text classification method, device and equipment based on target semantics and storage medium
CN113342964B (en) Recommendation type determination method and system based on mobile service
CN115129890A (en) Feedback data map generation method and generation device, question answering device and refrigerator
CN114491033A (en) Method for building user interest model based on word vector and topic model
CN114595324A (en) Method, device, terminal and non-transitory storage medium for power grid service data domain division
CN113535928A (en) Service discovery method and system of long-term and short-term memory network based on attention mechanism
CN112925983A (en) Recommendation method and system for power grid information
CN114048749A (en) Chinese named entity recognition method suitable for multiple fields

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant