CN117592562B

CN117592562B - Knowledge base automatic construction method based on natural language processing

Info

Publication number: CN117592562B
Application number: CN202410072571.XA
Authority: CN
Inventors: 屠静; 赵策; 王亚; 张玥; 雷媛媛; 孙岩; 潘亮亮; 刘岩
Original assignee: Zhuo Shi Future Tianjin Technology Co ltd
Current assignee: Zhuo Shi Future Tianjin Technology Co ltd
Priority date: 2024-01-18
Filing date: 2024-01-18
Publication date: 2024-04-09
Anticipated expiration: 2044-01-18
Also published as: CN117592562A

Abstract

The invention relates to the technical field of data processing, and provides an automatic knowledge base construction method based on natural language processing, which comprises the following steps: acquiring a process knowledge classification data set; constructing a semantic salient contrast coefficient according to semantic features of each element in the process knowledge classification data set; acquiring a semantic salient contrast sequence according to the semantic salient contrast coefficient; calculating a semantic salient neighbor coefficient according to each element in the process knowledge classification data set and the corresponding semantic salient comparison sequence; acquiring a semantic neighbor analysis sample set according to the semantic salient neighbor coefficients; obtaining a shared neighbor sample set according to the semantic neighbor analysis sample set; acquiring a semantic neighbor similarity distance according to the shared neighbor sample set; and obtaining a clustering result of the process knowledge classification data set by adopting a hierarchical clustering algorithm based on the semantic neighbor similarity distance, and constructing a process knowledge base according to the clustering result. According to the invention, clustering analysis is carried out on the data through the semantic neighbor similarity distance, so that the accuracy of constructing a process knowledge base is improved.

Description

Knowledge base automatic construction method based on natural language processing

Technical Field

The invention relates to the technical field of data processing, in particular to an automatic knowledge base construction method based on natural language processing.

Background

Along with the development of industrial Internet, industrial manufacturing is undergoing a digital revolution, so that the construction of the industrial new ecology with the characteristics of intelligent production, personalized customization, cooperative production and the like is quickened. The automatic or intelligent process design is always a target of computer-aided process design, the intellectualization of the process design is realized, and process design software needs to be energized to enable the process design software to have the capabilities of understanding design problems, generating design results, learning design knowledge and the like, and the acquisition of process knowledge, namely the construction of a process knowledge base, is always a key problem.

The construction method of the process knowledge base can only express concept and instance information, and the expressed information is less, for example, the information of the ontology description is incomplete, so that the information of the instance layer is less, the construction efficiency of the knowledge base is influenced, and even the inferable capability of the knowledge base is influenced. The construction process of the process knowledge base comprises semantic expression, cluster analysis and process template construction, wherein the lack of ontology description in the semantic expression affects the accuracy of the final process knowledge base structure, the cluster analysis is carried out on the basis of the semantic expression to directly affect the accuracy of the final various knowledge base construction, for example, the problems of sensitivity in abnormal value processing, excessive clustering and the like exist in the classification result of the process knowledge obtained by adopting a traditional hierarchical clustering algorithm, and the problems result in lower classification effect of the process knowledge by adopting the traditional clustering algorithm, and lower accuracy and efficiency of constructing the process knowledge base.

Disclosure of Invention

The invention provides an automatic knowledge base construction method based on natural language processing, which aims to solve the problems of low knowledge base construction precision and efficiency, and adopts the following technical scheme:

one embodiment of the invention provides an automatic knowledge base construction method based on natural language processing, which comprises the following steps:

acquiring a process knowledge data set;

acquiring a process knowledge classification data set according to the process knowledge data set; calculating the semantic prominence contrast coefficient of each word in each element according to the semantic feature relation among different words in each element of the process knowledge classification data set; acquiring a semantic salient contrast sequence of each element according to the semantic salient contrast coefficient of the word segmentation in each element of the process knowledge classification data; acquiring a semantic salient neighbor coefficient of each element according to each element of the process knowledge classification data set and a corresponding semantic salient comparison sequence thereof; acquiring a semantic neighbor analysis sample set of each element according to the semantic prominent neighbor coefficient of each element of the process knowledge classification data set;

obtaining a shared neighbor sample set among different elements of the process knowledge classification data set according to the semantic neighbor analysis sample set of each element of the process knowledge classification data set; obtaining semantic neighbor similarity distances among different elements of a process knowledge classification data set according to a shared neighbor sample set among the different elements; obtaining a clustering result of the process knowledge classification data set by adopting a hierarchical clustering algorithm based on the semantic neighbor similarity distance;

and constructing a process knowledge base according to the clustering result of the process knowledge classification data set.

Preferably, the method for acquiring the process knowledge classification data set according to the process knowledge data set comprises the following steps:

the method comprises the steps of obtaining word segmentation results of each element of a process knowledge data set by adopting a dictionary matching algorithm, taking the word segmentation results of each element as input, obtaining vector representations of all the words in each element of the process knowledge data set by utilizing a self-encoder based on a bidirectional long-short-time memory network, taking vectors formed by the vector representations of all the words in each element as semantic vectors of each element, and taking a set formed by the semantic vectors of all the elements as a process knowledge classification data set.

Preferably, the method for calculating the semantic prominence contrast coefficient of each word in each element according to the semantic feature relation between different words in each element of the process knowledge classification data set comprises the following steps:

in the method, in the process of the invention,indicate->Semantic prominence and contrast coefficients of individual word segmentation; />And->Respectively represent +.>Person and->Vector representation of individual word, ∈>Representation->And->Cosine similarity between them; />Representation->Is a transpose of (2); />Indicate->The number of the word segments in the element where the individual word segments are located; />Representing the adjustment parameters.

Preferably, the method for obtaining the semantic salient contrast sequence of each element according to the semantic salient contrast coefficient of the word segmentation in each element of the process knowledge classification data comprises the following steps:

and taking a sequence formed by sequencing the semantic prominence and contrast coefficients corresponding to all the segmentation words in each element of the process knowledge classification data set according to the sequence of the corresponding segmentation words in each element as the semantic prominence and contrast sequence of each element.

Preferably, the method for obtaining the semantic salient neighbor coefficient of each element according to each element of the process knowledge classification data set and the corresponding semantic salient comparison sequence thereof comprises the following steps:

in the middle of，Representing the technical knowledge classification data set +.>Semantic salient neighbor coefficients of the individual elements; />And->Respectively represent ++th in process knowledge classification data set>Person and->Semantic vector corresponding to the individual element->Representation->And->Cosine similarity between them; />And->Respectively represent ++th in process knowledge classification data set>Person and->Semantic prominence contrast sequence corresponding to each element, +.>Representation->And->DTW distance between; />Representing the number of elements in the process knowledge classification dataset.

Preferably, the method for obtaining the semantic neighbor analysis sample set of each element according to the semantic prominent neighbor coefficient of each element of the process knowledge classification data set comprises the following steps:

for each element of the process knowledge classification data set, taking the difference value of the semantic salient neighbor coefficient corresponding to each element and other elements as one semantic salient neighbor characteristic value of each other element, taking a sequence formed by sequencing all the semantic salient neighbor characteristic values according to the order from small to large as a semantic neighbor analysis sequence of each element, and taking a set formed by all elements corresponding to the data of a preset quantity before the semantic neighbor analysis sequence in the process knowledge classification data set as a semantic neighbor analysis sample set of each element.

Preferably, the method for obtaining the shared neighbor sample set between different elements of the process knowledge classification data set according to the semantic neighbor analysis sample set of each element of the process knowledge classification data set comprises the following steps:

and for any two elements of the process knowledge classification data set, taking a set formed by all elements in an intersection set of semantic neighbor analysis sample sets corresponding to the any two elements as a shared neighbor sample set of the any two sets.

Preferably, the method for obtaining the semantic neighbor similarity distance between different elements according to the shared neighbor sample set between the different elements of the process knowledge classification data set comprises the following steps:in (1) the->Representing the technical knowledge classification data set +.>Individual elements and->Semantic neighbor similarity distances between individual elements; />And->Respectively represent ++th in process knowledge classification data set>Individual elements and->Semantic neighbor analysis sample set of individual elements, +.>Representation->And->A Jacquard coefficient therebetween; />And->Respectively represent ++th in process knowledge classification data set>Individual elements and->Semantic salient neighbor coefficients corresponding to the individual elements; />Representing the technical knowledge classification data set +.>Individual elements and->The shared neighbor sample set corresponding to the individual element +.>Semantic salient neighbor coefficients corresponding to the samples; />Representing the technical knowledge classification data set +.>Individual elements and->The number of samples in the shared neighbor sample set corresponding to the individual elements; />Representing the adjustment parameters.

Preferably, the method for acquiring the clustering result of the process knowledge classification dataset by adopting the hierarchical clustering algorithm based on the semantic neighbor similarity distance comprises the following steps:

and taking the semantic neighbor similarity distance between different elements in the process knowledge classification data set as the similarity between samples in a condensation hierarchical clustering algorithm, and acquiring a clustering result of the process knowledge classification data set by adopting the condensation hierarchical clustering algorithm.

Preferably, the method for constructing the process knowledge base according to the clustering result of the process knowledge classification data set comprises the following steps:

and taking all elements corresponding to each cluster in the clustering result of the process knowledge classification data set as a process classification sample set, drawing up a standard expression method of a process method or an operation method represented by each process classification sample set, and forming a final process knowledge base according to all process classification sample sets.

The beneficial effects of the invention are as follows: the method comprises the steps of obtaining a semantic coding result of process knowledge through a natural language processing technology, obtaining a process knowledge classification data set based on the semantic coding result, constructing a semantic salient contrast coefficient according to an analysis result of semantic features of each element in the process knowledge classification data set, calculating a semantic salient neighbor coefficient according to the semantic salient contrast coefficient, obtaining a semantic neighbor analysis sample set according to the semantic salient neighbor coefficient, obtaining a shared neighbor sample set according to the semantic neighbor analysis sample set among different classification samples in the process knowledge classification data set, constructing semantic neighbor similarity distances among different classification samples according to the shared neighbor sample set, and obtaining a clustering result of the process knowledge classification data set based on the semantic neighbor similarity distances by adopting an AHC hierarchical clustering algorithm.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the description below are only some embodiments of the invention, and that other drawings can be obtained according to these drawings without inventive faculty for a person skilled in the art.

FIG. 1 is a flowchart of an automatic knowledge base construction method based on natural language processing according to an embodiment of the present invention;

fig. 2 is a schematic diagram of a process knowledge base construction flow according to an embodiment of the invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1, a flowchart of a method for automatically constructing a knowledge base based on natural language processing according to an embodiment of the invention is shown, the method includes the following steps:

step S001, obtaining a process knowledge classification data set.

The process knowledge base comprises the contents of a process name, a process sentence and the like, wherein the process name is a word or phrase which briefly indicates the operation content of the process, and the process sentence describes the operation content of the process and comprises the following specification, drawing, operation object, required parts, parameters, operation and the like. Further, the contents such as the process names and the process sentences are acquired from the process specification file, and the acquired set of the contents such as the process names and the process sentences is used as the process knowledge data set.

Because the process names and the process sentences are unstructured data described by natural language and are not computable, the input is a process knowledge data set, a dictionary matching algorithm is adopted to obtain word segmentation results of each element in the process knowledge data set, and the specific calculation process of the dictionary matching algorithm is a known technology and is not repeated; the word segmentation results of all elements in the process knowledge data set are used as input, a self-encoder based on a bidirectional long-short-time memory network is used for obtaining vector representations of all the words in each element in the process knowledge data set, a vector formed by the vector representations of all the words in each element is used as a semantic vector of each element, an optimization algorithm is SGD (Stochastic Gradient Descent) algorithm, a loss function is a cross entropy loss function, and a specific training process of the self-encoder based on the bidirectional long-short-time memory network is a known technology and is not repeated. And taking the set consisting of semantic vectors of all elements in the process knowledge data set as a process knowledge classification data set.

Thus, a process knowledge classification dataset is obtained.

Step S002, a semantic salient contrast coefficient is constructed according to the semantic features of each element in the process knowledge classification data set, a semantic salient contrast sequence is obtained according to the semantic salient contrast coefficient, and a semantic salient neighbor coefficient is obtained according to each element in the process knowledge classification data set and the corresponding semantic salient contrast sequence.

Each element in the process knowledge classification data set is a corresponding process name or vector representation of a process sentence, classification results of different process knowledge can be obtained through clustering analysis of the process knowledge classification data set, a process knowledge base is constructed according to the classification results of the process knowledge, wherein the quality of the clustering results of the process knowledge classification data set is poor due to the fact that description of the process knowledge is similar, and the quality of the clustering results of the process knowledge classification data set is poor due to the fact that the similarity is calculated through the vector representations of the different elements in the process knowledge classification data set, so that the construction accuracy of the process knowledge base is reduced.

Further, the semantic vector of each element in the process knowledge classification data set is composed of vector representation results of all the segmented words in each element. In particular, e.g. process knowledge classification data setThe word segmentation number of the individual elements is +.>Then->The semantic vector composition of the individual elements is +.>Wherein->Indicate->The vector of individual tokens represents the result. Further, because the correlations between the segmented words in each element in the process knowledge classification data set are different, the correlation degree of the segmented words in each element can be better obtained by analyzing the relationship between the different segmented words in each elementCapturing long-range dependencies in each element.

Further, the semantic prominence contrast coefficient of each element word is calculated through the correlation analysis between different words in each element of the process knowledge classification data set, and a specific calculation formula is as follows:in (1) the->Indicate->Semantic prominence and contrast coefficients of individual word segmentation; />And->Respectively represent +.>Person and->Vector representation of individual word, ∈>Representation->And->Cosine similarity between them; />Representation->Is a transpose of (2); />Indicate->The number of the word segments in the element where the individual word segments are located; />The regulating parameter is expressed, and the empirical value is 0.01.

If the process knowledge classifies the first element of the data setThe degree of association of the individual word with other words in the element is greater, the calculated +.>The smaller the value of (2), the calculated +.>The larger the value of (2), i.e. the calculated +.>Semantic prominence contrast coefficient of individual word>The larger the value of (2) is, the +.>The semantic relevance of the individual word with other words in the element is greater.

Further, classifying the data set according to the process knowledgeThe semantic salient contrast coefficient corresponding to each word in the semantic vector of each element is obtained as the +.>The semantically prominent contrast sequence of the individual elements, in particular, the +.>The corresponding semantic prominence contrast coefficient of each word in each element is formed according to the arrangement sequence of the corresponding elementsSequence as->Semantic highlighting of the individual elements contrast sequences +.>，/>Indicate->The%>The semantics corresponding to the individual word are distinguished from the contrast coefficient.

Further, according to the difference of each element in the process knowledge classification data set and other elements in the semantic features, a semantic salient neighbor coefficient is calculated, and a specific calculation formula is as follows:in (1) the->Representing the technical knowledge classification data set +.>Semantic salient neighbor coefficients of the individual elements; />And->Respectively represent ++th in process knowledge classification data set>Person and->Semantic vector corresponding to the individual element->Representation->And->Cosine similarity between them; />And->Respectively represent ++th in process knowledge classification data set>Person and->Semantic prominence contrast sequence corresponding to each element, +.>Representation->And->DTW distance between; />Representing the number of samples in the process knowledge classification dataset.

If the process knowledge classifies the data setThe individual elements have larger difference with other sample semantic features, the +.>And->The larger the value of (2), i.e. the calculated process knowledge classification data set ++>Semantic prominent neighbor coefficients of individual elements +.>The larger the value of (2) is, the more ∈is in the process knowledge classification dataset>The semantic features of the individual elements are more obvious.

So far, the semantically prominent neighbor coefficients of each element in the process knowledge classification data set are obtained.

Step S003, a semantic neighbor analysis sample set is obtained according to the semantic salient neighbor coefficients of each element in the process knowledge classification data set, a shared neighbor sample set is obtained according to the semantic neighbor analysis sample set, and a semantic neighbor similarity distance is calculated according to the shared neighbor sample set.

In particular, for example, computing the process knowledge classification data setDifferences of the semantic prominent neighbor coefficients corresponding to each element and each other element are taken as semantic prominent neighbor feature values, and a sequence formed by ordering all the semantic prominent neighbor feature values according to the order from small to large is taken as a +.>The semantic neighbor analysis sequence of each element is selected to be the +.>Before +.f. in individual element semantic neighbor analysis sequence>(size taken 10) the set of all elements corresponding to the data is taken as +.>Semantic neighbor analysis sample set of individual elements +.>. Obtaining a shared neighbor sample set according to semantic neighbor analysis sample sets among different elements in the process knowledge classification data set, reacting the similarity among the samples through the shared neighbor sample set, and specifically, taking a set formed by samples in intersections of the semantic neighbor analysis sample sets among the different elements in the process knowledge classification data set as the shared neighbor sample set among the different elements.

Further, according to the shared neighbor sample set among different elements in the process knowledge classification data set, the semantic neighbor similarity distance among the different elements is calculated, and a specific calculation formula is as follows:in (1) the->Representing the technical knowledge classification data set +.>Individual elements and->Semantic neighbor similarity distances between individual elements; />And->Respectively represent the knowledge of the processClassification data set->Individual elements and->Semantic neighbor analysis sample set of individual elements, +.>Representation->And->A Jacquard coefficient therebetween; />And->Respectively represent ++th in process knowledge classification data set>Individual elements and->Semantic salient neighbor coefficients corresponding to the individual elements; />Representing the technical knowledge classification data set +.>Individual elements and->The shared neighbor sample set corresponding to the individual element +.>Semantic salient neighbor coefficients corresponding to the samples; />Representing the technical knowledge classification data set +.>Individual elements and->The number of samples in the shared neighbor sample set corresponding to the individual elements; />The regulating parameter is expressed, and the empirical value is 0.01.

If the process knowledge classifies the data setIndividual elements and->Semantic features are similar among the individual elements, the calculated +.>And->The smaller the value of (2), i.e. the calculated process knowledge classification data set ++>Individual elements and->Semantic neighbor similarity distance between individual elements +.>The smaller the value of (2) is, the more ∈is in the process knowledge classification dataset>Individual elements and->The greater the likelihood that the individual elements belong to the same class of samples.

So far, the semantic neighbor similarity distance between the different elements in the process-only and classification data set is obtained.

Step S004, obtaining a classification result of the process knowledge classification data set by adopting a hierarchical clustering algorithm based on the semantic neighbor similarity distance, and constructing a process knowledge base according to the classification result.

The input is a process knowledge classification data set, and a AHC (Agglomerative Hierarchical Clustering) hierarchical clustering algorithm is adopted to obtain the classification result of the elements in the process knowledge classification data set. Specifically, each element in the process knowledge classification data set is regarded as an independent category, the distance between different categories is calculated by adopting a semantic neighbor similarity distance, and the clustering result of the process knowledge classification data set is obtained according to an AHC hierarchical clustering algorithm.

Further, a process knowledge base is constructed according to the clustering result of the process knowledge classification data set, specifically, all classification samples corresponding to each cluster in the clustering result of the process knowledge classification data set are taken as a process classification sample set, a standard expression method of a process method or an operation method represented by each process classification sample set is drawn, and a final process knowledge base is formed according to all process classification sample sets. Specifically, as shown in fig. 2, the process knowledge base is constructed as follows:

step S1, acquiring a data set, acquiring the data set required by building a process knowledge base, and acquiring vector representation of data in the data set by using a natural language processing technology;

step S2, cluster analysis is carried out on a data set required by the construction of the process knowledge base, vector representation results of all data in the data set required by the construction of the process knowledge base are input, and an AHC hierarchical clustering algorithm is adopted to obtain the clustering results of the data set based on semantic neighbor similarity distances;

step S3, a knowledge base is constructed, a process knowledge base is formed according to the clustering analysis result of the data required by the construction of the process knowledge base, a standard expression method of the data of each cluster in the clustering result is constructed according to the clustering result of the data set required by the construction of the process knowledge base, and the process knowledge base is formed based on the standardized expression method.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. The above description is only of the preferred embodiments of the present invention and is not intended to limit the invention, but any modifications, equivalent substitutions, improvements, etc. within the principles of the present invention should be included in the scope of the present invention.

Claims

1. The automatic knowledge base construction method based on natural language processing is characterized by comprising the following steps:

acquiring a process knowledge data set;

constructing a process knowledge base according to the clustering result of the process knowledge classification data set;

the method for calculating the semantic prominence contrast coefficient of each word in each element according to the semantic feature relation among different words in each element of the process knowledge classification data set comprises the following steps:

in (1) the->Indicate->Semantic prominence and contrast coefficients of individual word segmentation; />And->Respectively represent +.>Person and->Vector representation of individual word, ∈>Representation->And->Cosine similarity between them; />Representation->Is a transpose of (2); />Indicate->The number of the word segments in the element where the individual word segments are located; />Representing the adjustment parameters;

the method for acquiring the semantic salient neighbor coefficients of each element according to each element of the process knowledge classification data set and the corresponding semantic salient comparison sequence comprises the following steps:

in (1) the->Representing the technical knowledge classification data set +.>Semantic salient neighbor coefficients of the individual elements; />And->Respectively represent ++th in process knowledge classification data set>Person and->Semantic vector corresponding to the individual element->Representation->And->Cosine similarity between them; />And->Respectively represent ++th in process knowledge classification data set>Person and->Semantic prominence contrast sequence corresponding to each element, +.>Representation->And->DTW distance between; />Representing the number of elements in the process knowledge classification dataset;

the method for acquiring the semantic neighbor similarity distance between different elements according to the shared neighbor sample set between the different elements of the process knowledge classification data set comprises the following steps:

in (1) the->Representing the technical knowledge classification data set +.>Individual elements and->Semantic neighbor similarity distances between individual elements; />And->Respectively represent ++th in process knowledge classification data set>Individual elements and->Semantic neighbor analysis sample set of individual elements, +.>Representation->And->A Jacquard coefficient therebetween; />And->Respectively represent ++th in process knowledge classification data set>Individual elements and->Semantic salient neighbor coefficients corresponding to the individual elements; />Representing the technical knowledge classification data set +.>Individual elements and->The shared neighbor sample set corresponding to the individual element +.>Semantic salient neighbor coefficients corresponding to the samples; />Representing the technical knowledge classification data set +.>Individual elements and->The number of samples in the shared neighbor sample set corresponding to the individual elements; />Representing the adjustment parameters.

2. The automatic knowledge base construction method based on natural language processing according to claim 1, wherein the method for acquiring the process knowledge classification data set according to the process knowledge data set is as follows:

3. The automatic knowledge base construction method based on natural language processing according to claim 1, wherein the method for obtaining the semantic salient contrast sequence of each element according to the semantic salient contrast coefficient of the word in each element of the process knowledge classification data comprises the following steps:

4. The automatic knowledge base construction method based on natural language processing according to claim 1, wherein the method for obtaining the semantic neighbor analysis sample set of each element according to the semantic significance neighbor coefficient of each element of the process knowledge classification data set is as follows:

5. The automatic knowledge base construction method based on natural language processing according to claim 1, wherein the method for obtaining a shared neighbor sample set between different elements of the process knowledge classification data set according to the semantic neighbor analysis sample set of each element of the process knowledge classification data set comprises the following steps:

and for any two elements of the process knowledge classification data set, taking a set consisting of all elements in an intersection set of semantic neighbor analysis sample sets corresponding to the any two elements as a shared neighbor sample set of the any two elements.

6. The automatic knowledge base construction method based on natural language processing according to claim 1, wherein the method for obtaining the clustering result of the process knowledge classification dataset by using hierarchical clustering algorithm based on the semantic neighbor similarity distance is as follows:

7. The automatic knowledge base construction method based on natural language processing according to claim 1, wherein the method for constructing a process knowledge base according to the clustering result of the process knowledge classification dataset comprises the steps of: