CN115374222A

CN115374222A - Knowledge graph construction method and device and storage medium

Info

Publication number: CN115374222A
Application number: CN202110548067.9A
Authority: CN
Inventors: 吕笑笑; 郭宇晨; 蒋忠强; 张国宏
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Suzhou Software Technology Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Suzhou Software Technology Co Ltd
Priority date: 2021-05-19
Filing date: 2021-05-19
Publication date: 2022-11-22

Abstract

The embodiment of the application provides a knowledge graph construction method, a knowledge graph construction device and a storage medium, wherein the method comprises the following steps: obtaining system data to be processed; analyzing the system data to be processed by utilizing a preset identification model, and determining system text data and/or system text data; performing knowledge extraction on the system text data and/or the system text data to determine target construction information; and carrying out map construction processing on the target construction information to obtain a target knowledge map. Therefore, in the system manufacturing field, target construction information (such as system relation information, system entity information, system attribute information and the like) can be obtained by analyzing and extracting knowledge of system data to be processed, and then the knowledge graph is constructed by utilizing the target construction information, so that the speed and the accuracy of constructing the knowledge graph can be improved, and the system information can be accurately and comprehensively displayed through the knowledge graph.

Description

Knowledge graph construction method and device and storage medium

Technical Field

The present application relates to the field of computer data processing technologies, and in particular, to a method and an apparatus for constructing a knowledge graph, and a storage medium.

Background

The system is the precipitation of years of experience and intelligence of an enterprise unit in the management field, and implies the basic idea of the enterprise unit in the management practice of each field. Meanwhile, the system is used as a main direction and mode guide of production activities of enterprise units, and the formulated basis is that whether the system meets the regulations, whether the content meets the requirements, whether the system has mutual relations, which is a disused system, which is a running system, which is a relevant system in the field, and the like, so that the normal production and operation activities of production and relevant managers are seriously interfered, or the acquisition and understanding cost of front-line personnel is greatly increased, thereby causing low production efficiency.

Disclosure of Invention

The application provides a method, a device and a storage medium for constructing a knowledge graph, which can quickly and efficiently construct the knowledge graph, so that system information can be accurately and comprehensively displayed, and the construction efficiency of the knowledge graph is improved; but also can reduce the difficulty of system audit and system understanding and improve the production efficiency.

The technical scheme of the application is realized as follows:

in a first aspect, an embodiment of the present application provides a method for constructing a knowledge graph, where the method includes:

obtaining system data to be processed;

analyzing the system data to be processed by utilizing a preset identification model, and determining system text data and/or system text data;

performing knowledge extraction on the system text data and/or the system text data to determine target construction information;

and carrying out map construction processing on the target construction information to obtain a target knowledge map.

In a second aspect, an embodiment of the present application provides a knowledge graph constructing apparatus, which includes an obtaining unit, an identifying unit, an extracting unit, and a constructing unit; wherein,

the acquisition unit is configured to acquire system data to be processed;

the identification unit is configured to analyze the institutional data to be processed by utilizing a preset identification model and determine institutional text data and/or institutional text data;

the extraction unit is configured to extract knowledge from the system text data and/or the system text data and determine target construction information;

the construction unit is configured to perform map construction processing on the target construction information to obtain a target knowledge map.

In a third aspect, an embodiment of the present application provides a computer storage medium, where a computer program is stored, and when executed by multiple processors, the computer program implements the steps of the method according to the first aspect.

The embodiment of the application provides a method, a device and a storage medium for establishing a knowledge graph, and system data to be processed are obtained; analyzing the system data to be processed by using a preset identification model, and determining system text data and/or system text data; performing knowledge extraction on the system text data and/or the system text data to determine target construction information; and carrying out map construction processing on the target construction information to obtain a target knowledge map. Therefore, text/text recognition can be carried out on system data to be processed through the preset recognition model, manual marking is not needed, the labor cost is reduced, target construction information (such as system relation information, system entity information, system attribute information and the like) can be accurately obtained through knowledge extraction, and the speed and the accuracy of construction of the knowledge graph are improved; meanwhile, the relation among different systems can be combed through the knowledge map, system related information is accurately and comprehensively displayed, and the difficulty of system audit and system understanding is reduced, so that the production efficiency is improved.

Drawings

FIG. 1 is a schematic flow chart diagram of a knowledge graph construction method provided in an embodiment of the present application;

FIG. 2 is a schematic diagram of a knowledge graph structure provided in an embodiment of the present application;

FIG. 3 is a schematic diagram of a logical structure of a knowledge graph building system according to an embodiment of the present application;

FIG. 4 is a schematic flow chart diagram of another knowledge graph construction method provided in the embodiments of the present application;

FIG. 5 is a schematic flow chart diagram of another knowledge graph construction method provided in the embodiments of the present application;

FIG. 6 is a schematic flow chart diagram illustrating yet another method for constructing a knowledge graph according to an embodiment of the present application;

FIG. 7 is a schematic flow chart diagram illustrating yet another method for constructing a knowledge graph according to an embodiment of the present application;

FIG. 8 is a schematic flow chart diagram illustrating a further method for constructing a knowledge graph according to an embodiment of the present application;

FIG. 9 is a schematic flow chart diagram illustrating a further method for constructing a knowledge graph according to an embodiment of the present application;

FIG. 10 is a schematic structural diagram of a knowledge graph constructing apparatus according to an embodiment of the present application;

fig. 11 is a schematic hardware structure diagram of an electronic device according to an embodiment of the present disclosure;

fig. 12 is a schematic structural diagram of another electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.

The system is the precipitation of years of experience and intelligence of an enterprise unit in the management field, and implies the basic idea of the enterprise unit in the management practice of each field. The system is equivalent to the state of law, and just because of the existence of each rule and regulation, an enterprise can only find the rules, thereby ensuring that the enterprise produces orderly. Meanwhile, the system is used as a main direction and mode guide of production activities of enterprise units, the formulated basis is whether compliance exists, whether contents meet requirements, whether interrelations exist among the systems, which systems are abolished systems, which systems are in operation systems, which systems are related systems in the field, and the like, so that the normal production and operation activities of production and related managers are seriously disturbed, or the acquisition and understanding cost of front-line personnel is greatly increased, thereby reducing the production efficiency. Therefore, the knowledge graph which is easy to obtain and can accurately and comprehensively display system information is very important for enterprises.

Based on this, the embodiment of the present application provides a method for constructing a knowledge graph, and the basic idea of the method is as follows: obtaining system data to be processed; analyzing the system data to be processed by using a preset identification model, and determining system text data and/or system text data; performing knowledge extraction on the system text data and/or the system text data to determine target construction information; and carrying out map construction processing on the target construction information to obtain a target knowledge map. Therefore, text/text recognition can be carried out on system data to be processed through the preset recognition model, manual marking is not needed, the labor cost is reduced, target construction information (such as system relation information, system entity information, system attribute information and the like) can be accurately obtained through knowledge extraction, and the speed and the accuracy of construction of the knowledge map are improved; meanwhile, clear veins can be provided for an original disordered system through the knowledge map, accurate system search, comprehensive system understanding and system compliance audit are facilitated, the difficulty of system audit and system understanding is reduced, and therefore production efficiency is improved.

Embodiments of the present application will be described in detail below with reference to the accompanying drawings.

In an embodiment of the present application, refer to fig. 1, which illustrates a flowchart of a method for constructing a knowledge graph provided in an embodiment of the present application. As shown in fig. 1, the method may include:

s101: and obtaining system data to be processed.

The embodiment of the application can be applied to enterprise system management, system files from various sources can be combed through natural language processing and a neural network algorithm, and an automatic and intelligent knowledge graph construction process is provided in the system field.

For an enterprise, there are often a wide variety of regulations in order to ensure the orderly progress of production activities. The more business contents, the larger the scale and the longer the year of an enterprise, the more system files of the enterprise are, and the more troublesome the management of the system files is at this time. On one hand, for managers, the system files need to be subjected to compliance audit, analysis and arrangement so as to be convenient for subsequently revising the old system or making a new system; on the other hand, it is necessary for the manufacturer to know which regimes are in operation and which are relevant for working themselves. Therefore, if the system cannot be managed efficiently, the normal production and operation activities of production and related management personnel can be seriously disturbed, or the acquisition and understanding cost of front-line personnel is greatly increased, so that the production efficiency is reduced. Based on the method, the knowledge graph automatic construction method and the knowledge graph automatic construction device are designed for multi-source heterogeneous data in the system field, and the method and the device have important application values for system accurate search, system compliance audit and the like.

It should be noted that in order to construct a knowledge map in the system field, system data to be processed needs to be acquired. It should be understood that the system data to be processed is stored in a specific file form, that is, the system data to be processed may include a plurality of different files, and all subsequent processing and analysis still maintains the file form, and the different files are not merged or recombined. In other words, in the following description, reference to original system data, standard system data, system text data, and system text data may include multiple different documents.

Here, the institutional file may include multiple types and formats depending on the institutional year and source, and thus needs to be preprocessed to obtain the institutional data to be processed for analysis. Thus, in some embodiments, the obtaining of the regimen data to be processed may comprise:

carrying out data cleaning processing on the original system data to obtain standard system data; wherein the data cleansing process includes at least one of: unified rule processing, unified type processing and unified format processing;

and carrying out duplicate removal treatment on the standard system data to obtain the system data to be treated.

It should be noted that, in the face of multisource heterogeneous original system data, data cleaning processing and deduplication processing are sequentially performed. Wherein,

the data cleaning processing refers to unifying original system data provided by system files with different formats and different types in a form so as to obtain standard system data. Here, the data cleansing process includes at least one of: unified rule processing, unified type processing, and unified format processing.

For standard system data, there may be two copies of the file data of the same system. Therefore, aiming at standard system data, repeated system data needs to be selectively deleted through repeated processing, so that the constructed knowledge graph is concise and clear, and excessive redundant data is avoided.

It should be noted that, in the field of system making, system duplication removal needs to consider the service characteristics of system files, and system text file content comparison cannot be simply performed, which may cause system loss or influence system-related audit. Therefore, the embodiment of the application selects two dimensions of the text number and the system content as the basis, and the system files are compared pairwise to realize the detection of the repeated files.

Therefore, taking the first system and the second system as an example, in some embodiments, the performing deduplication processing on the standard system data to obtain the system data to be processed may include:

determining a first system file and a second system file in the standard system data;

coding the system text and the text number of the first system file to obtain a first system code, and coding the system text and the text number of the second system file to obtain a second system code;

calculating the coding distance between the first system code and the second system code, and determining whether the first system file and the second system file are repeated according to the coding distance;

if the first system file and the second system file are repeated, judging whether the sources of the first system file and the second system file are the same;

if the source of the first system file is the same as that of the second system file, randomly deleting the first system file and the second system file;

if the sources of the first system file and the second system file are different, judging whether the grades of the first system file and the second system file are the same;

under the condition that the grades of the first system file and the second system file are the same, randomly deleting the first system file and the second system file;

in a case where the first system file and the second system file are different in level, a system file having a lower level is deleted from among the first system file and the second system file.

Here, the encoding process may use various encoding methods, such as Simhash (a locality sensitive hash encoding algorithm) encoding. When the encoding mode is Simhash encoding, the encoding distance is Hamming distance. If the Hamming distance is smaller than a preset threshold value, the first system file and the second system file are judged to be repeated; and if the hamming distance is greater than or equal to a preset threshold value, judging that the first system file and the second system file are not repeated.

Therefore, through the treatment, the system source data of isomerization can be uniformly arranged to obtain the system data to be treated so as to construct a knowledge graph in the following.

S102: and analyzing the system text data to be processed by utilizing a preset identification model, and determining system text data and/or system text data.

It should be noted that institutional documents have a relatively standard format compared to other types of documents, and specifically institutional documents generally include both institutional text and institutional text. The system text is generally a short text and is used for recording information such as the making time, the making basis, the making purpose and the like of the system; the body of the system is generally a super-long text for recording the specific clauses of the system. From the above, it can be seen that the main information of a system file can be obtained through system text and system text, and the method is also the main basis for extracting knowledge subsequently.

In practical application, text and text distribution are not distinguished when system data are uploaded, and system texts cannot be judged by a simple identification method, so that great interference is caused to subsequent knowledge extraction. Moreover, for a specific file, the file may be a text file of a certain system, or a text file and a text file at the same time, or neither a text file nor a text file, so that the detection of the text can be converted into a multi-classification problem, and the detection is performed by using a text classification method. In other words, the embodiment of the present application analyzes the data to be processed by using the preset recognition model, thereby determining institutional text data and institutional sending data.

It should be further noted that, the preset identification model includes a text classifier and a text-sending classifier, and therefore, in some embodiments, the analyzing the institutional text data and/or institutional text data to be processed by using the preset identification model to determine institutional text data and/or institutional text data may include:

performing word segmentation processing on the system data to be processed to obtain a dictionary to be processed;

performing statistical analysis on the system data to be processed, and determining a word weight set corresponding to the dictionary to be processed;

calculating the word weight set by using the text classifier to obtain a first classification result; calculating the word weight set by using the text classifier to obtain a second classification result;

and determining the institutional text data and/or institutional text data according to the first classification result and the second classification result.

It should be noted that the working principle of the preset recognition model includes:

firstly, performing word segmentation processing by taking each specific file in system data to be processed as a unit to obtain one or more dictionaries to be processed; here, each specific file corresponds to a dictionary to be processed;

secondly, performing statistical analysis on each specific file in the system data to be processed, and determining a word weight set corresponding to each dictionary to be processed;

then, carrying out classification calculation on the word weight set of each dictionary to be processed through a text classifier to obtain a first classification result; classifying and calculating the word weight set of the dictionary to be processed through a text-sending classifier to obtain a second classification result;

and finally, according to the two classification results, determining that each specific file is a text, a text or both a text and a non-text, and finally determining institutional text data and institutional text data.

Therefore, the system text data and the system text data can be obtained through the preset recognition model, so that the system text data and the system text data can be subjected to knowledge extraction respectively in the following process.

S103: and extracting knowledge from the institutional text data and/or institutional sending data to determine target construction information.

It should be noted that knowledge extraction is performed from the institutional text data and/or institutional text data to obtain corresponding target construction information.

In the embodiment of the present application, the ontology structure of the knowledge graph mainly includes three parts: institutional entities, institutional relationships, and institutional attributes. Wherein,

(1) A system entity (or called entity) refers to a system name, such as a purchasing process management method of a certain company; the system entity is divided into a system subject (or called subject) and a system object (or called object) according to the system relationship, and the subject and the object form the core of the knowledge map body structure;

(2) The system relationship refers to the relationship between the system subject and the system object, and the system relationship and the system entity jointly form a system knowledge triple: system subject, system object and system relationship. The knowledge triplets are the main basis for forming the system knowledge graph. Illustratively, three institutional relationships may be defined: dependencies, alternatives and revisions. Wherein, the dependency relationship refers to the basis of system making, if the system 1 is made according to the requirements of the system 2, the system 1 depends on the system 2; the alternative relation refers to the fact that one system is issued, and the other system is simultaneously abolished, so that the two systems have the alternative relation; the revision relation is that the content is modified on the basis of the original system, or the description form is added or deleted or changed to form a new system version, and the two have revision relation;

(3) The system attribute refers to basic information of a system and is used for describing system related information. Illustratively, institutional attributes may be divided into two parts: institutional basic attributes and institutional content terms. The system basic attributes form a system knowledge map attribute graph and comprehensively show system information, wherein the system information comprises system originating numbers, issuing units, issuing time, interpretation departments, belonging fields, system labels, line disuse states and other attributes; the system content clauses divide the system text file into specific clauses according to the rule, so that the business personnel can directly check the corresponding system clauses.

Referring to fig. 2, a schematic diagram of a knowledge-graph structure provided in the embodiment of the present application is shown. As shown in fig. 2, N1, N2, N3, and N4 represent system entities, S1 to S7 represent system attributes, and the sides connecting the system entities are system relationships and directed sides, indicating the relationship direction of the systems. For example, a system entity N2 replaces the system entity N1, a system entity N3 revises the system entity N2, and a system entity N1 depends on the system entity N3.

It can be seen from the above that the system entity, the system relationship and the system attribute constitute the main content of the system knowledge graph, the system entity is taken as the core, the system relationship is taken as the basic framework of the knowledge graph formed by connecting, and the system attribute is taken as the entity attribute graph, so that the system knowledge graph with rich knowledge and clear semantics is formed by the system entity, the system relationship and the system attribute.

In a practical application scenario, institutional entities are necessarily present for a specific knowledge graph, but institutional relationships and institutional attributes may not exist. For a specific system, in terms of a system entity, a system subject is necessarily present, but there is a possibility that a system object does not exist (i.e., the system object is empty); in the aspect of the system relationship, one or more specific relationships may exist, or none of the specific relationships may exist/not exist (i.e. the system relationship is empty); in the framework attribute, it is also possible to extract no specific attribute value (i.e., the framework attribute is null).

In other words, the system entity information may include only system subject information, or the system entity information may include system subject information and system object information; the institutional relationship information may be preset null information, or the institutional relationship information may also include at least one of the following: dependency, replacement and revision relationships; the system attribute information may be preset null information, or the system attribute information may also include at least one of the following items: the system comprises a text number, a printing time, a printing unit, an explanation department, a line disuse state, an institutional label, an institutional field and institutional content clause.

It should be further noted that, for the system text data, system entity information, system relationship information and system attribute information may exist; for institutional text data, institutional entity information, institutional relationship information, and institutional attribute information may exist. Furthermore, the institutional entity information in institutional text data is definitely the institutional entity information containing institutional text (the text only comprises institutional subjects, and the text may contain institutional objects besides the institutional subjects); the institutional attribute information in institutional text data and institutional attribute information in institutional text may contain different contents.

Generally, there will be both institutional text data and institutional text data for a system, and in special cases there may be only text data or only text data. Therefore, the specific flow of the extraction method is different in different cases:

in the case that only institutional text data exists, in some embodiments, the extracting knowledge from the institutional text data and/or institutional text data to determine the target building information may include:

utilizing a preset entity extraction model to perform entity extraction on the system text data to obtain system entity information;

performing attribute extraction on the system text data by using a preset attribute extraction model to obtain system attribute information;

and determining the system entity information and the system attribute information as the target construction information.

It should be noted that, if only system text data exists, knowledge extraction is performed by using a preset entity extraction model and a preset attribute extraction model, so as to obtain system entity information and system attribute information, that is, the final target construction information includes system entity information and system attribute information.

Further, in the case that only institutional text data exists, in some embodiments, the extracting knowledge from the institutional text data and/or institutional text data and determining the target building information may include:

determining the system entity information and the system attribute information as the target construction information under the condition that the system entity information indicates that one system entity exists;

and under the condition that the system entity information indicates that at least two system entities exist, performing relationship extraction on the system text data by using a preset relationship extraction model to obtain system relationship information, and determining the system entity information, the system attribute information and the system relationship information as the target construction information. It should be noted that, if only institutional text data exists, institutional entity information and institutional attribute information are determined from the institutional text data respectively by using the preset entity extraction model and the preset attribute extraction model.

In addition, if only one institutional entity can be determined, the final target construction information only comprises institutional entity information and institutional attribute information; if two or more system entities can be determined, and at the moment, system relationship information is determined from system text data by using a preset relationship extraction model, the final target construction information comprises system entity information, system attribute information and system relationship information.

Further, in the presence of institutional text data and institutional text data, the extracting knowledge from the institutional text data and/or institutional text data to determine the target construction information may include:

performing attribute extraction on the system text data and the system text data by using a preset attribute extraction model to obtain system attribute information;

and under the condition that the system entity information indicates that at least two system entities exist, performing relationship extraction on the system text data by using a preset relationship extraction model to obtain system relationship information, and determining the system entity information, the system relationship information and the system attribute information as the target construction information. It should be noted that, under the condition of having both system text data and system text data, the system text data does not contribute additional system entity information and system relationship information, so that the system entity information and the system relationship information can be extracted only from the system text data; however, the institutional text data and the institutional text data may contribute different attribute information, and therefore attribute extraction needs to be performed simultaneously from the institutional text data and the institutional text data to obtain institutional attribute information.

Further, in some embodiments, the extracting attributes of the institutional text data and the institutional text data by using a preset attribute extraction model to obtain institutional attribute information may include:

performing primary attribute extraction on the system text data by using the preset attribute extraction model to obtain a primary attribute extraction result;

determining attribute items to be extracted according to the primary attribute extraction result;

performing secondary attribute extraction on the system text data by using the preset attribute extraction model according to the attribute items to be extracted to obtain a secondary attribute extraction result;

and obtaining the system attribute information according to the primary attribute extraction result and the secondary attribute extraction result.

In consideration of the problem of calculation efficiency, firstly, performing attribute extraction once from system text data through a preset attribute model for system attribute information; then, according to the attribute extraction result, determining the attribute items which are not extracted, namely the attribute items to be extracted; and finally, extracting from the system text data again according to the attribute items to be extracted. In this way, the system attribute information can be determined from the primary attribute extraction result and the secondary attribute extraction result. That is, when extracting the attribute, the attribute extraction is preferentially performed on the system text data, and the extraction is continued on the system text data for the attribute entries that cannot be extracted.

It should be noted that, since the system data to be processed may include files of different systems, the knowledge extraction must be processed in the system unit. It is assumed that the institutional text data includes a text file a of institutional 1 and a text file D of institutional 3, and the institutional text data includes a text file B of institutional 2 and a text file C of institutional 3.

Then, aiming at system 1, determining system entity information and system attribute information from the text file A respectively by using a preset entity extraction model and a preset attribute extraction model;

aiming at the system 2, system entity information is determined from the text file B by using a preset entity extraction model, system attribute information is determined from the text file B by using a preset attribute extraction model, and system relationship information is determined from the text file B by using a preset relationship extraction model under the condition that at least two system entities exist;

aiming at the system 3, system entity information needs to be determined from the text file C by using a preset entity extraction model, and system relation information needs to be determined from the text file C by using a preset relation extraction model under the condition that at least two system entities exist; in addition, a part of attribute information is obtained from the text file C by using a preset attribute extraction model, and for unknown attribute items, the extraction is carried out from the text file D by using the preset attribute extraction model. For example, if the system object cannot be extracted from the text file C, the system object can be determined to be empty, and the system relationship information is preset empty information, so that repeated extraction from the text file D is not needed; but if no attribute information is extracted from the text file C, the extraction of the attribute information of all the attribute entries is continued for the text file D.

Further, in some embodiments, the method may further comprise:

determining a preset submodel; wherein the preset model comprises: the system comprises a word vector BERT sub-model, a Bi-directional long-short term memory artificial neural network (Bi-LSTM) sub-model, a conditional random field CRM sub-model, a classification Softmax sub-model, a regular matching sub-model and an expert knowledge sub-model;

establishing a preset entity extraction model according to the BERT sub-model, the Bi-LSTM sub-model and the CRM sub-model;

establishing a preset relation extraction model according to the BERT sub-model and the Softmax sub-model;

and establishing a preset attribute extraction model according to the regular matching sub-model and the expert knowledge sub-model.

It should be noted that, in the embodiment of the present application, a first neural network is constructed based on a word vector BERT (Bidirectional Encoder retrieval from transforms) sub-model (or called BERT algorithm), a Bi-directional Long Short-Term Memory artificial neural network (Bi-LSTM) sub-model and a (Conditional Random Field algorithm, CRF) sub-model (or called CRF algorithm), and then feature learning is performed on a first sample data set (the first sample data set includes a plurality of entity sample data) by using the first neural network, so as to obtain a preset entity extraction model.

It should be further noted that, in the embodiment of the present application, a second neural network is established based on the BERT submodel, the Softmax submodel (or referred to as Softmax layer), and a full connectivity layer, and then the second neural network is used to perform feature learning on a second sample data set (the second sample data set includes a plurality of relationship sample data), so as to obtain the preset relationship extraction model.

It should be noted that, for the system attribute information, for example, the content such as the number of a letter, the date of the letter, the field, the clause, etc., generally has a relatively standard format, and a preset attribute extraction model may be established based on a regular matching sub-model (or referred to as a regular matching algorithm) and some expert knowledge sub-models (or referred to as expert knowledge bases).

The knowledge extraction models provided above are all specific embodiments, and are not limited strictly, and other feasible methods may also be used to extract knowledge.

Further, in some embodiments, where the institutional entity information indicates that there are at least two institutional entities, the at least two institutional entities comprise at least one institutional host and at least one institutional guest, and there is a unique institutional relationship between the at least one institutional host and the at least one institutional guest;

correspondingly, the extracting the relationship of the institutional text data by using the preset relationship extraction model to obtain institutional relationship information may include:

cutting the system text data to determine at least one sentence to be processed; each statement to be processed in the at least one statement to be processed comprises a system subject and a system object;

performing subject-object marking on the at least one statement to be processed to obtain at least one target statement;

performing feature extraction on the at least one target sentence, and determining respective semantic features, institutional subject features and institutional object features of the at least one target sentence;

determining the entity relationship of the at least one target sentence according to the semantic feature, the institutional subject feature and the institutional object feature of the at least one target sentence, and determining the entity relationship of the at least one target sentence as institutional relationship information;

the institutional subject characteristics at least comprise subject semantic characteristics and subject position characteristics, and the institutional object characteristics at least comprise object semantic characteristics and object position characteristics.

When there are at least two system entities, system relationship extraction is required. It should be understood that among at least two institutional entities, there must be at least one institutional host and at least one institutional guest, and that there is a uniquely defined institutional relationship between each pair of "institutional host and institutional guest".

The extraction of the system relationship is essentially a classification problem, and the extraction of the system relationship is essentially a three-classification problem because the system relationship is divided into three relationships of dependence, substitution and revision. Specifically, when system relationship information is extracted, the method comprises the following steps:

the system text data is cut, at least one sentence to be processed is determined, and subject and object marking processing is carried out on the at least one sentence to be processed, so that at least one target sentence is obtained.

Here, because a system text may include a plurality of system entities and simultaneously imply a plurality of system relationships, for a case where there are more than three system entities, the embodiment of the present application further needs to cut the system text before performing relationship extraction, the cutting is based on the positions of the system entities, and sequentially cuts a complete sentence including two system entities as an input of the relationship extraction, for example, if the system text includes n (n is an integer greater than 1) system entities, the system text is cut into (n-1) sentences, that is, each sentence to be processed needs to include a system subject and a system object.

In addition, in order to improve the processing efficiency, a system subject and a system object of the sentence to be processed are marked respectively to obtain the target sentence.

And secondly, extracting the features of the target sentences to determine the respective semantic features, system subject features and system object features of each target sentence. Here, the semantic features refer to the overall semantic features of the target sentence, the system subject features refer to the semantic features and the position features of the system subject, and the system object features refer to the semantic features and the position features of the system object. That is, for the preset relationship extraction model, after extracting features in each target statement by using the BERT algorithm, entity system vector merging (semantic feature and position feature merging) is also performed at a full connection layer to increase entity feature information.

Thirdly, splicing respective semantic features, system subject features and system object features of each target statement, and scoring the spliced features by using a Softmax layer to obtain three probability values respectively corresponding to the substitution relationship, the revision relationship and the dependency relationship, wherein the relationship corresponding to the maximum probability value is the entity relationship for determining each target statement.

And finally, further determining system relationship information according to the entity relationship of each target statement.

Thus, through the above processing, target construction information can be obtained, and the framework map can be constructed subsequently.

S104: and carrying out map construction processing on the target construction information to obtain a target knowledge map.

It should be noted that, as shown in fig. 2, after the target construction information is obtained, the target knowledge graph is obtained by performing graph construction according to the existing method.

Therefore, the embodiment of the application aims to form a set of automatic flow giving consideration to efficiency and effect in the process of establishing the knowledge graph, and provides a feasible scheme for establishing the knowledge graph in the system field. Through the knowledge graph of the system, the complex and obscure knowledge structure of the system can be effectively and comprehensively displayed, the system can be accurately understood conveniently, and the system has important application value for various business requirements (searching, auditing and the like) of the system.

In addition, aiming at the construction problem of the knowledge graph, in a technical scheme provided by the related technology, the knowledge graph aiming at journal documents is designed, and the body structure and the entity extraction model of the knowledge graph of the journal documents are defined; however, in the technical scheme, only the knowledge graph is constructed in the journal literature field by using the natural language processing related technology, which is different from the business field of the embodiment of the application, and the structure, the property, the attribute and the processing method of the journal literature and the institutional file are completely different, the knowledge graph structure of the journal literature cannot be applied to the institutional field, and the processing method for multi-source heterogeneous data is lacked.

In another technical scheme provided by the related technology, a knowledge graph for case situations is designed, and a structured text and an unstructured text are fused, so that semantic support is provided for accurate pushing of case situations. However, the technical scheme is applied to the case situation pushing field, the emphasis is on application description of a specific natural language processing technology, the application description is different from the business field of the embodiment of the application, the knowledge graph structure and the related algorithm design cannot be applied to the institutional field, and the design of automatically constructing the whole flow from end to end for the knowledge graph is lacked.

The knowledge graph construction method provided by the embodiment of the application solves the problems, and is specifically represented in the following steps: (1) The system domain knowledge map body structure is constructed, clear venation is provided for the original disordered system, accurate system search, comprehensive system understanding, system compliance audit and the like are facilitated; (2) Aiming at the system field text data, a system entity identification, system relation and system attribute extraction method is designed by combining the priori knowledge in the system field, and the extraction result is more accurate; (3) Designing a unified file system module, focusing on solving the unified problem of multi-source heterogeneous data, and decoupling with modules such as follow-up relation and entity extraction, so that the construction of the knowledge graph has higher expansibility, and the automation degree of the construction of the knowledge graph is improved; (4) End-to-end automatic construction is performed in the whole process from multi-source data acquisition to data preprocessing, knowledge extraction, map construction and the like, so that the labor cost is reduced, and the map construction efficiency is improved.

The embodiment of the application provides a method for constructing a knowledge graph, which comprises the steps of obtaining system data to be processed; analyzing the system data to be processed by using a preset identification model, and determining system text data and/or system text data; performing knowledge extraction on the system text data and/or the system text data, and determining target construction information (such as system relationship information, system entity information, system attribute information and the like); and carrying out map construction processing on the target construction information to obtain a target knowledge map. In this way, text/text recognition can be carried out on system data to be processed through the preset recognition model, manual marking is not needed, target construction information can be accurately obtained through knowledge extraction, and the speed and the accuracy of construction of the knowledge map are improved; besides, the relation among different systems can be combed through the knowledge graph, system related information is accurately and comprehensively displayed, difficulty in system audit and system understanding is reduced, and production efficiency is improved finally.

In another embodiment of the present application, refer to fig. 3, which shows a schematic logical structure diagram of a knowledge graph building system 20 provided by the embodiment of the present application. As shown in FIG. 3, the knowledge-graph building system 20 comprises a knowledge-graph ontology module 201, a unified document module 202 and an institutional knowledge extraction module 203.

The knowledge graph constructing system 20 designs a standardized File access form for different system sources, and for example, different system files can be read in through a HyperText Transfer Protocol (HTTP) and a File Transfer Protocol (FTP).

Knowledge graph ontology module 201

As shown in fig. 3, the knowledge-graph ontology module 201 includes institutional entities, institutional relationships, and institutional attributes; wherein, the system entity can comprise a system subject and a system object; institutional relationships may include dependencies, substitutions, and revisions; the system attribute can comprise basic attribute and content item, and the basic attribute can be further subdivided into system letter number, printing unit, printing time, explanation department, belonging field, system label, line revocation status and other attributes.

The system entities, the system relationships and the system attributes together form a system domain knowledge graph ontology structure, and as shown in fig. 2, the knowledge graph can be understood as a network schematic diagram for indicating the system entities, the system relationships and the system attributes.

Unified file system module 202

As shown in fig. 3, the unified file system module 202 includes text recognition, system deduplication, unified rules, unified formats, unified types, and unified storage; specifically, the unified rule mainly refers to corresponding rule adaptation for different data sources, such as marking sources and storing catalog standardization, so that the data sources have company and field classification attributes; the unified type is mainly used for converting the file type (such as doc, html and the like) of the source data into a standard file type (such as pdf type), so that the process automation is facilitated; the uniform format is mainly used for extracting the text content of the document according to the standard file type.

In this way, unified file system module 202 realizes standardized processing and floor storage of multi-source system data for subsequent knowledge extraction.

System knowledge extraction module 203

As shown in fig. 3, the system knowledge extraction module 203 includes entity extraction, relationship extraction and attribute extraction so as to obtain system entity information, system relationship information and system attribute information of system data to be processed, thereby being capable of generating a knowledge graph.

It should be understood that fig. 3 is only a schematic diagram of the logical structure of the knowledge graph constructing system 20, and in an actual application process, corresponding processes may be further designed according to the execution sequence to improve the work efficiency. For example, when unified rule processing is performed, source information, text number information, and the like of a file may need to be used for source marking and unified storage, so that although the extracted source information and text number logically belong to the system knowledge extraction module 203, they can be designed before unified rule processing.

It should be understood that the knowledge graph constructing system 20 may also support operations such as deletion, modification, addition, etc. in the later period, and the principle thereof may refer to the constructing principle, which is not described in detail.

The embodiment of the application provides a knowledge graph construction method, and through the detailed explanation of the embodiment, it can be seen that the embodiment of the application aims to form a set of automatic flow giving consideration to both efficiency and effect in the knowledge graph construction process, so that both efficiency and effect can be given consideration to both efficiency and effect, and a feasible scheme is provided for construction of a knowledge graph in the system field. After the knowledge graph is obtained, functions of graph visualization, system search, system audit and the like can be further realized, so that workers and managers can better understand the system of an enterprise.

In yet another embodiment of the present application, refer to fig. 4, which shows a schematic flow chart of another method for constructing a knowledge graph provided in the embodiment of the present application. As shown in fig. 4, the method may include:

s301: and (6) unifying the rules.

It should be noted that, unified rule processing is performed on a plurality of original system files to obtain a plurality of original system files with consistent rules. Here, the unified rule processing means processing the original system file according to a preset rule, for example, marking the source of the original system file, and storing all materials of the same system under the same folder according to the originating number and the entity name.

S302: and (4) unifying types.

It should be noted that unified type processing is performed on a plurality of original system files with consistent rules to obtain a plurality of original system files with consistent types.

S303: and (4) unifying the formats.

It should be noted that unified format processing is performed on multiple original system files of the same type to obtain multiple original system files of the same format.

Therefore, the unified rule, the unified type and the unified format are used as three main steps of data cleaning, so that the original system files from different sources form high-quality system data (equivalent to the standard system data) with unified type and format, and a foundation is laid for the automatic construction of the knowledge graph.

S304: and (5) removing the weight by a system.

After obtaining the standard system data, the duplicate removal process is also required. In the system field, the business characteristics of system files need to be considered for system duplication removal, if the system is determined to be repeated according to the content comparison of the system text, systems of different branch companies can be deleted by mistake, so that system loss is caused or system-related audit is influenced. Specifically, referring to fig. 5, it shows a schematic flow chart of another method for constructing a knowledge graph provided in the embodiment of the present application. As shown in fig. 5, a specific process of removing duplicate system files may include:

s401: and carrying out Simhash coding on the first system file and the second system file to obtain two system codes.

It should be noted that, by performing pairwise detection on different regimes, it is determined whether there is a system duplication. For convenience of description, two system files that need to be subjected to deduplication processing are referred to as a first system file and a second system file, respectively. Here, the first-system file and the second-system file are both system text files.

It should be noted that the basis for system deduplication can be determined by itself according to the actual usage scenario. In the embodiment of the application, whether the two systems are repeated is detected by detecting the text and the number of the system. And carrying out Simhash coding on the system text and the text number of the first system file, and carrying out Simhash coding on the system text and the text number of the second system file, thereby obtaining two system codes.

It will be appreciated that, through the aforementioned data cleansing process, all materials of the same system have been stored under the same folder under the letter/entity name, so the system text and letter number of a file of a certain system can be easily obtained at this time.

S402: and calculating the Hamming distance between the two system codes.

It should be noted that the hamming distance (or called hamming distance) between two system codes is calculated to characterize the similarity degree of the two system codes.

S403: and judging whether the Hamming distance is smaller than a preset threshold value or not.

Here, for step S403, if the determination result is no, step S404 is performed; if the judgment result is yes, step S405 is performed.

It should be noted that the hamming distance is compared with a preset threshold value, so as to determine whether the first-system file and the second-system file are repeated.

S404: determining that the first-system file and the second-system file are not duplicated.

It should be noted that, if the hamming distance between the two system codes is greater than or equal to the preset threshold, it is determined that the first system file and the second system file are not duplicated, and it is not necessary to delete them.

S405: and judging whether the first system file and the second system file are from the same source.

Here, for step S405, if the determination result is no, step S406 is performed; if the judgment result is yes, step S407 is executed.

It should be noted that, if the hamming distance between two system codes is smaller than the preset threshold, the repetition of the first system file and the second system file is determined. At this time, it is necessary to determine whether the first-system file and the second-system file are the same source, so as to determine which of the two files is deleted.

S406: and carrying out level judgment on the first system file and the second system file.

Here, with step S406, if the level decision result is the same level, step S407 is performed; if the level determination result is the upper and lower levels, step S408 is performed.

It should be noted that, for two repetition systems with different sources, it is necessary to determine the level of the two repetition systems, and to determine which of the two repetition systems should be deleted by determining whether the two repetition systems are files of the same level or files of different levels.

S407: and randomly deleting the first system file and the second system file.

In the case where it is determined that the first-system file and the second-system file are duplicated, if the first-system file and the second-system file are the same in origin or the first-system file and the second-system file are the same in rank, one of the first-system file and the second-system file may be deleted at random.

S408: and deleting the system file corresponding to the lower level.

In the case where it is determined that the first-system file and the second-system file are duplicated, if the first-system file and the second-system file have different ranks, the system file having the lower rank is selected and deleted.

That is to say, in the embodiment of the present application, the system text number and the system content are merged to perform Simhash coding, whether the system is repeated is determined according to the hamming distance of the coding, the system is determined to be repeated if the hamming distance is smaller than a certain threshold, and the threshold needs to be set according to the data situation and through experiments. According to the system that the Simhash code is confirmed to be repeated, a deduplication strategy needs to be set from the perspective of upper-layer application.

Because system text numbers are introduced, the system text contents are the same, but the systems (with different text numbers) executed in a plurality of branch companies are not judged to be repeated files for deletion. Aiming at the system from the same source and judging two repeated systems, one system can be deleted at will; for two systems with different sources and repetition, the upper and lower levels of organization are judged according to the marks of the source units, and then the system with higher organization level is reserved (the same level is deleted randomly).

Therefore, the system data to be processed is obtained by carrying out duplicate removal processing on the standard system data and deleting some redundant data.

S305: and (5) text recognition.

S306: and (5) identifying the text.

Here, step S305 and step S306 may be executed in parallel, and the execution order of the two steps is not sequential. Specifically, after step S304, text recognition may be performed first, and then text recognition may be performed, or text recognition and text recognition may be performed simultaneously.

It should be noted that the system text and the issue text are used as main components of the system data, and usually exist in the form of text, and the text and the issue text cannot be determined, so that great interference is caused to the subsequent knowledge extraction. Because the acquired system file is not only the text and the text, the detection of the text and the text needs to consider the non-text or text situation, so the text and text detection can be converted into a multi-classification problem, and the text classification method is used for detection. In addition, in the field of text classification, the classification between long text and short text is required, and a corresponding algorithm needs to be selected according to a classification object.

Aiming at the problem of multiple categories, in the embodiment of the application, the text and the text are identified and divided into two classification problems by adopting an integration strategy from the aspects of difficulty in constructing a training set, flexibility in updating and deploying a model and model effect, the two classifiers are used for solving the problem of classification of the text and the text respectively, and then the results of the two classifiers are combined to be output as final classification; for the problem of long and short texts, in the embodiment of the present application, a Support Vector Machine (SVM) which is suitable for the text field and is good at handling the problem of high-dimensional space classification is used as a base classifier.

That is to say, the embodiment of the present application provides a preset recognition model for text and text classification, and the preset recognition model at least includes a text classifier and a text classifier. Referring to fig. 6, a schematic flow chart of yet another method for constructing a knowledge graph provided in the embodiments of the present application is shown. As shown in fig. 6, the working process of the preset recognition model includes the following steps:

s501: and reading a system file to be processed.

For convenience of description, the system file to be processed is used to represent a specific system file in the system data to be processed, which needs to be identified by the text of the sent text.

S502: and performing word segmentation on the system file to be processed.

It should be noted that, word segmentation processing is performed on the system files to be processed, so that a dictionary of the system files to be processed can be obtained, and the system files to be processed can be classified according to word frequency.

S503: and performing weight calculation by using a word frequency-reverse file frequency TF-IDF algorithm.

It should be noted that, the Term Frequency-Inverse Document Frequency (TF-IDF) algorithm is used to perform weight calculation, so as to obtain the Term frequencies corresponding to different terms.

S504: and classifying by using a text classifier.

S505: classifying by using hair-text classifier

Here, step S504 and step S505 may be executed in parallel, and the execution order of the two steps is not sequential. Specifically, after step S503, the text classifier may be used for classification, and then the text classifier is used for classification; or the classification can be carried out by using a text classifier and then the classification can be carried out by using a text classifier; and the text classifier can be used for classifying at the same time.

Specifically, according to the classification results of the text classifier and the text classifier, there may be the following four cases: (1) If the classification result of the text classifier is the text and the classification result of the text-sending classifier is other, the file to be processed is a text file; (2) If the classification result of the text classifier is a text and the classification result of the text-sending classifier is a text, the file to be processed is a text file and a text-sending file; (3) If the classification result of the text classifier is other and the classification result of the text-sending classifier is text sending, the file to be processed is a text-sending file; (4) And if the classification result of the text classifier is other and the classification result of the text-sending classifier is other, the file to be processed is not the text file or the text-sending file.

Thus, through the above processing, it can be determined that each system file in the system data to be processed is a text, both text and text, or neither text nor text, thereby determining system text data and system text data.

S305: and storing by using a file system.

It should be noted that after the above processing, the isomerized system files are processed with uniform rules, uniform format, uniform type and duplicate removal, and it is determined whether each system file is a text or a text, and the system files can be stored in a file system uniformly, so as to facilitate the subsequent knowledge extraction and knowledge map construction.

It should be understood that the embodiments of the present application also show a schematic logic flow diagram of a unified file system, where the order of execution of specific steps may be adjusted in sequence in an actual application process. For example, in the system duplication removal step, a text number and a system text are required, the text number extraction strictly belongs to the working content of the system knowledge extraction module, and the system text can be obtained only after text recognition is carried out, so that corresponding adjustment can be carried out in practical application, and related corresponding processes are carried out according to the execution sequence.

In summary, the knowledge graph is constructed on the basis of system data, and the system data is one of the biggest bottlenecks in automatically constructing the knowledge graph and plays an important role in constructing the knowledge graph. Due to the fact that enterprises are large in organization and complex in system type structure, system data diversity is prominent, and the system data is particularly characterized in that system data sources are various and system data structures are different, and further data adaptation needs to be continuously carried out in upper-layer construction processes such as knowledge extraction, great inconvenience is brought to automatic construction of knowledge maps, the formed knowledge maps are not easy to expand, and application requirements of enterprise-level knowledge maps cannot be met.

In the face of multi-source heterogeneous system data, the embodiment of the application provides that the unified file system module decouples and separates data from upper-layer modules such as knowledge extraction and the like, so that the influence of the change of the data on the upper-layer modules is reduced to the maximum extent, and the stability and the robustness of the knowledge graph construction process are improved.

That is, the unified file system module provides mainly two functions: data acquisition and data unification. The system data acquisition function mainly solves the problem of system data multi-source, and in the face of complex conditions of system data source cross-unit, cross-platform and the like, the embodiment of the application formulates a uniform file system access standard in data acquisition, so that system data acquisition flow automation is realized; the data unification mainly solves the problem of system data isomerism, wherein the isomerism comprises structural and non-structural differences (non-structural data is mainly text data) and also comprises diversity of multi-source data formats, repeatability of contents and deficiency of necessary basic information, and therefore stable support of data on upper-layer construction of a knowledge graph is achieved.

It should be noted that the original system file mostly includes unstructured data, and therefore the following description is given by taking the unstructured data (specifically, text data) as an example. In addition, due to the fact that the quality of the structured data is low, the structured data can be processed similarly after the unstructured data are processed, the two processing results are aligned, and the target knowledge graph is finally constructed.

The embodiment of the application provides a knowledge graph construction method, and the detailed explanation of the embodiment shows that target construction information can be obtained by analyzing and extracting knowledge of system data to be processed, so that the knowledge graph is constructed by utilizing the target construction information, the speed and the accuracy of constructing the knowledge graph are improved, and the system information can be accurately and comprehensively displayed through the knowledge graph.

In yet another embodiment of the present application, refer to fig. 7, which shows a flowchart of yet another method for constructing a knowledge graph provided in the embodiment of the present application. As shown in fig. 7, the method may include:

s601: and performing system entity extraction from the system text file.

It should be noted that, in the embodiments of the present application, the system having both the institutional text file and the institutional text file is taken as an example to perform the subsequent description, so as to explain how to perform knowledge extraction.

Firstly, aiming at a system text file, a preset entity extraction model is utilized to extract a system entity.

It should be further noted that, in the system degree field, the system entity is different from common named entities, such as place name, person name, organization name, etc., the system entity is a long string with multiple named entities and other information, such as "a company procurement process management method (2017 edition)", which includes "a company", "procurement process", "management method", "2017 edition)" 4 components with different characteristics, and for convenience of distinction, the embodiments of the present application name the above 4 components as "ORG", "ARE", "TYP", "FIX" to correspondingly represent 4 types of entities: organizational entities, domain entities, type entities, and suffix entities, typically comprise the four or first three components described above for a institutional entity, and typically occur sequentially.

In addition, under the condition of identifying at least two system entities, the system entities also need to distinguish subjects and objects, namely, who is a subject and who is an object are found from one or a plurality of sentences, so that the system relationship is clear, and compared with the entity identification of common organizations and the like, the system entities have higher requirements, namely, the subject and the object need to be classified on the basis of identifying the entities. Based on the above requirements and characteristics, the identification of system entities can be performed by using a method for processing sequence labeling by natural language, but special design is required in the labeling and outputting links.

In consideration of the need of subject-object classification, the embodiments of the present application introduce "ZK" labeling on the basis of the biees labeling system (an existing labeling system), where Z represents a subject mark and K represents an object mark. Based on the above labeling system, the labeling sequence of system entity is as follows:

“0BZ-ORG IZ-ORG B-ARE I-ARE B-TYP I-TYP B-FIX I-FIX

according to a certain company purchasing process management method (2017 edition)

0 IZ-ORG EZ-ORG I-ARE E-ARE I-TYP E-TYE I-FIX E-FIX

0 BK-ORG IK-ORG B-ARE I-ARE B-TYP I-TYP B-FIX I-FIX

Method for establishing purchasing process management of certain company (2018 edition)

0IK-ORG EK-ORG I-ARE E-ARE I-TYP E-TYE I-FIX E-FIX”

In the above example, the system subject is labeled after "according" and the system object is labeled after "making". The embodiment of the application only marks the subject and the object in part of the ORG, mainly because the whole system entity can be identified through the ORG without marking all the system entities, and fewer marks can reduce classification levels and quantity, thereby improving the efficiency of model training and prediction.

In the embodiment of the application, a first neural network is established based on a BERT + Bi-LSTM + CRF model, and a preset entity recognition model is obtained after training. The BERT algorithm is used as a feature extraction model to output word embedded vectors as the input of the Bi-LSTM, and compared with vectors formed by other word embedded algorithms, the BERT algorithm mainly has the advantage of performing word sense disambiguation according to complex context semantics; moreover, the Bi-LSTM bidirectional semantic model ensures that the semantic information of sentences is obtained to the maximum extent, and the relation between texts and labels is predicted; in addition, the CRF is used as a decoding layer to further convert the probability output of the Bi-LSTM and identify the transfer characteristics between the labels.

Referring to fig. 8, a schematic flow chart of yet another method for constructing a knowledge graph provided in the embodiments of the present application is shown. As shown in fig. 8, for input data, first, a feature vector thereof is extracted using a BERT algorithm; secondly, calculating the feature vector by a Backward LSTM (Backward LSTM) and a Forward LSTM (Forward LSTM) respectively to obtain an Output result (namely Bi-LSTM Output); and thirdly, outputting the result through CRF to obtain Entity Recognition (NER) output, namely the result of the completion of the labeling.

It should be noted that, in the input link, considering that the BERT algorithm supports at most 512 words, otherwise, the input is cut off, which easily causes the loss of entity and relationship information. In order to solve this problem, the embodiment of the present application performs the following processing for system text content:

(1) The institutional text content is removed from the head and tail information according to the text format rule, for example, the head information is generally text-oriented object-call information, and the tail information is literal information such as "please follow execution", "now give printing", and the like. Extracting the character information between the head and the tail as input;

(2) Dividing characters between the head and the tail by punctuation marks, and deleting meaningless sentences such as background, functional introduction and other sentences with obvious description characteristics according to rules;

after the two data processing, the system messages are ensured to be within the maximum input required by the BERT algorithm as far as possible, and the information loss is reduced.

In addition, in the output link, a rule combination is added based on the prediction output of the CRF, and the system entity name with the subject-object mark is directly output. Rule combinations are divided into two levels:

(1) Combination of entities such as ORG and ARE

For example, ORG, a model will output categories in units of words, and will combine words beginning with BK-ORG or BZ-ORG, followed by IK-ORG or IZ-ORG, or more, ending with EK-ORG or EZ-ORG, into an ORG entity, such as a company. In the same way, ARE and TYP entities such as a purchase process, a management method and the like can also be identified.

(2) Institutional entity combinations

The system entity is the sequential combination of four or the first three entities such as the ORG, ARE and the like, and the ARE, TYP and FIX entities ARE sequentially combined with the ORG mark as the beginning until other entities appear; wherein ORG distinguishes BK and BZ, BZ is a subject, and BK is an object. In this context, the absence of a sequentially occurring combination of entities is intended to mean that there are no institutional entities in the sentence.

S602: and judging the number of the extracted entities.

Here, for step S602, if the number of extracted entities is 1, step S606 is performed; if the number of extracted entities is 2, then go to step S604; if the number of extracted entities is greater than 3, step S603 is performed.

It should be noted that if only one system entity is extracted, no system relationship exists in reality, and system attributes can be directly extracted; if two or more system entities are extracted, a preset relationship extraction model is needed to analyze system text files, determine system relationships among different entities, and simultaneously extract system attributes.

S603: and cutting the system text file.

It should be noted that: because a system text file may contain a plurality of system entities and simultaneously implies a plurality of system relationships, the embodiment of the application performs text segmentation before performing relationship extraction, the segmentation uses the entity position as a basis, and sequentially segments a complete sentence containing two system entities as an input of the relationship extraction, for example, if the system text file contains n system entities, the text content is segmented into (n-1) sentences for input.

S604: and extracting system relations from the system text files.

It should be noted that the system relationship extraction is performed on the system text file by using the preset relationship extraction model, and the system relationship between different entities is determined. In the embodiment of the application, a preset relation extraction model is established based on a BERT algorithm. In order to improve the classification effect, the embodiment of the application marks the institutional entity position on the input, so that the institutional vector combination (the semantic feature and the position feature are combined) is carried out on the full connection layer to increase the entity feature information.

Referring to FIG. 9, a flow diagram of yet another method of knowledge-graph construction is shown. As shown in fig. 9, in order to enable the BERT model to locate the positions of two entities, the embodiment of the present application adds "[ CLS ]" at the beginning of each sentence, adds a special character "$" before and after the production entity (subject), and adds a special character "#" before and after the production entity (object).

As shown in fig. 9, in the embodiment of the present application, a BERT algorithm is used to extract three parts of features for relationship classification:

(1) The final hidden state vector (H0 in fig. 9, corresponding to the semantic features described above) extracts the semantic features of the sentence. As the first partial feature of the final full-concatenation layer input, the first coding vector output according to the BERT algorithm is input to the active layer and then a layer of full concatenation is performed.

(2) The system body implies a state vector (such as Hi and Hj in fig. 9, which are equivalent to the body features described above), and the part of features includes not only semantic features of the body but also position features of the body. That is, entity feature information is added by merging semantic features of the subject with other features (e.g., location features). The system main body hidden state vector is obtained by inputting the activation and full connection layer after averaging the main body vector output by the BERT.

(3) The system object hidden state vector (for example, hk and Hm in fig. 9, which are equivalent to the aforementioned object features) includes not only semantic features of the object but also position features of the object.

Inputting the three characteristic vectors into respective full connection layers to carry out characteristic dimension compression, inputting a full connection layer to carry out classification dimension compression after splicing, and finally carrying out relation classification probability output through a Softmax layer, wherein the relation category corresponding to the maximum probability is the relation category of the subject and the object of the system.

S605: and carrying out relationship reasoning according to the extracted system relationship to obtain partial system attributes.

In the above description, when a system text file and a system text file coexist, part of attribute information is extracted from the system text file first, and unknown attribute information is extracted from the system text file second. Therefore, aiming at system text files, relationship reasoning can be carried out by utilizing part of information obtained when system relationships are extracted to obtain part of system attributes; the other system attributes are obtained in step S606.

S606: and extracting system attributes from the system text file.

Note that, by the aforementioned processing of the institutional document file, a partial institutional attribute can be determined. At this point, some attribute entries may still be unknown. Therefore, it is necessary to perform institutional attribute extraction again from the institutional text file for those unknown attribute entries. That is, the final institutional attribute information is partly from institutional text files and partly from institutional text files.

Specifically, in the embodiment of the present application, 7 system attributes and system terms such as a system letter number and a printing unit are defined, and different extraction methods are designed in consideration of the efficiency of automatic construction of a knowledge graph for the characteristics of each attribute. In a specific extraction method, the embodiment of the application designs an institution attribute personalized characteristic mode by combining with expert field prior knowledge, and performs institution attribute extraction based on the characteristic mode, wherein the specific characteristic mode is as follows:

(1) Aiming at 4 institutional attributes with obvious structure, position or context characteristics of a text number, a printing time, a printing unit and an interpretation department, the embodiment of the application adopts a regular matching mode to search the rules. For example, a text number- "album [ 2016 ] 131" typically appears at the beginning of a text file after the title and before the main content of the text, with obvious structural and positional features.

(2) Aiming at the attribute of the line revocation status, the embodiment of the application directly infers the revocation status by using the alternative relationship in the relationship extraction, otherwise, the line revocation status is the line revocation status.

(3) Aiming at system labels, the embodiment of the application utilizes phrases obtained after the extracted ARE entities in the entity identification ARE segmented and TF-IDF weights calculated during the text identification to carry out sequencing and take the first 3 as the system label attribute.

(4) Aiming at system field attributes, expert knowledge is introduced to uniformly divide the fields into 20 fields such as finance, purchasing and the like, and because the interpretation department of the system carries the field attributes, the embodiment of the application carries out many-to-one field mapping through the attributes of the interpretation department, and the system field attribute extraction efficiency is greatly improved.

(5) The system text clauses have obvious separators, and the system clause content can be extracted according to the chapters, the clause marks and the line feed symbols.

It should be noted that, the above only provides a flow for extracting knowledge from text and text, and if text is missing or text is missing, the corresponding processing is only required, which can be referred to above.

Thus, the system entities, the system relations and the system attributes form the main content of the system knowledge graph, the system entities are used as the core, the system relations are used as the basic framework for connecting and forming the knowledge graph, and the system attributes are combined to form the system knowledge graph with rich knowledge and clear semantics.

To sum up, the embodiment of the present application provides an automatic construction method of a knowledge graph in the manufacturing degree field, which at least includes the following contents: (1) Facing to the system field, defining a system knowledge graph body structure comprising system relations, system entities and system attributes, and constructing a top-level framework automatically constructed by the system knowledge graph in the system field; (2) The embodiment of the application provides a module for extracting tailored knowledge in the system field. Designing a system subject and object labeling and model result sequence combination system, and carrying out system entity identification based on BERT-Bi-LSTM-CRF to solve the problem of low long string entity identification rate; system entity marks are utilized, and system relation extraction is carried out by increasing dominant characteristic dimensions based on BERT coding vector characteristic splicing; designing a system attribute extraction method based on the characteristic modes of system text and text contents; (3) The embodiment of the application provides a unified file system module, which is used for focusing on solving the unified problem of multi-source heterogeneous data and decoupling and separating data processing from knowledge extraction, so that the automatic construction of the knowledge graph has higher expansibility and robustness. The system text and text recognition method is designed, manual labeling in the process of knowledge graph construction is reduced, and automatic construction is more feasible; and (4) system file duplication removal is designed based on Simhash fingerprint coding by combining with a system field service mode, and data redundancy is reduced.

In short, the embodiment of the present application provides a method for constructing a knowledge graph, and through the detailed explanation of the foregoing embodiment of the present application, it can be seen that the embodiment of the present application provides a system domain knowledge graph ontology structure, and provides a top-level guiding framework for the automatic construction of a system domain knowledge graph; moreover, the embodiment of the application also provides a unified file system independent module, which separates data processing from knowledge extraction, facilitates key processing of multi-source heterogeneous data, improves data quality, enables the knowledge extraction not to be influenced by data change, facilitates data expansion, and lays an automated foundation for establishment of system domain knowledge maps; finally, the embodiment of the application is oriented to the system field, and based on algorithms such as BERT, bi-LSTM, CRF and the like, system entity identification, system relation extraction, system attribute extraction and other knowledge extraction modules are designed according to the special structure of system data, so that accurate elements are provided for the construction of a system field knowledge graph.

In yet another embodiment of the present application, refer to fig. 10, which shows a schematic structural diagram of a knowledge graph constructing apparatus 70 provided in the embodiment of the present application. As shown in fig. 10, the knowledge-graph constructing apparatus 70 includes an acquiring unit 701, a recognizing unit 702, an extracting unit 703 and a constructing unit 704, wherein,

an acquisition unit 701 configured to acquire system data to be processed;

the identification unit 702 is configured to analyze the institutional data to be processed by using a preset identification model, and determine institutional text data and/or institutional text data;

an extraction unit 703 configured to perform knowledge extraction on the institutional text data and/or institutional text data to determine target construction information;

the constructing unit 704 is configured to perform map constructing processing on the target constructing information to obtain a target knowledge map.

In some embodiments, the obtaining unit 701 is specifically configured to perform data cleaning processing on the original system data to obtain standard system data; wherein the data cleansing process includes at least one of: unified rule processing, unified type processing and unified format processing; and carrying out duplicate removal treatment on the standard system data to obtain the system data to be treated.

In some embodiments, the preset recognition model comprises a text classifier and a text classifier; the recognition unit 701 is specifically configured to perform word segmentation processing on the system data to be processed to obtain a dictionary to be processed; performing statistical analysis on the system data to be processed, and determining a word weight set corresponding to the dictionary to be processed; calculating the word weight set by using the text classifier to obtain a first classification result; calculating the word weight set by using the text classifier to obtain a second classification result; and determining the institutional text data and/or institutional text data according to the first classification result and the second classification result.

In some embodiments, the extraction unit 703 is specifically configured to, in the case that only institutional text data exists, perform entity extraction on the institutional text data by using a preset entity extraction model to obtain institutional entity information; performing attribute extraction on the system text data by using a preset attribute extraction model to obtain system attribute information; and determining the system entity information and the system attribute information as the target construction information.

In some embodiments, the extracting unit 703 is specifically configured to, in the case that only institutional text data exists, perform entity extraction on the institutional text data by using a preset entity extraction model to obtain institutional entity information; performing attribute extraction on the system text data by using a preset attribute extraction model to obtain system attribute information; determining the system entity information and the system attribute information as the target construction information under the condition that the system entity information indicates that one system entity exists; and under the condition that the system entity information indicates that at least two system entities exist, performing relationship extraction on the system text data by using a preset relationship extraction model to obtain system relationship information, and determining the system entity information, the system attribute information and the system relationship information as the target construction information.

In some embodiments, the extraction unit 703 is specifically configured to perform entity extraction on institutional text data by using a preset entity extraction model in the presence of institutional text data and institutional text data, so as to obtain institutional entity information; performing attribute extraction on the system text data and the system text data by using a preset attribute extraction model to obtain system attribute information; determining the system entity information and the system attribute information as the target construction information under the condition that the system entity information indicates that one system entity exists; and under the condition that the system entity information indicates that at least two system entities exist, performing relationship extraction on the system text data by using a preset relationship extraction model to obtain system relationship information, and determining the system entity information, the system relationship information and the system attribute information as the target construction information.

In some embodiments, the extracting unit 703 is further configured to perform a first attribute extraction on the system text data by using the preset attribute extraction model to obtain a first attribute extraction result; determining attribute items to be extracted according to the primary attribute extraction result; performing secondary attribute extraction on the system text data by using the preset attribute extraction model according to the attribute items to be extracted to obtain a secondary attribute extraction result; and obtaining the system attribute information according to the primary attribute extraction result and the secondary attribute extraction result.

In some embodiments, where the institutional entity information indicates that there are at least two institutional entities, the at least two institutional entities comprise at least one institutional host and at least one institutional guest, and there is a unique institutional relationship between the institutional host and the institutional guest; correspondingly, the extracting unit 703 is specifically configured to, under the condition that the institutional entity information indicates that at least two institutional entities exist, cut the institutional text data and determine at least one statement to be processed; each statement to be processed in the at least one statement to be processed comprises a system subject and a system object; performing subject-object marking on the at least one statement to be processed to obtain at least one target statement; performing feature extraction on the at least one target sentence, and determining respective semantic features, institutional subject features and institutional object features of the at least one target sentence; determining the entity relationship of the at least one target sentence according to the semantic feature, the institutional subject feature and the institutional object feature of the at least one target sentence, and determining the entity relationship of the at least one target sentence as institutional relationship information; the institutional subject characteristics at least comprise subject semantic characteristics and subject position characteristics, and the institutional object characteristics at least comprise object semantic characteristics and object position characteristics.

It is understood that in this embodiment, a "unit" may be a part of a circuit, a part of a processor, a part of a program or software, etc., and may also be a module, or may also be non-modular. Moreover, each component in the embodiment may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware or a form of a software functional module.

Based on the understanding that the technical solution of the present embodiment essentially or a part contributing to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, and include several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to execute all or part of the steps of the method of the present embodiment. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

Accordingly, the present embodiment provides a computer storage medium storing a computer program which, when executed by a plurality of processors, implements the steps of the method of any of the preceding embodiments.

Based on the above-mentioned composition of a knowledge graph constructing apparatus 70 and computer storage medium, refer to fig. 11, which shows a schematic diagram of a hardware structure of an electronic device 80 provided in an embodiment of the present application. As shown in fig. 11, the electronic device 80 may include: a communication interface 801, a memory 802, and a processor 803; the various components are coupled together by a bus device 804. It is understood that bus device 804 is used to enable communications among the components. The bus device 804 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled as bus device 804 in FIG. 11. The communication interface 801 is used for receiving and sending signals in the process of receiving and sending information with other external network elements;

a memory 802 for storing a computer program capable of running on the processor 803;

a processor 803 for executing, when running the computer program, the following:

obtaining system data to be processed;

It will be appreciated that the memory 802 in the embodiments of the subject application can be either volatile memory or nonvolatile memory, or can include both volatile and nonvolatile memory. The non-volatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), or a flash Memory. The volatile Memory may be a Random Access Memory (RAM) which serves as an external cache. By way of example, and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), synchronous Dynamic Random Access Memory (SDRAM), double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM), enhanced Synchronous SDRAM (ESDRAM), synchronous Link Dynamic Random Access Memory (SLDRAM), and Direct Rambus RAM (DRRAM). The memory 802 of the apparatus and methods described herein is intended to comprise, without being limited to, these and any other suitable types of memory.

And the processor 803 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 803. The Processor 803 may be a general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 802, and the processor 803 reads the information in the memory 802, and completes the steps of the above method in combination with the hardware thereof.

It is to be understood that the embodiments described herein may be implemented in hardware, software, firmware, middleware, microcode, or a combination thereof. For a hardware implementation, the Processing units may be implemented within one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), general purpose processors, controllers, micro-controllers, microprocessors, other electronic units configured to perform the functions described herein, or a combination thereof.

For a software implementation, the techniques described herein may be implemented with modules (e.g., procedures, functions, and so on) that perform the functions described herein. The software codes may be stored in a memory and executed by a processor. The memory may be implemented within the processor or external to the processor.

Optionally, as another embodiment, the processor 803 is further configured to perform the steps of the method of any one of the previous embodiments when running the computer program.

Based on the above-mentioned composition diagram of the knowledge graph constructing apparatus 70, refer to fig. 12, which shows a composition structure diagram of another electronic device 80 provided by the embodiment of the present application. As shown in fig. 12, the electronic device 80 includes at least the knowledge-graph constructing apparatus 70 described in any of the previous embodiments.

For the electronic equipment 80, text/text recognition can be performed on system data to be processed through a preset recognition model, manual marking is not needed, target construction information can be accurately obtained through knowledge extraction, and the speed and accuracy of construction of a knowledge map are improved; besides, the relation among different systems can be combed through the knowledge graph, system related information is accurately and comprehensively displayed, difficulty in system audit and system understanding is reduced, and production efficiency is improved finally.

The above description is only a preferred embodiment of the present application, and is not intended to limit the scope of the present application.

It should be noted that, in the present application, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one of 8230, and" comprising 8230does not exclude the presence of additional like elements in a process, method, article, or apparatus comprising the element.

The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.

The methods disclosed in the several method embodiments provided in the present application may be combined arbitrarily without conflict to arrive at new method embodiments.

Features disclosed in several of the product embodiments provided in the present application may be combined in any combination to yield new product embodiments without conflict.

The features disclosed in the several method or apparatus embodiments provided herein may be combined in any combination to arrive at a new method or apparatus embodiment without conflict.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of knowledge graph construction, the method comprising:

obtaining system data to be processed;

2. The method for constructing a knowledge graph according to claim 1, wherein the step of acquiring institutional data to be processed comprises the following steps:

3. The knowledge graph construction method according to claim 1, wherein the preset recognition model comprises a text sending classifier and a text classifier; the analyzing the system data to be processed by utilizing a preset identification model to determine system text data and/or system text data comprises the following steps:

4. The knowledge graph construction method according to claim 1, wherein in the case where only institutional text data exists, the extracting knowledge from the institutional text data and/or institutional text data to determine target construction information includes:

5. The knowledge graph construction method according to claim 1, wherein in the case where only institutional text data exists, performing knowledge extraction on the institutional text data and/or institutional text data to determine target construction information comprises:

and under the condition that the system entity information indicates that at least two system entities exist, performing relationship extraction on the system text data by using a preset relationship extraction model to obtain system relationship information, and determining the system entity information, the system attribute information and the system relationship information as the target construction information.

6. The knowledge graph construction method according to claim 1, wherein in the case of institutional text data and institutional text data, performing knowledge extraction on the institutional text data and/or institutional text data to determine target construction information comprises:

and under the condition that the system entity information indicates that at least two system entities exist, performing relationship extraction on the system text data by using a preset relationship extraction model to obtain system relationship information, and determining the system entity information, the system relationship information and the system attribute information as the target construction information.

7. The knowledge graph construction method according to claim 6, wherein the extracting attributes of the institutional text data and the institutional text data by using a preset attribute extraction model to obtain institutional attribute information comprises:

8. The method of knowledge-graph construction according to claim 5 or 6, wherein in case that the institutional entity information indicates that at least two institutional entities exist, the at least two institutional entities comprise at least one institutional host and at least one institutional guest, and a unique institutional relationship exists between the institutional host and the institutional guest;

correspondingly, the relationship extraction is performed on the system text data by using a preset relationship extraction model to obtain system relationship information, and the method comprises the following steps:

cutting the system text data to determine at least one sentence to be processed; wherein each sentence to be processed in the at least one sentence to be processed comprises a system subject and a system object;

determining the respective entity relationship of the at least one target sentence according to the respective semantic feature, institutional subject feature and institutional object feature of the at least one target sentence, and determining the respective entity relationship of the at least one target sentence as the institutional relationship information;

9. The knowledge graph construction device is characterized by comprising an acquisition unit, an identification unit, an extraction unit and a construction unit; wherein,

the acquisition unit is configured to acquire system data to be processed;

10. A computer storage medium, characterized in that it stores a computer program which, when executed by a processor, implements the steps of the method according to any one of claims 1 to 8.