CN113569054A

CN113569054A - Knowledge graph construction method and system for multi-source Chinese financial bulletin document

Info

Publication number: CN113569054A
Application number: CN202110517049.4A
Authority: CN
Inventors: 高楠; 杜宇轩; 陈国鑫; 陈磊; 杨博威
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2021-05-12
Filing date: 2021-05-12
Publication date: 2021-10-29

Abstract

The knowledge graph construction method of the multisource Chinese financial bulletin document comprises the following steps: structuring the hierarchical relationship of each chapter of the document, and constructing a relatively complete document structure tree; labeling all the title data; unifying the length of the title to a preset word number, and performing word embedding coding at a character level by using BERT to obtain corresponding vector representation; dividing the processed data set into a training set and a testing set, and training to obtain a title classification model; classifying the document titles by using a title classification model; masking the complex effective knowledge of the effective text blocks; constructing a semantic model with a mask, constructing a multi-source similar generalized mask Bi-LSTM semantic model M-MST model, feeding the M-MST model for training, and obtaining a knowledge extraction model; acquiring entity relationship triples by combining an external knowledge base according to a knowledge extraction model; and constructing a knowledge graph of the multi-source financial bulletin document and realizing incremental updating or expansion. The system for implementing the knowledge graph construction method of the multi-source Chinese financial bulletin document is further included.

Description

Knowledge graph construction method and system for multi-source Chinese financial bulletin document

Technical Field

The invention relates to a knowledge extraction and knowledge graph construction method and a system, in particular to extraction of complex entities in the financial field and construction of knowledge graphs in the financial field.

The invention relates to the fields of natural language, knowledge graph, deep learning and the like, in particular to the field of modeling based on deep learning.

Background

The development of the marketing companies as the popular candidates in the economic development of China and the innovative and small-sized private enterprises with the economic growth supporting power is and will continue to face various challenges for a while. The production is not stopped, the gear shifting is not needed in the development, and the grain and grass are needed to be firstly produced. The capital market is the barn where the market companies replenish the "blood". In the beginning of the year, the new refinancing rule issued by the syndrome monitoring party greatly relieves the refinancing limit of the listed companies, particularly the entrepreneurial board enterprises. Meanwhile, the syndrome monitoring emphasizes that the daily monitoring system of the listed companies is continuously improved, the listed companies are strictly paid issuing conditions, and risk prevention and control measures such as information disclosure requirements of the listed companies are strengthened. The project is to extract information with industrial significance according to financial bulletin texts with various sources (such as increase and decrease bulletins, contract bulletins, marketing bulletins, monthly annual performance bulletins and the like), construct a financial bulletin document type knowledge graph capable of being automatically and incrementally expanded, and provide certain support for relevant management institutions and researchers in the aspects of risk analysis and early warning, management decision, model research and the like.

The knowledge graph construction of the financial field bulletin document has the following problems:

(1) the financial bulletin documents contain a large amount of redundant and invalid information, and the information extraction is relatively difficult.

(2) In the financial field, a large number of entities with complex structures exist, so that the context information of the entities is difficult to obtain, and the boundaries of the entities are difficult to confirm.

(3) The knowledge graph in the financial field has entities with different names and the same name, and entity fusion is needed.

Knowledge Graph (KG) is also called scientific Knowledge Graph, and is a semantic network system with very large scale, which describes a large amount of Knowledge describing entities or concepts in the real world and their mutual relations by mining, analyzing, constructing, drawing and displaying. It was first released by Google in 2012 for optimizing its search functions, after which various applications based on knowledge-graph technology developed rapidly. The knowledge graph is composed of a data layer (data layer) and a mode layer (schema layer). The data layer forms a graph knowledge base by the triples formed by entity-relation-entity or entity-attribute value. Named entity recognition and attribute judgment are developed from methods of dictionaries and rules to research methods of full-supervised deep learning, semi-supervised deep learning, field transfer learning and the like along with the research depth and the data volume accumulation. The mode layer provides a conceptual model and a logic basis of the knowledge graph, carries out standard constraint on the data layer and provides the knowledge reasoning capability. In the process of constructing the knowledge graph by the heterogeneous multi-source data extraction entities (and attributes), entity alignment and entity disambiguation are important steps. More and more companies or research institutions are dedicated to providing better services in the fields of medicine, biology, news, new media, etc. through knowledge-graph technology. The industry or field knowledge graph is oriented to a specific field, can carry out knowledge reasoning, and realizes the functions of auxiliary analysis, decision support and the like, such as a traditional Chinese medicine medical record knowledge graph, a traditional Chinese medicine and pharmacology semantic network, a Chinese symptom bank, a breast cancer knowledge graph and Linked Life Data in the medical field; a city knowledge graph based on CNschema in the traffic field; the method comprises the following steps that a Shanghai drawing library celebrity manuscript archive correlation open data set, a Chinese family tree correlation data set and UMLS are arranged in the human domain; the knowledge map of the Chinese tourist attractions in the tourist field; movie bilingual knowledge map in the Movie field, Linked Movie Dataset, and the like. With the development of social economy, the scale of enterprise data is increased rapidly, the requirements of effectively utilizing the data in practical application are exposed gradually, particularly the requirements of enterprise risk analysis and prediction are obvious, but the knowledge graph in the Chinese financial field is particularly lack of the domain knowledge graph which can be applied to small and medium-sized micro enterprises. Therefore, the project aims to realize an incrementally updated knowledge graph and knowledge reasoning system of the Chinese multi-source financial bulletin document, and provides related entity (enterprises or individuals and the like) and event information in the field for risk decision and other practical application requirements, so that the problems of high enterprise risk prediction analysis cost, low efficiency, high threshold and low timeliness are solved.

Disclosure of Invention

The invention provides a knowledge graph construction method and a knowledge graph construction system for a multi-source Chinese financial bulletin document, aiming at overcoming the problem that the knowledge graph in the aspect of the Chinese financial field is relatively lack in the prior art.

The invention discloses a knowledge graph construction method of a multi-source Chinese financial bulletin document, which comprises the following steps:

step 1: aiming at the format (xml/pdf) of the document data, the hierarchical relationship of each chapter of the document is structured by xml structure extraction or Optical Character Recognition (OCR) technology, and a more complete document structure tree (sessionTree) is constructed.

Step 2: all header data are labeled. And acquiring the position of the key information in a regular fuzzy matching mode, extracting the title of the effective text block where the key information is positioned, marking the title as an effective title, and marking the rest titles as invalid titles.

And step 3: unifying the length of the title to the preset number of words, and performing word embedding coding at a character level by using BERT to obtain corresponding vector representation.

And 4, step 4: and dividing the processed data set into a training set and a testing set, feeding the obtained vector into a BilSTM-CRF neural network for training, and performing secondary classification on the title through Softmax to obtain a title classification model.

And 5: the document titles are classified by using a title classification model, the range of the effective text block is further confirmed, and the effective text block is stored in a key-value form of a MongoDB database.

Step 6: the complex effective knowledge of the effective text blocks is masked and replaced by a certain short referring entity so as to reduce the influence of the complex knowledge on the context semantics, accurately acquire and extract the context semantic information of the knowledge and label the text blocks in a BIO form aiming at the effective knowledge.

And 7: constructing a semantic Model with a mask, constructing a Bi-LSTM semantic Model M-MST (mask-Multiple Sources One Topic Bi-LSTM Model) of a multi-source similar generalization mask, performing word embedding coding on the labeled data by using BERT (binary inverse transform), dividing the labeled data into a training set and a testing set, feeding the training set into the M-MST Model, and training to obtain a knowledge extraction Model.

And 8: and according to a knowledge extraction model, combining an external knowledge base to obtain the entities and word vectors with the attributes thereof having the professional field context semantic information, and completing the entity fusion work by utilizing a Levenshtein algorithm to obtain entity relationship triples.

And step 9: based on the fused financial field entity triples, a high-performance NoSQL database such as Neo4j is used for storage, display and query, the design and definition of the triplet objects in Neo4j are realized by using OGM, and the knowledge graph of the multi-source financial bulletin document is constructed and incremental updating or expansion is realized.

The system for implementing the knowledge graph construction method of the multisource Chinese financial bulletin documents comprises a document structure tree construction module, a title data labeling module, a vector representation construction module, a title classification model construction module, a document title classification module, a complex effective knowledge mask module, a knowledge extraction model construction module, an entity relation triple construction module and a multisource financial bulletin document knowledge graph construction module which are connected in sequence;

the document structure tree construction module: aiming at the format (xml/pdf) of the document data, the hierarchical relationship of each chapter of the document is structured by xml structure extraction or Optical Character Recognition (OCR) technology, and a more complete document structure tree (sessionTree) is constructed.

Title data labeling module: all header data are labeled. And acquiring the position of the key information in a regular fuzzy matching mode, extracting the title of the effective text block where the key information is positioned, marking the title as an effective title, and marking the rest titles as invalid titles.

The vector representation construction module: unifying the length of the title to the preset number of words, and performing word embedding coding at a character level by using BERT to obtain corresponding vector representation.

The title classification model construction module: and dividing the processed data set into a training set and a testing set, feeding the obtained vector into a BilSTM-CRF neural network for training, and performing secondary classification on the title through Softmax to obtain a title classification model.

The document title classification module: the document titles are classified by using a title classification model, the range of the effective text block is further confirmed, and the effective text block is stored in a key-value form of a MongoDB database.

Complex effective knowledge mask module: the complex effective knowledge of the effective text blocks is masked and replaced by a certain short referring entity so as to reduce the influence of the complex knowledge on the context semantics, accurately acquire and extract the context semantic information of the knowledge and label the text blocks in a BIO form aiming at the effective knowledge.

A knowledge extraction model construction module: constructing a semantic Model with a mask, constructing a Bi-LSTM semantic Model M-MST (mask-Multiple Sources One Topic Bi-LSTM Model) of a multi-source similar generalization mask, performing word embedding coding on the labeled data by using BERT (binary inverse transform), dividing the labeled data into a training set and a testing set, feeding the training set into the M-MST Model, and training to obtain a knowledge extraction Model.

The entity relationship triple construction module: and according to a knowledge extraction model, combining an external knowledge base to obtain the entities and word vectors with the attributes thereof having the professional field context semantic information, and completing the entity fusion work by utilizing a Levenshtein algorithm to obtain entity relationship triples.

The multi-source financial bulletin document knowledge map building module: based on the fused financial field entity triples, a high-performance NoSQL database such as Neo4j is used for storage, display and query, the design and definition of the triplet objects in Neo4j are realized by using OGM, and the knowledge graph of the multi-source financial bulletin document is constructed and incremental updating or expansion is realized.

The method utilizes xml structure extraction or Optical Character Recognition (OCR) technology to construct a document structure tree, and marks the title of the effective text block where the effective information is located in a regular fuzzy matching mode. And after the title is subjected to short-complement length cutting to adjust the uniform word number, performing character-level word embedding on each word by using Bert to obtain a corresponding word vector, feeding the word vector into a BilSTM-CRF, and classifying by Softmax.

And obtaining accurate effective text blocks according to the title classification model. The complex effective knowledge in the block is masked and replaced by a certain short reference entity so as to reduce the influence of the complex knowledge on context semantics, accurately acquire and extract knowledge context semantic information and label the text block in a BIO form aiming at the effective knowledge. And constructing a semantic model M-MST model with a mask to extract effective information.

And according to a knowledge extraction model, combining an external knowledge base to obtain the entities and word vectors with the attributes thereof having the professional field context semantic information, and completing the entity fusion work by utilizing a Levenshtein algorithm to obtain entity relationship triples. And (3) storing, displaying and inquiring by using a high-performance NoSQL database such as Neo4j, designing and defining triple objects in Neo4j by using the OGM, and constructing a knowledge graph of the multi-source financial bulletin document.

The invention has the advantages that: the respective advantages and disadvantages of keywords and a machine learning classification algorithm in the process of classifying the tax codes are comprehensively considered, the commodity name ultrashort text classification method based on the attention mechanism is provided, the respective advantages are ingeniously fused, the problem of insufficient context information of short texts is solved by utilizing an entity linking technology through information mining on the keyword level, the anchor text is utilized to replace the keywords in the shortage of the context for coding, then the contribution degree of different keywords to the classification of the tax codes is obtained through a Transformer framework, the classification of the tax codes is finally completed, the accuracy and the efficiency are further improved, and the labor cost is greatly reduced.

Drawings

FIG. 1 is a schematic diagram of a data preprocessing process according to the present invention.

FIG. 2 is a diagram of the M-MST with mask semantic model of the present invention.

FIG. 3 is an exemplary knowledge graph of a multi-source financial bulletin document.

FIG. 4 is a schematic flow chart of the present invention.

Detailed Description

The invention will be further explained with reference to the drawings.

step 1: and constructing a complete document structure tree, and acquiring a document structure comprising a general title, a primary title, a secondary title and the like and corresponding text blocks.

Step 2: and obtaining the position of the effective block according to the fuzzy matching of the marked content, and extracting the corresponding title of the effective block.

And step 3: and effectively performing short-length complementing cutting to unify the length of the word to the preset word number. In this example, since the header length is relatively short, the length is as long as possible to ensure that information is not lost. The Chinese BERT word vectors are applied for encoding. In the example, according to the statistical information, the number of the multi-part text words is found to be within 25 characters, so that the number of the words is determined to be 25 words, and if the number of the words is not enough, repeated strategy filling is adopted; if the number of words is excessive, the first 25 words are intercepted.

And 4, step 4: dividing the processed data set into a training set and a testing set, wherein the training set accounts for 80%, the testing set accounts for 20%, inputting the coded data and the labels thereof into a BilSTM-CRF network for training, and performing secondary classification by using softmax as a final activation function to obtain a title classification model.

And 5: the document titles are classified by using a title classification model, the range of the effective text block is further confirmed, and the effective text block is stored in a key-value form of a MongoDB database. Steps 1-5 are data preprocessing stages, the flow is shown in figure 1.

Step 6: and after the effective text blocks are obtained, masking the complex entities in the text blocks. For example, in the text "this company and the subordinate" Zhongtiejiu Ju Co., Ltd "," Zhongtieseventeen Ju Co., etc., the winning bid price of the national expressway network G85 Yukun Gansu bay to the Showa expressway investor and the cooperative contract constructor bidding the C cooperative contract construction section is about 24.5661 hundred million yuan. "in, the complex entity is: "national highway network G85 Yukun highway bay to Showa section highway investor and cooperative contract constructor bid C cooperative contract construction section". The complex entity is masked as a ' project ', and the masked text is a united bid-winning project consisting of the ' company and subordinate Zhongxiebai group company Limited, Zhongxiebai group Limited and the like, and the bid-winning price is about 24.5661 hundred million yuan. "

And 7: and constructing a semantic model M-MST with a mask, encoding the labeled data by BERT, dividing the labeled data into an 80% training set and a 20% testing set, and feeding the M-MST model for training to obtain a knowledge extraction model. The M-MST structure is shown in figure 2.

And 8: and (3) obtaining word vectors of the entities and the attributes thereof with professional field context semantic information by combining an external Baidu encyclopedia knowledge base, and completing the work of entity fusion by utilizing a Levenshtein algorithm to obtain entity relationship triples.

Where a, b represent two entity word vectors, i, j represent the vector indices, Lev_a,b(i, j) represents the similarity value between the a and b entities.

And step 9: and (4) knowledge graph construction and incremental updating or expanding. The method is to be realized by the steps of storing, displaying and inquiring a high-performance NoSQL database such as Neo4j based on the fused financial field entity triples, and realizing the design and definition of the triplet object in Neo4j by using OGM.

a) Constructing a knowledge graph based on the entity fusion triple knowledge base;

b) calculating a newly added entity or attribute word vector based on a unified semantic environment;

c) if the attribute is newly added, the following operations are executed:

i. judging whether the distance threshold value is a newly added attribute or not according to the distance threshold value or an external knowledge base;

yes → add to the corresponding entity attribute triple;

if it is not → an attribute that needs to be updated (attribute information update over time);

yes → update the current attribute value (point heel edge) and record the update time and frequency;

v. not → no modification, the repetition frequency can be recorded.

d) If the new entity is added, the following operations are executed:

i. judging whether the entity is a newly added entity or not according to the distance threshold;

ii, yes → obtaining the optimal adding position according to the clustering and optimization analysis method, such as calculating the number of newly added relations, changing the number of relations, and the like, so as to obtain the optimal knowledge graph according with the application target;

if not → is an entity that needs updating (attribute information update over time);

yes → update entity (point heel) based optimization method and record update time and frequency;

v. not → no modification, the repetition frequency can be recorded.

The system for implementing the knowledge graph construction method of the multi-source Chinese financial bulletin documents comprises a document structure tree construction module, a title data labeling module, a vector representation construction module, a title classification model construction module, a document title classification module, a complex effective knowledge mask module, a knowledge extraction model construction module, an entity relationship triple construction module and a multi-source financial bulletin document knowledge graph construction module which are connected in sequence;

The title classification model construction module: dividing the processed data set into an 80% training set and a 20% testing set, feeding the obtained vector into a BilSTM-CRF neural network for training, and performing secondary classification on the title through Softmax to obtain a title classification model.

A knowledge extraction model construction module: constructing a semantic Model with a mask, constructing a Bi-LSTM semantic Model (mask-Multiple Sources One Topic Bi-LSTM Model) of a multi-source similar generalization mask, performing word embedding coding on the labeled data by using BERT, dividing the labeled data into 80% of training sets and 20% of testing sets, and feeding the M-MST Model for training to obtain a knowledge extraction Model.

The entity relationship triple construction module: and according to a knowledge extraction model, combining an external knowledge base to obtain the entities and word vectors with the attributes thereof having the professional field context semantic information, and completing the entity fusion work by utilizing a Levenshtein algorithm to obtain entity relationship triples. The method specifically comprises the following steps:

and (3) obtaining word vectors of the entities and the attributes thereof with professional field context semantic information by combining an external Baidu encyclopedia knowledge base, and completing the work of entity fusion by utilizing a Levenshtein algorithm to obtain entity relationship triples.

In the formula, a and b represent two entity word vectors, i and j represent vector subscripts, and Lev_a,b(i, j) represents the similarity value between the a and b entities.

The multi-source financial bulletin document knowledge map building module: based on the fused financial field entity triples, a high-performance NoSQL database such as Neo4j is used for storage, display and query, the design and definition of the triplet objects in Neo4j are realized by using OGM, and the knowledge graph of the multi-source financial bulletin document is constructed and incremental updating or expansion is realized. The method specifically comprises the following steps:

e) constructing a knowledge graph based on the entity fusion triple knowledge base;

f) calculating a newly added entity or attribute word vector based on a unified semantic environment;

g) if the attribute is newly added, the following operations are executed:

judging whether the attribute is a newly added attribute according to the distance threshold or an external knowledge base;

yes → add to the corresponding entity attribute triple;

not → is an attribute that needs to be updated (attribute information update over time);

ix, yes → update the current attribute value (point heel edge), and record the update time and frequency;

x, not → no modification, the repetition frequency can be recorded.

h) If the new entity is added, the following operations are executed:

judging whether the entity is a newly added entity according to the distance threshold;

vii → obtaining the optimal adding position according to the clustering and optimization analysis method, such as calculating the number of newly added relations, changing the number of relations, etc., so as to obtain the optimal knowledge graph according with the application target;

not → is an entity that needs to be updated (attribute information update over time);

ix, yes → update entity (point heel edge) based on optimization method, and record update time and frequency;

x, not → no modification, the repetition frequency can be recorded.

The invention aims to overcome the problem of the lack of Chinese knowledge map resources, skillfully integrates knowledge extraction and a knowledge map construction method, and particularly aims at the extraction of complex entities in the financial field and the construction of the knowledge map in the financial field. And constructing a document structure tree by using xml structure extraction or an Optical Character Recognition (OCR) technology, and marking the title of the effective text block where the effective information is located in a regular fuzzy matching mode. After the title is subjected to short-complementing length-cutting adjustment for uniform word number, character-level word embedding is carried out on each word by using Bert to obtain a corresponding word vector, the word vector is fed into BilSTM-CRF, and then is classified by Softmax, so that a knowledge graph and knowledge inference system of the Chinese multi-source financial bulletin document capable of being updated in an incremental mode is finally realized, and associated entity (enterprise or individual and the like) and event information in the field are provided for risk decision and other practical application requirements, so that the problems of high enterprise risk prediction and analysis cost, low efficiency, high threshold and low timeliness are solved.

The invention has been illustrated by the above examples, but it should be noted that the examples are for illustrative purposes only and do not limit the invention to the scope of the examples. Although the invention has been described in detail with reference to the foregoing examples, it will be appreciated by those skilled in the art that: the technical solutions described in the foregoing examples can be modified or some technical features can be equally replaced; second, these modifications or substitutions do not depart from the scope of the present invention. The scope of the invention is defined by the appended claims and their equivalents.

Claims

1. The knowledge graph construction method of the multi-source Chinese financial bulletin document comprises the following steps:

step 1: structuring the hierarchical relationship of each chapter of the document by using an xml structure extraction or Optical Character Recognition (OCR) technology according to the format (xml/pdf) of the document data, and constructing a relatively complete document structure tree (sessionTree);

step 2: labeling all the title data; acquiring the position of the key information in a regular fuzzy matching mode, extracting the title of the effective text block where the key information is located, marking the title as an effective title, and marking the rest titles as invalid titles;

and step 3: unifying the length of the title to a preset word number, and performing word embedding coding at a character level by using BERT to obtain corresponding vector representation;

and 4, step 4: dividing the processed data set into a training set and a testing set, feeding the obtained vector into a BilSTM-CRF neural network for training, and performing secondary classification on the title through Softmax to obtain a title classification model;

and 5: classifying the document titles by using a title classification model, further confirming the range of the effective text block, and storing the range in a key-value form of a MongoDB database;

step 6: masking the complex effective knowledge of the effective text blocks, replacing the complex effective knowledge with a certain short entity to reduce the influence of the complex knowledge on context semantics, accurately acquiring and extracting knowledge context semantic information, and labeling the text blocks in a BIO form aiming at the effective knowledge;

and 7: constructing a semantic Model with a mask, constructing a Bi-LSTM semantic Model M-MST (mask-Multiple Sources One recent Bi-LSTM Model) of a multi-source similar generalization mask, performing word embedding coding on the labeled data by using BERT (binary inverse transform), dividing the labeled data into a training set and a test set, feeding the M-MST Model for training, and obtaining a knowledge extraction Model;

and 8: obtaining word vectors of the entities and the attributes thereof with professional field context semantic information by combining an external knowledge base according to a knowledge extraction model, and completing entity fusion work by utilizing a Levenshtein algorithm to obtain entity relationship triples;

where a, b represent two entity word vectors, i, j represent the vector indices, Lev_a,b(i, j) represents the similarity value between the entities a and b;

and step 9: based on the fused financial field entity triples, a high-performance NoSQL database such as Neo4j is used for storage, display and query, the design and definition of a triplet object in Neo4j are realized by using OGM, a knowledge graph of a multi-source financial bulletin document is constructed, and incremental updating or expansion is realized, and the method specifically comprises the following steps:

the method is realized by the steps that based on the fused financial field entity triples, a high-performance NoSQL database such as Neo4j can be used for storage, display and query, and OGM is used for realizing the design and definition of the triplet object in Neo4 j;

c) if the attribute is newly added, the following operations are executed:

yes → add to the corresponding entity attribute triple;

v. not → no modification, repetition frequency can be recorded;

d) if the new entity is added, the following operations are executed:

v. not → no modification, repetition frequency can be recorded.

2. The system for implementing the knowledge graph construction method of the multisource Chinese financial bulletin document of claim 1, comprising a document structure tree construction module, a title data labeling module, a vector representation construction module, a title classification model construction module, a document title classification module, a complex effective knowledge mask module, a knowledge extraction model construction module, an entity relationship triple construction module, and a multisource financial bulletin document knowledge graph construction module which are connected in sequence;

the document structure tree construction module: structuring the hierarchical relationship of each chapter of the document by using an xml structure extraction or Optical Character Recognition (OCR) technology according to the format (xml/pdf) of the document data, and constructing a relatively complete document structure tree (sessionTree);

title data labeling module: labeling all the title data; acquiring the position of the key information in a regular fuzzy matching mode, extracting the title of the effective text block where the key information is located, marking the title as an effective title, and marking the rest titles as invalid titles;

the vector representation construction module: unifying the length of the title to a preset word number, and performing word embedding coding at a character level by using BERT to obtain corresponding vector representation;

the title classification model construction module: dividing the processed data set into a training set and a testing set, feeding the obtained vector into a BilSTM-CRF neural network for training, and performing secondary classification on the title through Softmax to obtain a title classification model;

the document title classification module: classifying the document titles by using a title classification model, further confirming the range of the effective text block, and storing the range in a key-value form of a MongoDB database;

complex effective knowledge mask module: masking the complex effective knowledge of the effective text blocks, replacing the complex effective knowledge with a certain short entity to reduce the influence of the complex knowledge on context semantics, accurately acquiring and extracting knowledge context semantic information, and labeling the text blocks in a BIO form aiming at the effective knowledge;

a knowledge extraction model construction module: constructing a semantic Model with a mask, constructing a Bi-LSTM semantic Model M-MST (mask-Multiple Sources One recent Bi-LSTM Model) of a multi-source similar generalization mask, performing word embedding coding on the labeled data by using BERT (binary inverse transform), dividing the labeled data into a training set and a test set, feeding the M-MST Model for training, and obtaining a knowledge extraction Model;

the entity relationship triple construction module: obtaining word vectors of the entities and the attributes thereof with professional field context semantic information by combining an external knowledge base according to a knowledge extraction model, and completing entity fusion work by utilizing a Levenshtein algorithm to obtain entity relationship triples;