CN113569054A - Knowledge graph construction method and system for multi-source Chinese financial bulletin document - Google Patents

Knowledge graph construction method and system for multi-source Chinese financial bulletin document Download PDF

Info

Publication number
CN113569054A
CN113569054A CN202110517049.4A CN202110517049A CN113569054A CN 113569054 A CN113569054 A CN 113569054A CN 202110517049 A CN202110517049 A CN 202110517049A CN 113569054 A CN113569054 A CN 113569054A
Authority
CN
China
Prior art keywords
knowledge
title
document
entity
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110517049.4A
Other languages
Chinese (zh)
Inventor
高楠
杜宇轩
陈国鑫
陈磊
杨博威
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University of Technology ZJUT
Original Assignee
Zhejiang University of Technology ZJUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University of Technology ZJUT filed Critical Zhejiang University of Technology ZJUT
Priority to CN202110517049.4A priority Critical patent/CN113569054A/en
Publication of CN113569054A publication Critical patent/CN113569054A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The knowledge graph construction method of the multisource Chinese financial bulletin document comprises the following steps: structuring the hierarchical relationship of each chapter of the document, and constructing a relatively complete document structure tree; labeling all the title data; unifying the length of the title to a preset word number, and performing word embedding coding at a character level by using BERT to obtain corresponding vector representation; dividing the processed data set into a training set and a testing set, and training to obtain a title classification model; classifying the document titles by using a title classification model; masking the complex effective knowledge of the effective text blocks; constructing a semantic model with a mask, constructing a multi-source similar generalized mask Bi-LSTM semantic model M-MST model, feeding the M-MST model for training, and obtaining a knowledge extraction model; acquiring entity relationship triples by combining an external knowledge base according to a knowledge extraction model; and constructing a knowledge graph of the multi-source financial bulletin document and realizing incremental updating or expansion. The system for implementing the knowledge graph construction method of the multi-source Chinese financial bulletin document is further included.

Description

Knowledge graph construction method and system for multi-source Chinese financial bulletin document
Technical Field
The invention relates to a knowledge extraction and knowledge graph construction method and a system, in particular to extraction of complex entities in the financial field and construction of knowledge graphs in the financial field.
The invention relates to the fields of natural language, knowledge graph, deep learning and the like, in particular to the field of modeling based on deep learning.
Background
The development of the marketing companies as the popular candidates in the economic development of China and the innovative and small-sized private enterprises with the economic growth supporting power is and will continue to face various challenges for a while. The production is not stopped, the gear shifting is not needed in the development, and the grain and grass are needed to be firstly produced. The capital market is the barn where the market companies replenish the "blood". In the beginning of the year, the new refinancing rule issued by the syndrome monitoring party greatly relieves the refinancing limit of the listed companies, particularly the entrepreneurial board enterprises. Meanwhile, the syndrome monitoring emphasizes that the daily monitoring system of the listed companies is continuously improved, the listed companies are strictly paid issuing conditions, and risk prevention and control measures such as information disclosure requirements of the listed companies are strengthened. The project is to extract information with industrial significance according to financial bulletin texts with various sources (such as increase and decrease bulletins, contract bulletins, marketing bulletins, monthly annual performance bulletins and the like), construct a financial bulletin document type knowledge graph capable of being automatically and incrementally expanded, and provide certain support for relevant management institutions and researchers in the aspects of risk analysis and early warning, management decision, model research and the like.
The knowledge graph construction of the financial field bulletin document has the following problems:
(1) the financial bulletin documents contain a large amount of redundant and invalid information, and the information extraction is relatively difficult.
(2) In the financial field, a large number of entities with complex structures exist, so that the context information of the entities is difficult to obtain, and the boundaries of the entities are difficult to confirm.
(3) The knowledge graph in the financial field has entities with different names and the same name, and entity fusion is needed.
Knowledge Graph (KG) is also called scientific Knowledge Graph, and is a semantic network system with very large scale, which describes a large amount of Knowledge describing entities or concepts in the real world and their mutual relations by mining, analyzing, constructing, drawing and displaying. It was first released by Google in 2012 for optimizing its search functions, after which various applications based on knowledge-graph technology developed rapidly. The knowledge graph is composed of a data layer (data layer) and a mode layer (schema layer). The data layer forms a graph knowledge base by the triples formed by entity-relation-entity or entity-attribute value. Named entity recognition and attribute judgment are developed from methods of dictionaries and rules to research methods of full-supervised deep learning, semi-supervised deep learning, field transfer learning and the like along with the research depth and the data volume accumulation. The mode layer provides a conceptual model and a logic basis of the knowledge graph, carries out standard constraint on the data layer and provides the knowledge reasoning capability. In the process of constructing the knowledge graph by the heterogeneous multi-source data extraction entities (and attributes), entity alignment and entity disambiguation are important steps. More and more companies or research institutions are dedicated to providing better services in the fields of medicine, biology, news, new media, etc. through knowledge-graph technology. The industry or field knowledge graph is oriented to a specific field, can carry out knowledge reasoning, and realizes the functions of auxiliary analysis, decision support and the like, such as a traditional Chinese medicine medical record knowledge graph, a traditional Chinese medicine and pharmacology semantic network, a Chinese symptom bank, a breast cancer knowledge graph and Linked Life Data in the medical field; a city knowledge graph based on CNschema in the traffic field; the method comprises the following steps that a Shanghai drawing library celebrity manuscript archive correlation open data set, a Chinese family tree correlation data set and UMLS are arranged in the human domain; the knowledge map of the Chinese tourist attractions in the tourist field; movie bilingual knowledge map in the Movie field, Linked Movie Dataset, and the like. With the development of social economy, the scale of enterprise data is increased rapidly, the requirements of effectively utilizing the data in practical application are exposed gradually, particularly the requirements of enterprise risk analysis and prediction are obvious, but the knowledge graph in the Chinese financial field is particularly lack of the domain knowledge graph which can be applied to small and medium-sized micro enterprises. Therefore, the project aims to realize an incrementally updated knowledge graph and knowledge reasoning system of the Chinese multi-source financial bulletin document, and provides related entity (enterprises or individuals and the like) and event information in the field for risk decision and other practical application requirements, so that the problems of high enterprise risk prediction analysis cost, low efficiency, high threshold and low timeliness are solved.
Disclosure of Invention
The invention provides a knowledge graph construction method and a knowledge graph construction system for a multi-source Chinese financial bulletin document, aiming at overcoming the problem that the knowledge graph in the aspect of the Chinese financial field is relatively lack in the prior art.
The invention discloses a knowledge graph construction method of a multi-source Chinese financial bulletin document, which comprises the following steps:
step 1: aiming at the format (xml/pdf) of the document data, the hierarchical relationship of each chapter of the document is structured by xml structure extraction or Optical Character Recognition (OCR) technology, and a more complete document structure tree (sessionTree) is constructed.
Step 2: all header data are labeled. And acquiring the position of the key information in a regular fuzzy matching mode, extracting the title of the effective text block where the key information is positioned, marking the title as an effective title, and marking the rest titles as invalid titles.
And step 3: unifying the length of the title to the preset number of words, and performing word embedding coding at a character level by using BERT to obtain corresponding vector representation.
And 4, step 4: and dividing the processed data set into a training set and a testing set, feeding the obtained vector into a BilSTM-CRF neural network for training, and performing secondary classification on the title through Softmax to obtain a title classification model.
And 5: the document titles are classified by using a title classification model, the range of the effective text block is further confirmed, and the effective text block is stored in a key-value form of a MongoDB database.
Step 6: the complex effective knowledge of the effective text blocks is masked and replaced by a certain short referring entity so as to reduce the influence of the complex knowledge on the context semantics, accurately acquire and extract the context semantic information of the knowledge and label the text blocks in a BIO form aiming at the effective knowledge.
And 7: constructing a semantic Model with a mask, constructing a Bi-LSTM semantic Model M-MST (mask-Multiple Sources One Topic Bi-LSTM Model) of a multi-source similar generalization mask, performing word embedding coding on the labeled data by using BERT (binary inverse transform), dividing the labeled data into a training set and a testing set, feeding the training set into the M-MST Model, and training to obtain a knowledge extraction Model.
And 8: and according to a knowledge extraction model, combining an external knowledge base to obtain the entities and word vectors with the attributes thereof having the professional field context semantic information, and completing the entity fusion work by utilizing a Levenshtein algorithm to obtain entity relationship triples.
And step 9: based on the fused financial field entity triples, a high-performance NoSQL database such as Neo4j is used for storage, display and query, the design and definition of the triplet objects in Neo4j are realized by using OGM, and the knowledge graph of the multi-source financial bulletin document is constructed and incremental updating or expansion is realized.
The system for implementing the knowledge graph construction method of the multisource Chinese financial bulletin documents comprises a document structure tree construction module, a title data labeling module, a vector representation construction module, a title classification model construction module, a document title classification module, a complex effective knowledge mask module, a knowledge extraction model construction module, an entity relation triple construction module and a multisource financial bulletin document knowledge graph construction module which are connected in sequence;
the document structure tree construction module: aiming at the format (xml/pdf) of the document data, the hierarchical relationship of each chapter of the document is structured by xml structure extraction or Optical Character Recognition (OCR) technology, and a more complete document structure tree (sessionTree) is constructed.
Title data labeling module: all header data are labeled. And acquiring the position of the key information in a regular fuzzy matching mode, extracting the title of the effective text block where the key information is positioned, marking the title as an effective title, and marking the rest titles as invalid titles.
The vector representation construction module: unifying the length of the title to the preset number of words, and performing word embedding coding at a character level by using BERT to obtain corresponding vector representation.
The title classification model construction module: and dividing the processed data set into a training set and a testing set, feeding the obtained vector into a BilSTM-CRF neural network for training, and performing secondary classification on the title through Softmax to obtain a title classification model.
The document title classification module: the document titles are classified by using a title classification model, the range of the effective text block is further confirmed, and the effective text block is stored in a key-value form of a MongoDB database.
Complex effective knowledge mask module: the complex effective knowledge of the effective text blocks is masked and replaced by a certain short referring entity so as to reduce the influence of the complex knowledge on the context semantics, accurately acquire and extract the context semantic information of the knowledge and label the text blocks in a BIO form aiming at the effective knowledge.
A knowledge extraction model construction module: constructing a semantic Model with a mask, constructing a Bi-LSTM semantic Model M-MST (mask-Multiple Sources One Topic Bi-LSTM Model) of a multi-source similar generalization mask, performing word embedding coding on the labeled data by using BERT (binary inverse transform), dividing the labeled data into a training set and a testing set, feeding the training set into the M-MST Model, and training to obtain a knowledge extraction Model.
The entity relationship triple construction module: and according to a knowledge extraction model, combining an external knowledge base to obtain the entities and word vectors with the attributes thereof having the professional field context semantic information, and completing the entity fusion work by utilizing a Levenshtein algorithm to obtain entity relationship triples.
The multi-source financial bulletin document knowledge map building module: based on the fused financial field entity triples, a high-performance NoSQL database such as Neo4j is used for storage, display and query, the design and definition of the triplet objects in Neo4j are realized by using OGM, and the knowledge graph of the multi-source financial bulletin document is constructed and incremental updating or expansion is realized.
The method utilizes xml structure extraction or Optical Character Recognition (OCR) technology to construct a document structure tree, and marks the title of the effective text block where the effective information is located in a regular fuzzy matching mode. And after the title is subjected to short-complement length cutting to adjust the uniform word number, performing character-level word embedding on each word by using Bert to obtain a corresponding word vector, feeding the word vector into a BilSTM-CRF, and classifying by Softmax.
And obtaining accurate effective text blocks according to the title classification model. The complex effective knowledge in the block is masked and replaced by a certain short reference entity so as to reduce the influence of the complex knowledge on context semantics, accurately acquire and extract knowledge context semantic information and label the text block in a BIO form aiming at the effective knowledge. And constructing a semantic model M-MST model with a mask to extract effective information.
And according to a knowledge extraction model, combining an external knowledge base to obtain the entities and word vectors with the attributes thereof having the professional field context semantic information, and completing the entity fusion work by utilizing a Levenshtein algorithm to obtain entity relationship triples. And (3) storing, displaying and inquiring by using a high-performance NoSQL database such as Neo4j, designing and defining triple objects in Neo4j by using the OGM, and constructing a knowledge graph of the multi-source financial bulletin document.
The invention has the advantages that: the respective advantages and disadvantages of keywords and a machine learning classification algorithm in the process of classifying the tax codes are comprehensively considered, the commodity name ultrashort text classification method based on the attention mechanism is provided, the respective advantages are ingeniously fused, the problem of insufficient context information of short texts is solved by utilizing an entity linking technology through information mining on the keyword level, the anchor text is utilized to replace the keywords in the shortage of the context for coding, then the contribution degree of different keywords to the classification of the tax codes is obtained through a Transformer framework, the classification of the tax codes is finally completed, the accuracy and the efficiency are further improved, and the labor cost is greatly reduced.
Drawings
FIG. 1 is a schematic diagram of a data preprocessing process according to the present invention.
FIG. 2 is a diagram of the M-MST with mask semantic model of the present invention.
FIG. 3 is an exemplary knowledge graph of a multi-source financial bulletin document.
FIG. 4 is a schematic flow chart of the present invention.
Detailed Description
The invention will be further explained with reference to the drawings.
The invention discloses a knowledge graph construction method of a multi-source Chinese financial bulletin document, which comprises the following steps:
step 1: and constructing a complete document structure tree, and acquiring a document structure comprising a general title, a primary title, a secondary title and the like and corresponding text blocks.
Step 2: and obtaining the position of the effective block according to the fuzzy matching of the marked content, and extracting the corresponding title of the effective block.
And step 3: and effectively performing short-length complementing cutting to unify the length of the word to the preset word number. In this example, since the header length is relatively short, the length is as long as possible to ensure that information is not lost. The Chinese BERT word vectors are applied for encoding. In the example, according to the statistical information, the number of the multi-part text words is found to be within 25 characters, so that the number of the words is determined to be 25 words, and if the number of the words is not enough, repeated strategy filling is adopted; if the number of words is excessive, the first 25 words are intercepted.
And 4, step 4: dividing the processed data set into a training set and a testing set, wherein the training set accounts for 80%, the testing set accounts for 20%, inputting the coded data and the labels thereof into a BilSTM-CRF network for training, and performing secondary classification by using softmax as a final activation function to obtain a title classification model.
And 5: the document titles are classified by using a title classification model, the range of the effective text block is further confirmed, and the effective text block is stored in a key-value form of a MongoDB database. Steps 1-5 are data preprocessing stages, the flow is shown in figure 1.
Step 6: and after the effective text blocks are obtained, masking the complex entities in the text blocks. For example, in the text "this company and the subordinate" Zhongtiejiu Ju Co., Ltd "," Zhongtieseventeen Ju Co., etc., the winning bid price of the national expressway network G85 Yukun Gansu bay to the Showa expressway investor and the cooperative contract constructor bidding the C cooperative contract construction section is about 24.5661 hundred million yuan. "in, the complex entity is: "national highway network G85 Yukun highway bay to Showa section highway investor and cooperative contract constructor bid C cooperative contract construction section". The complex entity is masked as a ' project ', and the masked text is a united bid-winning project consisting of the ' company and subordinate Zhongxiebai group company Limited, Zhongxiebai group Limited and the like, and the bid-winning price is about 24.5661 hundred million yuan. "
And 7: and constructing a semantic model M-MST with a mask, encoding the labeled data by BERT, dividing the labeled data into an 80% training set and a 20% testing set, and feeding the M-MST model for training to obtain a knowledge extraction model. The M-MST structure is shown in figure 2.
And 8: and (3) obtaining word vectors of the entities and the attributes thereof with professional field context semantic information by combining an external Baidu encyclopedia knowledge base, and completing the work of entity fusion by utilizing a Levenshtein algorithm to obtain entity relationship triples.
Figure RE-GDA0003217306740000061
Where a, b represent two entity word vectors, i, j represent the vector indices, Leva,b(i, j) represents the similarity value between the a and b entities.
And step 9: and (4) knowledge graph construction and incremental updating or expanding. The method is to be realized by the steps of storing, displaying and inquiring a high-performance NoSQL database such as Neo4j based on the fused financial field entity triples, and realizing the design and definition of the triplet object in Neo4j by using OGM.
a) Constructing a knowledge graph based on the entity fusion triple knowledge base;
b) calculating a newly added entity or attribute word vector based on a unified semantic environment;
c) if the attribute is newly added, the following operations are executed:
i. judging whether the distance threshold value is a newly added attribute or not according to the distance threshold value or an external knowledge base;
yes → add to the corresponding entity attribute triple;
if it is not → an attribute that needs to be updated (attribute information update over time);
yes → update the current attribute value (point heel edge) and record the update time and frequency;
v. not → no modification, the repetition frequency can be recorded.
d) If the new entity is added, the following operations are executed:
i. judging whether the entity is a newly added entity or not according to the distance threshold;
ii, yes → obtaining the optimal adding position according to the clustering and optimization analysis method, such as calculating the number of newly added relations, changing the number of relations, and the like, so as to obtain the optimal knowledge graph according with the application target;
if not → is an entity that needs updating (attribute information update over time);
yes → update entity (point heel) based optimization method and record update time and frequency;
v. not → no modification, the repetition frequency can be recorded.
The system for implementing the knowledge graph construction method of the multi-source Chinese financial bulletin documents comprises a document structure tree construction module, a title data labeling module, a vector representation construction module, a title classification model construction module, a document title classification module, a complex effective knowledge mask module, a knowledge extraction model construction module, an entity relationship triple construction module and a multi-source financial bulletin document knowledge graph construction module which are connected in sequence;
the document structure tree construction module: aiming at the format (xml/pdf) of the document data, the hierarchical relationship of each chapter of the document is structured by xml structure extraction or Optical Character Recognition (OCR) technology, and a more complete document structure tree (sessionTree) is constructed.
Title data labeling module: all header data are labeled. And acquiring the position of the key information in a regular fuzzy matching mode, extracting the title of the effective text block where the key information is positioned, marking the title as an effective title, and marking the rest titles as invalid titles.
The vector representation construction module: unifying the length of the title to the preset number of words, and performing word embedding coding at a character level by using BERT to obtain corresponding vector representation.
The title classification model construction module: dividing the processed data set into an 80% training set and a 20% testing set, feeding the obtained vector into a BilSTM-CRF neural network for training, and performing secondary classification on the title through Softmax to obtain a title classification model.
The document title classification module: the document titles are classified by using a title classification model, the range of the effective text block is further confirmed, and the effective text block is stored in a key-value form of a MongoDB database.
Complex effective knowledge mask module: the complex effective knowledge of the effective text blocks is masked and replaced by a certain short referring entity so as to reduce the influence of the complex knowledge on the context semantics, accurately acquire and extract the context semantic information of the knowledge and label the text blocks in a BIO form aiming at the effective knowledge.
A knowledge extraction model construction module: constructing a semantic Model with a mask, constructing a Bi-LSTM semantic Model (mask-Multiple Sources One Topic Bi-LSTM Model) of a multi-source similar generalization mask, performing word embedding coding on the labeled data by using BERT, dividing the labeled data into 80% of training sets and 20% of testing sets, and feeding the M-MST Model for training to obtain a knowledge extraction Model.
The entity relationship triple construction module: and according to a knowledge extraction model, combining an external knowledge base to obtain the entities and word vectors with the attributes thereof having the professional field context semantic information, and completing the entity fusion work by utilizing a Levenshtein algorithm to obtain entity relationship triples. The method specifically comprises the following steps:
and (3) obtaining word vectors of the entities and the attributes thereof with professional field context semantic information by combining an external Baidu encyclopedia knowledge base, and completing the work of entity fusion by utilizing a Levenshtein algorithm to obtain entity relationship triples.
Figure RE-GDA0003217306740000081
In the formula, a and b represent two entity word vectors, i and j represent vector subscripts, and Leva,b(i, j) represents the similarity value between the a and b entities.
The multi-source financial bulletin document knowledge map building module: based on the fused financial field entity triples, a high-performance NoSQL database such as Neo4j is used for storage, display and query, the design and definition of the triplet objects in Neo4j are realized by using OGM, and the knowledge graph of the multi-source financial bulletin document is constructed and incremental updating or expansion is realized. The method specifically comprises the following steps:
e) constructing a knowledge graph based on the entity fusion triple knowledge base;
f) calculating a newly added entity or attribute word vector based on a unified semantic environment;
g) if the attribute is newly added, the following operations are executed:
judging whether the attribute is a newly added attribute according to the distance threshold or an external knowledge base;
yes → add to the corresponding entity attribute triple;
not → is an attribute that needs to be updated (attribute information update over time);
ix, yes → update the current attribute value (point heel edge), and record the update time and frequency;
x, not → no modification, the repetition frequency can be recorded.
h) If the new entity is added, the following operations are executed:
judging whether the entity is a newly added entity according to the distance threshold;
vii → obtaining the optimal adding position according to the clustering and optimization analysis method, such as calculating the number of newly added relations, changing the number of relations, etc., so as to obtain the optimal knowledge graph according with the application target;
not → is an entity that needs to be updated (attribute information update over time);
ix, yes → update entity (point heel edge) based on optimization method, and record update time and frequency;
x, not → no modification, the repetition frequency can be recorded.
The invention aims to overcome the problem of the lack of Chinese knowledge map resources, skillfully integrates knowledge extraction and a knowledge map construction method, and particularly aims at the extraction of complex entities in the financial field and the construction of the knowledge map in the financial field. And constructing a document structure tree by using xml structure extraction or an Optical Character Recognition (OCR) technology, and marking the title of the effective text block where the effective information is located in a regular fuzzy matching mode. After the title is subjected to short-complementing length-cutting adjustment for uniform word number, character-level word embedding is carried out on each word by using Bert to obtain a corresponding word vector, the word vector is fed into BilSTM-CRF, and then is classified by Softmax, so that a knowledge graph and knowledge inference system of the Chinese multi-source financial bulletin document capable of being updated in an incremental mode is finally realized, and associated entity (enterprise or individual and the like) and event information in the field are provided for risk decision and other practical application requirements, so that the problems of high enterprise risk prediction and analysis cost, low efficiency, high threshold and low timeliness are solved.
The invention has been illustrated by the above examples, but it should be noted that the examples are for illustrative purposes only and do not limit the invention to the scope of the examples. Although the invention has been described in detail with reference to the foregoing examples, it will be appreciated by those skilled in the art that: the technical solutions described in the foregoing examples can be modified or some technical features can be equally replaced; second, these modifications or substitutions do not depart from the scope of the present invention. The scope of the invention is defined by the appended claims and their equivalents.

Claims (2)

1. The knowledge graph construction method of the multi-source Chinese financial bulletin document comprises the following steps:
step 1: structuring the hierarchical relationship of each chapter of the document by using an xml structure extraction or Optical Character Recognition (OCR) technology according to the format (xml/pdf) of the document data, and constructing a relatively complete document structure tree (sessionTree);
step 2: labeling all the title data; acquiring the position of the key information in a regular fuzzy matching mode, extracting the title of the effective text block where the key information is located, marking the title as an effective title, and marking the rest titles as invalid titles;
and step 3: unifying the length of the title to a preset word number, and performing word embedding coding at a character level by using BERT to obtain corresponding vector representation;
and 4, step 4: dividing the processed data set into a training set and a testing set, feeding the obtained vector into a BilSTM-CRF neural network for training, and performing secondary classification on the title through Softmax to obtain a title classification model;
and 5: classifying the document titles by using a title classification model, further confirming the range of the effective text block, and storing the range in a key-value form of a MongoDB database;
step 6: masking the complex effective knowledge of the effective text blocks, replacing the complex effective knowledge with a certain short entity to reduce the influence of the complex knowledge on context semantics, accurately acquiring and extracting knowledge context semantic information, and labeling the text blocks in a BIO form aiming at the effective knowledge;
and 7: constructing a semantic Model with a mask, constructing a Bi-LSTM semantic Model M-MST (mask-Multiple Sources One recent Bi-LSTM Model) of a multi-source similar generalization mask, performing word embedding coding on the labeled data by using BERT (binary inverse transform), dividing the labeled data into a training set and a test set, feeding the M-MST Model for training, and obtaining a knowledge extraction Model;
and 8: obtaining word vectors of the entities and the attributes thereof with professional field context semantic information by combining an external knowledge base according to a knowledge extraction model, and completing entity fusion work by utilizing a Levenshtein algorithm to obtain entity relationship triples;
Figure FDA0003061967260000011
where a, b represent two entity word vectors, i, j represent the vector indices, Leva,b(i, j) represents the similarity value between the entities a and b;
and step 9: based on the fused financial field entity triples, a high-performance NoSQL database such as Neo4j is used for storage, display and query, the design and definition of a triplet object in Neo4j are realized by using OGM, a knowledge graph of a multi-source financial bulletin document is constructed, and incremental updating or expansion is realized, and the method specifically comprises the following steps:
the method is realized by the steps that based on the fused financial field entity triples, a high-performance NoSQL database such as Neo4j can be used for storage, display and query, and OGM is used for realizing the design and definition of the triplet object in Neo4 j;
a) constructing a knowledge graph based on the entity fusion triple knowledge base;
b) calculating a newly added entity or attribute word vector based on a unified semantic environment;
c) if the attribute is newly added, the following operations are executed:
i. judging whether the distance threshold value is a newly added attribute or not according to the distance threshold value or an external knowledge base;
yes → add to the corresponding entity attribute triple;
if it is not → an attribute that needs to be updated (attribute information update over time);
yes → update the current attribute value (point heel edge) and record the update time and frequency;
v. not → no modification, repetition frequency can be recorded;
d) if the new entity is added, the following operations are executed:
i. judging whether the entity is a newly added entity or not according to the distance threshold;
ii, yes → obtaining the optimal adding position according to the clustering and optimization analysis method, such as calculating the number of newly added relations, changing the number of relations, and the like, so as to obtain the optimal knowledge graph according with the application target;
if not → is an entity that needs updating (attribute information update over time);
yes → update entity (point heel) based optimization method and record update time and frequency;
v. not → no modification, repetition frequency can be recorded.
2. The system for implementing the knowledge graph construction method of the multisource Chinese financial bulletin document of claim 1, comprising a document structure tree construction module, a title data labeling module, a vector representation construction module, a title classification model construction module, a document title classification module, a complex effective knowledge mask module, a knowledge extraction model construction module, an entity relationship triple construction module, and a multisource financial bulletin document knowledge graph construction module which are connected in sequence;
the document structure tree construction module: structuring the hierarchical relationship of each chapter of the document by using an xml structure extraction or Optical Character Recognition (OCR) technology according to the format (xml/pdf) of the document data, and constructing a relatively complete document structure tree (sessionTree);
title data labeling module: labeling all the title data; acquiring the position of the key information in a regular fuzzy matching mode, extracting the title of the effective text block where the key information is located, marking the title as an effective title, and marking the rest titles as invalid titles;
the vector representation construction module: unifying the length of the title to a preset word number, and performing word embedding coding at a character level by using BERT to obtain corresponding vector representation;
the title classification model construction module: dividing the processed data set into a training set and a testing set, feeding the obtained vector into a BilSTM-CRF neural network for training, and performing secondary classification on the title through Softmax to obtain a title classification model;
the document title classification module: classifying the document titles by using a title classification model, further confirming the range of the effective text block, and storing the range in a key-value form of a MongoDB database;
complex effective knowledge mask module: masking the complex effective knowledge of the effective text blocks, replacing the complex effective knowledge with a certain short entity to reduce the influence of the complex knowledge on context semantics, accurately acquiring and extracting knowledge context semantic information, and labeling the text blocks in a BIO form aiming at the effective knowledge;
a knowledge extraction model construction module: constructing a semantic Model with a mask, constructing a Bi-LSTM semantic Model M-MST (mask-Multiple Sources One recent Bi-LSTM Model) of a multi-source similar generalization mask, performing word embedding coding on the labeled data by using BERT (binary inverse transform), dividing the labeled data into a training set and a test set, feeding the M-MST Model for training, and obtaining a knowledge extraction Model;
the entity relationship triple construction module: obtaining word vectors of the entities and the attributes thereof with professional field context semantic information by combining an external knowledge base according to a knowledge extraction model, and completing entity fusion work by utilizing a Levenshtein algorithm to obtain entity relationship triples;
the multi-source financial bulletin document knowledge map building module: based on the fused financial field entity triples, a high-performance NoSQL database such as Neo4j is used for storage, display and query, the design and definition of the triplet objects in Neo4j are realized by using OGM, and the knowledge graph of the multi-source financial bulletin document is constructed and incremental updating or expansion is realized.
CN202110517049.4A 2021-05-12 2021-05-12 Knowledge graph construction method and system for multi-source Chinese financial bulletin document Pending CN113569054A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110517049.4A CN113569054A (en) 2021-05-12 2021-05-12 Knowledge graph construction method and system for multi-source Chinese financial bulletin document

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110517049.4A CN113569054A (en) 2021-05-12 2021-05-12 Knowledge graph construction method and system for multi-source Chinese financial bulletin document

Publications (1)

Publication Number Publication Date
CN113569054A true CN113569054A (en) 2021-10-29

Family

ID=78161480

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110517049.4A Pending CN113569054A (en) 2021-05-12 2021-05-12 Knowledge graph construction method and system for multi-source Chinese financial bulletin document

Country Status (1)

Country Link
CN (1) CN113569054A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114398464A (en) * 2021-12-28 2022-04-26 北方工业大学 Knowledge graph-based discussion data display method and system
CN114416705A (en) * 2021-11-09 2022-04-29 北京泰策科技有限公司 Multi-source heterogeneous data fusion modeling method
CN114417015A (en) * 2022-01-26 2022-04-29 西南交通大学 Method for constructing maintainability knowledge graph of high-speed train
CN114495143A (en) * 2021-12-24 2022-05-13 北京百度网讯科技有限公司 Text object identification method and device, electronic equipment and storage medium
CN115630174A (en) * 2022-12-21 2023-01-20 上海金仕达软件科技有限公司 Multi-source bulletin document processing method and device, storage medium and electronic equipment
CN116090560A (en) * 2023-04-06 2023-05-09 北京大学深圳研究生院 Knowledge graph establishment method, device and system based on teaching materials
CN116340530A (en) * 2023-02-17 2023-06-27 江苏科技大学 Intelligent design method based on mechanical knowledge graph

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114416705A (en) * 2021-11-09 2022-04-29 北京泰策科技有限公司 Multi-source heterogeneous data fusion modeling method
CN114495143A (en) * 2021-12-24 2022-05-13 北京百度网讯科技有限公司 Text object identification method and device, electronic equipment and storage medium
CN114495143B (en) * 2021-12-24 2024-03-22 北京百度网讯科技有限公司 Text object recognition method and device, electronic equipment and storage medium
CN114398464A (en) * 2021-12-28 2022-04-26 北方工业大学 Knowledge graph-based discussion data display method and system
CN114398464B (en) * 2021-12-28 2023-01-24 北方工业大学 Knowledge graph-based discussion data display method and system
CN114417015A (en) * 2022-01-26 2022-04-29 西南交通大学 Method for constructing maintainability knowledge graph of high-speed train
CN114417015B (en) * 2022-01-26 2023-05-12 西南交通大学 High-speed train maintainability knowledge graph construction method
CN115630174A (en) * 2022-12-21 2023-01-20 上海金仕达软件科技有限公司 Multi-source bulletin document processing method and device, storage medium and electronic equipment
CN115630174B (en) * 2022-12-21 2023-07-21 上海金仕达软件科技股份有限公司 Multisource bulletin document processing method and device, storage medium and electronic equipment
CN116340530A (en) * 2023-02-17 2023-06-27 江苏科技大学 Intelligent design method based on mechanical knowledge graph
CN116090560A (en) * 2023-04-06 2023-05-09 北京大学深圳研究生院 Knowledge graph establishment method, device and system based on teaching materials

Similar Documents

Publication Publication Date Title
CN113569054A (en) Knowledge graph construction method and system for multi-source Chinese financial bulletin document
US11222052B2 (en) Machine learning-based relationship association and related discovery and
US11386096B2 (en) Entity fingerprints
WO2021147726A1 (en) Information extraction method and apparatus, electronic device and storage medium
CN110633409B (en) Automobile news event extraction method integrating rules and deep learning
US10303999B2 (en) Machine learning-based relationship association and related discovery and search engines
CN113806563B (en) Architect knowledge graph construction method for multi-source heterogeneous building humanistic historical material
CN112000725A (en) Ontology fusion pretreatment method for multi-source heterogeneous resources
CN111597811B (en) Financial chapter-level multi-correlation event extraction method based on graph neural network algorithm
CN114443855A (en) Knowledge graph cross-language alignment method based on graph representation learning
CN114661914A (en) Contract examination method, device, equipment and storage medium based on deep learning and knowledge graph
CN115759037A (en) Intelligent auditing frame and auditing method for building construction scheme
Li et al. Multi-task deep learning model based on hierarchical relations of address elements for semantic address matching
Li et al. Abstractive financial news summarization via transformer-BiLSTM encoder and graph attention-based decoder
CN112257442B (en) Policy document information extraction method based on corpus expansion neural network
CN115757325B (en) Intelligent conversion method and system for XES log
CN116821376A (en) Knowledge graph construction method and system in coal mine safety production field
CN115658919A (en) Culture information digital storage method
Hovy et al. Data Acquisition and Integration in the DGRC's Energy Data Collection Project
Xi et al. Chinese named entity recognition: applications and challenges
Chen et al. Prototype Network for Text Entity Relationship Recognition in Metallurgical Field Based on Integrated Multi-class Loss Functions
Jian et al. An improved memory networks based product model classification method
Liu et al. LeKAN: extracting long-tail relations via layer-enhanced knowledge-aggregation networks
Yu et al. A knowledge-graph based text summarization scheme for mobile edge computing
CN118070812B (en) Industry data analysis method based on NLP

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination