CN113569054A - Knowledge graph construction method and system for multi-source Chinese financial bulletin document - Google Patents
Knowledge graph construction method and system for multi-source Chinese financial bulletin document Download PDFInfo
- Publication number
- CN113569054A CN113569054A CN202110517049.4A CN202110517049A CN113569054A CN 113569054 A CN113569054 A CN 113569054A CN 202110517049 A CN202110517049 A CN 202110517049A CN 113569054 A CN113569054 A CN 113569054A
- Authority
- CN
- China
- Prior art keywords
- knowledge
- title
- document
- entity
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000010276 construction Methods 0.000 title claims abstract description 53
- 239000013598 vector Substances 0.000 claims abstract description 39
- 238000000605 extraction Methods 0.000 claims abstract description 32
- 238000012549 training Methods 0.000 claims abstract description 30
- 238000013145 classification model Methods 0.000 claims abstract description 21
- 238000012360 testing method Methods 0.000 claims abstract description 14
- 238000002372 labelling Methods 0.000 claims abstract description 11
- 230000000873 masking effect Effects 0.000 claims abstract description 4
- 238000000034 method Methods 0.000 claims description 19
- 238000012015 optical character recognition Methods 0.000 claims description 14
- 230000004927 fusion Effects 0.000 claims description 12
- 238000005516 engineering process Methods 0.000 claims description 10
- 238000004422 calculation algorithm Methods 0.000 claims description 8
- 238000013461 design Methods 0.000 claims description 7
- 238000012986 modification Methods 0.000 claims description 7
- 230000004048 modification Effects 0.000 claims description 7
- 238000003860 storage Methods 0.000 claims description 6
- 238000013528 artificial neural network Methods 0.000 claims description 5
- 238000005457 optimization Methods 0.000 claims description 3
- 238000013433 optimization analysis Methods 0.000 claims description 3
- 238000013135 deep learning Methods 0.000 description 4
- 238000011161 development Methods 0.000 description 4
- 238000011160 research Methods 0.000 description 4
- 238000004458 analytical method Methods 0.000 description 3
- 238000005520 cutting process Methods 0.000 description 3
- 239000003814 drug Substances 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 238000012544 monitoring process Methods 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 238000010586 diagram Methods 0.000 description 2
- 238000005065 mining Methods 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- 238000012502 risk assessment Methods 0.000 description 2
- 208000011580 syndromic disease Diseases 0.000 description 2
- 206010006187 Breast cancer Diseases 0.000 description 1
- 208000026310 Breast neoplasm Diseases 0.000 description 1
- 244000025254 Cannabis sativa Species 0.000 description 1
- 238000009825 accumulation Methods 0.000 description 1
- 230000004913 activation Effects 0.000 description 1
- 239000008280 blood Substances 0.000 description 1
- 210000004369 blood Anatomy 0.000 description 1
- 238000007635 classification algorithm Methods 0.000 description 1
- 238000013075 data extraction Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000002265 prevention Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 208000024891 symptom Diseases 0.000 description 1
- 238000013526 transfer learning Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Animal Behavior & Ethology (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The knowledge graph construction method of the multisource Chinese financial bulletin document comprises the following steps: structuring the hierarchical relationship of each chapter of the document, and constructing a relatively complete document structure tree; labeling all the title data; unifying the length of the title to a preset word number, and performing word embedding coding at a character level by using BERT to obtain corresponding vector representation; dividing the processed data set into a training set and a testing set, and training to obtain a title classification model; classifying the document titles by using a title classification model; masking the complex effective knowledge of the effective text blocks; constructing a semantic model with a mask, constructing a multi-source similar generalized mask Bi-LSTM semantic model M-MST model, feeding the M-MST model for training, and obtaining a knowledge extraction model; acquiring entity relationship triples by combining an external knowledge base according to a knowledge extraction model; and constructing a knowledge graph of the multi-source financial bulletin document and realizing incremental updating or expansion. The system for implementing the knowledge graph construction method of the multi-source Chinese financial bulletin document is further included.
Description
Technical Field
The invention relates to a knowledge extraction and knowledge graph construction method and a system, in particular to extraction of complex entities in the financial field and construction of knowledge graphs in the financial field.
The invention relates to the fields of natural language, knowledge graph, deep learning and the like, in particular to the field of modeling based on deep learning.
Background
The development of the marketing companies as the popular candidates in the economic development of China and the innovative and small-sized private enterprises with the economic growth supporting power is and will continue to face various challenges for a while. The production is not stopped, the gear shifting is not needed in the development, and the grain and grass are needed to be firstly produced. The capital market is the barn where the market companies replenish the "blood". In the beginning of the year, the new refinancing rule issued by the syndrome monitoring party greatly relieves the refinancing limit of the listed companies, particularly the entrepreneurial board enterprises. Meanwhile, the syndrome monitoring emphasizes that the daily monitoring system of the listed companies is continuously improved, the listed companies are strictly paid issuing conditions, and risk prevention and control measures such as information disclosure requirements of the listed companies are strengthened. The project is to extract information with industrial significance according to financial bulletin texts with various sources (such as increase and decrease bulletins, contract bulletins, marketing bulletins, monthly annual performance bulletins and the like), construct a financial bulletin document type knowledge graph capable of being automatically and incrementally expanded, and provide certain support for relevant management institutions and researchers in the aspects of risk analysis and early warning, management decision, model research and the like.
The knowledge graph construction of the financial field bulletin document has the following problems:
(1) the financial bulletin documents contain a large amount of redundant and invalid information, and the information extraction is relatively difficult.
(2) In the financial field, a large number of entities with complex structures exist, so that the context information of the entities is difficult to obtain, and the boundaries of the entities are difficult to confirm.
(3) The knowledge graph in the financial field has entities with different names and the same name, and entity fusion is needed.
Knowledge Graph (KG) is also called scientific Knowledge Graph, and is a semantic network system with very large scale, which describes a large amount of Knowledge describing entities or concepts in the real world and their mutual relations by mining, analyzing, constructing, drawing and displaying. It was first released by Google in 2012 for optimizing its search functions, after which various applications based on knowledge-graph technology developed rapidly. The knowledge graph is composed of a data layer (data layer) and a mode layer (schema layer). The data layer forms a graph knowledge base by the triples formed by entity-relation-entity or entity-attribute value. Named entity recognition and attribute judgment are developed from methods of dictionaries and rules to research methods of full-supervised deep learning, semi-supervised deep learning, field transfer learning and the like along with the research depth and the data volume accumulation. The mode layer provides a conceptual model and a logic basis of the knowledge graph, carries out standard constraint on the data layer and provides the knowledge reasoning capability. In the process of constructing the knowledge graph by the heterogeneous multi-source data extraction entities (and attributes), entity alignment and entity disambiguation are important steps. More and more companies or research institutions are dedicated to providing better services in the fields of medicine, biology, news, new media, etc. through knowledge-graph technology. The industry or field knowledge graph is oriented to a specific field, can carry out knowledge reasoning, and realizes the functions of auxiliary analysis, decision support and the like, such as a traditional Chinese medicine medical record knowledge graph, a traditional Chinese medicine and pharmacology semantic network, a Chinese symptom bank, a breast cancer knowledge graph and Linked Life Data in the medical field; a city knowledge graph based on CNschema in the traffic field; the method comprises the following steps that a Shanghai drawing library celebrity manuscript archive correlation open data set, a Chinese family tree correlation data set and UMLS are arranged in the human domain; the knowledge map of the Chinese tourist attractions in the tourist field; movie bilingual knowledge map in the Movie field, Linked Movie Dataset, and the like. With the development of social economy, the scale of enterprise data is increased rapidly, the requirements of effectively utilizing the data in practical application are exposed gradually, particularly the requirements of enterprise risk analysis and prediction are obvious, but the knowledge graph in the Chinese financial field is particularly lack of the domain knowledge graph which can be applied to small and medium-sized micro enterprises. Therefore, the project aims to realize an incrementally updated knowledge graph and knowledge reasoning system of the Chinese multi-source financial bulletin document, and provides related entity (enterprises or individuals and the like) and event information in the field for risk decision and other practical application requirements, so that the problems of high enterprise risk prediction analysis cost, low efficiency, high threshold and low timeliness are solved.
Disclosure of Invention
The invention provides a knowledge graph construction method and a knowledge graph construction system for a multi-source Chinese financial bulletin document, aiming at overcoming the problem that the knowledge graph in the aspect of the Chinese financial field is relatively lack in the prior art.
The invention discloses a knowledge graph construction method of a multi-source Chinese financial bulletin document, which comprises the following steps:
step 1: aiming at the format (xml/pdf) of the document data, the hierarchical relationship of each chapter of the document is structured by xml structure extraction or Optical Character Recognition (OCR) technology, and a more complete document structure tree (sessionTree) is constructed.
Step 2: all header data are labeled. And acquiring the position of the key information in a regular fuzzy matching mode, extracting the title of the effective text block where the key information is positioned, marking the title as an effective title, and marking the rest titles as invalid titles.
And step 3: unifying the length of the title to the preset number of words, and performing word embedding coding at a character level by using BERT to obtain corresponding vector representation.
And 4, step 4: and dividing the processed data set into a training set and a testing set, feeding the obtained vector into a BilSTM-CRF neural network for training, and performing secondary classification on the title through Softmax to obtain a title classification model.
And 5: the document titles are classified by using a title classification model, the range of the effective text block is further confirmed, and the effective text block is stored in a key-value form of a MongoDB database.
Step 6: the complex effective knowledge of the effective text blocks is masked and replaced by a certain short referring entity so as to reduce the influence of the complex knowledge on the context semantics, accurately acquire and extract the context semantic information of the knowledge and label the text blocks in a BIO form aiming at the effective knowledge.
And 7: constructing a semantic Model with a mask, constructing a Bi-LSTM semantic Model M-MST (mask-Multiple Sources One Topic Bi-LSTM Model) of a multi-source similar generalization mask, performing word embedding coding on the labeled data by using BERT (binary inverse transform), dividing the labeled data into a training set and a testing set, feeding the training set into the M-MST Model, and training to obtain a knowledge extraction Model.
And 8: and according to a knowledge extraction model, combining an external knowledge base to obtain the entities and word vectors with the attributes thereof having the professional field context semantic information, and completing the entity fusion work by utilizing a Levenshtein algorithm to obtain entity relationship triples.
And step 9: based on the fused financial field entity triples, a high-performance NoSQL database such as Neo4j is used for storage, display and query, the design and definition of the triplet objects in Neo4j are realized by using OGM, and the knowledge graph of the multi-source financial bulletin document is constructed and incremental updating or expansion is realized.
The system for implementing the knowledge graph construction method of the multisource Chinese financial bulletin documents comprises a document structure tree construction module, a title data labeling module, a vector representation construction module, a title classification model construction module, a document title classification module, a complex effective knowledge mask module, a knowledge extraction model construction module, an entity relation triple construction module and a multisource financial bulletin document knowledge graph construction module which are connected in sequence;
the document structure tree construction module: aiming at the format (xml/pdf) of the document data, the hierarchical relationship of each chapter of the document is structured by xml structure extraction or Optical Character Recognition (OCR) technology, and a more complete document structure tree (sessionTree) is constructed.
Title data labeling module: all header data are labeled. And acquiring the position of the key information in a regular fuzzy matching mode, extracting the title of the effective text block where the key information is positioned, marking the title as an effective title, and marking the rest titles as invalid titles.
The vector representation construction module: unifying the length of the title to the preset number of words, and performing word embedding coding at a character level by using BERT to obtain corresponding vector representation.
The title classification model construction module: and dividing the processed data set into a training set and a testing set, feeding the obtained vector into a BilSTM-CRF neural network for training, and performing secondary classification on the title through Softmax to obtain a title classification model.
The document title classification module: the document titles are classified by using a title classification model, the range of the effective text block is further confirmed, and the effective text block is stored in a key-value form of a MongoDB database.
Complex effective knowledge mask module: the complex effective knowledge of the effective text blocks is masked and replaced by a certain short referring entity so as to reduce the influence of the complex knowledge on the context semantics, accurately acquire and extract the context semantic information of the knowledge and label the text blocks in a BIO form aiming at the effective knowledge.
A knowledge extraction model construction module: constructing a semantic Model with a mask, constructing a Bi-LSTM semantic Model M-MST (mask-Multiple Sources One Topic Bi-LSTM Model) of a multi-source similar generalization mask, performing word embedding coding on the labeled data by using BERT (binary inverse transform), dividing the labeled data into a training set and a testing set, feeding the training set into the M-MST Model, and training to obtain a knowledge extraction Model.
The entity relationship triple construction module: and according to a knowledge extraction model, combining an external knowledge base to obtain the entities and word vectors with the attributes thereof having the professional field context semantic information, and completing the entity fusion work by utilizing a Levenshtein algorithm to obtain entity relationship triples.
The multi-source financial bulletin document knowledge map building module: based on the fused financial field entity triples, a high-performance NoSQL database such as Neo4j is used for storage, display and query, the design and definition of the triplet objects in Neo4j are realized by using OGM, and the knowledge graph of the multi-source financial bulletin document is constructed and incremental updating or expansion is realized.
The method utilizes xml structure extraction or Optical Character Recognition (OCR) technology to construct a document structure tree, and marks the title of the effective text block where the effective information is located in a regular fuzzy matching mode. And after the title is subjected to short-complement length cutting to adjust the uniform word number, performing character-level word embedding on each word by using Bert to obtain a corresponding word vector, feeding the word vector into a BilSTM-CRF, and classifying by Softmax.
And obtaining accurate effective text blocks according to the title classification model. The complex effective knowledge in the block is masked and replaced by a certain short reference entity so as to reduce the influence of the complex knowledge on context semantics, accurately acquire and extract knowledge context semantic information and label the text block in a BIO form aiming at the effective knowledge. And constructing a semantic model M-MST model with a mask to extract effective information.
And according to a knowledge extraction model, combining an external knowledge base to obtain the entities and word vectors with the attributes thereof having the professional field context semantic information, and completing the entity fusion work by utilizing a Levenshtein algorithm to obtain entity relationship triples. And (3) storing, displaying and inquiring by using a high-performance NoSQL database such as Neo4j, designing and defining triple objects in Neo4j by using the OGM, and constructing a knowledge graph of the multi-source financial bulletin document.
The invention has the advantages that: the respective advantages and disadvantages of keywords and a machine learning classification algorithm in the process of classifying the tax codes are comprehensively considered, the commodity name ultrashort text classification method based on the attention mechanism is provided, the respective advantages are ingeniously fused, the problem of insufficient context information of short texts is solved by utilizing an entity linking technology through information mining on the keyword level, the anchor text is utilized to replace the keywords in the shortage of the context for coding, then the contribution degree of different keywords to the classification of the tax codes is obtained through a Transformer framework, the classification of the tax codes is finally completed, the accuracy and the efficiency are further improved, and the labor cost is greatly reduced.
Drawings
FIG. 1 is a schematic diagram of a data preprocessing process according to the present invention.
FIG. 2 is a diagram of the M-MST with mask semantic model of the present invention.
FIG. 3 is an exemplary knowledge graph of a multi-source financial bulletin document.
FIG. 4 is a schematic flow chart of the present invention.
Detailed Description
The invention will be further explained with reference to the drawings.
The invention discloses a knowledge graph construction method of a multi-source Chinese financial bulletin document, which comprises the following steps:
step 1: and constructing a complete document structure tree, and acquiring a document structure comprising a general title, a primary title, a secondary title and the like and corresponding text blocks.
Step 2: and obtaining the position of the effective block according to the fuzzy matching of the marked content, and extracting the corresponding title of the effective block.
And step 3: and effectively performing short-length complementing cutting to unify the length of the word to the preset word number. In this example, since the header length is relatively short, the length is as long as possible to ensure that information is not lost. The Chinese BERT word vectors are applied for encoding. In the example, according to the statistical information, the number of the multi-part text words is found to be within 25 characters, so that the number of the words is determined to be 25 words, and if the number of the words is not enough, repeated strategy filling is adopted; if the number of words is excessive, the first 25 words are intercepted.
And 4, step 4: dividing the processed data set into a training set and a testing set, wherein the training set accounts for 80%, the testing set accounts for 20%, inputting the coded data and the labels thereof into a BilSTM-CRF network for training, and performing secondary classification by using softmax as a final activation function to obtain a title classification model.
And 5: the document titles are classified by using a title classification model, the range of the effective text block is further confirmed, and the effective text block is stored in a key-value form of a MongoDB database. Steps 1-5 are data preprocessing stages, the flow is shown in figure 1.
Step 6: and after the effective text blocks are obtained, masking the complex entities in the text blocks. For example, in the text "this company and the subordinate" Zhongtiejiu Ju Co., Ltd "," Zhongtieseventeen Ju Co., etc., the winning bid price of the national expressway network G85 Yukun Gansu bay to the Showa expressway investor and the cooperative contract constructor bidding the C cooperative contract construction section is about 24.5661 hundred million yuan. "in, the complex entity is: "national highway network G85 Yukun highway bay to Showa section highway investor and cooperative contract constructor bid C cooperative contract construction section". The complex entity is masked as a ' project ', and the masked text is a united bid-winning project consisting of the ' company and subordinate Zhongxiebai group company Limited, Zhongxiebai group Limited and the like, and the bid-winning price is about 24.5661 hundred million yuan. "
And 7: and constructing a semantic model M-MST with a mask, encoding the labeled data by BERT, dividing the labeled data into an 80% training set and a 20% testing set, and feeding the M-MST model for training to obtain a knowledge extraction model. The M-MST structure is shown in figure 2.
And 8: and (3) obtaining word vectors of the entities and the attributes thereof with professional field context semantic information by combining an external Baidu encyclopedia knowledge base, and completing the work of entity fusion by utilizing a Levenshtein algorithm to obtain entity relationship triples.
Where a, b represent two entity word vectors, i, j represent the vector indices, Leva,b(i, j) represents the similarity value between the a and b entities.
And step 9: and (4) knowledge graph construction and incremental updating or expanding. The method is to be realized by the steps of storing, displaying and inquiring a high-performance NoSQL database such as Neo4j based on the fused financial field entity triples, and realizing the design and definition of the triplet object in Neo4j by using OGM.
a) Constructing a knowledge graph based on the entity fusion triple knowledge base;
b) calculating a newly added entity or attribute word vector based on a unified semantic environment;
c) if the attribute is newly added, the following operations are executed:
i. judging whether the distance threshold value is a newly added attribute or not according to the distance threshold value or an external knowledge base;
yes → add to the corresponding entity attribute triple;
if it is not → an attribute that needs to be updated (attribute information update over time);
yes → update the current attribute value (point heel edge) and record the update time and frequency;
v. not → no modification, the repetition frequency can be recorded.
d) If the new entity is added, the following operations are executed:
i. judging whether the entity is a newly added entity or not according to the distance threshold;
ii, yes → obtaining the optimal adding position according to the clustering and optimization analysis method, such as calculating the number of newly added relations, changing the number of relations, and the like, so as to obtain the optimal knowledge graph according with the application target;
if not → is an entity that needs updating (attribute information update over time);
yes → update entity (point heel) based optimization method and record update time and frequency;
v. not → no modification, the repetition frequency can be recorded.
The system for implementing the knowledge graph construction method of the multi-source Chinese financial bulletin documents comprises a document structure tree construction module, a title data labeling module, a vector representation construction module, a title classification model construction module, a document title classification module, a complex effective knowledge mask module, a knowledge extraction model construction module, an entity relationship triple construction module and a multi-source financial bulletin document knowledge graph construction module which are connected in sequence;
the document structure tree construction module: aiming at the format (xml/pdf) of the document data, the hierarchical relationship of each chapter of the document is structured by xml structure extraction or Optical Character Recognition (OCR) technology, and a more complete document structure tree (sessionTree) is constructed.
Title data labeling module: all header data are labeled. And acquiring the position of the key information in a regular fuzzy matching mode, extracting the title of the effective text block where the key information is positioned, marking the title as an effective title, and marking the rest titles as invalid titles.
The vector representation construction module: unifying the length of the title to the preset number of words, and performing word embedding coding at a character level by using BERT to obtain corresponding vector representation.
The title classification model construction module: dividing the processed data set into an 80% training set and a 20% testing set, feeding the obtained vector into a BilSTM-CRF neural network for training, and performing secondary classification on the title through Softmax to obtain a title classification model.
The document title classification module: the document titles are classified by using a title classification model, the range of the effective text block is further confirmed, and the effective text block is stored in a key-value form of a MongoDB database.
Complex effective knowledge mask module: the complex effective knowledge of the effective text blocks is masked and replaced by a certain short referring entity so as to reduce the influence of the complex knowledge on the context semantics, accurately acquire and extract the context semantic information of the knowledge and label the text blocks in a BIO form aiming at the effective knowledge.
A knowledge extraction model construction module: constructing a semantic Model with a mask, constructing a Bi-LSTM semantic Model (mask-Multiple Sources One Topic Bi-LSTM Model) of a multi-source similar generalization mask, performing word embedding coding on the labeled data by using BERT, dividing the labeled data into 80% of training sets and 20% of testing sets, and feeding the M-MST Model for training to obtain a knowledge extraction Model.
The entity relationship triple construction module: and according to a knowledge extraction model, combining an external knowledge base to obtain the entities and word vectors with the attributes thereof having the professional field context semantic information, and completing the entity fusion work by utilizing a Levenshtein algorithm to obtain entity relationship triples. The method specifically comprises the following steps:
and (3) obtaining word vectors of the entities and the attributes thereof with professional field context semantic information by combining an external Baidu encyclopedia knowledge base, and completing the work of entity fusion by utilizing a Levenshtein algorithm to obtain entity relationship triples.
In the formula, a and b represent two entity word vectors, i and j represent vector subscripts, and Leva,b(i, j) represents the similarity value between the a and b entities.
The multi-source financial bulletin document knowledge map building module: based on the fused financial field entity triples, a high-performance NoSQL database such as Neo4j is used for storage, display and query, the design and definition of the triplet objects in Neo4j are realized by using OGM, and the knowledge graph of the multi-source financial bulletin document is constructed and incremental updating or expansion is realized. The method specifically comprises the following steps:
e) constructing a knowledge graph based on the entity fusion triple knowledge base;
f) calculating a newly added entity or attribute word vector based on a unified semantic environment;
g) if the attribute is newly added, the following operations are executed:
judging whether the attribute is a newly added attribute according to the distance threshold or an external knowledge base;
yes → add to the corresponding entity attribute triple;
not → is an attribute that needs to be updated (attribute information update over time);
ix, yes → update the current attribute value (point heel edge), and record the update time and frequency;
x, not → no modification, the repetition frequency can be recorded.
h) If the new entity is added, the following operations are executed:
judging whether the entity is a newly added entity according to the distance threshold;
vii → obtaining the optimal adding position according to the clustering and optimization analysis method, such as calculating the number of newly added relations, changing the number of relations, etc., so as to obtain the optimal knowledge graph according with the application target;
not → is an entity that needs to be updated (attribute information update over time);
ix, yes → update entity (point heel edge) based on optimization method, and record update time and frequency;
x, not → no modification, the repetition frequency can be recorded.
The invention aims to overcome the problem of the lack of Chinese knowledge map resources, skillfully integrates knowledge extraction and a knowledge map construction method, and particularly aims at the extraction of complex entities in the financial field and the construction of the knowledge map in the financial field. And constructing a document structure tree by using xml structure extraction or an Optical Character Recognition (OCR) technology, and marking the title of the effective text block where the effective information is located in a regular fuzzy matching mode. After the title is subjected to short-complementing length-cutting adjustment for uniform word number, character-level word embedding is carried out on each word by using Bert to obtain a corresponding word vector, the word vector is fed into BilSTM-CRF, and then is classified by Softmax, so that a knowledge graph and knowledge inference system of the Chinese multi-source financial bulletin document capable of being updated in an incremental mode is finally realized, and associated entity (enterprise or individual and the like) and event information in the field are provided for risk decision and other practical application requirements, so that the problems of high enterprise risk prediction and analysis cost, low efficiency, high threshold and low timeliness are solved.
The invention has been illustrated by the above examples, but it should be noted that the examples are for illustrative purposes only and do not limit the invention to the scope of the examples. Although the invention has been described in detail with reference to the foregoing examples, it will be appreciated by those skilled in the art that: the technical solutions described in the foregoing examples can be modified or some technical features can be equally replaced; second, these modifications or substitutions do not depart from the scope of the present invention. The scope of the invention is defined by the appended claims and their equivalents.
Claims (2)
1. The knowledge graph construction method of the multi-source Chinese financial bulletin document comprises the following steps:
step 1: structuring the hierarchical relationship of each chapter of the document by using an xml structure extraction or Optical Character Recognition (OCR) technology according to the format (xml/pdf) of the document data, and constructing a relatively complete document structure tree (sessionTree);
step 2: labeling all the title data; acquiring the position of the key information in a regular fuzzy matching mode, extracting the title of the effective text block where the key information is located, marking the title as an effective title, and marking the rest titles as invalid titles;
and step 3: unifying the length of the title to a preset word number, and performing word embedding coding at a character level by using BERT to obtain corresponding vector representation;
and 4, step 4: dividing the processed data set into a training set and a testing set, feeding the obtained vector into a BilSTM-CRF neural network for training, and performing secondary classification on the title through Softmax to obtain a title classification model;
and 5: classifying the document titles by using a title classification model, further confirming the range of the effective text block, and storing the range in a key-value form of a MongoDB database;
step 6: masking the complex effective knowledge of the effective text blocks, replacing the complex effective knowledge with a certain short entity to reduce the influence of the complex knowledge on context semantics, accurately acquiring and extracting knowledge context semantic information, and labeling the text blocks in a BIO form aiming at the effective knowledge;
and 7: constructing a semantic Model with a mask, constructing a Bi-LSTM semantic Model M-MST (mask-Multiple Sources One recent Bi-LSTM Model) of a multi-source similar generalization mask, performing word embedding coding on the labeled data by using BERT (binary inverse transform), dividing the labeled data into a training set and a test set, feeding the M-MST Model for training, and obtaining a knowledge extraction Model;
and 8: obtaining word vectors of the entities and the attributes thereof with professional field context semantic information by combining an external knowledge base according to a knowledge extraction model, and completing entity fusion work by utilizing a Levenshtein algorithm to obtain entity relationship triples;
where a, b represent two entity word vectors, i, j represent the vector indices, Leva,b(i, j) represents the similarity value between the entities a and b;
and step 9: based on the fused financial field entity triples, a high-performance NoSQL database such as Neo4j is used for storage, display and query, the design and definition of a triplet object in Neo4j are realized by using OGM, a knowledge graph of a multi-source financial bulletin document is constructed, and incremental updating or expansion is realized, and the method specifically comprises the following steps:
the method is realized by the steps that based on the fused financial field entity triples, a high-performance NoSQL database such as Neo4j can be used for storage, display and query, and OGM is used for realizing the design and definition of the triplet object in Neo4 j;
a) constructing a knowledge graph based on the entity fusion triple knowledge base;
b) calculating a newly added entity or attribute word vector based on a unified semantic environment;
c) if the attribute is newly added, the following operations are executed:
i. judging whether the distance threshold value is a newly added attribute or not according to the distance threshold value or an external knowledge base;
yes → add to the corresponding entity attribute triple;
if it is not → an attribute that needs to be updated (attribute information update over time);
yes → update the current attribute value (point heel edge) and record the update time and frequency;
v. not → no modification, repetition frequency can be recorded;
d) if the new entity is added, the following operations are executed:
i. judging whether the entity is a newly added entity or not according to the distance threshold;
ii, yes → obtaining the optimal adding position according to the clustering and optimization analysis method, such as calculating the number of newly added relations, changing the number of relations, and the like, so as to obtain the optimal knowledge graph according with the application target;
if not → is an entity that needs updating (attribute information update over time);
yes → update entity (point heel) based optimization method and record update time and frequency;
v. not → no modification, repetition frequency can be recorded.
2. The system for implementing the knowledge graph construction method of the multisource Chinese financial bulletin document of claim 1, comprising a document structure tree construction module, a title data labeling module, a vector representation construction module, a title classification model construction module, a document title classification module, a complex effective knowledge mask module, a knowledge extraction model construction module, an entity relationship triple construction module, and a multisource financial bulletin document knowledge graph construction module which are connected in sequence;
the document structure tree construction module: structuring the hierarchical relationship of each chapter of the document by using an xml structure extraction or Optical Character Recognition (OCR) technology according to the format (xml/pdf) of the document data, and constructing a relatively complete document structure tree (sessionTree);
title data labeling module: labeling all the title data; acquiring the position of the key information in a regular fuzzy matching mode, extracting the title of the effective text block where the key information is located, marking the title as an effective title, and marking the rest titles as invalid titles;
the vector representation construction module: unifying the length of the title to a preset word number, and performing word embedding coding at a character level by using BERT to obtain corresponding vector representation;
the title classification model construction module: dividing the processed data set into a training set and a testing set, feeding the obtained vector into a BilSTM-CRF neural network for training, and performing secondary classification on the title through Softmax to obtain a title classification model;
the document title classification module: classifying the document titles by using a title classification model, further confirming the range of the effective text block, and storing the range in a key-value form of a MongoDB database;
complex effective knowledge mask module: masking the complex effective knowledge of the effective text blocks, replacing the complex effective knowledge with a certain short entity to reduce the influence of the complex knowledge on context semantics, accurately acquiring and extracting knowledge context semantic information, and labeling the text blocks in a BIO form aiming at the effective knowledge;
a knowledge extraction model construction module: constructing a semantic Model with a mask, constructing a Bi-LSTM semantic Model M-MST (mask-Multiple Sources One recent Bi-LSTM Model) of a multi-source similar generalization mask, performing word embedding coding on the labeled data by using BERT (binary inverse transform), dividing the labeled data into a training set and a test set, feeding the M-MST Model for training, and obtaining a knowledge extraction Model;
the entity relationship triple construction module: obtaining word vectors of the entities and the attributes thereof with professional field context semantic information by combining an external knowledge base according to a knowledge extraction model, and completing entity fusion work by utilizing a Levenshtein algorithm to obtain entity relationship triples;
the multi-source financial bulletin document knowledge map building module: based on the fused financial field entity triples, a high-performance NoSQL database such as Neo4j is used for storage, display and query, the design and definition of the triplet objects in Neo4j are realized by using OGM, and the knowledge graph of the multi-source financial bulletin document is constructed and incremental updating or expansion is realized.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110517049.4A CN113569054A (en) | 2021-05-12 | 2021-05-12 | Knowledge graph construction method and system for multi-source Chinese financial bulletin document |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110517049.4A CN113569054A (en) | 2021-05-12 | 2021-05-12 | Knowledge graph construction method and system for multi-source Chinese financial bulletin document |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113569054A true CN113569054A (en) | 2021-10-29 |
Family
ID=78161480
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110517049.4A Pending CN113569054A (en) | 2021-05-12 | 2021-05-12 | Knowledge graph construction method and system for multi-source Chinese financial bulletin document |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113569054A (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114398464A (en) * | 2021-12-28 | 2022-04-26 | 北方工业大学 | Knowledge graph-based discussion data display method and system |
CN114416705A (en) * | 2021-11-09 | 2022-04-29 | 北京泰策科技有限公司 | Multi-source heterogeneous data fusion modeling method |
CN114417015A (en) * | 2022-01-26 | 2022-04-29 | 西南交通大学 | Method for constructing maintainability knowledge graph of high-speed train |
CN114495143A (en) * | 2021-12-24 | 2022-05-13 | 北京百度网讯科技有限公司 | Text object identification method and device, electronic equipment and storage medium |
CN115630174A (en) * | 2022-12-21 | 2023-01-20 | 上海金仕达软件科技有限公司 | Multi-source bulletin document processing method and device, storage medium and electronic equipment |
CN116090560A (en) * | 2023-04-06 | 2023-05-09 | 北京大学深圳研究生院 | Knowledge graph establishment method, device and system based on teaching materials |
CN116340530A (en) * | 2023-02-17 | 2023-06-27 | 江苏科技大学 | Intelligent design method based on mechanical knowledge graph |
-
2021
- 2021-05-12 CN CN202110517049.4A patent/CN113569054A/en active Pending
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114416705A (en) * | 2021-11-09 | 2022-04-29 | 北京泰策科技有限公司 | Multi-source heterogeneous data fusion modeling method |
CN114495143A (en) * | 2021-12-24 | 2022-05-13 | 北京百度网讯科技有限公司 | Text object identification method and device, electronic equipment and storage medium |
CN114495143B (en) * | 2021-12-24 | 2024-03-22 | 北京百度网讯科技有限公司 | Text object recognition method and device, electronic equipment and storage medium |
CN114398464A (en) * | 2021-12-28 | 2022-04-26 | 北方工业大学 | Knowledge graph-based discussion data display method and system |
CN114398464B (en) * | 2021-12-28 | 2023-01-24 | 北方工业大学 | Knowledge graph-based discussion data display method and system |
CN114417015A (en) * | 2022-01-26 | 2022-04-29 | 西南交通大学 | Method for constructing maintainability knowledge graph of high-speed train |
CN114417015B (en) * | 2022-01-26 | 2023-05-12 | 西南交通大学 | High-speed train maintainability knowledge graph construction method |
CN115630174A (en) * | 2022-12-21 | 2023-01-20 | 上海金仕达软件科技有限公司 | Multi-source bulletin document processing method and device, storage medium and electronic equipment |
CN115630174B (en) * | 2022-12-21 | 2023-07-21 | 上海金仕达软件科技股份有限公司 | Multisource bulletin document processing method and device, storage medium and electronic equipment |
CN116340530A (en) * | 2023-02-17 | 2023-06-27 | 江苏科技大学 | Intelligent design method based on mechanical knowledge graph |
CN116090560A (en) * | 2023-04-06 | 2023-05-09 | 北京大学深圳研究生院 | Knowledge graph establishment method, device and system based on teaching materials |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113569054A (en) | Knowledge graph construction method and system for multi-source Chinese financial bulletin document | |
US11222052B2 (en) | Machine learning-based relationship association and related discovery and | |
US11386096B2 (en) | Entity fingerprints | |
WO2021147726A1 (en) | Information extraction method and apparatus, electronic device and storage medium | |
CN110633409B (en) | Automobile news event extraction method integrating rules and deep learning | |
US10303999B2 (en) | Machine learning-based relationship association and related discovery and search engines | |
CN113806563B (en) | Architect knowledge graph construction method for multi-source heterogeneous building humanistic historical material | |
CN112000725A (en) | Ontology fusion pretreatment method for multi-source heterogeneous resources | |
CN111597811B (en) | Financial chapter-level multi-correlation event extraction method based on graph neural network algorithm | |
CN114443855A (en) | Knowledge graph cross-language alignment method based on graph representation learning | |
CN114661914A (en) | Contract examination method, device, equipment and storage medium based on deep learning and knowledge graph | |
CN115759037A (en) | Intelligent auditing frame and auditing method for building construction scheme | |
Li et al. | Multi-task deep learning model based on hierarchical relations of address elements for semantic address matching | |
Li et al. | Abstractive financial news summarization via transformer-BiLSTM encoder and graph attention-based decoder | |
CN112257442B (en) | Policy document information extraction method based on corpus expansion neural network | |
CN115757325B (en) | Intelligent conversion method and system for XES log | |
CN116821376A (en) | Knowledge graph construction method and system in coal mine safety production field | |
CN115658919A (en) | Culture information digital storage method | |
Hovy et al. | Data Acquisition and Integration in the DGRC's Energy Data Collection Project | |
Xi et al. | Chinese named entity recognition: applications and challenges | |
Chen et al. | Prototype Network for Text Entity Relationship Recognition in Metallurgical Field Based on Integrated Multi-class Loss Functions | |
Jian et al. | An improved memory networks based product model classification method | |
Liu et al. | LeKAN: extracting long-tail relations via layer-enhanced knowledge-aggregation networks | |
Yu et al. | A knowledge-graph based text summarization scheme for mobile edge computing | |
CN118070812B (en) | Industry data analysis method based on NLP |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |