CN110990525A - Natural language processing-based public opinion information extraction and knowledge base generation method - Google Patents
Natural language processing-based public opinion information extraction and knowledge base generation method Download PDFInfo
- Publication number
- CN110990525A CN110990525A CN201911117980.2A CN201911117980A CN110990525A CN 110990525 A CN110990525 A CN 110990525A CN 201911117980 A CN201911117980 A CN 201911117980A CN 110990525 A CN110990525 A CN 110990525A
- Authority
- CN
- China
- Prior art keywords
- entity
- word
- relationship
- extraction
- words
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 50
- 238000000605 extraction Methods 0.000 title claims abstract description 43
- 238000003058 natural language processing Methods 0.000 title claims abstract description 19
- 238000012549 training Methods 0.000 claims abstract description 13
- 238000007781 pre-processing Methods 0.000 claims abstract description 9
- 230000008520 organization Effects 0.000 claims abstract description 6
- 238000013528 artificial neural network Methods 0.000 claims abstract description 5
- 239000013598 vector Substances 0.000 claims description 25
- 230000011218 segmentation Effects 0.000 claims description 19
- 230000007246 mechanism Effects 0.000 claims description 11
- 230000008569 process Effects 0.000 claims description 11
- 238000012545 processing Methods 0.000 claims description 11
- 230000000694 effects Effects 0.000 claims description 7
- 238000009826 distribution Methods 0.000 claims description 6
- 238000004140 cleaning Methods 0.000 claims description 5
- 238000001914 filtration Methods 0.000 claims description 5
- 238000010606 normalization Methods 0.000 claims description 3
- 238000000658 coextraction Methods 0.000 claims description 2
- 238000005457 optimization Methods 0.000 claims description 2
- 125000004122 cyclic group Chemical group 0.000 claims 1
- 238000013500 data storage Methods 0.000 abstract description 2
- 239000000284 extract Substances 0.000 abstract 1
- 238000005516 engineering process Methods 0.000 description 5
- 230000008901 benefit Effects 0.000 description 3
- 238000013507 mapping Methods 0.000 description 3
- 239000011159 matrix material Substances 0.000 description 3
- 238000003860 storage Methods 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 230000001276 controlling effect Effects 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 238000005065 mining Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 230000004913 activation Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 230000000875 corresponding effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000011835 investigation Methods 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 239000013589 supplement Substances 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/3332—Query translation
- G06F16/3335—Syntactic pre-processing, e.g. stopword elimination, stemming
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/374—Thesaurus
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Animal Behavior & Ethology (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a natural language processing-based public opinion information extraction and knowledge base generation method, which comprises the following steps: firstly, preprocessing a text; secondly, named entity recognition, comprising: identifying company and organization names and person names, and finishing named entity identification by adopting a neural network-based method; extracting the relationship, namely extracting six types of relationships in the financial field by adopting a feature layer + GRU + Attention; fourthly, entity linking; and judging whether the link entities and the target entity are the same entity by calculating the distance between the link entities and the target entity by adopting a Jaro winkler distance method so as to achieve entity disambiguation. The method adopts the combination of an end-to-end model and a characteristic extraction input model to construct a one-stop flow from financial unstructured texts to structured data storage, fully utilizes the financial news context information, extracts knowledge with fewer parameters and faster training prediction speed, and achieves good performance in the field of financial public opinion information.
Description
Technical Field
The invention discloses a public opinion information extraction and knowledge base generation method based on natural language processing, relates to the technologies of entity identification, relation extraction, entity linkage and the like in the field of financial information, and particularly relates to a whole set of flow and method from information extraction to knowledge generation for enterprise public opinion news.
Background
Due to the diversification of the current investment main bodies and the development of enterprise operation conglomeration, the relationship among enterprises is more and more complex, and the concealment is very strong regardless of regions and industries. In financial institutions such as commercial banks, if an enterprise intentionally hides at the time of loan, it is difficult for the bank to grasp the actual information, which leads to excessive credit extension and multi-time credit extension, and increases the credit risk of the bank. Therefore, the method fully identifies the association relationship between enterprises, more comprehensively grasps the relevant information of the customers, and is an important direction for reducing credit risk.
Currently, data of enterprise association mainly comes from structured data provided by enterprises and data service providers, such as a national enterprise credit information bulletin system. Because the above information updating cycle is long, in order to enrich the dimension of the customer information, when the credit personnel investigate and collect evidence, the credit personnel can also use the judicial data in the judge document network of the Chinese court, the public opinion data in the enterprise news report and other information with stronger instantaneity as an important supplementary source of the enterprise association relationship. However, the public opinion information exists in the form of unstructured text, available technologies and tools are scarce when credit workers mine useful information in the public opinion information, the credit workers often depend on manual browsing and query, the investigation depth and the query efficiency are limited, and the ever-increasing dynamic query requirements of financial institutions for group customer associated information are difficult to meet. At present, introduction of richer data such as public opinion information as support and automated mining and storage of knowledge are urgent needs in the field of credit risk.
Disclosure of Invention
In view of the above problems, an object of the present invention is to provide a method for extracting public sentiment information and generating a knowledge base based on natural language processing. The method has the advantages that the enterprise association information mining is converted into an information extraction task in natural language processing, the characteristics of the information are found from the unstructured text and are modeled, the association relation between enterprise entities and enterprises is automatically extracted to serve as an important supplement of the existing structured data, and more powerful support is provided for credit risk management.
The invention relates to a public opinion information extraction and knowledge base generation method based on natural language processing, which takes unstructured texts such as enterprise news reports and the like on the Internet as data sources, constructs an entity identification model, a relation extraction model and an entity link model, and completes information extraction and mapping storage of the texts. Firstly, preprocessing unstructured text data, removing interference information such as symbols and stop words, and obtaining cleaned data; then, analyzing the data by word segmentation, part of speech tagging and the like, further constructing a model to extract named entities in the text, and constructing a relation extraction model to complete relation extraction between the entities; and finally, mapping the extracted entities and the relations to the knowledge base by an entity linking technology to complete the generation and the updating of the knowledge base.
In order to achieve the above object, the present invention provides a method for extracting public sentiment information and generating a knowledge base based on natural language processing, which comprises the following steps:
step one, text preprocessing
The text preprocessing mainly comprises character cleaning, word segmentation and word stop.
And (4) character cleaning, namely performing full-angle and half-angle unification treatment on the input text and performing matching filtering on punctuation marks by adopting a regular matching method.
And (4) word segmentation, namely performing word segmentation by adopting an LTP toolkit, and introducing a professional field dictionary to improve the word segmentation effect.
And (4) removing stop words, and removing text stop words by adopting a special stop dictionary in the field of the unhealthy assets. The special stopping dictionary in the field of the undesirable assets divides stopping words into 3 types: the category is nonsense words such as "and" are; one is a word which widely exists in sentences and appears at high frequency; the other is the nonsense vocabulary in the business system.
Step two, named entity recognition
The named entity identification facing the financial information field comprises the following steps: the invention adopts a neural network-based method to complete named entity identification.
According to the method, each word is mapped into dense embedding in a low-dimensional space, then word embedding (wordemmbedding) is used as the input of a model, features are automatically extracted by using a BilSTM, and labels of the whole sentence are predicted by using the CRF, so that the training and prediction of the model become an end-to-end integral process instead of the traditional pipeline, and the dependence on feature engineering is removed. The annotation data is from the 1998 daily corpus, which shares three types of entities, namely, person name (nr), place name (ns), and organization name (nt). B. M, E represent the first, middle and end words of an entity, respectively. For example, B-nr, M-nr, E-nr represent the first, non-first and last, and the last, respectively, of a name, and O represents that the word does not belong to a part of the named entity.
Step three, extracting relation
Relationship extraction is one of the important research tasks of natural language processing, and is given by entity 1(e _1) and entity 2(e _2), and the process of obtaining the relationship (r) between the two entities in a piece of text can be represented as r → (e1, e 2). The co-extraction relationship of the invention comprises 8 types, which are respectively: relationship of job, relationship of cooperation, relationship of litigation, relationship of supplier, relationship of stock control, relationship of investment, relationship of debt, relationship of Unknow. . For example, given a text "2002, the current generation of the lack of money, current sea-liriors, at a price of 3.8 billion dollars, will flag the TFT-LCD department and sell it to the beijing oriental group in its entirety," wherein the entity is "modern sea-liriors" and the entity is "beijing oriental group" and by analyzing the meaning distribution in the sentences and the logical relationship between words, the relationship of "acquisition" between the beijing oriental group and the modern sea-liriors can be extracted.
The invention provides a Bi-GRU model to solve the problem of relation extraction in the financial field, which mainly comprises the following steps:
s31, TeAnd (4) feature extraction, namely extracting lexical features and syntactic features from the input sentence. The quality of feature extraction determines the performance of the model. In the part of constructing the features, the features are divided into lexical features and syntactic features. Word embedding (denoted as w) is included in the lexical featuresw) Part of speech (pos, denoted w)f1) Named entity (ner, denoted w)f2) Syntactic characteristics include dependency type (dep, denoted w)f3) Parent node position (denoted as w)f4) And the relative position of the word (position feature, denoted as w)f5)。
When the lexical characteristics and the syntactic characteristics are obtained, the input sentences are processed by adopting a natural language processing packet LTP of the Hadamard to obtain the characteristics. The final feature set is:
Feature Set={ww,wf1,wf2,wf3,wf4,wf5,wf6}
and S32, embedding features. The feature embedding is to convert the features in step S31 into vector representation and to stitch together. Then, for sentence S ═ { w ═ w1,w2,...,wn}, word wiThe characteristic representation of (c) can be expressed as follows:
wherein, ww=Wwrdvi,viIs that the current word is in WwrdRepresented by ont-hot in the corresponding column,is a vector representation of the jth class of features.
S33, Bi-GRU model. Passing the vector in the step S32 through a BI-GRU network, where the number of hidden layer nodes in the BIGRU is: 400, Dropout ratio is: 0.5. through training of 10 pieces of data in each batch, the optimal result is obtained through 10 times of iterative training.
S34, attention mechanism. An attention mechanism is introduced to generate weight vectors for word-level and sentence-level. In the patent, a hot-one attention mechanism is adopted to realize automatic optimization of word-level weights.
And S35, finally, carrying out normalization processing on the result by using a softmax layer to obtain the probability distribution of the relational tags.
Step four, entity linking
After the extraction of the entity and the relation is completed, the important problem is how to connect the extracted entity with the real information in the knowledge base, and the invention adopts the Jaro winkler distance method to judge whether the link entity and the target entity are the same entity by calculating the distance between the link entity and the target entity so as to achieve the effect of entity disambiguation.
The invention relates to a public opinion information extraction and knowledge base generation method based on natural language processing, which has the advantages that compared with the prior art: (1) compared with the traditional single named entity recognition and relationship extraction module, the public opinion information extraction method integrates the whole process of public opinion information extraction, inputs the original text, outputs the structured knowledge and realizes an end-to-end model; (2) compared with the traditional one-hot model for obtaining word vectors, the word vector is trained by a deep learning method, so that the phenomenon of dimension disaster represented by the word vectors can be avoided, the information of the context of words can be fully mined, and the relation between the words can be obtained; (3) the relation extraction model based on the Bi-GRU generates a lightweight model with fewer parameters and faster training speed, and achieves good performance in the field of financial public opinion information.
Drawings
FIG. 1 is a block flow diagram of the method of the present invention.
Fig. 2 shows a GRU update mechanism.
Detailed Description
The technical solution of the present invention is further described below with reference to specific examples.
The invention discloses a public opinion information extraction and knowledge base generation method based on natural language processing, which comprises the following steps as shown in figure 1:
step one, text preprocessing
Text preprocessing is a basic and necessary step in unstructured data processing, and mainly reduces negative effects brought by noise data as much as possible by removing characters without entity semantics and filtering stop words after word segmentation, and the generalization capability of a model is improved. The data preprocessing methods selected therefore are mainly character cleaning, word segmentation and word decommissioning.
And (5) character cleaning. Characters such as commas, periods, quotation marks and the like in the text represent pauses, connections and the like of sentences, have no actual meaning in semantic analysis, can be regarded as useless characters, and the common processing mode is matching filtering. The invention adopts a regular matching method to carry out full-angle and half-angle unification processing on the input text and matching and filtering of punctuation marks.
And (5) word segmentation. In Chinese text, a word is the smallest semantic unit. In text-related processes and operations, a sentence is typically segmented into a series of words to represent the original sentence. Currently, commonly used word segmentation tools include jieba word segmentation, LTP natural language processing package of haardard, Stanford's tool package, and the like. After the accuracy, the performance and the word segmentation fine granularity of different tools are comprehensively considered, the word segmentation method adopts the LTP toolkit to segment words, and introduces a dictionary in the professional field to improve the word segmentation effect.
To stop the word. Before the semantics of a sentence are expressed in terms of words, there is often a one-step important operation of removing certain words and words, which are collectively referred to as stop words. The stop words can be divided into two types, one type is functional words such as 'the', 'woollen', 'in' and the like, the functional words have no definite actual meanings in sentences, and the functional words serve as connecting words and mood-assisting words to assist other words in the sentences; the other is a word that is widely present in a sentence, and the high frequency of occurrence makes it useless for representing the semantics of the sentence. Therefore, the density of the keywords can be improved by removing the two types of stop words, and the semantic information of the sentence can be more effectively acquired.
Step two, named entity recognition
Named entity identification oriented to the field of financial information is mainly to identify company and institution names and person names. The present invention employs a neural network-based approach to accomplish named entity recognition.
The data driving method comprises the steps of firstly mapping each word into dense embedding in a low-dimensional space, then using word embedding (word embedding) as the input of a model, automatically extracting features by using a neural network, and predicting the label of each word by softmax, so that the training of the model becomes an end-to-end integral process instead of the traditional pipeline and is independent of feature engineering.
The invention adopts a word-based BilSTM + CRF model, and the labeled data comes from a 1998 people daily statement corpus, wherein three types of entities including a person name (nr), a place name (ns) and an organization name (nt) are shared. B. M, E represent the first, middle and end words of an entity, respectively. For example, B-nr, M-nr, E-nr represent the first, non-first and last, and the last, respectively, of a name, and O represents that the word does not belong to a part of the named entity.
The model is divided into three layers, namely an embedding layer, a BilSTM layer and a CRF layer.
(1) Imbedding layer: also called the lookup layer, which functions to map each word in the input text into a vector representation in a low dimensional space, which is also the input to the next layer.
(2) BilsTM layer: and classifying each word by using the processing advantage of the LSTM on the serialized texts, judging the tag with the maximum probability, and outputting the tag. The use of bi-directional LSTM can better exploit semantic features at the sentence level, capture some laws of entity composition, such as part of an organization entity ending in "limited company" and the like.
(3) CRF layer: and modeling the relation between tags, and improving the accuracy of named entity identification.
Step three, extracting relation
Relationship extraction is one of the important research tasks of natural language processing, and is given by entity 1(e _1) and entity 2(e _2), and the process of obtaining the relationship (r) between the two entities in a piece of text can be represented as r → (e1, e 2). For example, given a text "2002, the missing modern Haili Shi will flag the TFT-LCD department for sale to the Jingdongfang group as a whole at a price of 3.8 hundred million dollars", wherein the entity is "modern Haili Shi" and the entity is "Jingdongfang group", and by analyzing the meaning distribution and the logical relationship between words in the sentence, the relationship of "acquisition" between the Jingdongfang group and the modern Haili Shi can be extracted.
The invention provides a Bi-GRU model to solve the problem of relation extraction in the financial field, which mainly comprises the following steps:
and S31, feature extraction, wherein lexical features and syntactic features are extracted from the input sentence. The quality of feature extraction determines the performance of the model. In the part of constructing the features, the features are divided into lexical features and syntactic features. Word embedding (denoted as w) is included in the lexical featuresw) Part of speech (pos, denoted w)f1) Named entity (ner, denoted w)f2) Syntactic characteristics include dependency type (dep, denoted w)f3) Parent node position (denoted as w)f4) And the relative position of the word (position feature, denoted as w)f5)。
When the lexical characteristics and the syntactic characteristics are obtained, the input sentences are processed by adopting a natural language processing packet LTP of the Hadamard to obtain the characteristics. The final feature set is:
Feature Set={ww,wf1,wf2,wf3,wf4,wf5,wf6}
and S32, embedding features. The feature embedding is to convert the features in step S31 into vector representation and to stitch together. That is to say a sentence is completely converted into a vector represented by its features. Word embedding is a low-dimensional vector representation of a word. Given a sentence, S ═ w1,w2,...,wnN is the number of words in the sentence.Is an embedded matrix, where dwIs the vector dimension of the word embedding defined by the user, and V is the total number of words. For each word in the sentence, I can formulate it byCorresponding to a word vector.
ww=Wwrdvi
Wherein v isiIs that the current word is in WwrdCorresponding to ont-hot representation of that column.
In addition, wfjIs a vector representation of the characteristics of part of speech, named entities, dependency relationship types, etc., wherein j represents the jth class of characteristics. The part of speech, the named entity and the dependency relationship type are one-hot vectors generated according to the analysis result of the LTP, and the father node position and the relative Position (PF) are initialized randomly.
Then, for sentence S ═ { w ═ w1,w2,...,wn}, word wiThe characteristic representation of (c) can be expressed as follows:
wherein the content of the first and second substances,is wiThe word vector of (a) is,is a vector representation of the jth class of features.
S33, Bi-GRU model. And (4) passing the vector in the step S32 through a Bi-GRU network to generate a high-dimensional vector. Both LSTM and GRU are specific variants of RNN. The long-time and short-time memory model needs to transmit two states, namely a long-range state and a short-range state which can be stably transmitted, and is additionally provided with three thresholds, namely a forgetting gate, an input gate and an output gate, so that information can selectively pass through to control and protect the long-range state. With these smart designs, the LSTM avoids the long-term dependence problem.
The GRU optimizes the internal design based on the LSTM, merges the long-range state and the short-range state, merges the forgetting gate and the input gate into an update gate, determines how much previous information is retained, and the reset gate determines how much previous information is combined with the current input. As shown in fig. 2.
In FIG. 2, xtFor the current input, ht-1Is the output of the previous moment, htIs the output at the current time and σ is the activation function. r istIs a reset gate for controlling the extent to which status information at a previous time is ignored, with smaller values of the reset gate indicating more ignorance. z is a radical oftIs an update gate for controlling the extent to which the state information at the previous time is brought into the current state, the larger the value of the update gate, the more the state information at the previous time is brought in. The two gate protection and control information are hidden from the last state ht-1To a new hidden state ht。
Equations 1-4 give the reset gate rtUpdate gate ztCandidate hidden statesAnd a hidden state h at the current momenttThe calculation method of (1).
rt=σ(Wr·[ht-1,xt]) (1)
zt=σ(Wz·[ht-1,xt]) (2)
The basic unit of the Bi-GRU model is composed of a forward-propagating GRU unit and a backward-propagating GRU unit, and the structure diagram of the backward-propagating GRU unit is shown in fig. 2. When processing sequence information, forward information and backward information can be considered at the same time, and finally the two units are spliced together in the output part. For the ith word, the output is the following formula:
The GRU model is simpler than the LSTM model, has fewer parameters, trains faster, but performs similarly and performs well even on a smaller sample set of data.
S34, attention mechanism. An attention mechanism is introduced to generate weight vectors for word-level and sentence-level.
When the input sequence of the model is long, it is difficult to retain all important information, and the performance of the model is therefore degraded. Attention is paid to the existence of a mechanism for solving the problem. Intermediate output result h to input sequence by preserving GRUiThese intermediate results are then selectively learned as inputs to the attention layer and correlated with the output sequence of the GRU at the time of output. Although the model increases the amount of computation after using the attention mechanism, the performance level can be improved.
And S35, finally, carrying out normalization processing on the result by using a softmax layer to obtain the probability distribution of the relational tags.
Step four, entity linking
After the extraction of the entity and the relation is completed, the important problem is how to connect the extracted entity with the real information in the knowledge base, and the invention adopts the Jaro winkler distance method to judge whether the link entity and the target entity are the same entity by calculating the distance between the link entity and the target entity so as to achieve the effect of entity disambiguation.
The following describes a technical solution of the present invention with an embodiment, which is verified based on news public opinion data provided by a financial technology company. In addition, the training data set comprises a daily news corpus of 1998 people and Tushare platform news information data.
Text preprocessing
The method processes the Tushare data, only selects Chinese words to perform word segmentation, refers to words in the professional field of the unhealthy assets, performs word segmentation on the text by using an LTP word segmentation device, and performs stop words on the text words after importing the stop words into a stop word bank.
Named entity recognition
In training the named entity recognition model, the invention selects a 1998 daily statement corpus which is a text of 'daily statement of people' from 1 month to 6 months in 1998, manually labels the part of speech, and commonly performs work related to natural language processing, such as word segmentation, part of speech labeling, named entity recognition and the like.
The data set is labeled by taking words as units, so that the data set is firstly processed, and the labeled ns, nr and nt are divided into fine-grained segments by one step, so that the labels of the segmented segments are added with B, M, E information. And marking other parts of speech as O to generate a new training speech file.
The model was trained on the new corpus file, which represented F1 ═ 0.90 on the test set. For the recognition result, the person name and organization name were selected and output, and the place name was discarded because it had little relation with the current study.
Extraction of three, relationship
The relation extraction model selects the news information data of the Tushare platform as a data source. The Tushare is a free-source python financial data interface package, in which news information data is one of the data structures provided by it. The method mainly obtains the news data of the news website, including the real-time information of the Xin Lang and the financial, the Wale street news and the news of the news. According to the provided data, the method carries out manual marking, and totally marks 1000 pieces of data for training and testing to finally obtain a relation extraction model.
In the specific processing, whether to enter a relationship extraction flow is judged according to the result of the named entity identification in the last step. And when the number of the extracted entities is more than 2, entering a relation extraction module. The current sentence and a pair of entity pairs are used as input, the output is the probability distribution of the relation labels, and one label with the highest probability is selected as the relation of the entity pairs. Every pair of extracted entities is sent to the relation extraction model for relation judgment. In order to ensure the correctness of the extraction relation, the invention sets a higher threshold value, and when the numerical value of the relation probability is greater than the threshold value, the relation is considered to be the knowledge storage. And when the number of the extracted entities is less than 2, directly skipping the processes of relation extraction and entity linking, and ending the process.
Four, entity linking
The entity linking part is equivalent to adding the verification of the information once. The database contains the names of companies related to public opinion information, and entity linking is performed before data storage, so that error information generated in the previous process can be effectively reduced, and disambiguation of entity nodes is realized.
Finally, the method selects and verifies the news public opinion data provided by Hua-Rong-and-Rong (Beijing) science and technology Limited company, the data are 1000 pieces, entity identification, relation extraction, entity linkage and the like are carried out according to the flow provided by the method, the effective relation among enterprises is extracted, and the effective relation is successfully stored in the mysql database.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the technical scope of the present invention, so that any minor modifications, equivalent changes and modifications made to the above embodiment according to the technical spirit of the present invention are within the technical scope of the present invention.
Claims (2)
1. A public opinion information extraction and knowledge base generation method based on natural language processing is characterized in that: the method comprises the following steps:
the method comprises the following steps of firstly, preprocessing a text, wherein characters are cleaned, words are segmented, and words are removed;
character cleaning, adopting a regular matching method to carry out full-angle and half-angle unification processing on an input text and matching and filtering of punctuation marks;
word segmentation is carried out by adopting an LTP toolkit, and a professional domain dictionary is introduced to improve the word segmentation effect;
removing stop words, and removing text stop words by adopting a special stop dictionary in the field of the unhealthy assets; the special stopping dictionary in the field of the undesirable assets divides stopping words into 3 types: one is nonsense words; one is a word which widely exists in sentences and appears at high frequency; the other is nonsense vocabulary in a business system;
step two, named entity recognition
The named entity identification facing the financial information field comprises the following steps: identifying company and organization names and person names, and finishing named entity identification by adopting a method based on a neural network;
according to the method, each word is mapped into dense embedding in a low-dimensional space, then the word is embedded into the wordemmbedding to serve as the input of a model, features are automatically extracted by using a BilSTM, and labels of the whole sentence are predicted by using a CRF (cyclic redundancy check) so that the training and prediction of the model become an end-to-end integral process;
step three, extracting relation
The co-extraction relationship of the invention comprises 8 types, which are respectively: an arbitrary relationship, a cooperative relationship, a litigation relationship, a supplier relationship, a stock control relationship, an investment relationship, a debt relationship, and an uknow relationship;
step four, entity linking
After the extraction of the entity and the relation is completed, the Jaro winkler distance method is adopted to judge whether the link entity and the target entity are the same entity by calculating the distance between the link entity and the target entity so as to achieve the effect of entity disambiguation.
2. The method for extracting public opinion information and generating knowledge base based on natural language processing as claimed in claim 1, wherein: the third step specifically adopts a Bi-GRU model to solve the problem of relation extraction in the financial field, and mainly comprises the following steps:
s31, feature extraction, namely extracting lexical features and syntactic features from the input sentences; the quality of feature extraction determines the performance of the model; in the part of constructing the characteristics, the characteristics are divided into lexical characteristics and syntactic characteristics; word embedding, denoted as w, is included in lexical featureswPart of speech pos, denoted wf1Named entity ner, denoted wf2Syntactic characteristics include dependency type dep, denoted as wf3Parent node position parent, denoted as wf4The relative position of the word position, denoted as wf5;
When the lexical characteristics and the syntactic characteristics are obtained, processing an input sentence by adopting a natural language processing packet LTP of the Haugh to obtain the characteristics; the final feature set is:
Feature Set={ww,wf1,wf2,wf3,wf4,wf5,wf6};
s32, embedding characteristics; the feature embedding is to convert the features in the step S31 into vector representation and to splice the vector representation; then, for sentence S ═ { w ═ w1,w2,…,wn}, word wiThe characteristic representation of (c) can be expressed as follows:
wherein, ww=Wwrdvi,viIs that the current word is in WwrdRepresented by ont-hot in the corresponding column,is a vector representation of class j features;
s33, a Bi-GRU model; passing the vector in the step S32 through a BI-GRU network, where the number of hidden layer nodes in the BIGRU is: 400, Dropout ratio is: 0.5; performing iterative training for 10 times through training of 10 pieces of data in each batch to obtain an optimal result;
s34, an attention mechanism; an attention mechanism is introduced, and weight vectors are generated for word levels and sentence levels; a hot-one attention mechanism is adopted to realize automatic optimization of word-level weight;
and S35, finally, carrying out normalization processing on the result by using a softmax layer to obtain the probability distribution of the relational tags.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911117980.2A CN110990525A (en) | 2019-11-15 | 2019-11-15 | Natural language processing-based public opinion information extraction and knowledge base generation method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911117980.2A CN110990525A (en) | 2019-11-15 | 2019-11-15 | Natural language processing-based public opinion information extraction and knowledge base generation method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110990525A true CN110990525A (en) | 2020-04-10 |
Family
ID=70084612
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911117980.2A Pending CN110990525A (en) | 2019-11-15 | 2019-11-15 | Natural language processing-based public opinion information extraction and knowledge base generation method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110990525A (en) |
Cited By (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111523323A (en) * | 2020-04-26 | 2020-08-11 | 梁华智能科技(上海)有限公司 | Disambiguation processing method and system for Chinese word segmentation |
CN111539806A (en) * | 2020-04-14 | 2020-08-14 | 鼎富智能科技有限公司 | Method and related device for structuring announcement content |
CN111597804A (en) * | 2020-05-15 | 2020-08-28 | 腾讯科技(深圳)有限公司 | Entity recognition model training method and related device |
CN111695346A (en) * | 2020-06-16 | 2020-09-22 | 广州商品清算中心股份有限公司 | Method for improving public opinion entity recognition rate in financial risk prevention and control field |
CN111723191A (en) * | 2020-05-19 | 2020-09-29 | 天闻数媒科技(北京)有限公司 | Text filtering and extracting method and system based on full-information natural language |
CN112036173A (en) * | 2020-11-09 | 2020-12-04 | 北京读我科技有限公司 | Method and system for processing telemarketing text |
CN112215006A (en) * | 2020-10-22 | 2021-01-12 | 上海交通大学 | Organization named entity normalization method and system |
CN112364002A (en) * | 2020-11-04 | 2021-02-12 | 上海新朋程数据科技发展有限公司 | Modeling method of data analysis model |
CN112380866A (en) * | 2020-11-25 | 2021-02-19 | 厦门市美亚柏科信息股份有限公司 | Text topic label generation method, terminal device and storage medium |
CN112541059A (en) * | 2020-11-05 | 2021-03-23 | 大连中河科技有限公司 | Multi-round intelligent question-answer interaction method applied to tax question-answer system |
CN112800764A (en) * | 2020-12-31 | 2021-05-14 | 江苏网进科技股份有限公司 | Entity extraction method in legal field based on Word2Vec-BilSTM-CRF model |
CN113326700A (en) * | 2021-02-26 | 2021-08-31 | 西安理工大学 | ALBert-based complex heavy equipment entity extraction method |
CN113609298A (en) * | 2021-08-23 | 2021-11-05 | 南京擎盾信息科技有限公司 | Data processing method and device for court public opinion corpus extraction |
CN114386422A (en) * | 2022-01-14 | 2022-04-22 | 淮安市创新创业科技服务中心 | Intelligent aid decision-making method and device based on enterprise pollution public opinion extraction |
CN114611515A (en) * | 2022-01-28 | 2022-06-10 | 江苏省联合征信有限公司 | Method and system for identifying actual control person of enterprise based on enterprise public opinion information |
CN114647734A (en) * | 2020-12-18 | 2022-06-21 | 同方威视科技江苏有限公司 | Method and device for generating event map of public opinion text, electronic equipment and medium |
WO2022134575A1 (en) * | 2020-12-23 | 2022-06-30 | 深圳壹账通智能科技有限公司 | Service keyword extraction method, apparatus, and device, and storage medium |
CN114897504A (en) * | 2022-05-20 | 2022-08-12 | 北京北大软件工程股份有限公司 | Method, device, storage medium and electronic equipment for processing repeated letters |
WO2022227196A1 (en) * | 2021-04-27 | 2022-11-03 | 平安科技(深圳)有限公司 | Data analysis method and apparatus, computer device, and storage medium |
CN116681065A (en) * | 2023-06-09 | 2023-09-01 | 西藏大学 | Combined extraction system and method for entity relationship in Tibetan medicine field |
CN116776886A (en) * | 2023-08-15 | 2023-09-19 | 浙江同信企业征信服务有限公司 | Information extraction method, device, equipment and storage medium |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2015080561A1 (en) * | 2013-11-27 | 2015-06-04 | Mimos Berhad | A method and system for automated relation discovery from texts |
WO2016054301A1 (en) * | 2014-10-02 | 2016-04-07 | Microsoft Technology Licensing, Llc | Distant supervision relationship extractor |
CN106815293A (en) * | 2016-12-08 | 2017-06-09 | 中国电子科技集团公司第三十二研究所 | System and method for constructing knowledge graph for information analysis |
CN107247739A (en) * | 2017-05-10 | 2017-10-13 | 浙江大学 | A kind of financial publication text knowledge extracting method based on factor graph |
CN107748757A (en) * | 2017-09-21 | 2018-03-02 | 北京航空航天大学 | A kind of answering method of knowledge based collection of illustrative plates |
CN108733792A (en) * | 2018-05-14 | 2018-11-02 | 北京大学深圳研究生院 | A kind of entity relation extraction method |
CN108875051A (en) * | 2018-06-28 | 2018-11-23 | 中译语通科技股份有限公司 | Knowledge mapping method for auto constructing and system towards magnanimity non-structured text |
CN109558492A (en) * | 2018-10-16 | 2019-04-02 | 中山大学 | A kind of listed company's knowledge mapping construction method and device suitable for event attribution |
US20190158524A1 (en) * | 2017-01-30 | 2019-05-23 | Splunk Inc. | Anomaly detection based on information technology environment topology |
CN109871535A (en) * | 2019-01-16 | 2019-06-11 | 四川大学 | A kind of French name entity recognition method based on deep neural network |
-
2019
- 2019-11-15 CN CN201911117980.2A patent/CN110990525A/en active Pending
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2015080561A1 (en) * | 2013-11-27 | 2015-06-04 | Mimos Berhad | A method and system for automated relation discovery from texts |
WO2016054301A1 (en) * | 2014-10-02 | 2016-04-07 | Microsoft Technology Licensing, Llc | Distant supervision relationship extractor |
CN106815293A (en) * | 2016-12-08 | 2017-06-09 | 中国电子科技集团公司第三十二研究所 | System and method for constructing knowledge graph for information analysis |
US20190158524A1 (en) * | 2017-01-30 | 2019-05-23 | Splunk Inc. | Anomaly detection based on information technology environment topology |
CN107247739A (en) * | 2017-05-10 | 2017-10-13 | 浙江大学 | A kind of financial publication text knowledge extracting method based on factor graph |
CN107748757A (en) * | 2017-09-21 | 2018-03-02 | 北京航空航天大学 | A kind of answering method of knowledge based collection of illustrative plates |
CN108733792A (en) * | 2018-05-14 | 2018-11-02 | 北京大学深圳研究生院 | A kind of entity relation extraction method |
CN108875051A (en) * | 2018-06-28 | 2018-11-23 | 中译语通科技股份有限公司 | Knowledge mapping method for auto constructing and system towards magnanimity non-structured text |
CN109558492A (en) * | 2018-10-16 | 2019-04-02 | 中山大学 | A kind of listed company's knowledge mapping construction method and device suitable for event attribution |
CN109871535A (en) * | 2019-01-16 | 2019-06-11 | 四川大学 | A kind of French name entity recognition method based on deep neural network |
Non-Patent Citations (3)
Title |
---|
KALYANI R. POLE等: "Improvised fuzzy clustering using name entity recognition and natural language processing" * |
张兰霞;胡文心;: "基于双向GRU神经网络和双层注意力机制的中文文本中人物关系抽取研究" * |
鄂海红;张文静;肖思琪;程瑞;胡莺夕;周筱松;牛佩晴;: "深度学习实体关系抽取研究综述" * |
Cited By (31)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111539806A (en) * | 2020-04-14 | 2020-08-14 | 鼎富智能科技有限公司 | Method and related device for structuring announcement content |
CN111523323A (en) * | 2020-04-26 | 2020-08-11 | 梁华智能科技(上海)有限公司 | Disambiguation processing method and system for Chinese word segmentation |
CN111523323B (en) * | 2020-04-26 | 2022-08-12 | 梁华智能科技(上海)有限公司 | Disambiguation processing method and system for Chinese word segmentation |
CN111597804A (en) * | 2020-05-15 | 2020-08-28 | 腾讯科技(深圳)有限公司 | Entity recognition model training method and related device |
CN111723191A (en) * | 2020-05-19 | 2020-09-29 | 天闻数媒科技(北京)有限公司 | Text filtering and extracting method and system based on full-information natural language |
CN111723191B (en) * | 2020-05-19 | 2023-10-27 | 天闻数媒科技(北京)有限公司 | Text filtering and extracting method and system based on full-information natural language |
CN111695346B (en) * | 2020-06-16 | 2024-05-07 | 广州商品清算中心股份有限公司 | Method for improving public opinion entity recognition rate in financial risk prevention and control field |
CN111695346A (en) * | 2020-06-16 | 2020-09-22 | 广州商品清算中心股份有限公司 | Method for improving public opinion entity recognition rate in financial risk prevention and control field |
CN112215006A (en) * | 2020-10-22 | 2021-01-12 | 上海交通大学 | Organization named entity normalization method and system |
CN112215006B (en) * | 2020-10-22 | 2022-08-09 | 上海交通大学 | Organization named entity normalization method and system |
CN112364002A (en) * | 2020-11-04 | 2021-02-12 | 上海新朋程数据科技发展有限公司 | Modeling method of data analysis model |
CN112541059A (en) * | 2020-11-05 | 2021-03-23 | 大连中河科技有限公司 | Multi-round intelligent question-answer interaction method applied to tax question-answer system |
CN112036173A (en) * | 2020-11-09 | 2020-12-04 | 北京读我科技有限公司 | Method and system for processing telemarketing text |
CN112380866A (en) * | 2020-11-25 | 2021-02-19 | 厦门市美亚柏科信息股份有限公司 | Text topic label generation method, terminal device and storage medium |
CN114647734A (en) * | 2020-12-18 | 2022-06-21 | 同方威视科技江苏有限公司 | Method and device for generating event map of public opinion text, electronic equipment and medium |
WO2022134575A1 (en) * | 2020-12-23 | 2022-06-30 | 深圳壹账通智能科技有限公司 | Service keyword extraction method, apparatus, and device, and storage medium |
CN112800764B (en) * | 2020-12-31 | 2023-07-04 | 江苏网进科技股份有限公司 | Entity extraction method in legal field based on Word2Vec-BiLSTM-CRF model |
CN112800764A (en) * | 2020-12-31 | 2021-05-14 | 江苏网进科技股份有限公司 | Entity extraction method in legal field based on Word2Vec-BilSTM-CRF model |
CN113326700A (en) * | 2021-02-26 | 2021-08-31 | 西安理工大学 | ALBert-based complex heavy equipment entity extraction method |
CN113326700B (en) * | 2021-02-26 | 2024-05-14 | 西安理工大学 | ALBert-based complex heavy equipment entity extraction method |
WO2022227196A1 (en) * | 2021-04-27 | 2022-11-03 | 平安科技(深圳)有限公司 | Data analysis method and apparatus, computer device, and storage medium |
CN113609298A (en) * | 2021-08-23 | 2021-11-05 | 南京擎盾信息科技有限公司 | Data processing method and device for court public opinion corpus extraction |
CN114386422B (en) * | 2022-01-14 | 2023-09-15 | 淮安市创新创业科技服务中心 | Intelligent auxiliary decision-making method and device based on enterprise pollution public opinion extraction |
CN114386422A (en) * | 2022-01-14 | 2022-04-22 | 淮安市创新创业科技服务中心 | Intelligent aid decision-making method and device based on enterprise pollution public opinion extraction |
CN114611515A (en) * | 2022-01-28 | 2022-06-10 | 江苏省联合征信有限公司 | Method and system for identifying actual control person of enterprise based on enterprise public opinion information |
CN114611515B (en) * | 2022-01-28 | 2023-12-12 | 江苏省联合征信有限公司 | Method and system for identifying enterprise actual control person based on enterprise public opinion information |
CN114897504A (en) * | 2022-05-20 | 2022-08-12 | 北京北大软件工程股份有限公司 | Method, device, storage medium and electronic equipment for processing repeated letters |
CN116681065A (en) * | 2023-06-09 | 2023-09-01 | 西藏大学 | Combined extraction system and method for entity relationship in Tibetan medicine field |
CN116681065B (en) * | 2023-06-09 | 2024-01-23 | 西藏大学 | Combined extraction method for entity relationship in Tibetan medicine field |
CN116776886A (en) * | 2023-08-15 | 2023-09-19 | 浙江同信企业征信服务有限公司 | Information extraction method, device, equipment and storage medium |
CN116776886B (en) * | 2023-08-15 | 2023-12-05 | 浙江同信企业征信服务有限公司 | Information extraction method, device, equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110990525A (en) | Natural language processing-based public opinion information extraction and knowledge base generation method | |
CN109684440B (en) | Address similarity measurement method based on hierarchical annotation | |
Day et al. | Deep learning for financial sentiment analysis on finance news providers | |
US20200073882A1 (en) | Artificial intelligence based corpus enrichment for knowledge population and query response | |
US20210350080A1 (en) | Systems and methods for deviation detection, information extraction and obligation deviation detection | |
Demir et al. | Improving named entity recognition for morphologically rich languages using word embeddings | |
CN106557462A (en) | Name entity recognition method and system | |
CN106611375A (en) | Text analysis-based credit risk assessment method and apparatus | |
CN109886270B (en) | Case element identification method for electronic file record text | |
CN113204967B (en) | Resume named entity identification method and system | |
Yan et al. | Neural network based relation extraction of enterprises in credit risk management | |
Kapusta et al. | Comparison of fake and real news based on morphological analysis | |
CN113255321A (en) | Financial field chapter-level event extraction method based on article entity word dependency relationship | |
CN113919366A (en) | Semantic matching method and device for power transformer knowledge question answering | |
CN111462752A (en) | Client intention identification method based on attention mechanism, feature embedding and BI-L STM | |
Li et al. | A method for resume information extraction using bert-bilstm-crf | |
CN113869054A (en) | Deep learning-based electric power field project feature identification method | |
CN117077682B (en) | Document analysis method and system based on semantic recognition | |
Chen et al. | From natural language to accounting entries using a natural language processing method | |
Sanyal et al. | Natural language processing technique for generation of SQL queries dynamically | |
Zhang et al. | Sentiment identification by incorporating syntax, semantics and context information | |
Zhu | Financial data analysis application via multi-strategy text processing | |
Jishtu et al. | Prediction of the stock market based on machine learning and sentiment analysis | |
CN113051396A (en) | Document classification identification method and device and electronic equipment | |
Sirirattanajakarin et al. | BoydCut: Bidirectional LSTM-CNN Model for Thai Sentence Segmenter |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20200410 |