CN110990525A - Natural language processing-based public opinion information extraction and knowledge base generation method - Google Patents

Natural language processing-based public opinion information extraction and knowledge base generation method Download PDF

Info

Publication number
CN110990525A
CN110990525A CN201911117980.2A CN201911117980A CN110990525A CN 110990525 A CN110990525 A CN 110990525A CN 201911117980 A CN201911117980 A CN 201911117980A CN 110990525 A CN110990525 A CN 110990525A
Authority
CN
China
Prior art keywords
entity
word
relationship
extraction
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911117980.2A
Other languages
Chinese (zh)
Inventor
路世伦
闫晨巍
仵伟强
周金黄
钟丽莉
万谊强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huarong Rongtong Beijing Technology Co ltd
Original Assignee
Huarong Rongtong Beijing Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huarong Rongtong Beijing Technology Co ltd filed Critical Huarong Rongtong Beijing Technology Co ltd
Priority to CN201911117980.2A priority Critical patent/CN110990525A/en
Publication of CN110990525A publication Critical patent/CN110990525A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3335Syntactic pre-processing, e.g. stopword elimination, stemming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/374Thesaurus

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a natural language processing-based public opinion information extraction and knowledge base generation method, which comprises the following steps: firstly, preprocessing a text; secondly, named entity recognition, comprising: identifying company and organization names and person names, and finishing named entity identification by adopting a neural network-based method; extracting the relationship, namely extracting six types of relationships in the financial field by adopting a feature layer + GRU + Attention; fourthly, entity linking; and judging whether the link entities and the target entity are the same entity by calculating the distance between the link entities and the target entity by adopting a Jaro winkler distance method so as to achieve entity disambiguation. The method adopts the combination of an end-to-end model and a characteristic extraction input model to construct a one-stop flow from financial unstructured texts to structured data storage, fully utilizes the financial news context information, extracts knowledge with fewer parameters and faster training prediction speed, and achieves good performance in the field of financial public opinion information.

Description

Natural language processing-based public opinion information extraction and knowledge base generation method
Technical Field
The invention discloses a public opinion information extraction and knowledge base generation method based on natural language processing, relates to the technologies of entity identification, relation extraction, entity linkage and the like in the field of financial information, and particularly relates to a whole set of flow and method from information extraction to knowledge generation for enterprise public opinion news.
Background
Due to the diversification of the current investment main bodies and the development of enterprise operation conglomeration, the relationship among enterprises is more and more complex, and the concealment is very strong regardless of regions and industries. In financial institutions such as commercial banks, if an enterprise intentionally hides at the time of loan, it is difficult for the bank to grasp the actual information, which leads to excessive credit extension and multi-time credit extension, and increases the credit risk of the bank. Therefore, the method fully identifies the association relationship between enterprises, more comprehensively grasps the relevant information of the customers, and is an important direction for reducing credit risk.
Currently, data of enterprise association mainly comes from structured data provided by enterprises and data service providers, such as a national enterprise credit information bulletin system. Because the above information updating cycle is long, in order to enrich the dimension of the customer information, when the credit personnel investigate and collect evidence, the credit personnel can also use the judicial data in the judge document network of the Chinese court, the public opinion data in the enterprise news report and other information with stronger instantaneity as an important supplementary source of the enterprise association relationship. However, the public opinion information exists in the form of unstructured text, available technologies and tools are scarce when credit workers mine useful information in the public opinion information, the credit workers often depend on manual browsing and query, the investigation depth and the query efficiency are limited, and the ever-increasing dynamic query requirements of financial institutions for group customer associated information are difficult to meet. At present, introduction of richer data such as public opinion information as support and automated mining and storage of knowledge are urgent needs in the field of credit risk.
Disclosure of Invention
In view of the above problems, an object of the present invention is to provide a method for extracting public sentiment information and generating a knowledge base based on natural language processing. The method has the advantages that the enterprise association information mining is converted into an information extraction task in natural language processing, the characteristics of the information are found from the unstructured text and are modeled, the association relation between enterprise entities and enterprises is automatically extracted to serve as an important supplement of the existing structured data, and more powerful support is provided for credit risk management.
The invention relates to a public opinion information extraction and knowledge base generation method based on natural language processing, which takes unstructured texts such as enterprise news reports and the like on the Internet as data sources, constructs an entity identification model, a relation extraction model and an entity link model, and completes information extraction and mapping storage of the texts. Firstly, preprocessing unstructured text data, removing interference information such as symbols and stop words, and obtaining cleaned data; then, analyzing the data by word segmentation, part of speech tagging and the like, further constructing a model to extract named entities in the text, and constructing a relation extraction model to complete relation extraction between the entities; and finally, mapping the extracted entities and the relations to the knowledge base by an entity linking technology to complete the generation and the updating of the knowledge base.
In order to achieve the above object, the present invention provides a method for extracting public sentiment information and generating a knowledge base based on natural language processing, which comprises the following steps:
step one, text preprocessing
The text preprocessing mainly comprises character cleaning, word segmentation and word stop.
And (4) character cleaning, namely performing full-angle and half-angle unification treatment on the input text and performing matching filtering on punctuation marks by adopting a regular matching method.
And (4) word segmentation, namely performing word segmentation by adopting an LTP toolkit, and introducing a professional field dictionary to improve the word segmentation effect.
And (4) removing stop words, and removing text stop words by adopting a special stop dictionary in the field of the unhealthy assets. The special stopping dictionary in the field of the undesirable assets divides stopping words into 3 types: the category is nonsense words such as "and" are; one is a word which widely exists in sentences and appears at high frequency; the other is the nonsense vocabulary in the business system.
Step two, named entity recognition
The named entity identification facing the financial information field comprises the following steps: the invention adopts a neural network-based method to complete named entity identification.
According to the method, each word is mapped into dense embedding in a low-dimensional space, then word embedding (wordemmbedding) is used as the input of a model, features are automatically extracted by using a BilSTM, and labels of the whole sentence are predicted by using the CRF, so that the training and prediction of the model become an end-to-end integral process instead of the traditional pipeline, and the dependence on feature engineering is removed. The annotation data is from the 1998 daily corpus, which shares three types of entities, namely, person name (nr), place name (ns), and organization name (nt). B. M, E represent the first, middle and end words of an entity, respectively. For example, B-nr, M-nr, E-nr represent the first, non-first and last, and the last, respectively, of a name, and O represents that the word does not belong to a part of the named entity.
Step three, extracting relation
Relationship extraction is one of the important research tasks of natural language processing, and is given by entity 1(e _1) and entity 2(e _2), and the process of obtaining the relationship (r) between the two entities in a piece of text can be represented as r → (e1, e 2). The co-extraction relationship of the invention comprises 8 types, which are respectively: relationship of job, relationship of cooperation, relationship of litigation, relationship of supplier, relationship of stock control, relationship of investment, relationship of debt, relationship of Unknow. . For example, given a text "2002, the current generation of the lack of money, current sea-liriors, at a price of 3.8 billion dollars, will flag the TFT-LCD department and sell it to the beijing oriental group in its entirety," wherein the entity is "modern sea-liriors" and the entity is "beijing oriental group" and by analyzing the meaning distribution in the sentences and the logical relationship between words, the relationship of "acquisition" between the beijing oriental group and the modern sea-liriors can be extracted.
The invention provides a Bi-GRU model to solve the problem of relation extraction in the financial field, which mainly comprises the following steps:
s31, TeAnd (4) feature extraction, namely extracting lexical features and syntactic features from the input sentence. The quality of feature extraction determines the performance of the model. In the part of constructing the features, the features are divided into lexical features and syntactic features. Word embedding (denoted as w) is included in the lexical featuresw) Part of speech (pos, denoted w)f1) Named entity (ner, denoted w)f2) Syntactic characteristics include dependency type (dep, denoted w)f3) Parent node position (denoted as w)f4) And the relative position of the word (position feature, denoted as w)f5)。
When the lexical characteristics and the syntactic characteristics are obtained, the input sentences are processed by adopting a natural language processing packet LTP of the Hadamard to obtain the characteristics. The final feature set is:
Feature Set={ww,wf1,wf2,wf3,wf4,wf5,wf6}
and S32, embedding features. The feature embedding is to convert the features in step S31 into vector representation and to stitch together. Then, for sentence S ═ { w ═ w1,w2,...,wn}, word wiThe characteristic representation of (c) can be expressed as follows:
Figure BDA0002274614230000031
wherein, ww=Wwrdvi,viIs that the current word is in WwrdRepresented by ont-hot in the corresponding column,
Figure BDA0002274614230000032
is a vector representation of the jth class of features.
S33, Bi-GRU model. Passing the vector in the step S32 through a BI-GRU network, where the number of hidden layer nodes in the BIGRU is: 400, Dropout ratio is: 0.5. through training of 10 pieces of data in each batch, the optimal result is obtained through 10 times of iterative training.
S34, attention mechanism. An attention mechanism is introduced to generate weight vectors for word-level and sentence-level. In the patent, a hot-one attention mechanism is adopted to realize automatic optimization of word-level weights.
And S35, finally, carrying out normalization processing on the result by using a softmax layer to obtain the probability distribution of the relational tags.
Step four, entity linking
After the extraction of the entity and the relation is completed, the important problem is how to connect the extracted entity with the real information in the knowledge base, and the invention adopts the Jaro winkler distance method to judge whether the link entity and the target entity are the same entity by calculating the distance between the link entity and the target entity so as to achieve the effect of entity disambiguation.
The invention relates to a public opinion information extraction and knowledge base generation method based on natural language processing, which has the advantages that compared with the prior art: (1) compared with the traditional single named entity recognition and relationship extraction module, the public opinion information extraction method integrates the whole process of public opinion information extraction, inputs the original text, outputs the structured knowledge and realizes an end-to-end model; (2) compared with the traditional one-hot model for obtaining word vectors, the word vector is trained by a deep learning method, so that the phenomenon of dimension disaster represented by the word vectors can be avoided, the information of the context of words can be fully mined, and the relation between the words can be obtained; (3) the relation extraction model based on the Bi-GRU generates a lightweight model with fewer parameters and faster training speed, and achieves good performance in the field of financial public opinion information.
Drawings
FIG. 1 is a block flow diagram of the method of the present invention.
Fig. 2 shows a GRU update mechanism.
Detailed Description
The technical solution of the present invention is further described below with reference to specific examples.
The invention discloses a public opinion information extraction and knowledge base generation method based on natural language processing, which comprises the following steps as shown in figure 1:
step one, text preprocessing
Text preprocessing is a basic and necessary step in unstructured data processing, and mainly reduces negative effects brought by noise data as much as possible by removing characters without entity semantics and filtering stop words after word segmentation, and the generalization capability of a model is improved. The data preprocessing methods selected therefore are mainly character cleaning, word segmentation and word decommissioning.
And (5) character cleaning. Characters such as commas, periods, quotation marks and the like in the text represent pauses, connections and the like of sentences, have no actual meaning in semantic analysis, can be regarded as useless characters, and the common processing mode is matching filtering. The invention adopts a regular matching method to carry out full-angle and half-angle unification processing on the input text and matching and filtering of punctuation marks.
And (5) word segmentation. In Chinese text, a word is the smallest semantic unit. In text-related processes and operations, a sentence is typically segmented into a series of words to represent the original sentence. Currently, commonly used word segmentation tools include jieba word segmentation, LTP natural language processing package of haardard, Stanford's tool package, and the like. After the accuracy, the performance and the word segmentation fine granularity of different tools are comprehensively considered, the word segmentation method adopts the LTP toolkit to segment words, and introduces a dictionary in the professional field to improve the word segmentation effect.
To stop the word. Before the semantics of a sentence are expressed in terms of words, there is often a one-step important operation of removing certain words and words, which are collectively referred to as stop words. The stop words can be divided into two types, one type is functional words such as 'the', 'woollen', 'in' and the like, the functional words have no definite actual meanings in sentences, and the functional words serve as connecting words and mood-assisting words to assist other words in the sentences; the other is a word that is widely present in a sentence, and the high frequency of occurrence makes it useless for representing the semantics of the sentence. Therefore, the density of the keywords can be improved by removing the two types of stop words, and the semantic information of the sentence can be more effectively acquired.
Step two, named entity recognition
Named entity identification oriented to the field of financial information is mainly to identify company and institution names and person names. The present invention employs a neural network-based approach to accomplish named entity recognition.
The data driving method comprises the steps of firstly mapping each word into dense embedding in a low-dimensional space, then using word embedding (word embedding) as the input of a model, automatically extracting features by using a neural network, and predicting the label of each word by softmax, so that the training of the model becomes an end-to-end integral process instead of the traditional pipeline and is independent of feature engineering.
The invention adopts a word-based BilSTM + CRF model, and the labeled data comes from a 1998 people daily statement corpus, wherein three types of entities including a person name (nr), a place name (ns) and an organization name (nt) are shared. B. M, E represent the first, middle and end words of an entity, respectively. For example, B-nr, M-nr, E-nr represent the first, non-first and last, and the last, respectively, of a name, and O represents that the word does not belong to a part of the named entity.
The model is divided into three layers, namely an embedding layer, a BilSTM layer and a CRF layer.
(1) Imbedding layer: also called the lookup layer, which functions to map each word in the input text into a vector representation in a low dimensional space, which is also the input to the next layer.
(2) BilsTM layer: and classifying each word by using the processing advantage of the LSTM on the serialized texts, judging the tag with the maximum probability, and outputting the tag. The use of bi-directional LSTM can better exploit semantic features at the sentence level, capture some laws of entity composition, such as part of an organization entity ending in "limited company" and the like.
(3) CRF layer: and modeling the relation between tags, and improving the accuracy of named entity identification.
Step three, extracting relation
Relationship extraction is one of the important research tasks of natural language processing, and is given by entity 1(e _1) and entity 2(e _2), and the process of obtaining the relationship (r) between the two entities in a piece of text can be represented as r → (e1, e 2). For example, given a text "2002, the missing modern Haili Shi will flag the TFT-LCD department for sale to the Jingdongfang group as a whole at a price of 3.8 hundred million dollars", wherein the entity is "modern Haili Shi" and the entity is "Jingdongfang group", and by analyzing the meaning distribution and the logical relationship between words in the sentence, the relationship of "acquisition" between the Jingdongfang group and the modern Haili Shi can be extracted.
The invention provides a Bi-GRU model to solve the problem of relation extraction in the financial field, which mainly comprises the following steps:
and S31, feature extraction, wherein lexical features and syntactic features are extracted from the input sentence. The quality of feature extraction determines the performance of the model. In the part of constructing the features, the features are divided into lexical features and syntactic features. Word embedding (denoted as w) is included in the lexical featuresw) Part of speech (pos, denoted w)f1) Named entity (ner, denoted w)f2) Syntactic characteristics include dependency type (dep, denoted w)f3) Parent node position (denoted as w)f4) And the relative position of the word (position feature, denoted as w)f5)。
When the lexical characteristics and the syntactic characteristics are obtained, the input sentences are processed by adopting a natural language processing packet LTP of the Hadamard to obtain the characteristics. The final feature set is:
Feature Set={ww,wf1,wf2,wf3,wf4,wf5,wf6}
and S32, embedding features. The feature embedding is to convert the features in step S31 into vector representation and to stitch together. That is to say a sentence is completely converted into a vector represented by its features. Word embedding is a low-dimensional vector representation of a word. Given a sentence, S ═ w1,w2,...,wnN is the number of words in the sentence.
Figure BDA0002274614230000061
Is an embedded matrix, where dwIs the vector dimension of the word embedding defined by the user, and V is the total number of words. For each word in the sentence, I can formulate it byCorresponding to a word vector.
ww=Wwrdvi
Wherein v isiIs that the current word is in WwrdCorresponding to ont-hot representation of that column.
In addition, wfjIs a vector representation of the characteristics of part of speech, named entities, dependency relationship types, etc., wherein j represents the jth class of characteristics. The part of speech, the named entity and the dependency relationship type are one-hot vectors generated according to the analysis result of the LTP, and the father node position and the relative Position (PF) are initialized randomly.
Then, for sentence S ═ { w ═ w1,w2,...,wn}, word wiThe characteristic representation of (c) can be expressed as follows:
Figure BDA0002274614230000071
wherein the content of the first and second substances,
Figure BDA0002274614230000072
is wiThe word vector of (a) is,
Figure BDA0002274614230000073
is a vector representation of the jth class of features.
S33, Bi-GRU model. And (4) passing the vector in the step S32 through a Bi-GRU network to generate a high-dimensional vector. Both LSTM and GRU are specific variants of RNN. The long-time and short-time memory model needs to transmit two states, namely a long-range state and a short-range state which can be stably transmitted, and is additionally provided with three thresholds, namely a forgetting gate, an input gate and an output gate, so that information can selectively pass through to control and protect the long-range state. With these smart designs, the LSTM avoids the long-term dependence problem.
The GRU optimizes the internal design based on the LSTM, merges the long-range state and the short-range state, merges the forgetting gate and the input gate into an update gate, determines how much previous information is retained, and the reset gate determines how much previous information is combined with the current input. As shown in fig. 2.
In FIG. 2, xtFor the current input, ht-1Is the output of the previous moment, htIs the output at the current time and σ is the activation function. r istIs a reset gate for controlling the extent to which status information at a previous time is ignored, with smaller values of the reset gate indicating more ignorance. z is a radical oftIs an update gate for controlling the extent to which the state information at the previous time is brought into the current state, the larger the value of the update gate, the more the state information at the previous time is brought in. The two gate protection and control information are hidden from the last state ht-1To a new hidden state ht
Equations 1-4 give the reset gate rtUpdate gate ztCandidate hidden states
Figure BDA0002274614230000074
And a hidden state h at the current momenttThe calculation method of (1).
rt=σ(Wr·[ht-1,xt]) (1)
zt=σ(Wz·[ht-1,xt]) (2)
Figure BDA0002274614230000075
Figure BDA0002274614230000076
The basic unit of the Bi-GRU model is composed of a forward-propagating GRU unit and a backward-propagating GRU unit, and the structure diagram of the backward-propagating GRU unit is shown in fig. 2. When processing sequence information, forward information and backward information can be considered at the same time, and finally the two units are spliced together in the output part. For the ith word, the output is the following formula:
Figure BDA0002274614230000081
wherein for the matrix
Figure BDA0002274614230000082
Sum matrix
Figure BDA0002274614230000083
Figure BDA0002274614230000084
The GRU model is simpler than the LSTM model, has fewer parameters, trains faster, but performs similarly and performs well even on a smaller sample set of data.
S34, attention mechanism. An attention mechanism is introduced to generate weight vectors for word-level and sentence-level.
When the input sequence of the model is long, it is difficult to retain all important information, and the performance of the model is therefore degraded. Attention is paid to the existence of a mechanism for solving the problem. Intermediate output result h to input sequence by preserving GRUiThese intermediate results are then selectively learned as inputs to the attention layer and correlated with the output sequence of the GRU at the time of output. Although the model increases the amount of computation after using the attention mechanism, the performance level can be improved.
And S35, finally, carrying out normalization processing on the result by using a softmax layer to obtain the probability distribution of the relational tags.
Step four, entity linking
After the extraction of the entity and the relation is completed, the important problem is how to connect the extracted entity with the real information in the knowledge base, and the invention adopts the Jaro winkler distance method to judge whether the link entity and the target entity are the same entity by calculating the distance between the link entity and the target entity so as to achieve the effect of entity disambiguation.
The following describes a technical solution of the present invention with an embodiment, which is verified based on news public opinion data provided by a financial technology company. In addition, the training data set comprises a daily news corpus of 1998 people and Tushare platform news information data.
Text preprocessing
The method processes the Tushare data, only selects Chinese words to perform word segmentation, refers to words in the professional field of the unhealthy assets, performs word segmentation on the text by using an LTP word segmentation device, and performs stop words on the text words after importing the stop words into a stop word bank.
Named entity recognition
In training the named entity recognition model, the invention selects a 1998 daily statement corpus which is a text of 'daily statement of people' from 1 month to 6 months in 1998, manually labels the part of speech, and commonly performs work related to natural language processing, such as word segmentation, part of speech labeling, named entity recognition and the like.
The data set is labeled by taking words as units, so that the data set is firstly processed, and the labeled ns, nr and nt are divided into fine-grained segments by one step, so that the labels of the segmented segments are added with B, M, E information. And marking other parts of speech as O to generate a new training speech file.
The model was trained on the new corpus file, which represented F1 ═ 0.90 on the test set. For the recognition result, the person name and organization name were selected and output, and the place name was discarded because it had little relation with the current study.
Extraction of three, relationship
The relation extraction model selects the news information data of the Tushare platform as a data source. The Tushare is a free-source python financial data interface package, in which news information data is one of the data structures provided by it. The method mainly obtains the news data of the news website, including the real-time information of the Xin Lang and the financial, the Wale street news and the news of the news. According to the provided data, the method carries out manual marking, and totally marks 1000 pieces of data for training and testing to finally obtain a relation extraction model.
In the specific processing, whether to enter a relationship extraction flow is judged according to the result of the named entity identification in the last step. And when the number of the extracted entities is more than 2, entering a relation extraction module. The current sentence and a pair of entity pairs are used as input, the output is the probability distribution of the relation labels, and one label with the highest probability is selected as the relation of the entity pairs. Every pair of extracted entities is sent to the relation extraction model for relation judgment. In order to ensure the correctness of the extraction relation, the invention sets a higher threshold value, and when the numerical value of the relation probability is greater than the threshold value, the relation is considered to be the knowledge storage. And when the number of the extracted entities is less than 2, directly skipping the processes of relation extraction and entity linking, and ending the process.
Four, entity linking
The entity linking part is equivalent to adding the verification of the information once. The database contains the names of companies related to public opinion information, and entity linking is performed before data storage, so that error information generated in the previous process can be effectively reduced, and disambiguation of entity nodes is realized.
Finally, the method selects and verifies the news public opinion data provided by Hua-Rong-and-Rong (Beijing) science and technology Limited company, the data are 1000 pieces, entity identification, relation extraction, entity linkage and the like are carried out according to the flow provided by the method, the effective relation among enterprises is extracted, and the effective relation is successfully stored in the mysql database.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the technical scope of the present invention, so that any minor modifications, equivalent changes and modifications made to the above embodiment according to the technical spirit of the present invention are within the technical scope of the present invention.

Claims (2)

1. A public opinion information extraction and knowledge base generation method based on natural language processing is characterized in that: the method comprises the following steps:
the method comprises the following steps of firstly, preprocessing a text, wherein characters are cleaned, words are segmented, and words are removed;
character cleaning, adopting a regular matching method to carry out full-angle and half-angle unification processing on an input text and matching and filtering of punctuation marks;
word segmentation is carried out by adopting an LTP toolkit, and a professional domain dictionary is introduced to improve the word segmentation effect;
removing stop words, and removing text stop words by adopting a special stop dictionary in the field of the unhealthy assets; the special stopping dictionary in the field of the undesirable assets divides stopping words into 3 types: one is nonsense words; one is a word which widely exists in sentences and appears at high frequency; the other is nonsense vocabulary in a business system;
step two, named entity recognition
The named entity identification facing the financial information field comprises the following steps: identifying company and organization names and person names, and finishing named entity identification by adopting a method based on a neural network;
according to the method, each word is mapped into dense embedding in a low-dimensional space, then the word is embedded into the wordemmbedding to serve as the input of a model, features are automatically extracted by using a BilSTM, and labels of the whole sentence are predicted by using a CRF (cyclic redundancy check) so that the training and prediction of the model become an end-to-end integral process;
step three, extracting relation
The co-extraction relationship of the invention comprises 8 types, which are respectively: an arbitrary relationship, a cooperative relationship, a litigation relationship, a supplier relationship, a stock control relationship, an investment relationship, a debt relationship, and an uknow relationship;
step four, entity linking
After the extraction of the entity and the relation is completed, the Jaro winkler distance method is adopted to judge whether the link entity and the target entity are the same entity by calculating the distance between the link entity and the target entity so as to achieve the effect of entity disambiguation.
2. The method for extracting public opinion information and generating knowledge base based on natural language processing as claimed in claim 1, wherein: the third step specifically adopts a Bi-GRU model to solve the problem of relation extraction in the financial field, and mainly comprises the following steps:
s31, feature extraction, namely extracting lexical features and syntactic features from the input sentences; the quality of feature extraction determines the performance of the model; in the part of constructing the characteristics, the characteristics are divided into lexical characteristics and syntactic characteristics; word embedding, denoted as w, is included in lexical featureswPart of speech pos, denoted wf1Named entity ner, denoted wf2Syntactic characteristics include dependency type dep, denoted as wf3Parent node position parent, denoted as wf4The relative position of the word position, denoted as wf5
When the lexical characteristics and the syntactic characteristics are obtained, processing an input sentence by adopting a natural language processing packet LTP of the Haugh to obtain the characteristics; the final feature set is:
Feature Set={ww,wf1,wf2,wf3,wf4,wf5,wf6};
s32, embedding characteristics; the feature embedding is to convert the features in the step S31 into vector representation and to splice the vector representation; then, for sentence S ═ { w ═ w1,w2,…,wn}, word wiThe characteristic representation of (c) can be expressed as follows:
Figure FDA0002274614220000021
wherein, ww=Wwrdvi,viIs that the current word is in WwrdRepresented by ont-hot in the corresponding column,
Figure FDA0002274614220000022
is a vector representation of class j features;
s33, a Bi-GRU model; passing the vector in the step S32 through a BI-GRU network, where the number of hidden layer nodes in the BIGRU is: 400, Dropout ratio is: 0.5; performing iterative training for 10 times through training of 10 pieces of data in each batch to obtain an optimal result;
s34, an attention mechanism; an attention mechanism is introduced, and weight vectors are generated for word levels and sentence levels; a hot-one attention mechanism is adopted to realize automatic optimization of word-level weight;
and S35, finally, carrying out normalization processing on the result by using a softmax layer to obtain the probability distribution of the relational tags.
CN201911117980.2A 2019-11-15 2019-11-15 Natural language processing-based public opinion information extraction and knowledge base generation method Pending CN110990525A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911117980.2A CN110990525A (en) 2019-11-15 2019-11-15 Natural language processing-based public opinion information extraction and knowledge base generation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911117980.2A CN110990525A (en) 2019-11-15 2019-11-15 Natural language processing-based public opinion information extraction and knowledge base generation method

Publications (1)

Publication Number Publication Date
CN110990525A true CN110990525A (en) 2020-04-10

Family

ID=70084612

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911117980.2A Pending CN110990525A (en) 2019-11-15 2019-11-15 Natural language processing-based public opinion information extraction and knowledge base generation method

Country Status (1)

Country Link
CN (1) CN110990525A (en)

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111523323A (en) * 2020-04-26 2020-08-11 梁华智能科技(上海)有限公司 Disambiguation processing method and system for Chinese word segmentation
CN111539806A (en) * 2020-04-14 2020-08-14 鼎富智能科技有限公司 Method and related device for structuring announcement content
CN111597804A (en) * 2020-05-15 2020-08-28 腾讯科技(深圳)有限公司 Entity recognition model training method and related device
CN111695346A (en) * 2020-06-16 2020-09-22 广州商品清算中心股份有限公司 Method for improving public opinion entity recognition rate in financial risk prevention and control field
CN111723191A (en) * 2020-05-19 2020-09-29 天闻数媒科技(北京)有限公司 Text filtering and extracting method and system based on full-information natural language
CN112036173A (en) * 2020-11-09 2020-12-04 北京读我科技有限公司 Method and system for processing telemarketing text
CN112215006A (en) * 2020-10-22 2021-01-12 上海交通大学 Organization named entity normalization method and system
CN112364002A (en) * 2020-11-04 2021-02-12 上海新朋程数据科技发展有限公司 Modeling method of data analysis model
CN112380866A (en) * 2020-11-25 2021-02-19 厦门市美亚柏科信息股份有限公司 Text topic label generation method, terminal device and storage medium
CN112541059A (en) * 2020-11-05 2021-03-23 大连中河科技有限公司 Multi-round intelligent question-answer interaction method applied to tax question-answer system
CN112800764A (en) * 2020-12-31 2021-05-14 江苏网进科技股份有限公司 Entity extraction method in legal field based on Word2Vec-BilSTM-CRF model
CN113326700A (en) * 2021-02-26 2021-08-31 西安理工大学 ALBert-based complex heavy equipment entity extraction method
CN113609298A (en) * 2021-08-23 2021-11-05 南京擎盾信息科技有限公司 Data processing method and device for court public opinion corpus extraction
CN114386422A (en) * 2022-01-14 2022-04-22 淮安市创新创业科技服务中心 Intelligent aid decision-making method and device based on enterprise pollution public opinion extraction
CN114611515A (en) * 2022-01-28 2022-06-10 江苏省联合征信有限公司 Method and system for identifying actual control person of enterprise based on enterprise public opinion information
CN114647734A (en) * 2020-12-18 2022-06-21 同方威视科技江苏有限公司 Method and device for generating event map of public opinion text, electronic equipment and medium
WO2022134575A1 (en) * 2020-12-23 2022-06-30 深圳壹账通智能科技有限公司 Service keyword extraction method, apparatus, and device, and storage medium
CN114897504A (en) * 2022-05-20 2022-08-12 北京北大软件工程股份有限公司 Method, device, storage medium and electronic equipment for processing repeated letters
WO2022227196A1 (en) * 2021-04-27 2022-11-03 平安科技(深圳)有限公司 Data analysis method and apparatus, computer device, and storage medium
CN116681065A (en) * 2023-06-09 2023-09-01 西藏大学 Combined extraction system and method for entity relationship in Tibetan medicine field
CN116776886A (en) * 2023-08-15 2023-09-19 浙江同信企业征信服务有限公司 Information extraction method, device, equipment and storage medium

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015080561A1 (en) * 2013-11-27 2015-06-04 Mimos Berhad A method and system for automated relation discovery from texts
WO2016054301A1 (en) * 2014-10-02 2016-04-07 Microsoft Technology Licensing, Llc Distant supervision relationship extractor
CN106815293A (en) * 2016-12-08 2017-06-09 中国电子科技集团公司第三十二研究所 System and method for constructing knowledge graph for information analysis
CN107247739A (en) * 2017-05-10 2017-10-13 浙江大学 A kind of financial publication text knowledge extracting method based on factor graph
CN107748757A (en) * 2017-09-21 2018-03-02 北京航空航天大学 A kind of answering method of knowledge based collection of illustrative plates
CN108733792A (en) * 2018-05-14 2018-11-02 北京大学深圳研究生院 A kind of entity relation extraction method
CN108875051A (en) * 2018-06-28 2018-11-23 中译语通科技股份有限公司 Knowledge mapping method for auto constructing and system towards magnanimity non-structured text
CN109558492A (en) * 2018-10-16 2019-04-02 中山大学 A kind of listed company's knowledge mapping construction method and device suitable for event attribution
US20190158524A1 (en) * 2017-01-30 2019-05-23 Splunk Inc. Anomaly detection based on information technology environment topology
CN109871535A (en) * 2019-01-16 2019-06-11 四川大学 A kind of French name entity recognition method based on deep neural network

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015080561A1 (en) * 2013-11-27 2015-06-04 Mimos Berhad A method and system for automated relation discovery from texts
WO2016054301A1 (en) * 2014-10-02 2016-04-07 Microsoft Technology Licensing, Llc Distant supervision relationship extractor
CN106815293A (en) * 2016-12-08 2017-06-09 中国电子科技集团公司第三十二研究所 System and method for constructing knowledge graph for information analysis
US20190158524A1 (en) * 2017-01-30 2019-05-23 Splunk Inc. Anomaly detection based on information technology environment topology
CN107247739A (en) * 2017-05-10 2017-10-13 浙江大学 A kind of financial publication text knowledge extracting method based on factor graph
CN107748757A (en) * 2017-09-21 2018-03-02 北京航空航天大学 A kind of answering method of knowledge based collection of illustrative plates
CN108733792A (en) * 2018-05-14 2018-11-02 北京大学深圳研究生院 A kind of entity relation extraction method
CN108875051A (en) * 2018-06-28 2018-11-23 中译语通科技股份有限公司 Knowledge mapping method for auto constructing and system towards magnanimity non-structured text
CN109558492A (en) * 2018-10-16 2019-04-02 中山大学 A kind of listed company's knowledge mapping construction method and device suitable for event attribution
CN109871535A (en) * 2019-01-16 2019-06-11 四川大学 A kind of French name entity recognition method based on deep neural network

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
KALYANI R. POLE等: "Improvised fuzzy clustering using name entity recognition and natural language processing" *
张兰霞;胡文心;: "基于双向GRU神经网络和双层注意力机制的中文文本中人物关系抽取研究" *
鄂海红;张文静;肖思琪;程瑞;胡莺夕;周筱松;牛佩晴;: "深度学习实体关系抽取研究综述" *

Cited By (31)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111539806A (en) * 2020-04-14 2020-08-14 鼎富智能科技有限公司 Method and related device for structuring announcement content
CN111523323A (en) * 2020-04-26 2020-08-11 梁华智能科技(上海)有限公司 Disambiguation processing method and system for Chinese word segmentation
CN111523323B (en) * 2020-04-26 2022-08-12 梁华智能科技(上海)有限公司 Disambiguation processing method and system for Chinese word segmentation
CN111597804A (en) * 2020-05-15 2020-08-28 腾讯科技(深圳)有限公司 Entity recognition model training method and related device
CN111723191A (en) * 2020-05-19 2020-09-29 天闻数媒科技(北京)有限公司 Text filtering and extracting method and system based on full-information natural language
CN111723191B (en) * 2020-05-19 2023-10-27 天闻数媒科技(北京)有限公司 Text filtering and extracting method and system based on full-information natural language
CN111695346B (en) * 2020-06-16 2024-05-07 广州商品清算中心股份有限公司 Method for improving public opinion entity recognition rate in financial risk prevention and control field
CN111695346A (en) * 2020-06-16 2020-09-22 广州商品清算中心股份有限公司 Method for improving public opinion entity recognition rate in financial risk prevention and control field
CN112215006A (en) * 2020-10-22 2021-01-12 上海交通大学 Organization named entity normalization method and system
CN112215006B (en) * 2020-10-22 2022-08-09 上海交通大学 Organization named entity normalization method and system
CN112364002A (en) * 2020-11-04 2021-02-12 上海新朋程数据科技发展有限公司 Modeling method of data analysis model
CN112541059A (en) * 2020-11-05 2021-03-23 大连中河科技有限公司 Multi-round intelligent question-answer interaction method applied to tax question-answer system
CN112036173A (en) * 2020-11-09 2020-12-04 北京读我科技有限公司 Method and system for processing telemarketing text
CN112380866A (en) * 2020-11-25 2021-02-19 厦门市美亚柏科信息股份有限公司 Text topic label generation method, terminal device and storage medium
CN114647734A (en) * 2020-12-18 2022-06-21 同方威视科技江苏有限公司 Method and device for generating event map of public opinion text, electronic equipment and medium
WO2022134575A1 (en) * 2020-12-23 2022-06-30 深圳壹账通智能科技有限公司 Service keyword extraction method, apparatus, and device, and storage medium
CN112800764B (en) * 2020-12-31 2023-07-04 江苏网进科技股份有限公司 Entity extraction method in legal field based on Word2Vec-BiLSTM-CRF model
CN112800764A (en) * 2020-12-31 2021-05-14 江苏网进科技股份有限公司 Entity extraction method in legal field based on Word2Vec-BilSTM-CRF model
CN113326700A (en) * 2021-02-26 2021-08-31 西安理工大学 ALBert-based complex heavy equipment entity extraction method
CN113326700B (en) * 2021-02-26 2024-05-14 西安理工大学 ALBert-based complex heavy equipment entity extraction method
WO2022227196A1 (en) * 2021-04-27 2022-11-03 平安科技(深圳)有限公司 Data analysis method and apparatus, computer device, and storage medium
CN113609298A (en) * 2021-08-23 2021-11-05 南京擎盾信息科技有限公司 Data processing method and device for court public opinion corpus extraction
CN114386422B (en) * 2022-01-14 2023-09-15 淮安市创新创业科技服务中心 Intelligent auxiliary decision-making method and device based on enterprise pollution public opinion extraction
CN114386422A (en) * 2022-01-14 2022-04-22 淮安市创新创业科技服务中心 Intelligent aid decision-making method and device based on enterprise pollution public opinion extraction
CN114611515A (en) * 2022-01-28 2022-06-10 江苏省联合征信有限公司 Method and system for identifying actual control person of enterprise based on enterprise public opinion information
CN114611515B (en) * 2022-01-28 2023-12-12 江苏省联合征信有限公司 Method and system for identifying enterprise actual control person based on enterprise public opinion information
CN114897504A (en) * 2022-05-20 2022-08-12 北京北大软件工程股份有限公司 Method, device, storage medium and electronic equipment for processing repeated letters
CN116681065A (en) * 2023-06-09 2023-09-01 西藏大学 Combined extraction system and method for entity relationship in Tibetan medicine field
CN116681065B (en) * 2023-06-09 2024-01-23 西藏大学 Combined extraction method for entity relationship in Tibetan medicine field
CN116776886A (en) * 2023-08-15 2023-09-19 浙江同信企业征信服务有限公司 Information extraction method, device, equipment and storage medium
CN116776886B (en) * 2023-08-15 2023-12-05 浙江同信企业征信服务有限公司 Information extraction method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN110990525A (en) Natural language processing-based public opinion information extraction and knowledge base generation method
CN109684440B (en) Address similarity measurement method based on hierarchical annotation
Day et al. Deep learning for financial sentiment analysis on finance news providers
US20200073882A1 (en) Artificial intelligence based corpus enrichment for knowledge population and query response
US20210350080A1 (en) Systems and methods for deviation detection, information extraction and obligation deviation detection
Demir et al. Improving named entity recognition for morphologically rich languages using word embeddings
CN106557462A (en) Name entity recognition method and system
CN106611375A (en) Text analysis-based credit risk assessment method and apparatus
CN109886270B (en) Case element identification method for electronic file record text
CN113204967B (en) Resume named entity identification method and system
Yan et al. Neural network based relation extraction of enterprises in credit risk management
Kapusta et al. Comparison of fake and real news based on morphological analysis
CN113255321A (en) Financial field chapter-level event extraction method based on article entity word dependency relationship
CN113919366A (en) Semantic matching method and device for power transformer knowledge question answering
CN111462752A (en) Client intention identification method based on attention mechanism, feature embedding and BI-L STM
Li et al. A method for resume information extraction using bert-bilstm-crf
CN113869054A (en) Deep learning-based electric power field project feature identification method
CN117077682B (en) Document analysis method and system based on semantic recognition
Chen et al. From natural language to accounting entries using a natural language processing method
Sanyal et al. Natural language processing technique for generation of SQL queries dynamically
Zhang et al. Sentiment identification by incorporating syntax, semantics and context information
Zhu Financial data analysis application via multi-strategy text processing
Jishtu et al. Prediction of the stock market based on machine learning and sentiment analysis
CN113051396A (en) Document classification identification method and device and electronic equipment
Sirirattanajakarin et al. BoydCut: Bidirectional LSTM-CNN Model for Thai Sentence Segmenter

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20200410