CN110990525A

CN110990525A - Natural language processing-based public opinion information extraction and knowledge base generation method

Info

Publication number: CN110990525A
Application number: CN201911117980.2A
Authority: CN
Inventors: 路世伦; 闫晨巍; 仵伟强; 周金黄; 钟丽莉; 万谊强
Original assignee: Huarong Rongtong Beijing Technology Co ltd
Current assignee: Huarong Rongtong Beijing Technology Co ltd
Priority date: 2019-11-15
Filing date: 2019-11-15
Publication date: 2020-04-10

Abstract

The invention discloses a natural language processing-based public opinion information extraction and knowledge base generation method, which comprises the following steps: firstly, preprocessing a text; secondly, named entity recognition, comprising: identifying company and organization names and person names, and finishing named entity identification by adopting a neural network-based method; extracting the relationship, namely extracting six types of relationships in the financial field by adopting a feature layer + GRU + Attention; fourthly, entity linking; and judging whether the link entities and the target entity are the same entity by calculating the distance between the link entities and the target entity by adopting a Jaro winkler distance method so as to achieve entity disambiguation. The method adopts the combination of an end-to-end model and a characteristic extraction input model to construct a one-stop flow from financial unstructured texts to structured data storage, fully utilizes the financial news context information, extracts knowledge with fewer parameters and faster training prediction speed, and achieves good performance in the field of financial public opinion information.

Description

Natural language processing-based public opinion information extraction and knowledge base generation method

Technical Field

The invention discloses a public opinion information extraction and knowledge base generation method based on natural language processing, relates to the technologies of entity identification, relation extraction, entity linkage and the like in the field of financial information, and particularly relates to a whole set of flow and method from information extraction to knowledge generation for enterprise public opinion news.

Background

Due to the diversification of the current investment main bodies and the development of enterprise operation conglomeration, the relationship among enterprises is more and more complex, and the concealment is very strong regardless of regions and industries. In financial institutions such as commercial banks, if an enterprise intentionally hides at the time of loan, it is difficult for the bank to grasp the actual information, which leads to excessive credit extension and multi-time credit extension, and increases the credit risk of the bank. Therefore, the method fully identifies the association relationship between enterprises, more comprehensively grasps the relevant information of the customers, and is an important direction for reducing credit risk.

Currently, data of enterprise association mainly comes from structured data provided by enterprises and data service providers, such as a national enterprise credit information bulletin system. Because the above information updating cycle is long, in order to enrich the dimension of the customer information, when the credit personnel investigate and collect evidence, the credit personnel can also use the judicial data in the judge document network of the Chinese court, the public opinion data in the enterprise news report and other information with stronger instantaneity as an important supplementary source of the enterprise association relationship. However, the public opinion information exists in the form of unstructured text, available technologies and tools are scarce when credit workers mine useful information in the public opinion information, the credit workers often depend on manual browsing and query, the investigation depth and the query efficiency are limited, and the ever-increasing dynamic query requirements of financial institutions for group customer associated information are difficult to meet. At present, introduction of richer data such as public opinion information as support and automated mining and storage of knowledge are urgent needs in the field of credit risk.

Disclosure of Invention

In view of the above problems, an object of the present invention is to provide a method for extracting public sentiment information and generating a knowledge base based on natural language processing. The method has the advantages that the enterprise association information mining is converted into an information extraction task in natural language processing, the characteristics of the information are found from the unstructured text and are modeled, the association relation between enterprise entities and enterprises is automatically extracted to serve as an important supplement of the existing structured data, and more powerful support is provided for credit risk management.

The invention relates to a public opinion information extraction and knowledge base generation method based on natural language processing, which takes unstructured texts such as enterprise news reports and the like on the Internet as data sources, constructs an entity identification model, a relation extraction model and an entity link model, and completes information extraction and mapping storage of the texts. Firstly, preprocessing unstructured text data, removing interference information such as symbols and stop words, and obtaining cleaned data; then, analyzing the data by word segmentation, part of speech tagging and the like, further constructing a model to extract named entities in the text, and constructing a relation extraction model to complete relation extraction between the entities; and finally, mapping the extracted entities and the relations to the knowledge base by an entity linking technology to complete the generation and the updating of the knowledge base.

In order to achieve the above object, the present invention provides a method for extracting public sentiment information and generating a knowledge base based on natural language processing, which comprises the following steps:

step one, text preprocessing

The text preprocessing mainly comprises character cleaning, word segmentation and word stop.

And (4) character cleaning, namely performing full-angle and half-angle unification treatment on the input text and performing matching filtering on punctuation marks by adopting a regular matching method.

And (4) word segmentation, namely performing word segmentation by adopting an LTP toolkit, and introducing a professional field dictionary to improve the word segmentation effect.

And (4) removing stop words, and removing text stop words by adopting a special stop dictionary in the field of the unhealthy assets. The special stopping dictionary in the field of the undesirable assets divides stopping words into 3 types: the category is nonsense words such as "and" are; one is a word which widely exists in sentences and appears at high frequency; the other is the nonsense vocabulary in the business system.

Step two, named entity recognition

The named entity identification facing the financial information field comprises the following steps: the invention adopts a neural network-based method to complete named entity identification.

According to the method, each word is mapped into dense embedding in a low-dimensional space, then word embedding (wordemmbedding) is used as the input of a model, features are automatically extracted by using a BilSTM, and labels of the whole sentence are predicted by using the CRF, so that the training and prediction of the model become an end-to-end integral process instead of the traditional pipeline, and the dependence on feature engineering is removed. The annotation data is from the 1998 daily corpus, which shares three types of entities, namely, person name (nr), place name (ns), and organization name (nt). B. M, E represent the first, middle and end words of an entity, respectively. For example, B-nr, M-nr, E-nr represent the first, non-first and last, and the last, respectively, of a name, and O represents that the word does not belong to a part of the named entity.

Step three, extracting relation

Relationship extraction is one of the important research tasks of natural language processing, and is given by entity 1(e _1) and entity 2(e _2), and the process of obtaining the relationship (r) between the two entities in a piece of text can be represented as r → (e1, e 2). The co-extraction relationship of the invention comprises 8 types, which are respectively: relationship of job, relationship of cooperation, relationship of litigation, relationship of supplier, relationship of stock control, relationship of investment, relationship of debt, relationship of Unknow. . For example, given a text "2002, the current generation of the lack of money, current sea-liriors, at a price of 3.8 billion dollars, will flag the TFT-LCD department and sell it to the beijing oriental group in its entirety," wherein the entity is "modern sea-liriors" and the entity is "beijing oriental group" and by analyzing the meaning distribution in the sentences and the logical relationship between words, the relationship of "acquisition" between the beijing oriental group and the modern sea-liriors can be extracted.

The invention provides a Bi-GRU model to solve the problem of relation extraction in the financial field, which mainly comprises the following steps:

s31, TeAnd (4) feature extraction, namely extracting lexical features and syntactic features from the input sentence. The quality of feature extraction determines the performance of the model. In the part of constructing the features, the features are divided into lexical features and syntactic features. Word embedding (denoted as w) is included in the lexical features^w) Part of speech (pos, denoted w)^f1) Named entity (ner, denoted w)^f2) Syntactic characteristics include dependency type (dep, denoted w)^f3) Parent node position (denoted as w)^f4) And the relative position of the word (position feature, denoted as w)^f5)。

When the lexical characteristics and the syntactic characteristics are obtained, the input sentences are processed by adopting a natural language processing packet LTP of the Hadamard to obtain the characteristics. The final feature set is:

Feature Set＝{w^w，w^f1，w^f2，w^f3，w^f4，w^f5，w^f6}

and S32, embedding features. The feature embedding is to convert the features in step S31 into vector representation and to stitch together. Then, for sentence S ═ { w ═ w₁，w₂，...，w_n}, word w_iThe characteristic representation of (c) can be expressed as follows:

wherein, w^w＝W^wrdvⁱ，vⁱIs that the current word is in W^wrdRepresented by ont-hot in the corresponding column,

is a vector representation of the jth class of features.

S33, Bi-GRU model. Passing the vector in the step S32 through a BI-GRU network, where the number of hidden layer nodes in the BIGRU is: 400, Dropout ratio is: 0.5. through training of 10 pieces of data in each batch, the optimal result is obtained through 10 times of iterative training.

S34, attention mechanism. An attention mechanism is introduced to generate weight vectors for word-level and sentence-level. In the patent, a hot-one attention mechanism is adopted to realize automatic optimization of word-level weights.

And S35, finally, carrying out normalization processing on the result by using a softmax layer to obtain the probability distribution of the relational tags.

Step four, entity linking

After the extraction of the entity and the relation is completed, the important problem is how to connect the extracted entity with the real information in the knowledge base, and the invention adopts the Jaro winkler distance method to judge whether the link entity and the target entity are the same entity by calculating the distance between the link entity and the target entity so as to achieve the effect of entity disambiguation.

The invention relates to a public opinion information extraction and knowledge base generation method based on natural language processing, which has the advantages that compared with the prior art: (1) compared with the traditional single named entity recognition and relationship extraction module, the public opinion information extraction method integrates the whole process of public opinion information extraction, inputs the original text, outputs the structured knowledge and realizes an end-to-end model; (2) compared with the traditional one-hot model for obtaining word vectors, the word vector is trained by a deep learning method, so that the phenomenon of dimension disaster represented by the word vectors can be avoided, the information of the context of words can be fully mined, and the relation between the words can be obtained; (3) the relation extraction model based on the Bi-GRU generates a lightweight model with fewer parameters and faster training speed, and achieves good performance in the field of financial public opinion information.

Drawings

FIG. 1 is a block flow diagram of the method of the present invention.

Fig. 2 shows a GRU update mechanism.

Detailed Description

The technical solution of the present invention is further described below with reference to specific examples.

The invention discloses a public opinion information extraction and knowledge base generation method based on natural language processing, which comprises the following steps as shown in figure 1:

step one, text preprocessing

Text preprocessing is a basic and necessary step in unstructured data processing, and mainly reduces negative effects brought by noise data as much as possible by removing characters without entity semantics and filtering stop words after word segmentation, and the generalization capability of a model is improved. The data preprocessing methods selected therefore are mainly character cleaning, word segmentation and word decommissioning.

And (5) character cleaning. Characters such as commas, periods, quotation marks and the like in the text represent pauses, connections and the like of sentences, have no actual meaning in semantic analysis, can be regarded as useless characters, and the common processing mode is matching filtering. The invention adopts a regular matching method to carry out full-angle and half-angle unification processing on the input text and matching and filtering of punctuation marks.

And (5) word segmentation. In Chinese text, a word is the smallest semantic unit. In text-related processes and operations, a sentence is typically segmented into a series of words to represent the original sentence. Currently, commonly used word segmentation tools include jieba word segmentation, LTP natural language processing package of haardard, Stanford's tool package, and the like. After the accuracy, the performance and the word segmentation fine granularity of different tools are comprehensively considered, the word segmentation method adopts the LTP toolkit to segment words, and introduces a dictionary in the professional field to improve the word segmentation effect.

To stop the word. Before the semantics of a sentence are expressed in terms of words, there is often a one-step important operation of removing certain words and words, which are collectively referred to as stop words. The stop words can be divided into two types, one type is functional words such as 'the', 'woollen', 'in' and the like, the functional words have no definite actual meanings in sentences, and the functional words serve as connecting words and mood-assisting words to assist other words in the sentences; the other is a word that is widely present in a sentence, and the high frequency of occurrence makes it useless for representing the semantics of the sentence. Therefore, the density of the keywords can be improved by removing the two types of stop words, and the semantic information of the sentence can be more effectively acquired.

Step two, named entity recognition

Named entity identification oriented to the field of financial information is mainly to identify company and institution names and person names. The present invention employs a neural network-based approach to accomplish named entity recognition.

The data driving method comprises the steps of firstly mapping each word into dense embedding in a low-dimensional space, then using word embedding (word embedding) as the input of a model, automatically extracting features by using a neural network, and predicting the label of each word by softmax, so that the training of the model becomes an end-to-end integral process instead of the traditional pipeline and is independent of feature engineering.

The invention adopts a word-based BilSTM + CRF model, and the labeled data comes from a 1998 people daily statement corpus, wherein three types of entities including a person name (nr), a place name (ns) and an organization name (nt) are shared. B. M, E represent the first, middle and end words of an entity, respectively. For example, B-nr, M-nr, E-nr represent the first, non-first and last, and the last, respectively, of a name, and O represents that the word does not belong to a part of the named entity.

The model is divided into three layers, namely an embedding layer, a BilSTM layer and a CRF layer.

(1) Imbedding layer: also called the lookup layer, which functions to map each word in the input text into a vector representation in a low dimensional space, which is also the input to the next layer.

(2) BilsTM layer: and classifying each word by using the processing advantage of the LSTM on the serialized texts, judging the tag with the maximum probability, and outputting the tag. The use of bi-directional LSTM can better exploit semantic features at the sentence level, capture some laws of entity composition, such as part of an organization entity ending in "limited company" and the like.

(3) CRF layer: and modeling the relation between tags, and improving the accuracy of named entity identification.

Step three, extracting relation

Relationship extraction is one of the important research tasks of natural language processing, and is given by entity 1(e _1) and entity 2(e _2), and the process of obtaining the relationship (r) between the two entities in a piece of text can be represented as r → (e1, e 2). For example, given a text "2002, the missing modern Haili Shi will flag the TFT-LCD department for sale to the Jingdongfang group as a whole at a price of 3.8 hundred million dollars", wherein the entity is "modern Haili Shi" and the entity is "Jingdongfang group", and by analyzing the meaning distribution and the logical relationship between words in the sentence, the relationship of "acquisition" between the Jingdongfang group and the modern Haili Shi can be extracted.

and S31, feature extraction, wherein lexical features and syntactic features are extracted from the input sentence. The quality of feature extraction determines the performance of the model. In the part of constructing the features, the features are divided into lexical features and syntactic features. Word embedding (denoted as w) is included in the lexical features^w) Part of speech (pos, denoted w)^f1) Named entity (ner, denoted w)^f2) Syntactic characteristics include dependency type (dep, denoted w)^f3) Parent node position (denoted as w)^f4) And the relative position of the word (position feature, denoted as w)^f5)。

Feature Set＝{w^w，w^f1，w^f2，w^f3，w^f4，w^f5，w^f6}

and S32, embedding features. The feature embedding is to convert the features in step S31 into vector representation and to stitch together. That is to say a sentence is completely converted into a vector represented by its features. Word embedding is a low-dimensional vector representation of a word. Given a sentence, S ═ w₁，w₂，...，w_nN is the number of words in the sentence.

Is an embedded matrix, where d^wIs the vector dimension of the word embedding defined by the user, and V is the total number of words. For each word in the sentence, I can formulate it byCorresponding to a word vector.

w^w＝W^wrdvⁱ

Wherein v isⁱIs that the current word is in W^wrdCorresponding to ont-hot representation of that column.

In addition, w^fjIs a vector representation of the characteristics of part of speech, named entities, dependency relationship types, etc., wherein j represents the jth class of characteristics. The part of speech, the named entity and the dependency relationship type are one-hot vectors generated according to the analysis result of the LTP, and the father node position and the relative Position (PF) are initialized randomly.

Then, for sentence S ═ { w ═ w₁，w₂，...，w_n}, word w_iThe characteristic representation of (c) can be expressed as follows:

wherein the content of the first and second substances,

is w_iThe word vector of (a) is,

is a vector representation of the jth class of features.

S33, Bi-GRU model. And (4) passing the vector in the step S32 through a Bi-GRU network to generate a high-dimensional vector. Both LSTM and GRU are specific variants of RNN. The long-time and short-time memory model needs to transmit two states, namely a long-range state and a short-range state which can be stably transmitted, and is additionally provided with three thresholds, namely a forgetting gate, an input gate and an output gate, so that information can selectively pass through to control and protect the long-range state. With these smart designs, the LSTM avoids the long-term dependence problem.

The GRU optimizes the internal design based on the LSTM, merges the long-range state and the short-range state, merges the forgetting gate and the input gate into an update gate, determines how much previous information is retained, and the reset gate determines how much previous information is combined with the current input. As shown in fig. 2.

In FIG. 2, x_tFor the current input, h_t-1Is the output of the previous moment, h_tIs the output at the current time and σ is the activation function. r is_tIs a reset gate for controlling the extent to which status information at a previous time is ignored, with smaller values of the reset gate indicating more ignorance. z is a radical of_tIs an update gate for controlling the extent to which the state information at the previous time is brought into the current state, the larger the value of the update gate, the more the state information at the previous time is brought in. The two gate protection and control information are hidden from the last state h_t-1To a new hidden state h_t。

Equations 1-4 give the reset gate r_tUpdate gate z_tCandidate hidden states

And a hidden state h at the current moment_tThe calculation method of (1).

r_t＝σ(W_r·[h_t-1，x_t]) (1)

z_t＝σ(W_z·[h_t-1，x_t]) (2)

The basic unit of the Bi-GRU model is composed of a forward-propagating GRU unit and a backward-propagating GRU unit, and the structure diagram of the backward-propagating GRU unit is shown in fig. 2. When processing sequence information, forward information and backward information can be considered at the same time, and finally the two units are spliced together in the output part. For the ith word, the output is the following formula:

wherein for the matrix

Sum matrix

The GRU model is simpler than the LSTM model, has fewer parameters, trains faster, but performs similarly and performs well even on a smaller sample set of data.

S34, attention mechanism. An attention mechanism is introduced to generate weight vectors for word-level and sentence-level.

When the input sequence of the model is long, it is difficult to retain all important information, and the performance of the model is therefore degraded. Attention is paid to the existence of a mechanism for solving the problem. Intermediate output result h to input sequence by preserving GRU_iThese intermediate results are then selectively learned as inputs to the attention layer and correlated with the output sequence of the GRU at the time of output. Although the model increases the amount of computation after using the attention mechanism, the performance level can be improved.

Step four, entity linking

The following describes a technical solution of the present invention with an embodiment, which is verified based on news public opinion data provided by a financial technology company. In addition, the training data set comprises a daily news corpus of 1998 people and Tushare platform news information data.

Text preprocessing

The method processes the Tushare data, only selects Chinese words to perform word segmentation, refers to words in the professional field of the unhealthy assets, performs word segmentation on the text by using an LTP word segmentation device, and performs stop words on the text words after importing the stop words into a stop word bank.

Named entity recognition

In training the named entity recognition model, the invention selects a 1998 daily statement corpus which is a text of 'daily statement of people' from 1 month to 6 months in 1998, manually labels the part of speech, and commonly performs work related to natural language processing, such as word segmentation, part of speech labeling, named entity recognition and the like.

The data set is labeled by taking words as units, so that the data set is firstly processed, and the labeled ns, nr and nt are divided into fine-grained segments by one step, so that the labels of the segmented segments are added with B, M, E information. And marking other parts of speech as O to generate a new training speech file.

The model was trained on the new corpus file, which represented F1 ═ 0.90 on the test set. For the recognition result, the person name and organization name were selected and output, and the place name was discarded because it had little relation with the current study.

Extraction of three, relationship

The relation extraction model selects the news information data of the Tushare platform as a data source. The Tushare is a free-source python financial data interface package, in which news information data is one of the data structures provided by it. The method mainly obtains the news data of the news website, including the real-time information of the Xin Lang and the financial, the Wale street news and the news of the news. According to the provided data, the method carries out manual marking, and totally marks 1000 pieces of data for training and testing to finally obtain a relation extraction model.

In the specific processing, whether to enter a relationship extraction flow is judged according to the result of the named entity identification in the last step. And when the number of the extracted entities is more than 2, entering a relation extraction module. The current sentence and a pair of entity pairs are used as input, the output is the probability distribution of the relation labels, and one label with the highest probability is selected as the relation of the entity pairs. Every pair of extracted entities is sent to the relation extraction model for relation judgment. In order to ensure the correctness of the extraction relation, the invention sets a higher threshold value, and when the numerical value of the relation probability is greater than the threshold value, the relation is considered to be the knowledge storage. And when the number of the extracted entities is less than 2, directly skipping the processes of relation extraction and entity linking, and ending the process.

Four, entity linking

The entity linking part is equivalent to adding the verification of the information once. The database contains the names of companies related to public opinion information, and entity linking is performed before data storage, so that error information generated in the previous process can be effectively reduced, and disambiguation of entity nodes is realized.

Finally, the method selects and verifies the news public opinion data provided by Hua-Rong-and-Rong (Beijing) science and technology Limited company, the data are 1000 pieces, entity identification, relation extraction, entity linkage and the like are carried out according to the flow provided by the method, the effective relation among enterprises is extracted, and the effective relation is successfully stored in the mysql database.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the technical scope of the present invention, so that any minor modifications, equivalent changes and modifications made to the above embodiment according to the technical spirit of the present invention are within the technical scope of the present invention.

Claims

1. A public opinion information extraction and knowledge base generation method based on natural language processing is characterized in that: the method comprises the following steps:

the method comprises the following steps of firstly, preprocessing a text, wherein characters are cleaned, words are segmented, and words are removed;

character cleaning, adopting a regular matching method to carry out full-angle and half-angle unification processing on an input text and matching and filtering of punctuation marks;

word segmentation is carried out by adopting an LTP toolkit, and a professional domain dictionary is introduced to improve the word segmentation effect;

removing stop words, and removing text stop words by adopting a special stop dictionary in the field of the unhealthy assets; the special stopping dictionary in the field of the undesirable assets divides stopping words into 3 types: one is nonsense words; one is a word which widely exists in sentences and appears at high frequency; the other is nonsense vocabulary in a business system;

step two, named entity recognition

The named entity identification facing the financial information field comprises the following steps: identifying company and organization names and person names, and finishing named entity identification by adopting a method based on a neural network;

according to the method, each word is mapped into dense embedding in a low-dimensional space, then the word is embedded into the wordemmbedding to serve as the input of a model, features are automatically extracted by using a BilSTM, and labels of the whole sentence are predicted by using a CRF (cyclic redundancy check) so that the training and prediction of the model become an end-to-end integral process;

step three, extracting relation

The co-extraction relationship of the invention comprises 8 types, which are respectively: an arbitrary relationship, a cooperative relationship, a litigation relationship, a supplier relationship, a stock control relationship, an investment relationship, a debt relationship, and an uknow relationship;

step four, entity linking

After the extraction of the entity and the relation is completed, the Jaro winkler distance method is adopted to judge whether the link entity and the target entity are the same entity by calculating the distance between the link entity and the target entity so as to achieve the effect of entity disambiguation.

2. The method for extracting public opinion information and generating knowledge base based on natural language processing as claimed in claim 1, wherein: the third step specifically adopts a Bi-GRU model to solve the problem of relation extraction in the financial field, and mainly comprises the following steps:

s31, feature extraction, namely extracting lexical features and syntactic features from the input sentences; the quality of feature extraction determines the performance of the model; in the part of constructing the characteristics, the characteristics are divided into lexical characteristics and syntactic characteristics; word embedding, denoted as w, is included in lexical features^wPart of speech pos, denoted w^f1Named entity ner, denoted w^f2Syntactic characteristics include dependency type dep, denoted as w^f3Parent node position parent, denoted as w^f4The relative position of the word position, denoted as w^f5；

When the lexical characteristics and the syntactic characteristics are obtained, processing an input sentence by adopting a natural language processing packet LTP of the Haugh to obtain the characteristics; the final feature set is:

Feature Set＝{w^w,w^f1,w^f2,w^f3,w^f4,w^f5,w^f6}；

s32, embedding characteristics; the feature embedding is to convert the features in the step S31 into vector representation and to splice the vector representation; then, for sentence S ═ { w ═ w₁,w₂,…,w_n}, word w_iThe characteristic representation of (c) can be expressed as follows:

is a vector representation of class j features;

s33, a Bi-GRU model; passing the vector in the step S32 through a BI-GRU network, where the number of hidden layer nodes in the BIGRU is: 400, Dropout ratio is: 0.5; performing iterative training for 10 times through training of 10 pieces of data in each batch to obtain an optimal result;

s34, an attention mechanism; an attention mechanism is introduced, and weight vectors are generated for word levels and sentence levels; a hot-one attention mechanism is adopted to realize automatic optimization of word-level weight;