CN112906397A

CN112906397A - Short text entity disambiguation method

Info

Publication number: CN112906397A
Application number: CN202110366911.6A
Authority: CN
Inventors: 文万志; 姜文轩; 李喜凯; 葛威; 朱恺; 吴雪斐; 袁佳祺
Original assignee: Nantong University
Current assignee: Shenzhen Hongyue Information Technology Co ltd
Priority date: 2021-04-06
Filing date: 2021-04-06
Publication date: 2021-06-04
Anticipated expiration: 2041-04-06
Also published as: CN112906397B

Abstract

The invention provides a short text entity disambiguation method based on deep learning, which is mainly used for solving the problem that entities in sentences have different meanings and different directions in different short texts, and comprises the following steps: step 1, segmenting words of sentences by using a jieba word segmentation technology, finding out entities to be disambiguated, and using listed company entities and abbreviations thereof as dictionaries; step 2, segmenting the sentence by taking the entity to be disambiguated as the center and the size of 32 characters; step 3, converting the statement containing the entity to be disambiguated into a Bidirectional Encoder reproduction from transformations (BERT) word vector model; and 4, putting the word vector model into a Long-Short Term Memory RNN (LSTM) model in batches, performing loss function calculation through cross entropy, and continuously optimizing parameters to obtain a final model. The invention can not only obtain good results in special fields such as company entities, but also obtain good results in general fields.

Description

Short text entity disambiguation method

Technical Field

The invention belongs to the field of natural language processing, and particularly relates to a Short text entity disambiguation method, which is an effective entity disambiguation technology based on deep learning Long-Short Term Memory RNN (LSTM) and Bidirectional Encoder retrieval from transformations (BERT) models, and is mainly used for solving the problem that company entities point to different meanings in different Short texts.

Background

In the internet era, information explosion and massive consultation, people hope that the advanced AI technology can associate texts with massive entity (company, name, etc.) information, improve reading fluency of users, realize accurate content recommendation, etc. The intelligent consultation treatment not only provides intelligent service for the financial industry, but also provides more innovation space for the financial business.

The text information is the main medium for information dissemination of company entities, and the company entities which generate news are accurately positioned to directly determine how to carry out downstream financial work. In the financial information, company entities (in tens of millions of company entities) appear in the form of domain names, which causes ambiguity. For example, apple is a commercially available company in the united states and is also a fruit. The object of entity disambiguation is to eliminate the problem of entity ambiguity in the information processing process and to purify the text information. Disambiguation is generally achieved by incorporating knowledge of entities. In recent years, the rapid development of artificial intelligence technology makes it possible to solve many problems, and people hope to apply the leading-edge artificial intelligence method to solve the problem of entity ambiguity in intelligent information.

The traditional entity disambiguation task is mainly based on a long text of a knowledge base, the knowledge base is complete, the long text has richer context information to assist entity disambiguation, and an entity disambiguation system based on vertical domain (company entity) disambiguation data is more challenging to construct.

The BERT model has the parallel capability, the capability of extracting features and modeling texts in a two-way mode, better results can be obtained with less data and shorter time, the long-term and short-term neural network can retain more important information and forget redundant information, the two technologies are combined and a binary technology is used for disambiguating entities, and a novel entity disambiguation technology based on deep learning is provided.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a short text entity disambiguation method, which can effectively help natural language processing developers and related readers to judge whether a word to be disambiguated is a company name according to own requirements, and has higher accuracy and efficiency.

In order to solve the above technical problem, an embodiment of the present invention provides a short text entity disambiguation method, including the following steps:

s1, performing word segmentation on the training sample and the test sample;

s2, segmenting the sample by taking the entity to be disambiguated as the center;

s3, converting the sample containing the entity to be disambiguated into a word vector pre-trained by a BERT model;

s4, constructing a neural network model;

s5, calculating the value between the one-dimensional vector output by the neural network and the label vector of the sample by using the cross entropy as a loss function, and optimizing a neural network parameter model;

s6, using Microsoft Neural Network Intelligence (NNI) to search for parameters with higher training accuracy.

The specific steps of step S1 are:

s1.1, creating dictionaries for all entity names (including company full names and short names), and finding out all entities to be disambiguated by using a jieba word segmentation technology for training samples and testing samples;

s1.2, generating a prefix tree for a text to be segmented, and constructing a directed acyclic graph of a potential string order by using regular matching;

s1.3, finding out a word segmentation scheme of the maximum probability path through dynamic programming, solving an HMM (hidden Markov model) model by using a Viterbi algorithm in order to enable a word segmentation effect to be adaptive to a text, and mining a new word.

The specific steps of step S2 are:

s2.1, segmenting the sentence, and selecting only 32 characters when the sentence is coded;

s2.2, segmenting the sentence by taking the entity name as a center, finding the position of the entity name in the text, and dividing the first 13 characters and the last 14 characters of the entity name into one sentence, wherein the entity name fixedly occupies 5 bytes.

The specific steps of step S3 are:

s3.1, finding the id corresponding to the BERT pre-training model for each word in each sentence of the cut training and verification sample;

s3.2, identifying the length of each sentence, using 0 and 1 as masks, wherein 0 represents that no word exists in the position, and 1 represents that a word exists in the position, so that each sentence is converted into a binary vector group [ I, T, L, M ], wherein I identifies the BERT model id corresponding to each word; t identifies whether the sample is a company name, wherein 1 identifies the company name and 0 identifies the company name; l represents the length of the sentence; m is a mask for each sentence;

s3.3, performing batch processing on all the training sets, wherein every 32 samples serve as one batch, and optimizing parameters;

the specific steps of step S4 are: the neural network model is divided into three sub-modules:

s4.1, a BERT conversion module, which is used for converting the id in the step 3.1 into a BERT model vector which is actually pre-trained;

s4.2, an LSTM module which is used as a first layer training model and is convenient for learning information among the sentence sequences;

s4.3, a linear output module which is used as a final input vector.

Further, in step S4.1, for the BERT model, corresponding gradient information is retained in the calculation, and the formula is:

wherein loss is a loss functionNumber, w is weight, y_iIs the true value;

in step S4.2, the LSTM module uses a dropout algorithm, for each layer of neurons, the neurons are temporarily discarded from the network according to a certain probability, and different neurons are randomly selected during each iterative training, which is equivalent to performing training on different neural networks each time;

in step S4.3, the linear output module uses an Attention mechanism, and the Attention mechanism gives higher weight to the Tokens sequence which has important influence on each word in the sentence; the Attention score calculation formula for Tokens is as follows:

wherein f is_TIs a linear layer, and the linear layer,

is the hidden layer state of the t-th Tokens, c_TIs the context vector for Tokens.

The specific steps of step S5 are:

s5.1, calculating a neural network loss function by using cross entropy, and optimizing a neural network parameter model;

s5.2, for the entity name, the name is only an indication pronoun without meaning in the aspect of actual grammar, and the problem is simplified into a two-classification problem: the name of the entity is 1, and the name of the non-entity is 0; the cross entropy is a tool for two categories, can measure slight differences, finds an optimal solution by using a gradient descent method, and defines a cross entropy loss function as follows:

wherein, y_iLabel representing a sample i, a positive class representing 1, and a negative class representing 0; y is_iRepresents the probability that sample i is predicted to be positive;

s5.3, optimizing parameters by using Adam as a gradient descent algorithm, wherein the Adam algorithm not only performs exponential weighted average processing on the gradient during each training, but also updates the weight W and the constant term b by using the obtained gradient value, and reduces the updating speed of the direction if the direction has large oscillation, so as to reduce the oscillation; the exponentially weighted average formula is as follows:

V_t＝βv_t-1+(1-β)θ_t

wherein beta represents hyper-ginseng, v_tRepresents the average value of the t-th order, theta_tRepresents the value of the t-th time.

The specific steps of step S6 are:

microsoft Neural Network Intelligence (NNI) is a lightweight but powerful tool kit, which can adjust the super-parameters and adjust the batch size, learning rate, length processed by each sentence, cycle number, and number of convolution kernels, wherein F1 formula is as follows, taking F1 value as the basis for judgment:

where TP represents the number of positive samples determined to be positive, FP represents the number of negative samples determined to be positive, and FN represents the number of positive samples determined to be negative.

The technical scheme of the invention has the following beneficial effects:

the invention provides an entity disambiguation method based on the combination of a Bidirectional Encoder reproduction from transformations (BERT) model and a Long-Short Term Memory RNN (LSTM) model, which can effectively help natural language processing developers and related readers to judge whether a word to be disambiguated is a company name according to the requirements of the developers and the related readers, and has higher accuracy and efficiency.

Drawings

FIG. 1 is an overall framework of the present invention;

FIG. 2 is a flow chart of the jieba word segmentation work in the present invention;

FIG. 3 is a graph of a sentence segmentation algorithm in the present invention;

FIG. 4 is a general framework diagram of a neural network in the present invention;

FIG. 5 is a graph of the value of F1 obtained using the three word vectors in the present invention;

FIG. 6 is a graph of F1 values obtained using three neural networks in the present invention;

fig. 7 shows the values of F1 obtained in the present invention using three text lengths.

Detailed Description

In order to make the technical problems, technical solutions and advantages of the present invention more apparent, the following detailed description is given with reference to the accompanying drawings and specific embodiments.

The invention provides a short text entity disambiguation technology based on deep learning, which is mainly used for helping natural language processing developers and related readers judge whether a word to be disambiguated is a company name according to own requirements. The technology firstly finds out an entity to be disambiguated through jieba word segmentation and segments a long text into a short text, so that the scale of a neural network is reduced; secondly, the text uses a BERT model as a word vector pre-training model, converts each word in each sentence into an id corresponding to the BERT model, and records the length, the mask and whether the sentence is a company name; and finally, constructing and training a deep neural network by adopting long-short term neural network technology, an Attention mechanism, cross entropy and other technologies to obtain better parameters.

The invention provides a short text entity disambiguation method, which comprises the following steps:

s1, performing word segmentation on the training sample and the test sample; the method comprises the following specific steps:

s1.1, creating dictionaries for all entity names (including company full names and short names), and finding out all entities to be disambiguated by using a jieba word segmentation technology for training samples and testing samples; FIG. 2 is a jieba word segmentation workflow diagram, in which the loaded dictionary is the entity name, so as to find out the word to be disambiguated conveniently and quickly.

S2, segmenting the sample by taking the entity to be disambiguated as the center; the method comprises the following specific steps:

s2.1, segmenting the sentences, and simultaneously, only selecting 32 characters when encoding the sentences, so that the training speed of the neural network is reduced as much as possible on the basis of ensuring the accuracy;

s2.2, segmenting the sentence by taking the entity name as the center, finding the position of the entity name in the text, and dividing the first 13 characters and the last 14 characters of the entity name into one sentence, wherein the entity name fixedly occupies 5 bytes, as shown in figure 3.

S3, converting the sample containing the entity to be disambiguated into a word vector pre-trained by a BERT model; the method comprises the following specific steps:

and S3.2, because the step 2 can only ensure that the lengths of long sentences are equal, for sentences with smaller lengths, the lengths of the sentences cannot be ensured. Therefore, the length of each sentence must be identified, 0 and 1 are used as masks, 0 represents that there is no word at the position, 1 represents that there is a word at the position, and each sentence is converted into a binary vector group [ I, T, L, M ], wherein I identifies the BERT model id corresponding to each word; t identifies whether the sample is a company name, wherein 1 identifies the company name and 0 identifies the company name; l represents the length of the sentence; m is a mask for each sentence;

and S3.3, carrying out batch processing on all the training sets, taking 32 samples as one batch, and optimizing parameters.

S4, constructing a neural network model, wherein the overall framework of the neural network is shown in FIG. 4, and the neural network model is divided into three sub-modules:

s4.3, a linear output module which is used as a final input vector.

For the BERT model, corresponding gradient information is retained in the calculation, and the formula is as follows:

where loss is the loss function, w is the weight, y_iAre true values.

Using a dropout algorithm for an LSTM module, temporarily discarding the neurons of each layer from the network according to a certain probability, randomly selecting different neurons during each iterative training, and equivalently, training on different neural networks each time;

since the important part of a sentence is usually on the key words, the linear output module, using the Attention mechanism, gives higher weight to the token sequence that has important influence on each word in the sentence; the Attention score calculation formula for Tokens is as follows:

wherein f is_TIs a linear layer, and the linear layer,

S5, calculating the value between the one-dimensional vector output by the neural network and the label vector of the sample by using the cross entropy as a loss function, and optimizing a neural network parameter model; the method comprises the following specific steps:

V_t＝βv_t-1+(1-β)θ_t

S6, searching parameters with higher training accuracy by using Microsoft Neural Network Intelligence (NNI); the method comprises the following specific steps:

microsoft Neural Network Intelligency (NNI) is a lightweight but powerful tool kit that can adjust the hyper-parameters and make reference to batch size, learning rate, length processed into each sentence, cycle number, and number of convolution kernels. Wherein, taking the F1 value as the judgment basis, the formula of F1 is as follows:

The general framework of the method provided by the invention is shown in figure 1, the BERT model and the LSTM model are combined, the BERT model can use predecessors to obtain information relation between sentences through mass data and trained vector parameters, and the LSTM model can obtain information relation between sentences through an update gate, an output gate and a forget gate.

Model comparisons are performed below, with analysis being performed for the word vector model, the neural network, and the text length, respectively.

Comparison 1: the results corresponding to the values of the test set F1 obtained by comparing the Word2vec, BERT and ERNIE models are shown in fig. 5, which shows that the best results of BERT and ERNIE are obtained, but the BERT model curve is smoother.

Comparison 2: comparing three neural network models of general neural network, Convolutional Neural Network (CNN) and long-short term neural network (LSTM), as shown in fig. 6, it can be shown that LSTM converges more smoothly.

Comparison 3: for different text length comparisons, as shown in fig. 7, the length effect is not too great for the same period of training.

Through experimental results and analysis, the invention uses the BERT model to effectively obtain the relation between words and avoid the import of redundant information. For neural networks, the use of LSTM solves the long text information preservation problem. In addition, the reasonable segmentation of the text length can obtain enough information and improve the training speed.

While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A short text entity disambiguation method, comprising the steps of:

s1, performing word segmentation on the training sample and the test sample;

s4, constructing a neural network model;

2. The method for disambiguating an entity of short text as claimed in claim 1, wherein the specific steps of step S1 are:

s1.1, creating dictionaries for all entity names, and finding out all entities to be disambiguated by using a jieba word segmentation technology for training samples and testing samples;

3. The method for disambiguating an entity of short text as claimed in claim 1, wherein the specific steps of step S2 are:

4. The method for disambiguating an entity of short text as claimed in claim 1, wherein the specific steps of step S3 are:

s4.3, a linear output module which is used as a final input vector.

5. The short text entity disambiguation method of claim 4, wherein in step S4.1, for the BERT model, corresponding gradient information is retained in the calculation, which is formulated as:

where loss is the loss function, w is the weight, y_iIs the true value;

wherein f is_TIs a linear layer, and the linear layer,

6. The method for disambiguating an entity of short text as claimed in claim 1, wherein the specific steps of step S5 are:

wherein, y_iRepresents a sample iLabel of (1), the positive class represents 1, the negative class represents 0; y is_iRepresents the probability that sample i is predicted to be positive;

v_t＝βv_t-1+(1-β)θ_t

7. The method for disambiguating an entity of short text as claimed in claim 1, wherein the specific steps of step S6 are:

the microsoft Neural Network Intelligence toolkit can adjust the hyper-parameters and adjust the parameters of batch size, learning rate, length processed by each sentence, cycle times and convolution kernel number, wherein the F1 formula is as follows by taking an F1 value as a judgment basis: