CN114444496A - Short text entity correlation identification method, system, electronic equipment and storage medium - Google Patents

Short text entity correlation identification method, system, electronic equipment and storage medium Download PDF

Info

Publication number
CN114444496A
CN114444496A CN202110439445.XA CN202110439445A CN114444496A CN 114444496 A CN114444496 A CN 114444496A CN 202110439445 A CN202110439445 A CN 202110439445A CN 114444496 A CN114444496 A CN 114444496A
Authority
CN
China
Prior art keywords
entity
word
vector
text
coding
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110439445.XA
Other languages
Chinese (zh)
Inventor
郭艳波
刘瑞熙
王兆元
龚浩
李青龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Smart Starlight Information Technology Co ltd
Original Assignee
Beijing Smart Starlight Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Smart Starlight Information Technology Co ltd filed Critical Beijing Smart Starlight Information Technology Co ltd
Priority to CN202110439445.XA priority Critical patent/CN114444496A/en
Publication of CN114444496A publication Critical patent/CN114444496A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a method, a system, electronic equipment and a storage medium for identifying short text entity correlation, wherein the method comprises the following steps: fusing a word vector containing entity context semantic information, a position vector containing entity position coding information and a pre-coding vector containing entity pre-coding information to obtain a fused word vector of words in a training text; inputting the fusion word vector of the word of each training text into a TD _ LSTM model to obtain a forward vector code and a backward vector code, splicing the forward vector code and the backward vector code to obtain an entity splicing vector, passing the entity splicing vector through a feedforward neural network to obtain a classified code vector of each training text, normalizing the classified code vectors, obtaining a loss function according to the normalized classified code vector, and performing iterative optimization on the loss function to obtain an optimal model; inputting the short text to be recognized into the optimal model to obtain a recognition result of the text to be recognized; the accuracy of short text entity correlation identification is realized.

Description

Short text entity correlation identification method, system, electronic equipment and storage medium
Technical Field
The invention relates to the field of text processing, in particular to a short text entity correlation identification method, a short text entity correlation identification system, electronic equipment and a storage medium.
Background
At present, the processing methods of text entity correlation are mainly divided into two types, namely machine learning-based and neural network-based.
The method based on Machine learning mainly comprises the steps of manually constructing feature information on a text, and then classifying the feature information by using a Machine learning classifier, wherein the commonly used feature classifier comprises a Support Vector Machine (SVM), naive Bayes and the like. In the traditional machine learning method, the construction of the characteristics is important and is the key for determining the accuracy, a large amount of characteristic engineering work needs to be done, and the extraction and construction of the characteristic information are not comprehensive, so that the accuracy of entity correlation identification according to the given characteristic information is low.
The neural network-based approach is to perform a deep semantic feature vector representation of the text, followed by prediction of the relevance of the entities from the given entity. Compared with manually constructing feature information, feature extraction using a neural network can be more comprehensive, and a target feature representation can be learned from data. The neural network automatically learns the characteristics of the features, avoiding the feature extraction process that requires a large amount of domain knowledge. Common Neural networks include Recurrent Neural Network (RNN), Long-Short Term Memory (LSTM), and the like. However, the entity features automatically learned according to the context semantics are still incomplete, and the problem of inaccurate entity relevancy identification exists.
Disclosure of Invention
In view of this, embodiments of the present invention provide a method, a system, an electronic device, and a storage medium for identifying short text entity relevance, so as to solve the problem in the prior art that identification of short text entity relevance is not accurate.
Therefore, the embodiment of the invention provides the following technical scheme:
according to a first aspect, an embodiment of the present invention provides a short text entity relevance identification method, including: acquiring a short text training set, wherein the short text training set comprises a plurality of training texts and a named entity corresponding to each training text; acquiring an entity pre-coding matrix, wherein the entity pre-coding matrix comprises pre-coding vectors corresponding to all entities, and the entity pre-coding is obtained according to context information pre-coded by the entities; converting each word in each training text into a word vector, and converting the position of each word in the training text into a position vector; obtaining a pre-coding vector corresponding to each word according to each word and the entity pre-coding matrix in each training text, wherein the pre-coding vector of the word corresponding to the named entity in each training text is an entity pre-coding vector, and the pre-coding vector of the word corresponding to the non-named entity in each training text is a zero vector; obtaining a fused word vector corresponding to each word in each training text according to the word vector, the position vector and the pre-coding vector corresponding to each word in each training text; respectively inputting the fusion word vector of each training text into a TD _ LSTM network to obtain a forward vector code and a backward vector code corresponding to each training text; respectively splicing the forward vector code and the backward vector code corresponding to each training text to obtain an entity splicing vector corresponding to each training text; the entity splicing vector corresponding to each training text is processed by a feedforward neural network to obtain a classification coding vector corresponding to each training text; normalizing the classified coding vectors to obtain normalized classified coding vectors; obtaining a loss function between the classification vector of the model code and the actual entity classification vector according to the normalized classification coding vector; performing iterative optimization training according to the loss function to obtain an optimal model, wherein the optimal model is used for identifying the short text and the correlation strength of entities in the short text; acquiring a short text to be recognized; and inputting the short text to be recognized into an optimal model to obtain a recognition result of the text to be recognized.
Optionally, the calculation formula of the fused word vector is as follows:
w_a=w_ta+w_pa+w_da
wherein, w _ a is a fused word vector corresponding to the word a; w _ ta is a word vector corresponding to the word a; w _ pa is the position vector corresponding to the word a, and w _ da is the pre-coding vector corresponding to the word a.
Optionally, after the step of inputting the fused word vector of each training text into the TD _ LSTM network to obtain the forward vector code and the backward vector code corresponding to each training text, the method further includes: respectively inputting the fusion word vector of each training text into an entity word coding vector corresponding to an entity word in each training text in the TD _ LSTM network, a coding vector of a first preset number word on the left side of the entity word and a coding vector of a second preset number word on the right side of the entity word; carrying out weighted average on the entity word encoding vector, the encoding vector of the word with the first preset number on the left side of the entity word and the encoding vector of the word with the second preset number on the right side of the entity word to obtain a new encoding vector corresponding to the entity word; carrying out similarity comparison on a new coding vector corresponding to the entity word and a precoding vector corresponding to the entity word in the entity precoding matrix to obtain a similarity calculation value; judging whether the calculated similarity value is larger than a preset similarity threshold value or not; if the similarity calculation value is larger than a preset similarity threshold value, the entity pre-coding matrix is not updated; and if the similarity calculation value is smaller than or equal to the preset similarity threshold value, performing weighted average on a new coding vector corresponding to the entity word and a precoding vector corresponding to the entity word in the entity precoding matrix to obtain an updated precoding vector, and updating the updated precoding vector to the entity precoding matrix.
Alternatively, the calculation formula of the calculated similarity value is as follows:
Threshold=cosine(w_new,w_d)
wherein Threshold is a similarity calculation value, w _ new is a new coding vector corresponding to the entity word, and w _ d is a precoding vector corresponding to the entity word;
the calculation formula of the new coding vector corresponding to the entity word is as follows:
Figure RE-GDA0003586198780000031
wherein s is a first preset number on the left side of the entity word, and v is a second preset number on the right side of the entity word; w _ t is an entity word encoding vector corresponding to an entity word in the training text, w _ tl-1For the coding vector of the first word to the left of the entity word in the training text, w _ tl-2For the coding vector of the second word to the left of the entity word in the training text, w _ tl-sFor the coding vector of the s-th word on the left of the entity word in the training text, w _ tr+1For the coding vector of the first word to the right of the entity word in the training text, w _ tr+2For the coding vector of the second word to the right of the entity word in the training text, w _ tr+vAnd (4) an encoding vector of the v word at the right side of the entity word in the training text.
Optionally, the calculation formula of the updated precoding vector is as follows:
Figure RE-GDA0003586198780000032
wherein, w _ z is a precoding vector after the entity word is updated; w _ new is a new encoding vector corresponding to the entity word, and w _ d is a pre-encoding vector corresponding to the entity word.
Optionally, before the step of inputting the short text to be recognized into the optimal model, the method further includes: inputting the short text to be recognized into a TextRCNN model to perform text secondary classification to obtain a classification result; if the classification result of the short text to be recognized is a noise text, removing the text to be recognized; if the classification result of the short text to be recognized is a non-noise text, the text to be recognized is reserved.
Alternatively, the calculation formula of the loss function is as follows:
Figure RE-GDA0003586198780000033
where n is the number of training samples, yiIs the ith training sampleActual classification vector of aiIs the classification vector of the ith training sample after model coding.
According to a second aspect, an embodiment of the present invention provides a short text entity relevance identification system, including:
the short text training set comprises a plurality of training texts and named entities corresponding to the training texts;
a second obtaining module, configured to obtain an entity pre-coding matrix, where the entity pre-coding matrix includes pre-coding vectors corresponding to all entities, and the entity pre-coding is obtained according to context information pre-coded by the entity;
the first processing module is used for respectively converting each word in each training text into a word vector and converting the position of each word in the training text into a position vector;
the second processing module is used for obtaining a pre-coding vector corresponding to each word according to each word and the entity pre-coding matrix in each training text, wherein the pre-coding vector of the word corresponding to the named entity in each training text is an entity pre-coding vector, and the pre-coding vector of the word corresponding to the non-named entity in each training text is a zero vector;
the third processing module is used for obtaining a fused word vector corresponding to each word in each training text according to the word vector, the position vector and the pre-coding vector corresponding to each word in each training text;
the fourth processing module is used for respectively inputting the fused word vector of each training text into the TD _ LSTM network to obtain the forward vector code and the backward vector code corresponding to each training text;
the fifth processing module is used for splicing the forward vector code and the backward vector code corresponding to each training text to obtain an entity splicing vector corresponding to each training text;
the sixth processing module is used for enabling the entity splicing vector corresponding to each training text to pass through a feedforward neural network to obtain a classified coding vector corresponding to each training text;
the seventh processing module is used for normalizing the classified coding vectors to obtain normalized classified coding vectors;
the eighth processing module is used for obtaining a loss function between the classification vector of the model code and the actual entity classification vector according to the normalized classification coding vector;
the ninth processing module is used for performing iterative optimization training according to the loss function to obtain an optimal model, and the optimal model is used for identifying the short text and the correlation strength of the entities in the short text;
the third acquisition module is used for acquiring short texts to be identified;
and the tenth processing module is used for inputting the short text to be recognized into the optimal model to obtain the recognition result of the text to be recognized.
Optionally, the calculation formula of the fused word vector is as follows:
w_a=w_ta+w_pa+w_da
wherein, w _ a is a fused word vector corresponding to the word a; w _ ta is a word vector corresponding to the word a; w _ pa is the position vector corresponding to the word a, and w _ da is the pre-coding vector corresponding to the word a.
Optionally, the method further comprises: the eleventh processing module is used for respectively inputting the fused word vector of each training text into an entity word encoding vector corresponding to an entity word in each training text in the TD _ LSTM network, an encoding vector of a first preset number word on the left side of the entity word and an encoding vector of a second preset number word on the right side of the entity word; the twelfth processing module is used for carrying out weighted average on the entity word encoding vector, the encoding vector of the word with the first preset number on the left side of the entity word and the encoding vector of the word with the second preset number on the right side of the entity word to obtain a new encoding vector corresponding to the entity word; a thirteenth processing module, configured to perform similarity comparison between the new coding vector corresponding to the entity word and the precoding vector corresponding to the entity word in the entity precoding matrix, so as to obtain a similarity calculation value; the judging module is used for judging whether the similarity calculation value is larger than a preset similarity threshold value or not; a fourteenth processing module, configured to not update the entity precoding matrix if the calculated similarity value is greater than the preset similarity threshold; and the fifteenth processing module is configured to, if the calculated similarity value is less than or equal to the preset similarity threshold, perform weighted average on the new coding vector corresponding to the entity word and the precoding vector corresponding to the entity word in the entity precoding matrix to obtain an updated precoding vector, and update the updated precoding vector to the entity precoding matrix.
Alternatively, the calculation formula of the calculated similarity value is as follows:
Threshold=cosine(w_new,w_d)
wherein Threshold is a similarity calculation value, w _ new is a new coding vector corresponding to the entity word, and w _ d is a precoding vector corresponding to the entity word;
the calculation formula of the new coding vector corresponding to the entity word is as follows:
Figure RE-GDA0003586198780000051
wherein s is a first preset number on the left side of the entity word, and v is a second preset number on the right side of the entity word; w _ t is an entity word encoding vector corresponding to an entity word in the training text, w _ tl-1For the coding vector of the first word to the left of the entity word in the training text, w _ tl-2For the coding vector of the second word to the left of the entity word in the training text, w _ tl-sFor the coding vector of the s-th word on the left of the entity word in the training text, w _ tr+1For the coding vector of the first word to the right of the entity word in the training text, w _ tr+2For the coding vector of the second word to the right of the entity word in the training text, w _ tr+vAnd (4) an encoding vector of the v word at the right side of the entity word in the training text.
Optionally, the calculation formula of the updated precoding vector is as follows:
Figure RE-GDA0003586198780000052
wherein, w _ z is a precoding vector after the entity word is updated; w _ new is a new encoding vector corresponding to the entity word, and w _ d is a pre-encoding vector corresponding to the entity word.
Optionally, the method further comprises: the sixteenth processing module is used for inputting the short text to be recognized into the TextRCNN model for text two-stage classification to obtain a classification result; a seventeenth processing module, configured to remove the to-be-recognized text if the classification result of the to-be-recognized short text is a noise text; and the eighteenth processing module is used for reserving the text to be recognized if the classification result of the short text to be recognized is a non-noise text.
Alternatively, the formula for the calculation of the loss function is as follows:
Figure RE-GDA0003586198780000061
where n is the number of training samples, yiIs the actual classification vector of the ith training sample, aiIs the classification vector of the ith training sample after model coding.
According to a third aspect, embodiments of the present invention provide an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to cause the at least one processor to perform the short text entity relevance identifying method described in any of the above first aspects.
According to a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, where computer instructions are stored, and the computer instructions are configured to cause a computer to execute the short text entity relevance identification method described in any one of the first aspect.
The technical scheme of the embodiment of the invention has the following advantages:
the embodiment of the invention provides a method, a system, electronic equipment and a storage medium for identifying short text entity correlation, wherein the method comprises the following steps: acquiring a short text training set, wherein the short text training set comprises a plurality of training texts and a named entity corresponding to each training text; acquiring an entity pre-coding matrix, wherein the entity pre-coding matrix comprises pre-coding vectors corresponding to all entities, and the entity pre-coding is obtained according to context information pre-coded by the entities; converting each word in each training text into a word vector, and converting the position of each word in the training text into a position vector; obtaining a pre-coding vector corresponding to each word according to each word and the entity pre-coding matrix in each training text, wherein the pre-coding vector of the word corresponding to the named entity in each training text is an entity pre-coding vector, and the pre-coding vector of the word corresponding to the non-named entity in each training text is a zero vector; obtaining a fused word vector corresponding to each word in each training text according to the word vector, the position vector and the pre-coding vector corresponding to each word in each training text; respectively inputting the fusion word vector of each training text into a TD _ LSTM network to obtain a forward vector code and a backward vector code corresponding to each training text; respectively splicing the forward vector code and the backward vector code corresponding to each training text to obtain an entity splicing vector corresponding to each training text; the entity splicing vector corresponding to each training text is processed by a feedforward neural network to obtain a classification coding vector corresponding to each training text; normalizing the classified coding vectors to obtain normalized classified coding vectors; obtaining a loss function between the classification vector of the model code and the actual entity classification vector according to the normalized classification coding vector; performing iterative optimization training according to the loss function to obtain an optimal model, wherein the optimal model is used for identifying the short text and the correlation strength of entities in the short text; acquiring a short text to be recognized; and inputting the short text to be recognized into an optimal model to obtain a recognition result of the text to be recognized. The above steps, multi-dimensional information fusion is adopted, the first dimension information is a word vector containing entity context semantic information, the second dimension information is a position vector containing entity position coding information, the third dimension information is a pre-coding vector containing entity pre-coding information, the three vectors are fused to obtain a fused word vector corresponding to a word in a training text, not only the relevance between an entity and the context is considered from a local context, but also the relevance between the context information carried by the entity and the context is considered from a large context environment, the expression is carried out from a plurality of information dimension rich entity features, and the confidence coefficient of the entity and text relevance calculation is improved; then, inputting the fused word vector corresponding to the word of each training text into a TD _ LSTM model to obtain a forward vector code and a backward vector code corresponding to each training text, splicing the forward vector code and the backward vector code to obtain an entity splicing vector, passing the entity splicing vector through a feedforward neural network to obtain a classified coding vector corresponding to each training text, normalizing the classified coding vector to obtain a normalized classified coding vector, obtaining a loss function between the classified vector of the model code and an actual entity classification vector according to the normalized classified coding vector, and performing iterative optimization training according to the loss function to obtain an optimal model; and finally, inputting the short text to be recognized into the optimal model to obtain a recognition result of the text to be recognized, and determining the correlation degree of the entity in the short text to be recognized and the short text to be recognized. The identification accuracy of the short text entity correlation is realized through the steps.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.
Fig. 1 is a flowchart of a specific example of a short text entity relevance identification method according to an embodiment of the present invention;
FIG. 2 is a flowchart of another specific example of a short text entity relevance identification method according to an embodiment of the present invention;
FIG. 3 is a block diagram of a specific example of a short text entity relevance identification system of an embodiment of the present invention;
fig. 4 is a schematic diagram of an electronic device according to an embodiment of the invention.
Detailed Description
The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it should be understood that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment of the invention provides a short text entity relevance identification method, which comprises the steps of S1-S13 as shown in FIG. 1.
Step S1: and acquiring a short text training set, wherein the short text training set comprises a plurality of training texts and named entities corresponding to each training text.
As an exemplary embodiment, the short text is referred to as a long text, and specifically, the short text refers to a text with a word number within a preset word number, in this embodiment, the preset word number is set to 350, that is, the text with a word number within 350 is a short text, which is only schematically illustrated in this embodiment, and is not limited thereto; of course, in other embodiments, the preset word number may also be set to other values, and may be reasonably set as needed in practical applications.
The short text training set comprises a plurality of short text training texts and named entities contained in each short text. Specifically, the named entity recognition method in the prior art can be used for recognizing the named entity in the training text to obtain the named entity corresponding to the training text. In this embodiment, the named entity includes an organization name; of course, in other embodiments, the named entity may further include a name of a person, a name of a place, and the like, and the specific type may be reasonably determined according to actual needs.
Step S2: and acquiring an entity pre-coding matrix, wherein the entity pre-coding matrix comprises pre-coding vectors corresponding to all entities, and the entity pre-coding is obtained according to the context information pre-coded by the entities.
As an exemplary embodiment, the entity precoding matrix includes precoding vectors corresponding to all entities, one precoding vector for each entity. The pre-coding vector is obtained according to context information pre-coded in an entity, and the specific process comprises the steps of firstly initializing an entity pre-coding matrix, then obtaining the entity vector, word vectors with a left preset number and word vectors with a right preset number through TD _ LSTM training, and obtaining a new entity vector of the entity with the context information fused therein by weighting and averaging the entity vector, the word vectors with the left word and the word vectors with the right word. And calculating the similarity between the new entity vector and the entity vector in the entity precoding matrix. And if the similarity is smaller than the preset threshold value, no operation is performed. And if the similarity is more than or equal to a preset threshold value, performing weighted average on the new entity vector and the original entity vector and updating the new entity vector and the original entity vector into an entity precoding matrix. Through continuous iterative training, new context information is continuously blended into the entity precoding matrix, so that the entity carries rich context information. Contextual information is used to calculate the degree of relevance to the short text to help determine whether the entity is the dominant entity. Specifically, a pre-encoding module may initialize a vector matrix of all entities to obtain an entity pre-encoding matrix.
When the entity correlation is judged, the correlation degree of the short text is calculated by using the context information pre-coded in the entity to help judge whether the entity is a main entity. The entity precoding is obtained through training data, and a large amount of context information is blended into an entity precoding vector. For example, hereinafter, four entities are mentioned, respectively, Wanke, Chinese Ping, Guizhou Maotai, and Ningde times.
The universal context information is generally related to property, the Chinese safety context information is generally related to insurance, and the Guizhou Maotai context information is generally related to liquor. Through one round of iterative training, the context information is merged into the entity vector, so that the entity carries a large amount of information, the entity correlation can be helped to be judged, and the confidence coefficient of the entity correlation judgment is improved.
Step S3: and respectively converting each word in each training text into a word vector, and converting the position of each word in the training text into a position vector.
As an exemplary embodiment, a word2vec word vector model with a preset dimension is trained in advance based on multi-domain corpus, and word vectors are obtained from the model. In this embodiment, the preset dimension is set to 300 dimensions, but in other embodiments, the preset dimension may also be set to other values, and may be reasonably set as needed. Words in each short text in the training text are converted into word vectors w _ t, which represent entity context semantic information.
The position of the word in the short text is converted into a position vector w _ p, which represents the entity position-coding information. In this embodiment, the dimension of the position vector is also a preset dimension, and specifically, the preset dimension is set to 300 dimensions.
The position vector is encoded as follows: in the embodiment, sine and cosine position coding is used, and the position information of the words in the short text is coded by utilizing the linear change characteristics of sine and cosine functions. The specific formula is as follows:
Figure RE-GDA0003586198780000091
Figure RE-GDA0003586198780000092
in the above formula pos represents the actual position of a word in the short text, i represents the ith dimension of the word vector, dmodelThe dimensions of the word vector are represented.
Through analysis statistics of a large amount of short texts, the semantic affinity of the primary entity associated with the short texts and the context is higher, and the semantic affinity of the secondary entity and the context is lower. The main entities are generally positioned at the head of the short text and rarely appear at the tail of the short text, so the position coding can also improve the judgment of entity relevance.
Step S4: and obtaining a pre-coding vector corresponding to each word according to each word in each training text and the entity pre-coding matrix, wherein the pre-coding vector of the word corresponding to the named entity in each training text is an entity pre-coding vector, and the pre-coding vector of the word corresponding to the non-named entity in each training text is a zero vector.
As an exemplary embodiment, each word in a training text is matched to an entity in the entity pre-coding matrix. If the word is the word corresponding to the entity, the word corresponding to the entity exists in the entity pre-coding matrix, that is, the entity corresponding to the word can be found in the entity pre-coding matrix, so that the pre-coding vector corresponding to the entity in the entity pre-coding matrix is used as the pre-coding vector of the entity word. If the word is a word corresponding to the non-named entity, the word corresponding to the entity does not exist in the entity pre-coding matrix, that is, the entity pre-coding matrix does not have the entity corresponding to the word, the pre-coding vector corresponding to the word is set to be a zero vector, that is, the non-entity word is not added with the pre-coding vector or the original vector is kept unchanged.
Step S5: and obtaining a fused word vector corresponding to each word in each training text according to the word vector, the position vector and the pre-coding vector corresponding to each word in each training text.
As an exemplary embodiment, the word vector w _ t converted from each word in the short text, the position vector w _ p converted from the position of the word in the short text, and the pre-coding vectors w _ d of the words in the short text obtained from the entity pre-coding matrix are added to obtain a fused word vector, and the fused word vector w _ a is input into the BI _ LSTM network.
Specifically, the calculation formula of the fused word vector is as follows:
w_a=w_ta+w_pa+w_da
wherein, w _ a is a fused word vector corresponding to the word a; w _ ta is a word vector corresponding to the word a; w _ pa is the position vector corresponding to the word a, and w _ da is the pre-coding vector corresponding to the word a. When the word a is a named entity word, w _ da is a corresponding precoding vector in the entity precoding matrix; when the word a is a non-named entity word, w _ da is a zero vector.
Step S6: and respectively inputting the fused word vector of each training text into the TD _ LSTM network to obtain the forward vector code and the backward vector code corresponding to each training text.
As an exemplary embodiment, the TD _ LSTM network employs two LSTM networks, i.e., an LSTM _ L network and an LSTM _ R network. The input of the LSTM _ L network is the first word of the short text to the entity word, the LSTM _ L is input from left to right, and the output is the forward vector coding which is blended with the information above the short text. The input of the LSTM _ R network is the last word of the short text to the entity word, the LSTM _ R is input from right to left, and the output is backward vector coding which is blended with the context information of the short text.
The specific output results are as follows:
w_l=LSTM_L(w0,w1,…,wt)
w_r=LSTM_R(wn,wn-1,…,wt)
wherein, LSTM _ L is a forward long-short term memory network (input words are fed to the network from left to right in sequence), and LSTM _ R is a backward long-short term memory network (input words are fed to the network from right to left in sequence); w0 is the first word from the left, w1 is the second word from the left, and so on, wt is an entity word; wn is the first word from the right, wn-1 is the second word from the right, and so on, and wt is the entity word.
The model used to identify the primary or secondary entity in this embodiment is TD-LSTM (Target-Dependent Long Short Term Memory). The model is modeled separately according to the context before and after the entity word, and is actually used to have two LSTMs, namely LSTM _ L and LSTM _ R.
As shown in the following sentence:
why employees in a mall are driven around by guard in Suning Square! | A | A
The entity in this sentence is "suning square," from "i" to "suning square" above, and "suning square" from "suning square" to the end-of-sentence exclamation point below.
The LSTM _ L inputs the above text of the Suning Square into the LSTM, namely the text is sequentially input into the network from the first word to the entity word of the sentence from left to right, and semantic information coding is carried out to obtain forward vector coding fused with the semantic information of the text.
The LSTM _ R is to input the following text of the Sunning square into the LSTM, namely to input the following text from the last word of a sentence to an entity word from right to left in sequence into a network for semantic information coding, so as to obtain backward vector coding fused with the following semantic information.
Step S7: and respectively splicing the forward vector code and the backward vector code corresponding to each training text to obtain an entity splicing vector corresponding to each training text.
As an exemplary embodiment, the forward vector code merged with the above information and the backward vector code merged with the below information are spliced to obtain a spliced vector. The concrete splicing mode can be concat function splicing, and the upper semantic information and the lower semantic information related to the entity are merged into the entity vector, so that the information content of the entity is enriched. Because the entity incorporates the context semantic information, the relevance of the entity and the context can be judged.
The calculation formula of the splicing vector is as follows:
w_i=concat(w_l,w_r)
w _ i is an entity splicing vector corresponding to the training text, w _ l is a forward vector code corresponding to the training text, and w _ r is a backward vector code corresponding to the training text.
Step S8: and (4) enabling the entity splicing vector corresponding to each training text to pass through a feedforward neural network to obtain a classified coding vector corresponding to each training text.
As an exemplary embodiment, the obtained entity splicing vector corresponding to the training text is subjected to a feed-forward neural network to obtain a classification coding vector, where the classification refers to strong correlation and weak correlation. In this embodiment, the classification code includes two types, which are 0 and 1, respectively, where 0 represents weak correlation and 1 represents strong correlation.
The calculation formula of the classified coding vector is as follows:
c_i=FeedForward(w_i)
wherein c _ i is a classified coding vector corresponding to the training text; and w _ i is an entity splicing vector corresponding to the training text.
Step S9: and normalizing the classified coding vectors to obtain normalized classified coding vectors.
As an exemplary embodiment, the normalization functions to map the value range of the classification to the interval of [0,1], so that the loss difference can be calculated with the actual classification values 0 and 1, and then back propagation, parameter updating and iterative optimization model can be performed. And outputting the classified coding vector c _ i to SoftMax for normalization to obtain a normalized classified coding vector.
Step S10: and obtaining a loss function between the classification vector of the model code and the actual entity classification vector according to the normalized classification coding vector.
As an exemplary embodiment, the normalized classification coding vector is calculated to obtain a loss function between the model-coded classification vector and the actual entity classification vector.
The formula for the calculation of the loss function is:
Figure RE-GDA0003586198780000121
where n is the number of training samples, yiIs the actual classification vector of the ith training sample, aiIs the classification vector of the ith training sample after model coding.
Step S11: and performing iterative optimization training according to the loss function to obtain an optimal model, wherein the optimal model is used for identifying the short text and the correlation strength of entities in the short text.
As an exemplary embodiment, the loss function is propagated reversely, an iteration link is entered, the parameter weight is updated, and the training is continued. The purpose of obtaining the optimal parameters of the model through continuous training and finally achieving the purpose of judging the correlation degree of the entity is achieved. And continuously updating the parameters, and fitting the predicted value of the model to the true value of the sample as much as possible, thereby reducing the error between the predicted value and the true value and training an optimal model capable of predicting the entity correlation.
Step S12: and acquiring the short text to be recognized.
As an exemplary embodiment, a short text to be recognized is acquired according to a recognition task.
Step S13: and inputting the short text to be recognized into the optimal model to obtain the recognition result of the text to be recognized.
As an exemplary embodiment, the short text to be recognized is input into the optimal model, and the recognition result output by the optimal model is obtained, wherein the recognition result comprises 0 and 1, 0 represents that the entity is weakly related to the short text, and 1 represents that the entity is strongly related to the short text.
The above steps, multi-dimensional information fusion is adopted, the first dimension information is a word vector containing entity context semantic information, the second dimension information is a position vector containing entity position coding information, the third dimension information is a pre-coding vector containing entity pre-coding information, the three vectors are fused to obtain a fused word vector corresponding to a word in a training text, not only the relevance between an entity and the context is considered from a local context, but also the relevance between the context information carried by the entity and the context is considered from a large context environment, the expression is carried out from a plurality of information dimension rich entity features, and the confidence coefficient of the entity and text relevance calculation is improved; then, inputting the fused word vector corresponding to the word of each training text into a TD _ LSTM model to obtain a forward vector code and a backward vector code corresponding to each training text, splicing the forward vector code and the backward vector code to obtain an entity splicing vector, passing the entity splicing vector through a feedforward neural network to obtain a classified coding vector corresponding to each training text, normalizing the classified coding vector to obtain a normalized classified coding vector, obtaining a loss function according to the normalized classified coding vector, and performing iterative optimization training according to the loss function to obtain an optimal model; and finally, inputting the short text to be recognized into the optimal model to obtain a recognition result of the text to be recognized, and determining the correlation degree of the entity in the short text to be recognized and the short text to be recognized. The identification accuracy of the short text entity correlation is realized through the steps.
As an exemplary embodiment, after the step S6 of inputting the fused word vector of each training text into the TD _ LSTM network to obtain the forward vector encoding and the backward vector encoding corresponding to each training text, steps S14-S19 are further included.
Step S14: and respectively inputting the fused word vector of each training text into an entity word coding vector corresponding to an entity word in each training text in the TD _ LSTM network, a coding vector of a word with a first preset number on the left side of the entity word and a coding vector of a word with a second preset number on the right side of the entity word.
In this embodiment, the first preset number is set to 3, and the second preset number is set to 3; of course, in other embodiments, specific values of the first preset number and the second preset number may also be reasonably set according to needs, and this is only schematically described in this embodiment, and is not limited thereto.
Specifically, the LSTM _ L network and the LSTM _ R network may also output a coding vector of each word, in this embodiment, an entity word coding vector w _ t corresponding to an entity word output by the LSTM network and coding vectors corresponding to words in three windows at the left and right of the entity word coding vector are adopted, and coding vectors of three words at the left are respectively recorded as w _ tl-1、w_tl-2And w _ tl-3And the coding vectors of the three words on the right are respectively marked as w _ tr+1、w_tr+2And w _ tr+3
Step S15: and carrying out weighted average on the entity word encoding vector, the encoding vector of the word with the first preset number on the left side of the entity word and the encoding vector of the word with the second preset number on the right side of the entity word to obtain a new encoding vector corresponding to the entity word.
As an exemplary embodiment, the calculation formula of the new code vector corresponding to the entity word is as follows:
Figure RE-GDA0003586198780000141
wherein s is a first preset number on the left side of the entity word, and v is a second preset number on the right side of the entity word; w _ t is an entity word encoding vector corresponding to an entity word in the training text, w _ tl-1For the coding vector of the first word to the left of the entity word in the training text, w _ tl-2For the coding vector of the second word to the left of the entity word in the training text, w _ tl-sFor the coding vector of the s-th word on the left of the entity word in the training text, w _ tr+1For the coding vector of the first word to the right of the entity word in the training text, w _ tr+2For the coding vector of the second word to the right of the entity word in the training text, w _ tr+vAnd (4) an encoding vector of the v word at the right side of the entity word in the training text.
Specifically, the entity word encoding vectors, the encoding vectors of 3 words on the left side of the entity word and the encoding vectors of 3 words on the right side of the entity word are weighted and averaged, and the context information in the three windows of the context is merged into the entity word encoding vectors to obtain new encoding vectors corresponding to the entity words.
Step S16: and carrying out similarity comparison on the new coding vector corresponding to the entity word and the precoding vector corresponding to the entity word in the entity precoding matrix to obtain a similarity calculation value.
As an illustrative embodiment, the comparison of the similarity uses a cosine function to calculate the cosine similarity. The Cosine function is used for calculating the size of an included angle between two vectors, and compared with a distance measurement, the Cosine function pays more attention to the difference of the two vectors in the direction rather than the distance or the length. The Cosine function is not influenced by the index scale, the value of the Cosine function falls in the interval [ -1,1], and the larger the value is, the smaller the difference is.
In this embodiment, the calculation formula of the similarity calculation value is as follows:
Threshold=cosine(w_new,w_d)
wherein Threshold is a similarity calculation value, w _ new is a new encoding vector corresponding to the entity word, and w _ d is a pre-encoding vector corresponding to the entity word.
Step S17: and judging whether the calculated similarity value is larger than a preset similarity threshold value or not. If the calculated similarity value is greater than the preset similarity threshold, performing step S18; if the calculated similarity value is less than or equal to the predetermined similarity threshold, step S19 is executed.
As an exemplary embodiment, the preset similarity threshold may be 0.7; of course, in other embodiments, the specific value of the preset similarity threshold may also be other values, such as 0.8 or 0.6, which is only schematically illustrated in this embodiment, but not limited thereto, and may be reasonably set in practical applications as needed.
Step S18: and if the similarity calculation value is larger than the preset similarity threshold value, the entity pre-coding matrix is not updated.
As an exemplary embodiment, when the similarity calculation value is greater than the preset similarity threshold, it indicates that the similarity between the new coding vector corresponding to the entity word in the short text and the precoding vector corresponding to the entity word in the entity precoding matrix is higher, which indicates that the context information carried by the short text is previously merged into the entity coding vector, so that no operation is required and the entity precoding matrix does not need to be updated.
Step S19: and if the similarity calculation value is smaller than or equal to the preset similarity threshold value, performing weighted average on a new coding vector corresponding to the entity word and a precoding vector corresponding to the entity word in the entity precoding matrix to obtain an updated precoding vector, and updating the updated precoding vector to the entity precoding matrix.
As an exemplary embodiment, when the calculated similarity value is less than or equal to the preset similarity threshold, it indicates that the similarity between the new coding vector corresponding to the entity word in the short text and the precoding vector corresponding to the entity word in the entity precoding matrix is low, which indicates that the context information carried by the short text is relatively new information, and the new context information needs to be merged into the original entity coding vector, so that the new coding vector corresponding to the entity word and the precoding vector corresponding to the entity word in the entity precoding matrix are weighted and averaged to obtain an updated precoding vector, and the updated precoding vector is updated to the entity precoding matrix.
The calculation formula of the updated precoding vector is as follows:
Figure RE-GDA0003586198780000151
wherein, w _ z is a precoding vector after the entity word is updated; w _ new is a new encoding vector corresponding to the entity word, and w _ d is a pre-encoding vector corresponding to the entity word.
And carrying out weighted average on the new coding vector w _ new and the original precoding vector w _ d to obtain a final precoding vector w _ z, and updating the w _ z into the entity precoding matrix to replace the original w _ d. The method is used for continuously integrating new context information into the entity, updating the entity precoding vector, and increasing the context coverage of the entity precoding vector, so that the accuracy of entity correlation identification is improved.
Specifically, as shown in fig. 2, the function of the Entity models module in fig. 2 is to generate new Entity vectors. The new entity vector is obtained by weighted average of itself and word vectors in the preset number of left and right sides, and the generated new entity vector is represented by e 2', namely, a new code vector.
And the step of taking the entity word encoding vectors output by the LSTM network and the encoding vectors of the words with the preset number on the left and right of the entity words, carrying out weighted average on the encoding vectors, and integrating the context information of the words with the context into the entity vector encoding to obtain a new encoding vector. Similarity calculation is carried out on the new coding vector and the original precoding vector in the entity precoding matrix, and if the similarity calculation value is larger than a preset similarity threshold value, the fact that the context information carried by the short text is merged into the entity precoding vector is shown, and no operation is needed; if the similarity calculation value is smaller than or equal to the similarity threshold value, the context information carried by the short text is relatively new information, the new context information needs to be merged into the original entity precoding vector, the new precoding vector and the original precoding vector are weighted and averaged to obtain a final entity precoding vector, the final entity precoding vector is updated into the entity precoding matrix, and the original entity precoding vector is replaced. And by continuously integrating new context information into the entity, the entity precoding vector is updated, and the discrimination of entity correlation is improved.
As an exemplary embodiment, the step of inputting the short text to be recognized into the optimal model in the step S13 further includes steps S20-S22.
Step S20: and inputting the short text to be recognized into the TextRCNN model to perform text secondary classification, and obtaining a classification result.
As an exemplary embodiment, TextRCNN first uses a bidirectional RNN to obtain context semantic and grammar information of a short text to be recognized, then automatically screens out the most important features by using maximum pooling, and then uses a full connection layer for classification to obtain a classification result.
TextRCNN combines the advantages of RNN and CNN, obtains context information by using a bidirectional loop structure, reduces noise more than a traditional window-based neural network, and can keep word order in a large range when learning text expression. Secondly, the important part of the text is obtained by using the maximum pooling layer, and which feature plays a more important role in the information denoising process is automatically judged.
Specifically, TextRCNN combines the left context, the right context of a word, and the word itself as a word representation, and uses bidirectional RNNs to extract context information of sentences, respectively. After the convolutional layer, the representation of all words is obtained, and the maximum pooling operation is firstly carried out, so that the maximum pooling can help to find the most important potential semantic information in the sentence. Then, the representation of the text is obtained through a full connection layer, and finally, the classification is carried out through a softmax layer.
Specifically, the classification result includes two results, namely a noise text and a non-noise text, wherein the noise text is a short text containing no entity, and the non-noise text is a short text containing an entity.
Step S21: and if the classification result of the short text to be recognized is the noise text, removing the text to be recognized.
Specifically, since the embodiment identifies the correlation of the entity in the short text, only the short text containing the entity needs to be identified, and the short text not containing the entity does not need to be identified, so that when the classification result of the text to be identified is the noise text, the text to be identified is removed, that is, the noise text is not input into the next optimal model.
Step S22: and if the classification result of the short text to be recognized is the non-noise text, reserving the text to be recognized.
Specifically, when the classification result of the short text to be recognized is a non-noise text, that is, the short text to be recognized includes one or more entities, the text to be recognized needs to be retained, and then the text to be recognized is input into the optimal model to recognize the entity correlation.
The text which does not contain the entity in the short text to be recognized is removed through the steps, a large amount of noise information is removed, only the text containing the entity is reserved, and the accuracy of entity relevancy recognition is improved.
The technical solution in this embodiment is based on Target-based correlation analysis (TBCA), and the main function is to identify the correlation degree of an entity appearing in a short text, that is, when one entity or a plurality of entities appear in the text, it can identify whether the entity is a main entity or a secondary entity, that is, it is determined whether the entity is in a strongly correlated relationship or a weakly correlated relationship with the text (the main entity is in a strongly correlated relationship, and the secondary entity is in a weakly correlated relationship).
Three entities, a1, a2 and A3, appear in a piece of text, and are all strongly related to the piece of text. The optimal model will output 1 to represent a strong correlation and 0 to represent a weak correlation. The final output results are A1-1, A2-1 and A3-1.
Four entities appear in another text paragraph, B1, B2, B3, B4, respectively. Wherein, B1 is strongly related to the text, and B2, B3, B4 are weakly related to the text. The final output results are B1-1, B2-0, B3-0 and B4-0.
The method is based on a deep learning network framework, and adopts a fusion scheme to identify the relevance of entities in the short text from the angle carried by multi-dimensional information. In the method, an entity (such as an organization entity) carries a large amount of information, so that an entity precoding matrix is added when a deep learning network framework is designed, and entity information precoding is completed by continuously integrating new context information into an entity vector during training. During prediction, the word features and the position features are coded into entity vectors in the short text through a deep learning network, then entity pre-coding vectors with context information are obtained from an entity information pre-coding matrix and jointly enter a forward feedback neural network, and the correlation degree of the entities in the short text is calculated. The method has the advantages that the strong coding capability of the deep learning model is utilized, and rich semantic grammar context information is blended. Not only the lexical information and the syntactic information are designed, but also the context environment of the entity is designed. From a plurality of information dimensions, the entity feature expression is enriched, and the confidence coefficient of the entity and text relevance calculation is improved. The relevance of the entity and the context is considered from the local context, and the relevance of the context information carried by the entity and the context is also considered from the large context environment.
The method has significance for entity monitoring and information retrieval in the business situation field, and can inform an information acquirer when one or more entities appear in the text, which entities are main entities described by the text and which entities are secondary entities described by the text, so that the most relevant information can be returned to the information acquirer; the method greatly improves the working efficiency of public opinion analysis workers.
The present embodiment further provides a system for identifying a short text entity correlation, where the system is used to implement the foregoing embodiments and preferred embodiments, and details of the description already made are not repeated. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. While the system described in the embodiments below is preferably implemented in software, implementations in hardware, or a combination of software and hardware are also possible and contemplated.
The embodiment further provides a system for identifying short text entity relevance, as shown in fig. 3, including:
the short text training set comprises a first obtaining module 1, a second obtaining module and a third obtaining module, wherein the first obtaining module is used for obtaining a short text training set, and the short text training set comprises a plurality of training texts and named entities corresponding to the training texts;
a second obtaining module 2, configured to obtain an entity pre-coding matrix, where the entity pre-coding matrix includes pre-coding vectors corresponding to all entities, and the entity pre-coding is obtained according to context information pre-coded by the entity;
the first processing module 3 is used for converting each word in each training text into a word vector and converting the position of each word in the training text into a position vector;
the second processing module 4 is configured to obtain a pre-coding vector corresponding to each word according to each word and the entity pre-coding matrix in each training text, where the pre-coding vector of the word corresponding to the named entity in each training text is an entity pre-coding vector, and the pre-coding vector of the word corresponding to the non-named entity in each training text is a zero vector;
the third processing module 5 is configured to obtain a fused word vector corresponding to each word in each training text according to the word vector, the position vector, and the pre-coding vector corresponding to each word in each training text;
the fourth processing module 6 is configured to input the fused word vector of each training text into the TD _ LSTM network to obtain a forward vector code and a backward vector code corresponding to each training text;
the fifth processing module 7 is configured to splice the forward vector code and the backward vector code corresponding to each training text, respectively, to obtain an entity splicing vector corresponding to each training text;
the sixth processing module 8 is configured to pass the entity splicing vector corresponding to each training text through a feed-forward neural network to obtain a classification coding vector corresponding to each training text;
a seventh processing module 9, configured to normalize the classified coding vector to obtain a normalized classified coding vector;
an eighth processing module 10, configured to obtain a loss function between the classification vector of the model code and the actual entity classification vector according to the normalized classification coding vector;
a ninth processing module 11, configured to perform iterative optimization training according to a loss function to obtain an optimal model, where the optimal model is used to identify the short text and the correlation strength of entities in the short text;
a third obtaining module 12, configured to obtain a short text to be recognized;
and the tenth processing module 13 is configured to input the short text to be recognized to the optimal model, so as to obtain a recognition result of the text to be recognized.
As an exemplary embodiment, the calculation formula of the fused word vector is as follows:
w_a=w_ta+w_pa+w_da
wherein, w _ a is a fused word vector corresponding to the word a; w _ ta is a word vector corresponding to the word a; w _ pa is the position vector corresponding to the word a, and w _ da is the pre-coding vector corresponding to the word a.
As an exemplary embodiment, further comprising: the eleventh processing module is used for respectively inputting the fused word vector of each training text into an entity word encoding vector corresponding to an entity word in each training text in the TD _ LSTM network, an encoding vector of a first preset number word on the left side of the entity word and an encoding vector of a second preset number word on the right side of the entity word; the twelfth processing module is used for carrying out weighted average on the entity word encoding vector, the encoding vector of the word with the first preset number on the left side of the entity word and the encoding vector of the word with the second preset number on the right side of the entity word to obtain a new encoding vector corresponding to the entity word; a thirteenth processing module, configured to perform similarity comparison between the new coding vector corresponding to the entity word and the precoding vector corresponding to the entity word in the entity precoding matrix, so as to obtain a similarity calculation value; the judging module is used for judging whether the similarity calculation value is larger than a preset similarity threshold value or not; a fourteenth processing module, configured to not update the entity precoding matrix if the calculated similarity value is greater than the preset similarity threshold; and the fifteenth processing module is configured to, if the calculated similarity value is less than or equal to the preset similarity threshold, perform weighted average on the new coding vector corresponding to the entity word and the precoding vector corresponding to the entity word in the entity precoding matrix to obtain an updated precoding vector, and update the updated precoding vector to the entity precoding matrix.
As an exemplary embodiment, the calculation formula of the similarity calculation value is as follows:
Threshold=cosine(w_new,w_d)
wherein Threshold is a similarity calculation value, w _ new is a new coding vector corresponding to the entity word, and w _ d is a precoding vector corresponding to the entity word;
the calculation formula of the new coding vector corresponding to the entity word is as follows:
Figure RE-GDA0003586198780000191
wherein s is a first preset number on the left side of the entity word, and v is a second preset number on the right side of the entity word; w _ t is an entity word encoding vector corresponding to an entity word in the training text, w _ tl-1For the coding vector of the first word to the left of the entity word in the training text, w _ tl-2For the coding vector of the second word to the left of the entity word in the training text, w _ tl-sFor the coding vector of the s-th word on the left of the entity word in the training text, w _ tr+1For the coding vector of the first word to the right of the entity word in the training text, w _ tr+2For the coding vector of the second word to the right of the entity word in the training text, w _ tr+vAnd (4) an encoding vector of the v word at the right side of the entity word in the training text.
As an exemplary embodiment, the calculation formula of the updated precoding vector is as follows:
Figure RE-GDA0003586198780000201
wherein, w _ z is a precoding vector after the entity word is updated; w _ new is a new encoding vector corresponding to the entity word, and w _ d is a pre-encoding vector corresponding to the entity word.
As an exemplary embodiment, further comprising: the sixteenth processing module is used for inputting the short text to be recognized into the TextRCNN model for text two-stage classification to obtain a classification result; a seventeenth processing module, configured to remove the to-be-recognized text if the classification result of the to-be-recognized short text is a noise text; and the eighteenth processing module is used for reserving the text to be recognized if the classification result of the short text to be recognized is a non-noise text.
As an exemplary embodiment, the calculation formula of the loss function is as follows:
Figure RE-GDA0003586198780000202
where n is the number of training samples, yiIs the actual classification vector of the ith training sample, aiIs the classification vector of the ith training sample after model coding.
The short text entity relevance identification system in this embodiment is presented in the form of functional units, where a unit refers to an ASIC circuit, a processor and memory executing one or more software or fixed programs, and/or other devices that may provide the above-described functionality.
Further functional descriptions of the modules are the same as those of the corresponding embodiments, and are not repeated herein.
An embodiment of the present invention further provides an electronic device, as shown in fig. 4, the electronic device includes one or more processors 71 and a memory 72, where one processor 71 is taken as an example in fig. 4.
The controller may further include: an input device 73 and an output device 74.
The processor 71, the memory 72, the input device 73 and the output device 74 may be connected by a bus or other means, as exemplified by the bus connection in fig. 4.
The processor 71 may be a Central Processing Unit (CPU). The Processor 71 may also be other general purpose Processor, Digital Signal Processor (DSP), Application Specific Integrated Circuit (ASIC), Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, or any combination thereof. A general purpose processor may be a microprocessor or any conventional processor or the like.
The memory 72, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the short text entity relevance identification method in the embodiments of the present application. The processor 71 executes various functional applications of the server and data processing by running non-transitory software programs, instructions and modules stored in the memory 72, namely, implements the short text entity relevance identification method of the above method embodiment.
The memory 72 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of a processing device operated by the server, and the like. Further, the memory 72 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 72 may optionally include memory located remotely from the processor 71, which may be connected to a network connection device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The input device 73 may receive input numeric or character information and generate key signal inputs related to user settings and function control of the processing device of the server. The output device 74 may include a display device such as a display screen.
One or more modules are stored in the memory 72 and, when executed by the one or more processors 71, perform the methods shown in fig. 1-2.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by instructing relevant hardware through a computer program, and the executed program may be stored in a computer-readable storage medium, and when executed, may include the processes of the embodiments of the short text entity relevance identifying method described above. The storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a Flash Memory (Flash Memory), a Hard Disk (Hard Disk Drive, abbreviated as HDD) or a Solid State Drive (SSD), etc.; the storage medium may also comprise a combination of memories of the kind described above.
Although the embodiments of the present invention have been described in conjunction with the accompanying drawings, those skilled in the art may make various modifications and variations without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope defined by the appended claims.

Claims (10)

1. A method for identifying short text entity correlation is characterized by comprising the following steps:
acquiring a short text training set, wherein the short text training set comprises a plurality of training texts and a named entity corresponding to each training text;
acquiring an entity pre-coding matrix, wherein the entity pre-coding matrix comprises pre-coding vectors corresponding to all entities, and the entity pre-coding is obtained according to context information pre-coded by the entities;
converting each word in each training text into a word vector, and converting the position of each word in the training text into a position vector;
obtaining a pre-coding vector corresponding to each word according to each word and the entity pre-coding matrix in each training text, wherein the pre-coding vector of the word corresponding to the named entity in each training text is an entity pre-coding vector, and the pre-coding vector of the word corresponding to the non-named entity in each training text is a zero vector;
obtaining a fused word vector corresponding to each word in each training text according to the word vector, the position vector and the pre-coding vector corresponding to each word in each training text;
respectively inputting the fusion word vector of each training text into a TD _ LSTM network to obtain a forward vector code and a backward vector code corresponding to each training text;
respectively splicing the forward vector code and the backward vector code corresponding to each training text to obtain an entity splicing vector corresponding to each training text;
the entity splicing vector corresponding to each training text is subjected to a feedforward neural network to obtain a classified coding vector corresponding to each training text;
normalizing the classified coding vectors to obtain normalized classified coding vectors;
obtaining a loss function between the classification vector of the model code and the actual entity classification vector according to the normalized classification coding vector;
performing iterative optimization training according to the loss function to obtain an optimal model, wherein the optimal model is used for identifying the short text and the correlation strength of entities in the short text;
acquiring a short text to be recognized;
and inputting the short text to be recognized into an optimal model to obtain a recognition result of the text to be recognized.
2. The short text entity relevance recognition method of claim 1,
the calculation formula of the fused word vector is as follows:
w_a=w_ta+w_pa+w_da
wherein, w _ a is a fused word vector corresponding to the word a; w _ ta is a word vector corresponding to the word a; w _ pa is the position vector corresponding to the word a, and w _ da is the pre-coding vector corresponding to the word a.
3. The method for identifying short text entity relevance of claim 1, wherein after the step of inputting the fused word vector of each training text into the TD _ LSTM network to obtain the forward vector code and the backward vector code corresponding to each training text, the method further comprises:
respectively inputting the fused word vector of each training text into an entity word coding vector corresponding to an entity word in each training text in the TD _ LSTM network, a coding vector of a word with a first preset number on the left side of the entity word and a coding vector of a word with a second preset number on the right side of the entity word;
carrying out weighted average on the entity word encoding vector, the encoding vector of the word with the first preset number on the left side of the entity word and the encoding vector of the word with the second preset number on the right side of the entity word to obtain a new encoding vector corresponding to the entity word;
carrying out similarity comparison on a new coding vector corresponding to the entity word and a precoding vector corresponding to the entity word in the entity precoding matrix to obtain a similarity calculation value;
judging whether the calculated similarity value is larger than a preset similarity threshold value or not;
if the similarity calculation value is larger than a preset similarity threshold value, the entity pre-coding matrix is not updated;
and if the similarity calculation value is smaller than or equal to the preset similarity threshold value, performing weighted average on a new coding vector corresponding to the entity word and a precoding vector corresponding to the entity word in the entity precoding matrix to obtain an updated precoding vector, and updating the updated precoding vector to the entity precoding matrix.
4. The short text entity relevance recognition method of claim 3,
the calculation formula of the calculated similarity value is as follows:
Threshold=cosine(w_new,w_d)
wherein Threshold is a similarity calculation value, w _ new is a new coding vector corresponding to the entity word, and w _ d is a precoding vector corresponding to the entity word;
the calculation formula of the new coding vector corresponding to the entity word is as follows:
Figure FDA0003034475560000031
wherein s is a first preset number on the left side of the entity word, and v is a second preset number on the right side of the entity word; w _ t is an entity word encoding vector corresponding to an entity word in the training text, w _ tl-1For the coding vector of the first word to the left of the entity word in the training text, w _ tl-2For the encoding vector of the second word to the left of the physical word in the training text, w _ tl-sFor the coding vector of the s-th word on the left of the entity word in the training text, w _ tr+1For the coding vector of the first word to the right of the entity word in the training text, w _ tr+2For the coding vector of the second word to the right of the entity word in the training text, w _ tr+vAnd (4) an encoding vector of the v word at the right side of the entity word in the training text.
5. The short text entity relevance recognition method of claim 3,
the calculation formula of the updated precoding vector is as follows:
Figure FDA0003034475560000041
wherein, w _ z is a precoding vector after the entity word is updated; w _ new is a new encoding vector corresponding to the entity word, and w _ d is a pre-encoding vector corresponding to the entity word.
6. The short text entity relevance identification method according to claim 1, wherein the step of inputting the short text to be identified into an optimal model is preceded by the further steps of:
inputting the short text to be recognized into a TextRCNN model to perform text secondary classification to obtain a classification result;
if the classification result of the short text to be recognized is a noise text, removing the text to be recognized;
and if the classification result of the short text to be recognized is a non-noise text, reserving the text to be recognized.
7. The short text entity relevance recognition method of any one of claims 1-6,
the formula for the calculation of the loss function is as follows:
Figure FDA0003034475560000042
where n is the number of training samples, yiIs the actual classification vector of the ith training sample, aiIs the classification vector of the ith training sample after model coding.
8. A short text entity relevance recognition system, comprising:
the short text training set comprises a plurality of training texts and named entities corresponding to the training texts;
a second obtaining module, configured to obtain an entity pre-coding matrix, where the entity pre-coding matrix includes pre-coding vectors corresponding to all entities, and the entity pre-coding is obtained according to context information pre-coded by the entity;
the first processing module is used for respectively converting each word in each training text into a word vector and converting the position of each word in the training text into a position vector;
the second processing module is used for obtaining a pre-coding vector corresponding to each word according to each word and the entity pre-coding matrix in each training text, wherein the pre-coding vector of the word corresponding to the named entity in each training text is an entity pre-coding vector, and the pre-coding vector of the word corresponding to the non-named entity in each training text is a zero vector;
the third processing module is used for obtaining a fused word vector corresponding to each word in each training text according to the word vector, the position vector and the pre-coding vector corresponding to each word in each training text;
the fourth processing module is used for respectively inputting the fused word vector of each training text into the TD _ LSTM network to obtain the forward vector code and the backward vector code corresponding to each training text;
the fifth processing module is used for splicing the forward vector code and the backward vector code corresponding to each training text to obtain an entity splicing vector corresponding to each training text;
the sixth processing module is used for enabling the entity splicing vector corresponding to each training text to pass through a feedforward neural network to obtain a classified coding vector corresponding to each training text;
the seventh processing module is used for normalizing the classified coding vectors to obtain normalized classified coding vectors;
the eighth processing module is used for obtaining a loss function between the classification vector of the model code and the actual entity classification vector according to the normalized classification coding vector;
the ninth processing module is used for performing iterative optimization training according to the loss function to obtain an optimal model, and the optimal model is used for identifying the short text and the correlation strength of the entities in the short text;
the third acquisition module is used for acquiring short texts to be identified;
and the tenth processing module is used for inputting the short text to be recognized into the optimal model to obtain the recognition result of the text to be recognized.
9. An electronic device, comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores a computer program executable by the at least one processor, the computer program being executable by the at least one processor to cause the at least one processor to perform the short text entity relevance identifying method of any one of claims 1-7.
10. A computer-readable storage medium storing computer instructions for causing a computer to perform the short text entity relevance identification method of any one of claims 1-7.
CN202110439445.XA 2021-04-23 2021-04-23 Short text entity correlation identification method, system, electronic equipment and storage medium Pending CN114444496A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110439445.XA CN114444496A (en) 2021-04-23 2021-04-23 Short text entity correlation identification method, system, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110439445.XA CN114444496A (en) 2021-04-23 2021-04-23 Short text entity correlation identification method, system, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114444496A true CN114444496A (en) 2022-05-06

Family

ID=81362322

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110439445.XA Pending CN114444496A (en) 2021-04-23 2021-04-23 Short text entity correlation identification method, system, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114444496A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116384515A (en) * 2023-06-06 2023-07-04 之江实验室 Model training method and device, storage medium and electronic equipment
CN117874611A (en) * 2023-12-29 2024-04-12 汉王科技股份有限公司 Text classification method, device, electronic equipment and storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116384515A (en) * 2023-06-06 2023-07-04 之江实验室 Model training method and device, storage medium and electronic equipment
CN116384515B (en) * 2023-06-06 2023-09-01 之江实验室 Model training method and device, storage medium and electronic equipment
CN117874611A (en) * 2023-12-29 2024-04-12 汉王科技股份有限公司 Text classification method, device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN112560912B (en) Classification model training method and device, electronic equipment and storage medium
CN108804512B (en) Text classification model generation device and method and computer readable storage medium
CN108363790B (en) Method, device, equipment and storage medium for evaluating comments
TW202020691A (en) Feature word determination method and device and server
WO2021056710A1 (en) Multi-round question-and-answer identification method, device, computer apparatus, and storage medium
CN111125317A (en) Model training, classification, system, device and medium for conversational text classification
KR102456535B1 (en) Medical fact verification method and apparatus, electronic device, and storage medium and program
CN111539209B (en) Method and apparatus for entity classification
CN114444496A (en) Short text entity correlation identification method, system, electronic equipment and storage medium
CN113255320A (en) Entity relation extraction method and device based on syntax tree and graph attention machine mechanism
CN113177412A (en) Named entity identification method and system based on bert, electronic equipment and storage medium
US20200364216A1 (en) Method, apparatus and storage medium for updating model parameter
CN111382248A (en) Question reply method and device, storage medium and terminal equipment
CN114841164A (en) Entity linking method, device, equipment and storage medium
CN113569559B (en) Short text entity emotion analysis method, system, electronic equipment and storage medium
CN113626608B (en) Semantic-enhancement relationship extraction method and device, computer equipment and storage medium
CN114202443A (en) Policy classification method, device, equipment and storage medium
CN112989829B (en) Named entity recognition method, device, equipment and storage medium
CN117407507A (en) Event processing method, device, equipment and medium based on large language model
CN116432705A (en) Text generation model construction method, text generation device, equipment and medium
US10296585B2 (en) Assisted free form decision definition using rules vocabulary
WO2023040153A1 (en) Method, apparatus, and device for updating intent recognition model, and readable medium
CN112541557B (en) Training method and device for generating countermeasure network and electronic equipment
CN114491030A (en) Skill label extraction and candidate phrase classification model training method and device
CN110309285B (en) Automatic question answering method, device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination