CN111950264A

CN111950264A - Text data enhancement method and knowledge element extraction method

Info

Publication number: CN111950264A
Application number: CN202010777706.4A
Authority: CN
Inventors: 程良伦; 牛伟才; 王德培; 张伟文
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2020-08-05
Filing date: 2020-08-05
Publication date: 2020-11-17
Anticipated expiration: 2040-08-05
Also published as: CN111950264B

Abstract

The invention discloses a text data enhancement method and a knowledge element extraction method, wherein the text data enhancement method comprises a process of screening similar texts from a first supplementary database and a second supplementary database, wherein the first supplementary database is derived from a knowledge base in a field similar to a basic data set, and the second supplementary database is derived from synonyms of entity words in the basic data set. The data enhancement method can generate efficient and massive supplement to the basic data with few sources, and the knowledge element extraction model trained on the data set enhanced by the enhancement method has higher generalization capability and extraction accuracy.

Description

Text data enhancement method and knowledge element extraction method

Technical Field

The invention relates to the technical field of natural language processing, in particular to a knowledge element extraction technology.

Background

With the rapid development of the internet technology, the construction of the industrial domain knowledge base can be better applied to domain intelligent question answering and intelligent decision making, the intelligentization of industrial manufacturing is promoted, a large amount of electronic text information can be generated in the industrial production process, the information is dispersed in a maintenance diagnosis table, an internet community and a factory database of workers, and if the unstructured and semi-structured electronic text information can be constructed into the knowledge base with extremely high knowledge density, the utilization rate of the domain knowledge can be greatly improved.

How to process the text information quickly and efficiently is a major concern in the field of natural language processing, and particularly, the recognition of named entities is the most critical. The identification of domain knowledge meta-entities can extract important knowledge units from structured and semi-structured text data, wherein the knowledge units are generally words which are most representative in a specific domain, and after the entities are correctly identified, relationship extraction, event extraction and knowledge base construction can be further completed. It can be seen that the quality of the named entity recognition effect directly affects the subsequent information extraction task.

Existing named entity recognition methods are roughly divided into three categories: rules and dictionaries based methods, statistical machine learning based methods, and deep learning based methods. The rule and dictionary-based learning method needs huge manpower marking due to the fact that a large number of rules and dictionaries need to be formulated, is limited by professional knowledge, can be formulated only by experts in certain fields, and is high in identification cost and low in efficiency; the method based on statistical machine learning mainly comprises a hidden Markov model, a maximum entropy model, a support vector machine and a conditional random field model, wherein the recognition effect mainly depends on various feature combinations selected by the model, such as part-of-speech features, position features, context features and the like of words, and entity recognition is required to be carried out through large-scale training corpora; the entity recognition technology based on deep learning is the most mainstream method at present, firstly, a word vector trained in advance is used as the input of a neural network, then, the semantic extraction is carried out on a text through a neural network layer, and the label of each word can be predicted by the extracted sentence characteristics through a global normalization function (Softmax) layer or a conditional random field. Although the recognition effect of deep learning on the named entity recognition technology is far better than that of statistical machine learning and rule-based methods, the realization of the model prediction capability and generalization capability of the method needs enough high-quality labeled data sets as supports, otherwise, the overfitting condition occurs, the expected recognition accuracy rate is difficult to obtain, and the industrial field often lacks enough labeled data sets to optimize the parameters of the training model.

Disclosure of Invention

The invention aims to provide a method for enhancing text data, which can efficiently and massively supplement basic data with few sources, can overcome the problem of model accuracy caused by the fact that the supplementary data and the basic data are too close to each other, and remarkably improves the generalization capability and the extraction accuracy of a model.

The invention also aims to provide a method for obtaining accurate extraction of the knowledge elements based on the enhanced text data.

The invention firstly discloses the following technical scheme:

a method of text data enhancement comprising the process of screening for similar text from a first supplementary database derived from a knowledge base in a domain close to a base data set and a second supplementary database derived from synonyms of entity words in the base data set.

In the scheme, the entity words refer to words representing entities.

The basic data set refers to a data set which contains certain text data and needs data enhancement, and is preferably a data set after the labeling is completed.

The similar fields refer to the fields with the same or similar physical words in the aspects of products, functions, technical processes and the like.

Such as the grid power domain and the electronics domain. For example, a three-phase transformer in the field of power networks appears in the name of a toroidal transformer in a loudspeaker electronics product in the field of electronics.

Or in the field of ceramic production and in the field of refractory materials, as in inorganic non-metallic materials. This is reflected, for example, in the mullite material required for the ceramic production process, which is called kyanite, mullite or sillimanite, among other refractory materials. In the two fields, the same process exists in a series of mullite reactions based on mullite, and the names are different.

By expanding the corpus information containing the entities in the close domain, on one hand, the data volume of the entity words can be improved, and on the other hand, the generalization capability of the model can also be improved.

Such a knowledge base in the near field may come from the internet, a recipe of raw materials, or a manual of workers, etc.

It will be appreciated that the data in the first and second supplementary databases should be presented in the form of text.

In some embodiments, the first supplemental database is obtained by a web page crawl of the entity terms it contains, and the second supplemental database is obtained by a web page crawl of synonyms of the entity terms it contains.

The web page in this embodiment is preferably a web page with more knowledge content, such as wikipedia.

In some embodiments, the similar text is determined by:

s51: and performing word segmentation and labeling on the short text from the first supplementary database and the short text from the second supplementary database, and calculating word vector cosine similarity between separated entity words, namely entity word similarity.

S52: and calculating the cosine similarity of word vectors between other words except the separated entity words, pairing the words with the same part of speech of which the similarity is greater than a threshold value into overlapped words, and calculating the weighted similarity of the overlapped words under the part of speech characteristics, namely the similarity of the overlapped words.

Preferably, the threshold value in S52 is set to 0.5, i.e. words with a similarity greater than 0.5 are overlapping words.

S53: and carrying out weighted average on the entity word similarity and the overlapping word similarity to obtain text similarity.

And performing iterative calculation aiming at the text similarity on the texts in the first supplementary database and the second supplementary database, wherein the two texts with the maximum text similarity in each iteration are similar texts.

In some embodiments, the synonym is obtained by synonym fission, the synonym fission comprising: and acquiring words with word vector cosine similar to the entity words in the basic data set from the corpus, namely synonyms of the entity words.

In some embodiments, the number of synonyms per fission is set to 1-4, preferably 3.

In some embodiments, the Word vector is obtained by a Word2Vec model transformation.

In some embodiments, the synonym fission is achieved by the Word2Vec model.

The Word2Vec model used may be trained by encyclopedia, Baidu, and/or microblog corpora.

The word vectors trained by the model have certain prior knowledge, and synonyms have semantic similarity and are expressed as similar cosine distances.

The invention further discloses a knowledge element extraction method, which is realized by the extraction model after training, wherein the training of the model is based on the label data set enhanced by the data enhancement method.

In some embodiments, the extraction model is a two-way long-and-short-term memory network model.

In some embodiments, the decimation model includes an input layer, a word embedding layer, a bi-directional LSTM layer, and a normalized exponential function layer.

In some embodiments, the input layer is an index of each word in the sentence into a vocabulary, which is obtained by traversing all data.

In addition, in order to enhance the representation information of the words, in some specific embodiments, the word embedding layer uses pre-trained Chinese word vectors, the training corpus of the word vectors is preferably Chinese encyclopedia and microblog data, and the dimension of the word vectors is preferably 300 dimensions.

In addition, in order to enhance the representation information of the words, in some specific embodiments, the character embedding and the word embedding of the words are spliced together, wherein the characters specifically refer to each Chinese character in the words.

The preferred characters are embedded in a 100-dimensional word vector that is randomly initialized in dimension and updated during the training process.

In some embodiments, the hidden layer dimension of the bi-directional LSTM layer is set to 256 dimensions, and finally the forward LSTM and backward LSTM are stitched together to obtain a 512-dimensional sentence representation.

In some embodiments, the bidirectional LSTM at each time step is input into the normalized exponential function layer, i.e. softmax function, to obtain a value between 0 and 1, and the label with the largest value is the entity label of the location.

The method can effectively solve the problem that the industrial field is lack of enough structured knowledge bases, can realize the possibility of borrowing the existing knowledge base of similar industrial scenes and supplementing basic data through synonyms by expanding the training data set through the text similarity, and meanwhile, not only obviously enhances the scale of the data set, but also effectively overcomes the problem of low model generalization capability caused by over-strong relevance of a single source entity through screening and integrating the data of two sources, and obviously improves the model accuracy.

By the aid of the method, manually crawled marine industry news texts are subjected to data enhancement, the used near field knowledge base is a military industry database, the data set can be expanded from original 1000 to 1300, and the entity recognition effect is improved by 3%.

Drawings

Fig. 1 is a schematic flow chart of a data enhancement method according to an embodiment of the present invention.

Fig. 2 is a schematic structural diagram of an extraction model used in embodiment 1 of the present invention.

Detailed Description

The present invention is described in detail below with reference to the following embodiments and the attached drawings, but it should be understood that the embodiments and the attached drawings are only used for the illustrative description of the present invention and do not limit the protection scope of the present invention in any way. All reasonable variations and combinations that fall within the spirit of the invention are intended to be within the scope of the invention.

The extraction of the knowledgeelements is performed by the flow shown in fig. 1.

Specifically, firstly, text data enhancement is performed, that is, data expansion is performed on the basis of an existing text data set.

The existing text data set, i.e. the basic data set, can be obtained by collecting electronic texts generated in the industrial production process, such as electronic texts scattered in a maintenance diagnosis table of workers, an internet community, a factory database, and the like.

The method is carried out on the basis of the entity words in the basic data set, so that the samples in the basic data set need to be labeled with the entity words, namely, the labeled data set is obtained firstly.

Based on the annotated dataset, data enhancement is performed by the following process:

s1 selecting an entity term library of one of the supplemental data sources

The entity word library may be from an existing knowledge base having similar industrial fields as the basic data set, and should contain a plurality of entity words under different entity types, for example, an existing knowledge base is selected, which includes an entity word 10, an entity word 11, … …, an entity word 1n, an entity word 20, an entity word 21, … …, an entity word 2m, an entity word k1, an entity word k2, and an entity word kl under an entity type 1, an entity type 2, and an entity type … …. The entity term library is a first supplementary data source.

S2 selecting entity words to be expanded in the annotation data set

The labeled data set is a set of short texts, wherein a non-entity word in each text can be labeled as 0, and an entity word can be labeled as Yn, wherein n represents an entity type to which the entity word belongs.

The entity types in S1 and S2 may be determined or adjusted according to different situations, for example, when performing general scene entity recognition, the entity types may be divided into time, location, person, organization, and the like, and when the sample of the labeled data set is small, the entity types may be further fine-tuned according to the application field.

And selecting entity words to be expanded in the annotation data set.

S3, the words of the entity to be expanded are converted into word vectors

The conversion may be implemented by the Word2Vec model proposed by Google in 2013.

The model is subjected to large-scale pre-training on a mass data set, can quickly and effectively express a word into a vector form, and has good functions of word clustering and the like.

The Word vectors obtained through the Word embedding operation of the Word2Vec model can be understood as distributed low-dimensional dense real number vectors, wherein the cosine distances between the Word vectors representing similar semantics are closer, the similarity between words can be calculated through the comparison of the Word vectors,

s3 carrying out synonym fission on the entity words to be expanded

Synonym fission can be achieved through direct rough computation of the Word2Vec model, generating a plurality of synonyms similar to the entity words, and forming a synonym library, namely a second supplementary data source.

In the process of synonym fission, the number of synonyms of each fission cannot be set to be too large, otherwise, the relevance among entity words disappears, and the semantic relevance between the fission words and the original entity words is lost.

Wherein the Word2Vec model is from 256G of co-training of encyclopedia, Baidu and microblog corpora.

The similarity threshold value can be set to 0.5, namely the cosine similarity of different Word vectors is calculated through the Word2Vec model, and when the cosine similarity is more than 0.5, the words are considered as similar words.

In each fission, the first 3 similar words with the highest similarity are preferably taken.

The words obtained through synonym fission and the original entity words (namely the entity words to be expanded) have strong relevance, and the generalization effect of the model can be reduced by directly sending the words as a supplementary data source into the model for training.

Therefore, after obtaining the synonyms, the method further screens the alternative supplementary data generated from different supplementary data sources, obtains a better data supplementary effect, and obviously improves the generalization capability of the model, and the specific process is as follows:

s4 obtaining alternative supplemental data

It may further comprise:

s41, selecting an entity word in the second supplementary data source and k entity words in the first supplementary data source belonging to k entity types, crawling web page texts on the basis of the independent entity words respectively, wherein the web pages are preferably web pages with more intellectual contents, such as Wikipedia, the crawled content format is set as short texts, and the obtained short texts respectively form a second supplementary database and a first supplementary database according to the source of the crawled entity words.

In order to reduce the noise of the text and increase the recognition effect, the length of the crawled short text can be properly finely adjusted according to the field to be trained.

S42, the short texts in the first supplementary database and the second supplementary database are respectively segmented, stop words and part of speech tagging are removed, and the noise influence of the texts is reduced.

S43: and storing the words appearing in all texts in the first supplementary database and the second supplementary database into a Word list, establishing a Word list index, and converting each Word in the Word list into a corresponding Word vector through the pre-trained Word2 Vec.

S5 obtaining the extended data from the alternative supplementary data and adding the extended data into the marked data set

Specifically, the text similarity between the word vectors in the word list corresponding to the first supplementary database and the word vectors in the word list corresponding to the second supplementary database is calculated, some texts with the maximum text similarity in the first supplementary database and the second supplementary database are reserved, and the texts are added into the labeled data set, so that the labeled data set is expanded.

Wherein, the text similarity can be calculated by:

s51: the separation of the entity words from the short text a in the first supplementary database and the short text B in the second supplementary database can also be achieved by finding the vector matrix corresponding to the entity words directly from the pre-trained word vectors. And then calculating the cosine similarity of the word vectors of the separated entity words as follows:

wherein, a^kRepresenting the kth alternative entity word in short text a, B representing the entity word in short text B originating from the fissile thesaurus, and t representing the dimension.

S52: a, B all words in the text are extracted except the entity word, which can be accomplished by the word segmentation tool of the LTP toolkit described above. Then, respectively calculating the cosine similarity of the word vectors of all other words except the entity through formula (1), taking the word pairs with the same part of speech and the similarity greater than a threshold value as an overlapped word list, and then performing part of speech tagging weighting calculation on the overlapped word list, wherein the calculation formula of the similarity of the words of the part is as follows:

wherein W represents a list of overlapping words extracted from two texts, m and n represent the lengths of the two texts, and a_iAnd b_iRepresenting two textsThe keywords with the same part of speech, p, obtained after cosine similarity calculation_t.wAnd expressing part-of-speech tagging weighting values.

To reduce the impact of irrelevant words on the A, B text similarity score, words with cosine similarities between words below a certain threshold will not be calculated with a weighted score.

The similarity threshold in step S52 may be set to 0.5.

S53: the entity word similarity obtained through S51 and the overlapping word similarity score obtained through S52 are weighted-averaged as follows:

i.e. the text similarity.

Through iterative calculation, the method specifically comprises the following steps:

and fixing one short text in the second supplementary database, converting different short texts in the first supplementary database, sequentially calculating the text similarity, and keeping the text with the maximum similarity score.

And fixing another short text in the second supplementary database, converting different short texts in the first supplementary database, sequentially calculating the text similarity, and reserving the text with the maximum similarity score.

And so on.

And classifying the first supplementary database text and the second supplementary database text with the largest text similarity score into the same type of labeled data set, and adding all the first supplementary database text and the second supplementary database text into the labeled data set.

S6 model training

After the text data enhancement is completed, a training model is established based on the extended data set as follows:

S61

and (3) transmitting the expanded data set to a bidirectional long-time memory network (BilSTM) model, and extracting semantic information of the short text.

The LSTM is an improved version of the recurrent neural network RNN, can effectively solve the problem of information loss caused by the sequence length in the training process of the RNN, and can extract the text data characteristics of the input sequence and the implicit association between each word.

Specifically, the BilSTM model may include an input layer, a word embedding layer, a bidirectional LSTM layer, and a normalized exponential function layer.

The input layer is an index of each word in the sentence in a word list, and the word list is obtained by traversing all data. The word embedding layer uses pre-trained Chinese word vectors, training corpora of the word vectors are Chinese encyclopedia and microblog data, and the dimensionality of the word vectors is 300 dimensions. After the input layer, character embedding and word embedding of the words are spliced together, wherein the character embedding is obtained by randomly initializing a model and is updated in the training process. The spliced vector matrix is used as the final input representation of the words. Characters are embedded into 100-dimensional word vectors that are randomly initialized in dimension and updated during training. The hidden layer dimension of the bi-directional LSTM layer is set to 256 dimensions, and stitching forward LSTM and backward LSTM together results in a 512-dimensional sentence representation. And after the bidirectional LSTM of each time step is input into the normalized exponential function layer, a numerical value between 0 and 1 can be obtained, and the label with the maximum corresponding numerical value is the entity label of the position.

The bidirectional LSTM network layer comprises three control switches, namely a forgetting gate, a memory gate and an output gate, and information flows are processed through the control switches.

Specifically, it includes:

controlling a forget door:

the forgetting gate can selectively forget the incoming information in combination with the current input, i.e., forget unimportant information and leave important information. It is achieved by the following formula:

f_t＝σ(W_f·[h_t-1,x_t]+b_f)，

wherein, W_fIndicating the hidden state of the last time step, x_tInput representing the current state, b_fA bias matrix is represented.

Controlling memory door:

the memory door can be combined withInput x at the current time step_tThe information in (1) is selectively reserved, namely important information in the current input is memorized, and unimportant information is discarded. It is achieved by the following formula:

i_t＝σ(W_i·[h_t-1,x_t]+b_i)

wherein, W_i，b_iA weight parameter indicating the need for learning,

the temporary cell state, which represents the current time step, is used to update the current cell state.

Controlling an output gate:

the output gate can determine which information before the current time step is output, firstly calculates the unit state of the current time step, and then obtains the unit state of the previous time step and the forgetting gate f of the current time step_tProduct and current time step memory gate i_tThe sum of products with temporary cell state, i.e. cell state C at the current time step_tAnd continuously adjusting in the process as follows:

o_t＝σ(W_o[h_t-1,x_t]+b_o)，

wherein, W_o，b_oIndicating the weight parameters that need to be learned.

In the above process, the hidden state of the current step is calculated as follows:

h_t＝o_t*tanh(C_t)。

and splicing the forward hidden state and the back hidden state of each word obtained by the bidirectional LSTM network to be used as the input of a normalized exponential function layer (softmax), after the softmax, performing sequence prediction on the input short text, outputting a label at a corresponding position, namely obtaining the label of each word corresponding to the input sequence, outputting an entity type if the word is an entity, and outputting 0 if the word is not the entity.

S7 extraction of knowledge elements

And performing the extraction of the knowledge elements through the model after the training of S6.

Example 1

Data expansion is performed based on the labeled data set as follows:

table 1: training data sample

The entity Word "transformer" is selected from the above samples and converted into a Word vector by the Word2Vec tool.

Finding a vector matrix corresponding to the entity words through the pre-trained word vectors, obtaining similar words of the word vectors by utilizing a cosine similarity algorithm through a Hadamard LTP tool, and realizing word fission of the entity words, namely 'transformers', wherein the words with the similarity of more than 0.5 are considered as synonyms, the number of the obtained synonyms is set to be 3 in each fission, and in one fission, the following synonyms can be obtained: three-phase transformer, transformer coil and oil immersed transformer.

And selecting the electronic device database as a third-party entity database, and sequentially selecting entity words of 'toroidal transformer', 'voltage transformer' and 'thermal relay' as alternative words under the condition that the entity type is 'equipment'.

Wikipedia web pages are respectively crawled by taking synonyms of 'three-phase transformers' and alternative words of 'ring transformers', 'voltage transformers' and 'thermal relays' as references, and the following short text contents are obtained:

"China's power supply system mostly adopts three-phase power transformer to control the change demand of voltage in the long-distance transmission process, but often because the asymmetry of three-phase load leads to three-phase transformer to break down. "

The sony corporation used state-of-the-art toroidal transformers to process the sound sources of different wave frequencies to prevent unpredictable failures. "

In order to save the cost of the voltage transformer, the voltage level of the primary side is reduced through the primary side winding and the secondary side winding, and the strong current and the weak current can be converted. "

If the starting state of the electric appliance is changed frequently in the using process, a thermal relay with larger power is generally selected, otherwise, the fault is easily caused. "

And (3) carrying out the following word processing on the short text content obtained by crawling through a Hadamard LTP toolkit: segmenting words, removing stop words and part-of-speech labels according to the segmentation condition, storing all obtained words into a word list, and establishing an index as follows:

the power supply system in China mostly adopts a three-phase power transformer to control the change requirement of voltage in the process of remote transmission, but the three-phase power transformer is often in fault due to asymmetry of three-phase load. ]

[index1,index2，……，indexN]

Sony corporation uses state-of-the-art toroidal transformers to process sound sources of different wave frequencies to prevent unpredictable failures

[index1,index2，……，indexN]

If the starting state of the electric appliance is frequently changed in the use process, a thermal relay with larger power is generally selected, otherwise, faults are easily caused.

[index1,index2，……，indexN]

……。

And correspondingly converting all words in the Word list into Word vectors through the Word2Vec after pre-training.

Performing text similarity calculation on a text obtained by crawling the synonym 'three-phase transformer' and a text obtained by crawling the alternative words 'annular transformer', 'voltage transformer' and 'thermal relay', and keeping the text with the maximum similarity, such as:

the sony corporation used state-of-the-art toroidal transformers to process the sound sources of different wave frequencies to prevent unpredictable failures. "and" if the electric appliance changes the starting state frequently in the using process, the thermal relay with larger power is generally selected, otherwise, the fault is easily caused. "

And adding the labeling form of the data set into a labeling sample as a supplementary sample to obtain an expanded data set.

Similarly, the subsequent processing with synonyms "transformer coil" and the alternatives "tesla coil", "inductor coil", "contactor coil" leads to the complementary samples:

the transformer coil has high requirements on the insulation performance of a winding, and the most important point is that the transformer coil has enough electric strength, and the principle of the inductance coil is electromagnetic induction, so that certain requirements on the frequency of a signal passing through the coil, namely low pass frequency and high stop frequency, are met. "

Similarly, crawling of all synonyms and all alternatives is done in turn, and through the above iterative process, some supplementary samples are obtained as follows:

The transformer coil has high requirements on the insulating property of the winding, and the most important point is sufficient electrical strength. "

The principle of the inductance coil is electromagnetic induction, and certain requirements on the frequency of signals passing through the coil are met, namely 'low pass frequency and high stop frequency' for short. "

An extended data set including an original text that a transformer is about to fail in the summer when the transformer is at a high temperature and supplementary words is input into a BilSTM model shown in the attached drawing 2 for training, after the original text is input, the solid transformer is marked as Yn in the model output, other words are marked as 0, and the model output conforms to the actual situation as shown in Table 2.

Table 2: predicted output sample

Inputting 'the time is up to summer and the transformer fails under the condition of high temperature' into the trained model, and obtaining a result of 0000Yn00000000, wherein Yn defines 'equipment' for the entity type, which shows that the method for extracting the knowledge elements is accurate and effective.

Further, the manually crawled marine industry news texts are subjected to data enhancement by using the process of the specific implementation mode, the military industry database is selected as a third-party entity library, and the results show that the data set can be expanded from the original 1000 to 1300, so that the entity identification effect of the model is improved by 3%.

The above examples are merely preferred embodiments of the present invention, and the scope of the present invention is not limited to the above examples. All technical schemes belonging to the idea of the invention belong to the protection scope of the invention. It should be noted that modifications and embellishments within the scope of the invention may be made by those skilled in the art without departing from the principle of the invention, and such modifications and embellishments should also be considered as within the scope of the invention.

Claims

1. A method of text data enhancement, characterized by: comprising the process of screening for similar text from a first supplemental database derived from a knowledge base in a domain close to the base data set and a second supplemental database derived from synonyms of the entity words in the base data set.

2. The data enhancement method of claim 1, wherein: the first supplementary database is obtained by webpage crawling of entity words contained in the first supplementary database, and the second supplementary database is obtained by webpage crawling of synonyms of the entity words contained in the second supplementary database.

3. The data enhancement method of claim 1, wherein: the similar texts are judged through the following processes:

s51: performing word segmentation and labeling on the short text from the first supplementary database and the short text from the second supplementary database, and calculating word vector cosine similarity between separated entity words, namely entity word similarity;

s52: calculating the cosine similarity of word vectors between other words except the separated entity words, pairing the words with the same part of speech of which the similarity is greater than a threshold value into overlapped words, and calculating the weighted similarity of the overlapped words under the part of speech characteristics, namely the similarity of the overlapped words;

s53: carrying out weighted average on the entity word similarity and the overlapping word similarity to obtain text similarity;

4. The data enhancement method of claim 1, wherein: the synonyms are obtained by synonym fission, the synonym fission comprising: and acquiring words with word vector cosine similar to the entity words in the basic data set from the corpus, namely synonyms of the entity words.

5. The data enhancement method of claim 4, wherein: the number of synonyms per fission is set to 1-4, preferably 3.

6. The data enhancement method according to any one of claims 1 to 5, characterized by: the Word vector is obtained by Word2Vec model conversion.

7. A method for extracting a knowledge element is characterized in that: the extraction method is realized by a training completed extraction model, and the training of the model is based on the labeling data set enhanced by the data enhancement method of any one of claims 1-6.

8. The method of extracting hoisites according to claim 7, wherein: the extraction model is a bidirectional long-time and short-time memory network model.

9. The method of extracting hoisites according to claim 8, wherein: the extraction model comprises an input layer, a word embedding layer, a bidirectional LSTM layer and a normalized index function layer, wherein the input layer is an index of each word in a sentence in a word list, and the word embedding layer uses a pre-trained Chinese word vector;

preferably, the dimension of the word vector is set to 300 dimensions, and the hidden layer dimension of the bidirectional LSTM layer is set to 256 dimensions.

10. The method of extracting hoisites according to claim 9, wherein: the input form of the input layer is a combination of characters and words.