CN111950264A - Text data enhancement method and knowledge element extraction method - Google Patents

Text data enhancement method and knowledge element extraction method Download PDF

Info

Publication number
CN111950264A
CN111950264A CN202010777706.4A CN202010777706A CN111950264A CN 111950264 A CN111950264 A CN 111950264A CN 202010777706 A CN202010777706 A CN 202010777706A CN 111950264 A CN111950264 A CN 111950264A
Authority
CN
China
Prior art keywords
words
word
entity
similarity
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010777706.4A
Other languages
Chinese (zh)
Other versions
CN111950264B (en
Inventor
程良伦
牛伟才
王德培
张伟文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong University of Technology
Original Assignee
Guangdong University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong University of Technology filed Critical Guangdong University of Technology
Priority to CN202010777706.4A priority Critical patent/CN111950264B/en
Publication of CN111950264A publication Critical patent/CN111950264A/en
Application granted granted Critical
Publication of CN111950264B publication Critical patent/CN111950264B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a text data enhancement method and a knowledge element extraction method, wherein the text data enhancement method comprises a process of screening similar texts from a first supplementary database and a second supplementary database, wherein the first supplementary database is derived from a knowledge base in a field similar to a basic data set, and the second supplementary database is derived from synonyms of entity words in the basic data set. The data enhancement method can generate efficient and massive supplement to the basic data with few sources, and the knowledge element extraction model trained on the data set enhanced by the enhancement method has higher generalization capability and extraction accuracy.

Description

Text data enhancement method and knowledge element extraction method
Technical Field
The invention relates to the technical field of natural language processing, in particular to a knowledge element extraction technology.
Background
With the rapid development of the internet technology, the construction of the industrial domain knowledge base can be better applied to domain intelligent question answering and intelligent decision making, the intelligentization of industrial manufacturing is promoted, a large amount of electronic text information can be generated in the industrial production process, the information is dispersed in a maintenance diagnosis table, an internet community and a factory database of workers, and if the unstructured and semi-structured electronic text information can be constructed into the knowledge base with extremely high knowledge density, the utilization rate of the domain knowledge can be greatly improved.
How to process the text information quickly and efficiently is a major concern in the field of natural language processing, and particularly, the recognition of named entities is the most critical. The identification of domain knowledge meta-entities can extract important knowledge units from structured and semi-structured text data, wherein the knowledge units are generally words which are most representative in a specific domain, and after the entities are correctly identified, relationship extraction, event extraction and knowledge base construction can be further completed. It can be seen that the quality of the named entity recognition effect directly affects the subsequent information extraction task.
Existing named entity recognition methods are roughly divided into three categories: rules and dictionaries based methods, statistical machine learning based methods, and deep learning based methods. The rule and dictionary-based learning method needs huge manpower marking due to the fact that a large number of rules and dictionaries need to be formulated, is limited by professional knowledge, can be formulated only by experts in certain fields, and is high in identification cost and low in efficiency; the method based on statistical machine learning mainly comprises a hidden Markov model, a maximum entropy model, a support vector machine and a conditional random field model, wherein the recognition effect mainly depends on various feature combinations selected by the model, such as part-of-speech features, position features, context features and the like of words, and entity recognition is required to be carried out through large-scale training corpora; the entity recognition technology based on deep learning is the most mainstream method at present, firstly, a word vector trained in advance is used as the input of a neural network, then, the semantic extraction is carried out on a text through a neural network layer, and the label of each word can be predicted by the extracted sentence characteristics through a global normalization function (Softmax) layer or a conditional random field. Although the recognition effect of deep learning on the named entity recognition technology is far better than that of statistical machine learning and rule-based methods, the realization of the model prediction capability and generalization capability of the method needs enough high-quality labeled data sets as supports, otherwise, the overfitting condition occurs, the expected recognition accuracy rate is difficult to obtain, and the industrial field often lacks enough labeled data sets to optimize the parameters of the training model.
Disclosure of Invention
The invention aims to provide a method for enhancing text data, which can efficiently and massively supplement basic data with few sources, can overcome the problem of model accuracy caused by the fact that the supplementary data and the basic data are too close to each other, and remarkably improves the generalization capability and the extraction accuracy of a model.
The invention also aims to provide a method for obtaining accurate extraction of the knowledge elements based on the enhanced text data.
The invention firstly discloses the following technical scheme:
a method of text data enhancement comprising the process of screening for similar text from a first supplementary database derived from a knowledge base in a domain close to a base data set and a second supplementary database derived from synonyms of entity words in the base data set.
In the scheme, the entity words refer to words representing entities.
The basic data set refers to a data set which contains certain text data and needs data enhancement, and is preferably a data set after the labeling is completed.
The similar fields refer to the fields with the same or similar physical words in the aspects of products, functions, technical processes and the like.
Such as the grid power domain and the electronics domain. For example, a three-phase transformer in the field of power networks appears in the name of a toroidal transformer in a loudspeaker electronics product in the field of electronics.
Or in the field of ceramic production and in the field of refractory materials, as in inorganic non-metallic materials. This is reflected, for example, in the mullite material required for the ceramic production process, which is called kyanite, mullite or sillimanite, among other refractory materials. In the two fields, the same process exists in a series of mullite reactions based on mullite, and the names are different.
By expanding the corpus information containing the entities in the close domain, on one hand, the data volume of the entity words can be improved, and on the other hand, the generalization capability of the model can also be improved.
Such a knowledge base in the near field may come from the internet, a recipe of raw materials, or a manual of workers, etc.
It will be appreciated that the data in the first and second supplementary databases should be presented in the form of text.
In some embodiments, the first supplemental database is obtained by a web page crawl of the entity terms it contains, and the second supplemental database is obtained by a web page crawl of synonyms of the entity terms it contains.
The web page in this embodiment is preferably a web page with more knowledge content, such as wikipedia.
In some embodiments, the similar text is determined by:
s51: and performing word segmentation and labeling on the short text from the first supplementary database and the short text from the second supplementary database, and calculating word vector cosine similarity between separated entity words, namely entity word similarity.
S52: and calculating the cosine similarity of word vectors between other words except the separated entity words, pairing the words with the same part of speech of which the similarity is greater than a threshold value into overlapped words, and calculating the weighted similarity of the overlapped words under the part of speech characteristics, namely the similarity of the overlapped words.
Preferably, the threshold value in S52 is set to 0.5, i.e. words with a similarity greater than 0.5 are overlapping words.
S53: and carrying out weighted average on the entity word similarity and the overlapping word similarity to obtain text similarity.
And performing iterative calculation aiming at the text similarity on the texts in the first supplementary database and the second supplementary database, wherein the two texts with the maximum text similarity in each iteration are similar texts.
In some embodiments, the synonym is obtained by synonym fission, the synonym fission comprising: and acquiring words with word vector cosine similar to the entity words in the basic data set from the corpus, namely synonyms of the entity words.
In some embodiments, the number of synonyms per fission is set to 1-4, preferably 3.
In some embodiments, the Word vector is obtained by a Word2Vec model transformation.
In some embodiments, the synonym fission is achieved by the Word2Vec model.
The Word2Vec model used may be trained by encyclopedia, Baidu, and/or microblog corpora.
The word vectors trained by the model have certain prior knowledge, and synonyms have semantic similarity and are expressed as similar cosine distances.
The invention further discloses a knowledge element extraction method, which is realized by the extraction model after training, wherein the training of the model is based on the label data set enhanced by the data enhancement method.
In some embodiments, the extraction model is a two-way long-and-short-term memory network model.
In some embodiments, the decimation model includes an input layer, a word embedding layer, a bi-directional LSTM layer, and a normalized exponential function layer.
In some embodiments, the input layer is an index of each word in the sentence into a vocabulary, which is obtained by traversing all data.
In addition, in order to enhance the representation information of the words, in some specific embodiments, the word embedding layer uses pre-trained Chinese word vectors, the training corpus of the word vectors is preferably Chinese encyclopedia and microblog data, and the dimension of the word vectors is preferably 300 dimensions.
In addition, in order to enhance the representation information of the words, in some specific embodiments, the character embedding and the word embedding of the words are spliced together, wherein the characters specifically refer to each Chinese character in the words.
The preferred characters are embedded in a 100-dimensional word vector that is randomly initialized in dimension and updated during the training process.
In some embodiments, the hidden layer dimension of the bi-directional LSTM layer is set to 256 dimensions, and finally the forward LSTM and backward LSTM are stitched together to obtain a 512-dimensional sentence representation.
In some embodiments, the bidirectional LSTM at each time step is input into the normalized exponential function layer, i.e. softmax function, to obtain a value between 0 and 1, and the label with the largest value is the entity label of the location.
The method can effectively solve the problem that the industrial field is lack of enough structured knowledge bases, can realize the possibility of borrowing the existing knowledge base of similar industrial scenes and supplementing basic data through synonyms by expanding the training data set through the text similarity, and meanwhile, not only obviously enhances the scale of the data set, but also effectively overcomes the problem of low model generalization capability caused by over-strong relevance of a single source entity through screening and integrating the data of two sources, and obviously improves the model accuracy.
By the aid of the method, manually crawled marine industry news texts are subjected to data enhancement, the used near field knowledge base is a military industry database, the data set can be expanded from original 1000 to 1300, and the entity recognition effect is improved by 3%.
Drawings
Fig. 1 is a schematic flow chart of a data enhancement method according to an embodiment of the present invention.
Fig. 2 is a schematic structural diagram of an extraction model used in embodiment 1 of the present invention.
Detailed Description
The present invention is described in detail below with reference to the following embodiments and the attached drawings, but it should be understood that the embodiments and the attached drawings are only used for the illustrative description of the present invention and do not limit the protection scope of the present invention in any way. All reasonable variations and combinations that fall within the spirit of the invention are intended to be within the scope of the invention.
The extraction of the knowledgeelements is performed by the flow shown in fig. 1.
Specifically, firstly, text data enhancement is performed, that is, data expansion is performed on the basis of an existing text data set.
The existing text data set, i.e. the basic data set, can be obtained by collecting electronic texts generated in the industrial production process, such as electronic texts scattered in a maintenance diagnosis table of workers, an internet community, a factory database, and the like.
The method is carried out on the basis of the entity words in the basic data set, so that the samples in the basic data set need to be labeled with the entity words, namely, the labeled data set is obtained firstly.
Based on the annotated dataset, data enhancement is performed by the following process:
s1 selecting an entity term library of one of the supplemental data sources
The entity word library may be from an existing knowledge base having similar industrial fields as the basic data set, and should contain a plurality of entity words under different entity types, for example, an existing knowledge base is selected, which includes an entity word 10, an entity word 11, … …, an entity word 1n, an entity word 20, an entity word 21, … …, an entity word 2m, an entity word k1, an entity word k2, and an entity word kl under an entity type 1, an entity type 2, and an entity type … …. The entity term library is a first supplementary data source.
S2 selecting entity words to be expanded in the annotation data set
The labeled data set is a set of short texts, wherein a non-entity word in each text can be labeled as 0, and an entity word can be labeled as Yn, wherein n represents an entity type to which the entity word belongs.
The entity types in S1 and S2 may be determined or adjusted according to different situations, for example, when performing general scene entity recognition, the entity types may be divided into time, location, person, organization, and the like, and when the sample of the labeled data set is small, the entity types may be further fine-tuned according to the application field.
And selecting entity words to be expanded in the annotation data set.
S3, the words of the entity to be expanded are converted into word vectors
The conversion may be implemented by the Word2Vec model proposed by Google in 2013.
The model is subjected to large-scale pre-training on a mass data set, can quickly and effectively express a word into a vector form, and has good functions of word clustering and the like.
The Word vectors obtained through the Word embedding operation of the Word2Vec model can be understood as distributed low-dimensional dense real number vectors, wherein the cosine distances between the Word vectors representing similar semantics are closer, the similarity between words can be calculated through the comparison of the Word vectors,
s3 carrying out synonym fission on the entity words to be expanded
Synonym fission can be achieved through direct rough computation of the Word2Vec model, generating a plurality of synonyms similar to the entity words, and forming a synonym library, namely a second supplementary data source.
In the process of synonym fission, the number of synonyms of each fission cannot be set to be too large, otherwise, the relevance among entity words disappears, and the semantic relevance between the fission words and the original entity words is lost.
Wherein the Word2Vec model is from 256G of co-training of encyclopedia, Baidu and microblog corpora.
The similarity threshold value can be set to 0.5, namely the cosine similarity of different Word vectors is calculated through the Word2Vec model, and when the cosine similarity is more than 0.5, the words are considered as similar words.
In each fission, the first 3 similar words with the highest similarity are preferably taken.
The words obtained through synonym fission and the original entity words (namely the entity words to be expanded) have strong relevance, and the generalization effect of the model can be reduced by directly sending the words as a supplementary data source into the model for training.
Therefore, after obtaining the synonyms, the method further screens the alternative supplementary data generated from different supplementary data sources, obtains a better data supplementary effect, and obviously improves the generalization capability of the model, and the specific process is as follows:
s4 obtaining alternative supplemental data
It may further comprise:
s41, selecting an entity word in the second supplementary data source and k entity words in the first supplementary data source belonging to k entity types, crawling web page texts on the basis of the independent entity words respectively, wherein the web pages are preferably web pages with more intellectual contents, such as Wikipedia, the crawled content format is set as short texts, and the obtained short texts respectively form a second supplementary database and a first supplementary database according to the source of the crawled entity words.
In order to reduce the noise of the text and increase the recognition effect, the length of the crawled short text can be properly finely adjusted according to the field to be trained.
S42, the short texts in the first supplementary database and the second supplementary database are respectively segmented, stop words and part of speech tagging are removed, and the noise influence of the texts is reduced.
S43: and storing the words appearing in all texts in the first supplementary database and the second supplementary database into a Word list, establishing a Word list index, and converting each Word in the Word list into a corresponding Word vector through the pre-trained Word2 Vec.
S5 obtaining the extended data from the alternative supplementary data and adding the extended data into the marked data set
Specifically, the text similarity between the word vectors in the word list corresponding to the first supplementary database and the word vectors in the word list corresponding to the second supplementary database is calculated, some texts with the maximum text similarity in the first supplementary database and the second supplementary database are reserved, and the texts are added into the labeled data set, so that the labeled data set is expanded.
Wherein, the text similarity can be calculated by:
s51: the separation of the entity words from the short text a in the first supplementary database and the short text B in the second supplementary database can also be achieved by finding the vector matrix corresponding to the entity words directly from the pre-trained word vectors. And then calculating the cosine similarity of the word vectors of the separated entity words as follows:
Figure BDA0002619071260000071
wherein, akRepresenting the kth alternative entity word in short text a, B representing the entity word in short text B originating from the fissile thesaurus, and t representing the dimension.
S52: a, B all words in the text are extracted except the entity word, which can be accomplished by the word segmentation tool of the LTP toolkit described above. Then, respectively calculating the cosine similarity of the word vectors of all other words except the entity through formula (1), taking the word pairs with the same part of speech and the similarity greater than a threshold value as an overlapped word list, and then performing part of speech tagging weighting calculation on the overlapped word list, wherein the calculation formula of the similarity of the words of the part is as follows:
Figure BDA0002619071260000081
wherein W represents a list of overlapping words extracted from two texts, m and n represent the lengths of the two texts, and aiAnd biRepresenting two textsThe keywords with the same part of speech, p, obtained after cosine similarity calculationt.wAnd expressing part-of-speech tagging weighting values.
To reduce the impact of irrelevant words on the A, B text similarity score, words with cosine similarities between words below a certain threshold will not be calculated with a weighted score.
The similarity threshold in step S52 may be set to 0.5.
S53: the entity word similarity obtained through S51 and the overlapping word similarity score obtained through S52 are weighted-averaged as follows:
Figure BDA0002619071260000082
i.e. the text similarity.
Through iterative calculation, the method specifically comprises the following steps:
and fixing one short text in the second supplementary database, converting different short texts in the first supplementary database, sequentially calculating the text similarity, and keeping the text with the maximum similarity score.
And fixing another short text in the second supplementary database, converting different short texts in the first supplementary database, sequentially calculating the text similarity, and reserving the text with the maximum similarity score.
And so on.
And classifying the first supplementary database text and the second supplementary database text with the largest text similarity score into the same type of labeled data set, and adding all the first supplementary database text and the second supplementary database text into the labeled data set.
S6 model training
After the text data enhancement is completed, a training model is established based on the extended data set as follows:
S61
and (3) transmitting the expanded data set to a bidirectional long-time memory network (BilSTM) model, and extracting semantic information of the short text.
The LSTM is an improved version of the recurrent neural network RNN, can effectively solve the problem of information loss caused by the sequence length in the training process of the RNN, and can extract the text data characteristics of the input sequence and the implicit association between each word.
Specifically, the BilSTM model may include an input layer, a word embedding layer, a bidirectional LSTM layer, and a normalized exponential function layer.
The input layer is an index of each word in the sentence in a word list, and the word list is obtained by traversing all data. The word embedding layer uses pre-trained Chinese word vectors, training corpora of the word vectors are Chinese encyclopedia and microblog data, and the dimensionality of the word vectors is 300 dimensions. After the input layer, character embedding and word embedding of the words are spliced together, wherein the character embedding is obtained by randomly initializing a model and is updated in the training process. The spliced vector matrix is used as the final input representation of the words. Characters are embedded into 100-dimensional word vectors that are randomly initialized in dimension and updated during training. The hidden layer dimension of the bi-directional LSTM layer is set to 256 dimensions, and stitching forward LSTM and backward LSTM together results in a 512-dimensional sentence representation. And after the bidirectional LSTM of each time step is input into the normalized exponential function layer, a numerical value between 0 and 1 can be obtained, and the label with the maximum corresponding numerical value is the entity label of the position.
The bidirectional LSTM network layer comprises three control switches, namely a forgetting gate, a memory gate and an output gate, and information flows are processed through the control switches.
Specifically, it includes:
controlling a forget door:
the forgetting gate can selectively forget the incoming information in combination with the current input, i.e., forget unimportant information and leave important information. It is achieved by the following formula:
ft=σ(Wf·[ht-1,xt]+bf),
wherein, WfIndicating the hidden state of the last time step, xtInput representing the current state, bfA bias matrix is represented.
Controlling memory door:
the memory door can be combined withInput x at the current time steptThe information in (1) is selectively reserved, namely important information in the current input is memorized, and unimportant information is discarded. It is achieved by the following formula:
it=σ(Wi·[ht-1,xt]+bi)
Figure BDA0002619071260000091
wherein, Wi,biA weight parameter indicating the need for learning,
Figure BDA0002619071260000092
the temporary cell state, which represents the current time step, is used to update the current cell state.
Controlling an output gate:
the output gate can determine which information before the current time step is output, firstly calculates the unit state of the current time step, and then obtains the unit state of the previous time step and the forgetting gate f of the current time steptProduct and current time step memory gate itThe sum of products with temporary cell state, i.e. cell state C at the current time steptAnd continuously adjusting in the process as follows:
Figure BDA0002619071260000101
ot=σ(Wo[ht-1,xt]+bo),
wherein, Wo,boIndicating the weight parameters that need to be learned.
In the above process, the hidden state of the current step is calculated as follows:
ht=ot*tanh(Ct)。
and splicing the forward hidden state and the back hidden state of each word obtained by the bidirectional LSTM network to be used as the input of a normalized exponential function layer (softmax), after the softmax, performing sequence prediction on the input short text, outputting a label at a corresponding position, namely obtaining the label of each word corresponding to the input sequence, outputting an entity type if the word is an entity, and outputting 0 if the word is not the entity.
S7 extraction of knowledge elements
And performing the extraction of the knowledge elements through the model after the training of S6.
Example 1
Data expansion is performed based on the labeled data set as follows:
table 1: training data sample
Figure BDA0002619071260000102
The entity Word "transformer" is selected from the above samples and converted into a Word vector by the Word2Vec tool.
Finding a vector matrix corresponding to the entity words through the pre-trained word vectors, obtaining similar words of the word vectors by utilizing a cosine similarity algorithm through a Hadamard LTP tool, and realizing word fission of the entity words, namely 'transformers', wherein the words with the similarity of more than 0.5 are considered as synonyms, the number of the obtained synonyms is set to be 3 in each fission, and in one fission, the following synonyms can be obtained: three-phase transformer, transformer coil and oil immersed transformer.
And selecting the electronic device database as a third-party entity database, and sequentially selecting entity words of 'toroidal transformer', 'voltage transformer' and 'thermal relay' as alternative words under the condition that the entity type is 'equipment'.
Wikipedia web pages are respectively crawled by taking synonyms of 'three-phase transformers' and alternative words of 'ring transformers', 'voltage transformers' and 'thermal relays' as references, and the following short text contents are obtained:
"China's power supply system mostly adopts three-phase power transformer to control the change demand of voltage in the long-distance transmission process, but often because the asymmetry of three-phase load leads to three-phase transformer to break down. "
The sony corporation used state-of-the-art toroidal transformers to process the sound sources of different wave frequencies to prevent unpredictable failures. "
In order to save the cost of the voltage transformer, the voltage level of the primary side is reduced through the primary side winding and the secondary side winding, and the strong current and the weak current can be converted. "
If the starting state of the electric appliance is changed frequently in the using process, a thermal relay with larger power is generally selected, otherwise, the fault is easily caused. "
And (3) carrying out the following word processing on the short text content obtained by crawling through a Hadamard LTP toolkit: segmenting words, removing stop words and part-of-speech labels according to the segmentation condition, storing all obtained words into a word list, and establishing an index as follows:
the power supply system in China mostly adopts a three-phase power transformer to control the change requirement of voltage in the process of remote transmission, but the three-phase power transformer is often in fault due to asymmetry of three-phase load. ]
[index1,index2,……,indexN]
Sony corporation uses state-of-the-art toroidal transformers to process sound sources of different wave frequencies to prevent unpredictable failures
[index1,index2,……,indexN]
If the starting state of the electric appliance is frequently changed in the use process, a thermal relay with larger power is generally selected, otherwise, faults are easily caused.
[index1,index2,……,indexN]
……。
And correspondingly converting all words in the Word list into Word vectors through the Word2Vec after pre-training.
Performing text similarity calculation on a text obtained by crawling the synonym 'three-phase transformer' and a text obtained by crawling the alternative words 'annular transformer', 'voltage transformer' and 'thermal relay', and keeping the text with the maximum similarity, such as:
the sony corporation used state-of-the-art toroidal transformers to process the sound sources of different wave frequencies to prevent unpredictable failures. "and" if the electric appliance changes the starting state frequently in the using process, the thermal relay with larger power is generally selected, otherwise, the fault is easily caused. "
And adding the labeling form of the data set into a labeling sample as a supplementary sample to obtain an expanded data set.
Similarly, the subsequent processing with synonyms "transformer coil" and the alternatives "tesla coil", "inductor coil", "contactor coil" leads to the complementary samples:
the transformer coil has high requirements on the insulation performance of a winding, and the most important point is that the transformer coil has enough electric strength, and the principle of the inductance coil is electromagnetic induction, so that certain requirements on the frequency of a signal passing through the coil, namely low pass frequency and high stop frequency, are met. "
Similarly, crawling of all synonyms and all alternatives is done in turn, and through the above iterative process, some supplementary samples are obtained as follows:
"China's power supply system mostly adopts three-phase power transformer to control the change demand of voltage in the long-distance transmission process, but often because the asymmetry of three-phase load leads to three-phase transformer to break down. "
The sony corporation used state-of-the-art toroidal transformers to process the sound sources of different wave frequencies to prevent unpredictable failures. "
If the starting state of the electric appliance is changed frequently in the using process, a thermal relay with larger power is generally selected, otherwise, the fault is easily caused. "
The transformer coil has high requirements on the insulating property of the winding, and the most important point is sufficient electrical strength. "
The principle of the inductance coil is electromagnetic induction, and certain requirements on the frequency of signals passing through the coil are met, namely 'low pass frequency and high stop frequency' for short. "
An extended data set including an original text that a transformer is about to fail in the summer when the transformer is at a high temperature and supplementary words is input into a BilSTM model shown in the attached drawing 2 for training, after the original text is input, the solid transformer is marked as Yn in the model output, other words are marked as 0, and the model output conforms to the actual situation as shown in Table 2.
Table 2: predicted output sample
Figure BDA0002619071260000131
Inputting 'the time is up to summer and the transformer fails under the condition of high temperature' into the trained model, and obtaining a result of 0000Yn00000000, wherein Yn defines 'equipment' for the entity type, which shows that the method for extracting the knowledge elements is accurate and effective.
Further, the manually crawled marine industry news texts are subjected to data enhancement by using the process of the specific implementation mode, the military industry database is selected as a third-party entity library, and the results show that the data set can be expanded from the original 1000 to 1300, so that the entity identification effect of the model is improved by 3%.
The above examples are merely preferred embodiments of the present invention, and the scope of the present invention is not limited to the above examples. All technical schemes belonging to the idea of the invention belong to the protection scope of the invention. It should be noted that modifications and embellishments within the scope of the invention may be made by those skilled in the art without departing from the principle of the invention, and such modifications and embellishments should also be considered as within the scope of the invention.

Claims (10)

1. A method of text data enhancement, characterized by: comprising the process of screening for similar text from a first supplemental database derived from a knowledge base in a domain close to the base data set and a second supplemental database derived from synonyms of the entity words in the base data set.
2. The data enhancement method of claim 1, wherein: the first supplementary database is obtained by webpage crawling of entity words contained in the first supplementary database, and the second supplementary database is obtained by webpage crawling of synonyms of the entity words contained in the second supplementary database.
3. The data enhancement method of claim 1, wherein: the similar texts are judged through the following processes:
s51: performing word segmentation and labeling on the short text from the first supplementary database and the short text from the second supplementary database, and calculating word vector cosine similarity between separated entity words, namely entity word similarity;
s52: calculating the cosine similarity of word vectors between other words except the separated entity words, pairing the words with the same part of speech of which the similarity is greater than a threshold value into overlapped words, and calculating the weighted similarity of the overlapped words under the part of speech characteristics, namely the similarity of the overlapped words;
s53: carrying out weighted average on the entity word similarity and the overlapping word similarity to obtain text similarity;
and performing iterative calculation aiming at the text similarity on the texts in the first supplementary database and the second supplementary database, wherein the two texts with the maximum text similarity in each iteration are similar texts.
4. The data enhancement method of claim 1, wherein: the synonyms are obtained by synonym fission, the synonym fission comprising: and acquiring words with word vector cosine similar to the entity words in the basic data set from the corpus, namely synonyms of the entity words.
5. The data enhancement method of claim 4, wherein: the number of synonyms per fission is set to 1-4, preferably 3.
6. The data enhancement method according to any one of claims 1 to 5, characterized by: the Word vector is obtained by Word2Vec model conversion.
7. A method for extracting a knowledge element is characterized in that: the extraction method is realized by a training completed extraction model, and the training of the model is based on the labeling data set enhanced by the data enhancement method of any one of claims 1-6.
8. The method of extracting hoisites according to claim 7, wherein: the extraction model is a bidirectional long-time and short-time memory network model.
9. The method of extracting hoisites according to claim 8, wherein: the extraction model comprises an input layer, a word embedding layer, a bidirectional LSTM layer and a normalized index function layer, wherein the input layer is an index of each word in a sentence in a word list, and the word embedding layer uses a pre-trained Chinese word vector;
preferably, the dimension of the word vector is set to 300 dimensions, and the hidden layer dimension of the bidirectional LSTM layer is set to 256 dimensions.
10. The method of extracting hoisites according to claim 9, wherein: the input form of the input layer is a combination of characters and words.
CN202010777706.4A 2020-08-05 2020-08-05 Text data enhancement method and knowledge element extraction method Active CN111950264B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010777706.4A CN111950264B (en) 2020-08-05 2020-08-05 Text data enhancement method and knowledge element extraction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010777706.4A CN111950264B (en) 2020-08-05 2020-08-05 Text data enhancement method and knowledge element extraction method

Publications (2)

Publication Number Publication Date
CN111950264A true CN111950264A (en) 2020-11-17
CN111950264B CN111950264B (en) 2024-04-26

Family

ID=73339486

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010777706.4A Active CN111950264B (en) 2020-08-05 2020-08-05 Text data enhancement method and knowledge element extraction method

Country Status (1)

Country Link
CN (1) CN111950264B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112632993A (en) * 2020-11-27 2021-04-09 浙江工业大学 Electric power measurement entity recognition model classification method based on convolution attention network
CN113158648A (en) * 2020-12-09 2021-07-23 中科讯飞互联(北京)信息科技有限公司 Text completion method, electronic device and storage device
CN113221574A (en) * 2021-05-31 2021-08-06 云南锡业集团(控股)有限责任公司研发中心 Named entity recognition method, device, equipment and computer readable storage medium
CN113779959A (en) * 2021-08-31 2021-12-10 西南电子技术研究所(中国电子科技集团公司第十研究所) Small sample text data mixing enhancement method
CN113901207A (en) * 2021-09-15 2022-01-07 昆明理工大学 Adverse drug reaction detection method based on data enhancement and semi-supervised learning
CN114706975A (en) * 2022-01-19 2022-07-05 天津大学 Text classification method for power failure news by introducing data enhancement SA-LSTM
CN116541535A (en) * 2023-05-19 2023-08-04 北京理工大学 Automatic knowledge graph construction method, system, equipment and medium

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106156286A (en) * 2016-06-24 2016-11-23 广东工业大学 Type extraction system and method towards technical literature knowledge entity
CN108052593A (en) * 2017-12-12 2018-05-18 山东科技大学 A kind of subject key words extracting method based on descriptor vector sum network structure
CN109284396A (en) * 2018-09-27 2019-01-29 北京大学深圳研究生院 Medical knowledge map construction method, apparatus, server and storage medium
CN109376352A (en) * 2018-08-28 2019-02-22 中山大学 A kind of patent text modeling method based on word2vec and semantic similarity
CN109408642A (en) * 2018-08-30 2019-03-01 昆明理工大学 A kind of domain entities relation on attributes abstracting method based on distance supervision
CN110502644A (en) * 2019-08-28 2019-11-26 同方知网(北京)技术有限公司 A kind of field level dictionary excavates the Active Learning Method of building
US20200134058A1 (en) * 2018-10-29 2020-04-30 Beijing Jingdong Shangke Information Technology Co., Ltd. System and method for building an evolving ontology from user-generated content
CN111143574A (en) * 2019-12-05 2020-05-12 大连民族大学 Query and visualization system construction method based on minority culture knowledge graph
US20200233917A1 (en) * 2019-01-23 2020-07-23 Keeeb Inc. Data processing system for data search and retrieval augmentation and enhanced data storage

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106156286A (en) * 2016-06-24 2016-11-23 广东工业大学 Type extraction system and method towards technical literature knowledge entity
CN108052593A (en) * 2017-12-12 2018-05-18 山东科技大学 A kind of subject key words extracting method based on descriptor vector sum network structure
CN109376352A (en) * 2018-08-28 2019-02-22 中山大学 A kind of patent text modeling method based on word2vec and semantic similarity
CN109408642A (en) * 2018-08-30 2019-03-01 昆明理工大学 A kind of domain entities relation on attributes abstracting method based on distance supervision
CN109284396A (en) * 2018-09-27 2019-01-29 北京大学深圳研究生院 Medical knowledge map construction method, apparatus, server and storage medium
US20200134058A1 (en) * 2018-10-29 2020-04-30 Beijing Jingdong Shangke Information Technology Co., Ltd. System and method for building an evolving ontology from user-generated content
US20200233917A1 (en) * 2019-01-23 2020-07-23 Keeeb Inc. Data processing system for data search and retrieval augmentation and enhanced data storage
CN110502644A (en) * 2019-08-28 2019-11-26 同方知网(北京)技术有限公司 A kind of field level dictionary excavates the Active Learning Method of building
CN111143574A (en) * 2019-12-05 2020-05-12 大连民族大学 Query and visualization system construction method based on minority culture knowledge graph

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
周艳平等: ""基于同义词词林的句子语义相似度方法及其在问答***中的应用"", 《计算机应用与软件》, vol. 36, no. 08, 12 August 2019 (2019-08-12), pages 65 - 68 *
胡龙茂等: ""基于多维相似度和情感词扩充的相同产品特征识别"", 《山东大学学报(工学版)》, vol. 50, no. 02, 23 March 2020 (2020-03-23), pages 50 - 59 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112632993A (en) * 2020-11-27 2021-04-09 浙江工业大学 Electric power measurement entity recognition model classification method based on convolution attention network
CN113158648A (en) * 2020-12-09 2021-07-23 中科讯飞互联(北京)信息科技有限公司 Text completion method, electronic device and storage device
CN113221574A (en) * 2021-05-31 2021-08-06 云南锡业集团(控股)有限责任公司研发中心 Named entity recognition method, device, equipment and computer readable storage medium
CN113779959A (en) * 2021-08-31 2021-12-10 西南电子技术研究所(中国电子科技集团公司第十研究所) Small sample text data mixing enhancement method
CN113901207A (en) * 2021-09-15 2022-01-07 昆明理工大学 Adverse drug reaction detection method based on data enhancement and semi-supervised learning
CN113901207B (en) * 2021-09-15 2024-04-26 昆明理工大学 Adverse drug reaction detection method based on data enhancement and semi-supervised learning
CN114706975A (en) * 2022-01-19 2022-07-05 天津大学 Text classification method for power failure news by introducing data enhancement SA-LSTM
CN116541535A (en) * 2023-05-19 2023-08-04 北京理工大学 Automatic knowledge graph construction method, system, equipment and medium

Also Published As

Publication number Publication date
CN111950264B (en) 2024-04-26

Similar Documents

Publication Publication Date Title
CN111950264B (en) Text data enhancement method and knowledge element extraction method
CN107798140B (en) Dialog system construction method, semantic controlled response method and device
CN110222160A (en) Intelligent semantic document recommendation method, device and computer readable storage medium
CN111581350A (en) Multi-task learning, reading and understanding method based on pre-training language model
CN111159485B (en) Tail entity linking method, device, server and storage medium
CN115858758A (en) Intelligent customer service knowledge graph system with multiple unstructured data identification
CN113704416B (en) Word sense disambiguation method and device, electronic equipment and computer-readable storage medium
CN112507109A (en) Retrieval method and device based on semantic analysis and keyword recognition
CN115495555A (en) Document retrieval method and system based on deep learning
CN113609844A (en) Electric power professional word bank construction method based on hybrid model and clustering algorithm
CN112860889A (en) BERT-based multi-label classification method
CN113434636A (en) Semantic-based approximate text search method and device, computer equipment and medium
CN115203421A (en) Method, device and equipment for generating label of long text and storage medium
CN108536781B (en) Social network emotion focus mining method and system
CN115017425B (en) Location search method, location search device, electronic device, and storage medium
CN115859980A (en) Semi-supervised named entity identification method, system and electronic equipment
Dusserre et al. Bigger does not mean better! We prefer specificity
CN114997288A (en) Design resource association method
CN110298046B (en) Translation model training method, text translation method and related device
CN116797195A (en) Work order processing method, apparatus, computer device, and computer readable storage medium
Lin et al. Enhanced BERT-based ranking models for spoken document retrieval
CN115017912A (en) Double-target entity emotion analysis method for multi-task learning
CN117828024A (en) Plug-in retrieval method, device, storage medium and equipment
CN116187347A (en) Question and answer method and device based on pre-training model, electronic equipment and storage medium
CN113590768B (en) Training method and device for text relevance model, question answering method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant