CN109582787A

CN109582787A - A kind of entity classification method and device of field of thermal power corpus data

Info

Publication number: CN109582787A
Application number: CN201811311803.3A
Authority: CN
Inventors: 唐静; 彭轩; 彭一轩; 解来甲
Original assignee: Yuanguang Software Co Ltd
Current assignee: Yuanguang Software Co Ltd
Priority date: 2018-11-05
Filing date: 2018-11-05
Publication date: 2019-04-05
Anticipated expiration: 2038-11-05
Also published as: CN109582787B

Abstract

The present invention relates to a kind of entity classification method and devices of field of thermal power corpus data, belong to thermal power generating technology field, method includes, just subseries is carried out to the text collection S to be sorted comprising field of thermal power corpus data, succeeded classifying text set S1 and failed classifying text set S2；The entity neologisms in failed classifying text set S2 are extracted, the new word list E of entity is established；Entity neologisms in the new word list of entity are carried out entity with the classifying text set S1 that succeeded one by one to be aligned, confirm the entity class of entity neologisms.The present invention utilizes field of thermal power text data, it is comprehensive that algorithm and text classification algorithm are found using unsupervised specialized vocabulary, realize to power generation corpus data entity classification, constructed by thermal power generation specialized dictionary can also be used for text data digging in the field corpus support.

Description

A kind of entity classification method and device of field of thermal power corpus data

Technical field

The present invention relates to thermal power generating technology field, especially a kind of entity classification side of field of thermal power corpus data Method and device.

Background technique

As typical non-/ semi-structured data, the processing for text data is always one of the hot spot of data mining.

To the text data analysis mining of field of thermal power, for thermal power generation corporations, regularly defect is made an inventory, Yi Jiqi The building of the Company Knowledge map of the long-range informatization of industry, auxiliary enterprises from global level understand production equipment operation and Health status, progress multidimensional data fusion and the excavation of deep knowledge are of great significance.

Currently, for field of thermal power text data analysis mining still in its infancy.Main reason is that fire The document data that power power field is accumulated not yet establishes complete corpus, in the case where corpus is insufficient, many systems The method difficulty of meter machine learning is proved effective.It is difficult to excavate from text with significant using the method for natural language processing As a result,

Mainly there are tour sheet and defect record in electricity power enterprise to regular job recording documents.To power generation corpus data into When row entity classification, since the title of the equipment in current entry may be accustomed to different due to personal term and there is the difference in statement It is different, correctly sorted out so that the device name using standard will can not record accordingly when being classified.

Summary of the invention

In view of above-mentioned analysis, the present invention is intended to provide a kind of entity classification method of field of thermal power corpus data and Device, by based on statistics new word identification method and sorting algorithm combine, realize the entity point to power generation corpus of text data Class.

The purpose of the present invention is mainly achieved through the following technical solutions:

A kind of entity classification method of field of thermal power corpus data, includes the following steps:

Just subseries is carried out to the text collection S to be sorted comprising field of thermal power corpus data, obtains successfully dividing Class text set S1 and failed classifying text set S2；

By the alternative new dictionary of foundation, the entity neologisms in the failed classifying text set S2 are extracted, are established real The new word list E of body；

Entity neologisms in the new word list E of entity are subjected to entity pair with the classifying text set S1 that succeeded one by one Together, entity alignment result is obtained；

According to obtained entity alignment as a result, determining the entity class of the entity neologisms.

Further, the construction method of the alternative new dictionary, comprising:

Establish field dictionary candidate word set；

Candidate word in the field dictionary candidate word set is quantified；

Field dictionary is constituted after carrying out threshold value screening to the candidate word after quantization；

Alternative new dictionary is constituted after rejecting the general word in the field dictionary.

It is further, described to establish field dictionary candidate word set, comprising:

Field of thermal power corpus data is pre-processed；

The progress substring cutting of pretreated corpus data is obtained into substring；

Word segmentation is carried out to the obtained substring, constitutes the candidate word set of field dictionary.

Further, the quantization of the candidate word includes word frequency, solidified inside degree, freedom degree and position into Word probability Quantization.

Further, the threshold value being arranged in the threshold value screening includes word frequency threshold, solidification degree threshold value and left and right conjunction Information entropy threshold and position are at Word probability threshold value.

Further, the just subseries, including,

Establish text collection S:{ s to be sorted₁,s₂,···,s_i,···s_m, s_iFor certain the text note in set Record；

Establish listed entity device list N:{ n₁,n₂,···,n_j,···n_K, n_jFor the classification of some entity Number；

Classifying text is treated to carry out including removal number, alphabetical, including record fractionation pretreatment；

Classify to pretreated text collection S according to entity device list N, obtains the document sample of successful classification 1 { Sn of this space S₁:s₁₁,s₁₂,···；Sn_j:s_j1,s_j2,···；···；Sn_k:s_k1,s_k2, k is in S1 Entity class sum, Sn_jIt is to belong to entity class n_jDocument subset.

Further, the entity neologisms entity is carried out with the classifying text set S1 that succeeded to be aligned, including；

Establish document subset Se, the Se ∈ S2 comprising substantive noun list E；

Calculate each document subset Sn in the document subset Se to the classifying text set S1 that succeeded_jDistance d (e, n_j)；E is the entity neologisms in the new word list E, n_jFor the entity class for the classifying text set S1 that succeeded；

The most document subset Sn of selected distance d maximum value frequency of occurrence_j, entity neologisms e is referred to document Collect Sn_jThe entity class belonged to.

Further, for can not entity alignment entity neologisms, classified by creating new entity class；By institute It states creation entity class and listed entity device list N is added.

Further, for the substantive noun list E comprising entity neologisms e and the affiliated entity class of entity neologisms e, warp Human-computer interaction is crossed, final confirmation is carried out by user.

A kind of entity classification device of field of thermal power corpus data, including it is first categorization module, alternative new dictionary, new Word extraction module and entity alignment module；

The first categorization module, for the text collection to be sorted comprising field of thermal power corpus data to input S carries out just subseries, and succeeded classifying text set S1 and failed classifying text set S2；

The alternative new dictionary, for storing the entity neologisms of field of thermal power；

The new words extraction module is separately connected, for receiving with the first categorization module and the alternative new dictionary The failed classifying text set S2 of first categorization module input extracts the failed classification according to alternative new dictionary content Entity neologisms in text collection S2 establish the new word list E of entity；

The entity alignment module is separately connected, for connecing with the first categorization module and the new words extraction module Receive the new word list of entity of succeeded the classifying text set S1 and new words extraction module output of first categorization module output Entity neologisms in the new word list E of the entity are carried out entity with the classifying text set S1 that succeeded one by one and are aligned by E, Obtain entity alignment result；And according to obtained entity alignment as a result, determining the entity class of the entity neologisms.

The present invention has the beneficial effect that:

It is comprehensive to be calculated using unsupervised specialized vocabulary discovery algorithm and text classification using field of thermal power text data Method realizes the entity classification to power generation corpus data, constructed by thermal power generation specialized dictionary can also be used in the field it is literary The corpus support that notebook data excavates.

Detailed description of the invention

Attached drawing is only used for showing the purpose of specific embodiment, and is not to be construed as limiting the invention, in entire attached drawing In, identical reference symbol indicates identical component.

Fig. 1 is the entity classification method flow diagram of the embodiment of the present invention；

Fig. 2 is that the entity classification device of the embodiment of the present invention forms connection schematic diagram.

Specific embodiment

Specifically describing the preferred embodiment of the present invention with reference to the accompanying drawing, wherein attached drawing constitutes the application a part, and Together with embodiments of the present invention for illustrating the principle of the present invention.

The embodiment of the invention discloses a kind of entity classification methods of field of thermal power corpus data, as shown in Figure 1, packet Include following steps:

Step S1, just subseries is carried out to the text collection S to be sorted comprising field of thermal power corpus data；

1) input data for classification is established；

Input data specifically includes:

Text collection S:{ s to be sorted₁,s₂,···,s_i,···s_m, wherein s_iFor certain the text note in set Record, corresponding with entity a certain in equipment entity, m is the quantity of text entry；

Listed entity device list N:{ n₁,n₂,···,n_j,···n_k, wherein n_jFor the class of some entity It does not number, the category is made of one or more title of equipment, and k is entity device list total；

2) text to be sorted in classifying text set S is pre-processed；

In order to eliminate to useless redundant information of classifying, treats classifying text and carry out including removing number, letter, record to tear open Divide equal pretreatments measure, keeps text to be sorted more succinct；

3) classify to pretreated text collection S according to entity device list N；

By to classifying text set S:{ s₁,s₂,···,s_i,···s_mClassification, classifying text collection of succeeding Close S1 and failed classifying text set S2；

The document sample space of the classifying text set that succeeded S1 is { Sn₁:s₁₁,s₁₂,···；Sn_j:s_j1, s_j2,···；···；Sn_k:s_k1,s_k2, k is the entity class sum in S1, Sn_jIt is to belong to entity class n_j Document subset.

Step S2, by the alternative new dictionary of foundation, the entity neologisms in the failed classifying text set S2 are extracted, Establish the new word list E of entity；

Alternatively newly the method for dictionary includes: for foundation in step

1) field dictionary candidate word set is established；

The field of thermal power corpus of some thermal power generation corporations accumulation can be used for establishing field dictionary candidate word set Data text；The corpus data mainly includes tour sheet and defect report etc..

The field of thermal power corpus data text of accumulation is pre-processed；Specific pretreatment operation includes to data Duplicate removal processing is carried out, and eliminating clear is not entity word including idle characters such as letter, symbol and numbers；Make subsequent processing Corpus data it is more succinct.

To pretreated corpus data text, the sentence in text is cut into substring with symbols such as space, newlines；

Word segmentation is carried out to substring again, constitutes the candidate word set of field dictionary；

Special, N-gram algorithm can be used, the cutting of N member is carried out to substring, the word in substring is subjected to cutting, is obtained Profession including equipment fault in technical staff's idiom in field of thermal power generating equipment title, the field and field is retouched The word stated constitutes field dictionary candidate word set.

Such as: to corpus substring " after high temperature superheater to leaked in the primary door of idle discharge gas " using N-gram algorithm to substring into Row N member cutting (N=6), the candidate word set obtained after cutting have:

High temperature

High temperature mistake

Hyperthermia and superheating

High temperature superheater

After high temperature superheater

Warm mistake

Temperature overheat

Warm superheater

After warm superheater

It is right after warm superheater

...。

2) the candidate word in the field dictionary candidate word set is quantified；

The quantization quantitative criteria of candidate's word includes word frequency, solidified inside degree, freedom degree and position into Word probability；

Solidified inside degree uses formulaIt indicates, in formula, x and y indicate in corpus two Different words, p (xy) indicate that x and y appears in the probability in corpus simultaneously；P (x) is that x appears alone in the probability in corpus；p (y) probability in corpus is appeared alone in for y；When pmi (x, y) > > 0 when, show x and y be it is highly relevant, i.e. x and y are frequent Occur simultaneously, character string xy may more constitute neologisms.

Freedom degree is measured with left and right connection word information entropy；That is, freedom degree=min (left connection word information entropy, it is right Connect word information entropy)；

In formula, s_lFor the left adjacent connective word of candidate word w；s_rFor the right adjacent connective word of candidate word w；p(w_l| it w) is candidate word w Its left adjacent connective word is w in the case where appearance_lConditional probability；p(w_r| its right adjacent connection in the case where w) occurring for candidate word w Word is w_rConditional probability.

The position is at Word probabilityI is c in formula_iThe position that word occurs；N(c_i, i) and it is c_iOut The frequency of all words of the position i in present word；N(c_i) it is c_iThe total frequency occurred in corpus.

3) field dictionary is constituted after carrying out threshold value screening to the candidate word after quantization；

Threshold value screening in setting threshold value include word frequency threshold, solidification degree threshold value and left and right connection word information entropy threshold value and Position is at Word probability threshold value；

By the way that left and right connection word information entropy threshold value is arranged, freedom degree threshold value is determined；

By the way that the solidification degree threshold value of setting and freedom degree threshold value to be combined, the word in candidate word set is judged Screening obtains the word of this field application；

By the way that word frequency threshold is arranged, when the word frequency that candidate word occurs is greater than threshold value, illustrate that the word is answered for this field Everyday words carries out screening to word and constitutes field dictionary；

By setting position at Word probability threshold value, is assessed and judged to being set in the field dictionary of generation at lexeme, mentioned Height at word accuracy.

4) the field dictionary is compared with general dictionary, is constituted after rejecting the general word in the field dictionary Alternative new dictionary.

Since the field dictionary that previous step is constituted does not carry out the identification of professional word, the word in dictionary includes this field The general word used, and these words and equipment and uncorrelated, do not need to carry out entity classification；As a result, by with general term Compared in library (there is power plant's dialect dictionary the eighties in power plant, this is the pervious standard universal vocabulary version of country) It is right, alternative new dictionary is constituted after rejecting the general word in the field dictionary.

By the alternative new dictionary of foundation, extraction is compared to the word in failed classifying text set S2, is extracted The entity neologisms in the alternative new dictionary for belonging to foundation for including in failed classifying text set S2 out establish entity neologisms column Table E.

It is special, it is more accurate in order to establish the new word list E of entity, by human-computer interaction, carried out finally by user to returning Class result is confirmed.

Step S3, the entity neologisms in the new word list of entity are carried out with the classifying text set S1 that succeeded one by one real Body alignment；Confirm the entity class of entity neologisms.

Specifically alignment procedure includes:

1) document subset Se, the Se ∈ S2 comprising substantive noun list E is established；

2) each document subset Sn in the document subset Se to the classifying text set S1 that succeeded is calculated_jDistance d (e, n_j)；E is the entity neologisms in the new word list E, n_jFor the entity class for the classifying text set S1 that succeeded；

3) the most document subset Sn of selected distance d maximum value frequency of occurrence_j, entity neologisms e is referred to the document Subset Sn_jThe entity class belonged to；

4) the document subset Sn for the classifying text set S1 that succeeded is updated_j, repeat the above process, until by document subset Se It is merged into document subset Sn_j。

Special, due to the update of thermal power generating equipment, there is the new equipment for logging into entity device list N not yet, with institute The relevant entity neologisms of new equipment are stated, by above-mentioned alignment procedure, cannot achieve entity alignment；

For the entity neologisms that can not be aligned, need to be classified by creating new entity class；And it is the creation is real Listed entity device list N is added in body classification.

It is special, in order to keep the classification of entity neologisms e more accurate, by human-computer interaction, carried out finally by user to returning Class result is confirmed.

The embodiment of the invention also discloses a kind of entity classification devices of field of thermal power corpus data, as shown in Fig. 2, Including first categorization module, alternative new dictionary, new words extraction module and entity alignment module；

The alternative new dictionary, the entity neologisms for including for storing field of thermal power；

Optionally, the construction method of the alternative new dictionary includes:

1) field dictionary candidate word set is established；

Since the field dictionary that previous step is constituted does not carry out the identification of professional word, the word in dictionary includes this field The general word used, and these words and equipment and uncorrelated, do not need to carry out entity classification；As a result, by with general term Library is compared, and constitutes alternative new dictionary after rejecting the general word in the field dictionary.

In conclusion the entity classification method and device for the field of thermal power corpus data that the embodiment of the present invention provides, It is comprehensive using unsupervised specialized vocabulary discovery algorithm and text classification algorithm, realization pair using field of thermal power text data Generate electricity corpus data entity classification, constructed by thermal power generation specialized dictionary can also be used for text data digging in the field Corpus support.

The foregoing is only a preferred embodiment of the present invention, but scope of protection of the present invention is not limited thereto, In the technical scope disclosed by the present invention, any changes or substitutions that can be easily thought of by anyone skilled in the art, It should be covered by the protection scope of the present invention.

Claims

1. a kind of entity classification method of field of thermal power corpus data, which comprises the steps of:

Just subseries is carried out to the text collection S to be sorted comprising field of thermal power corpus data, obtains successful classification text This set S1 and failed classifying text set S2；

By the alternative new dictionary of foundation, the entity neologisms in the failed classifying text set S2 are extracted, it is new to establish entity Word list E；

Entity neologisms in the new word list E of entity are carried out entity with the classifying text set S1 that succeeded one by one to be aligned, are obtained Result is aligned to entity；

2. entity classification method according to claim 1, which is characterized in that the construction method of the alternative new dictionary, packet It includes:

Establish field dictionary candidate word set；

Candidate word in the field dictionary candidate word set is quantified；

3. entity classification method according to claim 2, which is characterized in that described to establish field dictionary candidate word set, packet It includes:

Field of thermal power corpus data is pre-processed；

4. entity classification method according to claim 2, which is characterized in that it is described candidate word quantization include word frequency, Solidified inside degree, freedom degree and position at Word probability quantization.

5. entity classification method according to claim 4, which is characterized in that the threshold value being arranged in threshold value screening includes Word frequency threshold, solidification degree threshold value and left and right connection word information entropy threshold value and position are at Word probability threshold value.

6. entity classification method according to claim 1 or 2, which is characterized in that the just subseries, including,

Establish text collection S:{ s to be sorted₁,s₂,…,s_i,…s_m, s_iFor certain text entry in set；

Establish listed entity device list N:{ n₁,n₂,…,n_j,…n_K, n_jFor the class number of some entity；

Classify to pretreated text collection S according to entity device list N, the document sample for obtaining successful classification is empty Between S1 { Sn₁:s₁₁,s₁₂,…；Sn_j:s_j1,s_j2,…；…；Sn_k:s_k1,s_k2..., k is the entity class sum in S1, Sn_jIt is Belong to entity class n_jDocument subset.

7. entity classification method according to claim 1, which is characterized in that successfully divided the entity neologisms with described Class text set S1 carries out entity alignment, including；

Calculate each document subset Sn in the document subset Se to the classifying text set S1 that succeeded_jDistance d (e, n_j)；E is Entity neologisms in the new word list E, n_jFor the entity class for the classifying text set S1 that succeeded；

The most document subset Sn of selected distance d maximum value frequency of occurrence_j, entity neologisms e is referred to the document subset Sn_j The entity class belonged to.

8. entity classification method according to claim 7, which is characterized in that for can not entity alignment entity neologisms, Classified by the entity class for creating new；Listed entity device list N is added in the creation entity class.

9. entity classification method according to claim 8, which is characterized in that for the substantive noun comprising entity neologisms e List E and the affiliated entity class of entity neologisms e are carried out final confirmation by user by human-computer interaction.

10. a kind of entity classification device of field of thermal power corpus data, which is characterized in that including first categorization module, standby Select new dictionary, new words extraction module and entity alignment module；

The first categorization module, for the text collection S to be sorted comprising field of thermal power corpus data to input into The first subseries of row, succeeded classifying text set S1 and failed classifying text set S2；

The new words extraction module is separately connected with the first categorization module and the alternative new dictionary, first for receiving The failed classifying text set S2 of categorization module input extracts the failed classifying text according to alternative new dictionary content Entity neologisms in set S2 establish the new word list E of entity；

The entity alignment module is separately connected with the first categorization module and the new words extraction module, for receiving just The new word list E of entity of succeeded the classifying text set S1 and new words extraction module output of the output of subseries module, will Entity neologisms in the new word list E of entity carry out entity with the classifying text set S1 that succeeded one by one and are aligned, and obtain Entity is aligned result；And according to obtained entity alignment as a result, determining the entity class of the entity neologisms.