CN109582787A - A kind of entity classification method and device of field of thermal power corpus data - Google Patents

A kind of entity classification method and device of field of thermal power corpus data Download PDF

Info

Publication number
CN109582787A
CN109582787A CN201811311803.3A CN201811311803A CN109582787A CN 109582787 A CN109582787 A CN 109582787A CN 201811311803 A CN201811311803 A CN 201811311803A CN 109582787 A CN109582787 A CN 109582787A
Authority
CN
China
Prior art keywords
entity
word
field
neologisms
dictionary
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811311803.3A
Other languages
Chinese (zh)
Other versions
CN109582787B (en
Inventor
唐静
彭轩
彭一轩
解来甲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yuanguang Software Co Ltd
Original Assignee
Yuanguang Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yuanguang Software Co Ltd filed Critical Yuanguang Software Co Ltd
Priority to CN201811311803.3A priority Critical patent/CN109582787B/en
Publication of CN109582787A publication Critical patent/CN109582787A/en
Application granted granted Critical
Publication of CN109582787B publication Critical patent/CN109582787B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to a kind of entity classification method and devices of field of thermal power corpus data, belong to thermal power generating technology field, method includes, just subseries is carried out to the text collection S to be sorted comprising field of thermal power corpus data, succeeded classifying text set S1 and failed classifying text set S2;The entity neologisms in failed classifying text set S2 are extracted, the new word list E of entity is established;Entity neologisms in the new word list of entity are carried out entity with the classifying text set S1 that succeeded one by one to be aligned, confirm the entity class of entity neologisms.The present invention utilizes field of thermal power text data, it is comprehensive that algorithm and text classification algorithm are found using unsupervised specialized vocabulary, realize to power generation corpus data entity classification, constructed by thermal power generation specialized dictionary can also be used for text data digging in the field corpus support.

Description

A kind of entity classification method and device of field of thermal power corpus data
Technical field
The present invention relates to thermal power generating technology field, especially a kind of entity classification side of field of thermal power corpus data Method and device.
Background technique
As typical non-/ semi-structured data, the processing for text data is always one of the hot spot of data mining.
To the text data analysis mining of field of thermal power, for thermal power generation corporations, regularly defect is made an inventory, Yi Jiqi The building of the Company Knowledge map of the long-range informatization of industry, auxiliary enterprises from global level understand production equipment operation and Health status, progress multidimensional data fusion and the excavation of deep knowledge are of great significance.
Currently, for field of thermal power text data analysis mining still in its infancy.Main reason is that fire The document data that power power field is accumulated not yet establishes complete corpus, in the case where corpus is insufficient, many systems The method difficulty of meter machine learning is proved effective.It is difficult to excavate from text with significant using the method for natural language processing As a result,
Mainly there are tour sheet and defect record in electricity power enterprise to regular job recording documents.To power generation corpus data into When row entity classification, since the title of the equipment in current entry may be accustomed to different due to personal term and there is the difference in statement It is different, correctly sorted out so that the device name using standard will can not record accordingly when being classified.
Summary of the invention
In view of above-mentioned analysis, the present invention is intended to provide a kind of entity classification method of field of thermal power corpus data and Device, by based on statistics new word identification method and sorting algorithm combine, realize the entity point to power generation corpus of text data Class.
The purpose of the present invention is mainly achieved through the following technical solutions:
A kind of entity classification method of field of thermal power corpus data, includes the following steps:
Just subseries is carried out to the text collection S to be sorted comprising field of thermal power corpus data, obtains successfully dividing Class text set S1 and failed classifying text set S2;
By the alternative new dictionary of foundation, the entity neologisms in the failed classifying text set S2 are extracted, are established real The new word list E of body;
Entity neologisms in the new word list E of entity are subjected to entity pair with the classifying text set S1 that succeeded one by one Together, entity alignment result is obtained;
According to obtained entity alignment as a result, determining the entity class of the entity neologisms.
Further, the construction method of the alternative new dictionary, comprising:
Establish field dictionary candidate word set;
Candidate word in the field dictionary candidate word set is quantified;
Field dictionary is constituted after carrying out threshold value screening to the candidate word after quantization;
Alternative new dictionary is constituted after rejecting the general word in the field dictionary.
It is further, described to establish field dictionary candidate word set, comprising:
Field of thermal power corpus data is pre-processed;
The progress substring cutting of pretreated corpus data is obtained into substring;
Word segmentation is carried out to the obtained substring, constitutes the candidate word set of field dictionary.
Further, the quantization of the candidate word includes word frequency, solidified inside degree, freedom degree and position into Word probability Quantization.
Further, the threshold value being arranged in the threshold value screening includes word frequency threshold, solidification degree threshold value and left and right conjunction Information entropy threshold and position are at Word probability threshold value.
Further, the just subseries, including,
Establish text collection S:{ s to be sorted1,s2,···,si,···sm, siFor certain the text note in set Record;
Establish listed entity device list N:{ n1,n2,···,nj,···nK, njFor the classification of some entity Number;
Classifying text is treated to carry out including removal number, alphabetical, including record fractionation pretreatment;
Classify to pretreated text collection S according to entity device list N, obtains the document sample of successful classification 1 { Sn of this space S1:s11,s12,···;Snj:sj1,sj2,···;···;Snk:sk1,sk2, k is in S1 Entity class sum, SnjIt is to belong to entity class njDocument subset.
Further, the entity neologisms entity is carried out with the classifying text set S1 that succeeded to be aligned, including;
Establish document subset Se, the Se ∈ S2 comprising substantive noun list E;
Calculate each document subset Sn in the document subset Se to the classifying text set S1 that succeededjDistance d (e, nj);E is the entity neologisms in the new word list E, njFor the entity class for the classifying text set S1 that succeeded;
The most document subset Sn of selected distance d maximum value frequency of occurrencej, entity neologisms e is referred to document Collect SnjThe entity class belonged to.
Further, for can not entity alignment entity neologisms, classified by creating new entity class;By institute It states creation entity class and listed entity device list N is added.
Further, for the substantive noun list E comprising entity neologisms e and the affiliated entity class of entity neologisms e, warp Human-computer interaction is crossed, final confirmation is carried out by user.
A kind of entity classification device of field of thermal power corpus data, including it is first categorization module, alternative new dictionary, new Word extraction module and entity alignment module;
The first categorization module, for the text collection to be sorted comprising field of thermal power corpus data to input S carries out just subseries, and succeeded classifying text set S1 and failed classifying text set S2;
The alternative new dictionary, for storing the entity neologisms of field of thermal power;
The new words extraction module is separately connected, for receiving with the first categorization module and the alternative new dictionary The failed classifying text set S2 of first categorization module input extracts the failed classification according to alternative new dictionary content Entity neologisms in text collection S2 establish the new word list E of entity;
The entity alignment module is separately connected, for connecing with the first categorization module and the new words extraction module Receive the new word list of entity of succeeded the classifying text set S1 and new words extraction module output of first categorization module output Entity neologisms in the new word list E of the entity are carried out entity with the classifying text set S1 that succeeded one by one and are aligned by E, Obtain entity alignment result;And according to obtained entity alignment as a result, determining the entity class of the entity neologisms.
The present invention has the beneficial effect that:
It is comprehensive to be calculated using unsupervised specialized vocabulary discovery algorithm and text classification using field of thermal power text data Method realizes the entity classification to power generation corpus data, constructed by thermal power generation specialized dictionary can also be used in the field it is literary The corpus support that notebook data excavates.
Detailed description of the invention
Attached drawing is only used for showing the purpose of specific embodiment, and is not to be construed as limiting the invention, in entire attached drawing In, identical reference symbol indicates identical component.
Fig. 1 is the entity classification method flow diagram of the embodiment of the present invention;
Fig. 2 is that the entity classification device of the embodiment of the present invention forms connection schematic diagram.
Specific embodiment
Specifically describing the preferred embodiment of the present invention with reference to the accompanying drawing, wherein attached drawing constitutes the application a part, and Together with embodiments of the present invention for illustrating the principle of the present invention.
The embodiment of the invention discloses a kind of entity classification methods of field of thermal power corpus data, as shown in Figure 1, packet Include following steps:
Step S1, just subseries is carried out to the text collection S to be sorted comprising field of thermal power corpus data;
1) input data for classification is established;
Input data specifically includes:
Text collection S:{ s to be sorted1,s2,···,si,···sm, wherein siFor certain the text note in set Record, corresponding with entity a certain in equipment entity, m is the quantity of text entry;
Listed entity device list N:{ n1,n2,···,nj,···nk, wherein njFor the class of some entity It does not number, the category is made of one or more title of equipment, and k is entity device list total;
2) text to be sorted in classifying text set S is pre-processed;
In order to eliminate to useless redundant information of classifying, treats classifying text and carry out including removing number, letter, record to tear open Divide equal pretreatments measure, keeps text to be sorted more succinct;
3) classify to pretreated text collection S according to entity device list N;
By to classifying text set S:{ s1,s2,···,si,···smClassification, classifying text collection of succeeding Close S1 and failed classifying text set S2;
The document sample space of the classifying text set that succeeded S1 is { Sn1:s11,s12,···;Snj:sj1, sj2,···;···;Snk:sk1,sk2, k is the entity class sum in S1, SnjIt is to belong to entity class nj Document subset.
Step S2, by the alternative new dictionary of foundation, the entity neologisms in the failed classifying text set S2 are extracted, Establish the new word list E of entity;
Alternatively newly the method for dictionary includes: for foundation in step
1) field dictionary candidate word set is established;
The field of thermal power corpus of some thermal power generation corporations accumulation can be used for establishing field dictionary candidate word set Data text;The corpus data mainly includes tour sheet and defect report etc..
The field of thermal power corpus data text of accumulation is pre-processed;Specific pretreatment operation includes to data Duplicate removal processing is carried out, and eliminating clear is not entity word including idle characters such as letter, symbol and numbers;Make subsequent processing Corpus data it is more succinct.
To pretreated corpus data text, the sentence in text is cut into substring with symbols such as space, newlines;
Word segmentation is carried out to substring again, constitutes the candidate word set of field dictionary;
Special, N-gram algorithm can be used, the cutting of N member is carried out to substring, the word in substring is subjected to cutting, is obtained Profession including equipment fault in technical staff's idiom in field of thermal power generating equipment title, the field and field is retouched The word stated constitutes field dictionary candidate word set.
Such as: to corpus substring " after high temperature superheater to leaked in the primary door of idle discharge gas " using N-gram algorithm to substring into Row N member cutting (N=6), the candidate word set obtained after cutting have:
High temperature
High temperature mistake
Hyperthermia and superheating
High temperature superheater
After high temperature superheater
Warm mistake
Temperature overheat
Warm superheater
After warm superheater
It is right after warm superheater
...。
2) the candidate word in the field dictionary candidate word set is quantified;
The quantization quantitative criteria of candidate's word includes word frequency, solidified inside degree, freedom degree and position into Word probability;
Solidified inside degree uses formulaIt indicates, in formula, x and y indicate in corpus two Different words, p (xy) indicate that x and y appears in the probability in corpus simultaneously;P (x) is that x appears alone in the probability in corpus;p (y) probability in corpus is appeared alone in for y;When pmi (x, y) > > 0 when, show x and y be it is highly relevant, i.e. x and y are frequent Occur simultaneously, character string xy may more constitute neologisms.
Freedom degree is measured with left and right connection word information entropy;That is, freedom degree=min (left connection word information entropy, it is right Connect word information entropy);
In formula, slFor the left adjacent connective word of candidate word w;srFor the right adjacent connective word of candidate word w;p(wl| it w) is candidate word w Its left adjacent connective word is w in the case where appearancelConditional probability;p(wr| its right adjacent connection in the case where w) occurring for candidate word w Word is wrConditional probability.
The position is at Word probabilityI is c in formulaiThe position that word occurs;N(ci, i) and it is ciOut The frequency of all words of the position i in present word;N(ci) it is ciThe total frequency occurred in corpus.
3) field dictionary is constituted after carrying out threshold value screening to the candidate word after quantization;
Threshold value screening in setting threshold value include word frequency threshold, solidification degree threshold value and left and right connection word information entropy threshold value and Position is at Word probability threshold value;
By the way that left and right connection word information entropy threshold value is arranged, freedom degree threshold value is determined;
By the way that the solidification degree threshold value of setting and freedom degree threshold value to be combined, the word in candidate word set is judged Screening obtains the word of this field application;
By the way that word frequency threshold is arranged, when the word frequency that candidate word occurs is greater than threshold value, illustrate that the word is answered for this field Everyday words carries out screening to word and constitutes field dictionary;
By setting position at Word probability threshold value, is assessed and judged to being set in the field dictionary of generation at lexeme, mentioned Height at word accuracy.
4) the field dictionary is compared with general dictionary, is constituted after rejecting the general word in the field dictionary Alternative new dictionary.
Since the field dictionary that previous step is constituted does not carry out the identification of professional word, the word in dictionary includes this field The general word used, and these words and equipment and uncorrelated, do not need to carry out entity classification;As a result, by with general term Compared in library (there is power plant's dialect dictionary the eighties in power plant, this is the pervious standard universal vocabulary version of country) It is right, alternative new dictionary is constituted after rejecting the general word in the field dictionary.
By the alternative new dictionary of foundation, extraction is compared to the word in failed classifying text set S2, is extracted The entity neologisms in the alternative new dictionary for belonging to foundation for including in failed classifying text set S2 out establish entity neologisms column Table E.
It is special, it is more accurate in order to establish the new word list E of entity, by human-computer interaction, carried out finally by user to returning Class result is confirmed.
Step S3, the entity neologisms in the new word list of entity are carried out with the classifying text set S1 that succeeded one by one real Body alignment;Confirm the entity class of entity neologisms.
Specifically alignment procedure includes:
1) document subset Se, the Se ∈ S2 comprising substantive noun list E is established;
2) each document subset Sn in the document subset Se to the classifying text set S1 that succeeded is calculatedjDistance d (e, nj);E is the entity neologisms in the new word list E, njFor the entity class for the classifying text set S1 that succeeded;
3) the most document subset Sn of selected distance d maximum value frequency of occurrencej, entity neologisms e is referred to the document Subset SnjThe entity class belonged to;
4) the document subset Sn for the classifying text set S1 that succeeded is updatedj, repeat the above process, until by document subset Se It is merged into document subset Snj
Special, due to the update of thermal power generating equipment, there is the new equipment for logging into entity device list N not yet, with institute The relevant entity neologisms of new equipment are stated, by above-mentioned alignment procedure, cannot achieve entity alignment;
For the entity neologisms that can not be aligned, need to be classified by creating new entity class;And it is the creation is real Listed entity device list N is added in body classification.
It is special, in order to keep the classification of entity neologisms e more accurate, by human-computer interaction, carried out finally by user to returning Class result is confirmed.
The embodiment of the invention also discloses a kind of entity classification devices of field of thermal power corpus data, as shown in Fig. 2, Including first categorization module, alternative new dictionary, new words extraction module and entity alignment module;
The first categorization module, for the text collection to be sorted comprising field of thermal power corpus data to input S carries out just subseries, and succeeded classifying text set S1 and failed classifying text set S2;
The alternative new dictionary, the entity neologisms for including for storing field of thermal power;
The new words extraction module is separately connected, for receiving with the first categorization module and the alternative new dictionary The failed classifying text set S2 of first categorization module input extracts the failed classification according to alternative new dictionary content Entity neologisms in text collection S2 establish the new word list E of entity;
The entity alignment module is separately connected, for connecing with the first categorization module and the new words extraction module Receive the new word list of entity of succeeded the classifying text set S1 and new words extraction module output of first categorization module output Entity neologisms in the new word list E of the entity are carried out entity with the classifying text set S1 that succeeded one by one and are aligned by E, Obtain entity alignment result;And according to obtained entity alignment as a result, determining the entity class of the entity neologisms.
Optionally, the construction method of the alternative new dictionary includes:
1) field dictionary candidate word set is established;
The field of thermal power corpus of some thermal power generation corporations accumulation can be used for establishing field dictionary candidate word set Data text;The corpus data mainly includes tour sheet and defect report etc..
The field of thermal power corpus data text of accumulation is pre-processed;Specific pretreatment operation includes to data Duplicate removal processing is carried out, and eliminating clear is not entity word including idle characters such as letter, symbol and numbers;Make subsequent processing Corpus data it is more succinct.
To pretreated corpus data text, the sentence in text is cut into substring with symbols such as space, newlines;
Word segmentation is carried out to substring again, constitutes the candidate word set of field dictionary;
Special, N-gram algorithm can be used, the cutting of N member is carried out to substring, the word in substring is subjected to cutting, is obtained Profession including equipment fault in technical staff's idiom in field of thermal power generating equipment title, the field and field is retouched The word stated constitutes field dictionary candidate word set.
2) the candidate word in the field dictionary candidate word set is quantified;
The quantization quantitative criteria of candidate's word includes word frequency, solidified inside degree, freedom degree and position into Word probability;
Solidified inside degree uses formulaIt indicates, in formula, x and y indicate in corpus two Different words, p (xy) indicate that x and y appears in the probability in corpus simultaneously;P (x) is that x appears alone in the probability in corpus;p (y) probability in corpus is appeared alone in for y;When pmi (x, y) > > 0 when, show x and y be it is highly relevant, i.e. x and y are frequent Occur simultaneously, character string xy may more constitute neologisms.
Freedom degree is measured with left and right connection word information entropy;That is, freedom degree=min (left connection word information entropy, it is right Connect word information entropy);
In formula, slFor the left adjacent connective word of candidate word w;srFor the right adjacent connective word of candidate word w;p(wl| it w) is candidate word w Its left adjacent connective word is w in the case where appearancelConditional probability;p(wr| its right adjacent connection in the case where w) occurring for candidate word w Word is wrConditional probability.
The position is at Word probabilityI is c in formulaiThe position that word occurs;N(ci, i) and it is ciOut The frequency of all words of the position i in present word;N(ci) it is ciThe total frequency occurred in corpus.
3) field dictionary is constituted after carrying out threshold value screening to the candidate word after quantization;
Threshold value screening in setting threshold value include word frequency threshold, solidification degree threshold value and left and right connection word information entropy threshold value and Position is at Word probability threshold value;
By the way that left and right connection word information entropy threshold value is arranged, freedom degree threshold value is determined;
By the way that the solidification degree threshold value of setting and freedom degree threshold value to be combined, the word in candidate word set is judged Screening obtains the word of this field application;
By the way that word frequency threshold is arranged, when the word frequency that candidate word occurs is greater than threshold value, illustrate that the word is answered for this field Everyday words carries out screening to word and constitutes field dictionary;
By setting position at Word probability threshold value, is assessed and judged to being set in the field dictionary of generation at lexeme, mentioned Height at word accuracy.
4) the field dictionary is compared with general dictionary, is constituted after rejecting the general word in the field dictionary Alternative new dictionary.
Since the field dictionary that previous step is constituted does not carry out the identification of professional word, the word in dictionary includes this field The general word used, and these words and equipment and uncorrelated, do not need to carry out entity classification;As a result, by with general term Library is compared, and constitutes alternative new dictionary after rejecting the general word in the field dictionary.
In conclusion the entity classification method and device for the field of thermal power corpus data that the embodiment of the present invention provides, It is comprehensive using unsupervised specialized vocabulary discovery algorithm and text classification algorithm, realization pair using field of thermal power text data Generate electricity corpus data entity classification, constructed by thermal power generation specialized dictionary can also be used for text data digging in the field Corpus support.
The foregoing is only a preferred embodiment of the present invention, but scope of protection of the present invention is not limited thereto, In the technical scope disclosed by the present invention, any changes or substitutions that can be easily thought of by anyone skilled in the art, It should be covered by the protection scope of the present invention.

Claims (10)

1. a kind of entity classification method of field of thermal power corpus data, which comprises the steps of:
Just subseries is carried out to the text collection S to be sorted comprising field of thermal power corpus data, obtains successful classification text This set S1 and failed classifying text set S2;
By the alternative new dictionary of foundation, the entity neologisms in the failed classifying text set S2 are extracted, it is new to establish entity Word list E;
Entity neologisms in the new word list E of entity are carried out entity with the classifying text set S1 that succeeded one by one to be aligned, are obtained Result is aligned to entity;
According to obtained entity alignment as a result, determining the entity class of the entity neologisms.
2. entity classification method according to claim 1, which is characterized in that the construction method of the alternative new dictionary, packet It includes:
Establish field dictionary candidate word set;
Candidate word in the field dictionary candidate word set is quantified;
Field dictionary is constituted after carrying out threshold value screening to the candidate word after quantization;
Alternative new dictionary is constituted after rejecting the general word in the field dictionary.
3. entity classification method according to claim 2, which is characterized in that described to establish field dictionary candidate word set, packet It includes:
Field of thermal power corpus data is pre-processed;
The progress substring cutting of pretreated corpus data is obtained into substring;
Word segmentation is carried out to the obtained substring, constitutes the candidate word set of field dictionary.
4. entity classification method according to claim 2, which is characterized in that it is described candidate word quantization include word frequency, Solidified inside degree, freedom degree and position at Word probability quantization.
5. entity classification method according to claim 4, which is characterized in that the threshold value being arranged in threshold value screening includes Word frequency threshold, solidification degree threshold value and left and right connection word information entropy threshold value and position are at Word probability threshold value.
6. entity classification method according to claim 1 or 2, which is characterized in that the just subseries, including,
Establish text collection S:{ s to be sorted1,s2,…,si,…sm, siFor certain text entry in set;
Establish listed entity device list N:{ n1,n2,…,nj,…nK, njFor the class number of some entity;
Classifying text is treated to carry out including removal number, alphabetical, including record fractionation pretreatment;
Classify to pretreated text collection S according to entity device list N, the document sample for obtaining successful classification is empty Between S1 { Sn1:s11,s12,…;Snj:sj1,sj2,…;…;Snk:sk1,sk2..., k is the entity class sum in S1, SnjIt is Belong to entity class njDocument subset.
7. entity classification method according to claim 1, which is characterized in that successfully divided the entity neologisms with described Class text set S1 carries out entity alignment, including;
Establish document subset Se, the Se ∈ S2 comprising substantive noun list E;
Calculate each document subset Sn in the document subset Se to the classifying text set S1 that succeededjDistance d (e, nj);E is Entity neologisms in the new word list E, njFor the entity class for the classifying text set S1 that succeeded;
The most document subset Sn of selected distance d maximum value frequency of occurrencej, entity neologisms e is referred to the document subset Snj The entity class belonged to.
8. entity classification method according to claim 7, which is characterized in that for can not entity alignment entity neologisms, Classified by the entity class for creating new;Listed entity device list N is added in the creation entity class.
9. entity classification method according to claim 8, which is characterized in that for the substantive noun comprising entity neologisms e List E and the affiliated entity class of entity neologisms e are carried out final confirmation by user by human-computer interaction.
10. a kind of entity classification device of field of thermal power corpus data, which is characterized in that including first categorization module, standby Select new dictionary, new words extraction module and entity alignment module;
The first categorization module, for the text collection S to be sorted comprising field of thermal power corpus data to input into The first subseries of row, succeeded classifying text set S1 and failed classifying text set S2;
The alternative new dictionary, for storing the entity neologisms of field of thermal power;
The new words extraction module is separately connected with the first categorization module and the alternative new dictionary, first for receiving The failed classifying text set S2 of categorization module input extracts the failed classifying text according to alternative new dictionary content Entity neologisms in set S2 establish the new word list E of entity;
The entity alignment module is separately connected with the first categorization module and the new words extraction module, for receiving just The new word list E of entity of succeeded the classifying text set S1 and new words extraction module output of the output of subseries module, will Entity neologisms in the new word list E of entity carry out entity with the classifying text set S1 that succeeded one by one and are aligned, and obtain Entity is aligned result;And according to obtained entity alignment as a result, determining the entity class of the entity neologisms.
CN201811311803.3A 2018-11-05 2018-11-05 Entity classification method and device for corpus data in thermal power generation field Active CN109582787B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811311803.3A CN109582787B (en) 2018-11-05 2018-11-05 Entity classification method and device for corpus data in thermal power generation field

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811311803.3A CN109582787B (en) 2018-11-05 2018-11-05 Entity classification method and device for corpus data in thermal power generation field

Publications (2)

Publication Number Publication Date
CN109582787A true CN109582787A (en) 2019-04-05
CN109582787B CN109582787B (en) 2020-10-20

Family

ID=65921571

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811311803.3A Active CN109582787B (en) 2018-11-05 2018-11-05 Entity classification method and device for corpus data in thermal power generation field

Country Status (1)

Country Link
CN (1) CN109582787B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111177403A (en) * 2019-12-16 2020-05-19 恩亿科(北京)数据科技有限公司 Sample data processing method and device
CN112597760A (en) * 2020-12-04 2021-04-02 光大科技有限公司 Method and device for extracting domain words in document
CN112948570A (en) * 2019-12-11 2021-06-11 复旦大学 Unsupervised automatic domain knowledge map construction system
CN113157903A (en) * 2020-12-28 2021-07-23 国网浙江省电力有限公司信息通信分公司 Multi-field-oriented electric power word stock construction method
CN113468332A (en) * 2021-07-14 2021-10-01 广州华多网络科技有限公司 Classification model updating method and corresponding device, equipment and medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6138087A (en) * 1994-09-30 2000-10-24 Budzinski; Robert L. Memory system for storing and retrieving experience and knowledge with natural language utilizing state representation data, word sense numbers, function codes and/or directed graphs
CN106095736A (en) * 2016-06-07 2016-11-09 华东师范大学 A kind of method of field neologisms extraction
CN106447346A (en) * 2016-08-29 2017-02-22 北京中电普华信息技术有限公司 Method and system for construction of intelligent electric power customer service system
CN107748799A (en) * 2017-11-08 2018-03-02 四川长虹电器股份有限公司 A kind of method of multi-data source movie data entity alignment
CN108363691A (en) * 2018-02-09 2018-08-03 国网江苏省电力有限公司电力科学研究院 A kind of field term identifying system and method for 95598 work order of electric power

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6138087A (en) * 1994-09-30 2000-10-24 Budzinski; Robert L. Memory system for storing and retrieving experience and knowledge with natural language utilizing state representation data, word sense numbers, function codes and/or directed graphs
CN106095736A (en) * 2016-06-07 2016-11-09 华东师范大学 A kind of method of field neologisms extraction
CN106447346A (en) * 2016-08-29 2017-02-22 北京中电普华信息技术有限公司 Method and system for construction of intelligent electric power customer service system
CN107748799A (en) * 2017-11-08 2018-03-02 四川长虹电器股份有限公司 A kind of method of multi-data source movie data entity alignment
CN108363691A (en) * 2018-02-09 2018-08-03 国网江苏省电力有限公司电力科学研究院 A kind of field term identifying system and method for 95598 work order of electric power

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
邱剑: "电力中文文本数据挖掘技术及其在可靠性中的应用研究", 《中国博士学位论文全文数据库工程科技Ⅱ辑》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112948570A (en) * 2019-12-11 2021-06-11 复旦大学 Unsupervised automatic domain knowledge map construction system
CN111177403A (en) * 2019-12-16 2020-05-19 恩亿科(北京)数据科技有限公司 Sample data processing method and device
CN111177403B (en) * 2019-12-16 2023-06-23 恩亿科(北京)数据科技有限公司 Sample data processing method and device
CN112597760A (en) * 2020-12-04 2021-04-02 光大科技有限公司 Method and device for extracting domain words in document
CN113157903A (en) * 2020-12-28 2021-07-23 国网浙江省电力有限公司信息通信分公司 Multi-field-oriented electric power word stock construction method
CN113468332A (en) * 2021-07-14 2021-10-01 广州华多网络科技有限公司 Classification model updating method and corresponding device, equipment and medium

Also Published As

Publication number Publication date
CN109582787B (en) 2020-10-20

Similar Documents

Publication Publication Date Title
CN109582787A (en) A kind of entity classification method and device of field of thermal power corpus data
Al-Twairesh et al. AraSenTi: Large-scale Twitter-specific Arabic sentiment lexicons
Hoffart et al. Discovering emerging entities with ambiguous names
Xia et al. Dual sentiment analysis: Considering two sides of one review
Lawrie et al. Normalizing source code vocabulary
US7672833B2 (en) Method and apparatus for automatic entity disambiguation
Kmail et al. An automatic online recruitment system based on exploiting multiple semantic resources and concept-relatedness measures
US20200151252A1 (en) Error correction for tables in document conversion
Bhargava et al. MSATS: Multilingual sentiment analysis via text summarization
Islam et al. Near-synonym choice using a 5-gram language model
Babhulgaonkar et al. Language identification for multilingual machine translation
Wu et al. Extracting summary knowledge graphs from long documents
Wities et al. A consolidated open knowledge representation for multiple texts
Yang et al. Ontology generation for large email collections.
Fu et al. Generating chinese named entity data from a parallel corpus
Li et al. Automatic extraction for product feature words from comments on the web
Firdhous Automating legal research through data mining
Moin et al. Framework for rumors detection in social media
Oliveira et al. Assessing concept weighting in integer linear programming based single-document summarization
Munir et al. A comparison of topic modelling approaches for urdu text
Wang et al. Sentiment detection and visualization of Chinese micro-blog
Mesquita et al. Extracting information networks from the blogosphere: State-of-the-art and challenges
González Pellicer et al. The talp participation at tac-kbp 2012
Kaur et al. Keyword extraction for punjabi language
Badhe et al. Synopsis Creation for Research Paper using Text Summarization Models

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant