CN109522298A - Data cleaning method for CIM - Google Patents

Data cleaning method for CIM Download PDF

Info

Publication number
CN109522298A
CN109522298A CN201810993857.6A CN201810993857A CN109522298A CN 109522298 A CN109522298 A CN 109522298A CN 201810993857 A CN201810993857 A CN 201810993857A CN 109522298 A CN109522298 A CN 109522298A
Authority
CN
China
Prior art keywords
text
data
word
dictionary
participle
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810993857.6A
Other languages
Chinese (zh)
Inventor
马文
张雪坚
张新阳
耿贞伟
辛永
黄文思
罗义旺
刘庆胜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Corp of China SGCC
State Grid Information and Telecommunication Co Ltd
Information Center of Yunnan Power Grid Co Ltd
Original Assignee
State Grid Corp of China SGCC
State Grid Information and Telecommunication Co Ltd
Information Center of Yunnan Power Grid Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Corp of China SGCC, State Grid Information and Telecommunication Co Ltd, Information Center of Yunnan Power Grid Co Ltd filed Critical State Grid Corp of China SGCC
Priority to CN201810993857.6A priority Critical patent/CN109522298A/en
Publication of CN109522298A publication Critical patent/CN109522298A/en
Pending legal-status Critical Current

Links

Landscapes

  • Machine Translation (AREA)

Abstract

The present invention relates to a kind of data cleaning methods for CIM, comprising the following steps: step S100 obtains the format data of data source, and format data is converted to tables of data, each behavior one record in the tables of data, each to be classified as a field;Step S200, all fields in ergodic data table extract the text in the N item record that amount of text is most in the field F, form text set T if the field type of certain field F is text-type;Step S300, from k differential dictionary { D1,D2,...,DkIn, determine the corresponding differential dictionary Dt of text set T, wherein the value range of t be 1 ... k;Step S400 is segmented using the field F that differential dictionary Dt records every, forms participle text label;The participle text label in every record is filled into the corresponding data of CIM by step S500.

Description

Data cleaning method for CIM
Technical field
The present invention relates to electric power big data field more particularly to a kind of data cleaning methods for CIM.
Background technique
Common information model (CIM) is an abstract model, describes the main object of electric system.CIM model and various Extended model based on CIM has been widely applied to every aspect of electric system, such as Zhang Jinbo etc. in " electric power battalion at present The application of pin system CIM model " involved in application of the CIM in Guangdong Power Grid Corporation's marketing system, relate in CN103678790A And the transmission of electricity net wire loss model based on CIM model, the CIM model etc. of distribution network system involved in CN104182911A.
In the application process of CIM model, due to the historical reasons of network system, it usually needs will be retouched using other models The network system data stated are converted into CIM model and describe data or convert other model datas for CIM model data, that is, deposit In other model datas mutually converting to CIM model data, for example, SCD model involved in CN101873008A is to CIM mould The conversion of type, the mapping of SCL model involved in CN106292576A to CIM model, CIM model involved in CN103761077A With the mapping of relational database etc..
In some larger models (such as YNCIM of Yunnan Power System) based on CIM, other related pattern numbers Situations such as amount is very more, and data volume is also very big, and the format of data, content intact degree, content repeat can compare Big difference does not have the batch text data of specific format especially, and the text difference for describing identical content is bigger.Therefore exist Before converting CIM model data for other model datas (such as relational database), exists and treat change data (especially batch Text data) carry out data cleansing necessity.
Summary of the invention
In order to solve the above technical problems, the present invention relates to a kind of data cleaning methods for CIM, comprising the following steps: Step S100 obtains the format data of data source, and format data is converted to tables of data, every in the tables of data One behavior one record, it is each to be classified as a field;Step S200, all fields in ergodic data table, if certain field F Field type be text-type, then extracting the text in amount of text is most in the field F N item record, form text Collect T;Step S300, from k differential dictionary { D1,D2,...,DkIn, determine the corresponding differential dictionary Dt of text set T, wherein t Value range is 1 ... k;Step S400 is segmented using the field F that differential dictionary Dt records every, forms participle text Label;The participle text label in every record is filled into the corresponding data of CIM by step S500.
Detailed description of the invention
Fig. 1 is flow chart of the method for the present invention.
Specific embodiment
To make the object, technical solutions and advantages of the present invention clearer, the present invention will be made further in conjunction with attached drawing Detailed description.This description is to describe specific implementation consistent with the principles of the present invention by way of example, and not limitation Mode, the description of these embodiments is detailed enough, so that those skilled in the art can practice the present invention, is not being taken off Other embodiments can be used in the case where from scope and spirit of the present invention and can change and/or replace each element Structure.Therefore, the following detailed description should not be understood from restrictive sense.
As shown in Figure 1, the present invention provides a kind of data cleaning methods for CIM.In one exemplary embodiment, CIM can be the YNCIM for the Yunnan Power System built.It will be appreciated by those skilled in the art that CIM is not limited only to YNCIM, the CIM being also applied in the case of other, i.e., it is equal suitable for all scenario of data cleaning method correlation step of the present invention Protection scope of the present invention can be fallen into.
Further, the method for the present invention includes the following steps:
Step S100 obtains the format data of multiple data sources, and format data is converted to tables of data, the number It is each to be classified as a field according to each behavior one record in table.In the present invention, data source is used for storage formatting data, It is generally implemented as database, data storage center, data server, data cloud etc., data source can have different physical bits It sets, such as positioned at Kunming, Qujing, Dali, Guangzhou, Shenzhen etc..Format data in data source is can be by Computer program automatic running and the format parsed, including but not limited to relational database format, XML format, CIM model etc.. Obviously, different-format carries out parsing by computer program can be realized using any prior art, and will not influence of the invention Protection scope.Further, the unified format of the data after parsing is tables of data, to facilitate subsequent processing.
Under normal circumstances, after obtaining format data (or after forming tables of data), it is also necessary to the number in tables of data According to the pretreatments such as code conversion, the conversion of simplified and traditional body, text cleaning are carried out, so that data meet subsequent processing requirement, specifically Treatment process can refer to CN107577713A, and the content introduced in " text handling method based on electric power dictionary " is no longer superfluous It states.
Step S200, all fields in ergodic data table, if the field type of certain field F is text-type, The text in the N item record that amount of text is most in the field F is extracted, text set T is formed.
According to the present invention, the value of N can be specified rule of thumb, such as specified N=100, i.e. extraction amount of text is most 100 records, but this tables of data many for record strip number (such as million ranks record), there are undersamplings Defect.Another implementation, with the percentage of summary journal, (such as preceding 0.1%) can preferably handle big data quantity, still The case where the case where for small data quantity, there is also undersamplings.It is therefore preferred that N=max (Total × 0.1%, 100), Wherein Total is the sum recorded in tables of data, and max is max function.It is i.e. biggish among selection the two, as taking for N Value.
According to the present invention, further, when forming text set T, between the text in difference record in separator library Separator distinguish.Separator library is generally implemented as punctuation mark library or punctuation mark and spcial character library.It is preferred to separate Symbol is exclamation, i.e., the content of text of different records with "!" separate, due to "!" frequency used in record is very low, therefore energy Enough effective different records distinguished in text set T, the subsequent processing to be likely to occur lay the foundation.
Step S300, from k differential dictionary { D1,D2,...,DkIn, determine the corresponding differential dictionary Dt of text set T, The value range of middle t be 1 ... k.It include multiple not subdivisible participle terms in each differential dictionary.According to this hair Bright, differential dictionary is the database for including dictionary for word segmentation, can also regard a kind of special " dictionary " as.One of dictionary is exemplary Implementation is CN107577713A, the electric power dictionary introduced in " text handling method based on electric power dictionary ".Differential dictionary In " word " be mainly used for realizing the participle function of sentence, such as there are word " electromagnetic brakes " in some differential dictionary, then language Sentence " usual electromagnetic brake is arranged near the shaft coupling of motor " will preferentially be divided into " usual ", " electromagnetic brake ", " be arranged in electricity Near the shaft coupling of machine " three parts, then to " usual ", " being arranged near the shaft coupling of motor " the two parts again into one Step participle (as needed), such as segmented using the prior art, and " electromagnetic brake " is no longer further segmented.Each Vocabulary (example is commonly described under the industry-specific technical term of certain detail or general case of differential dictionary storage power industry Such as, but not limited to, vocabulary involved in the hobby of worker), therefore participle (vocabulary) quantity stored in each differential dictionary compares Few, the quantity of word lacks the 1-2 order of magnitude in more general patent classification dictionary, so as to quickly finish participle work.
According to the present invention, the step S300 further comprises:
K differential dictionary { D is used only in step S310 respectively1,D2,...,Dk(without the use of separator library, stop words Library, general dictionary etc. use vocabulary or other segmenting methods in the prior art) each of text set T is segmented side by side Sequence, the participle vector W1 ... Wk after forming k sequence.
W1=(w11,w12,...,w1n1);
W2=(w21,w22,...,w2n2);
Wk=(wk1,wk2,...,wknk)。
Wherein, n1≥n2≥...≥nk, each word w in W1 ... WkijIt is the word for including in differential dictionary, wherein i The value of value is 1 ... k, j corresponds to n1…nk
Step S320 obtains i, so that
Step S330, if i=1, illustrate the identical participle (or vocabulary) between W1 and text set T than W2 ... Wk Identical participle (or vocabulary) between text set T mostly more, therefore the differential dictionary for obtaining W1 and using is determined as Dt.
Step S340, it is identical between the word and text set T before illustrating in i participle vector W1 ... Wi if i > 1 Word difference is not many, it is therefore desirable to be further processed to judge which participle vector is more close with text set T.Specifically, will participle Word duplicate removal in vector W1 ... Wi simultaneously sorts, formed i sort after participle vector V1 ... Vi;Wherein, m1≥m2≥...≥mk
V1=(v11,v12,...,v1m1);
V2=(v21,v22,...,v2m2);
Vk=(vk1,vk2,...,vkmk);
The corresponding differential dictionary of V1 is determined as Dt by step S350.
Step S400 is segmented using the field F that differential dictionary Dt records every, forms participle text label.One In a embodiment, the content of text of the participle text label is consistent with the content of word in differential dictionary Dt.Another is preferred real It applies in example, the content of text of the participle text label includes word in differential dictionary Dt.
According to the present invention, specifically, step S400 further comprises:
Step S410, the text in field F recorded to every, is segmented, shape using the separator in separator library At the first term vector.Different from step S200, in the step, separator library is embodied as punctuation mark library, by using separation Symbol can will divide into specific sentence or phrase at section text in field F, each sentence and/or constitute the first participle to Amount, i.e. element in first participle vector are sentence and/or phrase.Those skilled in the art, which understand, to be known, is deposited in the prior art Paragraph is divided into sentence/phrase mode using punctuation mark a variety of, any of them mode can be adapted for of the invention Technical solution.
Step S420, to the first word of each of first term vector (i.e. sentence and/or phrase), using differential dictionary Dt into Row participle, forms participle text label and the second term vector.In example as previously described, if in differential dictionary Dt including " electromagnetism Band-type brake ", " motor ", " shaft coupling ", then " usual electromagnetic brake is arranged near the shaft coupling of motor " will be divided into for the first time Including multiple (7) the second word the second term vector (" usual ", " electromagnetic brake ", " setting exist ", " motor ", " ", " shaft coupling Device ", " near ").
In one embodiment, the content of text of the participle text label is consistent with the content of word in differential dictionary Dt.? After the completion of step S420, using in the second term vector with consistent second word of word in Dt as text label, i.e., " electromagnetic brake ", " motor ", " shaft coupling " are used as text label.
In another embodiment, the content of text for segmenting text label includes word in differential dictionary Dt.After step S420 Also follow the steps below:
Step S430 segments the second word of each of second term vector using deactivated dictionary, formed third word to Amount.Specifically, deleting second word in the second term vector if some second word is identical as stop words;If some Second word includes stop words, then second word is split as third word.In the present invention, deactivates dictionary and stop in the prior art It is similar with dictionary meaning, the function word of the not no specific meanings of general storage, such as " ", " ", " usual " etc..In example as above, If deactivate dictionary include " usual ", " setting ", " ", " ", " attachment ", including (" electromagnetism armful in third term vector Lock ", " setting exists ", " motor ", " shaft coupling ").
Step S440 segments the word in the non-differential dictionary Dt in third term vector using general dictionary, is formed As a result term vector.In the present invention, the meaning of general dictionary is identical as meaning in the prior art, and those skilled in the art, which understand, to be known Dawn, a variety of modes segmented using general dictionary exist in the prior art, any of them mode can be adapted for this hair Bright technical solution.Such as in aforementioned exemplary, it is only necessary to which carrying out participle to " setting exists " is " setting " and " ", without right " electromagnetic brake ", " motor ", " shaft coupling " are segmented.
In one embodiment, conclusive participle vector is (" electromagnetic brake ", " setting ", " ", " motor ", " shaft coupling "), Using conclusive participle vector as participle text label.Further include step S450 in another embodiment, that is, repeats similar step The step of rapid S430, segments result term vector using stop words, thus conclusive participle the vector (" electricity after being optimized Magnetic band-type brake ", " motor ", " shaft coupling "), using the conclusive participle vector after optimization as participle text label.
According to disclosed specification of the invention, other realizations of the invention are obvious for those skilled in the art 's.The various aspects of embodiment and/or embodiment can be used for system of the invention and side individually or with any combination In method.Specification and example therein should be only be regarded solely as it is exemplary, the actual scope of the present invention and spirit by appended right Claim indicates.

Claims (7)

1. a kind of data cleaning method for CIM, which comprises the following steps:
Step S100 obtains the format data of multiple data sources, and format data is converted to tables of data, the tables of data In each behavior one record, it is each to be classified as a field;
Step S200, all fields in ergodic data table are extracted if the field type of certain field F is text-type The text in N item record that amount of text is most in the field F, forms text set T;
Step S300, from k differential dictionary { D1,D2,...,DkIn, determine the corresponding differential dictionary Dt of text set T, wherein t Value range is 1 ... k;It include multiple not subdivisible participle terms in each differential dictionary;
Step S400 is segmented using the field F that differential dictionary Dt records every, forms participle text label;Described point The content of text of word text label includes word in differential dictionary Dt;
The participle text label in every record is filled into the corresponding data of CIM by step S500.
2. data cleaning method according to claim 1, which is characterized in that the step S300 further comprises:
K differential dictionary { D is used only in step S310 respectively1,D2,...,Dk(without the use of separator library, deactivate dictionary, logical With dictionary etc.) each of text set T is segmented and is sorted, formed k sort after participle vector W1 ... Wk;
W1=(w11,w12,...,w1n1);
W2=(w21,w22,...,w2n2);
Wk=(wk1,wk2,...,wknk);
Wherein, n1≥n2≥...≥nk, each word w in W1 ... WkijIt is the word for including in differential dictionary, the wherein value of i Being 1 ..., the value of k, j correspond to n1…nk
Step S320 obtains i, so that
The corresponding differential dictionary of W1 is determined as Dt if i=1 by step S330.
3. data cleaning method according to claim 2, which is characterized in that the step S300 further comprises:
Step S340 by the word duplicate removal segmented in vector W1 ... Wi and sorts if i > 1, the participle after forming i sequence Vector V1 ... Vi;Wherein, m1≥m2≥...≥mk
V1=(v11,v12,...,v1m1);
V2=(v21,v22,...,v2m2);
Vk=(vk1,vk2,...,vkmk);
The corresponding differential dictionary of V1 is determined as Dt by step S350.
4. data cleaning method according to claim 3, which is characterized in that N=max (Total × 0.1%, 100), Middle Total is the sum recorded in tables of data, and max is max function.
5. data cleaning method according to claim 4, which is characterized in that in the step S200, form text set T When, punctuation mark library or punctuation mark and spcial character (be generally implemented as with separator library between the text in difference record Library) in separator distinguish.
6. data cleaning method according to claim 5, which is characterized in that the separator is exclamation.
7. data cleaning method according to claim 3, which is characterized in that the step S400 further comprises:
Step S410, the text in field F recorded to every, is segmented using the separator in separator library, forms the One term vector;
Step S420 segments the first word of each of first term vector using differential dictionary Dt, forms participle text mark Label and the second term vector;
Step S430 segments the second word of each of second term vector using deactivated dictionary, forms third term vector; If some second word is identical as stop words, second word is deleted in the second term vector;If some second word includes Stop words, then second word is split as third word;
Step S440 segments third term vector using general dictionary, forms result term vector, as participle text mark Label.
CN201810993857.6A 2018-08-29 2018-08-29 Data cleaning method for CIM Pending CN109522298A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810993857.6A CN109522298A (en) 2018-08-29 2018-08-29 Data cleaning method for CIM

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810993857.6A CN109522298A (en) 2018-08-29 2018-08-29 Data cleaning method for CIM

Publications (1)

Publication Number Publication Date
CN109522298A true CN109522298A (en) 2019-03-26

Family

ID=65770780

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810993857.6A Pending CN109522298A (en) 2018-08-29 2018-08-29 Data cleaning method for CIM

Country Status (1)

Country Link
CN (1) CN109522298A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113705728A (en) * 2021-09-18 2021-11-26 全知科技(杭州)有限责任公司 Classified grading list intelligent marking method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8825620B1 (en) * 2011-06-13 2014-09-02 A9.Com, Inc. Behavioral word segmentation for use in processing search queries
CN106156002A (en) * 2016-06-30 2016-11-23 乐视控股(北京)有限公司 The system of selection of participle dictionary and system
CN107577713A (en) * 2017-08-03 2018-01-12 国网信通亿力科技有限责任公司 Text handling method based on electric power dictionary
CN108228825A (en) * 2018-01-02 2018-06-29 北京市燃气集团有限责任公司 A kind of station address data cleaning method based on participle

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8825620B1 (en) * 2011-06-13 2014-09-02 A9.Com, Inc. Behavioral word segmentation for use in processing search queries
CN106156002A (en) * 2016-06-30 2016-11-23 乐视控股(北京)有限公司 The system of selection of participle dictionary and system
CN107577713A (en) * 2017-08-03 2018-01-12 国网信通亿力科技有限责任公司 Text handling method based on electric power dictionary
CN108228825A (en) * 2018-01-02 2018-06-29 北京市燃气集团有限责任公司 A kind of station address data cleaning method based on participle

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113705728A (en) * 2021-09-18 2021-11-26 全知科技(杭州)有限责任公司 Classified grading list intelligent marking method
CN113705728B (en) * 2021-09-18 2023-08-01 全知科技(杭州)有限责任公司 Classification and classification list intelligent marking method

Similar Documents

Publication Publication Date Title
CN108133045B (en) Keyword extraction method and system, and keyword extraction model generation method and system
CN102289522A (en) Method of intelligently classifying texts
CN109241297B (en) Content classification and aggregation method, electronic equipment, storage medium and engine
CN110928981A (en) Method, system and storage medium for establishing and perfecting iteration of text label system
WO2020253506A1 (en) Contract content extraction method and apparatus, and computer device and storage medium
CN109271516A (en) Entity type classification method and system in a kind of knowledge mapping
CN114818643B (en) Log template extraction method and device for reserving specific service information
CN108829810A (en) File classification method towards healthy public sentiment
CN109308317A (en) A kind of hot spot word extracting method of the non-structured text based on cluster
CN110705283A (en) Deep learning method and system based on matching of text laws and regulations and judicial interpretations
CN107357895A (en) A kind of processing method of the text representation based on bag of words
Ramasundaram et al. Text categorization by backpropagation network
CN107832307B (en) Chinese word segmentation method based on undirected graph and single-layer neural network
CN112685374B (en) Log classification method and device and electronic equipment
CN108595426A (en) Term vector optimization method based on Chinese character pattern structural information
CN116975634A (en) Micro-service extraction method based on program static attribute and graph neural network
CN109522298A (en) Data cleaning method for CIM
CN111752541B (en) Payment routing method based on Rete algorithm
CN109299473A (en) A kind of software projects recommended method based on developer's portrait
CN111538839A (en) Real-time text clustering method based on Jacobsard distance
JP5690472B2 (en) Data extraction system
CN110929509A (en) Louvain community discovery algorithm-based field event trigger word clustering method
CN115547514A (en) Pathogenic gene sequencing method, pathogenic gene sequencing device, electronic equipment and medium
CN111046181B (en) Actor-critic method for automatic classification induction
CN109726286B (en) Automatic book classification method based on LDA topic model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20190326

RJ01 Rejection of invention patent application after publication