CN109522298A

CN109522298A - Data cleaning method for CIM

Info

Publication number: CN109522298A
Application number: CN201810993857.6A
Authority: CN
Inventors: 马文; 张雪坚; 张新阳; 耿贞伟; 辛永; 黄文思; 罗义旺; 刘庆胜
Original assignee: State Grid Corp of China SGCC; State Grid Information and Telecommunication Co Ltd; Information Center of Yunnan Power Grid Co Ltd
Current assignee: State Grid Corp of China SGCC; State Grid Information and Telecommunication Co Ltd; Information Center of Yunnan Power Grid Co Ltd
Priority date: 2018-08-29
Filing date: 2018-08-29
Publication date: 2019-03-26

Abstract

The present invention relates to a kind of data cleaning methods for CIM, comprising the following steps: step S100 obtains the format data of data source, and format data is converted to tables of data, each behavior one record in the tables of data, each to be classified as a field；Step S200, all fields in ergodic data table extract the text in the N item record that amount of text is most in the field F, form text set T if the field type of certain field F is text-type；Step S300, from k differential dictionary { D₁,D₂,...,D_kIn, determine the corresponding differential dictionary Dt of text set T, wherein the value range of t be 1 ... k；Step S400 is segmented using the field F that differential dictionary Dt records every, forms participle text label；The participle text label in every record is filled into the corresponding data of CIM by step S500.

Description

Data cleaning method for CIM

Technical field

The present invention relates to electric power big data field more particularly to a kind of data cleaning methods for CIM.

Background technique

Common information model (CIM) is an abstract model, describes the main object of electric system.CIM model and various Extended model based on CIM has been widely applied to every aspect of electric system, such as Zhang Jinbo etc. in " electric power battalion at present The application of pin system CIM model " involved in application of the CIM in Guangdong Power Grid Corporation's marketing system, relate in CN103678790A And the transmission of electricity net wire loss model based on CIM model, the CIM model etc. of distribution network system involved in CN104182911A.

In the application process of CIM model, due to the historical reasons of network system, it usually needs will be retouched using other models The network system data stated are converted into CIM model and describe data or convert other model datas for CIM model data, that is, deposit In other model datas mutually converting to CIM model data, for example, SCD model involved in CN101873008A is to CIM mould The conversion of type, the mapping of SCL model involved in CN106292576A to CIM model, CIM model involved in CN103761077A With the mapping of relational database etc..

In some larger models (such as YNCIM of Yunnan Power System) based on CIM, other related pattern numbers Situations such as amount is very more, and data volume is also very big, and the format of data, content intact degree, content repeat can compare Big difference does not have the batch text data of specific format especially, and the text difference for describing identical content is bigger.Therefore exist Before converting CIM model data for other model datas (such as relational database), exists and treat change data (especially batch Text data) carry out data cleansing necessity.

Summary of the invention

In order to solve the above technical problems, the present invention relates to a kind of data cleaning methods for CIM, comprising the following steps: Step S100 obtains the format data of data source, and format data is converted to tables of data, every in the tables of data One behavior one record, it is each to be classified as a field；Step S200, all fields in ergodic data table, if certain field F Field type be text-type, then extracting the text in amount of text is most in the field F N item record, form text Collect T；Step S300, from k differential dictionary { D₁,D₂,...,D_kIn, determine the corresponding differential dictionary Dt of text set T, wherein t Value range is 1 ... k；Step S400 is segmented using the field F that differential dictionary Dt records every, forms participle text Label；The participle text label in every record is filled into the corresponding data of CIM by step S500.

Detailed description of the invention

Fig. 1 is flow chart of the method for the present invention.

Specific embodiment

To make the object, technical solutions and advantages of the present invention clearer, the present invention will be made further in conjunction with attached drawing Detailed description.This description is to describe specific implementation consistent with the principles of the present invention by way of example, and not limitation Mode, the description of these embodiments is detailed enough, so that those skilled in the art can practice the present invention, is not being taken off Other embodiments can be used in the case where from scope and spirit of the present invention and can change and/or replace each element Structure.Therefore, the following detailed description should not be understood from restrictive sense.

As shown in Figure 1, the present invention provides a kind of data cleaning methods for CIM.In one exemplary embodiment, CIM can be the YNCIM for the Yunnan Power System built.It will be appreciated by those skilled in the art that CIM is not limited only to YNCIM, the CIM being also applied in the case of other, i.e., it is equal suitable for all scenario of data cleaning method correlation step of the present invention Protection scope of the present invention can be fallen into.

Further, the method for the present invention includes the following steps:

Step S100 obtains the format data of multiple data sources, and format data is converted to tables of data, the number It is each to be classified as a field according to each behavior one record in table.In the present invention, data source is used for storage formatting data, It is generally implemented as database, data storage center, data server, data cloud etc., data source can have different physical bits It sets, such as positioned at Kunming, Qujing, Dali, Guangzhou, Shenzhen etc..Format data in data source is can be by Computer program automatic running and the format parsed, including but not limited to relational database format, XML format, CIM model etc.. Obviously, different-format carries out parsing by computer program can be realized using any prior art, and will not influence of the invention Protection scope.Further, the unified format of the data after parsing is tables of data, to facilitate subsequent processing.

Under normal circumstances, after obtaining format data (or after forming tables of data), it is also necessary to the number in tables of data According to the pretreatments such as code conversion, the conversion of simplified and traditional body, text cleaning are carried out, so that data meet subsequent processing requirement, specifically Treatment process can refer to CN107577713A, and the content introduced in " text handling method based on electric power dictionary " is no longer superfluous It states.

Step S200, all fields in ergodic data table, if the field type of certain field F is text-type, The text in the N item record that amount of text is most in the field F is extracted, text set T is formed.

According to the present invention, the value of N can be specified rule of thumb, such as specified N=100, i.e. extraction amount of text is most 100 records, but this tables of data many for record strip number (such as million ranks record), there are undersamplings Defect.Another implementation, with the percentage of summary journal, (such as preceding 0.1%) can preferably handle big data quantity, still The case where the case where for small data quantity, there is also undersamplings.It is therefore preferred that N=max (Total × 0.1%, 100), Wherein Total is the sum recorded in tables of data, and max is max function.It is i.e. biggish among selection the two, as taking for N Value.

According to the present invention, further, when forming text set T, between the text in difference record in separator library Separator distinguish.Separator library is generally implemented as punctuation mark library or punctuation mark and spcial character library.It is preferred to separate Symbol is exclamation, i.e., the content of text of different records with "！" separate, due to "！" frequency used in record is very low, therefore energy Enough effective different records distinguished in text set T, the subsequent processing to be likely to occur lay the foundation.

Step S300, from k differential dictionary { D₁,D₂,...,D_kIn, determine the corresponding differential dictionary Dt of text set T, The value range of middle t be 1 ... k.It include multiple not subdivisible participle terms in each differential dictionary.According to this hair Bright, differential dictionary is the database for including dictionary for word segmentation, can also regard a kind of special " dictionary " as.One of dictionary is exemplary Implementation is CN107577713A, the electric power dictionary introduced in " text handling method based on electric power dictionary ".Differential dictionary In " word " be mainly used for realizing the participle function of sentence, such as there are word " electromagnetic brakes " in some differential dictionary, then language Sentence " usual electromagnetic brake is arranged near the shaft coupling of motor " will preferentially be divided into " usual ", " electromagnetic brake ", " be arranged in electricity Near the shaft coupling of machine " three parts, then to " usual ", " being arranged near the shaft coupling of motor " the two parts again into one Step participle (as needed), such as segmented using the prior art, and " electromagnetic brake " is no longer further segmented.Each Vocabulary (example is commonly described under the industry-specific technical term of certain detail or general case of differential dictionary storage power industry Such as, but not limited to, vocabulary involved in the hobby of worker), therefore participle (vocabulary) quantity stored in each differential dictionary compares Few, the quantity of word lacks the 1-2 order of magnitude in more general patent classification dictionary, so as to quickly finish participle work.

According to the present invention, the step S300 further comprises:

K differential dictionary { D is used only in step S310 respectively₁,D₂,...,D_k(without the use of separator library, stop words Library, general dictionary etc. use vocabulary or other segmenting methods in the prior art) each of text set T is segmented side by side Sequence, the participle vector W1 ... Wk after forming k sequence.

W1=(w₁₁,w₁₂,...,w_1n1)；

W2=(w₂₁,w₂₂,...,w_2n2)；

…

Wk=(w_k1,w_k2,...,w_knk)。

Wherein, n₁≥n₂≥...≥n_k, each word w in W1 ... Wk_ijIt is the word for including in differential dictionary, wherein i The value of value is 1 ... k, j corresponds to n₁…n_k。

Step S320 obtains i, so that

Step S330, if i=1, illustrate the identical participle (or vocabulary) between W1 and text set T than W2 ... Wk Identical participle (or vocabulary) between text set T mostly more, therefore the differential dictionary for obtaining W1 and using is determined as Dt.

Step S340, it is identical between the word and text set T before illustrating in i participle vector W1 ... Wi if i > 1 Word difference is not many, it is therefore desirable to be further processed to judge which participle vector is more close with text set T.Specifically, will participle Word duplicate removal in vector W1 ... Wi simultaneously sorts, formed i sort after participle vector V1 ... Vi；Wherein, m₁≥m₂≥...≥m_k。

V1=(v₁₁,v₁₂,...,v_1m1)；

V2=(v₂₁,v₂₂,...,v_2m2)；

…

Vk=(v_k1,v_k2,...,v_kmk)；

The corresponding differential dictionary of V1 is determined as Dt by step S350.

Step S400 is segmented using the field F that differential dictionary Dt records every, forms participle text label.One In a embodiment, the content of text of the participle text label is consistent with the content of word in differential dictionary Dt.Another is preferred real It applies in example, the content of text of the participle text label includes word in differential dictionary Dt.

According to the present invention, specifically, step S400 further comprises:

Step S410, the text in field F recorded to every, is segmented, shape using the separator in separator library At the first term vector.Different from step S200, in the step, separator library is embodied as punctuation mark library, by using separation Symbol can will divide into specific sentence or phrase at section text in field F, each sentence and/or constitute the first participle to Amount, i.e. element in first participle vector are sentence and/or phrase.Those skilled in the art, which understand, to be known, is deposited in the prior art Paragraph is divided into sentence/phrase mode using punctuation mark a variety of, any of them mode can be adapted for of the invention Technical solution.

Step S420, to the first word of each of first term vector (i.e. sentence and/or phrase), using differential dictionary Dt into Row participle, forms participle text label and the second term vector.In example as previously described, if in differential dictionary Dt including " electromagnetism Band-type brake ", " motor ", " shaft coupling ", then " usual electromagnetic brake is arranged near the shaft coupling of motor " will be divided into for the first time Including multiple (7) the second word the second term vector (" usual ", " electromagnetic brake ", " setting exist ", " motor ", " ", " shaft coupling Device ", " near ").

In one embodiment, the content of text of the participle text label is consistent with the content of word in differential dictionary Dt.? After the completion of step S420, using in the second term vector with consistent second word of word in Dt as text label, i.e., " electromagnetic brake ", " motor ", " shaft coupling " are used as text label.

In another embodiment, the content of text for segmenting text label includes word in differential dictionary Dt.After step S420 Also follow the steps below:

Step S430 segments the second word of each of second term vector using deactivated dictionary, formed third word to Amount.Specifically, deleting second word in the second term vector if some second word is identical as stop words；If some Second word includes stop words, then second word is split as third word.In the present invention, deactivates dictionary and stop in the prior art It is similar with dictionary meaning, the function word of the not no specific meanings of general storage, such as " ", " ", " usual " etc..In example as above, If deactivate dictionary include " usual ", " setting ", " ", " ", " attachment ", including (" electromagnetism armful in third term vector Lock ", " setting exists ", " motor ", " shaft coupling ").

Step S440 segments the word in the non-differential dictionary Dt in third term vector using general dictionary, is formed As a result term vector.In the present invention, the meaning of general dictionary is identical as meaning in the prior art, and those skilled in the art, which understand, to be known Dawn, a variety of modes segmented using general dictionary exist in the prior art, any of them mode can be adapted for this hair Bright technical solution.Such as in aforementioned exemplary, it is only necessary to which carrying out participle to " setting exists " is " setting " and " ", without right " electromagnetic brake ", " motor ", " shaft coupling " are segmented.

In one embodiment, conclusive participle vector is (" electromagnetic brake ", " setting ", " ", " motor ", " shaft coupling "), Using conclusive participle vector as participle text label.Further include step S450 in another embodiment, that is, repeats similar step The step of rapid S430, segments result term vector using stop words, thus conclusive participle the vector (" electricity after being optimized Magnetic band-type brake ", " motor ", " shaft coupling "), using the conclusive participle vector after optimization as participle text label.

According to disclosed specification of the invention, other realizations of the invention are obvious for those skilled in the art 's.The various aspects of embodiment and/or embodiment can be used for system of the invention and side individually or with any combination In method.Specification and example therein should be only be regarded solely as it is exemplary, the actual scope of the present invention and spirit by appended right Claim indicates.

Claims

1. a kind of data cleaning method for CIM, which comprises the following steps:

Step S100 obtains the format data of multiple data sources, and format data is converted to tables of data, the tables of data In each behavior one record, it is each to be classified as a field；

Step S200, all fields in ergodic data table are extracted if the field type of certain field F is text-type The text in N item record that amount of text is most in the field F, forms text set T；

Step S300, from k differential dictionary { D₁,D₂,...,D_kIn, determine the corresponding differential dictionary Dt of text set T, wherein t Value range is 1 ... k；It include multiple not subdivisible participle terms in each differential dictionary；

Step S400 is segmented using the field F that differential dictionary Dt records every, forms participle text label；Described point The content of text of word text label includes word in differential dictionary Dt；

The participle text label in every record is filled into the corresponding data of CIM by step S500.

2. data cleaning method according to claim 1, which is characterized in that the step S300 further comprises:

K differential dictionary { D is used only in step S310 respectively₁,D₂,...,D_k(without the use of separator library, deactivate dictionary, logical With dictionary etc.) each of text set T is segmented and is sorted, formed k sort after participle vector W1 ... Wk；

W1=(w₁₁,w₁₂,...,w_1n1)；

W2=(w₂₁,w₂₂,...,w_2n2)；

…

Wk=(w_k1,w_k2,...,w_knk)；

Wherein, n₁≥n₂≥...≥n_k, each word w in W1 ... Wk_ijIt is the word for including in differential dictionary, the wherein value of i Being 1 ..., the value of k, j correspond to n₁…n_k；

Step S320 obtains i, so that

The corresponding differential dictionary of W1 is determined as Dt if i=1 by step S330.

3. data cleaning method according to claim 2, which is characterized in that the step S300 further comprises:

Step S340 by the word duplicate removal segmented in vector W1 ... Wi and sorts if i > 1, the participle after forming i sequence Vector V1 ... Vi；Wherein, m₁≥m₂≥...≥m_k；

V1=(v₁₁,v₁₂,...,v_1m1)；

V2=(v₂₁,v₂₂,...,v_2m2)；

…

Vk=(v_k1,v_k2,...,v_kmk)；

4. data cleaning method according to claim 3, which is characterized in that N=max (Total × 0.1%, 100), Middle Total is the sum recorded in tables of data, and max is max function.

5. data cleaning method according to claim 4, which is characterized in that in the step S200, form text set T When, punctuation mark library or punctuation mark and spcial character (be generally implemented as with separator library between the text in difference record Library) in separator distinguish.

6. data cleaning method according to claim 5, which is characterized in that the separator is exclamation.

7. data cleaning method according to claim 3, which is characterized in that the step S400 further comprises:

Step S410, the text in field F recorded to every, is segmented using the separator in separator library, forms the One term vector；

Step S420 segments the first word of each of first term vector using differential dictionary Dt, forms participle text mark Label and the second term vector；

Step S430 segments the second word of each of second term vector using deactivated dictionary, forms third term vector； If some second word is identical as stop words, second word is deleted in the second term vector；If some second word includes Stop words, then second word is split as third word；

Step S440 segments third term vector using general dictionary, forms result term vector, as participle text mark Label.