CN109614396A

CN109614396A - A kind of method for cleaning of address data structure and standardization

Info

Publication number: CN109614396A
Application number: CN201811543929.3A
Authority: CN
Inventors: 宋才华; 郑爱武; 蓝源娟; 王永才; 吴丽贤
Original assignee: Guangdong Power Grid Co Ltd; Foshan Power Supply Bureau of Guangdong Power Grid Corp
Current assignee: Guangdong Power Grid Co Ltd; Foshan Power Supply Bureau of Guangdong Power Grid Corp
Priority date: 2018-12-17
Filing date: 2018-12-17
Publication date: 2019-04-12

Abstract

The present invention relates to the method for cleaning of a kind of address data structure and standardization, comprising the following steps: S1: original address text initialization process；S2 original address stratification parsing；S3: the address date of hierarchical parsing is matched with base address dictionary library；S4: being judged according to whether matching degree meets the requirements, and the address date for the hierarchical parsing that matching degree is met the requirements is added to base address dictionary library as clearing achievements, is unsatisfactory for desired data and is returned to S2 into next cleaning circulation；S5: comprehensive assessment is carried out to clearing achievements using the algorithm of similarity and compliance evaluation, to confirm the accuracy and validity of achievement；The present invention can effectively improve the integrality and accuracy rate of customer electricity address；Accuracy that user's report barrier address judges can be improved, the response speed that improves emergency maintenance, the zone user to be affected by power failure send prompting message, grasp zonal power load demand etc. and all having played very important effect.

Description

A kind of method for cleaning of address data structure and standardization

Technical field

The present invention relates to address data structureizations and standardization field, more particularly, to a kind of address data structure With the method for cleaning of standardization.

Background technique

In today that urban construction is maked rapid progress, numerous streets, community are planned again and construction, this phenomenon cause to supply More and more customer electricity address dates and real address are inconsistent in electric enterprise marketing system.In addition to this, it is gone through due to some History reason causes existing customer electricity address date to there is phenomena such as a large amount of mistake, poikilonymy, imperfect information, such as will Table number is as address, cell, Lou Dong etc. without standard appellation etc.；Simultaneously as the customer electricity address date of storage is not knot The data of structure, there are the regular inconsistent or even same cell difference development periods, difference that the customer address of different community is filled in Fill in rule all inconsistence problems in address between Lou Dong；These problems have seriously affected customer service work, emergency check man The quality of work also produces serious influence to all kinds of supporting system for analysis and decision making construction carried out based on address date.

Summary of the invention

The present invention is to overcome customer electricity address described in the above-mentioned prior art imperfect, inaccurate, user's report barrier address Accuracy of judgement degree is not high, and emergency maintenance response speed is slow, cannot be that the zone user transmission prompting message etc. being affected by power failure lacks It falls into, the method for cleaning of a kind of address data structure and standardization is provided.

It the described method comprises the following steps:

S1: power supply enterprise's storage customer electricity original address data are obtained, and carry out initialization process；

S2: stratification parsing is carried out to power supply enterprise's storage customer electricity original address data after initialization；

S3: the address date of hierarchical parsing is matched with base address dictionary library；

S4: according to the matching degree between the address date of hierarchical parsing and base address dictionary library whether meet the requirements into Row judgement；

The address date for the hierarchical parsing that matching degree is met the requirements is added to base address dictionary as clearing achievements Library；

Matching degree is unsatisfactory for desired data return S2 and parse again in next cleaning circulation, until some cleaning Until circulation cannot obtain the address date for meeting matching degree requirement again.

S5: the algorithm of building similarity and compliance evaluation, and comprehensive assessment is carried out to clearing achievements.

This method is the treatment process repeatedly recycled, and the achievement cleared up every time can all be used to supplement with modified basis Location dictionary library, it is whole until completing then with the treatment process for participating in next round by supplement and modified base address dictionary library A scale removal process.

All customer electricity addresses can be carried out the processing of structuring and standardization by the present invention, realize administrative region, street It does, the name of cell unification, i.e., customer electricity address is uniformly processed and is stated are as follows: city+district+street+cell+Lou Dong+door The form (not cell road+road form can be used) of the trade mark effectively increases customer electricity address integrality and quasi- True rate；Improving accuracy, the response speed of raising emergency maintenance, the region to be affected by power failure that user's report barrier address judges User sends prompting message, the zonal power load demand of grasp etc. and has played very important effect.

Preferably, the analytic method of the step S2 is to be improved to the participle based on text feature by traditional segmenting method Method.

It is preferably based on the segmenting method of text feature are as follows: on the basis of understanding " segmenting method based on statistics ", into The extension of row algorithm, except applying frequency (DF), increases information gain (IG), mutual information, X²Count (CHI), expectation intersects Four kinds of methods of entropy (CE).

Preferably, in step S5 the algorithm of similarity and compliance evaluation pass through comprehensive clustering algorithm, k nearest neighbor algorithm, CART classification tree regression algorithm constructs.

Preferably, information gain is occurs in electricity consumption address by counting some characteristic item or the number that does not occur is come in advance The classification of electricity consumption address is surveyed, the calculation formula of information gain is as follows:

Wherein P_r(c_i) indicate the probability that feature occurs in the sample, P_r(c_i| t) indicate each in the case that feature occurs The probability of classification is how many respectively.

Information gain G (t) reflects reduction of the feature t to classification confusion degree, that is, to the information content of classification in reality By being sorted according to the information gain value of each feature in existing, and it is sub according to the feature that the threshold value of setting selects proper size Collection.

Preferably, the association relationship of mutual information is completed to extract by calculating the correlation between feature t and classification c；It calculates Formula are as follows:

Wherein: A is the number that t and c occurs simultaneously；B is that t occurs and number that c does not occur；C is c appearance and t does not have The number of appearance；N is all electricity consumption number of addresses；If t and c are uncorrelated, I (t, c) value is 0；It is then right if there is m class M value is had in each t, takes being averaged for they, so that it may obtain a linear order needed for Feature Selection；I average value is bigger The probability that feature is selected is bigger.

Preferably, χ²The calculation formula of statistics can be expressed as:.

Wherein, t indicates that characteristic item, c indicate classification.

Preferably, it is expected that intersect closely related calculation formula as follows,

Wherein P_r(c_i| t) and P (c_i) the same information gain of meaning；If entry and electricity consumption address classes strong correlation, also It is P_r(c_i| t) greatly, and corresponding classification probability of occurrence is small, then illustrates that influence of the entry to classification is big, corresponding CE value is with regard to big, quilt Choose the possibility as characteristic item bigger；It is expected that intersecting the closely related probability distribution for reflecting text categories and some spy occurring Determine the distance between the probability distribution of text categories under conditions of word, the expectation of entry t intersects closely related bigger, is distributed to text categories Influence it is also bigger.

Compared with prior art, the beneficial effect of technical solution of the present invention is: providing a kind of address date for power supply enterprise The method for cleaning of structuring and standardization realizes that administrative region, neighbourhood committee, the name of cell are unified, effectively increases client's use Electric address integrality and accuracy rate；In the accuracy for improving the judgement of user's report barrier address, the response speed for improving emergency maintenance, it is The zone user that is affected by power failure send prompting message, grasp zonal power load demand etc. all played it is very heavy The effect wanted.

Detailed description of the invention

Fig. 1 is the method for cleaning flow chart of the present embodiment address data structure and standardization.

Specific embodiment

The attached figures are only used for illustrative purposes and cannot be understood as limitating the patent；

In order to better illustrate this embodiment, the certain components of attached drawing have omission, zoom in or out, and do not represent actual product Size；

To those skilled in the art, it is to be understood that certain known features and its explanation, which may be omitted, in attached drawing 's.

The following further describes the technical solution of the present invention with reference to the accompanying drawings and examples.

The present embodiment provides the method for cleaning of a kind of address data structure and standardization, as shown in Figure 1, the method packet Include following steps:

S2: to " segmenting method based on statistics " understand on the basis of, carry out algorithm extension, applying frequency (DF) it Outside, information gain (IG), mutual information, χ are increased²(CHI), expectation cross entropy (CE) four kinds of methods are counted, by traditional participle Method, which is improved, becomes the segmenting method based on text feature, to realize to power supply enterprise's storage customer electricity original address data Carry out stratification parsing；Specific analytic method specification:

DF (Document frequency): it can be expressed as electricity consumption address frequency herein；DF is indicated The electricity consumption number of addresses of some characteristic item t.This method for measuring characteristic item significance level is based on such a hypothesis: DF is lesser Influence of the characteristic item to classification results is smaller；This method preferentially takes the biggish characteristic item of DF, and the lesser characteristic item of DF will be by It rejects.

Information gain (IG): IG occurs in electricity consumption address by counting some characteristic item or the number that does not occur is predicted The calculation formula of the classification of electricity consumption address, IG is as follows:

Wherein P_r(c_i) indicate the probability that feature occurs in the sample, P_r(c_i| t) indicate each in the case that feature occurs The probability of classification is how many respectively, and m indicates the number of classification.

Information gain G (t) reflects reduction of the feature t to classification confusion degree, that is, the information content to classification.In reality By being sorted according to the information gain value of each feature in existing, and it is sub according to the feature that the threshold value of setting selects proper size Collection.

Mutual information Ml (Mutual information): association relationship, it is related between classification c by calculating feature t Property is completed to extract, calculation formula are as follows:

Wherein: A is the number that t and c occurs simultaneously；B is that t occurs and number that c does not occur；C is c appearance and t does not have The number of appearance.N is all electricity consumption number of addresses；If t and c are uncorrelated, I (t, c) value is 0；If there is m classification, then M value is had for each t, takes being averaged for they, so that it may obtain a linear order needed for Feature Selection；Big I average value Feature a possibility that being selected it is big.

χ²Count (CHI): CHI method has thought substantially similar with Ml method, same by calculating feature t and classification c Between degree of dependence complete to extract；If characteristic item t and classification c inverse correlation, just illustrate the electricity consumption address containing characteristic item t not The probability for belonging to c wants larger, this is also to have very much directive significance for judging whether electricity consumption address is not belonging to classification；To overcome This defect, CHI calculate the correlation of characteristic item t and classification c using formula；Calculation formula can be expressed as:.

It is expected that cross entropy (CE): expectation intersects closely related (CE) and is defined as follows,

Wherein P_r(c_i| t) and P (c_i) the same information gain of meaning；If entry and electricity consumption address classes strong correlation, also It is P_r(c_i| t) greatly, and corresponding classification probability of occurrence is small, then illustrates that influence of the entry to classification is big, corresponding CE value is with regard to big, just It is likely to selected as characteristic item；It is expected that intersecting the closely related probability distribution for reflecting text categories and some specific word occurring Under conditions of text categories the distance between probability distribution；The expectation of entry t intersects closely related bigger, to be distributed to text categories shadow Sound is also bigger.

S3: the address date of hierarchical parsing is matched with base address dictionary library, the level that matching degree is met the requirements Dissolve the address date of analysis as clearing achievements, and is added to base address dictionary library；

S4: the data that matching degree is unsatisfactory for requiring are put into next cleaning circulation and parse again, until some cleaning follows Until ring cannot obtain the address date for meeting matching degree requirement again；

S5: comprehensive clustering algorithm, k nearest neighbor algorithm, CART classification tree regression algorithm, building similarity and compliance evaluation Algorithm, and comprehensive assessment is carried out to clearing achievements；Specific method is described as follows:

Clustering algorithm: similar electricity consumption address similarity is larger under normal circumstances, and inhomogeneous electricity consumption address similarity It is smaller.As a kind of unsupervised machine learning method, cluster is not due to needing training process, and does not need in advance to text Mark classification by hand, therefore there is certain flexibility and higher automatic processing ability.

One electricity consumption address shows as one and is made of word, word and number, in terms of most famous information retrieval can be used Electricity consumption address is expressed as weighted feature vector D=D (T1, W1 by vector space model (vector space model, VSM)； T2, W2；…；Tn, Wn), then, the classification of sample to be divided is determined by calculating the method for electricity consumption address similarity.Work as electricity consumption When address is represented as vector space model, the similarity of electricity consumption address can be come by the inner product between feature vector It indicates.Most electricity consumption address can be regarded as and be made of several words in simple terms, each word be converted to weight with Afterwards, each weight can regard the one-component in vector as, then an electricity consumption address can regard one in n-dimensional space as Vector, here it is the origin of vector space model；The corresponding weight of word can be calculated by TF-IDF weighting technique.

CART post-class processing: being a kind of Decision-Tree Method, estimates letter using the gini index based on minimum range Number, for determining the expansion shape of the decision tree generated by the Sub Data Set；In the method, key is to examine some address sample The Geordie impurity level of the post-class processing of this collection；Geordie impurity level indicates that an address sample chosen at random is divided in the subsets A possibility that wrong (such as a customer electricity address is assigned to a wrong cell)；Geordie impurity level is selected for this sample In probability multiplied by it by the probability of misclassification.When all samples are all a classes in a node, Geordie impurity level is zero.

K nearest neighbor algorithm: its core concept is if big in the K in feature space most adjacent samples of a sample Majority belongs to some classification, then the sample also belongs to this classification, and the characteristic with sample in this classification；It is close using K Key factor of the adjacent algorithm when assessing the consistency of an address sample set is its distance function.It applies in the method Minkowski Distance formula:

Wherein, x_i、y_iFor two-dimentional variable, p indicates variable element.

The terms describing the positional relationship in the drawings are only for illustration, should not be understood as the limitation to this patent；

Obviously, the above embodiment of the present invention be only to clearly illustrate example of the present invention, and not be pair The restriction of embodiments of the present invention.For those of ordinary skill in the art, may be used also on the basis of the above description To make other variations or changes in different ways.There is no necessity and possibility to exhaust all the enbodiments.It is all this Made any modifications, equivalent replacements, and improvements etc., should be included in the claims in the present invention within the spirit and principle of invention Protection scope within.

Claims

1. the method for cleaning of a kind of address data structure and standardization, which is characterized in that the described method comprises the following steps:

S4: sentenced according to whether the matching degree between the address date of hierarchical parsing and base address dictionary library meets the requirements It is disconnected；

The address date for the hierarchical parsing that matching degree is met the requirements is added to base address dictionary library as clearing achievements；

Matching degree is unsatisfactory for desired data return S2 and parse again in next cleaning circulation, until some cleaning recycles Until the address date for meeting matching degree requirement cannot be obtained again；

S5: assessing similarity and consistency, and carries out comprehensive assessment to clearing achievements.

2. the method for cleaning of address data structure according to claim 1 and standardization, which is characterized in that the step The analytic method of S2 is the segmenting method based on text feature.

3. the method for cleaning of address data structure according to claim 2 and standardization, which is characterized in that be based on text The segmenting method of feature are as follows: on the basis of " segmenting method based on statistics ", algorithm extension is carried out, except applying frequency, Information gain, mutual information, χ are increased simultaneously²Statistics, expectation four kinds of methods of cross entropy.

4. the method for cleaning of address data structure according to claim 1 and standardization, which is characterized in that in step S5 The algorithm of similarity and compliance evaluation is by comprehensive k nearest neighbor algorithm, comprehensive clustering algorithm, CART classification tree regression algorithm come structure It builds.

5. the method for cleaning of address data structure according to claim 3 and standardization, which is characterized in that information gain It is the number for occurring in electricity consumption address by counting some characteristic item or not occurring the classification of predicting electricity consumption address, information increases The calculation formula of benefit is as follows:

Wherein P_r(c_i) indicate the probability that feature occurs in the sample, P_r(c_i| t) indicate each classification in the case that feature occurs Probability is how many respectively, and m indicates the number of classification.

6. the method for cleaning of address data structure according to claim 3 and standardization, which is characterized in that mutual information Association relationship is completed to extract by calculating the correlation between feature t and classification c；Calculation formula are as follows:

Wherein: A is the number that t and c occurs simultaneously；B is that t occurs and number that c does not occur；C is c appearance and t does not occur Number；N is all electricity consumption number of addresses；If t and c are uncorrelated, I (t, c) value is 0.

7. the method for cleaning of address data structure according to claim 3 and standardization, which is characterized in that χ²Statistics Calculation formula can be expressed as:

Wherein, t indicates characteristic item and c indicates classification, and A is the number that t and c occurs simultaneously；B is that t occurs and c does not occur time Number；C is that c occurs and number that t does not occur；N is all electricity consumption number of addresses.

8. the method for cleaning of address data structure according to claim 3 and standardization, which is characterized in that expectation intersects Closely related calculation formula is as follows,