CN107025301A - Flight ensures the method for cleaning of data - Google Patents

Flight ensures the method for cleaning of data Download PDF

Info

Publication number
CN107025301A
CN107025301A CN201710273945.4A CN201710273945A CN107025301A CN 107025301 A CN107025301 A CN 107025301A CN 201710273945 A CN201710273945 A CN 201710273945A CN 107025301 A CN107025301 A CN 107025301A
Authority
CN
China
Prior art keywords
data
aircraft gate
record
cleaning
flight
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710273945.4A
Other languages
Chinese (zh)
Inventor
金海燕
李喻蒙
秦娟娟
王彬
王磊
黑新宏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xian University of Technology
Original Assignee
Xian University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xian University of Technology filed Critical Xian University of Technology
Priority to CN201710273945.4A priority Critical patent/CN107025301A/en
Publication of CN107025301A publication Critical patent/CN107025301A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24553Query execution of query operations
    • G06F16/24554Unary operations; Data partitioning operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Computational Linguistics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

Flight ensures the method for cleaning of data, comprises the following steps:Step 1, to flight ensure data pre-process;Data, which carry out attribute cleaning, to be ensured to flight first, aircraft gate data are obtained, then exceptional value cleaning is carried out to aircraft gate data;Step 2, the duplicated records to aircraft gate data are cleaned:Step 2.1, the key assignments for creating sort key and calculating aircraft gate data;Step 2.2, according to neighbour's sort method based on clustered index, aircraft gate data are ranked up;The window of variable-size is slided in step 2.3, the data set after sequence, the duplicated records to aircraft gate data are detected and cleaned.The present invention improves the accuracy and integrality that flight ensures data, improves the efficiency that detection flight ensures the duplicated records of data.

Description

Flight ensures the method for cleaning of data
Technical field
The invention belongs to mathematical statistics and data mining technology field, it is related to the method for cleaning that a kind of flight ensures data.
Background technology
The domestic research for data cleansing is started late, but research is in extensive range:Point out the number in data cleansing Mainly there are forms data source and multi-data source according to source, and give forms data source and the multi-data source error logging in instance layer Classification;From the angle of the quality of data, the expansible data scrubbing framework of setting up rule-based storehouse and method base is analyzed Necessity;Research in terms of the method and framework of data cleansing, including Knowledge based engineering duplicated records sweep-out method, base The reconfigurable data cleaning framework of clean-up task is completed with flow model, based on semantic rules in multiple rule combination distinct methods The open type data for completing data scrubbing task in self study mode in storehouse clears up framework.
Data scrubbing uses corresponding data clearing method according to concrete application and different pieces of information, corresponding after data classification Cleaning method mainly includes following four kinds:1. the solution of missing values:In most cases, missing values must be inserted by hand (i.e. manual cleanup), certainly, some missing values can be derived from notebook data source or other data sources, therefore can be used average Value, maximum, minimum value or increasingly complex probability Estimation replace missing values, so as to reach the purpose of cleaning.2. improper value Detection and solution:Possible improper value or exceptional value are recognized with the method for statistical analysis, such as variance analysis identification is not observed Distribution or the value of regression equation, can also check data value with simple rule storehouse (common-sense rule, business ad hoc rules etc.), or Person detects and cleared up data using constraining between different attribute, outside data.3. repeat detection and the solution of record: Property value identical record is considered as to repeat to record in database.Detected by judging whether the property value between record is equal Whether record is repeated data, and is merged or removed using the basic skills for the weight that disappears.4. inconsistency is mainly manifested in data Inside source and between data source, the integrated data of multi-data source may have semantic conflict, therefore, detection and solution for the problem Certainly method is that definable integrity constraint is used to detect inconsistency, can also find to contact by analyze data, reach data Uniformity.
, it is necessary to ensure that data are purified and optimized to flight in the business of Civil Aviation Airport, although many general existed Data cleansing is theoretical and framework, due to the particularity and industry confidentiality of business scope, and flight ensures the data volume of data Huge, the information content that packet contains is numerous, ensures that the duplicated records of data carry out detection difficult, purification and optimization to flight Workload is big.
The content of the invention
It is an object of the invention to provide the method for cleaning that a kind of flight ensures data, improve flight and ensure the accurate of data Property and integrality, improve detection flight ensure data duplicated records efficiency.
The technical solution adopted in the present invention is that flight ensures the method for cleaning of data, comprises the following steps:
Step 1, to flight ensure data pre-process;
Data, which carry out attribute cleaning, to be ensured to flight first, aircraft gate data are obtained, then aircraft gate data are carried out abnormal Value cleaning;
Step 2, the duplicated records to aircraft gate data are cleaned:
Step 2.1, establishment sort key, and calculate the key assignments of aircraft gate data;
Step 2.2, according to neighbour's sort method based on clustered index, aircraft gate data are ranked up;
The window of variable-size is slided in step 2.3, the data set after sequence, the similar of aircraft gate data is repeated to remember Record is detected and cleaned.
In step 1, ensure that data carry out attribute cleaning and are specifically divided into flight:
(1) the processing pair data unrelated with aircraft gate information:Deleted or not extracted;
(2) to the processing of missing Value Data in the data of aircraft gate:Missing values data include primary attribute missing data and non-master Attribute missing data, primary attribute missing data is abandoned, and nonprime attribute missing number is reacquired or be derived from from data source According to;
(3) to the processing for the data that business rule is violated in the data of aircraft gate:By being proofreaded with data source, reacquire;
(4) to the processing of the data of same attribute different expression form in the data of aircraft gate:Set unique form of expression.
In step 1, judged using box figure method and reject the exceptional value in the data of aircraft gate, detailed process is:
All aircraft gate data for clearance are set to data set A, data set A is divided into α × n interval, n is interval Number, α is the number of aircraft gate data in each interval, and β is interval size:
Wherein, all aircraft gate data in each interval constitute a data set, DnRepresent the data that numbering is n Collection;
The distribution characteristics of aircraft gate data is analyzed, domain [i-j, i+j] in data set A data set is obtained, wherein, i-j is Minimum value data set, i.e. Min { D1, D2..., Dn, i+j is maximum value data collection, i.e. Max { D1, D2..., Dn};By [i-j, i+ J] primary data group is set to, outlier is rejected to primary data group, non-Outlier Data group [Q is obtained1- 3 × IQR, Q3+ 3 × IQR], To [Q1- 3 × IQR, Q3+ 3 × IQR] negated abnormal data group, obtain target data set [Q1- 1.5 × IQR, Q3+ 1.5 × IQR], Target data set is set to data set B, wherein Q1Represent the first quantile, Q3The 3rd quantile is represented, IQR is represented between quartile Away from IQR=Q3-Q1
The detailed process of step 2.1 is:
The different attributes for extracting aircraft gate data are used as different sort keys;According to sort key to data set Each field calculated field value of aircraft gate data in B, so that the key assignments of aircraft gate data is obtained, the key of aircraft gate data Value, is the set of field value in the aircraft gate data.
Step 2.2 is specially:
Clustered index is set up in data set B, according to the key assignments of aircraft gate data, to the aircraft gate data in data set B Arranged so that duplicated records are aligned to adjacent domain, obtain data set C.
The detailed process of step 2.3 is:
Each data in data set C constitute a record, and the window of variable-size is slided on data set C, sliding First in first out strategy is used during dynamic, during window sliding, if the record in current window is the 1~N articles record, then next Record into window is the N+1 articles record, and the N+1 articles record and the 2~N articles record in window are carried out into similarity one by one Matching, whether be repeat record, if repeating to record, this record is rejected, if not being weight if the N+1 articles record is detected with this Multiple record, then continue slide downward window, the similarity mode until completing all records in data set C.
In step 2.3, the detailed process of similarity mode is:
Field weight is set, is given a mark by the independent weight to each field of some experts, takes same field The marking average of weight, is used as the field weight of the field, field weights=field weight × field value, the weights of a record The summation of the field weights of all fields is constituted in the record;
During similarity mode, the weights of two records to be matched are calculated respectively, and carry out adduction, obtain two The similarity M of record to be matched, M is compared with default similarity threshold N, if M is more than in N, two records to be matched The repetition that is recorded as into window is recorded afterwards, is otherwise considered as two different records.
In step 2.3, the size of window is driven by the usage frequency of aircraft gate:Count the average usage frequency of aircraft gate The Mean and maximum usage frequency Max of aircraft gate, using size of (Mean+Max)/2 as window.
The beneficial effects of the invention are as follows:Flight ensures the method for cleaning of data, the attribute cleaning used in pretreatment stage The detection of method and exceptional value and delet method, improve accuracy and integrality that flight ensures data set, add after pretreatment Carry being obviously improved for speed and exactly have benefited from the increase that flight after cleaning ensures efficacy data proportion in data set;To sequence side Method is improved, and clustered index is introduced in neighbour's sort method, while improving sequencing production so that duplicated records It is aligned to neighboring regions;The window of variable-size is slided, the size of window is driven by the usage frequency of aircraft gate, to similar repetition Record is detected and cleaned, because duplicated records arrangement has been aligned in same window as much as possible, in not shadow Ring to search to repeat to reduce in the case of record efficiency in the times such as unnecessary number of comparisons and detect that repeating record number increases Plus, so as to preferably improve the efficiency of detection.
Brief description of the drawings
Fig. 1 is the schematic diagram of data scrubbing;
Fig. 2 is aircraft gate data distribution characteristics figure;
Fig. 3 is the box traction substation in the concentration domain of remote seat in the plane data;
Fig. 4 is the box traction substation in the concentration domain of nearly seat in the plane data;
Fig. 5 is the flow chart using neighbour's sort method sequence based on clustered index;
Fig. 6 is the schematic diagram for the window for sliding variable-size;
Fig. 7 is the flow chart of similarity mode;
Fig. 8 be before and after data cleansing the load time compare figure;
Fig. 9 is the comparison figure of the number of the similar record of detection in the times such as distinct methods.
Embodiment
As shown in figure 1, flight ensures the method for cleaning of data, it is intended to analyze the base that Civil Aviation Airport flight ensures data characteristicses On plinth, complete flight and ensure the correlation test for being both needed to carry out in the duplicated records detection of data, to existing data cleansing Method is adjusted correspondingly and refined, while data cleansing rules and methods are determined, so as to ensure that data are carried to flight Pure optimization, high-quality data are provided for follow-up study.
Data instance is ensured with the flight in the year of Lanzhou Zhong Chuan airports 2015,2016, below in conjunction with the accompanying drawings and specific implementation The present invention is described in detail for mode:
Flight ensures the method for cleaning of data, comprises the following steps:
Step 1, to flight ensure data pre-process;
Data, which carry out attribute cleaning, to be ensured to flight first, aircraft gate data are obtained, then aircraft gate data are carried out abnormal Value cleaning;
Step 2, the duplicated records to aircraft gate data are cleaned:
Step 2.1, establishment sort key, and calculate the key assignments of aircraft gate data;
Step 2.2, according to neighbour's sort method based on clustered index, aircraft gate data are ranked up;
The window of variable-size is slided in step 2.3, the data set after sequence, the similar of aircraft gate data is repeated to remember Record is detected and cleaned.
In step 1, ensure that data carry out attribute cleaning and are specifically divided into flight:
(1) the processing pair data unrelated with aircraft gate information:For example:Flying height, aeroplane span, course line, way point And flight-time information, belong to the data unrelated with aircraft gate information, deleted or not extracted;
(2) to the processing of missing Value Data in the data of aircraft gate:Missing values data include primary attribute missing data and non-master Attribute missing data, primary attribute missing can have a strong impact on aircraft gate real-time status, and do not allow to exist primary attribute in system and lack The situation of mistake, therefore when primary attribute is lacked, it is believed that the data are wrong data, and primary attribute missing data is abandoned;Non- master Property missing smaller is influenceed on aircraft gate real-time status, but run counter to the integrity rule of data, reacquired from data source Or it is derived from nonprime attribute missing data;
(3) to the processing for the data that business rule is violated in the data of aircraft gate:Violate the attribute that business rule refers to data Relation of the value or between the property value of data in itself violates the business rule of Civil Aviation Airport, such as certain flight without it is previous stand it is winged But there is this landing time in the time, or but there is the latter station landing time without this departure time, for such data, lead to Cross and proofreaded with data source, reacquired;
(4) to the processing of the data of same attribute different expression form in the data of aircraft gate:Property value represents formal cause list Position or department and it is different, for example, the representation for the state that approaches can have YES/NO or arrival/cancellation, to different manifestations The data of form carry out unitized processing, set unique form of expression.
In practical situations both, aircraft gate data are significantly affected by exceptional value, in order to eliminate exceptional value to whole data Influence, it is necessary to judged exceptional value and rejected, obtain meeting the data set of the actual conditions of airport aircraft gate.
Judgement at present to exceptional value is main using two methods of physics diagnostic method and statistical energy method with rejecting:Physics is sentenced Other method, is, to the existing understanding of objective things, to differentiate because the reasons such as external interference, human error cause to survey number according to people Deviate normal outcome according to value, judge and reject at any time in experimentation.Statistical energy method, is to give a fiducial probability, and A confidence limit is determined, all errors more than this limit are considered as it and are not belonging to random error scope, are regarded as exceptional value and pick Remove.When physical identification is difficult to judge, typically using statistical recognition methods.
The present invention ensures the distribution characteristics of data according to flight, is judged and is picked using the box figure method in statistical recognition methods Except the exceptional value in the data of aircraft gate.
In step 1, judged using box figure method and reject the exceptional value in the data of aircraft gate, detailed process is:
Aircraft gate data for clearance are set to data set A, as shown in table 1, data set A α × n interval, n are divided into For interval number, α is the number of aircraft gate data in each interval, and β is interval size:
Wherein, all aircraft gate data in each interval constitute a data set, DnRepresent the data that numbering is n Collection;
The flight of table 1. ensures aircraft gate data message table in data
Sequence number 1 2 3 4 ..... n-1 n
Data D1 D2 D3 D4 ..... Dn-1 Dn
For dispersion degree is not king-sized data source, the distributions of data itself is typically concentrated in a certain specific In region, the distribution characteristics of aircraft gate data is analyzed, as shown in Fig. 2 domain [i-j, i+j] in data set A data set is obtained, its In, i-j is minimum value data set, i.e. Min { D1, D2..., Dn, i+j is maximum value data collection, i.e. Max { D1, D2..., Dn};
By taking one group of aircraft gate data as an example, as shown in table 2, in practical situations both, if directly calculating the reality of aircraft gate Border is interval, obtains remote seat in the plane data set A1Interval be [70,500], nearly seat in the plane data set A2Interval be [- 500,60], the knot Fruit is not inconsistent with actual conditions, illustrates that aircraft gate data set is significantly affected by abnormal Value Data, it is necessary to sentence to exceptional value Disconnected and rejecting.
The flight of table 2. ensures the aircraft gate data (shutdown bit number) in data
First, data set A is divided into 1000 intervals, finds remote seat in the plane data set A1Data set in domain for [70, 160], nearly seat in the plane data set A2Data set in domain be [- 9,60], then, to value in A1Data set in domain shutdown digit According to box map analysis is done, the box traction substation of the aircraft gate shown in Fig. 3 is obtained, to value in A2Data set in domain aircraft gate data Box map analysis is done, the box traction substation of the aircraft gate shown in Fig. 4 is obtained.
According to the analysis of box-shaped figure result, remote seat in the plane data set A is obtained1Non- Outlier Data group be [85,134], nearly machine Position data set A2Non- Outlier Data group be [- 10.75,27.75];Calculate again and obtain remote seat in the plane data set A1Non- abnormal data Group is [95.5,116.5], nearly seat in the plane data set A2Non- abnormal data group be [- 2.5,19.5];Result of calculation meets airport and stopped The actual conditions of seat in the plane.Therefore, rejecting abnormalities Value Data is recognized by the method for aircraft gate data distribution characteristics and box figure Mode is more quick and effect significantly, be that the important step that data are cleared up is ensured to flight.
The detailed process of step 2.1 is:Airport human users custom and keyword importance are analyzed, aircraft gate data are extracted Different attributes be used as different sort keys, different sort keys constitute sort key combinatorics on words, to extract Exemplified by the combination of following sort key:
Key Com={ Gate=aircraft gates, Plan LT=this plan landing time, this reality of Actual LT= Landing time, this Proposed Departure time of Plan DT=, this actual time of departure of Actual DT=};
According to each field calculated field value of sort key to the aircraft gate data in data set B, so as to be stopped The key assignments of seat in the plane data, the key assignments of aircraft gate data is the set of field value in the aircraft gate data.
Step 2.2 is specially:
Clustered index is set up in data set B, according to the key assignments of aircraft gate data, to the aircraft gate data in data set B Carry out neighbour's arrangement so that duplicated records are aligned to adjacent domain, obtain data set C.As shown in figure 5, in the present embodiment 3 minor sorts of middle progress, the result set of 3 minor sorts is compared, inconsistent part minor sort again, obtains final result Collection, the accidental error for preventing a minor sort from causing.
The detailed process of step 2.3 is:
Each data in data set C constitute a record, the window of variable-size are slided on data set C, such as Shown in Fig. 6, first in first out strategy is used in sliding process, during window sliding, if the record in current window is the 1~N articles note Record, then next into window record be the N+1 articles record, by the N+1 articles record with window in the 2~N articles record by One carries out similarity mode, and whether be repeat record, if repeating to record, reject this if the N+1 articles record is detected with this Record, if not being to repeat to record, then continues slide downward window, the similarity mode until completing all records in data set C.
As shown in fig. 7, in step 2.3, the detailed process of similarity mode is:
Field weight is set, is the influence power for accurate description field for aircraft gate state change, according to data set In the significance level of each field different field weights are set, the method generally used has following several:1. subjective experience method;2. Primary and secondary index queuing classification;3. expert graded.In the present invention, field weight is set using expert graded:By some positions The independent weight to each field of expert is given a mark, and is taken the marking average of the weight of same field, is used as the word of the field Duan Quanchong, field weights=field weight × field value, the field weights of weights all fields in the record of a record Summation constitute;During similarity mode, the weights of two records to be matched are calculated respectively, and carry out adduction, are obtained The similarity M of two records to be matched, M is compared with default similarity threshold N, if M is more than N, two notes to be matched The repetition that is recorded as after in record into window is recorded, and is otherwise considered as two different records.
In step 2.3, the size of window is driven by the usage frequency of aircraft gate:Due to window it is larger when, number of comparisons meeting Increase, and some are relatively not necessarily to;The matching of repeated data may can be omitted when window is smaller again;As shown in table 3, According to the guarantee data of the annual second half year of Lanzhou Zhong Chuan airports 2015 and the 2016 annual first half of the year, aircraft gate is counted each The average usage frequency Mean of the moon.
Monthly aircraft gate usage frequency (the n M/D of table 3.:N-th month daily average value ,-do not come into operation)
As shown in table 4, to being rounded on the average usage frequency Mean of aircraft gate every month, 12 middle of the month maximums is calculated and are used Frequency Max, using the average size as window of the two
The determinant (average~maximum) of the sliding window size of each aircraft gate of table 4.
Nearly 101 Nearly 102 Nearly 103 Nearly 104 Nearly 105 Nearly 106 Nearly 107 Nearly 108 Nearly 109 Nearly 110
4~6 4~6 4~6 4~6 4~5 3~5 4~6 4~6 4~7 4~6
Nearly 111 Nearly 112 Nearly 113 Nearly 114 Nearly 115 It is remote by 1 It is remote by 2 It is remote by 3 It is remote by 4 It is remote by 5
4~7 4~6 4~6 4~6 4~6 1~2 1~2 2~2 1~2 2~2
It is remote by 6 It is remote by 7 It is remote by 8 It is remote by 9 It is remote by 10 It is remote by 11 It is remote by 12 It is remote by 13 It is remote by 14 It is remote by 15
2~2 2~2 2~3 2~2 1~2 2~2 2~2 1~2 1~1 1~1
The evaluation criterion of data cleansing quality has the consistency principle, integrality principle, availability, efficiency etc., and the present invention is main The speed of data cleansing and the degree of cleaning to repeating record are considered, for repeating to record main by false recognition rate and accuracy rate To weigh, as shown in table 5:4 groups of data instances are taken, compared with before cleaning, loading velocity accelerates after cleaning, wait detection weight in the time Record number is greatly increased again.
The comparison of index is loaded before table 5. is cleaned and after cleaning
The every month of the guarantee of 3000 record progress attribute is clear in being recorded to the actual guarantee of 2015 of Lanzhou Zhong Chuan airports Wash, exceptional value cleaning and duplicated records detection and rejecting, calculate cleaning before data loading time and cleaning after Time, as shown in figure 8, the load time greatly shortens after data cleansing.
Using neighbour's sort algorithm based on clustered index, to ensureing that record carries out similarity detection.By when checking etc. The interior number for detecting similar record, obtained result is compared with the mode directly retrieved, comparing result such as Fig. 9, and use Accuracy rate come weigh duplicated records detection effect.
By institute's aforesaid way, flight of the present invention ensures the method for cleaning of data, improves flight and ensures the accurate of data Property and integrality, improve detection flight ensure data duplicated records efficiency.

Claims (8)

1. flight ensures the method for cleaning of data, it is characterised in that comprise the following steps:
Step 1, to flight ensure data pre-process;
Data, which carry out attribute cleaning, to be ensured to flight first, aircraft gate data are obtained, then it is clear to aircraft gate data progress exceptional value Wash;
Step 2, the duplicated records to aircraft gate data are cleaned:
Step 2.1, the key assignments for creating sort key and calculating aircraft gate data;
Step 2.2, according to neighbour's sort method based on clustered index, aircraft gate data are ranked up;
The window of variable-size is slided in step 2.3, the data set after sequence, the duplicated records to aircraft gate data are entered Row is detected and cleaned.
2. flight according to claim 1 ensures the method for cleaning of data, it is characterised in that in the step 1, to flight Ensure that data carry out attribute cleaning and are specifically divided into:
(1) the processing pair data unrelated with aircraft gate information:Deleted or not extracted;
(2) to the processing of missing Value Data in the data of aircraft gate:Missing values data include primary attribute missing data and nonprime attribute Missing data, primary attribute missing data is abandoned, and nonprime attribute missing data is reacquired or be derived from from data source;
(3) to the processing for the data that business rule is violated in the data of aircraft gate:By being proofreaded with data source, reacquire;
(4) to the processing of the data of same attribute different expression form in the data of aircraft gate:Set unique form of expression.
3. flight according to claim 1 ensures the method for cleaning of data, it is characterised in that in the step 1, using case Type figure method judges and rejects the exceptional value in the data of aircraft gate, and detailed process is:
All aircraft gate data for clearance are set to data set A, data set A is divided into α × n interval, n is interval Number, α is the number of aircraft gate data in each interval, and β is interval size:
Wherein, all aircraft gate data in each interval constitute a data set, DnRepresent the data set that numbering is n;
The distribution characteristics of aircraft gate data is analyzed, domain [i-j, i+j] in data set A data set is obtained, wherein, i-j is minimum Value Data collection, i.e. Min { D1, D2..., Dn, i+j is maximum value data collection, i.e. Max { D1, D2..., Dn};[i-j, i+j] is set For primary data group, outlier is rejected to primary data group, non-Outlier Data group [Q is obtained1- 3 × IQR, Q3+ 3 × IQR], it is right [Q1- 3 × IQR, Q3+ 3 × IQR] negated abnormal data group, obtain target data set [Q1- 1.5 × IQR, Q3+ 1.5 × IQR], will Target data set is set to data set B, wherein Q1Represent the first quantile, Q3The 3rd quantile is represented, IQR represents quartile spacing IQR=Q3-Q1
4. flight according to claim 3 ensures the method for cleaning of data, it is characterised in that the step 2.1 it is specific Process is:
The different attributes for extracting aircraft gate data are used as different sort keys;According to sort key in data set B Aircraft gate data each field calculated field value, so as to obtain the key assignments of aircraft gate data, the key assignments of aircraft gate data, i.e., For the set of field value in the aircraft gate data.
5. flight according to claim 4 ensures the method for cleaning of data, it is characterised in that the step 2.2 is specially:
Clustered index is set up in data set B, according to the key assignments of aircraft gate data, the aircraft gate data in data set B are carried out Neighbour arranges so that duplicated records are aligned to adjacent domain, obtain data set C.
6. flight according to claim 5 ensures the method for cleaning of data, it is characterised in that the step 2.3 it is specific Process is:
Each data in data set C constitute a record, and the window of variable-size is slided on data set C, was slided First in first out strategy is used in journey, during window sliding, if the record in current window is the 1~N articles record, is then next entered The record of window is the N+1 articles record, and the N+1 articles record is carried out into similarity one by one with the 2~N articles record in window Match somebody with somebody, whether be repeat record, if repeating to record, this record is rejected, if not being to repeat if the N+1 articles record is detected with this Record, then continue slide downward window, the similarity mode until completing all records in data set C.
7. flight according to claim 6 ensures the method for cleaning of data, it is characterised in that similar in the step 2.3 Spending the detailed process matched is:
Field weight is set, is given a mark by the independent weight to each field of some experts, takes the weight of same field Marking average, as the field weight of the field, field weights=field weight × field value, the weights of a record are by this The summation of the field weights of all fields is constituted in record;
During similarity mode, the weights of two records to be matched are calculated respectively, and adduction are carried out, obtain two and treat Similarity M with record, M is compared with default similarity threshold N, laggard in two records to be matched if M is more than N The repetition that is recorded as entering window is recorded, and is otherwise considered as two different records.
8. flight according to claim 1 ensures the method for cleaning of data, it is characterised in that in step 2.3, by aircraft gate Usage frequency drive window size:Count the average usage frequency Mean of aircraft gate and the maximum usage frequency of aircraft gate Max, using size of (Mean+Max)/2 as window.
CN201710273945.4A 2017-04-25 2017-04-25 Flight ensures the method for cleaning of data Pending CN107025301A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710273945.4A CN107025301A (en) 2017-04-25 2017-04-25 Flight ensures the method for cleaning of data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710273945.4A CN107025301A (en) 2017-04-25 2017-04-25 Flight ensures the method for cleaning of data

Publications (1)

Publication Number Publication Date
CN107025301A true CN107025301A (en) 2017-08-08

Family

ID=59527900

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710273945.4A Pending CN107025301A (en) 2017-04-25 2017-04-25 Flight ensures the method for cleaning of data

Country Status (1)

Country Link
CN (1) CN107025301A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108763346A (en) * 2018-05-15 2018-11-06 中南大学 A kind of abnormal point processing method of sliding window box figure medium filtering
CN109727446A (en) * 2019-01-15 2019-05-07 华北电力大学(保定) A kind of identification and processing method of electricity consumption data exceptional value
CN109918367A (en) * 2019-03-19 2019-06-21 北京百度网讯科技有限公司 A kind of cleaning method of structural data, device, electronic equipment and storage medium
CN110162519A (en) * 2019-04-17 2019-08-23 苏宁易购集团股份有限公司 Data clearing method
CN110737640A (en) * 2019-10-12 2020-01-31 齐鲁工业大学 data quality improving method and system based on distributed system
CN111104398A (en) * 2019-12-17 2020-05-05 智慧航海(青岛)科技有限公司 Detection method and elimination method for approximate repeated record of intelligent ship
CN112416920A (en) * 2020-12-01 2021-02-26 北京理工大学 MES-oriented data cleaning method and system
CN114999156A (en) * 2022-05-27 2022-09-02 北京汽车研究总院有限公司 Automatic identification method and device for crossing scene of pedestrian in front of vehicle, medium and vehicle

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110055252A1 (en) * 2003-03-28 2011-03-03 Dun & Bradstreet, Inc. System and method for data cleansing
CN104699796A (en) * 2015-03-18 2015-06-10 浪潮集团有限公司 Data cleaning method based on data warehouse
CN106055613A (en) * 2016-05-26 2016-10-26 华东理工大学 Cleaning method for data classification and training databases based on mixed norm

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110055252A1 (en) * 2003-03-28 2011-03-03 Dun & Bradstreet, Inc. System and method for data cleansing
CN104699796A (en) * 2015-03-18 2015-06-10 浪潮集团有限公司 Data cleaning method based on data warehouse
CN106055613A (en) * 2016-05-26 2016-10-26 华东理工大学 Cleaning method for data classification and training databases based on mixed norm

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
杨宏娜: "基于数据仓库的数据清洗技术研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
谢文阁 等: "数据清洗中重复记录清洗算法的研究", 《软件工程师》 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108763346A (en) * 2018-05-15 2018-11-06 中南大学 A kind of abnormal point processing method of sliding window box figure medium filtering
CN108763346B (en) * 2018-05-15 2022-02-01 中南大学 Abnormal point processing method for sliding window box type graph median filtering
CN109727446A (en) * 2019-01-15 2019-05-07 华北电力大学(保定) A kind of identification and processing method of electricity consumption data exceptional value
CN109918367A (en) * 2019-03-19 2019-06-21 北京百度网讯科技有限公司 A kind of cleaning method of structural data, device, electronic equipment and storage medium
CN110162519A (en) * 2019-04-17 2019-08-23 苏宁易购集团股份有限公司 Data clearing method
CN110737640A (en) * 2019-10-12 2020-01-31 齐鲁工业大学 data quality improving method and system based on distributed system
CN111104398A (en) * 2019-12-17 2020-05-05 智慧航海(青岛)科技有限公司 Detection method and elimination method for approximate repeated record of intelligent ship
CN111104398B (en) * 2019-12-17 2023-08-29 智慧航海(青岛)科技有限公司 Detection method and elimination method for intelligent ship approximate repeated record
CN112416920A (en) * 2020-12-01 2021-02-26 北京理工大学 MES-oriented data cleaning method and system
CN112416920B (en) * 2020-12-01 2023-01-24 北京理工大学 MES-oriented data cleaning method and system
CN114999156A (en) * 2022-05-27 2022-09-02 北京汽车研究总院有限公司 Automatic identification method and device for crossing scene of pedestrian in front of vehicle, medium and vehicle

Similar Documents

Publication Publication Date Title
CN107025301A (en) Flight ensures the method for cleaning of data
Zhang et al. Improving crowdsourced label quality using noise correction
CN107294993A (en) A kind of WEB abnormal flow monitoring methods based on integrated study
CN101957889B (en) Selective wear-based equipment optimal maintenance time prediction method
CN104281525B (en) A kind of defect data analysis method and the method utilizing its reduction Software Testing Project
KR20180072167A (en) System for extracting similar patents and method thereof
CN107274066B (en) LRFMD model-based shared traffic customer value analysis method
CN107832467A (en) A kind of microblog topic detecting method based on improved Single pass clustering algorithms
CN105447079B (en) A kind of data cleaning method based on functional dependence
CN113239087A (en) Anti-electricity-stealing inspection monitoring method and system
KR20190053616A (en) Data merging device and method for bia datda analysis
CN108268886A (en) For identifying the method and system of plug-in operation
Alizamini et al. Data quality improvement using fuzzy association rules
Singh et al. Performance analysis of faculty using data mining techniques
CN116756373A (en) Project review expert screening method, system and medium based on knowledge graph update
Ganjour et al. Gender inequality regarding retirement benefits in Switzerland
JansiRani et al. Computation of reducts using topology and measure of significance of attributes
Pereira et al. Traffic event detection using online social networks
Yang et al. Analysis of dishonorable behavior on railway online ticketing system based on k-means and FP-growth
Wu et al. Interval type-2 fuzzy clustering based association rule mining method
Silva et al. Detecting possible persons of interest in a physical activity program using step entries: Including a web‐based application for outlier detection and decision‐making
Samant et al. Bigram-based features for real-world event identification from microblogs
Jun et al. Research on Evaluation Method Used to Quality Performance of Missile Weapon Based on Rough Set Rule Extraction
Braune et al. Behavioral clustering for point processes
Jahanian et al. Selecting Optimal k in the k-means Clustering Algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20170808

RJ01 Rejection of invention patent application after publication