CN107025301A - Flight ensures the method for cleaning of data - Google Patents
Flight ensures the method for cleaning of data Download PDFInfo
- Publication number
- CN107025301A CN107025301A CN201710273945.4A CN201710273945A CN107025301A CN 107025301 A CN107025301 A CN 107025301A CN 201710273945 A CN201710273945 A CN 201710273945A CN 107025301 A CN107025301 A CN 107025301A
- Authority
- CN
- China
- Prior art keywords
- data
- aircraft gate
- record
- cleaning
- flight
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 68
- 238000004140 cleaning Methods 0.000 title claims abstract description 40
- 238000012545 processing Methods 0.000 claims description 13
- 238000009826 distribution Methods 0.000 claims description 8
- 230000002159 abnormal effect Effects 0.000 claims description 7
- 238000013480 data collection Methods 0.000 claims description 4
- 238000001514 detection method Methods 0.000 abstract description 15
- 238000004458 analytical method Methods 0.000 description 4
- 238000005201 scrubbing Methods 0.000 description 4
- 238000011160 research Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- PEDCQBHIVMGVHV-UHFFFAOYSA-N Glycerine Chemical compound OCC(O)CO PEDCQBHIVMGVHV-UHFFFAOYSA-N 0.000 description 1
- 230000005856 abnormality Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 238000002405 diagnostic procedure Methods 0.000 description 1
- 239000006185 dispersion Substances 0.000 description 1
- 235000013399 edible fruits Nutrition 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000000746 purification Methods 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2455—Query execution
- G06F16/24553—Query execution of query operations
- G06F16/24554—Unary operations; Data partitioning operations
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Quality & Reliability (AREA)
- Computational Linguistics (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
Flight ensures the method for cleaning of data, comprises the following steps:Step 1, to flight ensure data pre-process;Data, which carry out attribute cleaning, to be ensured to flight first, aircraft gate data are obtained, then exceptional value cleaning is carried out to aircraft gate data;Step 2, the duplicated records to aircraft gate data are cleaned:Step 2.1, the key assignments for creating sort key and calculating aircraft gate data;Step 2.2, according to neighbour's sort method based on clustered index, aircraft gate data are ranked up;The window of variable-size is slided in step 2.3, the data set after sequence, the duplicated records to aircraft gate data are detected and cleaned.The present invention improves the accuracy and integrality that flight ensures data, improves the efficiency that detection flight ensures the duplicated records of data.
Description
Technical field
The invention belongs to mathematical statistics and data mining technology field, it is related to the method for cleaning that a kind of flight ensures data.
Background technology
The domestic research for data cleansing is started late, but research is in extensive range:Point out the number in data cleansing
Mainly there are forms data source and multi-data source according to source, and give forms data source and the multi-data source error logging in instance layer
Classification;From the angle of the quality of data, the expansible data scrubbing framework of setting up rule-based storehouse and method base is analyzed
Necessity;Research in terms of the method and framework of data cleansing, including Knowledge based engineering duplicated records sweep-out method, base
The reconfigurable data cleaning framework of clean-up task is completed with flow model, based on semantic rules in multiple rule combination distinct methods
The open type data for completing data scrubbing task in self study mode in storehouse clears up framework.
Data scrubbing uses corresponding data clearing method according to concrete application and different pieces of information, corresponding after data classification
Cleaning method mainly includes following four kinds:1. the solution of missing values:In most cases, missing values must be inserted by hand
(i.e. manual cleanup), certainly, some missing values can be derived from notebook data source or other data sources, therefore can be used average
Value, maximum, minimum value or increasingly complex probability Estimation replace missing values, so as to reach the purpose of cleaning.2. improper value
Detection and solution:Possible improper value or exceptional value are recognized with the method for statistical analysis, such as variance analysis identification is not observed
Distribution or the value of regression equation, can also check data value with simple rule storehouse (common-sense rule, business ad hoc rules etc.), or
Person detects and cleared up data using constraining between different attribute, outside data.3. repeat detection and the solution of record:
Property value identical record is considered as to repeat to record in database.Detected by judging whether the property value between record is equal
Whether record is repeated data, and is merged or removed using the basic skills for the weight that disappears.4. inconsistency is mainly manifested in data
Inside source and between data source, the integrated data of multi-data source may have semantic conflict, therefore, detection and solution for the problem
Certainly method is that definable integrity constraint is used to detect inconsistency, can also find to contact by analyze data, reach data
Uniformity.
, it is necessary to ensure that data are purified and optimized to flight in the business of Civil Aviation Airport, although many general existed
Data cleansing is theoretical and framework, due to the particularity and industry confidentiality of business scope, and flight ensures the data volume of data
Huge, the information content that packet contains is numerous, ensures that the duplicated records of data carry out detection difficult, purification and optimization to flight
Workload is big.
The content of the invention
It is an object of the invention to provide the method for cleaning that a kind of flight ensures data, improve flight and ensure the accurate of data
Property and integrality, improve detection flight ensure data duplicated records efficiency.
The technical solution adopted in the present invention is that flight ensures the method for cleaning of data, comprises the following steps:
Step 1, to flight ensure data pre-process;
Data, which carry out attribute cleaning, to be ensured to flight first, aircraft gate data are obtained, then aircraft gate data are carried out abnormal
Value cleaning;
Step 2, the duplicated records to aircraft gate data are cleaned:
Step 2.1, establishment sort key, and calculate the key assignments of aircraft gate data;
Step 2.2, according to neighbour's sort method based on clustered index, aircraft gate data are ranked up;
The window of variable-size is slided in step 2.3, the data set after sequence, the similar of aircraft gate data is repeated to remember
Record is detected and cleaned.
In step 1, ensure that data carry out attribute cleaning and are specifically divided into flight:
(1) the processing pair data unrelated with aircraft gate information:Deleted or not extracted;
(2) to the processing of missing Value Data in the data of aircraft gate:Missing values data include primary attribute missing data and non-master
Attribute missing data, primary attribute missing data is abandoned, and nonprime attribute missing number is reacquired or be derived from from data source
According to;
(3) to the processing for the data that business rule is violated in the data of aircraft gate:By being proofreaded with data source, reacquire;
(4) to the processing of the data of same attribute different expression form in the data of aircraft gate:Set unique form of expression.
In step 1, judged using box figure method and reject the exceptional value in the data of aircraft gate, detailed process is:
All aircraft gate data for clearance are set to data set A, data set A is divided into α × n interval, n is interval
Number, α is the number of aircraft gate data in each interval, and β is interval size:
Wherein, all aircraft gate data in each interval constitute a data set, DnRepresent the data that numbering is n
Collection;
The distribution characteristics of aircraft gate data is analyzed, domain [i-j, i+j] in data set A data set is obtained, wherein, i-j is
Minimum value data set, i.e. Min { D1, D2..., Dn, i+j is maximum value data collection, i.e. Max { D1, D2..., Dn};By [i-j, i+
J] primary data group is set to, outlier is rejected to primary data group, non-Outlier Data group [Q is obtained1- 3 × IQR, Q3+ 3 × IQR],
To [Q1- 3 × IQR, Q3+ 3 × IQR] negated abnormal data group, obtain target data set [Q1- 1.5 × IQR, Q3+ 1.5 × IQR],
Target data set is set to data set B, wherein Q1Represent the first quantile, Q3The 3rd quantile is represented, IQR is represented between quartile
Away from IQR=Q3-Q1。
The detailed process of step 2.1 is:
The different attributes for extracting aircraft gate data are used as different sort keys;According to sort key to data set
Each field calculated field value of aircraft gate data in B, so that the key assignments of aircraft gate data is obtained, the key of aircraft gate data
Value, is the set of field value in the aircraft gate data.
Step 2.2 is specially:
Clustered index is set up in data set B, according to the key assignments of aircraft gate data, to the aircraft gate data in data set B
Arranged so that duplicated records are aligned to adjacent domain, obtain data set C.
The detailed process of step 2.3 is:
Each data in data set C constitute a record, and the window of variable-size is slided on data set C, sliding
First in first out strategy is used during dynamic, during window sliding, if the record in current window is the 1~N articles record, then next
Record into window is the N+1 articles record, and the N+1 articles record and the 2~N articles record in window are carried out into similarity one by one
Matching, whether be repeat record, if repeating to record, this record is rejected, if not being weight if the N+1 articles record is detected with this
Multiple record, then continue slide downward window, the similarity mode until completing all records in data set C.
In step 2.3, the detailed process of similarity mode is:
Field weight is set, is given a mark by the independent weight to each field of some experts, takes same field
The marking average of weight, is used as the field weight of the field, field weights=field weight × field value, the weights of a record
The summation of the field weights of all fields is constituted in the record;
During similarity mode, the weights of two records to be matched are calculated respectively, and carry out adduction, obtain two
The similarity M of record to be matched, M is compared with default similarity threshold N, if M is more than in N, two records to be matched
The repetition that is recorded as into window is recorded afterwards, is otherwise considered as two different records.
In step 2.3, the size of window is driven by the usage frequency of aircraft gate:Count the average usage frequency of aircraft gate
The Mean and maximum usage frequency Max of aircraft gate, using size of (Mean+Max)/2 as window.
The beneficial effects of the invention are as follows:Flight ensures the method for cleaning of data, the attribute cleaning used in pretreatment stage
The detection of method and exceptional value and delet method, improve accuracy and integrality that flight ensures data set, add after pretreatment
Carry being obviously improved for speed and exactly have benefited from the increase that flight after cleaning ensures efficacy data proportion in data set;To sequence side
Method is improved, and clustered index is introduced in neighbour's sort method, while improving sequencing production so that duplicated records
It is aligned to neighboring regions;The window of variable-size is slided, the size of window is driven by the usage frequency of aircraft gate, to similar repetition
Record is detected and cleaned, because duplicated records arrangement has been aligned in same window as much as possible, in not shadow
Ring to search to repeat to reduce in the case of record efficiency in the times such as unnecessary number of comparisons and detect that repeating record number increases
Plus, so as to preferably improve the efficiency of detection.
Brief description of the drawings
Fig. 1 is the schematic diagram of data scrubbing;
Fig. 2 is aircraft gate data distribution characteristics figure;
Fig. 3 is the box traction substation in the concentration domain of remote seat in the plane data;
Fig. 4 is the box traction substation in the concentration domain of nearly seat in the plane data;
Fig. 5 is the flow chart using neighbour's sort method sequence based on clustered index;
Fig. 6 is the schematic diagram for the window for sliding variable-size;
Fig. 7 is the flow chart of similarity mode;
Fig. 8 be before and after data cleansing the load time compare figure;
Fig. 9 is the comparison figure of the number of the similar record of detection in the times such as distinct methods.
Embodiment
As shown in figure 1, flight ensures the method for cleaning of data, it is intended to analyze the base that Civil Aviation Airport flight ensures data characteristicses
On plinth, complete flight and ensure the correlation test for being both needed to carry out in the duplicated records detection of data, to existing data cleansing
Method is adjusted correspondingly and refined, while data cleansing rules and methods are determined, so as to ensure that data are carried to flight
Pure optimization, high-quality data are provided for follow-up study.
Data instance is ensured with the flight in the year of Lanzhou Zhong Chuan airports 2015,2016, below in conjunction with the accompanying drawings and specific implementation
The present invention is described in detail for mode:
Flight ensures the method for cleaning of data, comprises the following steps:
Step 1, to flight ensure data pre-process;
Data, which carry out attribute cleaning, to be ensured to flight first, aircraft gate data are obtained, then aircraft gate data are carried out abnormal
Value cleaning;
Step 2, the duplicated records to aircraft gate data are cleaned:
Step 2.1, establishment sort key, and calculate the key assignments of aircraft gate data;
Step 2.2, according to neighbour's sort method based on clustered index, aircraft gate data are ranked up;
The window of variable-size is slided in step 2.3, the data set after sequence, the similar of aircraft gate data is repeated to remember
Record is detected and cleaned.
In step 1, ensure that data carry out attribute cleaning and are specifically divided into flight:
(1) the processing pair data unrelated with aircraft gate information:For example:Flying height, aeroplane span, course line, way point
And flight-time information, belong to the data unrelated with aircraft gate information, deleted or not extracted;
(2) to the processing of missing Value Data in the data of aircraft gate:Missing values data include primary attribute missing data and non-master
Attribute missing data, primary attribute missing can have a strong impact on aircraft gate real-time status, and do not allow to exist primary attribute in system and lack
The situation of mistake, therefore when primary attribute is lacked, it is believed that the data are wrong data, and primary attribute missing data is abandoned;Non- master
Property missing smaller is influenceed on aircraft gate real-time status, but run counter to the integrity rule of data, reacquired from data source
Or it is derived from nonprime attribute missing data;
(3) to the processing for the data that business rule is violated in the data of aircraft gate:Violate the attribute that business rule refers to data
Relation of the value or between the property value of data in itself violates the business rule of Civil Aviation Airport, such as certain flight without it is previous stand it is winged
But there is this landing time in the time, or but there is the latter station landing time without this departure time, for such data, lead to
Cross and proofreaded with data source, reacquired;
(4) to the processing of the data of same attribute different expression form in the data of aircraft gate:Property value represents formal cause list
Position or department and it is different, for example, the representation for the state that approaches can have YES/NO or arrival/cancellation, to different manifestations
The data of form carry out unitized processing, set unique form of expression.
In practical situations both, aircraft gate data are significantly affected by exceptional value, in order to eliminate exceptional value to whole data
Influence, it is necessary to judged exceptional value and rejected, obtain meeting the data set of the actual conditions of airport aircraft gate.
Judgement at present to exceptional value is main using two methods of physics diagnostic method and statistical energy method with rejecting:Physics is sentenced
Other method, is, to the existing understanding of objective things, to differentiate because the reasons such as external interference, human error cause to survey number according to people
Deviate normal outcome according to value, judge and reject at any time in experimentation.Statistical energy method, is to give a fiducial probability, and
A confidence limit is determined, all errors more than this limit are considered as it and are not belonging to random error scope, are regarded as exceptional value and pick
Remove.When physical identification is difficult to judge, typically using statistical recognition methods.
The present invention ensures the distribution characteristics of data according to flight, is judged and is picked using the box figure method in statistical recognition methods
Except the exceptional value in the data of aircraft gate.
In step 1, judged using box figure method and reject the exceptional value in the data of aircraft gate, detailed process is:
Aircraft gate data for clearance are set to data set A, as shown in table 1, data set A α × n interval, n are divided into
For interval number, α is the number of aircraft gate data in each interval, and β is interval size:
Wherein, all aircraft gate data in each interval constitute a data set, DnRepresent the data that numbering is n
Collection;
The flight of table 1. ensures aircraft gate data message table in data
Sequence number | 1 | 2 | 3 | 4 | ..... | n-1 | n |
Data | D1 | D2 | D3 | D4 | ..... | Dn-1 | Dn |
For dispersion degree is not king-sized data source, the distributions of data itself is typically concentrated in a certain specific
In region, the distribution characteristics of aircraft gate data is analyzed, as shown in Fig. 2 domain [i-j, i+j] in data set A data set is obtained, its
In, i-j is minimum value data set, i.e. Min { D1, D2..., Dn, i+j is maximum value data collection, i.e. Max { D1, D2..., Dn};
By taking one group of aircraft gate data as an example, as shown in table 2, in practical situations both, if directly calculating the reality of aircraft gate
Border is interval, obtains remote seat in the plane data set A1Interval be [70,500], nearly seat in the plane data set A2Interval be [- 500,60], the knot
Fruit is not inconsistent with actual conditions, illustrates that aircraft gate data set is significantly affected by abnormal Value Data, it is necessary to sentence to exceptional value
Disconnected and rejecting.
The flight of table 2. ensures the aircraft gate data (shutdown bit number) in data
First, data set A is divided into 1000 intervals, finds remote seat in the plane data set A1Data set in domain for [70,
160], nearly seat in the plane data set A2Data set in domain be [- 9,60], then, to value in A1Data set in domain shutdown digit
According to box map analysis is done, the box traction substation of the aircraft gate shown in Fig. 3 is obtained, to value in A2Data set in domain aircraft gate data
Box map analysis is done, the box traction substation of the aircraft gate shown in Fig. 4 is obtained.
According to the analysis of box-shaped figure result, remote seat in the plane data set A is obtained1Non- Outlier Data group be [85,134], nearly machine
Position data set A2Non- Outlier Data group be [- 10.75,27.75];Calculate again and obtain remote seat in the plane data set A1Non- abnormal data
Group is [95.5,116.5], nearly seat in the plane data set A2Non- abnormal data group be [- 2.5,19.5];Result of calculation meets airport and stopped
The actual conditions of seat in the plane.Therefore, rejecting abnormalities Value Data is recognized by the method for aircraft gate data distribution characteristics and box figure
Mode is more quick and effect significantly, be that the important step that data are cleared up is ensured to flight.
The detailed process of step 2.1 is:Airport human users custom and keyword importance are analyzed, aircraft gate data are extracted
Different attributes be used as different sort keys, different sort keys constitute sort key combinatorics on words, to extract
Exemplified by the combination of following sort key:
Key Com={ Gate=aircraft gates, Plan LT=this plan landing time, this reality of Actual LT=
Landing time, this Proposed Departure time of Plan DT=, this actual time of departure of Actual DT=};
According to each field calculated field value of sort key to the aircraft gate data in data set B, so as to be stopped
The key assignments of seat in the plane data, the key assignments of aircraft gate data is the set of field value in the aircraft gate data.
Step 2.2 is specially:
Clustered index is set up in data set B, according to the key assignments of aircraft gate data, to the aircraft gate data in data set B
Carry out neighbour's arrangement so that duplicated records are aligned to adjacent domain, obtain data set C.As shown in figure 5, in the present embodiment
3 minor sorts of middle progress, the result set of 3 minor sorts is compared, inconsistent part minor sort again, obtains final result
Collection, the accidental error for preventing a minor sort from causing.
The detailed process of step 2.3 is:
Each data in data set C constitute a record, the window of variable-size are slided on data set C, such as
Shown in Fig. 6, first in first out strategy is used in sliding process, during window sliding, if the record in current window is the 1~N articles note
Record, then next into window record be the N+1 articles record, by the N+1 articles record with window in the 2~N articles record by
One carries out similarity mode, and whether be repeat record, if repeating to record, reject this if the N+1 articles record is detected with this
Record, if not being to repeat to record, then continues slide downward window, the similarity mode until completing all records in data set C.
As shown in fig. 7, in step 2.3, the detailed process of similarity mode is:
Field weight is set, is the influence power for accurate description field for aircraft gate state change, according to data set
In the significance level of each field different field weights are set, the method generally used has following several:1. subjective experience method;2.
Primary and secondary index queuing classification;3. expert graded.In the present invention, field weight is set using expert graded:By some positions
The independent weight to each field of expert is given a mark, and is taken the marking average of the weight of same field, is used as the word of the field
Duan Quanchong, field weights=field weight × field value, the field weights of weights all fields in the record of a record
Summation constitute;During similarity mode, the weights of two records to be matched are calculated respectively, and carry out adduction, are obtained
The similarity M of two records to be matched, M is compared with default similarity threshold N, if M is more than N, two notes to be matched
The repetition that is recorded as after in record into window is recorded, and is otherwise considered as two different records.
In step 2.3, the size of window is driven by the usage frequency of aircraft gate:Due to window it is larger when, number of comparisons meeting
Increase, and some are relatively not necessarily to;The matching of repeated data may can be omitted when window is smaller again;As shown in table 3,
According to the guarantee data of the annual second half year of Lanzhou Zhong Chuan airports 2015 and the 2016 annual first half of the year, aircraft gate is counted each
The average usage frequency Mean of the moon.
Monthly aircraft gate usage frequency (the n M/D of table 3.:N-th month daily average value ,-do not come into operation)
As shown in table 4, to being rounded on the average usage frequency Mean of aircraft gate every month, 12 middle of the month maximums is calculated and are used
Frequency Max, using the average size as window of the two
The determinant (average~maximum) of the sliding window size of each aircraft gate of table 4.
Nearly 101 | Nearly 102 | Nearly 103 | Nearly 104 | Nearly 105 | Nearly 106 | Nearly 107 | Nearly 108 | Nearly 109 | Nearly 110 |
4~6 | 4~6 | 4~6 | 4~6 | 4~5 | 3~5 | 4~6 | 4~6 | 4~7 | 4~6 |
Nearly 111 | Nearly 112 | Nearly 113 | Nearly 114 | Nearly 115 | It is remote by 1 | It is remote by 2 | It is remote by 3 | It is remote by 4 | It is remote by 5 |
4~7 | 4~6 | 4~6 | 4~6 | 4~6 | 1~2 | 1~2 | 2~2 | 1~2 | 2~2 |
It is remote by 6 | It is remote by 7 | It is remote by 8 | It is remote by 9 | It is remote by 10 | It is remote by 11 | It is remote by 12 | It is remote by 13 | It is remote by 14 | It is remote by 15 |
2~2 | 2~2 | 2~3 | 2~2 | 1~2 | 2~2 | 2~2 | 1~2 | 1~1 | 1~1 |
The evaluation criterion of data cleansing quality has the consistency principle, integrality principle, availability, efficiency etc., and the present invention is main
The speed of data cleansing and the degree of cleaning to repeating record are considered, for repeating to record main by false recognition rate and accuracy rate
To weigh, as shown in table 5:4 groups of data instances are taken, compared with before cleaning, loading velocity accelerates after cleaning, wait detection weight in the time
Record number is greatly increased again.
The comparison of index is loaded before table 5. is cleaned and after cleaning
The every month of the guarantee of 3000 record progress attribute is clear in being recorded to the actual guarantee of 2015 of Lanzhou Zhong Chuan airports
Wash, exceptional value cleaning and duplicated records detection and rejecting, calculate cleaning before data loading time and cleaning after
Time, as shown in figure 8, the load time greatly shortens after data cleansing.
Using neighbour's sort algorithm based on clustered index, to ensureing that record carries out similarity detection.By when checking etc.
The interior number for detecting similar record, obtained result is compared with the mode directly retrieved, comparing result such as Fig. 9, and use
Accuracy rate come weigh duplicated records detection effect.
By institute's aforesaid way, flight of the present invention ensures the method for cleaning of data, improves flight and ensures the accurate of data
Property and integrality, improve detection flight ensure data duplicated records efficiency.
Claims (8)
1. flight ensures the method for cleaning of data, it is characterised in that comprise the following steps:
Step 1, to flight ensure data pre-process;
Data, which carry out attribute cleaning, to be ensured to flight first, aircraft gate data are obtained, then it is clear to aircraft gate data progress exceptional value
Wash;
Step 2, the duplicated records to aircraft gate data are cleaned:
Step 2.1, the key assignments for creating sort key and calculating aircraft gate data;
Step 2.2, according to neighbour's sort method based on clustered index, aircraft gate data are ranked up;
The window of variable-size is slided in step 2.3, the data set after sequence, the duplicated records to aircraft gate data are entered
Row is detected and cleaned.
2. flight according to claim 1 ensures the method for cleaning of data, it is characterised in that in the step 1, to flight
Ensure that data carry out attribute cleaning and are specifically divided into:
(1) the processing pair data unrelated with aircraft gate information:Deleted or not extracted;
(2) to the processing of missing Value Data in the data of aircraft gate:Missing values data include primary attribute missing data and nonprime attribute
Missing data, primary attribute missing data is abandoned, and nonprime attribute missing data is reacquired or be derived from from data source;
(3) to the processing for the data that business rule is violated in the data of aircraft gate:By being proofreaded with data source, reacquire;
(4) to the processing of the data of same attribute different expression form in the data of aircraft gate:Set unique form of expression.
3. flight according to claim 1 ensures the method for cleaning of data, it is characterised in that in the step 1, using case
Type figure method judges and rejects the exceptional value in the data of aircraft gate, and detailed process is:
All aircraft gate data for clearance are set to data set A, data set A is divided into α × n interval, n is interval
Number, α is the number of aircraft gate data in each interval, and β is interval size:
Wherein, all aircraft gate data in each interval constitute a data set, DnRepresent the data set that numbering is n;
The distribution characteristics of aircraft gate data is analyzed, domain [i-j, i+j] in data set A data set is obtained, wherein, i-j is minimum
Value Data collection, i.e. Min { D1, D2..., Dn, i+j is maximum value data collection, i.e. Max { D1, D2..., Dn};[i-j, i+j] is set
For primary data group, outlier is rejected to primary data group, non-Outlier Data group [Q is obtained1- 3 × IQR, Q3+ 3 × IQR], it is right
[Q1- 3 × IQR, Q3+ 3 × IQR] negated abnormal data group, obtain target data set [Q1- 1.5 × IQR, Q3+ 1.5 × IQR], will
Target data set is set to data set B, wherein Q1Represent the first quantile, Q3The 3rd quantile is represented, IQR represents quartile spacing
IQR=Q3-Q1。
4. flight according to claim 3 ensures the method for cleaning of data, it is characterised in that the step 2.1 it is specific
Process is:
The different attributes for extracting aircraft gate data are used as different sort keys;According to sort key in data set B
Aircraft gate data each field calculated field value, so as to obtain the key assignments of aircraft gate data, the key assignments of aircraft gate data, i.e.,
For the set of field value in the aircraft gate data.
5. flight according to claim 4 ensures the method for cleaning of data, it is characterised in that the step 2.2 is specially:
Clustered index is set up in data set B, according to the key assignments of aircraft gate data, the aircraft gate data in data set B are carried out
Neighbour arranges so that duplicated records are aligned to adjacent domain, obtain data set C.
6. flight according to claim 5 ensures the method for cleaning of data, it is characterised in that the step 2.3 it is specific
Process is:
Each data in data set C constitute a record, and the window of variable-size is slided on data set C, was slided
First in first out strategy is used in journey, during window sliding, if the record in current window is the 1~N articles record, is then next entered
The record of window is the N+1 articles record, and the N+1 articles record is carried out into similarity one by one with the 2~N articles record in window
Match somebody with somebody, whether be repeat record, if repeating to record, this record is rejected, if not being to repeat if the N+1 articles record is detected with this
Record, then continue slide downward window, the similarity mode until completing all records in data set C.
7. flight according to claim 6 ensures the method for cleaning of data, it is characterised in that similar in the step 2.3
Spending the detailed process matched is:
Field weight is set, is given a mark by the independent weight to each field of some experts, takes the weight of same field
Marking average, as the field weight of the field, field weights=field weight × field value, the weights of a record are by this
The summation of the field weights of all fields is constituted in record;
During similarity mode, the weights of two records to be matched are calculated respectively, and adduction are carried out, obtain two and treat
Similarity M with record, M is compared with default similarity threshold N, laggard in two records to be matched if M is more than N
The repetition that is recorded as entering window is recorded, and is otherwise considered as two different records.
8. flight according to claim 1 ensures the method for cleaning of data, it is characterised in that in step 2.3, by aircraft gate
Usage frequency drive window size:Count the average usage frequency Mean of aircraft gate and the maximum usage frequency of aircraft gate
Max, using size of (Mean+Max)/2 as window.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710273945.4A CN107025301A (en) | 2017-04-25 | 2017-04-25 | Flight ensures the method for cleaning of data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710273945.4A CN107025301A (en) | 2017-04-25 | 2017-04-25 | Flight ensures the method for cleaning of data |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107025301A true CN107025301A (en) | 2017-08-08 |
Family
ID=59527900
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710273945.4A Pending CN107025301A (en) | 2017-04-25 | 2017-04-25 | Flight ensures the method for cleaning of data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107025301A (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108763346A (en) * | 2018-05-15 | 2018-11-06 | 中南大学 | A kind of abnormal point processing method of sliding window box figure medium filtering |
CN109727446A (en) * | 2019-01-15 | 2019-05-07 | 华北电力大学(保定) | A kind of identification and processing method of electricity consumption data exceptional value |
CN109918367A (en) * | 2019-03-19 | 2019-06-21 | 北京百度网讯科技有限公司 | A kind of cleaning method of structural data, device, electronic equipment and storage medium |
CN110162519A (en) * | 2019-04-17 | 2019-08-23 | 苏宁易购集团股份有限公司 | Data clearing method |
CN110737640A (en) * | 2019-10-12 | 2020-01-31 | 齐鲁工业大学 | data quality improving method and system based on distributed system |
CN111104398A (en) * | 2019-12-17 | 2020-05-05 | 智慧航海(青岛)科技有限公司 | Detection method and elimination method for approximate repeated record of intelligent ship |
CN112416920A (en) * | 2020-12-01 | 2021-02-26 | 北京理工大学 | MES-oriented data cleaning method and system |
CN114999156A (en) * | 2022-05-27 | 2022-09-02 | 北京汽车研究总院有限公司 | Automatic identification method and device for crossing scene of pedestrian in front of vehicle, medium and vehicle |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110055252A1 (en) * | 2003-03-28 | 2011-03-03 | Dun & Bradstreet, Inc. | System and method for data cleansing |
CN104699796A (en) * | 2015-03-18 | 2015-06-10 | 浪潮集团有限公司 | Data cleaning method based on data warehouse |
CN106055613A (en) * | 2016-05-26 | 2016-10-26 | 华东理工大学 | Cleaning method for data classification and training databases based on mixed norm |
-
2017
- 2017-04-25 CN CN201710273945.4A patent/CN107025301A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110055252A1 (en) * | 2003-03-28 | 2011-03-03 | Dun & Bradstreet, Inc. | System and method for data cleansing |
CN104699796A (en) * | 2015-03-18 | 2015-06-10 | 浪潮集团有限公司 | Data cleaning method based on data warehouse |
CN106055613A (en) * | 2016-05-26 | 2016-10-26 | 华东理工大学 | Cleaning method for data classification and training databases based on mixed norm |
Non-Patent Citations (2)
Title |
---|
杨宏娜: "基于数据仓库的数据清洗技术研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
谢文阁 等: "数据清洗中重复记录清洗算法的研究", 《软件工程师》 * |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108763346A (en) * | 2018-05-15 | 2018-11-06 | 中南大学 | A kind of abnormal point processing method of sliding window box figure medium filtering |
CN108763346B (en) * | 2018-05-15 | 2022-02-01 | 中南大学 | Abnormal point processing method for sliding window box type graph median filtering |
CN109727446A (en) * | 2019-01-15 | 2019-05-07 | 华北电力大学(保定) | A kind of identification and processing method of electricity consumption data exceptional value |
CN109918367A (en) * | 2019-03-19 | 2019-06-21 | 北京百度网讯科技有限公司 | A kind of cleaning method of structural data, device, electronic equipment and storage medium |
CN110162519A (en) * | 2019-04-17 | 2019-08-23 | 苏宁易购集团股份有限公司 | Data clearing method |
CN110737640A (en) * | 2019-10-12 | 2020-01-31 | 齐鲁工业大学 | data quality improving method and system based on distributed system |
CN111104398A (en) * | 2019-12-17 | 2020-05-05 | 智慧航海(青岛)科技有限公司 | Detection method and elimination method for approximate repeated record of intelligent ship |
CN111104398B (en) * | 2019-12-17 | 2023-08-29 | 智慧航海(青岛)科技有限公司 | Detection method and elimination method for intelligent ship approximate repeated record |
CN112416920A (en) * | 2020-12-01 | 2021-02-26 | 北京理工大学 | MES-oriented data cleaning method and system |
CN112416920B (en) * | 2020-12-01 | 2023-01-24 | 北京理工大学 | MES-oriented data cleaning method and system |
CN114999156A (en) * | 2022-05-27 | 2022-09-02 | 北京汽车研究总院有限公司 | Automatic identification method and device for crossing scene of pedestrian in front of vehicle, medium and vehicle |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107025301A (en) | Flight ensures the method for cleaning of data | |
Zhang et al. | Improving crowdsourced label quality using noise correction | |
CN107294993A (en) | A kind of WEB abnormal flow monitoring methods based on integrated study | |
CN101957889B (en) | Selective wear-based equipment optimal maintenance time prediction method | |
CN104281525B (en) | A kind of defect data analysis method and the method utilizing its reduction Software Testing Project | |
KR20180072167A (en) | System for extracting similar patents and method thereof | |
CN107274066B (en) | LRFMD model-based shared traffic customer value analysis method | |
CN107832467A (en) | A kind of microblog topic detecting method based on improved Single pass clustering algorithms | |
CN105447079B (en) | A kind of data cleaning method based on functional dependence | |
CN113239087A (en) | Anti-electricity-stealing inspection monitoring method and system | |
KR20190053616A (en) | Data merging device and method for bia datda analysis | |
CN108268886A (en) | For identifying the method and system of plug-in operation | |
Alizamini et al. | Data quality improvement using fuzzy association rules | |
Singh et al. | Performance analysis of faculty using data mining techniques | |
CN116756373A (en) | Project review expert screening method, system and medium based on knowledge graph update | |
Ganjour et al. | Gender inequality regarding retirement benefits in Switzerland | |
JansiRani et al. | Computation of reducts using topology and measure of significance of attributes | |
Pereira et al. | Traffic event detection using online social networks | |
Yang et al. | Analysis of dishonorable behavior on railway online ticketing system based on k-means and FP-growth | |
Wu et al. | Interval type-2 fuzzy clustering based association rule mining method | |
Silva et al. | Detecting possible persons of interest in a physical activity program using step entries: Including a web‐based application for outlier detection and decision‐making | |
Samant et al. | Bigram-based features for real-world event identification from microblogs | |
Jun et al. | Research on Evaluation Method Used to Quality Performance of Missile Weapon Based on Rough Set Rule Extraction | |
Braune et al. | Behavioral clustering for point processes | |
Jahanian et al. | Selecting Optimal k in the k-means Clustering Algorithm |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170808 |
|
RJ01 | Rejection of invention patent application after publication |