CN110263230B - Data cleaning method and device based on density clustering - Google Patents

Data cleaning method and device based on density clustering Download PDF

Info

Publication number
CN110263230B
CN110263230B CN201910341078.2A CN201910341078A CN110263230B CN 110263230 B CN110263230 B CN 110263230B CN 201910341078 A CN201910341078 A CN 201910341078A CN 110263230 B CN110263230 B CN 110263230B
Authority
CN
China
Prior art keywords
data
eps
samples
data set
distance
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910341078.2A
Other languages
Chinese (zh)
Other versions
CN110263230A (en
Inventor
许海涛
张晓鹏
周贤伟
林福宏
吕兴
安建伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology Beijing USTB
Original Assignee
University of Science and Technology Beijing USTB
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology Beijing USTB filed Critical University of Science and Technology Beijing USTB
Priority to CN201910341078.2A priority Critical patent/CN110263230B/en
Publication of CN110263230A publication Critical patent/CN110263230A/en
Application granted granted Critical
Publication of CN110263230B publication Critical patent/CN110263230B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/906Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a data cleaning method and device based on density clustering, which can improve the accuracy of cleaning results. The method comprises the following steps: acquiring a data set to be cleaned, wherein the attribute values of the samples in the data set comprise: numerical data and character data; respectively adopting a standardized Euclidean distance and a character string similarity algorithm based on an editing distance to the numerical data and the character data to determine the distance between samples in the data set; estimating eps and minPts of the DBSCAN algorithm according to the determined distance between the samples in the data set; taking the estimated eps and minPts values as DBSCAN parameter values, and carrying out density clustering on the data set to be cleaned; and cleaning the data in the data set to be cleaned according to the clustering result. The present invention relates to the field of data mining.

Description

Data cleaning method and device based on density clustering
Technical Field
The invention relates to the field of data mining, in particular to a data cleaning method and device based on density clustering.
Background
The data has become an important power for promoting the development of the industry after the information explosion age is entered at present. Huge wealth hidden in data can be obtained by enterprises, and the enterprises can obtain a large amount of useful information from the huge wealth, provide support for development decisions of the enterprises from various aspects of business management, market analysis, scientific exploration and the like, and promote the development of the enterprises. However, data in reality is often complex and complicated, and data with different structures and different types of dirty data, such as error data, invalid data, missing duplicate data, etc., exist in the data, which greatly increases the difficulty of data analysis.
The machine learning method is widely applied to the field of data cleaning, and the core purpose of the machine learning method is to cluster data sets. Clustering analysis, also known as cluster analysis, is a statistical method for studying sample classification. The purpose of clustering is to make the similarity between objects of the same class as large as possible and the similarity between objects of different classes as small as possible. The Noise-Based Density Clustering of Applications with Noise (DBSCAN) algorithm is a classic Clustering algorithm, the final Clustering result of the DBSCAN depends on the selection of eps and minPts parameter values, if the eps and minPts parameter values are not selected properly, the Clustering result is poor, even wrong Clustering occurs, wherein eps represents the scanning radius, minPts represents the minimum contained point number,
in the prior art, parameters eps and minPts of a DBSCAN algorithm are generally set manually according to experience, so that the accuracy of a clustering result is low.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a data cleaning method and device based on density clustering, so as to solve the problem of low accuracy of clustering results caused by manually setting parameters eps and minPts of a DBSCAN algorithm according to experience in the prior art.
In order to solve the above technical problem, an embodiment of the present invention provides a data cleaning method based on density clustering, including:
acquiring a data set to be cleaned, wherein the attribute values of the samples in the data set comprise: numerical data and character data;
respectively adopting a standardized Euclidean distance and a character string similarity algorithm based on an editing distance to the numerical data and the character data to determine the distance between samples in the data set;
estimating eps and minPts of the DBSCAN algorithm according to the determined distance between the samples in the data set, wherein eps represents a scanning radius, minPts represents the minimum contained point number, and DBSCAN represents the noise-based clustering;
taking the estimated eps and minPts values as DBSCAN parameter values, and carrying out density clustering on the data set to be cleaned;
and cleaning the data in the data set to be cleaned according to the clustering result.
Further, the distance between samples is expressed as:
Figure BDA0002040721410000021
where dist (X, Y) represents the distance between samples X, Y, wkThe weight value representing the kth attribute value of the sample,
Figure BDA0002040721410000022
denotes the normalized Euclidean distance, sim (x), between the sample attributes when the kth attribute value of the sample is numericalsk,ysk) Indicating the edit-distance-based string similarity of sample attributes when the kth attribute value of the sample is character-type, n indicating the number of attributes contained in the sample, εkIndicating the missing state of the kth attribute, xnkAnd ynkA value representing the k-th attribute of X and Y as numerical type, Xsk,yskThe kth attribute representing X and Y, respectively, is a character-type value, X and Y represent two samples of the dataset, z (X)nk)、z(ynk) Respectively represent normalized xnk、ynk
Further, epsilonkExpressed as:
Figure BDA0002040721410000023
further, z (x)nk) Expressed as:
Figure BDA0002040721410000031
where u is the mean of the kth attribute of all samples in the dataset and σ is the standard deviation of all samples in the dataset.
Further, the character string similarity based on the edit distance is expressed as:
Figure BDA0002040721410000032
where sim (S, T) represents the similarity between strings S and T based on the edit distance, S and T represent character-type attribute values of 2 samples in the data set, m, d represent the number of characters in strings S and T, respectively, and ld represents the minimum number of edit operations required to change string S to string T.
Further, the estimating eps and minPts of the DBSCAN algorithm according to the determined distance between samples in the data set includes:
constructing a data set sample distance matrix according to the determined distance between the samples in the data set;
sequencing each row of data of the sample distance matrix in an ascending order, wherein the K-th column of the sequenced matrix represents the distance between each sample point and the K-th neighbor sample point of the sequenced matrix, calculating the mean value of each column of the sequenced matrix to obtain a K-average nearest neighbor distance matrix, and taking the K-average nearest neighbor distance matrix as the candidate value of eps;
and determining the number of samples contained in the eps neighborhood of each sample point according to the obtained eps value, and calculating the average value of the number of samples contained in all the sample points as a minPts value.
7. The density-clustering-based data washing method of claim 6, wherein eps is expressed as:
Figure BDA0002040721410000033
wherein D isepsIs a set of candidate values for the eps parameter,
Figure BDA0002040721410000034
representing a sample distance matrix DN×NThe mean value of the K-th column after sorting, wherein N represents the number of samples in the data set;
minPts is represented as:
Figure BDA0002040721410000035
wherein, PiThe number of samples contained in the eps neighborhood of the ith sample.
Further, the density clustering of the data set to be cleaned by using the estimated eps and minPts values as DBSCAN parameter values includes:
sequentially selecting eps and minPts under different K values, bringing the eps and minPts into a DBSCAN algorithm, performing density clustering on a data set to be cleaned, judging that clustering tends to be stable if the number of clusters is continuously unchanged for 3 times, and selecting the current eps and minPts as optimal values of parameters eps and minPts of the DBSCAN algorithm;
and performing density clustering on the data set to be cleaned according to the obtained optimal values of eps and minPts.
Further, the cleaning the data in the data set to be cleaned according to the clustering result includes:
if the missing attribute value is numerical data, filling the attribute mean value of the cluster where the corresponding sample is located into the missing data;
if the missing attribute value is character data, acquiring the attribute value with the highest frequency of occurrence of the cluster where the corresponding sample is located as the missing data;
if the example among the 2 samples in the cluster is larger than a preset distance threshold, judging that the 2 samples are the repeated data, and combining the repeated data.
An embodiment of the present invention further provides a data cleaning apparatus based on density clustering, including:
the acquisition module is used for acquiring a data set to be cleaned, wherein the attribute values of the samples in the data set comprise: numerical data and character data;
the determining module is used for determining the distance between samples in the data set by respectively adopting a standardized Euclidean distance and a character string similarity algorithm based on an editing distance for the numerical data and the character data;
the estimation module is used for estimating eps and minPts of the DBSCAN algorithm according to the determined distance between the samples in the data set, wherein eps represents a scanning radius, minPts represents the minimum contained point number, and DBSCAN represents the noise-based clustering;
the clustering module is used for performing density clustering on the data set to be cleaned by taking the estimated eps and minPts values as DBSCAN parameter values;
and the cleaning module is used for cleaning the data in the data set to be cleaned according to the clustering result.
The technical scheme of the invention has the following beneficial effects:
in the above scheme, a data set to be cleaned is obtained, wherein the attribute values of the samples in the data set include: numerical data and character data; respectively adopting a standardized Euclidean distance and a character string similarity algorithm based on an editing distance to the numerical data and the character data to determine the distance between samples in the data set; estimating eps and minPts of the DBSCAN algorithm according to the determined distance between the samples in the data set, thereby obtaining eps and minPts values in a self-adaptive manner; taking the estimated eps and minPts values as DBSCAN parameter values, and carrying out density clustering on numerical data and character data in a data set to be cleaned; and according to the clustering result, cleaning the data in the data set to be cleaned so as to clean dirty data of different data types. Therefore, the DBSCAN algorithm of eps and minPts is acquired in a self-adaptive mode, the data set to be cleaned is clustered, dirty data of different data types are cleaned according to the clustering result, and the accuracy of the cleaning result can be improved.
Drawings
FIG. 1 is a schematic flow chart of a data cleaning method based on density clustering according to an embodiment of the present invention;
FIG. 2 is a detailed flowchart of a data cleaning method based on density clustering according to an embodiment of the present invention;
fig. 3 is a schematic structural diagram of a data cleaning apparatus based on density clustering according to an embodiment of the present invention.
Detailed Description
In order to make the technical problems, technical solutions and advantages of the present invention more apparent, the following detailed description is given with reference to the accompanying drawings and specific embodiments.
The invention provides a data cleaning method and device based on density clustering, aiming at the problem of low accuracy of clustering results caused by manually setting parameters eps and minPts of a DBSCAN algorithm according to experience.
Example one
As shown in fig. 1, the data cleaning method based on density clustering provided by the embodiment of the present invention includes:
s101, acquiring a data set to be cleaned, wherein the attribute values of the samples in the data set comprise: numerical data and character data;
s102, determining the distance between samples in a data set by respectively adopting a standardized Euclidean distance and a character string similarity algorithm based on an editing distance for numerical data and character data;
s103, estimating eps and minPts of the DBSCAN algorithm according to the determined distance between the samples in the data set, wherein eps represents a scanning radius, and minPts represents the minimum contained point number;
s104, performing density clustering on the data set to be cleaned by taking the estimated eps and minPts values as DBSCAN parameter values;
and S105, cleaning the data in the data set to be cleaned according to the clustering result.
The data cleaning method based on density clustering in the embodiment of the invention obtains a data set to be cleaned, wherein the attribute values of samples in the data set comprise: numerical data and character data; respectively adopting a standardized Euclidean distance and a character string similarity algorithm based on an editing distance to the numerical data and the character data to determine the distance between samples in the data set; estimating eps and minPts of the DBSCAN algorithm according to the determined distance between the samples in the data set, thereby obtaining eps and minPts values in a self-adaptive manner; taking the estimated eps and minPts values as DBSCAN parameter values, and carrying out density clustering on numerical data and character data in a data set to be cleaned; and according to the clustering result, cleaning the data in the data set to be cleaned so as to clean dirty data of different data types. Therefore, the DBSCAN algorithm of eps and minPts is acquired in a self-adaptive mode, the data set to be cleaned is clustered, dirty data of different data types are cleaned according to the clustering result, and the accuracy of the cleaning result can be improved.
In this embodiment, after the data set to be cleaned is acquired (S101), invalid data or irrelevant data with missing attributes exceeding a preset value needs to be removed, each attribute type is analyzed, and it is determined whether an attribute value of each attribute is numeric data or character data.
In this embodiment, in S102, standardized euclidean distances and string similarity calculation methods based on edit distances are respectively adopted for the numerical data and character data of the samples, different weights are given to different attributes, and 2 distance calculation methods are fused to finally obtain the distance between the samples in the data set, specifically:
assuming that the dataset is D, which includes N samples, each having N attribute values, the sample X in the dataset can be represented as (X)1,x2,…,xn) By epsilonkIndicating the absence state of the kth attribute, ∈kExpressed as:
Figure BDA0002040721410000061
attribute value X of kth attribute of sample XkExpressed as:
Figure BDA0002040721410000062
wherein x isnkThe k-th attribute of X is a numerical value, XskThe k-th attribute representing X is a character-type value, subscript n represents number (number), and subscript s represents string (character string).
In this embodiment, Z-Score normalization is performed on the numerical attribute data of all samples in the dataset, and the normalization formula is:
Figure BDA0002040721410000071
wherein, z (x)nk) Representing normalized xnkU is the mean of the kth attribute of all samples in the dataset and σ is the datasetAll sample standard deviations.
Assuming that Y also represents another sample in the data set D, the distance between samples X, Y is represented as:
Figure BDA0002040721410000072
where dist (X, Y) represents the distance between samples X, Y, wkThe weight value representing the kth attribute value of the sample,
Figure BDA0002040721410000073
denotes the normalized Euclidean distance, sim (x), between the sample attributes when the kth attribute value of the sample is numericalsk,ysk) Indicating the edit-distance-based string similarity of sample attributes when the kth attribute value of the sample is character-type, n indicating the number of attributes contained in the sample, εkIndicating the missing state of the kth attribute, xnkAnd ynkA value representing the k-th attribute of X and Y as numerical type, Xsk,yskThe kth attribute representing X and Y, respectively, is a character-type value, X and Y represent two samples of the dataset, z (X)nk)、z(ynk) Respectively represent normalized xnk、ynk
In this embodiment, for the characteristics that the sample contains two different attributes, namely a numeric attribute and a character attribute, the euclidean distance and the edit distance are respectively adopted to perform distance calculation on the two different attributes.
In this embodiment, the similarity of the character strings is determined based on the edit distance, where the edit distance is also called Levenshtein distance, and refers to the minimum number of edit operations (insertion, deletion, and replacement) required to change from one character string to another character string, and if the distance between the character strings is larger, the minimum number of edit operations is more different. Permitted editing operations include replacing one character with another, inserting one character, and deleting one character.
In this embodiment, S and T are assumed to represent attribute values of 2 samples in a dataset, where S ═ S1…si…sm,T=t1…ti…td,si、tiRespectively representing characters in S and T, and m and d respectively representing the number of characters in S and T; for the strings S and T, a relationship matrix LD of (m +1) × (d +1) is established:
LD(m+1)×(d+1)={dij}(0≤i≤m,0≤j≤d)
wherein the matrix element dijExpressed as:
Figure BDA0002040721410000081
by aijIndicating whether insertion, replacement and deletion operations need to be performed:
Figure BDA0002040721410000082
final element dmdNamely the Levenshtein distance between two character strings, abbreviated as ld distance, where ld is expressed as:
ld=dmd
where ld represents the minimum number of editing operations required to change the character string S to the character string T.
In this embodiment, the similarity between character strings based on the edit distance may be expressed as:
Figure BDA0002040721410000083
where sim (S, T) represents the similarity between strings S and T based on the edit distance, S and T represent character-type attribute values of 2 samples in the data set, m, d represent the number of characters in strings S and T, respectively, and ld represents the minimum number of edit operations required to change string S to string T.
In an embodiment of the foregoing density clustering-based data cleansing method, further, the estimating eps and minPts of the DBSCAN algorithm according to the determined distance between samples in the data set includes:
constructing a data set sample distance matrix D according to the determined distance between the samples in the data setN×NWherein D isN×NExpressed as:
DN×N={dist(Xi,Xj)|1≤i≤N,1≤j≤N}
wherein, dist (X)i,Xj) The distance between the ith sample and the jth sample is represented, and N represents the number of samples in the data set;
to sample distance matrix DN×NSequencing each row of data in an ascending order, wherein the 0 th column of the sequenced matrix represents the distance between a sample point and the matrix and is 0; and calculating the mean value of each column of the sorted matrix to obtain a final K-average nearest neighbor distance matrix which is used as a candidate value of eps.
Figure BDA0002040721410000091
Wherein D isepsIs a set of candidate values for the eps parameter,
Figure BDA0002040721410000092
representing a sample distance matrix DN×NThe mean value of the K-th column after sorting, wherein N represents the number of samples in the data set;
determining the number of samples contained in the eps neighborhood of each sample point according to the obtained eps value, and calculating the average value of the number of samples contained in all the sample points as a minPts value, wherein the minPts is expressed as:
Figure BDA0002040721410000093
wherein, PiThe number of samples contained in the eps neighborhood of the ith sample.
In this embodiment, two parameters, that is, the scanning radius eps and the minimum inclusion point number minPts of the DBSCAN algorithm, are estimated according to the statistical characteristic (average value) of the average nearest neighbor distance of the data set sample points, where eps represents a distance threshold in an eps neighborhood, and minPts represents a minimum sample number threshold in the eps neighborhood that the sample point needs to become a core sample point.
In this embodiment, the clustering results according to a large number of data sets indicate that eps and minPts around the optimal clustering result tend to be stable. Selecting eps and minPts under different K values in sequence, bringing the eps and minPts into a DBSCAN algorithm, carrying out density clustering on a data set to be cleaned, obtaining the cluster number of a final clustering result, if the cluster number is continuously unchanged for 3 times, considering that clustering tends to be stable at the moment, and selecting the eps and minPts at the moment as optimal values of parameters eps and minPts of the DBSCAN algorithm; and performing density clustering on the data set to be cleaned according to the obtained optimal values of eps and minPts, as shown in FIG. 2.
In this embodiment, the DBSCAN is a classic density-based clustering algorithm. Unlike the partitioning and hierarchical clustering method, which defines clusters as the largest set of density-connected points, it is possible to partition areas with sufficiently high density into clusters and find clusters of arbitrary shape in a spatial database of noise. The clustering method based on DBSCAN is defined as follows: if the number of samples in the eps neighborhood of one sample point is not less than minPts, the sample point is a core sample point; given a data set D, if sample p is within the eps neighborhood of q and q is a core sample point, then sample p is directly hit by sample q density, if there is a chain of samples, p1,p2,…,pi,…,pnLet p denote1=p,pn=q,piIs from pi+1Regarding the eps and minPts density being through, the sample p is said to be reachable from the sample q density; if there are sample points o, p and q reachable from o density, then sample points p and q are connected by density; if a sample is not reachable by any one of the core sample densities, then the sample is considered to be a noise sample; all sample points connected in density are grouped into a cluster.
In this embodiment, according to the obtained optimal values of eps and minPts, density clustering is performed on the data set to be cleaned to obtain a clustering result, and according to the clustering result, the data in the data set to be cleaned is cleaned to obtain clean data, specifically:
if the missing attribute value is numerical data, filling the attribute mean value of the cluster where the corresponding sample is located into the missing data;
if the missing attribute value is character data, acquiring the attribute value with the highest frequency of occurrence of the cluster where the corresponding sample is located as the missing data;
if the example among the 2 samples in the cluster is larger than a preset distance threshold, judging that the 2 samples are the repeated data, and combining the repeated data.
The data cleaning method based on density clustering can be used for solving the problem of cleaning different data types.
The data cleaning method based on density clustering can be applied to the field of network security logs, so that log data can be analyzed without prior knowledge, the pressure of manual participation in analysis can be greatly reduced, fault positions can be conveniently and quickly found out, the system load is reduced, and the operation efficiency of the system is improved. Namely: the data set to be cleaned may be: a network security log dataset; the main steps of cleaning the network security log data set comprise:
preprocessing a network security log data set, and analyzing and removing log irrelevant information attributes such as ip addresses, time and the like according to service requirements; and determining the attribute type of each attribute in the log data set, wherein the attribute types comprise: numeric and character types;
respectively adopting a standardized Euclidean distance and a character string similarity algorithm based on an editing distance to the numerical data and the character data to determine the distance between samples in the log data set;
estimating eps and minPts of the DBSCAN algorithm according to the determined distance between samples in the log data set, wherein eps represents a scanning radius, and minPts represents the minimum contained point number;
taking the estimated eps and minPts values as DBSCAN parameter values, carrying out density clustering on a log data set to be cleaned, and gathering similar logs into the same cluster;
according to the clustering result, the repeated log data in the same cluster are merged, so that manual reference and analysis are facilitated, the fault position is quickly determined, the log missing attribute generated due to system faults is filled, and the accuracy of log analysis is improved.
Example two
The present invention also provides a specific embodiment of a data cleaning apparatus based on density clustering, which corresponds to the specific embodiment of the data cleaning method based on density clustering provided by the present invention, and the data cleaning apparatus based on density clustering can achieve the purpose of the present invention by executing the flow steps in the specific embodiment of the method, so the explanation in the specific embodiment of the data cleaning method based on density clustering is also applicable to the specific embodiment of the data cleaning apparatus based on density clustering provided by the present invention, and will not be described again in the following specific embodiment of the present invention.
As shown in fig. 3, an embodiment of the present invention further provides a data cleaning apparatus based on density clustering, including:
an obtaining module 11, configured to obtain a data set to be cleaned, where an attribute value of a sample in the data set includes: numerical data and character data;
a determining module 12, configured to determine distances between samples in a data set by respectively using a standardized euclidean distance and a character string similarity algorithm based on an edit distance for numerical data and character data;
an estimation module 13, configured to estimate eps and minPts of the DBSCAN algorithm according to the determined distance between samples in the data set, where eps represents a scanning radius, minPts represents a minimum contained point number, and DBSCAN represents a density-based cluster with noise;
the clustering module 14 is configured to perform density clustering on the data set to be cleaned by using the estimated eps and minPts values as DBSCAN parameter values;
and the cleaning module 15 is used for cleaning the data in the data set to be cleaned according to the clustering result.
The data cleaning device based on density clustering in the embodiment of the invention obtains a data set to be cleaned, wherein the attribute values of samples in the data set comprise: numerical data and character data; respectively adopting a standardized Euclidean distance and a character string similarity algorithm based on an editing distance to the numerical data and the character data to determine the distance between samples in the data set; estimating eps and minPts of the DBSCAN algorithm according to the determined distance between the samples in the data set, thereby obtaining eps and minPts values in a self-adaptive manner; taking the estimated eps and minPts values as DBSCAN parameter values, and carrying out density clustering on numerical data and character data in a data set to be cleaned; and according to the clustering result, cleaning the data in the data set to be cleaned so as to clean dirty data of different data types. Therefore, the DBSCAN algorithm of eps and minPts is acquired in a self-adaptive mode, the data set to be cleaned is clustered, dirty data of different data types are cleaned according to the clustering result, and the accuracy of the cleaning result can be improved.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.
While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (10)

1. A data cleaning method based on density clustering is characterized by comprising the following steps:
acquiring a data set to be cleaned, wherein the attribute values of the samples in the data set comprise: numerical data and character data;
respectively adopting a standardized Euclidean distance and a character string similarity algorithm based on an editing distance to the numerical data and the character data to determine the distance between samples in the data set;
estimating eps and minPts of the DBSCAN algorithm according to the determined distance between the samples in the data set, wherein eps represents a scanning radius, minPts represents the minimum contained point number, and DBSCAN represents the noise-based clustering;
taking the estimated eps and minPts values as DBSCAN parameter values, and carrying out density clustering on the data set to be cleaned;
and cleaning the data in the data set to be cleaned according to the clustering result.
2. The density-clustering-based data washing method of claim 1, wherein the distance between samples is expressed as:
Figure FDA0002898172810000011
where dist (X, Y) represents the distance between samples X, Y, wkThe weight value representing the kth attribute value of the sample,
Figure FDA0002898172810000012
denotes the normalized Euclidean distance, sim (x), between the sample attributes when the kth attribute value of the sample is numericalsk,ysk) Indicating the edit-distance-based string similarity of sample attributes when the kth attribute value of the sample is character-type, n indicating the number of attributes contained in the sample, εkIndicating the missing state of the kth attribute, xnkAnd ynkA value representing the k-th attribute of X and Y as numerical type, Xsk,yskThe kth attribute representing X and Y, respectively, is a character-type value, X and Y represent two samples of the dataset, z (X)nk)、z(ynk) Respectively represent normalized xnk、ynk
3. The data washing method based on density clustering of claim 2, characterized in that εkExpressed as:
Figure FDA0002898172810000021
4. the data washing method based on density clustering according to claim 2, characterized in that z (x)nk) Expressed as:
Figure FDA0002898172810000022
where u is the mean of the kth attribute of all samples in the dataset and σ is the standard deviation of all samples in the dataset.
5. The data cleaning method based on density clustering of claim 1, wherein the character string similarity based on edit distance is expressed as:
Figure FDA0002898172810000023
where sim (S, T) represents the similarity between strings S and T based on the edit distance, S and T represent character-type attribute values of 2 samples in the data set, m, d represent the number of characters in strings S and T, respectively, and ld represents the minimum number of edit operations required to change string S to string T.
6. The method of claim 1, wherein estimating eps and minPts of the DBSCAN algorithm according to the determined distance between samples in the data set comprises:
constructing a data set sample distance matrix according to the determined distance between the samples in the data set;
sequencing each row of data of the sample distance matrix in an ascending order, wherein the K-th column of the sequenced matrix represents the distance between each sample point and the K-th neighbor sample point of the sequenced matrix, calculating the mean value of each column of the sequenced matrix to obtain a K-average nearest neighbor distance matrix, and taking the K-average nearest neighbor distance matrix as the candidate value of eps;
and determining the number of samples contained in the eps neighborhood of each sample point according to the obtained eps value, and calculating the average value of the number of samples contained in all the sample points as a minPts value.
7. The density-clustering-based data washing method of claim 6, wherein eps is expressed as:
Figure FDA0002898172810000024
wherein D isepsIs a set of candidate values for the eps parameter,
Figure FDA0002898172810000032
representing a sample distance matrix DN×NThe mean value of the K-th column after sorting, wherein N represents the number of samples in the data set;
minPts is represented as:
Figure FDA0002898172810000031
wherein, PiThe number of samples contained in the eps neighborhood of the ith sample.
8. The data cleaning method based on density clustering according to claim 1, wherein the density clustering of the data set to be cleaned by taking the estimated eps and minPts values as DBSCAN parameter values comprises:
sequentially selecting eps and minPts under different K values, bringing the eps and minPts into a DBSCAN algorithm, performing density clustering on a data set to be cleaned, judging that clustering tends to be stable if the number of clusters is continuously unchanged for 3 times, and selecting the current eps and minPts as optimal values of parameters eps and minPts of the DBSCAN algorithm;
and performing density clustering on the data set to be cleaned according to the obtained optimal values of eps and minPts.
9. The data cleaning method based on density clustering according to claim 1, wherein the cleaning the data in the data set to be cleaned according to the clustering result comprises:
if the missing attribute value is numerical data, filling the attribute mean value of the cluster where the corresponding sample is located into the missing data;
if the missing attribute value is character data, acquiring the attribute value with the highest frequency of occurrence of the cluster where the corresponding sample is located as the missing data;
if the distance between the 2 samples in the cluster is larger than a preset distance threshold, judging the 2 samples as the repeated data, and merging the repeated data.
10. A data cleaning device based on density clustering is characterized by comprising:
the acquisition module is used for acquiring a data set to be cleaned, wherein the attribute values of the samples in the data set comprise: numerical data and character data;
the determining module is used for determining the distance between samples in the data set by respectively adopting a standardized Euclidean distance and a character string similarity algorithm based on an editing distance for the numerical data and the character data;
the estimation module is used for estimating eps and minPts of the DBSCAN algorithm according to the determined distance between the samples in the data set, wherein eps represents a scanning radius, minPts represents the minimum contained point number, and DBSCAN represents the noise-based clustering;
the clustering module is used for performing density clustering on the data set to be cleaned by taking the estimated eps and minPts values as DBSCAN parameter values;
and the cleaning module is used for cleaning the data in the data set to be cleaned according to the clustering result.
CN201910341078.2A 2019-04-25 2019-04-25 Data cleaning method and device based on density clustering Active CN110263230B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910341078.2A CN110263230B (en) 2019-04-25 2019-04-25 Data cleaning method and device based on density clustering

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910341078.2A CN110263230B (en) 2019-04-25 2019-04-25 Data cleaning method and device based on density clustering

Publications (2)

Publication Number Publication Date
CN110263230A CN110263230A (en) 2019-09-20
CN110263230B true CN110263230B (en) 2021-04-06

Family

ID=67913894

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910341078.2A Active CN110263230B (en) 2019-04-25 2019-04-25 Data cleaning method and device based on density clustering

Country Status (1)

Country Link
CN (1) CN110263230B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110988856B (en) * 2019-12-19 2021-08-03 电子科技大学 Target detection trace agglomeration algorithm based on density clustering
CN111046056A (en) * 2019-12-26 2020-04-21 成都康赛信息技术有限公司 Data consistency evaluation method based on data pattern clustering
CN111582406A (en) * 2020-05-31 2020-08-25 重庆大学 Power equipment state monitoring data clustering method and system
CN112187550B (en) * 2020-10-16 2022-09-30 温州职业技术学院 Log analysis method based on density peak value multi-attribute clustering
CN112633320B (en) * 2020-11-26 2023-04-07 西安电子科技大学 Radar radiation source data cleaning method based on phase image coefficient and DBSCAN
CN112632953B (en) * 2020-12-22 2023-07-25 云汉芯城(上海)互联网科技股份有限公司 Method for rapidly and accurately detecting that multiple uploaded bill of materials belongs to same product
CN113379454A (en) * 2021-06-09 2021-09-10 北京房江湖科技有限公司 Data processing method and device, electronic equipment and storage medium
CN117834382A (en) * 2022-09-26 2024-04-05 中兴通讯股份有限公司 Equipment group obstacle recognition method, device and computer readable storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109190653A (en) * 2018-07-09 2019-01-11 四川大学 Malicious code family homology analysis technology based on semi-supervised Density Clustering
CN109543775A (en) * 2018-12-18 2019-03-29 贵州联科卫信科技有限公司 A kind of feature selection approach towards clustering algorithm based on Density Clustering

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170184410A1 (en) * 2015-12-29 2017-06-29 Le Holdings (Beijing) Co., Ltd. Method and electronic device for personalized navigation
KR102446811B1 (en) * 2016-02-19 2022-09-23 삼성전자주식회사 Method for combining and providing colltected data from plural devices and electronic device for the same

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109190653A (en) * 2018-07-09 2019-01-11 四川大学 Malicious code family homology analysis technology based on semi-supervised Density Clustering
CN109543775A (en) * 2018-12-18 2019-03-29 贵州联科卫信科技有限公司 A kind of feature selection approach towards clustering algorithm based on Density Clustering

Also Published As

Publication number Publication date
CN110263230A (en) 2019-09-20

Similar Documents

Publication Publication Date Title
CN110263230B (en) Data cleaning method and device based on density clustering
CN110335168B (en) Method and system for optimizing power utilization information acquisition terminal fault prediction model based on GRU
CN108959395B (en) Multi-source heterogeneous big data oriented hierarchical reduction combined cleaning method
CN110880019A (en) Method for adaptively training target domain classification model through unsupervised domain
CN110737805B (en) Method and device for processing graph model data and terminal equipment
CN111860981A (en) Enterprise national industry category prediction method and system based on LSTM deep learning
CN112800148B (en) Scattered pollution enterprise study and judgment method based on clustering feature tree and outlier quantification
CN116881430B (en) Industrial chain identification method and device, electronic equipment and readable storage medium
CN115878599A (en) Sewage industry data cleaning method
CN116226103A (en) Method for detecting government data quality based on FPGrow algorithm
CN117131449A (en) Data management-oriented anomaly identification method and system with propagation learning capability
CN117453764A (en) Data mining analysis method
CN113726558A (en) Network equipment flow prediction system based on random forest algorithm
CN113743453A (en) Population quantity prediction method based on random forest
CN116595543A (en) Processing system for developing application data by software based on Internet platform
CN116188834B (en) Full-slice image classification method and device based on self-adaptive training model
KR101985961B1 (en) Similarity Quantification System of National Research and Development Program and Searching Cooperative Program using same
CN110502669A (en) The unsupervised chart dendrography learning method of lightweight and device based on the side N DFS subgraph
CN115660730A (en) Loss user analysis method and system based on classification algorithm
CN112819527A (en) User grouping processing method and device
CN113033694B (en) Data cleaning method based on deep learning
CN108964951B (en) Method for acquiring alarm information and server
CN116578611B (en) Knowledge management method and system for inoculated knowledge
CN116451675A (en) Detection optimization method for similar repeated records based on density clustering algorithm DBSCAN algorithm
CN116070120B (en) Automatic identification method and system for multi-tag time sequence electrophysiological signals

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20190920

Assignee: Henan Tianbo Internet of things Research Institute Co.,Ltd.

Assignor: University OF SCIENCE AND TECHNOLOGY BEIJING

Contract record no.: X2022980003571

Denomination of invention: A data cleaning method and device based on Density Clustering

Granted publication date: 20210406

License type: Common License

Record date: 20220401

EE01 Entry into force of recordation of patent licensing contract