CN110263230B

CN110263230B - Data cleaning method and device based on density clustering

Info

Publication number: CN110263230B
Application number: CN201910341078.2A
Authority: CN
Inventors: 许海涛; 张晓鹏; 周贤伟; 林福宏; 吕兴; 安建伟
Original assignee: University of Science and Technology Beijing USTB
Current assignee: University of Science and Technology Beijing USTB
Priority date: 2019-04-25
Filing date: 2019-04-25
Publication date: 2021-04-06
Anticipated expiration: 2039-04-25
Also published as: CN110263230A

Abstract

The invention provides a data cleaning method and device based on density clustering, which can improve the accuracy of cleaning results. The method comprises the following steps: acquiring a data set to be cleaned, wherein the attribute values of the samples in the data set comprise: numerical data and character data; respectively adopting a standardized Euclidean distance and a character string similarity algorithm based on an editing distance to the numerical data and the character data to determine the distance between samples in the data set; estimating eps and minPts of the DBSCAN algorithm according to the determined distance between the samples in the data set; taking the estimated eps and minPts values as DBSCAN parameter values, and carrying out density clustering on the data set to be cleaned; and cleaning the data in the data set to be cleaned according to the clustering result. The present invention relates to the field of data mining.

Description

Data cleaning method and device based on density clustering

Technical Field

The invention relates to the field of data mining, in particular to a data cleaning method and device based on density clustering.

Background

The data has become an important power for promoting the development of the industry after the information explosion age is entered at present. Huge wealth hidden in data can be obtained by enterprises, and the enterprises can obtain a large amount of useful information from the huge wealth, provide support for development decisions of the enterprises from various aspects of business management, market analysis, scientific exploration and the like, and promote the development of the enterprises. However, data in reality is often complex and complicated, and data with different structures and different types of dirty data, such as error data, invalid data, missing duplicate data, etc., exist in the data, which greatly increases the difficulty of data analysis.

The machine learning method is widely applied to the field of data cleaning, and the core purpose of the machine learning method is to cluster data sets. Clustering analysis, also known as cluster analysis, is a statistical method for studying sample classification. The purpose of clustering is to make the similarity between objects of the same class as large as possible and the similarity between objects of different classes as small as possible. The Noise-Based Density Clustering of Applications with Noise (DBSCAN) algorithm is a classic Clustering algorithm, the final Clustering result of the DBSCAN depends on the selection of eps and minPts parameter values, if the eps and minPts parameter values are not selected properly, the Clustering result is poor, even wrong Clustering occurs, wherein eps represents the scanning radius, minPts represents the minimum contained point number,

in the prior art, parameters eps and minPts of a DBSCAN algorithm are generally set manually according to experience, so that the accuracy of a clustering result is low.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a data cleaning method and device based on density clustering, so as to solve the problem of low accuracy of clustering results caused by manually setting parameters eps and minPts of a DBSCAN algorithm according to experience in the prior art.

In order to solve the above technical problem, an embodiment of the present invention provides a data cleaning method based on density clustering, including:

acquiring a data set to be cleaned, wherein the attribute values of the samples in the data set comprise: numerical data and character data;

respectively adopting a standardized Euclidean distance and a character string similarity algorithm based on an editing distance to the numerical data and the character data to determine the distance between samples in the data set;

estimating eps and minPts of the DBSCAN algorithm according to the determined distance between the samples in the data set, wherein eps represents a scanning radius, minPts represents the minimum contained point number, and DBSCAN represents the noise-based clustering;

taking the estimated eps and minPts values as DBSCAN parameter values, and carrying out density clustering on the data set to be cleaned;

and cleaning the data in the data set to be cleaned according to the clustering result.

Further, the distance between samples is expressed as:

where dist (X, Y) represents the distance between samples X, Y, w_kThe weight value representing the kth attribute value of the sample,

denotes the normalized Euclidean distance, sim (x), between the sample attributes when the kth attribute value of the sample is numerical_sk,y_sk) Indicating the edit-distance-based string similarity of sample attributes when the kth attribute value of the sample is character-type, n indicating the number of attributes contained in the sample, ε_kIndicating the missing state of the kth attribute, x_nkAnd y_nkA value representing the k-th attribute of X and Y as numerical type, X_sk,y_skThe kth attribute representing X and Y, respectively, is a character-type value, X and Y represent two samples of the dataset, z (X)_nk)、z(y_nk) Respectively represent normalized x_nk、y_nk。

Further, epsilon_kExpressed as:

further, z (x)_nk) Expressed as:

where u is the mean of the kth attribute of all samples in the dataset and σ is the standard deviation of all samples in the dataset.

Further, the character string similarity based on the edit distance is expressed as:

where sim (S, T) represents the similarity between strings S and T based on the edit distance, S and T represent character-type attribute values of 2 samples in the data set, m, d represent the number of characters in strings S and T, respectively, and ld represents the minimum number of edit operations required to change string S to string T.

Further, the estimating eps and minPts of the DBSCAN algorithm according to the determined distance between samples in the data set includes:

constructing a data set sample distance matrix according to the determined distance between the samples in the data set;

sequencing each row of data of the sample distance matrix in an ascending order, wherein the K-th column of the sequenced matrix represents the distance between each sample point and the K-th neighbor sample point of the sequenced matrix, calculating the mean value of each column of the sequenced matrix to obtain a K-average nearest neighbor distance matrix, and taking the K-average nearest neighbor distance matrix as the candidate value of eps;

and determining the number of samples contained in the eps neighborhood of each sample point according to the obtained eps value, and calculating the average value of the number of samples contained in all the sample points as a minPts value.

7. The density-clustering-based data washing method of claim 6, wherein eps is expressed as:

wherein D is_epsIs a set of candidate values for the eps parameter,

representing a sample distance matrix D_N×NThe mean value of the K-th column after sorting, wherein N represents the number of samples in the data set;

minPts is represented as:

wherein, P_iThe number of samples contained in the eps neighborhood of the ith sample.

Further, the density clustering of the data set to be cleaned by using the estimated eps and minPts values as DBSCAN parameter values includes:

sequentially selecting eps and minPts under different K values, bringing the eps and minPts into a DBSCAN algorithm, performing density clustering on a data set to be cleaned, judging that clustering tends to be stable if the number of clusters is continuously unchanged for 3 times, and selecting the current eps and minPts as optimal values of parameters eps and minPts of the DBSCAN algorithm;

and performing density clustering on the data set to be cleaned according to the obtained optimal values of eps and minPts.

Further, the cleaning the data in the data set to be cleaned according to the clustering result includes:

if the missing attribute value is numerical data, filling the attribute mean value of the cluster where the corresponding sample is located into the missing data;

if the missing attribute value is character data, acquiring the attribute value with the highest frequency of occurrence of the cluster where the corresponding sample is located as the missing data;

if the example among the 2 samples in the cluster is larger than a preset distance threshold, judging that the 2 samples are the repeated data, and combining the repeated data.

An embodiment of the present invention further provides a data cleaning apparatus based on density clustering, including:

the acquisition module is used for acquiring a data set to be cleaned, wherein the attribute values of the samples in the data set comprise: numerical data and character data;

the determining module is used for determining the distance between samples in the data set by respectively adopting a standardized Euclidean distance and a character string similarity algorithm based on an editing distance for the numerical data and the character data;

the estimation module is used for estimating eps and minPts of the DBSCAN algorithm according to the determined distance between the samples in the data set, wherein eps represents a scanning radius, minPts represents the minimum contained point number, and DBSCAN represents the noise-based clustering;

the clustering module is used for performing density clustering on the data set to be cleaned by taking the estimated eps and minPts values as DBSCAN parameter values;

and the cleaning module is used for cleaning the data in the data set to be cleaned according to the clustering result.

The technical scheme of the invention has the following beneficial effects:

in the above scheme, a data set to be cleaned is obtained, wherein the attribute values of the samples in the data set include: numerical data and character data; respectively adopting a standardized Euclidean distance and a character string similarity algorithm based on an editing distance to the numerical data and the character data to determine the distance between samples in the data set; estimating eps and minPts of the DBSCAN algorithm according to the determined distance between the samples in the data set, thereby obtaining eps and minPts values in a self-adaptive manner; taking the estimated eps and minPts values as DBSCAN parameter values, and carrying out density clustering on numerical data and character data in a data set to be cleaned; and according to the clustering result, cleaning the data in the data set to be cleaned so as to clean dirty data of different data types. Therefore, the DBSCAN algorithm of eps and minPts is acquired in a self-adaptive mode, the data set to be cleaned is clustered, dirty data of different data types are cleaned according to the clustering result, and the accuracy of the cleaning result can be improved.

Drawings

FIG. 1 is a schematic flow chart of a data cleaning method based on density clustering according to an embodiment of the present invention;

FIG. 2 is a detailed flowchart of a data cleaning method based on density clustering according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a data cleaning apparatus based on density clustering according to an embodiment of the present invention.

Detailed Description

In order to make the technical problems, technical solutions and advantages of the present invention more apparent, the following detailed description is given with reference to the accompanying drawings and specific embodiments.

The invention provides a data cleaning method and device based on density clustering, aiming at the problem of low accuracy of clustering results caused by manually setting parameters eps and minPts of a DBSCAN algorithm according to experience.

Example one

As shown in fig. 1, the data cleaning method based on density clustering provided by the embodiment of the present invention includes:

s101, acquiring a data set to be cleaned, wherein the attribute values of the samples in the data set comprise: numerical data and character data;

s102, determining the distance between samples in a data set by respectively adopting a standardized Euclidean distance and a character string similarity algorithm based on an editing distance for numerical data and character data;

s103, estimating eps and minPts of the DBSCAN algorithm according to the determined distance between the samples in the data set, wherein eps represents a scanning radius, and minPts represents the minimum contained point number;

s104, performing density clustering on the data set to be cleaned by taking the estimated eps and minPts values as DBSCAN parameter values;

and S105, cleaning the data in the data set to be cleaned according to the clustering result.

The data cleaning method based on density clustering in the embodiment of the invention obtains a data set to be cleaned, wherein the attribute values of samples in the data set comprise: numerical data and character data; respectively adopting a standardized Euclidean distance and a character string similarity algorithm based on an editing distance to the numerical data and the character data to determine the distance between samples in the data set; estimating eps and minPts of the DBSCAN algorithm according to the determined distance between the samples in the data set, thereby obtaining eps and minPts values in a self-adaptive manner; taking the estimated eps and minPts values as DBSCAN parameter values, and carrying out density clustering on numerical data and character data in a data set to be cleaned; and according to the clustering result, cleaning the data in the data set to be cleaned so as to clean dirty data of different data types. Therefore, the DBSCAN algorithm of eps and minPts is acquired in a self-adaptive mode, the data set to be cleaned is clustered, dirty data of different data types are cleaned according to the clustering result, and the accuracy of the cleaning result can be improved.

In this embodiment, after the data set to be cleaned is acquired (S101), invalid data or irrelevant data with missing attributes exceeding a preset value needs to be removed, each attribute type is analyzed, and it is determined whether an attribute value of each attribute is numeric data or character data.

In this embodiment, in S102, standardized euclidean distances and string similarity calculation methods based on edit distances are respectively adopted for the numerical data and character data of the samples, different weights are given to different attributes, and 2 distance calculation methods are fused to finally obtain the distance between the samples in the data set, specifically:

assuming that the dataset is D, which includes N samples, each having N attribute values, the sample X in the dataset can be represented as (X)₁,x₂,…,x_n) By epsilon_kIndicating the absence state of the kth attribute, ∈_kExpressed as:

attribute value X of kth attribute of sample X_kExpressed as:

wherein x is_nkThe k-th attribute of X is a numerical value, X_skThe k-th attribute representing X is a character-type value, subscript n represents number (number), and subscript s represents string (character string).

In this embodiment, Z-Score normalization is performed on the numerical attribute data of all samples in the dataset, and the normalization formula is:

wherein, z (x)_nk) Representing normalized x_nkU is the mean of the kth attribute of all samples in the dataset and σ is the datasetAll sample standard deviations.

Assuming that Y also represents another sample in the data set D, the distance between samples X, Y is represented as:

In this embodiment, for the characteristics that the sample contains two different attributes, namely a numeric attribute and a character attribute, the euclidean distance and the edit distance are respectively adopted to perform distance calculation on the two different attributes.

In this embodiment, the similarity of the character strings is determined based on the edit distance, where the edit distance is also called Levenshtein distance, and refers to the minimum number of edit operations (insertion, deletion, and replacement) required to change from one character string to another character string, and if the distance between the character strings is larger, the minimum number of edit operations is more different. Permitted editing operations include replacing one character with another, inserting one character, and deleting one character.

In this embodiment, S and T are assumed to represent attribute values of 2 samples in a dataset, where S ═ S₁…s_i…s_m,T＝t₁…t_i…t_d，s_i、t_iRespectively representing characters in S and T, and m and d respectively representing the number of characters in S and T; for the strings S and T, a relationship matrix LD of (m +1) × (d +1) is established:

LD_(m+1)×(d+1)＝{d_ij}(0≤i≤m,0≤j≤d)

wherein the matrix element d_ijExpressed as:

by a_ijIndicating whether insertion, replacement and deletion operations need to be performed:

final element d_mdNamely the Levenshtein distance between two character strings, abbreviated as ld distance, where ld is expressed as:

ld＝d_md

where ld represents the minimum number of editing operations required to change the character string S to the character string T.

In this embodiment, the similarity between character strings based on the edit distance may be expressed as:

In an embodiment of the foregoing density clustering-based data cleansing method, further, the estimating eps and minPts of the DBSCAN algorithm according to the determined distance between samples in the data set includes:

constructing a data set sample distance matrix D according to the determined distance between the samples in the data set_N×NWherein D is_N×NExpressed as:

D_N×N＝{dist(X_i,X_j)|1≤i≤N,1≤j≤N}

wherein, dist (X)_i,X_j) The distance between the ith sample and the jth sample is represented, and N represents the number of samples in the data set;

to sample distance matrix D_N×NSequencing each row of data in an ascending order, wherein the 0 th column of the sequenced matrix represents the distance between a sample point and the matrix and is 0; and calculating the mean value of each column of the sorted matrix to obtain a final K-average nearest neighbor distance matrix which is used as a candidate value of eps.

Wherein D is_epsIs a set of candidate values for the eps parameter,

determining the number of samples contained in the eps neighborhood of each sample point according to the obtained eps value, and calculating the average value of the number of samples contained in all the sample points as a minPts value, wherein the minPts is expressed as:

In this embodiment, two parameters, that is, the scanning radius eps and the minimum inclusion point number minPts of the DBSCAN algorithm, are estimated according to the statistical characteristic (average value) of the average nearest neighbor distance of the data set sample points, where eps represents a distance threshold in an eps neighborhood, and minPts represents a minimum sample number threshold in the eps neighborhood that the sample point needs to become a core sample point.

In this embodiment, the clustering results according to a large number of data sets indicate that eps and minPts around the optimal clustering result tend to be stable. Selecting eps and minPts under different K values in sequence, bringing the eps and minPts into a DBSCAN algorithm, carrying out density clustering on a data set to be cleaned, obtaining the cluster number of a final clustering result, if the cluster number is continuously unchanged for 3 times, considering that clustering tends to be stable at the moment, and selecting the eps and minPts at the moment as optimal values of parameters eps and minPts of the DBSCAN algorithm; and performing density clustering on the data set to be cleaned according to the obtained optimal values of eps and minPts, as shown in FIG. 2.

In this embodiment, the DBSCAN is a classic density-based clustering algorithm. Unlike the partitioning and hierarchical clustering method, which defines clusters as the largest set of density-connected points, it is possible to partition areas with sufficiently high density into clusters and find clusters of arbitrary shape in a spatial database of noise. The clustering method based on DBSCAN is defined as follows: if the number of samples in the eps neighborhood of one sample point is not less than minPts, the sample point is a core sample point; given a data set D, if sample p is within the eps neighborhood of q and q is a core sample point, then sample p is directly hit by sample q density, if there is a chain of samples, p₁,p₂,…,p_i,…,p_nLet p denote₁＝p,p_n＝q,p_iIs from p_i+1Regarding the eps and minPts density being through, the sample p is said to be reachable from the sample q density; if there are sample points o, p and q reachable from o density, then sample points p and q are connected by density; if a sample is not reachable by any one of the core sample densities, then the sample is considered to be a noise sample; all sample points connected in density are grouped into a cluster.

In this embodiment, according to the obtained optimal values of eps and minPts, density clustering is performed on the data set to be cleaned to obtain a clustering result, and according to the clustering result, the data in the data set to be cleaned is cleaned to obtain clean data, specifically:

The data cleaning method based on density clustering can be used for solving the problem of cleaning different data types.

The data cleaning method based on density clustering can be applied to the field of network security logs, so that log data can be analyzed without prior knowledge, the pressure of manual participation in analysis can be greatly reduced, fault positions can be conveniently and quickly found out, the system load is reduced, and the operation efficiency of the system is improved. Namely: the data set to be cleaned may be: a network security log dataset; the main steps of cleaning the network security log data set comprise:

preprocessing a network security log data set, and analyzing and removing log irrelevant information attributes such as ip addresses, time and the like according to service requirements; and determining the attribute type of each attribute in the log data set, wherein the attribute types comprise: numeric and character types;

respectively adopting a standardized Euclidean distance and a character string similarity algorithm based on an editing distance to the numerical data and the character data to determine the distance between samples in the log data set;

estimating eps and minPts of the DBSCAN algorithm according to the determined distance between samples in the log data set, wherein eps represents a scanning radius, and minPts represents the minimum contained point number;

taking the estimated eps and minPts values as DBSCAN parameter values, carrying out density clustering on a log data set to be cleaned, and gathering similar logs into the same cluster;

according to the clustering result, the repeated log data in the same cluster are merged, so that manual reference and analysis are facilitated, the fault position is quickly determined, the log missing attribute generated due to system faults is filled, and the accuracy of log analysis is improved.

Example two

The present invention also provides a specific embodiment of a data cleaning apparatus based on density clustering, which corresponds to the specific embodiment of the data cleaning method based on density clustering provided by the present invention, and the data cleaning apparatus based on density clustering can achieve the purpose of the present invention by executing the flow steps in the specific embodiment of the method, so the explanation in the specific embodiment of the data cleaning method based on density clustering is also applicable to the specific embodiment of the data cleaning apparatus based on density clustering provided by the present invention, and will not be described again in the following specific embodiment of the present invention.

As shown in fig. 3, an embodiment of the present invention further provides a data cleaning apparatus based on density clustering, including:

an obtaining module 11, configured to obtain a data set to be cleaned, where an attribute value of a sample in the data set includes: numerical data and character data;

a determining module 12, configured to determine distances between samples in a data set by respectively using a standardized euclidean distance and a character string similarity algorithm based on an edit distance for numerical data and character data;

an estimation module 13, configured to estimate eps and minPts of the DBSCAN algorithm according to the determined distance between samples in the data set, where eps represents a scanning radius, minPts represents a minimum contained point number, and DBSCAN represents a density-based cluster with noise;

the clustering module 14 is configured to perform density clustering on the data set to be cleaned by using the estimated eps and minPts values as DBSCAN parameter values;

and the cleaning module 15 is used for cleaning the data in the data set to be cleaned according to the clustering result.

The data cleaning device based on density clustering in the embodiment of the invention obtains a data set to be cleaned, wherein the attribute values of samples in the data set comprise: numerical data and character data; respectively adopting a standardized Euclidean distance and a character string similarity algorithm based on an editing distance to the numerical data and the character data to determine the distance between samples in the data set; estimating eps and minPts of the DBSCAN algorithm according to the determined distance between the samples in the data set, thereby obtaining eps and minPts values in a self-adaptive manner; taking the estimated eps and minPts values as DBSCAN parameter values, and carrying out density clustering on numerical data and character data in a data set to be cleaned; and according to the clustering result, cleaning the data in the data set to be cleaned so as to clean dirty data of different data types. Therefore, the DBSCAN algorithm of eps and minPts is acquired in a self-adaptive mode, the data set to be cleaned is clustered, dirty data of different data types are cleaned according to the clustering result, and the accuracy of the cleaning result can be improved.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.

While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A data cleaning method based on density clustering is characterized by comprising the following steps:

2. The density-clustering-based data washing method of claim 1, wherein the distance between samples is expressed as:

3. The data washing method based on density clustering of claim 2, characterized in that ε_kExpressed as:

4. the data washing method based on density clustering according to claim 2, characterized in that z (x)_nk) Expressed as:

5. The data cleaning method based on density clustering of claim 1, wherein the character string similarity based on edit distance is expressed as:

6. The method of claim 1, wherein estimating eps and minPts of the DBSCAN algorithm according to the determined distance between samples in the data set comprises:

wherein D is_epsIs a set of candidate values for the eps parameter,

minPts is represented as:

8. The data cleaning method based on density clustering according to claim 1, wherein the density clustering of the data set to be cleaned by taking the estimated eps and minPts values as DBSCAN parameter values comprises:

9. The data cleaning method based on density clustering according to claim 1, wherein the cleaning the data in the data set to be cleaned according to the clustering result comprises:

if the distance between the 2 samples in the cluster is larger than a preset distance threshold, judging the 2 samples as the repeated data, and merging the repeated data.

10. A data cleaning device based on density clustering is characterized by comprising: