CN112269854B

CN112269854B - Large-scale data similarity characteristic detection method based on inverted index

Info

Publication number: CN112269854B
Application number: CN202011299602.3A
Authority: CN
Inventors: 钱晨; 张顾洪
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2020-11-18
Filing date: 2020-11-18
Publication date: 2022-06-10
Anticipated expiration: 2040-11-18
Also published as: CN112269854A

Abstract

The invention discloses a large-scale data similarity characteristic detection method based on inverted indexes. According to the method, the characteristic data columns of corresponding types are sampled, corresponding inverted indexes are extracted, and then a hash table is established by the inverted indexes and the characteristics in a key value pair mode to generate candidate characteristic subsets, so that the purpose of reducing the dimension of the characteristic sets is achieved; and combining the features in the feature subset after dimensionality reduction pairwise, respectively applying a Pearson correlation coefficient algorithm and a non-repetitive counting method to numerical features and categorical features to obtain correlation coefficients of feature pairs, setting a threshold value and outputting a result. The method overcomes the defect that the original feature sets need to be combined pairwise in the past, can reduce the calculation time by one order of magnitude, and saves a large amount of resources; meanwhile, the accuracy and the recall rate can be kept at an extremely high level.

Description

Large-scale data similarity characteristic detection method based on inverted index

Technical Field

The invention belongs to the field of machine learning and data mining, relates to a method for detecting feature similarity in big data feature engineering, and particularly relates to a large-scale data similarity feature detection method based on inverted indexes.

Background

Feature Similarity Detection (Feature Similarity Detection) is a crucial link in the data mining process and is also a necessary process for machine learning model training. A large number of similar features often exist in an original data set, so that the importance of the features can be dispersed in the model training process, the screening of the features is influenced, and the performance of the model is influenced; and unnecessary calculation overhead is increased, and a large amount of resources are wasted.

Most of the current main feature similarity detection methods need to traverse and combine all features pairwise for analysis, and when the scale of an original feature set is large, the scale of a set of combined features is also very large, so that the method has poor performance on a large-scale data set; or the local sensitive hash (Locality sensitive hash) method is used for reducing the dimension firstly and then analyzing the similarity, and the method has the defects that: although the dimensionality of the data set is reduced, the method can only be applied to the features of the type (including numerical type after being subjected to one-hot coding) at present, and cannot be applied to the numerical type features.

Disclosure of Invention

The invention aims to provide a large-scale data similarity characteristic detection method based on an inverted index, aiming at the defects of the prior art. The method can reduce the dimensionality of the feature set, further greatly reduce the calculation time, can be simultaneously applied to the class type and numerical type features, and ensures extremely high accuracy and recall rate.

The purpose of the invention is realized by the following technical scheme:

a large-scale data similarity feature detection method based on inverted indexes comprises the following steps:

1. for a tabular data set in a relational database, the features of the data set are the fields of the table. All features of the data set constitute a set of raw features, for each feature in the set of raw features: firstly, carrying out column sampling on data corresponding to the hash table, constructing an inverted index, and taking the inverted index-characteristics as key-value pairs of the hash table;

2. and traversing the hash table, and extracting all different features belonging to the same key in the hash table to be used as a feature subset. For each feature subset, the features contained in the feature subset are combined pairwise into feature pairs, and since the two features in each feature pair have the same inverted index, the two features have a higher probability of being a pair of similar features. All feature pairs constitute a candidate feature set;

3. and each feature pair in the candidate feature set applies a corresponding similarity measurement function, sets a threshold value, obtains a similarity measurement result and adds the similarity measurement result into the result set.

The features can be classified into numerical type features and classification type features according to data attributes of the features. Therefore, for a given data set, the original features are classified first, a numerical original feature set and a categorical original feature set are constructed, and the above steps 1 to 3 are performed respectively to obtain a result set of the numerical feature set and a result set of the categorical feature set respectively.

In the above technical solution, the method for column sampling and constructing an inverted index described in 1 specifically includes:

1) carrying out random column sampling on the data corresponding to the characteristics to obtain a sampled data column;

2) the reverse index construction method comprises the following steps: for the numerical type features, calculating the mean value of the sampled data columns, mapping the values which are larger than the mean value in the sampled data columns to be 1, mapping the values which are smaller than the mean value to be-1, mapping the rest values to be 0, and mapping the sampled data columns to be the inverted indexes corresponding to the numerical type features; for the class-type features, traversing the sampled data columns in sequence, mapping the first class value to be 1, mapping the second class value to be 2 …, and increasing in sequence, wherein the mapped sampled data columns are the inverted indexes of the corresponding class-type features.

Further, in step 3, a corresponding similarity measurement function and threshold setting are adopted for each feature pair, and a result set is added, wherein the specific method is as follows:

1) for the numerical type feature pairs, a Pearson correlation coefficient (Pearson correlation coefficient) method is applied to all columns of data to measure similarity, when the absolute value of the Pearson correlation coefficient between the data columns corresponding to two features in the feature pairs is larger than a set threshold value, the feature pairs are marked as similar feature pairs and added into a result set;

2) for the class-type feature pairs, measuring the similarity of the whole-column data by applying a non-repeated counting method, namely counting the number of values of original data columns after repeated values are removed; suppose that the non-repeat counts of two features F1 and F2 in the feature pair to be measured are C1 and C2 respectively; the combined non-repeat count of F1 and F2 was C3; when C1 ═ C2 ═ C3, the pair of features is labeled as a similar pair and added to the result set.

Compared with the prior art, the invention has the following beneficial effects:

1) the invention provides a method for establishing a feature subset through an inverted index, which puts features possibly having similarity into the same subset, thereby greatly reducing the scale of feature pair combination needing measurement and analysis;

2) the invention provides a brand-new reverse index design method by analyzing the relationship between the characteristic distribution and the similarity measurement function, and the method maps the million-level data distribution to the reverse index with a very small scale, thereby not only reducing the data dimensionality by a plurality of orders of magnitude, but also ensuring the uniqueness and the accuracy of the mapping relationship;

3) aiming at the characteristics of numerical type and classification type, the invention respectively adopts the corresponding similarity measurement function Pearson correlation coefficient algorithm and the non-repeated counting method, and optimizes the algorithm in the engineering process, thereby greatly improving the calculation efficiency of the algorithm.

Drawings

FIG. 1 is a schematic diagram of the overall scheme design of a large-scale data similarity feature detection method based on inverted indexes;

FIG. 2 is a schematic diagram illustrating an example of a method for generating a candidate feature set in a large-scale data similarity feature detection method based on inverted indexes.

Detailed Description

As shown in fig. 1, the method for detecting large-scale data similar features based on inverted index includes: candidate set generation, similarity measurement and result set integration processes.

Firstly, column sampling is carried out on data corresponding to all the features in the original feature set, and an inverted index is constructed. And putting the features with the same inverted index into the same feature subset, combining all the features in each feature subset pairwise to form a feature pair, and adding the candidate feature set. The invention provides a brand-new reverse index design method by analyzing the relationship between the feature distribution and the similarity measurement function:

for a numerical pair of features (let the features be X, Y), the similarity of the features X and Y is measured using the Pearson correlation coefficient, which is formulated as follows:

the Pearson correlation coefficient measures the degree of linear correlation between two features, the Pearson correlation coefficient rho between the features X and Y_X,YAnd the covariance cov (X, Y) between them. Observation of formula E ((X-. mu.) of cov (X, Y)_X)(Y-μ_Y) It is understood that X and Y are changed in the same or opposite directionsDuring normalization, the absolute value of the covariance tends to be maximized, and the Pearson correlation coefficient tends to be 1 or-1 after normalization, which shows that the X and Y are highly positively or negatively correlated. And the variation trend of X and Y can be deduced through the relation between the sample points and the expected value mu.

Therefore, the numerical characteristic data column inverted index design method of the invention is as follows:

1) sampling the original characteristic random column, defaulting the sampling length to 20 sample points, and obtaining a sampled sequence;

2) calculating the mean value mu of the sampled sequence_sampled；

3) Mapping all sample points s in the sampled sequence as follows:

4) the mapped sequence is the inverted index.

And for the feature pairs of the category type, extracting the inverted index by adopting a category value remapping method.

The invention relates to a method for designing a category type characteristic data column inverted index, which comprises the following steps:

2) and traversing all sample points s in the sampled sequence in sequence, and mapping as follows:

if s is the value of the 1 st category in the sampled sequence, the mapping is 1;

if s is the value of the 2 nd category in the sequence after sampling, the mapping is 2;

……

if s is the nth category value in the sampled sequence, the s is mapped to n;

3) the mapped sequence is the inverted index.

Taking the inverted index-characteristic as a key-value pair of the hash table; and traversing the hash table, and extracting all different features belonging to the same key in the hash table to serve as feature subsets. For each feature subset, combining the features contained in the feature subset pairwise into feature pairs, wherein all the feature pairs form a candidate feature set; the number of feature pairs in the candidate feature set obtained at this time is far less than the number of pairwise combinations of the original feature set. The specific theory is derived as follows:

let the dimension of the original feature set be d, and after sampling and inverted index screening, m subsets d are obtained₁d₂…d_m. Note that:

if the features in the original feature set are combined pairwise, the number of the obtained feature pairs is

If the features in all subsets are combined pairwise, the number of feature pairs obtained is

The time complexity magnitude of the calculation is:

the number of feature pairs after decomposition of the subsets by the inverted index is related to the maximum subset, while often the size of the subsets is much smaller than the original feature set, hence max (O (di)²))＜＜O(d²)。

Then, for all pairs X and Y of features to be tested in the candidate feature set, the corresponding similarity metric function sim (X, Y) is applied. For a numerical type feature pair, a Pearson correlation coefficient is adopted to measure the similarity of the features X and Y, a threshold value threshold is set (default is 0,95), and when | sim (X, Y) | > threshold, the corresponding feature pair is considered to belong to a pair of similar feature pairs and is added into a result set; for the class-type feature pair, the similarity between the features X and Y is measured by using non-duplicate counting (i.e. determining the number of values obtained by removing duplicate values from the original data, for example, for a group of data [ a, a, a, b, b, c, d ], the number of values obtained by removing duplicate values becomes [ a, b, c, d ], and the non-duplicate counting is 4), and the specific theory is as follows:

let distint (X) and distint (Y) denote the number of non-repeated counts for features X and Y, respectively;

let distint (X, Y) denote the number of joint non-duplicate counts of features X and Y (i.e., combining the fields corresponding to features X and Y and then performing non-duplicate counting of the corresponding data column).

When the (X) and (Y) features are different, it can be deduced that the values of the features X and Y have a one-to-one mapping relationship, i.e. the feature pair to be detected belongs to a similar feature pair of a class type. A result set is added.

By applying the theoretical principle, the invention provides a large-scale data similarity characteristic detection method based on inverted indexes.

A specific example of the candidate feature set generation method in the method is shown in fig. 2:

assume that there is a relational database table with 100 ten thousand records and 5 fields (f 1-f 5). Taking numerical features as an example, the feature name of the data set is the field name of the corresponding table, and the data column corresponding to each feature has 100 ten thousand values. Firstly, sampling (assuming that the sampling length is 8) and constructing an inverted index are carried out on a data column to obtain a corresponding hash table. Traversing the hash table to obtain two groups of feature subsets s1 { f1, f2, f3} and s2 { f4, f5 }; for each candidate subset: all the features are combined pairwise to be expressed in the form of feature pairs. s1 → { (f1, f2), (f1, f3), (f2, f3) }, s2 → { (f4, f5) }. After merging, a candidate feature set s is obtained: { (f1, f2), (f1, f3), (f2, f3), (f4, f5) }, to finally obtain 4 different sets of pairs. By contrast, if we initially combine two by two the original feature sets, there will eventually be

Different pairs of features are grouped.

In the process of actually verifying the invention, experiments are carried out on 10 different data sets, and the average scale of each data set is about 250 ten thousand records and about 300 characteristic numbers. The following results were obtained experimentally: after the method of the invention is adopted, the detection time of the similar characteristics can be shortened to about one tenth of that of other existing methods, thereby greatly saving time and resources.

Claims

1. A large-scale data similarity characteristic detection method based on inverted indexes is characterized by comprising the following steps:

firstly, for a table type data set in a relational database, the characteristics of the data set are fields of a table; all features of the data set constitute a set of raw features, for each feature in the set of raw features: firstly, performing column sampling on data corresponding to features, and constructing an inverted index; then, the inverted index-features are used as key-value pairs to construct a hash table;

traversing the hash table, extracting all different characteristics belonging to the same key in the hash table to be used as characteristic subsets, combining every two characteristics contained in each characteristic subset into characteristic pairs, and forming a candidate characteristic set by all the characteristic pairs;

thirdly, for each feature pair in the candidate feature set, applying a corresponding similarity measurement function, setting a threshold value, obtaining a similarity measurement result and adding the similarity measurement result into a result set;

the method comprises the following steps that the characteristics are divided into numerical characteristics and classification characteristics according to the attributes of data of the characteristics, for a given data set, an original characteristic set is disassembled firstly, a numerical original characteristic set and a classification original characteristic set are constructed, and the first step and the third step are executed respectively, so that a result set of the numerical characteristic set and a result set of the classification characteristic set are obtained respectively;

the method for sampling the columns and constructing the inverted index comprises the following specific steps of:

2) the reverse index construction method comprises the following steps:

for the numerical type features, calculating the mean value of the sampled data columns, mapping the values which are larger than the mean value in the sampled data columns to be 1, mapping the values which are smaller than the mean value to be-1, mapping the rest values to be 0, and mapping the sampled data columns to be the inverted indexes corresponding to the numerical type features;

for the class type features, traversing the sampled data columns in sequence, mapping a first class value to be 1, mapping a second class value to be 2 …, and increasing the values in sequence, wherein the mapped sampled data columns are inverted indexes corresponding to the class type features;

thirdly, each feature pair described in the above is applied with a corresponding similarity measurement function, a threshold is set, a similarity measurement result is obtained and added into a result set, and the specific method is as follows:

1) for the numerical characteristic pair, measuring the similarity of the full-column data by applying a Pearson correlation coefficient method, and when the absolute value of the Pearson correlation coefficient between the data columns corresponding to the two characteristics in the characteristic pair is greater than a set threshold, marking the characteristic pair as a similar characteristic pair and adding the similar characteristic pair into a result set;

2) for the class-type feature pairs, measuring the similarity of the whole-column data by applying a non-repeated counting method, namely counting the number of values of original data columns after repeated values are removed; suppose that the non-repeat counts of two features F1 and F2 in the feature pair to be measured are C1 and C2 respectively; the combined non-repeat count of F1 and F2 was C3; when C1 ═ C2 ═ C3, the pair of features is labeled as a similar pair of features and added to the result set.