CN112269854B - Large-scale data similarity characteristic detection method based on inverted index - Google Patents

Large-scale data similarity characteristic detection method based on inverted index Download PDF

Info

Publication number
CN112269854B
CN112269854B CN202011299602.3A CN202011299602A CN112269854B CN 112269854 B CN112269854 B CN 112269854B CN 202011299602 A CN202011299602 A CN 202011299602A CN 112269854 B CN112269854 B CN 112269854B
Authority
CN
China
Prior art keywords
features
characteristic
feature
data
pair
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011299602.3A
Other languages
Chinese (zh)
Other versions
CN112269854A (en
Inventor
钱晨
张顾洪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN202011299602.3A priority Critical patent/CN112269854B/en
Publication of CN112269854A publication Critical patent/CN112269854A/en
Application granted granted Critical
Publication of CN112269854B publication Critical patent/CN112269854B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/319Inverted lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a large-scale data similarity characteristic detection method based on inverted indexes. According to the method, the characteristic data columns of corresponding types are sampled, corresponding inverted indexes are extracted, and then a hash table is established by the inverted indexes and the characteristics in a key value pair mode to generate candidate characteristic subsets, so that the purpose of reducing the dimension of the characteristic sets is achieved; and combining the features in the feature subset after dimensionality reduction pairwise, respectively applying a Pearson correlation coefficient algorithm and a non-repetitive counting method to numerical features and categorical features to obtain correlation coefficients of feature pairs, setting a threshold value and outputting a result. The method overcomes the defect that the original feature sets need to be combined pairwise in the past, can reduce the calculation time by one order of magnitude, and saves a large amount of resources; meanwhile, the accuracy and the recall rate can be kept at an extremely high level.

Description

Large-scale data similarity characteristic detection method based on inverted index
Technical Field
The invention belongs to the field of machine learning and data mining, relates to a method for detecting feature similarity in big data feature engineering, and particularly relates to a large-scale data similarity feature detection method based on inverted indexes.
Background
Feature Similarity Detection (Feature Similarity Detection) is a crucial link in the data mining process and is also a necessary process for machine learning model training. A large number of similar features often exist in an original data set, so that the importance of the features can be dispersed in the model training process, the screening of the features is influenced, and the performance of the model is influenced; and unnecessary calculation overhead is increased, and a large amount of resources are wasted.
Most of the current main feature similarity detection methods need to traverse and combine all features pairwise for analysis, and when the scale of an original feature set is large, the scale of a set of combined features is also very large, so that the method has poor performance on a large-scale data set; or the local sensitive hash (Locality sensitive hash) method is used for reducing the dimension firstly and then analyzing the similarity, and the method has the defects that: although the dimensionality of the data set is reduced, the method can only be applied to the features of the type (including numerical type after being subjected to one-hot coding) at present, and cannot be applied to the numerical type features.
Disclosure of Invention
The invention aims to provide a large-scale data similarity characteristic detection method based on an inverted index, aiming at the defects of the prior art. The method can reduce the dimensionality of the feature set, further greatly reduce the calculation time, can be simultaneously applied to the class type and numerical type features, and ensures extremely high accuracy and recall rate.
The purpose of the invention is realized by the following technical scheme:
a large-scale data similarity feature detection method based on inverted indexes comprises the following steps:
1. for a tabular data set in a relational database, the features of the data set are the fields of the table. All features of the data set constitute a set of raw features, for each feature in the set of raw features: firstly, carrying out column sampling on data corresponding to the hash table, constructing an inverted index, and taking the inverted index-characteristics as key-value pairs of the hash table;
2. and traversing the hash table, and extracting all different features belonging to the same key in the hash table to be used as a feature subset. For each feature subset, the features contained in the feature subset are combined pairwise into feature pairs, and since the two features in each feature pair have the same inverted index, the two features have a higher probability of being a pair of similar features. All feature pairs constitute a candidate feature set;
3. and each feature pair in the candidate feature set applies a corresponding similarity measurement function, sets a threshold value, obtains a similarity measurement result and adds the similarity measurement result into the result set.
The features can be classified into numerical type features and classification type features according to data attributes of the features. Therefore, for a given data set, the original features are classified first, a numerical original feature set and a categorical original feature set are constructed, and the above steps 1 to 3 are performed respectively to obtain a result set of the numerical feature set and a result set of the categorical feature set respectively.
In the above technical solution, the method for column sampling and constructing an inverted index described in 1 specifically includes:
1) carrying out random column sampling on the data corresponding to the characteristics to obtain a sampled data column;
2) the reverse index construction method comprises the following steps: for the numerical type features, calculating the mean value of the sampled data columns, mapping the values which are larger than the mean value in the sampled data columns to be 1, mapping the values which are smaller than the mean value to be-1, mapping the rest values to be 0, and mapping the sampled data columns to be the inverted indexes corresponding to the numerical type features; for the class-type features, traversing the sampled data columns in sequence, mapping the first class value to be 1, mapping the second class value to be 2 …, and increasing in sequence, wherein the mapped sampled data columns are the inverted indexes of the corresponding class-type features.
Further, in step 3, a corresponding similarity measurement function and threshold setting are adopted for each feature pair, and a result set is added, wherein the specific method is as follows:
1) for the numerical type feature pairs, a Pearson correlation coefficient (Pearson correlation coefficient) method is applied to all columns of data to measure similarity, when the absolute value of the Pearson correlation coefficient between the data columns corresponding to two features in the feature pairs is larger than a set threshold value, the feature pairs are marked as similar feature pairs and added into a result set;
2) for the class-type feature pairs, measuring the similarity of the whole-column data by applying a non-repeated counting method, namely counting the number of values of original data columns after repeated values are removed; suppose that the non-repeat counts of two features F1 and F2 in the feature pair to be measured are C1 and C2 respectively; the combined non-repeat count of F1 and F2 was C3; when C1 ═ C2 ═ C3, the pair of features is labeled as a similar pair and added to the result set.
Compared with the prior art, the invention has the following beneficial effects:
1) the invention provides a method for establishing a feature subset through an inverted index, which puts features possibly having similarity into the same subset, thereby greatly reducing the scale of feature pair combination needing measurement and analysis;
2) the invention provides a brand-new reverse index design method by analyzing the relationship between the characteristic distribution and the similarity measurement function, and the method maps the million-level data distribution to the reverse index with a very small scale, thereby not only reducing the data dimensionality by a plurality of orders of magnitude, but also ensuring the uniqueness and the accuracy of the mapping relationship;
3) aiming at the characteristics of numerical type and classification type, the invention respectively adopts the corresponding similarity measurement function Pearson correlation coefficient algorithm and the non-repeated counting method, and optimizes the algorithm in the engineering process, thereby greatly improving the calculation efficiency of the algorithm.
Drawings
FIG. 1 is a schematic diagram of the overall scheme design of a large-scale data similarity feature detection method based on inverted indexes;
FIG. 2 is a schematic diagram illustrating an example of a method for generating a candidate feature set in a large-scale data similarity feature detection method based on inverted indexes.
Detailed Description
As shown in fig. 1, the method for detecting large-scale data similar features based on inverted index includes: candidate set generation, similarity measurement and result set integration processes.
Firstly, column sampling is carried out on data corresponding to all the features in the original feature set, and an inverted index is constructed. And putting the features with the same inverted index into the same feature subset, combining all the features in each feature subset pairwise to form a feature pair, and adding the candidate feature set. The invention provides a brand-new reverse index design method by analyzing the relationship between the feature distribution and the similarity measurement function:
for a numerical pair of features (let the features be X, Y), the similarity of the features X and Y is measured using the Pearson correlation coefficient, which is formulated as follows:
Figure BDA0002786420980000041
the Pearson correlation coefficient measures the degree of linear correlation between two features, the Pearson correlation coefficient rho between the features X and YX,YAnd the covariance cov (X, Y) between them. Observation of formula E ((X-. mu.) of cov (X, Y)X)(Y-μY) It is understood that X and Y are changed in the same or opposite directionsDuring normalization, the absolute value of the covariance tends to be maximized, and the Pearson correlation coefficient tends to be 1 or-1 after normalization, which shows that the X and Y are highly positively or negatively correlated. And the variation trend of X and Y can be deduced through the relation between the sample points and the expected value mu.
Therefore, the numerical characteristic data column inverted index design method of the invention is as follows:
1) sampling the original characteristic random column, defaulting the sampling length to 20 sample points, and obtaining a sampled sequence;
2) calculating the mean value mu of the sampled sequencesampled
3) Mapping all sample points s in the sampled sequence as follows:
Figure BDA0002786420980000051
4) the mapped sequence is the inverted index.
And for the feature pairs of the category type, extracting the inverted index by adopting a category value remapping method.
The invention relates to a method for designing a category type characteristic data column inverted index, which comprises the following steps:
1) sampling the original characteristic random column, defaulting the sampling length to 20 sample points, and obtaining a sampled sequence;
2) and traversing all sample points s in the sampled sequence in sequence, and mapping as follows:
if s is the value of the 1 st category in the sampled sequence, the mapping is 1;
if s is the value of the 2 nd category in the sequence after sampling, the mapping is 2;
……
if s is the nth category value in the sampled sequence, the s is mapped to n;
3) the mapped sequence is the inverted index.
Taking the inverted index-characteristic as a key-value pair of the hash table; and traversing the hash table, and extracting all different features belonging to the same key in the hash table to serve as feature subsets. For each feature subset, combining the features contained in the feature subset pairwise into feature pairs, wherein all the feature pairs form a candidate feature set; the number of feature pairs in the candidate feature set obtained at this time is far less than the number of pairwise combinations of the original feature set. The specific theory is derived as follows:
let the dimension of the original feature set be d, and after sampling and inverted index screening, m subsets d are obtained1d2…dm. Note that:
Figure BDA0002786420980000061
if the features in the original feature set are combined pairwise, the number of the obtained feature pairs is
Figure BDA0002786420980000062
If the features in all subsets are combined pairwise, the number of feature pairs obtained is
Figure BDA0002786420980000063
The time complexity magnitude of the calculation is:
Figure BDA0002786420980000064
the number of feature pairs after decomposition of the subsets by the inverted index is related to the maximum subset, while often the size of the subsets is much smaller than the original feature set, hence max (O (di)2))<<O(d2)。
Then, for all pairs X and Y of features to be tested in the candidate feature set, the corresponding similarity metric function sim (X, Y) is applied. For a numerical type feature pair, a Pearson correlation coefficient is adopted to measure the similarity of the features X and Y, a threshold value threshold is set (default is 0,95), and when | sim (X, Y) | > threshold, the corresponding feature pair is considered to belong to a pair of similar feature pairs and is added into a result set; for the class-type feature pair, the similarity between the features X and Y is measured by using non-duplicate counting (i.e. determining the number of values obtained by removing duplicate values from the original data, for example, for a group of data [ a, a, a, b, b, c, d ], the number of values obtained by removing duplicate values becomes [ a, b, c, d ], and the non-duplicate counting is 4), and the specific theory is as follows:
let distint (X) and distint (Y) denote the number of non-repeated counts for features X and Y, respectively;
let distint (X, Y) denote the number of joint non-duplicate counts of features X and Y (i.e., combining the fields corresponding to features X and Y and then performing non-duplicate counting of the corresponding data column).
When the (X) and (Y) features are different, it can be deduced that the values of the features X and Y have a one-to-one mapping relationship, i.e. the feature pair to be detected belongs to a similar feature pair of a class type. A result set is added.
By applying the theoretical principle, the invention provides a large-scale data similarity characteristic detection method based on inverted indexes.
A specific example of the candidate feature set generation method in the method is shown in fig. 2:
assume that there is a relational database table with 100 ten thousand records and 5 fields (f 1-f 5). Taking numerical features as an example, the feature name of the data set is the field name of the corresponding table, and the data column corresponding to each feature has 100 ten thousand values. Firstly, sampling (assuming that the sampling length is 8) and constructing an inverted index are carried out on a data column to obtain a corresponding hash table. Traversing the hash table to obtain two groups of feature subsets s1 { f1, f2, f3} and s2 { f4, f5 }; for each candidate subset: all the features are combined pairwise to be expressed in the form of feature pairs. s1 → { (f1, f2), (f1, f3), (f2, f3) }, s2 → { (f4, f5) }. After merging, a candidate feature set s is obtained: { (f1, f2), (f1, f3), (f2, f3), (f4, f5) }, to finally obtain 4 different sets of pairs. By contrast, if we initially combine two by two the original feature sets, there will eventually be
Figure BDA0002786420980000071
Different pairs of features are grouped.
In the process of actually verifying the invention, experiments are carried out on 10 different data sets, and the average scale of each data set is about 250 ten thousand records and about 300 characteristic numbers. The following results were obtained experimentally: after the method of the invention is adopted, the detection time of the similar characteristics can be shortened to about one tenth of that of other existing methods, thereby greatly saving time and resources.

Claims (1)

1. A large-scale data similarity characteristic detection method based on inverted indexes is characterized by comprising the following steps:
firstly, for a table type data set in a relational database, the characteristics of the data set are fields of a table; all features of the data set constitute a set of raw features, for each feature in the set of raw features: firstly, performing column sampling on data corresponding to features, and constructing an inverted index; then, the inverted index-features are used as key-value pairs to construct a hash table;
traversing the hash table, extracting all different characteristics belonging to the same key in the hash table to be used as characteristic subsets, combining every two characteristics contained in each characteristic subset into characteristic pairs, and forming a candidate characteristic set by all the characteristic pairs;
thirdly, for each feature pair in the candidate feature set, applying a corresponding similarity measurement function, setting a threshold value, obtaining a similarity measurement result and adding the similarity measurement result into a result set;
the method comprises the following steps that the characteristics are divided into numerical characteristics and classification characteristics according to the attributes of data of the characteristics, for a given data set, an original characteristic set is disassembled firstly, a numerical original characteristic set and a classification original characteristic set are constructed, and the first step and the third step are executed respectively, so that a result set of the numerical characteristic set and a result set of the classification characteristic set are obtained respectively;
the method for sampling the columns and constructing the inverted index comprises the following specific steps of:
1) carrying out random column sampling on the data corresponding to the characteristics to obtain a sampled data column;
2) the reverse index construction method comprises the following steps:
for the numerical type features, calculating the mean value of the sampled data columns, mapping the values which are larger than the mean value in the sampled data columns to be 1, mapping the values which are smaller than the mean value to be-1, mapping the rest values to be 0, and mapping the sampled data columns to be the inverted indexes corresponding to the numerical type features;
for the class type features, traversing the sampled data columns in sequence, mapping a first class value to be 1, mapping a second class value to be 2 …, and increasing the values in sequence, wherein the mapped sampled data columns are inverted indexes corresponding to the class type features;
thirdly, each feature pair described in the above is applied with a corresponding similarity measurement function, a threshold is set, a similarity measurement result is obtained and added into a result set, and the specific method is as follows:
1) for the numerical characteristic pair, measuring the similarity of the full-column data by applying a Pearson correlation coefficient method, and when the absolute value of the Pearson correlation coefficient between the data columns corresponding to the two characteristics in the characteristic pair is greater than a set threshold, marking the characteristic pair as a similar characteristic pair and adding the similar characteristic pair into a result set;
2) for the class-type feature pairs, measuring the similarity of the whole-column data by applying a non-repeated counting method, namely counting the number of values of original data columns after repeated values are removed; suppose that the non-repeat counts of two features F1 and F2 in the feature pair to be measured are C1 and C2 respectively; the combined non-repeat count of F1 and F2 was C3; when C1 ═ C2 ═ C3, the pair of features is labeled as a similar pair of features and added to the result set.
CN202011299602.3A 2020-11-18 2020-11-18 Large-scale data similarity characteristic detection method based on inverted index Active CN112269854B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011299602.3A CN112269854B (en) 2020-11-18 2020-11-18 Large-scale data similarity characteristic detection method based on inverted index

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011299602.3A CN112269854B (en) 2020-11-18 2020-11-18 Large-scale data similarity characteristic detection method based on inverted index

Publications (2)

Publication Number Publication Date
CN112269854A CN112269854A (en) 2021-01-26
CN112269854B true CN112269854B (en) 2022-06-10

Family

ID=74340717

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011299602.3A Active CN112269854B (en) 2020-11-18 2020-11-18 Large-scale data similarity characteristic detection method based on inverted index

Country Status (1)

Country Link
CN (1) CN112269854B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013129729A1 (en) * 2012-02-28 2013-09-06 주식회사 케이쓰리아이 System for searching augmented reality image in real-time by using layout descriptor and image feature point
CN106503106A (en) * 2016-10-17 2017-03-15 北京工业大学 A kind of image hash index construction method based on deep learning

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8065293B2 (en) * 2007-10-24 2011-11-22 Microsoft Corporation Self-compacting pattern indexer: storing, indexing and accessing information in a graph-like data structure

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2013129729A1 (en) * 2012-02-28 2013-09-06 주식회사 케이쓰리아이 System for searching augmented reality image in real-time by using layout descriptor and image feature point
CN106503106A (en) * 2016-10-17 2017-03-15 北京工业大学 A kind of image hash index construction method based on deep learning

Also Published As

Publication number Publication date
CN112269854A (en) 2021-01-26

Similar Documents

Publication Publication Date Title
CN106845717B (en) Energy efficiency evaluation method based on multi-model fusion strategy
CN106294762B (en) Entity identification method based on learning
CN108376143B (en) Novel OLAP pre-calculation system and method for generating pre-calculation result
CN110134719B (en) Identification and classification method for sensitive attribute of structured data
CN109871454B (en) Robust discrete supervision cross-media hash retrieval method
CN111833310B (en) Surface defect classification method based on neural network architecture search
CN116523320A (en) Intellectual property risk intelligent analysis method based on Internet big data
Huang et al. Weighting method for feature selection in k-means
CN1783092A (en) Data analysis device and data analysis method
CN114800041B (en) Cutter state monitoring method and monitoring device thereof
CN112269854B (en) Large-scale data similarity characteristic detection method based on inverted index
Park et al. Grid-based subspace clustering over data streams
CN112766727A (en) High-end sensitive user voltage sag severity evaluation method
CN113484400B (en) Mass spectrogram molecular formula calculation method based on machine learning
CN111382792A (en) Rolling bearing fault diagnosis method based on double-sparse dictionary sparse representation
Alrawashdeh et al. Wilk’s lambda based on robust method
CN114547251B (en) BERT-based two-stage folk story retrieval method
CN113571126B (en) Protein residue contact prediction method based on multi-loss training
Rathore et al. Approximate cluster heat maps of large high-dimensional data
CN115392375A (en) Intelligent evaluation method and system for multi-source data fusion degree
CN106295703B (en) Method for modeling and identifying time sequence
CN113656910A (en) Rolling bearing health index curve construction method based on AFF-AAKR fusion
CN114997216A (en) Bearing fault identification method based on tensor multi-mode feature high-order matching
CN113268986A (en) Unit name matching and searching method and device based on fuzzy matching algorithm
CN112100670A (en) Big data based privacy data grading protection method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant