CN110175191A

CN110175191A - Data filtering rule modeling method in data analysis

Info

Publication number: CN110175191A
Application number: CN201910401717.XA
Authority: CN
Inventors: 周鹏程; 荆一楠; 何震瀛; 王晓阳
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2019-05-14
Filing date: 2019-05-14
Publication date: 2019-08-27
Anticipated expiration: 2039-05-14
Also published as: CN110175191B

Abstract

Data filtering rule modeling method the invention belongs to data analysis technique field, in specially a kind of data analysis.Data filtering rule modeling method of the invention mainly includes three parts: (1) data column analysis filtering (2) data area analysis filtering (3) result set automatic visual.The present invention, which passes through, reasonably sets relevant rule solves how to apply the foundation analysis filtering model of data filtering rule in data analysis, crosses filter data and intuitive display data using model analysis.The present invention can facilitate the quick garbled data of user and find interested data subset, contact between analysis and mining data item.

Description

Data filtering rule modeling method in data analysis

Technical field

The invention belongs to data analysis technique fields, and in particular to the data filtering rule modeling method in data analysis.

Background technique

In the data ubiquitous epoch, the decision of user is increasingly by the driving of data.It is analyzed typically for data As a result difference tends to significantly affect decision process.Select improper data, it is either intentional still unintentionally, may cause The decision of mistake, misleading or " fragility ".For the user for having no data analysis experience particularly with data analysis, these are bad The result of data analysis may result in serious economic loss.So guidance user carries out good data selection energy band to use The data investigative analysis of family better quality is experienced.

In order to enable the user of no data analysis experience to eliminate as much as the Data Mining process of error and numerous of being easy Trivial analysis filter condition setting, it is flat-footed to obtain good data analysis filter effect.There is no doubt that we need A standardized process is wanted to determine how this carries out the selection of the filter analysis of data, how to be automated according to the feature of data Carry out data filtering rule modeling.

Summary of the invention

The scene that the purpose of the present invention is explore for interactive data provides a kind of data filtering rule modeling method, Quickly to carry out analysis mining for the data on data set, facilitate exploration and analysis of the user for data.

For the recommendation rules modeling on data set, our desired characteristics are as follows:

1. interpretation: how suitably to generate recommendation inside a visualization system；

2. feasibility: generating and recommend should have enough analysis significances, it would be desirable to be able to excavate the potential association between data；

3. qualitative: the building of the characteristic explored due to user, model has high efficiency, robustness.

Data filtering rule modeling method provided by the invention, the specific steps are as follows:

(1) give whether the data set D being made of mass data is referred to using the method for random forest feature selecting according to user Determine critical data, calculates the different degree of data column.Detailed process is as follows:

(1.1) prominence score (variable importance measures), is indicated with VIM, by Gini index GI To indicate, it is assumed that there is m data column X₁, X₂, X₃..., X_m, it is now to calculate each column X_jGini index score VIM_j ^(Gini), that is, it is listed in the average knots modification of all decision tree interior joint division impurity levels of random forest (RF) for j-th；Wherein Gini Index:

Wherein, K indicates that m node has K classification, p in all decision trees of RF_mkIndicate ratio shared by classification k in node m, p_mk′Indicate the complement value of ratio shared by classification k in node m；It intuitively, is exactly that two samples are at will randomly selected from node m This, the inconsistent probability of category label.

(1.2) data column X_jIn the importance of node m, i.e., the Gini index variation amount before and after node m branch is；WithRespectively indicate the Gini index of latter two new node of branch.

(1.3) data column X_jThe node occurred in decision tree i is in set M, then X_jIn the importance that i-th is set are as follows:

。

(1.4) n tree is shared inside random forest, then data column X_jImportance are as follows:

。

(1.5) according to the sequence for calculating importance, returning to customer analysis filter result is most important two column data, note Importance ranking for A, B, A is higher than B.

(2) data area analysis filtering.The present invention illustrates how that carrying out data area analysis filters in the case of the column of A, B two, Detailed process is as follows:

(2.1) present invention is divided into three classes according to two column data type of A, B first: numeric type N, discrete value type X, timing type T；For Numeric type N, can do sliding-model control first, and specific practice is data are carried out with branch mailbox to handle to obtain each chest to be denoted as n ', count The counting for calculating each branch mailbox is denoted as CNT (n ')；For discrete value type X, the counting for calculating each discrete value is denoted as CNT (x)；

Since timing type data often have the feature of season property, the present invention can be automatically according to the time series data model of data column T It encloses and divides time slice case, data column T handles to obtain each timing case by branch mailbox is denoted as t '；Such as: the data area 2017 of T - 2019 years years, then timing case t ' was divided as unit of year, and the data area of T is only data in 2019, then timing case t ' is with the moon For unit division；The data area for similarly arranging T is only in January, 2019, then timing case t ' is divided as unit of day.

(2.2) two kinds of data are formed according to three different data types and analyzes filtration combination model, data set D is carried out Data filtering analyzes (wherein all "/" meanings are "or", are not expressed as division)；Specifically:

(2.2.1) A is timing type data, and B is discrete value type or numeric type；The unit choosing for the timing case t ' that A is obtained according to (2.1) Take the proximal segment time appropriate as first filter condition t_recent(such as: nearest 3 years, six months nearest, seven days nearest, no It is sufficient then do not generate this filtering)；Data set after the conditional filtering of A column is D^*, dispersion number is obtained by filtration in data column B According to column B^*X₁ ^*, x₂ ^*..., x_k ^*Or numeric data column B^*(n will be obtained by branch mailbox again₁ ^*) ', (n₂ ^*) ' ..., (n_k ^*) ', wherein Chest quantity is k, with x^*/ (n^*) ' in the maximum three value CNT (x of counting^*)_top3/ CNT ((n^*) ')_top3Three of place from Dissipate data x_max ^*Or case (n_max ^*) ' numberical range as second filter condition；With two filter condition t_recentAnd x_max ^*/ (n_max ^*) ' intersection t_recent∩x_max ^*/ (n_max ^*) ' as analysis filtration combination model analysis filter condition, to data set D into Row data filter analysis；

(2.2.2) A is discrete value type or numeric type, and B is timing type data；A calculate the CNT (x) of each discrete value amount or case/ CNT (n ') chooses and counts five most constant x_top5Or case (n_top5) ' (discrete value or box number deficiency will not then generate this Filtering) corresponding numberical range is as first filter condition；Data set after the conditional filtering of A column is D^*；It chooses in A Count most constant x_maxOr case (n_max) ' corresponding data column B^*Timing range t_maxAs second filter condition；With Two filter condition x_top5/(n_top5) ' and t_maxIntersection x_top5/(n_top5)′∩t_maxAnalysis as analysis filtration combination model Filter condition carries out data filtering analysis to data set D.

(3) in order to be presented to the user the data filtered by analysis, the present invention will pass through step (1), (2) two-step analysis The result data collection being obtained by filtration automatically visualizes.Detailed process is as follows:

(3.1) result data collection is visualized to obtain the base value d (X) of column X, arranges the maximum value max (X) of X, minimum value min (X), the record strip number of X is arranged | X |, arrange the data type type(X of X) and, arrange the counting CNT of each corresponding x ' of case data x ' of X (x ') (each discrete value of discrete value column X can regard a case as), the phase of each case data x ' corresponding counting CNT (x ') Relationship number correlation (x, CNT (x ')).

(3.2) the column type type(X according to obtained in (3.1)) define a set of shearing rule；When the data type of column x It can be histogram, line chart for timing type: Visual Chart；When the data type of column x is discrete value type or numeric type: visualization Chart can be histogram, cake chart, scatter plot.

(3.3) present invention proposes that a kind of data analysing method-Relative Entropy filters to determine from step (1), (2) analysis The result data collection obtained afterwards the visualization how to automate；The core concept of this method calculates each data column X visualization For ratio of the comentropy relative to standardized chart-information entropy of various charts, it is denoted as C(X)₁, C(X)₂..., C(X)_k；Than The size of more each Relative Entropy, maximum value C(X)_maxCorresponding subtype is exactly the visualization types of data column X.Specifically Way is as follows:

(3.3.1) column diagram is most commonly used one of the chart of analyst, and the difference in height of pillar is using raising user for data The identification of difference；Column diagram is suitable for each scene, can preferably show when x ' element (i.e. the number of case) is more The details of data；The Relative Entropy for calculating histogram uses the base value d (X) of column X, | d (X) | indicate the radix d of column X (X) value；

(3.3.2) pie chart can show multi-group data, and performance each group of data accounts for always than situation；We need differentiation in cake chart The CNT(x ' of degree) highlight the accounting of each section, Shannon entropy is introduced thus:, make For the part of criterion；Wherein y indicates each value of CNT (x'), and P (y) indicates the quantity accounting value of y, i.e. y is at CNT (x') Probability of happening；

The advantage of (3.3.3) line chart can reflect the case where development and change of the same thing in different time；As data CNT When (x ') and x ' meet certain distribution (such as: linear distribution, exponential distribution, log series model, low order power are distributed), the expression of distribution Formula is denoted as distribution (x ', CNT(x ')), comentropy C(X) it is 1；Otherwise, comentropy C(X) it is 0；

C(X)=distribution (x ', CNT(x '))；

(3.3.4) scatter plot indicates the relationship between two variables by reference axis；Use related coefficient correlation (x ', CNT (x ')) is calculated；

C(X)=correlation (x ', CNT (x ')).

(3.4) relative information Entropy sequence is obtained under various Visual Charts by comparing column X, obtain Relative Entropy most Big value C(X)_max.(1) the result data collection obtained after (2) analysis filtering will use C(X)_maxCorresponding subtype carries out visual Change shows.

The present invention, which passes through, reasonably sets relevant rule solves how to build in data analysis using data filtering rule Vertical analysis filtering model, crosses filter data and intuitive display data using model analysis.The present invention can facilitate user quickly to screen Data simultaneously find interested data subset, contact between analysis and mining data item.

Detailed description of the invention

Fig. 1 is data column analysis example diagram.

Fig. 2 is the process of data analysis filtering.

Fig. 3 is the example of data analysis filtering.Wherein, it is price filtering example that (a), which is sales date filtering example figure (b), Figure.

Fig. 4 is result data collection visual means comparison diagram.Wherein, (a) is that result data collection histogram shows that (b) is knot Fruit data set line chart is shown.

Fig. 5 is the method for the present invention process diagram.

Specific embodiment

We introduce the present invention by a specific data analysis system in this section.

The data that the present invention selects include 33 column, 344355 data.Process as described above is operated, analysis The data visualization that analysis obtains simultaneously is returned to user's displaying by data column and data area later.It is illustrated in fig. 1 shown below, the present invention Data column analysis method is arranged using profit and analyzes remaining all data column as key column, and analysis result is sales date and price The importance highest of two column.

The present invention is based on the schemes that (2) provide to establish data filtering rule model, to target column sales date and price into The combination of row screening conditions, data analysis system obtain the behaviour that analysis data are illustrated in fig. 2 shown below based on data filtering rule model Make sequence, obtaining the sales date is nearest one month, the maximum case data area 0-57 of price.It finally obtains as shown in Figure 3 Filter result system example show.

The visual form of the automation that the present invention uses.Therefore the autonomous analysis result data collection of meeting, with appropriate visual Change chart to show result data collection.It is illustrated in fig. 4 shown below, is just less closed shown in left figure using data as histogram displaying It is suitable, and data visualization is turned into right figure line chart, trend just is better seen than being visualized as histogram.Therefore, the present invention uses The line chart display data column price on the right.

Claims

1. the data filtering rule modeling method in a kind of data analysis, the specific steps are as follows:

(1) data set being made of mass data is givenD, using the method for random forest feature selecting, whether referred to according to user Determine critical data, calculates the different degree of data column；Detailed process is as follows:

(1.1) prominence score is indicated with VIM；Gini index is indicated with GI, it is assumed that have m data column X₁, X₂, X₃..., X_m, to calculate each column X_jGini index score VIM_j ^(Gini), that is, it is all to be listed in random forest (RF) for j-th The average knots modification of decision tree interior joint division impurity level；Gini index are as follows:

；

Wherein, K indicates that m node has K classification, p in all decision trees of RF_mkIndicate ratio shared by classification k, p in node m_mk′ Indicate the complement value of ratio shared by classification k in node m；

(1.2) data column X_jGini index variation amount in the importance of node m, i.e., before and after node m branch are as follows:

；

WithRespectively indicate the Gini index of latter two new node of branch；

；

(1.5) according to importance ranking is calculated, returning to customer analysis filter result is most important two column data, is denoted as A, B, The importance ranking of A is higher than B；

(2) data area analysis filtering；Detailed process is as follows:

(2.1) it is divided into three classes first according to two column data type of A, B: numeric type N, discrete value type X, timing type T；For numerical value Type N, does sliding-model control first, and specific practice is data are carried out with branch mailbox to handle to obtain each chest to be denoted as n ', calculates each The counting of branch mailbox is denoted as CNT (n ')；For discrete value type X, the counting for calculating each discrete value is denoted as CNT (x)；

Timing type T divides time slice case according to the time series data range of data column T, and data column T handles to obtain every by branch mailbox A timing case is denoted as t '；

(2.2) two kinds of data are formed according to three different data types and analyzes filtration combination mode, data are carried out to data set D Filter analysis；Specifically:

(2.2.1) A is timing type data, and B is discrete value type or numeric type；The unit choosing for the timing case t ' that A is obtained according to (2.1) Take the proximal segment time appropriate as first filter condition t_recent；Data set after the conditional filtering of A column is denoted as D^*, Discrete data column B is obtained by filtration in data column B^*X₁ ^*, x₂ ^*..., x_k ^*Or numeric data column B^*It branch mailbox will obtain again (n₁ ^*) ', (n₂ ^*) ' ..., (n_k ^*Wherein chest quantity is k to) ', with x^*/ (n^*) ' in the maximum three value CNT of counting (x^*)_top3/ CNT ((n^*) ')_top3Three discrete data x at place_max ^*Or case (n_max ^*) ' numberical range as second filter Condition；With two filter condition t_recentAnd x_max ^*/ (n_max ^*) ' intersection t_recent∩x_max ^*/ (n_max ^*) ' as analysis filtering group The analysis filter condition of molding type carries out data filtering analysis to data set D；

(2.2.2) A is discrete value type or numeric type, and B is timing type data；A calculate the CNT (x) of each discrete value amount or case/ CNT (n ') chooses and counts five most constant x_top5Or case (n_top5) ' corresponding numberical range is as first filter condition； Data set after the conditional filtering of A column is D^*；It chooses in A and counts most constant x_maxOr case (n_max) ' corresponding number According to column B^*Timing range t_maxAs second filter condition；With two filter condition x_top5/(n_top5) ' and t_maxIntersection x_top5/(n_top5)′∩t_maxAs the analysis filter condition of analysis filtration combination model, data filtering analysis is carried out to data set D；

(3) in order to be presented to the user the data filtered by analysis, the result being obtained by filtration will be analyzed by step (1), (2) Data set automatically visualizes；Detailed process is as follows:

(3.1) result data collection is visualized to obtain the base value d (X) of column X, arranges the maximum value max (X) of X, minimum value min (X), the record strip number of X is arranged | X |, arrange the data type type(X of X) and, arrange the counting CNT of each corresponding x ' of case data x ' of X (x '), the related coefficient correlation (x, CNT (x ')) of each case data x ' corresponding counting CNT (x ')；

(3.2) the column type type(X according to obtained in (3.1)) define a set of shearing rule；When the data type of column x is timing Type: Visual Chart is histogram, line chart；When the data type of column x is discrete value type or numeric type: Visual Chart is column Shape figure, cake chart, scatter plot；

(3.3) number of results obtained after step (1), (2) analysis filtering is determined using data analysing method-Relative Entropy The visualization how automated according to collection；The core concept of this method is the letter for calculating each data column X and being visualized as various charts Ratio of the entropy relative to standardized chart-information entropy is ceased, C(X is denoted as)₁, C(X)₂..., C(X)_k；Compare each relative information The size of entropy, maximum value C(X)_maxCorresponding subtype is exactly the visualization types of data column X；It is specific as follows:

In (3.3.1) column diagram, the difference in height of pillar is for improving user for the identification of data difference；Calculate histogram Relative Entropy uses the base value d (X) of column X, | d (X) | indicate the value of the radix d (X) of column X:

(3.3.2) pie chart can show multi-group data, and performance each group of data accounts for always than situation；In cake chart, discrimination is needed CNT(x ') highlight the accounting of each section, introduce Shannon entropy thus:, as The part of criterion；Wherein y indicates each value of CNT (x'), and P (y) indicates the quantity accounting value of y, i.e. y is CNT's (x') Probability of happening；

(3.3.3) line chart can reflect the case where development and change of the same thing in different time；As data CNT(x ') with X ' meets certain distribution: when linear distribution, exponential distribution, log series model or low order power are distributed, the expression formula of distribution is denoted as Distribution (x ', CNT(x ')), comentropy C(X) it is 1；Otherwise, comentropy C(X) it is 0；

C(X)=distribution (x ', CNT(x '))

In (3.3.4) scatter plot, by reference axis, the relationship between two variables is indicated；Use related coefficient correlation (x ', CNT (x ')) is calculated；

C(X)=correlation (x ', CNT (x '))

(3.4) relative information Entropy sequence is obtained under various Visual Charts by comparing column X, obtain Relative Entropy maximum value

C(X)_max；The result data collection obtained after step (1), (2) analysis filtering is using C(X)_maxCorresponding subtype carries out Visualization shows.