CN113761297A

CN113761297A - Method and device for determining field relevancy in database table

Info

Publication number: CN113761297A
Application number: CN202011248181.1A
Authority: CN
Inventors: 张蒙
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Priority date: 2020-11-10
Filing date: 2020-11-10
Publication date: 2021-12-07

Abstract

The invention discloses a method and a device for determining field relevancy in a database table, and relates to the technical field of computers. One embodiment of the method comprises: for any two fields to be analyzed in a database table, judging the field type of each field according to the element of the field; the field types include: a numeric field and a categorical field, the elements in the categorical field belonging to at least two element categories; when one of the two fields is a numerical field and the other field is a typing field, determining elements belonging to the same element category in the typing field, and forming an analysis group by the elements in the numerical field corresponding to the elements; and determining the interclass variance and the intraclass variance of each analysis group, and obtaining the correlation index of the two fields according to the interclass variance and the intraclass variance. The implementation method can quantitatively calculate the correlation degree aiming at the numerical fields and the classification fields in any database table, and is beneficial to realizing the uniform analysis of the correlation degrees of different types of fields.

Description

Method and device for determining field relevancy in database table

Technical Field

The invention relates to the technical field of computers, in particular to a method and a device for determining field relevancy in a database table.

Background

There is a need in many scenarios to determine the degree of relatedness of different fields in a database table. For example, in a data analysis scenario, information of a data provider and a data demander is often asymmetric, and in addition, a database table has certain complexity, so that the problems of ambiguous data demand, frequent data correction and the like exist, and at this time, the correlation degree between different fields in the database table needs to be analyzed, so that a valuable reference is provided for the data demander, and the work efficiency is remarkably improved. In the prior art, correlation can be calculated by using methods such as a pearson correlation coefficient and the like according to whether the field type to be analyzed is a numerical field or a subtyping field.

In the process of implementing the invention, the inventor finds that the prior art has at least the following problems: first, when dealing with a large number of fields to be analyzed whose types are unknown, the prior art cannot quickly and accurately identify the field types. Second, when calculating the correlation between the numeric field and some classified fields (such as gender), the prior art can only describe the correlation qualitatively, which cannot meet the requirements of some application environments. Third, there is a lack of uniform criteria in the prior art to implement relevancy analysis of database table fields in various situations.

Disclosure of Invention

In view of this, embodiments of the present invention provide a method and an apparatus for determining field relevancy in a database table, which can quantitatively calculate relevancy for a numeric field and a categorical field in any database table, and are helpful to implement unified analysis of relevancy for different types of fields.

To achieve the above objects, according to one aspect of the present invention, there is provided a method for determining the relatedness of fields in a database table.

The method for determining the field relevancy in the database table comprises the following steps: for any two fields to be analyzed in a database table, judging the field type of each field according to the element of the field; wherein the field types include: a numeric field and a categorical field, the elements in the categorical field belonging to at least two element categories; when one of the two fields is a numerical field and the other field is a typing field, determining elements belonging to the same element category in the typing field, and forming an analysis group by the elements in the numerical field corresponding to the elements; determining an interclass variance and an intraclass variance for each analysis group, and obtaining a correlation index of the two fields according to the interclass variance and the intraclass variance.

Optionally, the determining, according to the element of each field, the field type to which the field belongs includes: for any field to be analyzed, judging whether the proportion of elements in the field, which accord with a preset first regular expression, is not less than a first threshold value: if yes, determining the field as a numerical field; the first regular expression is used for matching floating point numbers; if the proportion of the elements in the field which accord with the first regular expression is smaller than a first threshold, judging whether the number of the elements in the field after the duplication removal is larger than 1 and not larger than a second threshold: if yes, determining the field as a type-divided field; wherein the second threshold is related to and less than the total number of elements in the field.

Optionally, the determining, according to the element of each field, the field type to which the field belongs includes: for any field to be analyzed, judging whether the number of elements in the field after de-duplication is larger than 1 and not larger than a second threshold value: if yes, determining the field as a type-divided field; wherein the second threshold is related to and less than the total number of elements in the field; if the number of the elements in the field after the duplication removal is 1 or is greater than a second threshold, whether the proportion of the elements in the field which accord with a preset second regular expression is not less than a third threshold is judged: if yes, determining the field as a numerical field; wherein the second regular expression is used for matching floating point numbers and integers.

Optionally, the obtaining the correlation index of the two fields according to the inter-group variance and the intra-group variance includes: dividing the variance between the groups by the variance in the groups to obtain an initial value of the correlation degree of the two fields, and determining the natural logarithm of the initial value of the correlation degree as a middle value of the correlation degree; and transforming the correlation intermediate value to a value interval from zero to one to form a correlation index of the two fields.

Optionally, the transforming the correlation intermediate value to a value interval from zero to one to form a correlation index of the two fields includes: when the correlation intermediate value is less than zero, determining the correlation index as zero; when the correlation degree intermediate value is larger than a first numerical value, determining the correlation degree index as one; wherein the first value is a real number greater than one; when the correlation degree intermediate value is not less than zero and not more than a first numerical value, determining the correlation degree index as a product of the correlation degree intermediate value and a second numerical value; wherein the second value is the inverse of the first value.

Optionally, the method further comprises: when any two fields to be analyzed in the database table are numerical fields, determining the absolute values of the spearman correlation coefficients of the two fields as the correlation indexes of the two fields; when any two fields to be analyzed in the database table are classified fields, determining the Cramer correlation coefficient of the two fields as the correlation index of the two fields.

Optionally, the method further comprises: after obtaining the relevance indexes of any two fields to be analyzed in the database table, inputting the relevance indexes into a preset relevance matrix; the row number and the column number of the correlation matrix are both equal to the total number of the fields to be analyzed of the database table, each row and each column respectively correspond to the identifiers of the fields to be analyzed in the database table which are arranged in a preset sequence, any element in the correlation matrix is a correlation index between the field corresponding to the row where the element is located and the field corresponding to the column where the element is located, and the gray value of the element is positively correlated with the correlation index.

Optionally, the method further comprises: after obtaining the relevance indexes of any two fields to be analyzed in the database table, inputting the relevance indexes into a preset weight connection diagram; the weight connection graph comprises nodes which are arranged along the circumferential direction and used for representing fields to be analyzed in the database table, and connecting lines which are positioned between any two nodes and used for representing correlation indexes; the nodes are configured with different colors for representing different field types, the connecting lines are configured with different colors for representing different correlation index types, the width and the color depth of the connecting line are positively correlated with the correlation index represented by the connecting line, and the correlation index types comprise: a relevance indicator between two numeric fields, a relevance indicator between two categorical fields, and a relevance indicator between a numeric field and a categorical field.

To achieve the above object, according to another aspect of the present invention, there is provided an apparatus for determining relevancy of fields in a database table.

The device for determining the field relevancy in the database table of the embodiment of the invention can comprise: a field type determination unit to: for any two fields to be analyzed in a database table, judging the field type of each field according to the element of the field; wherein the field types include: a numeric field and a categorical field, the elements in the categorical field belonging to at least two element categories; a grouping unit for: when one of the two fields is a numerical field and the other field is a typing field, determining elements belonging to the same element category in the typing field, and forming an analysis group by the elements in the numerical field corresponding to the elements; and the correlation calculation unit is used for determining the interclass variance and the intraclass variance of each analysis group and obtaining the correlation indexes of the two fields according to the interclass variance and the intraclass variance.

To achieve the above object, according to still another aspect of the present invention, there is provided an electronic apparatus.

An electronic device of the present invention includes: one or more processors; the storage device is used for storing one or more programs, and when the one or more programs are executed by the one or more processors, the one or more processors realize the method for determining the field relevancy in the database table.

To achieve the above object, according to still another aspect of the present invention, there is provided a computer-readable storage medium.

The invention relates to a computer readable storage medium, on which a computer program is stored, which when executed by a processor implements the method for determining the relevancy of fields in a database table provided by the invention.

According to the technical scheme of the invention, the embodiment of the invention has the following advantages or beneficial effects: when the relevancy analysis is executed on the fields in the database table, firstly, the field types are quickly and accurately judged according to the field elements and a preset regular expression; then, the correlation index can be calculated by adopting corresponding methods according to different field types, for example, when two fields to be analyzed are both numerical fields, the absolute values of the spearman correlation coefficients of the two fields are used as the correlation index, and when the two fields to be analyzed are both classification type fields, the gram correlation coefficients of the two fields are used as the correlation index; specifically, when one of the two fields to be analyzed is a numeric field and the other is a typing field, the embodiment of the present invention first divides the elements in the data value field into a plurality of analysis groups according to the element class in the typing field, and then calculates the inter-group variance and the intra-group variance for each analysis group and uses the quotient of the two as a correlation index, thereby implementing quantitative correlation analysis of the numeric field and the typing field (the specific principle will be described below). To sum up, the embodiment of the present invention provides a unified analysis standard from field type determination to correlation calculation, and when two numeric fields, two subtype fields, or a numeric field and a subtype field are faced, quantitative correlation analysis can be performed to obtain a correlation index between zero and one.

Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.

Drawings

The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:

FIG. 1 is a schematic diagram of the main steps of the method for determining the relevancy of the fields in the database table according to the embodiment of the present invention;

FIG. 2 is a global schematic of a saturation logarithm function of an embodiment of the present invention;

FIG. 3 is a partial schematic of a saturation logarithm function of an embodiment of the present invention;

FIG. 4 is a diagram illustrating a specific implementation of the method for determining relevancy of fields in a database table according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a correlation matrix according to an embodiment of the invention;

FIG. 6 is a diagram illustrating a weight connection according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of a component of an apparatus for determining relevancy of fields in a database table according to an embodiment of the present invention;

FIG. 8 is an exemplary system architecture diagram in which embodiments of the present invention may be employed;

FIG. 9 is a schematic structural diagram of an electronic device for implementing the method for determining the relevancy of the fields in the database table according to the embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

It should be noted that the embodiments of the present invention and the technical features of the embodiments may be combined with each other without conflict.

Example one

FIG. 1 is a diagram illustrating the main steps of a method for determining the relevancy of fields in a database table according to an embodiment of the present invention. As shown in fig. 1, the method for determining the field relevancy in the database table according to the embodiment of the present invention may be specifically performed according to the following steps:

step S101: and for any two fields to be analyzed in the database table, judging the field type of each field according to the element of the field.

In this step, the two fields to be analyzed may be fields of the same database table, or may be any fields in a plurality of database tables that can be associated with each other. For example, if the employee performance assessment table and the employee base information table may be associated by a common employee name field, then a relevancy analysis may be performed on any two fields in the two database tables. In embodiments of the present invention, the field types may include a numeric field, a categorical field, and other types of fields that are different from both the numeric field and the categorical field.

Wherein, the elements in the numeric field (i.e. the values of the database table in the fields) are represented as integers, decimals and other numeric values, the sizes of the numeric values generally have practical meanings and can be compared with each other, for example, the payroll amount field and the overtime length field in the staff performance evaluation table generally belong to the numeric field. The elements in the typing field can be generally divided into at least two element categories, for example, a gender field and an affiliated age group field in the employee basic information table generally belong to the typing field, the gender field includes two element categories, i.e., "male" and "female", and the affiliated age group field includes element categories which may be: "10 to 20 years old", "20 to 30 years old", "30 to 40 years old", "40 to 50 years old", and the like. The categorical field can be divided into an orderly-arranged categorical field and an unordered categorical field, the element categories of the orderly-arranged categorical field can be compared with each other (for example, the sizes of the comparison) and can be ordered according to actual meanings, and the element categories of the unordered field cannot be ordered according to the actual meanings. For example, the gender field is a sorted classification field, and the age field is a sorted classification field. For the mobile phone number field, the mobile phone number field belongs to other types of fields because the mobile phone number generally has no practical significance in a numerical level and does not reflect the properties of belonging to different types; in embodiments of the present invention, relevancy analysis may not be performed on other types of fields. For the age field, since its element has both a numerical meaning and a category meaning, the field may be either a numerical type field or a typing field.

In an alternative, the field type of any field may be determined according to the following steps. For any field to be analyzed, firstly, judging whether the proportion of elements in the field, which accord with a preset first regular expression, is not less than a first threshold value, namely judging whether the following formula is satisfied:

n_num≥η·n

wherein n is_numRepresenting the number of elements that conform to the first regular expression and n representing the total number of elements in the field. η is a preset first threshold and may be a number greater than zero and less than one. The first regular expression may be a regular expression used to match floating point numbers (i.e., numbers with a decimal point), such as "\\ d" (the first one used for character escape, the regular expression used to match elements that have numbers after a decimal point).

If the formula is satisfied, determining the field as a numeric field; otherwise, judging whether the number of the elements in the field after the duplication removal is greater than 1 and not greater than a second threshold value, namely whether the following conditions are met:

1＜n_dedup≤β

wherein n is_dedupRepresents the number of elements in the field after de-duplication, and beta is a preset second threshold value.

If the above conditions are met, determining the field as a typing field; otherwise, the field is determined to be the other field. Wherein the second threshold β is related to and smaller than the total number of elements n in the field, which may be n, for example^α(α is a positive number less than 1).

In an actual application scenario, because most elements of a numeric field are floating point numbers, in the field type determination method, a field with a certain proportion of elements as floating point numbers is determined as the numeric field, then a type-divided field is determined by determining whether the element type contained in the field is smaller or far smaller (the determination rule far smaller can be flexibly set) than the total number of the elements, and a field which is not determined by the two methods is the field of other types, so that the field type can be accurately and quickly determined, and the problem that the field type cannot be determined in time when a large number of unknown fields face can be solved. It should be noted that, in a few cases, the above method may determine the numeric field taking the value as an integer as another type field, but since most numeric fields take the value as a floating point number, the above limitation does not affect the actual effect of the above method.

In another alternative, the field type of any field may be determined according to the following steps. For any field to be analyzed, firstly, judging whether the number of elements in the field after deduplication is greater than 1 and not greater than a second threshold: if yes, determining the field as a type-divided field; otherwise, judging whether the element proportion of the field which accords with the preset second regular expression is not less than a third threshold value, namely judging whether the following formula is met:

n₀≥τ·n

wherein n is₀And τ is a preset third threshold value, and can be a number greater than zero and less than one. The second regular expression may be a regular expression, such as "\ d" (which is used to match elements having numbers) used to match floating point numbers and integers (i.e., numbers having decimal points).

If the formula is satisfied, determining the field as a numeric field; if the above formula is not satisfied, the field is determined to be a typing field.

It can be understood that the field type judgment method firstly executes classification type judgment and then executes numerical type judgment, and can also realize accurate and rapid judgment of the field type. In a specific case, for other types of fields (e.g., mobile phone number fields) whose values are integers, the method may determine the fields as numerical fields, and since the probability of occurrence of such a case is small, the actual use of the method is not greatly affected.

Step S102: when one of the two fields is a numerical field and the other field is a typing field, determining elements belonging to the same element class in the typing field, and forming an analysis group by the elements in the numerical field corresponding to the elements.

In the prior art, when one of the fields to be analyzed is a numerical field and the other is a classification field, only qualitative correlation analysis can be performed, and for this reason, the present embodiment provides a quantitative correlation analysis method. Specifically, elements belonging to the same element category in the typing field are determined first, and then the elements in the numerical field corresponding to the elements form an analysis group. It will be appreciated that if a first element in a typing field corresponds to a second element in a numeric field, the first element is in the same record as the second element (in the case where the typing field and the numeric field belong to the same database table), or the first element and the second element correspond to the same element of an associated field (in the case where the typing field and the numeric field belong to different database tables). For example, the two fields to be analyzed are payroll and gender fields, respectively, as follows (the two fields have five elements from record 1 to record 5, respectively, with record 1 to record 5 being arranged from top to bottom):

amount of payroll	Sex
		1223.12	For male
2154.56	Woman
		1896.51	For male
3021.55	Woman
		2136.96	Woman

1223.12, 1896.51 corresponding to the element category "male" may be formed into one analysis group and 2154.56, 3021.55, 2136.96 corresponding to the element category "female" may be formed into another analysis group.

Step S103: and determining the interclass variance and the intraclass variance of each analysis group, and obtaining the correlation index of the two fields according to the interclass variance and the intraclass variance.

In the present embodiment, the interclass variance is used to indicate the degree of data dispersion within the analysis groups, and the interclass variance is used to indicate the degree of data dispersion between the analysis groups. In general, the inter-group variance MSB and the intra-group variance MSE may be calculated by the following equations:

where r denotes the number of element classes of the taxonomic field to be analysed, n_iIndicates the number of elements, x, that the ith element class has in the typing field_ijIndicating that the ith element type in the typing field corresponds to the jth element of the numeric field to be analyzed,

indicating that the ith element category in the typing field corresponds to the average of the elements in the numeric fieldThe value of the one or more of the one,

and the average value of elements in the numerical field is shown, and i and j are positive integers.

After the inter-group variance and the intra-group variance are obtained, a correlation index of the two fields can be obtained according to the inter-group variance and the intra-group variance. Preferably, the inter-group variance MSB may be first divided by the intra-group variance MSE to obtain an initial correlation value F of the two fields, and the natural logarithm log of the initial correlation value_eF is determined as the middle value of the correlation degree, and finally the middle value log of the correlation degree is determined_eF transforms to the value interval from zero to one, thus forming the correlation index for both fields. As a preferred solution, the above transformation can be performed by the following formula:

the above function is a saturated logarithmic function of F. Wherein R is₁₂And represents a correlation index, mu is a preset first value, and mu is greater than 1 (for example, 10 is optional).

That is, when the correlation median log_eWhen F is less than zero, determining the correlation index as zero; when the correlation degree intermediate value is larger than the first numerical value, determining the correlation degree index as one; when the correlation median is not less than zero and not more than a first value, determining the correlation index as the correlation median log_eF multiplied by a second value, the second value being the inverse of the first value. Fig. 2 is a global schematic diagram of a saturation logarithmic function according to an embodiment of the present invention, fig. 3 is a local schematic diagram of a saturation logarithmic function according to an embodiment of the present invention, and in fig. 2 and 3, the abscissa is the F value, and the ordinate is the correlation index R₁₂。

The principle of calculation of the above correlation index is that, for a plurality of analysis groups configured according to the element classes in the classification field, the inter-group variance MSB is determined by the individual difference E of the elements in the numerical field and the processing factor difference T existing between different analysis groups, the intra-group variance MSE is determined by the individual difference E of the elements within the analysis groups, and the above processing factor difference T is determined by the degree of correlation between the elements in the analysis groups and the element classes of the corresponding classification fields since the different analysis groups correspond to different element classes of the classification fields. Thus, as the correlation of elements within an analysis group with the corresponding element class increases (i.e., the correlation of numeric fields with typing fields increases), the difference in processing factors, T, that exists between different analysis groups increases, resulting in an increase in the ratio, F, of the inter-group variance, MSB, to the intra-group variance, MSE; conversely, as the degree of correlation of the numeric field with the categorical field decreases, the difference in processing factor T that exists between different analysis groups decreases, resulting in a decrease in F; when the numerical type field is not correlated with the typing field, the difference in processing factor, T, existing between different analysis groups is zero and F is equal to 1. Therefore, F can be used to accurately measure the degree of correlation of two fields.

In practical application, the variation range of the F value is large, so that the range is logarithmically reduced. Due to log_e20000 ≈ 10, and numerous tests have shown that the F values of the numeric and subtype fields are in most cases [1,20000 ]]Within the interval, a logarithmic function of 0.1. log can therefore be used_eAnd F describes a relevance index. For rare excesses of the common interval, the logarithm function is not [0,1 ]]The value of F within the range may be defined as a boundary value by saturation. Thus, F is converted into a correlation index R in an ideal interval by a saturated logarithmic function₁₂Therefore, quantitative analysis of the correlation degree of the numerical field and the classification field is realized. In some alternative implementations, the initial value F of the degree of correlation may be directly used as the index of the degree of correlation, and other calculation results of the inter-group variance MSB and the intra-group variance MSE, such as (MSB-MSE)/MSE, may be determined as the initial value F of the degree of correlation; log in the middle of correlation_eWhen F performs the transformation, any other suitable transformation method may be used without being limited to the above-described saturation logarithm function.

In this embodiment, when any two fields X, Y to be analyzed in the database table are both numeric fields, the absolute value of the spearman correlation coefficient for both fields is determined as the correlation index for both fields. Specifically, a headThe elements in X and Y are sorted in ascending order, and the sorted list is marked as X₀And Y₀Then, the Spearman correlation coefficient r of Spearman is calculated according to the following formula_XY：

Wherein the content of the first and second substances,

denotes that the ith element in X is in X_oIn the position (a) of (b),

denotes that the i-th element in Y is in Y_oN denotes the number of elements of any field (the number of elements of the field to be analyzed is the same). Finally, r can be_XYAs a correlation index. It will be appreciated that the correlation index is in the zero to one interval.

When any two fields to be analyzed in the database table (e.g., field 1, field 2) are both classified fields, the Cramer correlation coefficient of the two fields is determined as the correlation index. Specifically, for field 1, field 2, chi-squared independence analysis is performed first. Setting the element types of the field 1 and the field 2 as s and c respectively, wherein the s and the c are not less than 2, establishing an s multiplied by c observation frequency list table according to the elements in the field 1 and the field 2, and setting the value f of the unit cell in the ith row and the jth column in the table_ijThe number of elements that take the ith element category in the field 1 and the jth element category in the field 2 is represented. Then calculating the expected frequency of each cell in the observation frequency list table to generate an s × c expected frequency list table, wherein the cell value of the ith row and the jth column in the table is as follows:

where k is a positive integer, and N represents the number of elements of field 1 or field 2 (the number of elements of field 1 is equal to the number of elements of field 2).

Then, chi-square statistic chi of field 1 and field 2 is calculated²If each one of

Not less than 5, then:

if present, less than 5

Then:

finally, the Cramer correlation coefficient of field 1 and field 2, i.e., the Cramer's V correlation coefficient, is calculated as the correlation index using the following formula:

it will be appreciated that the correlation index calculated according to the above steps is in the zero to one interval.

Through the setting, the two numerical fields, the two sub-type fields and the correlation indexes between the numerical fields and the classification fields can be accurately calculated after the field types are rapidly distinguished, so that the unified standard of the database table field correlation analysis is provided. After the relevancy indexes of any two fields of the database table are obtained, the relevancy indexes can be displayed through various data visualization methods.

Example two

FIG. 4 is a diagram illustrating a specific implementation of the method for determining the relevancy of the fields in the database table according to the embodiment of the present invention. As shown in FIG. 4, the method for determining the relevancy of the fields in the database table according to the embodiment of the invention may include three parts, namely preprocessing, relevancy analysis and result input.

In the preprocessing portion, a data flush needs to be performed first against the database table. Illustratively, if the database table is not in csv (Comma-Separated Values) format, then a string split is required for the header row and each row of records; unifying the formats of elements in each field, removing redundant spaces, punctuations, messy codes and the like, and unifying invalid values such as NULL, None and the like and missing values into NULL characters. The field type of the field to be analyzed can be determined according to the method in the first embodiment. Finally, initialization of the relevance matrix and the weight connection graph is performed (the relevance matrix and the weight connection graph are used for relevance index visualization, which will be described later).

In the correlation analysis part, correlation analysis needs to be performed on any two fields in a database table, and before analysis, it is first checked whether the number of elements of each field to be analyzed is greater than a preset threshold number: if yes, performing subsequent analysis; otherwise, the field is discarded. Thereafter, the correlation index may be calculated for each situation according to the method described in embodiment one.

In the result output section, the correlation index may be input to a preset correlation matrix and/or a weight connection map. Fig. 5 is a schematic diagram of a correlation matrix according to an embodiment of the present invention, and fig. 6 is a schematic diagram of a weight connection diagram according to an embodiment of the present invention. As shown in fig. 5, the number of rows and columns of the correlation matrix is equal to the total number of fields to be analyzed in the database table, each row and each column respectively correspond to the identifiers of the fields to be analyzed in the database table arranged in the preset sequence (i.e., each column from left to right corresponds to the identifier of the fields to be analyzed in the database table arranged in the sequence, each row from top to bottom corresponds to the identifier of the fields to be analyzed in the database table arranged in the sequence, and the field identifier may be a field name), any element in the correlation matrix is a correlation index between the field corresponding to the row where the element is located and the field corresponding to the column where the element is located, and the gray value of the element is positively correlated with the correlation index. It is to be understood that the elements in the correlation matrix of fig. 5 are each omitted by a percentile.

As shown in fig. 6, the weight connection graph includes nodes arranged along the circumferential direction and used for representing fields to be analyzed in the database table, and connection lines located between any two nodes and used for representing the relevance indexes; the nodes are configured with different colors for representing different field types, the connecting lines are configured with different colors for representing different correlation index types, the width and the color depth (i.e. the integrated gray scale of three channels to which the colors belong) of the connecting lines are positively correlated with the correlation index represented by the connecting lines, and the correlation index types comprise: a relevance indicator between two numeric fields, a relevance indicator between two categorical fields, and a relevance indicator between a numeric field and a categorical field. In fig. 6, since nodes of different colors and connecting lines of different colors cannot be displayed, only different gray scales are schematically displayed. It can be seen that the database table corresponding to the correlation matrix in fig. 5 and the weight connection graph in fig. 6 has the following fields: the system, a level 1 department, a level 2 department, whether an organization is responsible for people, whether high latency exists, whether core talents, job level sequence, job name, department age, gender, constellation, highest scholarness, whether performance is excellent, type of work hours, nationality, ethnicity, political face, marital status, province, city, department age, same-job duration, promotion interval, promotion speed, training duration, work saturation and performance level.

According to the technical scheme of the embodiment of the invention, when the correlation analysis is executed on the field in the database table, the field type is judged quickly and accurately according to the field element and a preset regular expression; then, the correlation index can be calculated by adopting corresponding methods according to different field types, for example, when two fields to be analyzed are both numerical fields, the absolute values of the spearman correlation coefficients of the two fields are used as the correlation index, and when the two fields to be analyzed are both classification type fields, the gram correlation coefficients of the two fields are used as the correlation index; specifically, when one of the two fields to be analyzed is a numeric field and the other is a typing field, the embodiment of the present invention first divides the elements in the data value field into a plurality of analysis groups according to the element class in the typing field, and then calculates the inter-group variance and the intra-group variance for each analysis group and uses the quotient of the two as a correlation index, thereby implementing quantitative correlation analysis of the numeric field and the typing field. To sum up, the embodiment of the present invention provides a unified analysis standard from field type determination to correlation calculation, and when two numeric fields, two subtype fields, or a numeric field and a subtype field are faced, quantitative correlation analysis can be performed to obtain a correlation index between zero and one.

It should be noted that, for the convenience of description, the foregoing method embodiments are described as a series of acts, but those skilled in the art will appreciate that the present invention is not limited by the order of acts described, and that some steps may in fact be performed in other orders or concurrently. Moreover, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no acts or modules are necessarily required to implement the invention.

To facilitate a better implementation of the above-described aspects of embodiments of the present invention, the following also provides relevant means for implementing the above-described aspects.

Referring to fig. 7, an apparatus 700 for determining the relevance of a field in a database table according to an embodiment of the present invention may include: a field type determination unit 701, a grouping unit 702, and a correlation calculation unit 703.

Wherein, the field type determining unit 701 may be configured to: for any two fields to be analyzed in a database table, judging the field type of each field according to the element of the field; wherein the field types include: a numeric field and a categorical field, the elements in the categorical field belonging to at least two element categories; the grouping unit 702 may be configured to: when one of the two fields is a numerical field and the other field is a typing field, determining elements belonging to the same element category in the typing field, and forming an analysis group by the elements in the numerical field corresponding to the elements; the correlation calculation unit 703 may be configured to determine an interclass variance and an intraclass variance for each analysis group, from which a correlation index for the two fields is obtained.

In this embodiment of the present invention, the field type determining unit 701 may further be configured to: for any field to be analyzed, judging whether the proportion of elements in the field, which accord with a preset first regular expression, is not less than a first threshold value: if yes, determining the field as a numerical field; the first regular expression is used for matching floating point numbers; if the proportion of the elements in the field which accord with the first regular expression is smaller than a first threshold, judging whether the number of the elements in the field after the duplication removal is larger than 1 and not larger than a second threshold: if yes, determining the field as a type-divided field; wherein the second threshold is related to and less than the total number of elements in the field.

In an alternative, the field type determining unit 701 may be further configured to: for any field to be analyzed, judging whether the number of elements in the field after de-duplication is larger than 1 and not larger than a second threshold value: if yes, determining the field as a type-divided field; wherein the second threshold is related to and less than the total number of elements in the field; if the number of the elements in the field after the duplication removal is 1 or is greater than a second threshold, whether the proportion of the elements in the field which accord with a preset second regular expression is not less than a third threshold is judged: if yes, determining the field as a numerical field; wherein the second regular expression is used for matching floating point numbers and integers.

In a specific application, the correlation calculation unit 703 may be further configured to: dividing the variance between the groups by the variance in the groups to obtain an initial value of the correlation degree of the two fields, and determining the natural logarithm of the initial value of the correlation degree as a middle value of the correlation degree; and transforming the correlation intermediate value to a value interval from zero to one to form a correlation index of the two fields.

In practical applications, the correlation calculation unit 703 may be further configured to: when the correlation intermediate value is less than zero, determining the correlation index as zero; when the correlation degree intermediate value is larger than a first numerical value, determining the correlation degree index as one; wherein the first value is a real number greater than one; when the correlation degree intermediate value is not less than zero and not more than a first numerical value, determining the correlation degree index as a product of the correlation degree intermediate value and a second numerical value; wherein the second value is the inverse of the first value.

As a preferable scheme, the correlation calculation unit 703 may be further configured to: when any two fields to be analyzed in the database table are numerical fields, determining the absolute values of the spearman correlation coefficients of the two fields as the correlation indexes of the two fields; when any two fields to be analyzed in the database table are classified fields, determining the Cramer correlation coefficient of the two fields as the correlation index of the two fields.

Preferably, the apparatus 700 may further comprise a first visualization unit for: after obtaining the relevance indexes of any two fields to be analyzed in the database table, inputting the relevance indexes into a preset relevance matrix; the row number and the column number of the correlation matrix are both equal to the total number of the fields to be analyzed of the database table, each row and each column respectively correspond to the identifiers of the fields to be analyzed in the database table which are arranged in a preset sequence, any element in the correlation matrix is a correlation index between the field corresponding to the row where the element is located and the field corresponding to the column where the element is located, and the gray value of the element is positively correlated with the correlation index.

Furthermore, in an embodiment of the present invention, the apparatus 700 may further comprise a second visualization unit for: after obtaining the relevance indexes of any two fields to be analyzed in the database table, inputting the relevance indexes into a preset weight connection diagram; the weight connection graph comprises nodes which are arranged along the circumferential direction and used for representing fields to be analyzed in the database table, and connecting lines which are positioned between any two nodes and used for representing correlation indexes; the nodes are configured with different colors for representing different field types, the connecting lines are configured with different colors for representing different correlation index types, the width and the color depth of the connecting line are positively correlated with the correlation index represented by the connecting line, and the correlation index types comprise: a relevance indicator between two numeric fields, a relevance indicator between two categorical fields, and a relevance indicator between a numeric field and a categorical field.

FIG. 8 illustrates an exemplary system architecture 800 for a method of determining the relevance of fields in a database table or an apparatus for determining the relevance of fields in a database table to which embodiments of the present invention may be applied.

As shown in fig. 8, the system architecture 800 may include

terminal devices

801, 802, 803, a network 804 and a server 805 (this architecture is merely an example, and the components included in a particular architecture may be adapted according to the application specific circumstances). The network 804 serves to provide a medium for communication links between the

terminal devices

801, 802, 803 and the server 805. Network 804 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

A user may use the

terminal devices

801, 802, 803 to interact with a server 805 over a network 804 to receive or send messages or the like. Various client applications, such as applications that perform relevance statistics (for example only), may be installed on the

terminal devices

801, 802, 803.

The

terminal devices

801, 802, 803 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.

The server 805 may be a server that provides various services, such as an arithmetic server (for example only) that provides support for applications that perform correlation statistics operated by users using the

terminal devices

801, 802, 803. The calculation server may process the received correlation calculation request and feed back the processing result (e.g., the calculated correlation index — just an example) to the

terminal devices

801, 802, 803.

It should be noted that the method for determining the field relevancy in the database table provided by the embodiment of the present invention is generally executed by the server 805, and accordingly, the apparatus for determining the field relevancy in the database table is generally disposed in the server 805.

It should be understood that the number of terminal devices, networks, and servers in fig. 8 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

The invention also provides the electronic equipment. The electronic device of the embodiment of the invention comprises: one or more processors; the storage device is used for storing one or more programs, and when the one or more programs are executed by the one or more processors, the one or more processors realize the method for determining the field relevancy in the database table.

Referring now to FIG. 9, shown is a block diagram of a computer system 900 suitable for use in implementing an electronic device of an embodiment of the present invention. The electronic device shown in fig. 9 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.

As shown in fig. 9, the computer system 900 includes a Central Processing Unit (CPU)901 that can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)902 or a program loaded from a storage section 908 into a Random Access Memory (RAM) 903. In the RAM903, various programs and data necessary for the operation of the computer system 900 are also stored. The CPU901, ROM 902, and RAM903 are connected to each other via a bus 904. An input/output (I/O) interface 905 is also connected to bus 904.

The following components are connected to the I/O interface 905: an input portion 906 including a keyboard, a mouse, and the like; an output section 907 including components such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 908 including a hard disk and the like; and a communication section 909 including a network interface card such as a LAN card, a modem, or the like. The communication section 909 performs communication processing via a network such as the internet. The drive 910 is also connected to the I/O interface 905 as necessary. A removable medium 911 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 910 as necessary, so that a computer program read out therefrom is mounted into the storage section 908 as necessary.

In particular, the processes described in the main step diagrams above may be implemented as computer software programs, according to embodiments of the present disclosure. For example, embodiments of the invention include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the main step diagram. In the above-described embodiment, the computer program can be downloaded and installed from the network via the communication section 909, and/or installed from the removable medium 911. The computer program, when executed by the central processing unit 901, performs the above-described functions defined in the system of the present invention.

It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present invention may be implemented by software or hardware. The described units may also be provided in a processor, and may be described as: a processor includes a field type determination unit, a grouping unit, and a correlation calculation unit. Where the names of these elements do not in some cases constitute a limitation on the elements themselves, for example, the field type determination element may also be described as an "element providing a field type to a packet element".

As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by the apparatus, cause the apparatus to perform steps comprising: for any two fields to be analyzed in a database table, judging the field type of each field according to the element of the field; wherein the field types include: a numeric field and a categorical field, the elements in the categorical field belonging to at least two element categories; when one of the two fields is a numerical field and the other field is a typing field, determining elements belonging to the same element category in the typing field, and forming an analysis group by the elements in the numerical field corresponding to the elements; determining an interclass variance and an intraclass variance for each analysis group, and obtaining a correlation index of the two fields according to the interclass variance and the intraclass variance.

The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for determining relevance of fields in a database table, comprising:

for any two fields to be analyzed in a database table, judging the field type of each field according to the element of the field; wherein the field types include: a numeric field and a categorical field, the elements in the categorical field belonging to at least two element categories;

when one of the two fields is a numerical field and the other field is a typing field, determining elements belonging to the same element category in the typing field, and forming an analysis group by the elements in the numerical field corresponding to the elements;

determining an interclass variance and an intraclass variance for each analysis group, and obtaining a correlation index of the two fields according to the interclass variance and the intraclass variance.

2. The method of claim 1, wherein determining the field type of each field according to the element of the field comprises:

for any field to be analyzed, judging whether the proportion of elements in the field, which accord with a preset first regular expression, is not less than a first threshold value: if yes, determining the field as a numerical field; the first regular expression is used for matching floating point numbers;

if the proportion of the elements in the field which accord with the first regular expression is smaller than a first threshold, judging whether the number of the elements in the field after the duplication removal is larger than 1 and not larger than a second threshold: if yes, determining the field as a type-divided field; wherein the second threshold is related to and less than the total number of elements in the field.

3. The method of claim 1, wherein determining the field type of each field according to the element of the field comprises:

for any field to be analyzed, judging whether the number of elements in the field after de-duplication is larger than 1 and not larger than a second threshold value: if yes, determining the field as a type-divided field; wherein the second threshold is related to and less than the total number of elements in the field;

if the number of the elements in the field after the duplication removal is 1 or is greater than a second threshold, whether the proportion of the elements in the field which accord with a preset second regular expression is not less than a third threshold is judged: if yes, determining the field as a numerical field; wherein the second regular expression is used for matching floating point numbers and integers.

4. The method of claim 1, wherein obtaining the correlation indicator for the two fields according to the inter-group variance and the intra-group variance comprises:

dividing the variance between the groups by the variance in the groups to obtain an initial value of the correlation degree of the two fields, and determining the natural logarithm of the initial value of the correlation degree as a middle value of the correlation degree;

and transforming the correlation intermediate value to a value interval from zero to one to form a correlation index of the two fields.

5. The method of claim 4, wherein transforming the correlation intermediate value to a value range from zero to one to form a correlation indicator for the two fields comprises:

when the correlation intermediate value is less than zero, determining the correlation index as zero;

when the correlation degree intermediate value is larger than a first numerical value, determining the correlation degree index as one; wherein the first value is a real number greater than one;

when the correlation degree intermediate value is not less than zero and not more than a first numerical value, determining the correlation degree index as a product of the correlation degree intermediate value and a second numerical value; wherein the second value is the inverse of the first value.

6. The method of claim 4, further comprising:

when any two fields to be analyzed in the database table are numerical fields, determining the absolute values of the spearman correlation coefficients of the two fields as the correlation indexes of the two fields;

when any two fields to be analyzed in the database table are classified fields, determining the Cramer correlation coefficient of the two fields as the correlation index of the two fields.

7. The method of claim 6, further comprising:

after obtaining the relevance indexes of any two fields to be analyzed in the database table, inputting the relevance indexes into a preset relevance matrix; wherein the content of the first and second substances,

the row number and the column number of the correlation matrix are both equal to the total number of the fields to be analyzed in the database table, each row and each column respectively correspond to the identifiers of the fields to be analyzed in the database table which are arranged in a preset sequence, any element in the correlation matrix is a correlation index between the field corresponding to the row where the element is located and the field corresponding to the column where the element is located, and the gray value of the element is positively correlated with the correlation index.

8. The method of claim 6, further comprising:

after obtaining the relevance indexes of any two fields to be analyzed in the database table, inputting the relevance indexes into a preset weight connection diagram; wherein the content of the first and second substances,

the weight connection graph comprises nodes which are arranged along the circumferential direction and used for representing fields to be analyzed in the database table, and connecting lines which are arranged between any two nodes and used for representing correlation indexes; the nodes are configured with different colors for representing different field types, the connecting lines are configured with different colors for representing different correlation index types, the width and the color depth of the connecting line are positively correlated with the correlation index represented by the connecting line, and the correlation index types comprise: a relevance indicator between two numeric fields, a relevance indicator between two categorical fields, and a relevance indicator between a numeric field and a categorical field.

9. An apparatus for determining relevance of fields in a database table, comprising:

a field type determination unit to: for any two fields to be analyzed in a database table, judging the field type of each field according to the element of the field; wherein the field types include: a numeric field and a categorical field, the elements in the categorical field belonging to at least two element categories;

a grouping unit for: when one of the two fields is a numerical field and the other field is a typing field, determining elements belonging to the same element category in the typing field, and forming an analysis group by the elements in the numerical field corresponding to the elements;

and the correlation calculation unit is used for determining the interclass variance and the intraclass variance of each analysis group and obtaining the correlation indexes of the two fields according to the interclass variance and the intraclass variance.

10. An electronic device, comprising:

one or more processors;

a storage device for storing one or more programs,

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-8.

11. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-8.