CN113761297B

CN113761297B - Method and device for determining field relatedness in database table

Info

Publication number: CN113761297B
Application number: CN202011248181.1A
Authority: CN
Inventors: 张蒙
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Wodong Tianjun Information Technology Co Ltd
Priority date: 2020-11-10
Filing date: 2020-11-10
Publication date: 2024-06-18
Anticipated expiration: 2040-11-10
Also published as: CN113761297A

Abstract

The invention discloses a method and a device for determining field relativity in a database table, and relates to the technical field of computers. One embodiment of the method comprises the following steps: judging the field type of any two fields to be analyzed in the database table according to the element of each field; the field types include: a numeric field and a taxonomy field, elements in the taxonomy field belonging to at least two element categories; when one of the two fields is a numerical field and the other is a classification field, determining elements belonging to the same element category in the classification field, and forming an analysis group by the elements in the numerical field corresponding to the elements; inter-group variance and intra-group variance for each analysis group are determined, and a correlation index for the two fields is obtained from the inter-group variance and intra-group variance. According to the method and the device, the relevance can be calculated quantitatively for the numerical type field and the classification type field in any database table, and unified analysis of the relevance of different types of fields is facilitated.

Description

Method and device for determining field relatedness in database table

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a method and apparatus for determining field relevance in a database table.

Background

There are a number of scenarios in which the need to determine the relevance of different fields in a database table exists. For example, in a data analysis scenario, the information of the data provider and the data demander is often asymmetric, and in addition, the database table itself has certain complexity, so that there are problems of unclear data demand, frequent data correction and the like, and at this time, the correlation degree between different fields in the database table needs to be analyzed, so that a valuable reference is provided for the data demander, and the working efficiency is remarkably improved. In the prior art, the relevance can be calculated by adopting methods such as pearson correlation coefficient and the like according to whether the field type to be analyzed is a numeric field or a classified field.

In carrying out the invention, the inventors have found that the prior art has at least the following problems: first, when a large number of fields to be analyzed with unknown types are faced, the prior art cannot quickly and accurately identify the field types. Second, the prior art only qualitatively describes the degree of correlation when calculating the degree of correlation of numeric fields with certain categorical fields (e.g., gender), which cannot meet the needs of certain application environments. Third, there is a lack of unification criteria in the prior art to implement correlation analysis of database table fields in various situations.

Disclosure of Invention

In view of this, the embodiments of the present invention provide a method and an apparatus for determining field correlation in a database table, which can quantitatively calculate correlation for a numeric field and a categorized field in any database table, and are helpful for implementing unified analysis of field correlation of different types.

To achieve the above object, according to one aspect of the present invention, there is provided a method of determining a degree of correlation of fields in a database table.

The method for determining the field correlation degree in the database table comprises the following steps: judging the field type of any two fields to be analyzed in the database table according to the element of each field; wherein the field types include: a numeric field and a taxonomy field, elements in the taxonomy field belonging to at least two element categories; when one of the two fields is a numerical field and the other is a classification field, determining elements belonging to the same element category in the classification field, and forming an analysis group by the elements in the numerical field corresponding to the elements; an inter-group variance and an intra-group variance for each analysis group are determined, and a correlation index for the two fields is obtained from the inter-group variance and the intra-group variance.

Optionally, the determining, according to the element of each field, the field type to which the field belongs includes: judging whether the element duty ratio of any field to be analyzed, which accords with a preset first regular expression, in the field is not smaller than a first threshold value: if yes, determining the field as a numerical field; the first regular expression is used for matching floating point numbers; if the element duty ratio of the field conforming to the first regular expression is smaller than a first threshold value, judging whether the number of the elements after the duplication removal in the field is larger than 1 and not larger than a second threshold value: if yes, determining the field as a classification field; wherein the second threshold is related to and less than the total number of elements in the field.

Optionally, the determining, according to the element of each field, the field type to which the field belongs includes: for any field to be analyzed, judging whether the number of elements after the duplication removal in the field is greater than 1 and not greater than a second threshold value: if yes, determining the field as a classification field; wherein the second threshold is related to and less than the total number of elements in the field; if the number of the elements after the duplication removal in the field is 1 or greater than a second threshold, judging whether the element duty ratio of the elements in the field, which accords with a preset second regular expression, is not less than a third threshold: if yes, determining the field as a numerical field; wherein the second regular expression is used to match floating point numbers and integers.

Optionally, the obtaining the correlation index of the two fields according to the inter-group variance and the intra-group variance includes: dividing the inter-group variance by the intra-group variance to obtain a correlation initial value of the two fields, and determining a natural logarithm of the correlation initial value as a correlation intermediate value; and transforming the correlation intermediate value into a numerical value interval from zero to one to form correlation indexes of the two fields.

Optionally, the transforming the correlation intermediate value to a value interval from zero to one forms a correlation index of the two fields, including: when the correlation intermediate value is smaller than zero, determining the correlation index as zero; when the correlation intermediate value is larger than a first numerical value, determining the correlation index as one; wherein the first value is a real number greater than one; determining the relevance index as a product of the relevance median and a second value when the relevance median is not less than zero and not greater than a first value; wherein the second value is the reciprocal of the first value.

Optionally, the method further comprises: when any two fields to be analyzed in the database table are numeric fields, determining absolute values of spearman correlation coefficients of the two fields as correlation indexes of the two fields; and when any two fields to be analyzed in the database table are classified fields, determining the Cramer correlation coefficients of the two fields as correlation indexes of the two fields.

Optionally, the method further comprises: after obtaining the correlation indexes of any two fields to be analyzed in the database table, inputting the correlation indexes into a preset correlation matrix; the number of rows and the number of columns of the correlation matrix are equal to the total number of fields to be analyzed in the database table, each row and each column respectively correspond to the identification of the fields to be analyzed in the database table arranged in a preset sequence, any element in the correlation matrix is a correlation index between a field corresponding to the row where the element is located and a field corresponding to the column where the element is located, and the gray value of the element is positively correlated with the correlation index.

Optionally, the method further comprises: after obtaining the correlation index of any two fields to be analyzed in the database table, inputting the correlation index into a preset weight connection diagram; the weight connection graph comprises nodes which are arranged along the circumferential direction and used for representing fields to be analyzed in the database table, and connection lines which are positioned between any two nodes and used for representing relevance indexes; the nodes are configured with different colors for representing different field types, the connecting lines are configured with different colors for representing different relevancy index types, the width and the color depth of the connecting lines are positively correlated with relevancy indexes represented by the connecting lines, and the relevancy index types comprise: a correlation index between two numeric fields, a correlation index between two categorical fields, and a correlation index between a numeric field and a categorical field.

To achieve the above object, according to another aspect of the present invention, there is provided an apparatus for determining a degree of correlation of fields in a database table.

The device for determining the field correlation degree in the database table according to the embodiment of the invention can comprise: a field type determining unit configured to: judging the field type of any two fields to be analyzed in the database table according to the element of each field; wherein the field types include: a numeric field and a taxonomy field, elements in the taxonomy field belonging to at least two element categories; a grouping unit for: when one of the two fields is a numerical field and the other is a classification field, determining elements belonging to the same element category in the classification field, and forming an analysis group by the elements in the numerical field corresponding to the elements; and a correlation calculation unit for determining an inter-group variance and an intra-group variance for each analysis group, and obtaining correlation indexes of the two fields according to the inter-group variance and the intra-group variance.

To achieve the above object, according to still another aspect of the present invention, there is provided an electronic apparatus.

An electronic apparatus of the present invention includes: one or more processors; and the storage device is used for storing one or more programs, and when the one or more programs are executed by the one or more processors, the one or more processors are enabled to realize the method for determining the field relatedness in the database table.

To achieve the above object, according to still another aspect of the present invention, there is provided a computer-readable storage medium.

A computer readable storage medium of the present invention has stored thereon a computer program which when executed by a processor implements the method of determining field relevance in a database table provided by the present invention.

According to the technical scheme of the invention, the embodiment of the invention has the following advantages or beneficial effects: when performing correlation analysis on fields in a database table, firstly, quickly and accurately judging the field types according to field elements and a preset regular expression; the correlation index can be calculated by adopting a corresponding method according to different field types, for example, when two fields to be analyzed are numerical fields, the absolute value of the Szelman correlation coefficient of the two fields is used as the correlation index, and when the two fields to be analyzed are classified fields, the Cream correlation coefficient of the two fields is used as the correlation index; in particular, when one of the two fields to be analyzed is a numeric field and the other is a classification field, the embodiment of the present invention first classifies elements in the data value field into a plurality of analysis groups according to the element types in the classification field, and then calculates inter-group variance and intra-group variance for each analysis group and takes the quotient of the two as a correlation index, thereby realizing quantitative correlation analysis of the numeric field and the classification field (the specific principle will be described below). In summary, the embodiment of the invention provides a unified analysis standard from field type judgment to relevance calculation, when facing two numerical fields, two classification fields or one numerical field and one classification field, quantitative relevance analysis can be performed and a relevance index between zero and one is obtained, and finally, the embodiment of the invention can generate a relevance matrix or a weight connection diagram based on the relevance index, thereby realizing visual output of the relevance analysis.

Further effects of the above-described non-conventional alternatives are described below in connection with the embodiments.

Drawings

The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:

FIG. 1 is a schematic diagram of main steps of a method for determining field relatedness in a database table according to an embodiment of the present invention;

FIG. 2 is a global schematic of a saturated logarithmic function of an embodiment of the invention;

FIG. 3 is a partial schematic diagram of a saturated logarithmic function of an embodiment of the invention;

FIG. 4 is a schematic diagram showing a specific implementation of a method for determining field relatedness in a database table according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a correlation matrix according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of a weight connection diagram in an embodiment of the invention;

FIG. 7 is a schematic diagram of components of an apparatus for determining field relatedness in database tables according to an embodiment of the present invention;

FIG. 8 is an exemplary system architecture diagram to which embodiments in accordance with the present invention may be applied;

Fig. 9 is a schematic diagram of an electronic device for implementing a method for determining field relevance in a database table according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present invention will now be described with reference to the accompanying drawings, in which various details of the embodiments of the present invention are included to facilitate understanding, and are to be considered merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

It should be noted that the embodiments of the present invention and the technical features in the embodiments may be combined with each other without collision.

Example 1

Fig. 1 is a schematic diagram of main steps of a method for determining field relatedness in a database table according to an embodiment of the present invention. As shown in fig. 1, the method for determining the field correlation degree in the database table according to the embodiment of the present invention may specifically be performed according to the following steps:

Step S101: for any two fields to be analyzed in the database table, judging the field type of each field according to the element of the field.

In this step, the two fields to be analyzed may be fields of the same database table, or may be any fields in a plurality of database tables that can be associated with each other. For example, if an employee performance assessment table and an employee base table may be associated via a common employee name field, a relevance analysis may be performed on any two fields in the two database tables. In embodiments of the present invention, the field types may include a numeric type field, a category type field, and other types of fields that are different from both numeric and category type fields.

Wherein the elements in the numeric field (i.e., the values of the database table in the field) are represented as integers, decimal values, etc., the sizes of which generally have practical meanings and can be compared with each other, for example, the payroll amount field and the overtime period field in the employee performance assessment table generally belong to the numeric field. The elements in the typing field can be generally divided into at least two element categories, for example, a gender field in the staff basic information table and an affiliated age group field generally belong to a classification type field, the gender field includes two element categories, namely "male" and "female", and the affiliated age group field includes the element categories that can be: "10 to 20 years old", "20 to 30 years old", "30 to 40 years old", "40 to 50 years old", etc. The classification fields can be divided into ordered classification fields and unordered classification fields, wherein element categories of the ordered classification fields can be compared with each other (such as comparison size) and can be ordered according to actual meanings, and element categories of the unordered classification fields cannot be ordered according to actual meanings. For example, the gender field is an unordered taxonomy field, and the age group field is an ordered taxonomy field. For the mobile phone number field, the mobile phone number field belongs to other types of fields because the mobile phone number generally has no practical meaning of a numerical value layer and does not show the property of belonging to different categories; in embodiments of the present invention, no correlation analysis may be performed on other types of fields. For an age field, since its elements have both numerical and categorical meanings, the field may be either a numeric or a categorical field.

In an alternative, the field type of any field may be determined according to the following steps. For any field to be analyzed, firstly judging whether the element duty ratio of the field, which accords with a preset first regular expression, is not smaller than a first threshold value, namely judging whether the following formula is satisfied:

n_num≥η·n

Where n _num represents the number of elements conforming to the first regular expression and n represents the total number of elements in the field. η is a preset first threshold, which may be a number greater than zero and less than one. The first regular expression may be a regular expression for matching floating point numbers (i.e., numbers with decimal points), such as "\\d" (first\for character escape) for matching elements with numbers after the decimal points.

If the above formula is satisfied, determining the field as a numeric field; otherwise, judging whether the number of the elements after the duplication removal in the field is greater than 1 and not greater than a second threshold, namely meeting the following conditions:

1＜n_dedup≤β

Wherein n _dedup represents the number of elements after de-duplication in the field, and β is a preset second threshold.

If the above condition is satisfied, determining the field as a classification field; otherwise, the field is determined to be the other field. Wherein the second threshold β is related to and less than the total number of elements n in the field, which may be, for example, n ^α (α is a positive number less than 1).

In practical application scenarios, because most of elements of the numerical field are floating point numbers, in the above-mentioned field type judging method, firstly, a field in which a certain proportion of elements are floating point numbers is judged as a numerical field, then, a classification type field is determined by judging whether the element types contained in the field are smaller or far smaller (the judgment rule of far smaller can be flexibly set) than the total number of elements, and the fields which are not judged by the two types are other types of fields, thereby realizing accurate and rapid judgment of the field types, and further, solving the problem that the field types cannot be timely determined when facing a large number of unknown fields. In a few cases, the above method may determine the numeric field with an integer value as another type field, but since most of the numeric fields are floating point numbers, the above limitation does not affect the practical effect of the above method.

In another alternative, the field type of any field may be determined according to the following steps. For any field to be analyzed, first, it is determined whether the number of elements after deduplication in the field is greater than 1 and not greater than a second threshold: if yes, determining the field as a classification field; otherwise, judging whether the element duty ratio of the second regular expression in the field is not smaller than a third threshold value, namely judging whether the following formula is satisfied:

n₀≥τ·n

Where n ₀ represents the number of elements conforming to the second regular expression, τ is a preset third threshold, which may be a number greater than zero and less than one. The second regular expression may be a regular expression for matching floating point numbers and integers (i.e., numbers with decimal points), such as "\d" (the regular expression being used to match elements with numbers).

If the above formula is satisfied, determining the field as a numeric field; if the above formula is not satisfied, the field is determined as a classification type field.

It can be understood that the field type judging method can execute the classification type judgment first and then execute the numerical type judgment, and can also realize the accurate and rapid judgment of the field type. In certain cases, for other types of fields (e.g., a cell phone number field) that are integer valued, the method may determine it as a numeric field, which will not have a significant impact on the actual use of the method because of the low probability of occurrence.

Step S102: when one of the two fields is a numerical field and the other is a classification field, determining the elements belonging to the same element category in the classification field, and forming an analysis group by the elements in the numerical field corresponding to the elements.

In the prior art, when one of the fields to be analyzed is a numeric field and the other is a classification field, qualitative correlation analysis can be generally performed only, and for this, the present embodiment provides a quantitative correlation analysis method. Specifically, firstly, determining the elements belonging to the same element category in the typing field, and then forming an analysis group by the elements in the numeric type field corresponding to the elements. It will be appreciated that if a first element in a taxonomy field corresponds to a second element in a numeric field, then the first element is in the same record as the second element (where the taxonomy field and the numeric field belong to the same database table), or the first element corresponds to the same element of an association field (where the taxonomy field and the numeric field belong to different database tables). For example, the two fields to be analyzed are payroll and gender fields, respectively, specifically as follows (two fields have five elements from record 1 to record 5, respectively, record 1 to record 5 are arranged from top to bottom):

Payroll amount	Sex (sex)
		1223.12	Man's body
2154.56	Female
		1896.51	Man's body
3021.55	Female
		2136.96	Female

Then 1223.12, 1896.51 corresponding to the element category "male" may be grouped into one analysis group and 2154.56, 3021.55, 2136.96 corresponding to the element category "female" into another analysis group.

Step S103: inter-group variance and intra-group variance for each analysis group are determined, and a correlation index for the two fields is obtained from the inter-group variance and intra-group variance.

In the present embodiment, the inter-group variance is used to represent the degree of data dispersion inside the analysis group, and the inter-group variance is used to represent the degree of data dispersion between the analysis groups. In general, the inter-group variance MSB and the intra-group variance MSE may be calculated by the following formula:

Wherein r represents the number of element categories of the taxonomy field to be analyzed, n _i represents the number of elements of the ith element category in the taxonomy field, x _ij represents the jth element of the numerical field to be analyzed corresponding to the ith element category in the taxonomy field, Representing the average value of the i-th element class in the classified field corresponding to the elements in the numeric field,/>The average value of the elements in the numerical field is represented, and i and j are positive integers.

After the inter-group variance and the intra-group variance are obtained, a correlation index of the two fields may be obtained from the inter-group variance and the intra-group variance. Preferably, the correlation index of the two fields may be formed by dividing the inter-group variance MSB by the intra-group variance MSE to obtain the correlation initial value F of the two fields, determining the natural log _e F of the correlation initial value as the correlation intermediate value, and finally transforming the correlation intermediate value log _e F to a value interval from zero to one. As a preferred embodiment, the above transformation can be performed by the following formula:

The above function is a saturated logarithmic function of F. Wherein R ₁₂ represents a correlation index, μ is a preset first value, and μ is greater than 1 (e.g., optionally 10).

That is, when the correlation intermediate value log _e F is smaller than zero, the correlation index is determined as zero; when the correlation intermediate value is larger than the first numerical value, determining the correlation index as one; when the correlation intermediate value is not less than zero and not greater than the first value, the correlation index is determined as the product of the correlation intermediate value log _e F and a second value, which is the reciprocal of the first value. Fig. 2 is a global schematic diagram of a saturated logarithmic function according to an embodiment of the present invention, and fig. 3 is a partial schematic diagram of a saturated logarithmic function according to an embodiment of the present invention, wherein in fig. 2 and 3, the abscissa indicates the F value, and the ordinate indicates the correlation index R ₁₂.

The calculation principle of the above correlation index is that, for a plurality of analysis groups configured according to the element categories in the classification type field, the group variance MSB is determined by the individual differences E of the elements in the numeric type field and the processing factor differences T existing between the different analysis groups, the group variance MSE is determined by the individual differences E of the elements in the analysis groups, and since the different analysis groups correspond to the different element categories of the classification type field, the above processing factor differences T are determined by the degree of correlation of the elements in the analysis groups with the element categories of the corresponding classification type field. Thus, as the degree of correlation of the elements within the analysis group with the corresponding element categories increases (i.e., the degree of correlation of the numeric field with the taxonomic field increases), the difference T in processing factors that exists between the different analysis groups increases, resulting in an increase in the ratio F of the inter-group variance MSB to the intra-group variance MSE; conversely, when the degree of correlation of the numeric type field and the classification type field decreases, the processing factor difference T existing between the different analysis groups decreases, resulting in a decrease in F; when the numeric type field is uncorrelated with the taxonomy field, the process factor difference T, F, that exists between the different analysis groups is zero and F is equal to 1. Therefore, F can be used to accurately measure the degree of correlation of two fields.

In practical application, the variation range of the F value is larger, so that the logarithmic reduction range is adopted for the F value. Since log _e 20000≡10, and a number of tests indicate that in most cases the values of the numerical and taxonomic fields are in the [1,20000] interval, the correlation index can be described by a logarithmic function of 0.1 log _e F. For F values that rarely go beyond the common interval resulting in a logarithmic function that is not in the range of 0,1, saturation may be used to define it as a boundary value. In this way, the F is converted into the correlation index R ₁₂ in the ideal interval through the saturated logarithmic function, so as to realize quantitative analysis of the correlation between the numeric field and the classified field. In some alternative implementations, the correlation initial value F may be directly used as a correlation index, and other calculation results of the inter-group variance MSB and the intra-group variance MSE, such as (MSB-MSE)/MSE, may be determined as the correlation initial value F; in performing the transformation on the correlation intermediate value log _e F, any other applicable transformation method may be used, not limited to the above-described saturated logarithmic function.

In this embodiment, when any two fields X, Y to be analyzed in the database table are numeric fields, the absolute values of the spearman correlation coefficients of the two fields are determined as the correlation indexes of the two fields. Specifically, the elements in X and Y are first sorted in ascending order, the sorted list is denoted as X ₀ and Y ₀, and then the Spearman correlation coefficient r _XY is calculated according to the following formula:

Wherein, Representing the position of the ith element in X _o,/>Represents the position of the ith element in Y _o, and N represents the number of elements of any field (the number of elements of the field to be analyzed is the same). Finally, the absolute value of r _XY can be used as a correlation index. It will be appreciated that the correlation index is in the value interval zero to one.

When any two fields (e.g., field 1, field 2) to be analyzed in the database table are both classified fields, the clahm correlation coefficients of the two fields are determined as the correlation index. Specifically, for field 1, field 2, the chi-square independence analysis is performed first. And setting the element category numbers of the field 1 and the field 2 to be s and c respectively, wherein the s and the c are not less than 2, and establishing an observation frequency series table of s multiplied by c according to the elements in the field 1 and the field 2, wherein the value f _ij of a cell of the jth column of the ith row in the table represents the element number of which the value is the ith element category in the field 1 and the value is the jth element category in the field 2. And then calculating the expected frequency of each cell of the observation frequency series list, and generating an s multiplied by c expected frequency series list, wherein the values of the cells in the ith row and the jth column in the list are as follows:

where k is a positive integer and N represents the number of elements of field 1 or field 2 (field 1 and field 2 are equal in number of elements).

Next, calculate chi-square statistics χ ² for field 1 and field 2, if eachAnd no less than 5, then:

If there is less than 5 Then:

Finally, the Cramer's V correlation coefficient of field 1 and field 2 is calculated as a correlation index using the following formula:

It will be appreciated that the correlation index calculated according to the above steps is in the value interval of zero to one.

Through the arrangement, the two numerical value type fields, the two classification type fields and the correlation indexes between the numerical value type fields and the classification type fields can be accurately calculated after the field types are rapidly judged, so that unified standards for database table field correlation analysis are provided. After the correlation index of any two fields of the database table is obtained, the correlation index can be displayed through various data visualization methods.

Example two

Fig. 4 is a schematic diagram showing a specific implementation of a method for determining field relatedness in a database table according to an embodiment of the present invention. As shown in FIG. 4, the method for determining the field correlation degree in the database table according to the embodiment of the invention can comprise three parts of preprocessing, correlation degree analysis and result input.

In the preprocessing section, data cleansing needs to be performed first for the database table. Illustratively, if the database table is not in csv (Comma-SEPARATED VALUES) format, a string split is required for the header line and each line record; the formats of elements in each field are unified, redundant spaces, punctuations, messy codes and the like are removed, and invalid values such as NULL, none and the like and missing values are unified into NULL characters. The field type of the field to be analyzed can be determined according to the method in the first embodiment. Finally, the initialization of the correlation matrix and the weight connection graph (the correlation matrix and the weight connection graph are used for correlation index visualization, which will be described later) is performed.

In the correlation analysis part, correlation analysis needs to be carried out on any two fields in a database table, and before analysis, whether the element number of each field to be analyzed is larger than a preset threshold number needs to be checked: if yes, executing subsequent analysis; otherwise, the field is discarded. Thereafter, the correlation index can be calculated for various situations according to the method described in the first embodiment.

In the result output section, the correlation index may be input into a preset correlation matrix and/or weight connection graph. Fig. 5 is a schematic diagram of a correlation matrix according to an embodiment of the present invention, and fig. 6 is a schematic diagram of a weight connection diagram according to an embodiment of the present invention. As shown in fig. 5, the number of rows and columns of the correlation matrix are equal to the total number of fields to be analyzed in the database table, each row and each column respectively correspond to the identifiers of the fields to be analyzed in the database table arranged in a preset order (i.e., each column from left to right corresponds to the identifier of the fields to be analyzed in the database table arranged in the order, each row from top to bottom corresponds to the identifier of the fields to be analyzed in the database table arranged in the order, the field identifier may be a field name), and any element in the correlation matrix is a correlation index between the field corresponding to the row of the element and the field corresponding to the column of the element, and the gray value of the element is positively correlated with the correlation index. It will be appreciated that the elements in the correlation matrix of fig. 5 are omitted from the percentile numbers.

As shown in fig. 6, the weight connection graph includes nodes arranged along the circumferential direction for representing fields to be analyzed in the database table, and connection lines between any two nodes for representing relevance indexes; the nodes are configured with different colors for representing different field types, the connecting lines are configured with different colors for representing different correlation index types, the width and the color depth (i.e. the comprehensive gray scale of three channels to which the colors belong) of the connecting lines are positively correlated with the correlation index represented by the connecting lines, and the correlation index types comprise: a correlation index between two numeric fields, a correlation index between two categorical fields, and a correlation index between a numeric field and a categorical field. In fig. 6, nodes of different colors and connecting lines of different colors cannot be displayed, but only different gray scales can be schematically displayed. It can be seen that the database table of the correlation matrix in fig. 5 corresponding to the weight connection diagram in fig. 6 has the following fields: system, level 1 department, level 2 department, whether institution responsible person, whether high potential, whether core talents, job level sequence, job name, span of span, age, gender, constellation, highest school, whether good performance, man-hour type, nationality, ethnicity, political aspect, marital status, province, city, span of span, age, on-duty duration, promotion interval, promotion speed, training duration, work saturation, performance level.

According to the technical scheme of the embodiment of the invention, when the correlation analysis is carried out on the fields in the database table, firstly, the field types are rapidly and accurately judged according to the field elements and the preset regular expression; the correlation index can be calculated by adopting a corresponding method according to different field types, for example, when two fields to be analyzed are numerical fields, the absolute value of the Szelman correlation coefficient of the two fields is used as the correlation index, and when the two fields to be analyzed are classified fields, the Cream correlation coefficient of the two fields is used as the correlation index; particularly, when one of the two fields to be analyzed is a numeric field and the other is a classification field, the embodiment of the invention firstly divides the elements in the data value field into a plurality of analysis groups according to the element types in the classification field, then calculates the inter-group variance and the intra-group variance for each analysis group and takes the quotient of the inter-group variance and the intra-group variance as a correlation index, thereby realizing quantitative correlation analysis of the numeric field and the classification field. In summary, the embodiment of the invention provides a unified analysis standard from field type judgment to relevance calculation, when facing two numerical fields, two classification fields or one numerical field and one classification field, quantitative relevance analysis can be performed and a relevance index between zero and one is obtained, and finally, the embodiment of the invention can generate a relevance matrix or a weight connection diagram based on the relevance index, thereby realizing visual output of the relevance analysis.

It should be noted that, for the convenience of description, the foregoing method embodiments are expressed as a series of combinations of actions, but it should be understood by those skilled in the art that the present invention is not limited by the described order of actions, and some steps may actually be performed in other order or simultaneously. Moreover, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred embodiments, and that the acts and modules referred to are not necessarily required to practice the invention.

In order to facilitate better implementation of the above-described aspects of embodiments of the present invention, the following provides related devices for implementing the above-described aspects.

Referring to fig. 7, an apparatus 700 for determining field relatedness in a database table according to an embodiment of the present invention may include: a field type determination unit 701, a grouping unit 702, and a correlation calculation unit 703.

Wherein, the field type determining unit 701 may be configured to: judging the field type of any two fields to be analyzed in the database table according to the element of each field; wherein the field types include: a numeric field and a taxonomy field, elements in the taxonomy field belonging to at least two element categories; the grouping unit 702 may be configured to: when one of the two fields is a numerical field and the other is a classification field, determining elements belonging to the same element category in the classification field, and forming an analysis group by the elements in the numerical field corresponding to the elements; the correlation calculation unit 703 may be configured to determine an inter-group variance and an intra-group variance for each analysis group, and obtain correlation indexes of the two fields according to the inter-group variance and the intra-group variance.

In an embodiment of the present invention, the field type determining unit 701 may further be configured to: judging whether the element duty ratio of any field to be analyzed, which accords with a preset first regular expression, in the field is not smaller than a first threshold value: if yes, determining the field as a numerical field; the first regular expression is used for matching floating point numbers; if the element duty ratio of the field conforming to the first regular expression is smaller than a first threshold value, judging whether the number of the elements after the duplication removal in the field is larger than 1 and not larger than a second threshold value: if yes, determining the field as a classification field; wherein the second threshold is related to and less than the total number of elements in the field.

In an alternative, the field type determining unit 701 may be further configured to: for any field to be analyzed, judging whether the number of elements after the duplication removal in the field is greater than 1 and not greater than a second threshold value: if yes, determining the field as a classification field; wherein the second threshold is related to and less than the total number of elements in the field; if the number of the elements after the duplication removal in the field is 1 or greater than a second threshold, judging whether the element duty ratio of the elements in the field, which accords with a preset second regular expression, is not less than a third threshold: if yes, determining the field as a numerical field; wherein the second regular expression is used to match floating point numbers and integers.

In a specific application, the correlation calculation unit 703 may be further configured to: dividing the inter-group variance by the intra-group variance to obtain a correlation initial value of the two fields, and determining a natural logarithm of the correlation initial value as a correlation intermediate value; and transforming the correlation intermediate value into a numerical value interval from zero to one to form correlation indexes of the two fields.

In practical applications, the correlation calculation unit 703 may be further configured to: when the correlation intermediate value is smaller than zero, determining the correlation index as zero; when the correlation intermediate value is larger than a first numerical value, determining the correlation index as one; wherein the first value is a real number greater than one; determining the relevance index as a product of the relevance median and a second value when the relevance median is not less than zero and not greater than a first value; wherein the second value is the reciprocal of the first value.

As a preferred embodiment, the correlation calculation unit 703 may be further configured to: when any two fields to be analyzed in the database table are numeric fields, determining absolute values of spearman correlation coefficients of the two fields as correlation indexes of the two fields; and when any two fields to be analyzed in the database table are classified fields, determining the Cramer correlation coefficients of the two fields as correlation indexes of the two fields.

Preferably, the device 700 may further comprise a first visualization unit for: after obtaining the correlation indexes of any two fields to be analyzed in the database table, inputting the correlation indexes into a preset correlation matrix; the number of rows and the number of columns of the correlation matrix are equal to the total number of fields to be analyzed in the database table, each row and each column respectively correspond to the identification of the fields to be analyzed in the database table arranged in a preset sequence, any element in the correlation matrix is a correlation index between a field corresponding to the row where the element is located and a field corresponding to the column where the element is located, and the gray value of the element is positively correlated with the correlation index.

Furthermore, in an embodiment of the present invention, the apparatus 700 may further comprise a second visualization unit for: after obtaining the correlation index of any two fields to be analyzed in the database table, inputting the correlation index into a preset weight connection diagram; the weight connection graph comprises nodes which are arranged along the circumferential direction and used for representing fields to be analyzed in the database table, and connection lines which are positioned between any two nodes and used for representing relevance indexes; the nodes are configured with different colors for representing different field types, the connecting lines are configured with different colors for representing different relevancy index types, the width and the color depth of the connecting lines are positively correlated with relevancy indexes represented by the connecting lines, and the relevancy index types comprise: a correlation index between two numeric fields, a correlation index between two categorical fields, and a correlation index between a numeric field and a categorical field.

Fig. 8 illustrates an exemplary system architecture 800 of a method of determining field relatedness in a database table or an apparatus of determining field relatedness in a database table, to which embodiments of the present invention may be applied.

As shown in fig. 8, a system architecture 800 may include terminal devices 801, 802, 803, a network 804, and a server 805 (this architecture is merely an example, and the components contained in a particular architecture may be tailored to the application specific case). The network 804 serves as a medium for providing communication links between the terminal devices 801, 802, 803 and the server 805. The network 804 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

A user may interact with the server 805 through the network 804 using the terminal devices 801, 802, 803 to receive or send messages or the like. Various client applications, such as applications (by way of example only) that perform relevance statistics, may be installed on the terminal devices 801, 802, 803.

The terminal devices 801, 802, 803 may be a variety of electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like.

The server 805 may be a server providing various services, such as an arithmetic server (merely an example) providing support for applications that perform relevance statistics operated by the user with the terminal devices 801, 802, 803. The operation server may process the received correlation calculation request and feed back the processing result (e.g., the calculated correlation index—only an example) to the terminal devices 801, 802, 803.

It should be noted that, the method for determining the field relevance in the database table provided in the embodiment of the present invention is generally executed by the server 805, and accordingly, the device for determining the field relevance in the database table is generally disposed in the server 805.

It should be understood that the number of terminal devices, networks and servers in fig. 8 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

The invention also provides electronic equipment. The electronic equipment of the embodiment of the invention comprises: one or more processors; and the storage device is used for storing one or more programs, and when the one or more programs are executed by the one or more processors, the one or more processors are enabled to realize the method for determining the field relatedness in the database table.

Referring now to FIG. 9, there is illustrated a schematic diagram of a computer system 900 suitable for use in implementing an electronic device of an embodiment of the present invention. The electronic device shown in fig. 9 is only an example, and should not impose any limitation on the functions and scope of use of the embodiments of the present invention.

As shown in fig. 9, the computer system 900 includes a Central Processing Unit (CPU) 901, which can execute various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 902 or a program loaded from a storage section 908 into a Random Access Memory (RAM) 903. In the RAM903, various programs and data necessary for the operation of the computer system 900 are also stored. The CPU901, ROM 902, and RAM903 are connected to each other through a bus 904. An input/output (I/O) interface 905 is also connected to the bus 904.

The following components are connected to the I/O interface 905: an input section 906 including a keyboard, a mouse, and the like; an output portion 907 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and a speaker; a storage portion 908 including a hard disk or the like; and a communication section 909 including a network interface card such as a LAN card, a modem, or the like. The communication section 909 performs communication processing via a network such as the internet. The drive 910 is also connected to the I/O interface 905 as needed. A removable medium 911 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed on the drive 910 as needed, so that a computer program read out therefrom is installed into the storage section 908 as needed.

In particular, the processes described in the main step diagrams above may be implemented as computer software programs according to the disclosed embodiments of the invention. For example, embodiments of the present invention include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the main step diagrams. In the above-described embodiment, the computer program can be downloaded and installed from the network through the communication section 909 and/or installed from the removable medium 911. The above-described functions defined in the system of the present invention are performed when the computer program is executed by the central processing unit 901.

The computer readable medium shown in the present invention may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, a computer readable signal medium may comprise a data signal propagated in baseband or as part of a carrier wave, with computer readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units involved in the embodiments of the present invention may be implemented in software or in hardware. The described units may also be provided in a processor, for example, described as: a processor includes a field type determining unit, a grouping unit, and a correlation calculating unit. The names of these units do not constitute a limitation on the unit itself in some cases, and for example, the field type determining unit may also be described as "a unit that provides a field type to a grouping unit".

As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be present alone without being fitted into the device. The computer readable medium carries one or more programs which, when executed by the device, cause the device to perform steps comprising: judging the field type of any two fields to be analyzed in the database table according to the element of each field; wherein the field types include: a numeric field and a taxonomy field, elements in the taxonomy field belonging to at least two element categories; when one of the two fields is a numerical field and the other is a classification field, determining elements belonging to the same element category in the classification field, and forming an analysis group by the elements in the numerical field corresponding to the elements; an inter-group variance and an intra-group variance for each analysis group are determined, and a correlation index for the two fields is obtained from the inter-group variance and the intra-group variance.

The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives can occur depending upon design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims

1. A method for determining field relevance in a database table, comprising:

Judging the field type of any two fields to be analyzed in the database table according to the element of each field; wherein the field types include: a numeric field and a taxonomy field, elements in the taxonomy field belonging to at least two element categories;

When one of the two fields is a numerical field and the other is a classification field, determining elements belonging to the same element category in the classification field, and forming an analysis group by the elements in the numerical field corresponding to the elements;

determining an inter-group variance and an intra-group variance for each analysis group, dividing the inter-group variance by the intra-group variance to obtain a correlation initial value of the two fields, and obtaining a correlation index of the two fields according to the correlation initial value.

2. The method of claim 1, wherein determining the field type to which each field belongs based on the element of the field comprises:

Judging whether the element duty ratio of any field to be analyzed, which accords with a preset first regular expression, in the field is not smaller than a first threshold value: if yes, determining the field as a numerical field; the first regular expression is used for matching floating point numbers;

If the element duty ratio of the field conforming to the first regular expression is smaller than a first threshold value, judging whether the number of the elements after the duplication removal in the field is larger than 1 and not larger than a second threshold value: if yes, determining the field as a classification field; wherein the second threshold is related to and less than the total number of elements in the field.

3. The method of claim 1, wherein determining the field type to which each field belongs based on the element of the field comprises:

for any field to be analyzed, judging whether the number of elements after the duplication removal in the field is greater than 1 and not greater than a second threshold value: if yes, determining the field as a classification field; wherein the second threshold is related to and less than the total number of elements in the field;

If the number of the elements after the duplication removal in the field is 1 or greater than a second threshold, judging whether the element duty ratio of the elements in the field, which accords with a preset second regular expression, is not less than a third threshold: if yes, determining the field as a numerical field; wherein the second regular expression is used to match floating point numbers and integers.

4. The method according to claim 1, wherein the obtaining the correlation index of the two fields from the correlation initial value includes:

Determining the natural logarithm of the correlation initial value as a correlation intermediate value;

and transforming the correlation intermediate value into a numerical value interval from zero to one to form correlation indexes of the two fields.

5. The method of claim 4, wherein transforming the correlation intermediate value to a value interval from zero to one forms a correlation index for the two fields, comprising:

When the correlation intermediate value is smaller than zero, determining the correlation index as zero;

when the correlation intermediate value is larger than a first numerical value, determining the correlation index as one; wherein the first value is a real number greater than one;

Determining the relevance index as a product of the relevance median and a second value when the relevance median is not less than zero and not greater than a first value; wherein the second value is the reciprocal of the first value.

6. The method according to claim 4, wherein the method further comprises:

when any two fields to be analyzed in the database table are numeric fields, determining absolute values of spearman correlation coefficients of the two fields as correlation indexes of the two fields;

And when any two fields to be analyzed in the database table are classified fields, determining the Cramer correlation coefficients of the two fields as correlation indexes of the two fields.

7. The method according to claim 6, wherein the method further comprises:

after obtaining the correlation indexes of any two fields to be analyzed in the database table, inputting the correlation indexes into a preset correlation matrix; wherein,

The number of rows and columns of the correlation matrix are equal to the total number of fields to be analyzed in the database table, each row and each column respectively correspond to the identifications of the fields to be analyzed in the database table arranged in a preset sequence, any element in the correlation matrix is a correlation index between the field corresponding to the row of the element and the field corresponding to the column of the element, and the gray value of the element is positively correlated with the correlation index.

8. The method according to claim 6, wherein the method further comprises:

after obtaining the correlation index of any two fields to be analyzed in the database table, inputting the correlation index into a preset weight connection diagram; wherein,

The weight connection graph comprises nodes which are arranged along the circumferential direction and used for representing fields to be analyzed in the database table, and connection lines which are positioned between any two nodes and used for representing relevance indexes; the nodes are configured with different colors for representing different field types, the connecting lines are configured with different colors for representing different relevancy index types, the width and the color depth of the connecting lines are positively correlated with relevancy indexes represented by the connecting lines, and the relevancy index types comprise: a correlation index between two numeric fields, a correlation index between two categorical fields, and a correlation index between a numeric field and a categorical field.

9. An apparatus for determining field relevance in a database table, comprising:

A field type determining unit configured to: judging the field type of any two fields to be analyzed in the database table according to the element of each field; wherein the field types include: a numeric field and a taxonomy field, elements in the taxonomy field belonging to at least two element categories;

a grouping unit for: when one of the two fields is a numerical field and the other is a classification field, determining elements belonging to the same element category in the classification field, and forming an analysis group by the elements in the numerical field corresponding to the elements;

And the correlation calculation unit is used for determining an inter-group variance and an intra-group variance for each analysis group, dividing the inter-group variance by the intra-group variance to obtain correlation initial values of the two fields, and obtaining correlation indexes of the two fields according to the correlation initial values.

10. An electronic device, comprising:

One or more processors;

Storage means for storing one or more programs,

When executed by the one or more processors, causes the one or more processors to implement the method of any of claims 1-8.

11. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any of claims 1-8.