CN111104466A

CN111104466A - Method for rapidly classifying massive database tables

Info

Publication number: CN111104466A
Application number: CN201911357917.6A
Authority: CN
Inventors: 王衍祺; 王楠; 孟庆磊; 毛俐旻
Original assignee: Aerospace Science And Technology Network Information Development Co ltd
Current assignee: China Changfeng Electromechanical Technology Research And Design Institute
Priority date: 2019-12-25
Filing date: 2019-12-25
Publication date: 2020-05-05
Anticipated expiration: 2039-12-25
Also published as: CN111104466B

Abstract

The invention relates to a method for quickly classifying massive database tables, which comprises the steps of calculating mutual information entropy to obtain key attributes of each table, constructing a feature vector of the selected attributes according to metadata information of attribute field types and data content abstracts, clustering the key attributes by using a machine learning clustering algorithm, labeling a clustering center to form a training set training classification algorithm, applying the trained classification algorithm to residual attribute classification, sampling and judging classification results, reversely optimizing the classification algorithm, and outputting the categories of all database table attribute fields. The method combines metadata information and field contents of the database fields to construct field feature vectors, clusters the key fields of the database to be analyzed and sets data fields (labeling), constructs a training set, trains an industry characteristic classification algorithm, and simplifies the manual processing workload.

Description

Method for rapidly classifying massive database tables

Technical Field

The invention relates to a database technology, in particular to a method for quickly classifying massive database tables.

Background

In the process of warehouse construction, a large amount of manpower and material resources are consumed for data cataloging and cleaning, and one important task is to classify database tables. By classifying and labeling the database table, the data fields (such as the field representing clients, products, quantity, amount and the like) are identified, a data directory is established, missing metadata information is supplemented, data quality rules are formulated in an auxiliary mode to find data quality problems and the like, and follow-up data management and promotion are performed in a targeted mode.

The existing data classification method needs implementation personnel to carry out the data classification according to database design documents, library table structure remarks and the like, depends on the experience of the personnel to a great extent, and each piece of metadata information needs to be confirmed one by one, so that the time and the labor are wasted. When faced with a huge amount of data types and data sizes, the labor cost is enormous. Therefore, a machine learning method is introduced into the field of data management, and based on the existing data classification labels, the data classification and labeling are carried out with the assistance of a computer through methods such as clustering and classification, so that the repeated workload is reduced, and the efficiency is improved.

One of the key technologies for realizing the method is to extract the feature vectors of the database fields (character type and numerical type) and identify the data fields through training a machine learning classification algorithm.

The characteristics of the fields of a database table should contain two parts: metadata information of the fields and data contents corresponding to the fields. The metadata information of a field is the profile and description of the field, including the field type, length, distribution characteristics of the field content, schema, etc. For example, for mailbox fields, the metadata information includes: the fields belong to character strings, the length of the fields is not more than 256 characters, and the mode meets regular expressions of a mailbox; for the sales amount field, the metadata information thereof includes: the field belongs to a numerical type, the precision is two digits after a decimal point, the data distribution approximately meets the positive-Taiwan distribution, the maximum value, the minimum value, the mean value, the variance and the like of the data distribution are in a certain specific range, and the like. When the characteristics of the database fields are constructed, the metadata information is added, and the data fields of the fields can be more accurately and quickly distinguished.

Disclosure of Invention

The invention aims to provide a method for quickly classifying mass database tables, which is used for solving the problems in the prior art.

The invention discloses a method for quickly classifying massive database tables, which comprises the steps of calculating mutual information entropy to obtain key attributes of each table, constructing a feature vector of the selected attributes according to metadata information of attribute field types and data content abstracts, clustering the key attributes by using a machine learning clustering algorithm, labeling a clustering center to form a training set training classification algorithm, applying the trained classification algorithm to residual attribute classification, sampling and judging classification results, reversely optimizing the classification algorithm, and outputting the categories of all database table attribute fields.

According to an embodiment of the method for rapidly classifying the mass database tables, the step of calculating the mutual information entropy to obtain the key attributes of each table comprises the following steps: according to the sampling data of the database, the dependency relationship among different fields is obtained by calculating the mutual information entropy among the fields of the database table, and the key field with the largest influence on other fields is selected according to the threshold value.

According to an embodiment of the method for rapidly classifying the mass database tables, the step of calculating the mutual information entropy to obtain the key attributes of each table specifically comprises the following steps: random data samples, comprising: when the database table data scale is small, the full amount of data is used; when the data size is large, random sampling without putting back is adopted; calculating the information entropy and the mutual information entropy of the fields comprises the following steps: the information entropy H (X) of the field X, wherein p (X) is the probability that the value X is distributed in the whole value range of X;

the mutual information entropy I (X, Y) of the field X and the rest of the field Y, p (X, Y) is the probability that the data pair < X, Y > is distributed in the whole value range of < X, Y >:

sequentially calculating the information entropy H (x) of all the fields and the mutual information entropy I (x, y) of the fields x and the rest of the fields y to form a dependency graph among the fields, wherein the weight of a certain node vi in the dependency graph is A (vi), and the weight is Wi;

a (vi) is a weight of a field dependency graph node vi, and represents the strength of the correlation between the node vi and other nodes in the dependency graph; wi is the weight of the node vi and other nodes; and performing descending arrangement according to the field weights Wi of the database table, selecting a field set with the sum of the weights larger than a given threshold value, and recording the field set as a key field set of the table.

According to an embodiment of the method for rapidly classifying the mass database tables, the step of calculating the feature vectors of the fields includes: according to the key attribute of the database table, carrying out hierarchical sampling again; and calculating the feature vector of the attribute field according to the metadata information of the data field, the statistical features of the data content and the like, and dividing the feature vector into numerical feature extraction and character feature extraction.

According to an embodiment of the method for rapidly classifying the mass database tables, the extracting the numerical field features comprises the following steps: (1) calculating the statistical characteristics of the fields; (2) the data normalization processing comprises the following steps: converting the raw data into a normal distribution between 0 and 1 by using a z-score normalization method; the z-score formula is as follows: x '═ X-avg (X))/std (X), where X is the raw data, X' is the normalized data, avg (X) is the mean value of X, std (X) is the standard deviation of X; (3) constructing a probability distribution histogram includes: the outliers are first removed and data between [ Q1-1.5 IQR, Q3+1.5 IQR ] are selected for binning, where Q1 is the first quartile, Q3 is the third quartile, and IQR is the quartile range, IQR, Q3-Q1. Calculating a normalized data probability distribution histogram according to the number of buckets, wherein the number of sub-buckets determines the length of the feature vector; (4) judging whether the data meet common data distribution by using a KS method; (5) feature vector is statistical feature + normalized probability distribution histogram + distribution function parameters.

According to an embodiment of the method for rapidly classifying the massive database tables, the character type field feature extraction comprises the following steps: (1) reading data of the attribute field, and extracting character length distribution characteristics of the character string; (2) extracting character mode distribution characteristics of the character string; (3) for the character pattern distribution of the extracted character strings, a preset regular expression is used for matching whether the character strings conform to the regular expression or not; (4) extracting character string features by using a natural language processing method; (5) carrying out named entity recognition; (6) the feature vectors are combined.

According to an embodiment of the method for rapidly classifying the mass database tables, the clustering key attributes by using a machine learning clustering algorithm, and the labeling of the clustering centers comprises: extracting the feature vectors of the key attributes according to the step 2, clustering the key attributes by using a machine learning method, labeling to form training samples for training a machine learning algorithm, clustering the feature vectors of the key data by using an unsupervised learning clustering algorithm, labeling a clustering center, and extending the label attributes to other attributes in the cluster.

According to an embodiment of the method for rapidly classifying the mass database tables, the training classification algorithm for forming the training set comprises the following steps: training a classification algorithm model with supervised learning according to the extracted key attribute features and the printed labels; and classifying by using the feature vectors of the extracted other attributes and the trained classification algorithm model, performing random sampling inspection on classification results, increasing the weight of classified error data, putting the classified error data into a training set, retraining the classification algorithm, iterating step by step, and finally outputting the trained classification algorithm.

The invention provides an accurate and rapid database field classification method and a processing flow, which enrich extracted field characteristics, establish a classification algorithm from scratch and assist a user in field classification. And (3) combining metadata information and field contents of the database fields to construct field feature vectors, clustering key fields of the database to be analyzed and setting data fields (labeling), constructing a training set, training an industry characteristic classification algorithm, and simplifying manual processing workload.

Drawings

FIG. 1 is an overall process flow diagram;

FIG. 2 is a flow chart of a process for extracting key fields of a database table;

FIG. 3 is a field dependency diagram;

FIG. 4 is a flow diagram of a process for extracting field features;

FIG. 5 is a diagram of extracting numeric field features: statistical features and probability distribution histograms;

FIG. 6 is a flow chart of extracting character type field features.

Detailed Description

In order to make the objects, contents, and advantages of the present invention clearer, the following detailed description of the embodiments of the present invention will be made in conjunction with the accompanying drawings and examples.

Fig. 1 is a flow chart showing a method for rapidly classifying a mass database table according to the present invention, as shown in fig. 1, the present invention first obtains key attributes of each table by calculating a mutual information entropy, constructs feature vectors of selected attributes according to metadata information such as attribute field types (which are mainly aimed at character types and numerical types at present) and data content abstracts, etc., clusters the key attributes by using a machine learning clustering algorithm, tags a clustering center, forms a training set training classification algorithm, applies the trained classification algorithm to the rest attribute classifications, and performs sampling judgment on classification results, reversely optimizes the classification algorithm, and finally outputs the categories of all database table attribute fields.

As shown in FIG. 1, the step of identifying key attributes of a database table includes:

according to the sampling data of the database, the dependency relationship among different fields is obtained by calculating the mutual information entropy among the fields of the database table, and the key field with the largest influence on other fields is selected according to the threshold value. The process flow is shown in fig. 2.

(1) Random data samples, comprising:

when the database table data size is small (such as less than 1000 pieces), the full amount of data is used; random sampling without replacement is used when the data size is large, ensuring that the samples represent the full amount of data as much as possible.

(2) Calculating the information entropy and the mutual information entropy of the fields comprises the following steps:

the information entropy H (X) of the field X, wherein p (X) is the probability that the value X is distributed in the whole value range of X;

fig. 3 shows a field dependency graph, and as shown in fig. 3, information entropies h (x) of all fields and mutual information entropies I (x, y) of the fields x and the remaining fields y are sequentially calculated to form a dependency graph between the fields, where a weight of a certain node vi in the dependency graph is a (vi), and the weight is Wi.

A (vi) is a weight of a field dependency graph node vi, and represents the strength of the correlation between the node vi and other nodes in the dependency graph; wi is the weight of the node vi and other nodes.

And (4) performing descending arrangement according to the field weights Wi of the database table, selecting a field set with the sum of the weights greater than a given threshold (such as 0.8), and recording the field set as a key field set of the table.

Pseudocode for selecting key attributes based on threshold inversion selection is as follows:

2. the step of calculating the feature vector of the field comprises:

according to the key attribute of the database table, carrying out hierarchical sampling again; and calculating the feature vector of the attribute field according to the metadata information of the data field, the statistical features of the data content and the like, wherein the calculation is mainly divided into numerical feature extraction and character feature extraction. The process flow is shown in fig. 4, wherein fig. 5 and fig. 6 respectively refine the extraction process of the numeric and character fields in fig. 4.

2.1 reading data Table metadata

And acquiring field metadata of the database table in a jdbc (jdbc) mode and other modes, wherein the field metadata comprises field names, types, lengths, remarks and other information.

2.2 hierarchical sampling based on Key fields

According to the method of '1, identifying key attributes of a database table', identifying key data, reading data distribution of the key attributes, and performing hierarchical sampling according to the data distribution.

2.3 variable field, extracting key characteristic vector for numerical value type and character type field respectively

2.3.1 numerical field feature extraction

FIG. 5 is a diagram of extracting numeric field features: as shown in fig. 5, for the numerical data, basic statistical information such as a maximum value, a minimum value, a mean value, a variance, a standard deviation, a median, a mode, and the like is calculated, the data is normalized, a statistical probability distribution histogram is calculated, and the basic statistical information and the probability distribution histogram together form a feature vector of the numerical data.

(1) Calculating statistical characteristics of fields

Calculating maximum value, minimum value, mean value, variance, standard deviation, 4 quantile (1/4 quantile Q1 and 3/4 quantile Q3) and mode, wherein the statistical characteristics can visually reflect the rough distribution of the data set and are beneficial to the subsequent classification calculation.

(2) Data normalization processing

The raw data was converted to a normal distribution between 0-1 using the z-score normalization method. The normal distribution type is a very general mathematical model, and can facilitate data processing.

The z-score formula is as follows: x ═ X-avg (X))/std (X), where X is the raw data, X' is the normalized data, avg (X) is the mean value of X, std (X) is the standard deviation of X.

(3) Constructing a probability distribution histogram

The outliers are first removed and data between [ Q1-1.5 IQR, Q3+1.5 IQR ] are selected for binning, where Q1 is the first quartile, Q3 is the third quartile, and IQR is the quartile range, IQR, Q3-Q1. A normalized data probability distribution histogram is calculated based on the number of buckets, where the number of sub-buckets determines the length of the feature vector.

(4) Determining relationships between data and common distributions

And judging whether the data meet common data distribution such as 0-1 distribution, binomial distribution, normal distribution, Poisson distribution, average distribution, exponential distribution and the like by using a KS method, and if so, calculating parameters of corresponding distribution functions.

(5) Combined feature vector

Feature vector is statistical feature + normalized probability distribution histogram + distribution function parameters.

The finally formed feature vector is a large array, and is composed of the statistical features obtained in the step (2), the normalized probability distribution histogram array obtained in the step (3), and the distribution function parameter array [ distribution type, parameter 1, parameter 2] (the common distribution function has at most two parameters) obtained in the step (4), and the composition is as follows:

2.3.2 character type field feature extraction

For character type data, first, the length distribution characteristics (maximum value, minimum value, mean value, median, etc. of the character string length) and the character distribution characteristics (number of occurrences, position, etc. of letters, numbers, special symbols, etc.) of the character string are calculated; then matching with common regular expressions (such as a mailbox, a postcode, a mobile phone number, an identity card number and the like); extracting word vectors of the character strings by using a natural language processing method, carrying out named entity recognition, and judging whether the character strings are names of people, places, objects and the like; and finally, synthesizing the information to form a characteristic vector of the character type data. Fig. 6 is a flow chart of extracting character type field features, as shown in fig. 6,

(1) reading data of attribute field, and extracting character length distribution characteristic of character string

First, data content of the attribute field is acquired, and the length distribution of the character string is known from the statistical perspective by counting the maximum value, the minimum value, the mean value, the variance, the standard deviation, the mode and the like according to the length of the character string.

(2) Extracting character pattern distribution characteristics of character string

The character mode refers to the composition of the type corresponding to each character in the character string, the character string is divided into individual characters, the letters correspond to \ w (or A), the numbers correspond to \ d (or #), the special expressions (comma, semicolon, quotation mark, dash, ellipsis, dot, asterisk, full stop, space, addition, subtraction, multiplication and division symbols and the like) correspond, and the character string is converted into a mode like a format of "\\ \ w \ d \ d- \". Aggregate statistics are performed on the character pattern distribution, and the pattern of TOP5 is selected.

(3) Common regular expression matching

And for the character pattern distribution of the extracted character strings, using a preset regular expression, such as a mailbox, a mobile phone number, an identity card number and the like, to match whether the character strings conform to the regular expression.

(4) Extracting character string features using natural language processing techniques

Firstly, segmenting words of a character string, and extracting Word vectors by using technologies such as TFIDF, CBOW, Word2Vec and the like; all the content of the field is regarded as an article, and the text feature vector of the field is constructed by using a method of Doc2Vec and the like.

(5) Named entity recognition

Through the named entity identification technology, whether the field content is a name of a person, a name of a institution, a place name, a name of an object and the like is identified, and an important basis for dividing a data domain is provided. And matching with a preset corpus such as names of people and place names by analyzing the parts of speech, word vectors and the like, and selecting the category with the similarity exceeding a certain threshold value as a result for use.

(6) Combined feature vector

The finally formed feature vector is the character string length statistical feature + the pattern distribution feature + the word vector + the matched regular expression + the named entity recognition result, and the composition is as follows:

2.4 constructing feature vectors for attributes

Unifying the length of the feature vector of the numeric field and the character field, and forming the feature vector as follows:

type of field (character type, numerical type)

2.3.1 and 2.3.2 steps to obtain the field feature vector

3. Data clustering and labeling

And (3) extracting the feature vectors of the key attributes according to the step (2), clustering the key attributes by using a machine learning method, and labeling to form a training sample for training a machine learning algorithm. Clustering the feature vectors of the key data by using a clustering algorithm (such as a density-based clustering algorithm) of unsupervised learning, labeling the clustering center (a label system is established in advance), and automatically expanding the label attribute to other attributes in the cluster by using the system.

4. Data classification algorithm training and optimization

(1) Training classification algorithm

And (3) training a classification algorithm (such as SVM, random forest and the like) model with supervised learning according to the key attribute features extracted in the step (2) and the labels marked in the step (3).

(2) Data classification result verification and algorithm optimization

Extracting the feature vectors of the rest attributes by using the step 2, classifying by using the trained classification algorithm model, performing random sampling inspection on the classification result, increasing the weight of the classified error data, putting the classified error data into a training set to retrain the classification algorithm, iterating step by step, and finally outputting the trained classification algorithm.

The invention discloses a method for constructing a field feature vector, which comprises the following steps: on the basis of data content, the metadata information of fields is added, and a feature vector is constructed by combining with industry prior knowledge. For numerical data, combining the statistical features with a probability distribution histogram; for character type data, on the basis of the existing word vectors, characteristics such as character pattern distribution and the like are added and are matched with a common regular expression. And extracting key fields according to the information entropy, clustering and labeling to be used as a training set for quickly constructing an industry characteristic classification algorithm.

The method of the invention tests on a plurality of actual service databases (backup databases), the sample data scale is 1000, the threshold value of the key field is 0.8, DBSCAN is adopted as a clustering algorithm, the number of numerical type field barrels is 512, the character string word vector is 512, the database of self-research software and part of commercial software are respectively applied, and the test results are in an acceptable range. Test results show that the method can rapidly extract the key fields of all tables in the database, extract the characteristics suitable for the characteristics of the industry, form the accumulation of industry knowledge and greatly reduce the workload of manual labeling.

The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims

1. A method for rapidly classifying massive database tables is characterized by comprising the following steps: calculating mutual information entropy to obtain key attributes of each table, constructing a feature vector of the selected attributes according to metadata information and data content abstract of attribute field types, clustering the key attributes by using a machine learning clustering algorithm, labeling a clustering center to form a training set training classification algorithm, applying the trained classification algorithm to residual attribute classification, sampling and judging classification results, reversely optimizing the classification algorithm, and outputting the categories of all database table attribute fields.

2. The method for rapidly classifying massive database tables according to claim 1, wherein calculating mutual entropy to obtain key attributes of each table comprises:

according to the sampling data of the database, the dependency relationship among different fields is obtained by calculating the mutual information entropy among the fields of the database table, and the key field with the largest influence on other fields is selected according to the threshold value.

3. The method for rapidly classifying massive database tables according to claim 2, wherein the step of calculating mutual information entropy to obtain key attributes of each table specifically comprises the steps of:

random data samples, comprising:

when the database table data scale is small, the full amount of data is used; when the data size is large, random sampling without putting back is adopted;

calculating the information entropy and the mutual information entropy of the fields comprises the following steps:

a (vi) is a weight of a field dependency graph node vi, and represents the strength of the correlation between the node vi and other nodes in the dependency graph; wi is the weight of the node vi and other nodes;

and performing descending arrangement according to the field weights Wi of the database table, selecting a field set with the sum of the weights larger than a given threshold value, and recording the field set as a key field set of the table.

4. The method for rapid classification of mass database tables according to claim 1, wherein the step of computing the feature vector of the field comprises: according to the key attribute of the database table, carrying out hierarchical sampling again; and calculating the feature vector of the attribute field according to the metadata information of the data field, the statistical features of the data content and the like, and dividing the feature vector into numerical feature extraction and character feature extraction.

5. The method of rapidly classifying a large number of database tables according to claim 4, wherein the extracting numerical field features comprises:

(1) calculating the statistical characteristics of the fields;

(2) the data normalization processing comprises the following steps:

converting the raw data into a normal distribution between 0 and 1 by using a z-score normalization method;

the z-score formula is as follows: x '═ X-avg (X))/std (X), where X is the raw data, X' is the normalized data, avg (X) is the mean value of X, std (X) is the standard deviation of X; (3) constructing a probability distribution histogram includes: the outliers are first removed and data between [ Q1-1.5 IQR, Q3+1.5 IQR ] are selected for binning, where Q1 is the first quartile, Q3 is the third quartile, and IQR is the quartile range, IQR, Q3-Q1. Calculating a normalized data probability distribution histogram according to the number of buckets, wherein the number of sub-buckets determines the length of the feature vector; (4) judging whether the data meet common data distribution by using a KS method; (5) feature vector is statistical feature + normalized probability distribution histogram + distribution function parameters.

6. The method for rapid classification of mass database tables as claimed in claim 4, wherein the character type field feature extraction comprises:

(1) reading data of the attribute field, and extracting character length distribution characteristics of the character string;

(2) extracting character mode distribution characteristics of the character string;

(3) for the character pattern distribution of the extracted character strings, a preset regular expression is used for matching whether the character strings conform to the regular expression or not;

(4) extracting character string features by using a natural language processing method;

(5) carrying out named entity recognition;

(6) the feature vectors are combined.

7. The method for rapidly classifying massive database tables according to claim 1, wherein the key attributes are clustered by using a machine learning clustering algorithm, and labeling the clustering centers comprises: extracting the feature vectors of the key attributes according to the step 2, clustering the key attributes by using a machine learning method, labeling to form training samples for training a machine learning algorithm, clustering the feature vectors of the key data by using an unsupervised learning clustering algorithm, labeling a clustering center, and extending the label attributes to other attributes in the cluster.

8. The method for rapid classification of mass database tables according to claim 1, wherein forming a training set training classification algorithm comprises:

training a classification algorithm model with supervised learning according to the extracted key attribute features and the printed labels; and classifying by using the feature vectors of the extracted other attributes and the trained classification algorithm model, performing random sampling inspection on classification results, increasing the weight of classified error data, putting the classified error data into a training set, retraining the classification algorithm, iterating step by step, and finally outputting the trained classification algorithm.