CN111104466A - Method for rapidly classifying massive database tables - Google Patents

Method for rapidly classifying massive database tables Download PDF

Info

Publication number
CN111104466A
CN111104466A CN201911357917.6A CN201911357917A CN111104466A CN 111104466 A CN111104466 A CN 111104466A CN 201911357917 A CN201911357917 A CN 201911357917A CN 111104466 A CN111104466 A CN 111104466A
Authority
CN
China
Prior art keywords
data
field
fields
classification
clustering
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911357917.6A
Other languages
Chinese (zh)
Other versions
CN111104466B (en
Inventor
王衍祺
王楠
孟庆磊
毛俐旻
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Changfeng Electromechanical Technology Research And Design Institute
Original Assignee
Aerospace Science And Technology Network Information Development Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Aerospace Science And Technology Network Information Development Co ltd filed Critical Aerospace Science And Technology Network Information Development Co ltd
Priority to CN201911357917.6A priority Critical patent/CN111104466B/en
Publication of CN111104466A publication Critical patent/CN111104466A/en
Application granted granted Critical
Publication of CN111104466B publication Critical patent/CN111104466B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a method for quickly classifying massive database tables, which comprises the steps of calculating mutual information entropy to obtain key attributes of each table, constructing a feature vector of the selected attributes according to metadata information of attribute field types and data content abstracts, clustering the key attributes by using a machine learning clustering algorithm, labeling a clustering center to form a training set training classification algorithm, applying the trained classification algorithm to residual attribute classification, sampling and judging classification results, reversely optimizing the classification algorithm, and outputting the categories of all database table attribute fields. The method combines metadata information and field contents of the database fields to construct field feature vectors, clusters the key fields of the database to be analyzed and sets data fields (labeling), constructs a training set, trains an industry characteristic classification algorithm, and simplifies the manual processing workload.

Description

Method for rapidly classifying massive database tables
Technical Field
The invention relates to a database technology, in particular to a method for quickly classifying massive database tables.
Background
In the process of warehouse construction, a large amount of manpower and material resources are consumed for data cataloging and cleaning, and one important task is to classify database tables. By classifying and labeling the database table, the data fields (such as the field representing clients, products, quantity, amount and the like) are identified, a data directory is established, missing metadata information is supplemented, data quality rules are formulated in an auxiliary mode to find data quality problems and the like, and follow-up data management and promotion are performed in a targeted mode.
The existing data classification method needs implementation personnel to carry out the data classification according to database design documents, library table structure remarks and the like, depends on the experience of the personnel to a great extent, and each piece of metadata information needs to be confirmed one by one, so that the time and the labor are wasted. When faced with a huge amount of data types and data sizes, the labor cost is enormous. Therefore, a machine learning method is introduced into the field of data management, and based on the existing data classification labels, the data classification and labeling are carried out with the assistance of a computer through methods such as clustering and classification, so that the repeated workload is reduced, and the efficiency is improved.
One of the key technologies for realizing the method is to extract the feature vectors of the database fields (character type and numerical type) and identify the data fields through training a machine learning classification algorithm.
The characteristics of the fields of a database table should contain two parts: metadata information of the fields and data contents corresponding to the fields. The metadata information of a field is the profile and description of the field, including the field type, length, distribution characteristics of the field content, schema, etc. For example, for mailbox fields, the metadata information includes: the fields belong to character strings, the length of the fields is not more than 256 characters, and the mode meets regular expressions of a mailbox; for the sales amount field, the metadata information thereof includes: the field belongs to a numerical type, the precision is two digits after a decimal point, the data distribution approximately meets the positive-Taiwan distribution, the maximum value, the minimum value, the mean value, the variance and the like of the data distribution are in a certain specific range, and the like. When the characteristics of the database fields are constructed, the metadata information is added, and the data fields of the fields can be more accurately and quickly distinguished.
Disclosure of Invention
The invention aims to provide a method for quickly classifying mass database tables, which is used for solving the problems in the prior art.
The invention discloses a method for quickly classifying massive database tables, which comprises the steps of calculating mutual information entropy to obtain key attributes of each table, constructing a feature vector of the selected attributes according to metadata information of attribute field types and data content abstracts, clustering the key attributes by using a machine learning clustering algorithm, labeling a clustering center to form a training set training classification algorithm, applying the trained classification algorithm to residual attribute classification, sampling and judging classification results, reversely optimizing the classification algorithm, and outputting the categories of all database table attribute fields.
According to an embodiment of the method for rapidly classifying the mass database tables, the step of calculating the mutual information entropy to obtain the key attributes of each table comprises the following steps: according to the sampling data of the database, the dependency relationship among different fields is obtained by calculating the mutual information entropy among the fields of the database table, and the key field with the largest influence on other fields is selected according to the threshold value.
According to an embodiment of the method for rapidly classifying the mass database tables, the step of calculating the mutual information entropy to obtain the key attributes of each table specifically comprises the following steps: random data samples, comprising: when the database table data scale is small, the full amount of data is used; when the data size is large, random sampling without putting back is adopted; calculating the information entropy and the mutual information entropy of the fields comprises the following steps: the information entropy H (X) of the field X, wherein p (X) is the probability that the value X is distributed in the whole value range of X;
Figure BDA0002336435660000021
the mutual information entropy I (X, Y) of the field X and the rest of the field Y, p (X, Y) is the probability that the data pair < X, Y > is distributed in the whole value range of < X, Y >:
Figure BDA0002336435660000031
sequentially calculating the information entropy H (x) of all the fields and the mutual information entropy I (x, y) of the fields x and the rest of the fields y to form a dependency graph among the fields, wherein the weight of a certain node vi in the dependency graph is A (vi), and the weight is Wi;
Figure BDA0002336435660000032
Figure BDA0002336435660000033
a (vi) is a weight of a field dependency graph node vi, and represents the strength of the correlation between the node vi and other nodes in the dependency graph; wi is the weight of the node vi and other nodes; and performing descending arrangement according to the field weights Wi of the database table, selecting a field set with the sum of the weights larger than a given threshold value, and recording the field set as a key field set of the table.
According to an embodiment of the method for rapidly classifying the mass database tables, the step of calculating the feature vectors of the fields includes: according to the key attribute of the database table, carrying out hierarchical sampling again; and calculating the feature vector of the attribute field according to the metadata information of the data field, the statistical features of the data content and the like, and dividing the feature vector into numerical feature extraction and character feature extraction.
According to an embodiment of the method for rapidly classifying the mass database tables, the extracting the numerical field features comprises the following steps: (1) calculating the statistical characteristics of the fields; (2) the data normalization processing comprises the following steps: converting the raw data into a normal distribution between 0 and 1 by using a z-score normalization method; the z-score formula is as follows: x '═ X-avg (X))/std (X), where X is the raw data, X' is the normalized data, avg (X) is the mean value of X, std (X) is the standard deviation of X; (3) constructing a probability distribution histogram includes: the outliers are first removed and data between [ Q1-1.5 IQR, Q3+1.5 IQR ] are selected for binning, where Q1 is the first quartile, Q3 is the third quartile, and IQR is the quartile range, IQR, Q3-Q1. Calculating a normalized data probability distribution histogram according to the number of buckets, wherein the number of sub-buckets determines the length of the feature vector; (4) judging whether the data meet common data distribution by using a KS method; (5) feature vector is statistical feature + normalized probability distribution histogram + distribution function parameters.
According to an embodiment of the method for rapidly classifying the massive database tables, the character type field feature extraction comprises the following steps: (1) reading data of the attribute field, and extracting character length distribution characteristics of the character string; (2) extracting character mode distribution characteristics of the character string; (3) for the character pattern distribution of the extracted character strings, a preset regular expression is used for matching whether the character strings conform to the regular expression or not; (4) extracting character string features by using a natural language processing method; (5) carrying out named entity recognition; (6) the feature vectors are combined.
According to an embodiment of the method for rapidly classifying the mass database tables, the clustering key attributes by using a machine learning clustering algorithm, and the labeling of the clustering centers comprises: extracting the feature vectors of the key attributes according to the step 2, clustering the key attributes by using a machine learning method, labeling to form training samples for training a machine learning algorithm, clustering the feature vectors of the key data by using an unsupervised learning clustering algorithm, labeling a clustering center, and extending the label attributes to other attributes in the cluster.
According to an embodiment of the method for rapidly classifying the mass database tables, the training classification algorithm for forming the training set comprises the following steps: training a classification algorithm model with supervised learning according to the extracted key attribute features and the printed labels; and classifying by using the feature vectors of the extracted other attributes and the trained classification algorithm model, performing random sampling inspection on classification results, increasing the weight of classified error data, putting the classified error data into a training set, retraining the classification algorithm, iterating step by step, and finally outputting the trained classification algorithm.
The invention provides an accurate and rapid database field classification method and a processing flow, which enrich extracted field characteristics, establish a classification algorithm from scratch and assist a user in field classification. And (3) combining metadata information and field contents of the database fields to construct field feature vectors, clustering key fields of the database to be analyzed and setting data fields (labeling), constructing a training set, training an industry characteristic classification algorithm, and simplifying manual processing workload.
Drawings
FIG. 1 is an overall process flow diagram;
FIG. 2 is a flow chart of a process for extracting key fields of a database table;
FIG. 3 is a field dependency diagram;
FIG. 4 is a flow diagram of a process for extracting field features;
FIG. 5 is a diagram of extracting numeric field features: statistical features and probability distribution histograms;
FIG. 6 is a flow chart of extracting character type field features.
Detailed Description
In order to make the objects, contents, and advantages of the present invention clearer, the following detailed description of the embodiments of the present invention will be made in conjunction with the accompanying drawings and examples.
Fig. 1 is a flow chart showing a method for rapidly classifying a mass database table according to the present invention, as shown in fig. 1, the present invention first obtains key attributes of each table by calculating a mutual information entropy, constructs feature vectors of selected attributes according to metadata information such as attribute field types (which are mainly aimed at character types and numerical types at present) and data content abstracts, etc., clusters the key attributes by using a machine learning clustering algorithm, tags a clustering center, forms a training set training classification algorithm, applies the trained classification algorithm to the rest attribute classifications, and performs sampling judgment on classification results, reversely optimizes the classification algorithm, and finally outputs the categories of all database table attribute fields.
As shown in FIG. 1, the step of identifying key attributes of a database table includes:
according to the sampling data of the database, the dependency relationship among different fields is obtained by calculating the mutual information entropy among the fields of the database table, and the key field with the largest influence on other fields is selected according to the threshold value. The process flow is shown in fig. 2.
(1) Random data samples, comprising:
when the database table data size is small (such as less than 1000 pieces), the full amount of data is used; random sampling without replacement is used when the data size is large, ensuring that the samples represent the full amount of data as much as possible.
(2) Calculating the information entropy and the mutual information entropy of the fields comprises the following steps:
the information entropy H (X) of the field X, wherein p (X) is the probability that the value X is distributed in the whole value range of X;
Figure BDA0002336435660000061
the mutual information entropy I (X, Y) of the field X and the rest of the field Y, p (X, Y) is the probability that the data pair < X, Y > is distributed in the whole value range of < X, Y >:
Figure BDA0002336435660000062
fig. 3 shows a field dependency graph, and as shown in fig. 3, information entropies h (x) of all fields and mutual information entropies I (x, y) of the fields x and the remaining fields y are sequentially calculated to form a dependency graph between the fields, where a weight of a certain node vi in the dependency graph is a (vi), and the weight is Wi.
Figure BDA0002336435660000063
Figure BDA0002336435660000064
A (vi) is a weight of a field dependency graph node vi, and represents the strength of the correlation between the node vi and other nodes in the dependency graph; wi is the weight of the node vi and other nodes.
And (4) performing descending arrangement according to the field weights Wi of the database table, selecting a field set with the sum of the weights greater than a given threshold (such as 0.8), and recording the field set as a key field set of the table.
Pseudocode for selecting key attributes based on threshold inversion selection is as follows:
Figure BDA0002336435660000071
2. the step of calculating the feature vector of the field comprises:
according to the key attribute of the database table, carrying out hierarchical sampling again; and calculating the feature vector of the attribute field according to the metadata information of the data field, the statistical features of the data content and the like, wherein the calculation is mainly divided into numerical feature extraction and character feature extraction. The process flow is shown in fig. 4, wherein fig. 5 and fig. 6 respectively refine the extraction process of the numeric and character fields in fig. 4.
2.1 reading data Table metadata
And acquiring field metadata of the database table in a jdbc (jdbc) mode and other modes, wherein the field metadata comprises field names, types, lengths, remarks and other information.
2.2 hierarchical sampling based on Key fields
According to the method of '1, identifying key attributes of a database table', identifying key data, reading data distribution of the key attributes, and performing hierarchical sampling according to the data distribution.
2.3 variable field, extracting key characteristic vector for numerical value type and character type field respectively
2.3.1 numerical field feature extraction
FIG. 5 is a diagram of extracting numeric field features: as shown in fig. 5, for the numerical data, basic statistical information such as a maximum value, a minimum value, a mean value, a variance, a standard deviation, a median, a mode, and the like is calculated, the data is normalized, a statistical probability distribution histogram is calculated, and the basic statistical information and the probability distribution histogram together form a feature vector of the numerical data.
(1) Calculating statistical characteristics of fields
Calculating maximum value, minimum value, mean value, variance, standard deviation, 4 quantile (1/4 quantile Q1 and 3/4 quantile Q3) and mode, wherein the statistical characteristics can visually reflect the rough distribution of the data set and are beneficial to the subsequent classification calculation.
(2) Data normalization processing
The raw data was converted to a normal distribution between 0-1 using the z-score normalization method. The normal distribution type is a very general mathematical model, and can facilitate data processing.
The z-score formula is as follows: x ═ X-avg (X))/std (X), where X is the raw data, X' is the normalized data, avg (X) is the mean value of X, std (X) is the standard deviation of X.
(3) Constructing a probability distribution histogram
The outliers are first removed and data between [ Q1-1.5 IQR, Q3+1.5 IQR ] are selected for binning, where Q1 is the first quartile, Q3 is the third quartile, and IQR is the quartile range, IQR, Q3-Q1. A normalized data probability distribution histogram is calculated based on the number of buckets, where the number of sub-buckets determines the length of the feature vector.
(4) Determining relationships between data and common distributions
And judging whether the data meet common data distribution such as 0-1 distribution, binomial distribution, normal distribution, Poisson distribution, average distribution, exponential distribution and the like by using a KS method, and if so, calculating parameters of corresponding distribution functions.
(5) Combined feature vector
Feature vector is statistical feature + normalized probability distribution histogram + distribution function parameters.
The finally formed feature vector is a large array, and is composed of the statistical features obtained in the step (2), the normalized probability distribution histogram array obtained in the step (3), and the distribution function parameter array [ distribution type, parameter 1, parameter 2] (the common distribution function has at most two parameters) obtained in the step (4), and the composition is as follows:
Figure BDA0002336435660000081
2.3.2 character type field feature extraction
For character type data, first, the length distribution characteristics (maximum value, minimum value, mean value, median, etc. of the character string length) and the character distribution characteristics (number of occurrences, position, etc. of letters, numbers, special symbols, etc.) of the character string are calculated; then matching with common regular expressions (such as a mailbox, a postcode, a mobile phone number, an identity card number and the like); extracting word vectors of the character strings by using a natural language processing method, carrying out named entity recognition, and judging whether the character strings are names of people, places, objects and the like; and finally, synthesizing the information to form a characteristic vector of the character type data. Fig. 6 is a flow chart of extracting character type field features, as shown in fig. 6,
(1) reading data of attribute field, and extracting character length distribution characteristic of character string
First, data content of the attribute field is acquired, and the length distribution of the character string is known from the statistical perspective by counting the maximum value, the minimum value, the mean value, the variance, the standard deviation, the mode and the like according to the length of the character string.
(2) Extracting character pattern distribution characteristics of character string
The character mode refers to the composition of the type corresponding to each character in the character string, the character string is divided into individual characters, the letters correspond to \ w (or A), the numbers correspond to \ d (or #), the special expressions (comma, semicolon, quotation mark, dash, ellipsis, dot, asterisk, full stop, space, addition, subtraction, multiplication and division symbols and the like) correspond, and the character string is converted into a mode like a format of "\\ \ w \ d \ d- \". Aggregate statistics are performed on the character pattern distribution, and the pattern of TOP5 is selected.
(3) Common regular expression matching
And for the character pattern distribution of the extracted character strings, using a preset regular expression, such as a mailbox, a mobile phone number, an identity card number and the like, to match whether the character strings conform to the regular expression.
(4) Extracting character string features using natural language processing techniques
Firstly, segmenting words of a character string, and extracting Word vectors by using technologies such as TFIDF, CBOW, Word2Vec and the like; all the content of the field is regarded as an article, and the text feature vector of the field is constructed by using a method of Doc2Vec and the like.
(5) Named entity recognition
Through the named entity identification technology, whether the field content is a name of a person, a name of a institution, a place name, a name of an object and the like is identified, and an important basis for dividing a data domain is provided. And matching with a preset corpus such as names of people and place names by analyzing the parts of speech, word vectors and the like, and selecting the category with the similarity exceeding a certain threshold value as a result for use.
(6) Combined feature vector
The finally formed feature vector is the character string length statistical feature + the pattern distribution feature + the word vector + the matched regular expression + the named entity recognition result, and the composition is as follows:
Figure BDA0002336435660000101
2.4 constructing feature vectors for attributes
Unifying the length of the feature vector of the numeric field and the character field, and forming the feature vector as follows:
type of field (character type, numerical type) 2.3.1 and 2.3.2 steps to obtain the field feature vector
3. Data clustering and labeling
And (3) extracting the feature vectors of the key attributes according to the step (2), clustering the key attributes by using a machine learning method, and labeling to form a training sample for training a machine learning algorithm. Clustering the feature vectors of the key data by using a clustering algorithm (such as a density-based clustering algorithm) of unsupervised learning, labeling the clustering center (a label system is established in advance), and automatically expanding the label attribute to other attributes in the cluster by using the system.
4. Data classification algorithm training and optimization
(1) Training classification algorithm
And (3) training a classification algorithm (such as SVM, random forest and the like) model with supervised learning according to the key attribute features extracted in the step (2) and the labels marked in the step (3).
(2) Data classification result verification and algorithm optimization
Extracting the feature vectors of the rest attributes by using the step 2, classifying by using the trained classification algorithm model, performing random sampling inspection on the classification result, increasing the weight of the classified error data, putting the classified error data into a training set to retrain the classification algorithm, iterating step by step, and finally outputting the trained classification algorithm.
The invention discloses a method for constructing a field feature vector, which comprises the following steps: on the basis of data content, the metadata information of fields is added, and a feature vector is constructed by combining with industry prior knowledge. For numerical data, combining the statistical features with a probability distribution histogram; for character type data, on the basis of the existing word vectors, characteristics such as character pattern distribution and the like are added and are matched with a common regular expression. And extracting key fields according to the information entropy, clustering and labeling to be used as a training set for quickly constructing an industry characteristic classification algorithm.
The method of the invention tests on a plurality of actual service databases (backup databases), the sample data scale is 1000, the threshold value of the key field is 0.8, DBSCAN is adopted as a clustering algorithm, the number of numerical type field barrels is 512, the character string word vector is 512, the database of self-research software and part of commercial software are respectively applied, and the test results are in an acceptable range. Test results show that the method can rapidly extract the key fields of all tables in the database, extract the characteristics suitable for the characteristics of the industry, form the accumulation of industry knowledge and greatly reduce the workload of manual labeling.
The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims (8)

1. A method for rapidly classifying massive database tables is characterized by comprising the following steps: calculating mutual information entropy to obtain key attributes of each table, constructing a feature vector of the selected attributes according to metadata information and data content abstract of attribute field types, clustering the key attributes by using a machine learning clustering algorithm, labeling a clustering center to form a training set training classification algorithm, applying the trained classification algorithm to residual attribute classification, sampling and judging classification results, reversely optimizing the classification algorithm, and outputting the categories of all database table attribute fields.
2. The method for rapidly classifying massive database tables according to claim 1, wherein calculating mutual entropy to obtain key attributes of each table comprises:
according to the sampling data of the database, the dependency relationship among different fields is obtained by calculating the mutual information entropy among the fields of the database table, and the key field with the largest influence on other fields is selected according to the threshold value.
3. The method for rapidly classifying massive database tables according to claim 2, wherein the step of calculating mutual information entropy to obtain key attributes of each table specifically comprises the steps of:
random data samples, comprising:
when the database table data scale is small, the full amount of data is used; when the data size is large, random sampling without putting back is adopted;
calculating the information entropy and the mutual information entropy of the fields comprises the following steps:
the information entropy H (X) of the field X, wherein p (X) is the probability that the value X is distributed in the whole value range of X;
Figure FDA0002336435650000011
the mutual information entropy I (X, Y) of the field X and the rest of the field Y, p (X, Y) is the probability that the data pair < X, Y > is distributed in the whole value range of < X, Y >:
Figure FDA0002336435650000021
sequentially calculating the information entropy H (x) of all the fields and the mutual information entropy I (x, y) of the fields x and the rest of the fields y to form a dependency graph among the fields, wherein the weight of a certain node vi in the dependency graph is A (vi), and the weight is Wi;
Figure FDA0002336435650000022
Figure FDA0002336435650000023
a (vi) is a weight of a field dependency graph node vi, and represents the strength of the correlation between the node vi and other nodes in the dependency graph; wi is the weight of the node vi and other nodes;
and performing descending arrangement according to the field weights Wi of the database table, selecting a field set with the sum of the weights larger than a given threshold value, and recording the field set as a key field set of the table.
4. The method for rapid classification of mass database tables according to claim 1, wherein the step of computing the feature vector of the field comprises: according to the key attribute of the database table, carrying out hierarchical sampling again; and calculating the feature vector of the attribute field according to the metadata information of the data field, the statistical features of the data content and the like, and dividing the feature vector into numerical feature extraction and character feature extraction.
5. The method of rapidly classifying a large number of database tables according to claim 4, wherein the extracting numerical field features comprises:
(1) calculating the statistical characteristics of the fields;
(2) the data normalization processing comprises the following steps:
converting the raw data into a normal distribution between 0 and 1 by using a z-score normalization method;
the z-score formula is as follows: x '═ X-avg (X))/std (X), where X is the raw data, X' is the normalized data, avg (X) is the mean value of X, std (X) is the standard deviation of X; (3) constructing a probability distribution histogram includes: the outliers are first removed and data between [ Q1-1.5 IQR, Q3+1.5 IQR ] are selected for binning, where Q1 is the first quartile, Q3 is the third quartile, and IQR is the quartile range, IQR, Q3-Q1. Calculating a normalized data probability distribution histogram according to the number of buckets, wherein the number of sub-buckets determines the length of the feature vector; (4) judging whether the data meet common data distribution by using a KS method; (5) feature vector is statistical feature + normalized probability distribution histogram + distribution function parameters.
6. The method for rapid classification of mass database tables as claimed in claim 4, wherein the character type field feature extraction comprises:
(1) reading data of the attribute field, and extracting character length distribution characteristics of the character string;
(2) extracting character mode distribution characteristics of the character string;
(3) for the character pattern distribution of the extracted character strings, a preset regular expression is used for matching whether the character strings conform to the regular expression or not;
(4) extracting character string features by using a natural language processing method;
(5) carrying out named entity recognition;
(6) the feature vectors are combined.
7. The method for rapidly classifying massive database tables according to claim 1, wherein the key attributes are clustered by using a machine learning clustering algorithm, and labeling the clustering centers comprises: extracting the feature vectors of the key attributes according to the step 2, clustering the key attributes by using a machine learning method, labeling to form training samples for training a machine learning algorithm, clustering the feature vectors of the key data by using an unsupervised learning clustering algorithm, labeling a clustering center, and extending the label attributes to other attributes in the cluster.
8. The method for rapid classification of mass database tables according to claim 1, wherein forming a training set training classification algorithm comprises:
training a classification algorithm model with supervised learning according to the extracted key attribute features and the printed labels; and classifying by using the feature vectors of the extracted other attributes and the trained classification algorithm model, performing random sampling inspection on classification results, increasing the weight of classified error data, putting the classified error data into a training set, retraining the classification algorithm, iterating step by step, and finally outputting the trained classification algorithm.
CN201911357917.6A 2019-12-25 2019-12-25 Method for quickly classifying massive database tables Active CN111104466B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911357917.6A CN111104466B (en) 2019-12-25 2019-12-25 Method for quickly classifying massive database tables

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911357917.6A CN111104466B (en) 2019-12-25 2019-12-25 Method for quickly classifying massive database tables

Publications (2)

Publication Number Publication Date
CN111104466A true CN111104466A (en) 2020-05-05
CN111104466B CN111104466B (en) 2023-07-28

Family

ID=70425147

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911357917.6A Active CN111104466B (en) 2019-12-25 2019-12-25 Method for quickly classifying massive database tables

Country Status (1)

Country Link
CN (1) CN111104466B (en)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111860575A (en) * 2020-06-05 2020-10-30 百度在线网络技术(北京)有限公司 Method and device for processing article attribute information, electronic equipment and storage medium
CN111913954A (en) * 2020-06-20 2020-11-10 杭州城市大数据运营有限公司 Intelligent data standard catalog generation method and device
CN112380205A (en) * 2020-11-17 2021-02-19 北京融七牛信息技术有限公司 Method and system for automatically generating characteristics of distributed architecture
CN112380348A (en) * 2020-11-25 2021-02-19 中信百信银行股份有限公司 Metadata processing method and device, electronic equipment and computer-readable storage medium
CN112434032A (en) * 2020-11-17 2021-03-02 北京融七牛信息技术有限公司 Automatic feature generation system and method
CN112530597A (en) * 2020-11-26 2021-03-19 山东健康医疗大数据有限公司 Data table classification method, device and medium based on Bert character model
CN112614005A (en) * 2020-11-30 2021-04-06 国网北京市电力公司 Enterprise rework state processing method and device
CN113094567A (en) * 2021-03-31 2021-07-09 四川新网银行股份有限公司 Malicious complaint identification method and system based on text clustering
CN113435199A (en) * 2021-07-18 2021-09-24 谢勇 Storage and reading interference method and system for character corresponding culture
CN113761297A (en) * 2020-11-10 2021-12-07 北京沃东天骏信息技术有限公司 Method and device for determining field relevancy in database table
CN114528288A (en) * 2021-08-31 2022-05-24 天津工业大学 Design method of multi-type organ chip database
CN115168345A (en) * 2022-06-27 2022-10-11 天翼爱音乐文化科技有限公司 Database classification method, system, device and storage medium
WO2023098034A1 (en) * 2021-11-30 2023-06-08 深圳前海微众银行股份有限公司 Business data report classification method and apparatus
US11720533B2 (en) 2021-11-29 2023-08-08 International Business Machines Corporation Automated classification of data types for databases

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060074953A1 (en) * 2004-09-30 2006-04-06 International Business Machines Corporation Metadata management for a data abstraction model
US20160188710A1 (en) * 2014-12-29 2016-06-30 Wipro Limited METHOD AND SYSTEM FOR MIGRATING DATA TO NOT ONLY STRUCTURED QUERY LANGUAGE (NoSOL) DATABASE
CN105844398A (en) * 2016-03-22 2016-08-10 武汉大学 PLM (product life-cycle management) database-based mining algorithm for DPIPP (distributed parameterized intelligent product platform) product families
CN107103025A (en) * 2017-01-05 2017-08-29 北京亚信智慧数据科技有限公司 A kind of data processing method and data processing platform (DPP)
CN107294993A (en) * 2017-07-05 2017-10-24 重庆邮电大学 A kind of WEB abnormal flow monitoring methods based on integrated study
CN109408555A (en) * 2018-09-19 2019-03-01 智器云南京信息科技有限公司 Data type recognition methods and device, data storage method and device
US20190311149A1 (en) * 2018-04-08 2019-10-10 Imperva, Inc. Detecting attacks on databases based on transaction characteristics determined from analyzing database logs
CN110377754A (en) * 2019-07-01 2019-10-25 北京信息科技大学 A kind of database body learning optimization method based on decision tree
CN110427992A (en) * 2019-07-23 2019-11-08 杭州城市大数据运营有限公司 Data matching method, device, computer equipment and storage medium
CN110597816A (en) * 2019-09-17 2019-12-20 深圳追一科技有限公司 Data processing method, data processing device, computer equipment and computer readable storage medium

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060074953A1 (en) * 2004-09-30 2006-04-06 International Business Machines Corporation Metadata management for a data abstraction model
US20160188710A1 (en) * 2014-12-29 2016-06-30 Wipro Limited METHOD AND SYSTEM FOR MIGRATING DATA TO NOT ONLY STRUCTURED QUERY LANGUAGE (NoSOL) DATABASE
CN105844398A (en) * 2016-03-22 2016-08-10 武汉大学 PLM (product life-cycle management) database-based mining algorithm for DPIPP (distributed parameterized intelligent product platform) product families
CN107103025A (en) * 2017-01-05 2017-08-29 北京亚信智慧数据科技有限公司 A kind of data processing method and data processing platform (DPP)
CN107294993A (en) * 2017-07-05 2017-10-24 重庆邮电大学 A kind of WEB abnormal flow monitoring methods based on integrated study
US20190311149A1 (en) * 2018-04-08 2019-10-10 Imperva, Inc. Detecting attacks on databases based on transaction characteristics determined from analyzing database logs
CN109408555A (en) * 2018-09-19 2019-03-01 智器云南京信息科技有限公司 Data type recognition methods and device, data storage method and device
CN110377754A (en) * 2019-07-01 2019-10-25 北京信息科技大学 A kind of database body learning optimization method based on decision tree
CN110427992A (en) * 2019-07-23 2019-11-08 杭州城市大数据运营有限公司 Data matching method, device, computer equipment and storage medium
CN110597816A (en) * 2019-09-17 2019-12-20 深圳追一科技有限公司 Data processing method, data processing device, computer equipment and computer readable storage medium

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
NILESHKUMAR D ET AL: "Proposed efficient approach for classification for multi-relational data mining using Bayesian Belief Network" *
丁国辉 等: "基于DBSCAN聚类算法的多模式匹配", vol. 33, no. 33 *
马军;宋玲;韩晓晖;闫泼;: "基于网页上下文的Deep Web数据库分类", vol. 19, no. 02 *

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111860575B (en) * 2020-06-05 2023-06-16 百度在线网络技术(北京)有限公司 Method and device for processing object attribute information, electronic equipment and storage medium
CN111860575A (en) * 2020-06-05 2020-10-30 百度在线网络技术(北京)有限公司 Method and device for processing article attribute information, electronic equipment and storage medium
CN111913954A (en) * 2020-06-20 2020-11-10 杭州城市大数据运营有限公司 Intelligent data standard catalog generation method and device
CN111913954B (en) * 2020-06-20 2023-08-04 杭州城市大数据运营有限公司 Intelligent data standard catalog generation method and device
CN113761297A (en) * 2020-11-10 2021-12-07 北京沃东天骏信息技术有限公司 Method and device for determining field relevancy in database table
CN112434032B (en) * 2020-11-17 2024-04-05 北京融七牛信息技术有限公司 Automatic feature generation system and method
CN112434032A (en) * 2020-11-17 2021-03-02 北京融七牛信息技术有限公司 Automatic feature generation system and method
CN112380205A (en) * 2020-11-17 2021-02-19 北京融七牛信息技术有限公司 Method and system for automatically generating characteristics of distributed architecture
CN112380205B (en) * 2020-11-17 2024-04-02 北京融七牛信息技术有限公司 Automatic feature generation method and system of distributed architecture
CN112380348B (en) * 2020-11-25 2024-03-26 中信百信银行股份有限公司 Metadata processing method, apparatus, electronic device and computer readable storage medium
CN112380348A (en) * 2020-11-25 2021-02-19 中信百信银行股份有限公司 Metadata processing method and device, electronic equipment and computer-readable storage medium
CN112530597A (en) * 2020-11-26 2021-03-19 山东健康医疗大数据有限公司 Data table classification method, device and medium based on Bert character model
CN112614005A (en) * 2020-11-30 2021-04-06 国网北京市电力公司 Enterprise rework state processing method and device
CN112614005B (en) * 2020-11-30 2024-04-30 国网北京市电力公司 Method and device for processing reworking state of enterprise
CN113094567A (en) * 2021-03-31 2021-07-09 四川新网银行股份有限公司 Malicious complaint identification method and system based on text clustering
CN113435199A (en) * 2021-07-18 2021-09-24 谢勇 Storage and reading interference method and system for character corresponding culture
CN114528288A (en) * 2021-08-31 2022-05-24 天津工业大学 Design method of multi-type organ chip database
US11720533B2 (en) 2021-11-29 2023-08-08 International Business Machines Corporation Automated classification of data types for databases
WO2023098034A1 (en) * 2021-11-30 2023-06-08 深圳前海微众银行股份有限公司 Business data report classification method and apparatus
CN115168345A (en) * 2022-06-27 2022-10-11 天翼爱音乐文化科技有限公司 Database classification method, system, device and storage medium

Also Published As

Publication number Publication date
CN111104466B (en) 2023-07-28

Similar Documents

Publication Publication Date Title
CN111104466B (en) Method for quickly classifying massive database tables
CN107633007B (en) Commodity comment data tagging system and method based on hierarchical AP clustering
CN109165294B (en) Short text classification method based on Bayesian classification
WO2017166912A1 (en) Method and device for extracting core words from commodity short text
CN110826320B (en) Sensitive data discovery method and system based on text recognition
US10089581B2 (en) Data driven classification and data quality checking system
CN110825877A (en) Semantic similarity analysis method based on text clustering
CN112632228A (en) Text mining-based auxiliary bid evaluation method and system
CN108519971B (en) Cross-language news topic similarity comparison method based on parallel corpus
CN104834651B (en) Method and device for providing high-frequency question answers
CN111158641B (en) Automatic recognition method for transaction function points based on semantic analysis and text mining
US10083403B2 (en) Data driven classification and data quality checking method
CN113360647B (en) 5G mobile service complaint source-tracing analysis method based on clustering
CN110928981A (en) Method, system and storage medium for establishing and perfecting iteration of text label system
CN112836509A (en) Expert system knowledge base construction method and system
CN112580332B (en) Enterprise portrait method based on label layering and deepening modeling
CN110222192A (en) Corpus method for building up and device
CN112395881A (en) Material label construction method and device, readable storage medium and electronic equipment
CN111753067A (en) Innovative assessment method, device and equipment for technical background text
CN114722810A (en) Real estate customer portrait method and system based on information extraction and multi-attribute decision
CN111626331B (en) Automatic industry classification device and working method thereof
CN112989053A (en) Periodical recommendation method and device
CN107609921A (en) A kind of data processing method and server
CN114511027B (en) Method for extracting English remote data through big data network
CN114049165B (en) Commodity price comparison method, device, equipment and medium for purchasing system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20210916

Address after: 100854 east gate, 52 Yongding Road, Haidian District, Beijing

Applicant after: China Changfeng electromechanical technology research and Design Institute

Address before: 100854 east gate, 52 Yongding Road, Haidian District, Beijing

Applicant before: Aerospace Science and Technology Network Information Development Co.,Ltd.

GR01 Patent grant
GR01 Patent grant