CN111104466A - Method for rapidly classifying massive database tables - Google Patents
Method for rapidly classifying massive database tables Download PDFInfo
- Publication number
- CN111104466A CN111104466A CN201911357917.6A CN201911357917A CN111104466A CN 111104466 A CN111104466 A CN 111104466A CN 201911357917 A CN201911357917 A CN 201911357917A CN 111104466 A CN111104466 A CN 111104466A
- Authority
- CN
- China
- Prior art keywords
- data
- field
- fields
- classification
- clustering
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/285—Clustering or classification
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to a method for quickly classifying massive database tables, which comprises the steps of calculating mutual information entropy to obtain key attributes of each table, constructing a feature vector of the selected attributes according to metadata information of attribute field types and data content abstracts, clustering the key attributes by using a machine learning clustering algorithm, labeling a clustering center to form a training set training classification algorithm, applying the trained classification algorithm to residual attribute classification, sampling and judging classification results, reversely optimizing the classification algorithm, and outputting the categories of all database table attribute fields. The method combines metadata information and field contents of the database fields to construct field feature vectors, clusters the key fields of the database to be analyzed and sets data fields (labeling), constructs a training set, trains an industry characteristic classification algorithm, and simplifies the manual processing workload.
Description
Technical Field
The invention relates to a database technology, in particular to a method for quickly classifying massive database tables.
Background
In the process of warehouse construction, a large amount of manpower and material resources are consumed for data cataloging and cleaning, and one important task is to classify database tables. By classifying and labeling the database table, the data fields (such as the field representing clients, products, quantity, amount and the like) are identified, a data directory is established, missing metadata information is supplemented, data quality rules are formulated in an auxiliary mode to find data quality problems and the like, and follow-up data management and promotion are performed in a targeted mode.
The existing data classification method needs implementation personnel to carry out the data classification according to database design documents, library table structure remarks and the like, depends on the experience of the personnel to a great extent, and each piece of metadata information needs to be confirmed one by one, so that the time and the labor are wasted. When faced with a huge amount of data types and data sizes, the labor cost is enormous. Therefore, a machine learning method is introduced into the field of data management, and based on the existing data classification labels, the data classification and labeling are carried out with the assistance of a computer through methods such as clustering and classification, so that the repeated workload is reduced, and the efficiency is improved.
One of the key technologies for realizing the method is to extract the feature vectors of the database fields (character type and numerical type) and identify the data fields through training a machine learning classification algorithm.
The characteristics of the fields of a database table should contain two parts: metadata information of the fields and data contents corresponding to the fields. The metadata information of a field is the profile and description of the field, including the field type, length, distribution characteristics of the field content, schema, etc. For example, for mailbox fields, the metadata information includes: the fields belong to character strings, the length of the fields is not more than 256 characters, and the mode meets regular expressions of a mailbox; for the sales amount field, the metadata information thereof includes: the field belongs to a numerical type, the precision is two digits after a decimal point, the data distribution approximately meets the positive-Taiwan distribution, the maximum value, the minimum value, the mean value, the variance and the like of the data distribution are in a certain specific range, and the like. When the characteristics of the database fields are constructed, the metadata information is added, and the data fields of the fields can be more accurately and quickly distinguished.
Disclosure of Invention
The invention aims to provide a method for quickly classifying mass database tables, which is used for solving the problems in the prior art.
The invention discloses a method for quickly classifying massive database tables, which comprises the steps of calculating mutual information entropy to obtain key attributes of each table, constructing a feature vector of the selected attributes according to metadata information of attribute field types and data content abstracts, clustering the key attributes by using a machine learning clustering algorithm, labeling a clustering center to form a training set training classification algorithm, applying the trained classification algorithm to residual attribute classification, sampling and judging classification results, reversely optimizing the classification algorithm, and outputting the categories of all database table attribute fields.
According to an embodiment of the method for rapidly classifying the mass database tables, the step of calculating the mutual information entropy to obtain the key attributes of each table comprises the following steps: according to the sampling data of the database, the dependency relationship among different fields is obtained by calculating the mutual information entropy among the fields of the database table, and the key field with the largest influence on other fields is selected according to the threshold value.
According to an embodiment of the method for rapidly classifying the mass database tables, the step of calculating the mutual information entropy to obtain the key attributes of each table specifically comprises the following steps: random data samples, comprising: when the database table data scale is small, the full amount of data is used; when the data size is large, random sampling without putting back is adopted; calculating the information entropy and the mutual information entropy of the fields comprises the following steps: the information entropy H (X) of the field X, wherein p (X) is the probability that the value X is distributed in the whole value range of X;
the mutual information entropy I (X, Y) of the field X and the rest of the field Y, p (X, Y) is the probability that the data pair < X, Y > is distributed in the whole value range of < X, Y >:
sequentially calculating the information entropy H (x) of all the fields and the mutual information entropy I (x, y) of the fields x and the rest of the fields y to form a dependency graph among the fields, wherein the weight of a certain node vi in the dependency graph is A (vi), and the weight is Wi;
a (vi) is a weight of a field dependency graph node vi, and represents the strength of the correlation between the node vi and other nodes in the dependency graph; wi is the weight of the node vi and other nodes; and performing descending arrangement according to the field weights Wi of the database table, selecting a field set with the sum of the weights larger than a given threshold value, and recording the field set as a key field set of the table.
According to an embodiment of the method for rapidly classifying the mass database tables, the step of calculating the feature vectors of the fields includes: according to the key attribute of the database table, carrying out hierarchical sampling again; and calculating the feature vector of the attribute field according to the metadata information of the data field, the statistical features of the data content and the like, and dividing the feature vector into numerical feature extraction and character feature extraction.
According to an embodiment of the method for rapidly classifying the mass database tables, the extracting the numerical field features comprises the following steps: (1) calculating the statistical characteristics of the fields; (2) the data normalization processing comprises the following steps: converting the raw data into a normal distribution between 0 and 1 by using a z-score normalization method; the z-score formula is as follows: x '═ X-avg (X))/std (X), where X is the raw data, X' is the normalized data, avg (X) is the mean value of X, std (X) is the standard deviation of X; (3) constructing a probability distribution histogram includes: the outliers are first removed and data between [ Q1-1.5 IQR, Q3+1.5 IQR ] are selected for binning, where Q1 is the first quartile, Q3 is the third quartile, and IQR is the quartile range, IQR, Q3-Q1. Calculating a normalized data probability distribution histogram according to the number of buckets, wherein the number of sub-buckets determines the length of the feature vector; (4) judging whether the data meet common data distribution by using a KS method; (5) feature vector is statistical feature + normalized probability distribution histogram + distribution function parameters.
According to an embodiment of the method for rapidly classifying the massive database tables, the character type field feature extraction comprises the following steps: (1) reading data of the attribute field, and extracting character length distribution characteristics of the character string; (2) extracting character mode distribution characteristics of the character string; (3) for the character pattern distribution of the extracted character strings, a preset regular expression is used for matching whether the character strings conform to the regular expression or not; (4) extracting character string features by using a natural language processing method; (5) carrying out named entity recognition; (6) the feature vectors are combined.
According to an embodiment of the method for rapidly classifying the mass database tables, the clustering key attributes by using a machine learning clustering algorithm, and the labeling of the clustering centers comprises: extracting the feature vectors of the key attributes according to the step 2, clustering the key attributes by using a machine learning method, labeling to form training samples for training a machine learning algorithm, clustering the feature vectors of the key data by using an unsupervised learning clustering algorithm, labeling a clustering center, and extending the label attributes to other attributes in the cluster.
According to an embodiment of the method for rapidly classifying the mass database tables, the training classification algorithm for forming the training set comprises the following steps: training a classification algorithm model with supervised learning according to the extracted key attribute features and the printed labels; and classifying by using the feature vectors of the extracted other attributes and the trained classification algorithm model, performing random sampling inspection on classification results, increasing the weight of classified error data, putting the classified error data into a training set, retraining the classification algorithm, iterating step by step, and finally outputting the trained classification algorithm.
The invention provides an accurate and rapid database field classification method and a processing flow, which enrich extracted field characteristics, establish a classification algorithm from scratch and assist a user in field classification. And (3) combining metadata information and field contents of the database fields to construct field feature vectors, clustering key fields of the database to be analyzed and setting data fields (labeling), constructing a training set, training an industry characteristic classification algorithm, and simplifying manual processing workload.
Drawings
FIG. 1 is an overall process flow diagram;
FIG. 2 is a flow chart of a process for extracting key fields of a database table;
FIG. 3 is a field dependency diagram;
FIG. 4 is a flow diagram of a process for extracting field features;
FIG. 5 is a diagram of extracting numeric field features: statistical features and probability distribution histograms;
FIG. 6 is a flow chart of extracting character type field features.
Detailed Description
In order to make the objects, contents, and advantages of the present invention clearer, the following detailed description of the embodiments of the present invention will be made in conjunction with the accompanying drawings and examples.
Fig. 1 is a flow chart showing a method for rapidly classifying a mass database table according to the present invention, as shown in fig. 1, the present invention first obtains key attributes of each table by calculating a mutual information entropy, constructs feature vectors of selected attributes according to metadata information such as attribute field types (which are mainly aimed at character types and numerical types at present) and data content abstracts, etc., clusters the key attributes by using a machine learning clustering algorithm, tags a clustering center, forms a training set training classification algorithm, applies the trained classification algorithm to the rest attribute classifications, and performs sampling judgment on classification results, reversely optimizes the classification algorithm, and finally outputs the categories of all database table attribute fields.
As shown in FIG. 1, the step of identifying key attributes of a database table includes:
according to the sampling data of the database, the dependency relationship among different fields is obtained by calculating the mutual information entropy among the fields of the database table, and the key field with the largest influence on other fields is selected according to the threshold value. The process flow is shown in fig. 2.
(1) Random data samples, comprising:
when the database table data size is small (such as less than 1000 pieces), the full amount of data is used; random sampling without replacement is used when the data size is large, ensuring that the samples represent the full amount of data as much as possible.
(2) Calculating the information entropy and the mutual information entropy of the fields comprises the following steps:
the information entropy H (X) of the field X, wherein p (X) is the probability that the value X is distributed in the whole value range of X;
the mutual information entropy I (X, Y) of the field X and the rest of the field Y, p (X, Y) is the probability that the data pair < X, Y > is distributed in the whole value range of < X, Y >:
fig. 3 shows a field dependency graph, and as shown in fig. 3, information entropies h (x) of all fields and mutual information entropies I (x, y) of the fields x and the remaining fields y are sequentially calculated to form a dependency graph between the fields, where a weight of a certain node vi in the dependency graph is a (vi), and the weight is Wi.
A (vi) is a weight of a field dependency graph node vi, and represents the strength of the correlation between the node vi and other nodes in the dependency graph; wi is the weight of the node vi and other nodes.
And (4) performing descending arrangement according to the field weights Wi of the database table, selecting a field set with the sum of the weights greater than a given threshold (such as 0.8), and recording the field set as a key field set of the table.
Pseudocode for selecting key attributes based on threshold inversion selection is as follows:
2. the step of calculating the feature vector of the field comprises:
according to the key attribute of the database table, carrying out hierarchical sampling again; and calculating the feature vector of the attribute field according to the metadata information of the data field, the statistical features of the data content and the like, wherein the calculation is mainly divided into numerical feature extraction and character feature extraction. The process flow is shown in fig. 4, wherein fig. 5 and fig. 6 respectively refine the extraction process of the numeric and character fields in fig. 4.
2.1 reading data Table metadata
And acquiring field metadata of the database table in a jdbc (jdbc) mode and other modes, wherein the field metadata comprises field names, types, lengths, remarks and other information.
2.2 hierarchical sampling based on Key fields
According to the method of '1, identifying key attributes of a database table', identifying key data, reading data distribution of the key attributes, and performing hierarchical sampling according to the data distribution.
2.3 variable field, extracting key characteristic vector for numerical value type and character type field respectively
2.3.1 numerical field feature extraction
FIG. 5 is a diagram of extracting numeric field features: as shown in fig. 5, for the numerical data, basic statistical information such as a maximum value, a minimum value, a mean value, a variance, a standard deviation, a median, a mode, and the like is calculated, the data is normalized, a statistical probability distribution histogram is calculated, and the basic statistical information and the probability distribution histogram together form a feature vector of the numerical data.
(1) Calculating statistical characteristics of fields
Calculating maximum value, minimum value, mean value, variance, standard deviation, 4 quantile (1/4 quantile Q1 and 3/4 quantile Q3) and mode, wherein the statistical characteristics can visually reflect the rough distribution of the data set and are beneficial to the subsequent classification calculation.
(2) Data normalization processing
The raw data was converted to a normal distribution between 0-1 using the z-score normalization method. The normal distribution type is a very general mathematical model, and can facilitate data processing.
The z-score formula is as follows: x ═ X-avg (X))/std (X), where X is the raw data, X' is the normalized data, avg (X) is the mean value of X, std (X) is the standard deviation of X.
(3) Constructing a probability distribution histogram
The outliers are first removed and data between [ Q1-1.5 IQR, Q3+1.5 IQR ] are selected for binning, where Q1 is the first quartile, Q3 is the third quartile, and IQR is the quartile range, IQR, Q3-Q1. A normalized data probability distribution histogram is calculated based on the number of buckets, where the number of sub-buckets determines the length of the feature vector.
(4) Determining relationships between data and common distributions
And judging whether the data meet common data distribution such as 0-1 distribution, binomial distribution, normal distribution, Poisson distribution, average distribution, exponential distribution and the like by using a KS method, and if so, calculating parameters of corresponding distribution functions.
(5) Combined feature vector
Feature vector is statistical feature + normalized probability distribution histogram + distribution function parameters.
The finally formed feature vector is a large array, and is composed of the statistical features obtained in the step (2), the normalized probability distribution histogram array obtained in the step (3), and the distribution function parameter array [ distribution type, parameter 1, parameter 2] (the common distribution function has at most two parameters) obtained in the step (4), and the composition is as follows:
2.3.2 character type field feature extraction
For character type data, first, the length distribution characteristics (maximum value, minimum value, mean value, median, etc. of the character string length) and the character distribution characteristics (number of occurrences, position, etc. of letters, numbers, special symbols, etc.) of the character string are calculated; then matching with common regular expressions (such as a mailbox, a postcode, a mobile phone number, an identity card number and the like); extracting word vectors of the character strings by using a natural language processing method, carrying out named entity recognition, and judging whether the character strings are names of people, places, objects and the like; and finally, synthesizing the information to form a characteristic vector of the character type data. Fig. 6 is a flow chart of extracting character type field features, as shown in fig. 6,
(1) reading data of attribute field, and extracting character length distribution characteristic of character string
First, data content of the attribute field is acquired, and the length distribution of the character string is known from the statistical perspective by counting the maximum value, the minimum value, the mean value, the variance, the standard deviation, the mode and the like according to the length of the character string.
(2) Extracting character pattern distribution characteristics of character string
The character mode refers to the composition of the type corresponding to each character in the character string, the character string is divided into individual characters, the letters correspond to \ w (or A), the numbers correspond to \ d (or #), the special expressions (comma, semicolon, quotation mark, dash, ellipsis, dot, asterisk, full stop, space, addition, subtraction, multiplication and division symbols and the like) correspond, and the character string is converted into a mode like a format of "\\ \ w \ d \ d- \". Aggregate statistics are performed on the character pattern distribution, and the pattern of TOP5 is selected.
(3) Common regular expression matching
And for the character pattern distribution of the extracted character strings, using a preset regular expression, such as a mailbox, a mobile phone number, an identity card number and the like, to match whether the character strings conform to the regular expression.
(4) Extracting character string features using natural language processing techniques
Firstly, segmenting words of a character string, and extracting Word vectors by using technologies such as TFIDF, CBOW, Word2Vec and the like; all the content of the field is regarded as an article, and the text feature vector of the field is constructed by using a method of Doc2Vec and the like.
(5) Named entity recognition
Through the named entity identification technology, whether the field content is a name of a person, a name of a institution, a place name, a name of an object and the like is identified, and an important basis for dividing a data domain is provided. And matching with a preset corpus such as names of people and place names by analyzing the parts of speech, word vectors and the like, and selecting the category with the similarity exceeding a certain threshold value as a result for use.
(6) Combined feature vector
The finally formed feature vector is the character string length statistical feature + the pattern distribution feature + the word vector + the matched regular expression + the named entity recognition result, and the composition is as follows:
2.4 constructing feature vectors for attributes
Unifying the length of the feature vector of the numeric field and the character field, and forming the feature vector as follows:
type of field (character type, numerical type) | 2.3.1 and 2.3.2 steps to obtain the field feature vector |
3. Data clustering and labeling
And (3) extracting the feature vectors of the key attributes according to the step (2), clustering the key attributes by using a machine learning method, and labeling to form a training sample for training a machine learning algorithm. Clustering the feature vectors of the key data by using a clustering algorithm (such as a density-based clustering algorithm) of unsupervised learning, labeling the clustering center (a label system is established in advance), and automatically expanding the label attribute to other attributes in the cluster by using the system.
4. Data classification algorithm training and optimization
(1) Training classification algorithm
And (3) training a classification algorithm (such as SVM, random forest and the like) model with supervised learning according to the key attribute features extracted in the step (2) and the labels marked in the step (3).
(2) Data classification result verification and algorithm optimization
Extracting the feature vectors of the rest attributes by using the step 2, classifying by using the trained classification algorithm model, performing random sampling inspection on the classification result, increasing the weight of the classified error data, putting the classified error data into a training set to retrain the classification algorithm, iterating step by step, and finally outputting the trained classification algorithm.
The invention discloses a method for constructing a field feature vector, which comprises the following steps: on the basis of data content, the metadata information of fields is added, and a feature vector is constructed by combining with industry prior knowledge. For numerical data, combining the statistical features with a probability distribution histogram; for character type data, on the basis of the existing word vectors, characteristics such as character pattern distribution and the like are added and are matched with a common regular expression. And extracting key fields according to the information entropy, clustering and labeling to be used as a training set for quickly constructing an industry characteristic classification algorithm.
The method of the invention tests on a plurality of actual service databases (backup databases), the sample data scale is 1000, the threshold value of the key field is 0.8, DBSCAN is adopted as a clustering algorithm, the number of numerical type field barrels is 512, the character string word vector is 512, the database of self-research software and part of commercial software are respectively applied, and the test results are in an acceptable range. Test results show that the method can rapidly extract the key fields of all tables in the database, extract the characteristics suitable for the characteristics of the industry, form the accumulation of industry knowledge and greatly reduce the workload of manual labeling.
The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.
Claims (8)
1. A method for rapidly classifying massive database tables is characterized by comprising the following steps: calculating mutual information entropy to obtain key attributes of each table, constructing a feature vector of the selected attributes according to metadata information and data content abstract of attribute field types, clustering the key attributes by using a machine learning clustering algorithm, labeling a clustering center to form a training set training classification algorithm, applying the trained classification algorithm to residual attribute classification, sampling and judging classification results, reversely optimizing the classification algorithm, and outputting the categories of all database table attribute fields.
2. The method for rapidly classifying massive database tables according to claim 1, wherein calculating mutual entropy to obtain key attributes of each table comprises:
according to the sampling data of the database, the dependency relationship among different fields is obtained by calculating the mutual information entropy among the fields of the database table, and the key field with the largest influence on other fields is selected according to the threshold value.
3. The method for rapidly classifying massive database tables according to claim 2, wherein the step of calculating mutual information entropy to obtain key attributes of each table specifically comprises the steps of:
random data samples, comprising:
when the database table data scale is small, the full amount of data is used; when the data size is large, random sampling without putting back is adopted;
calculating the information entropy and the mutual information entropy of the fields comprises the following steps:
the information entropy H (X) of the field X, wherein p (X) is the probability that the value X is distributed in the whole value range of X;
the mutual information entropy I (X, Y) of the field X and the rest of the field Y, p (X, Y) is the probability that the data pair < X, Y > is distributed in the whole value range of < X, Y >:
sequentially calculating the information entropy H (x) of all the fields and the mutual information entropy I (x, y) of the fields x and the rest of the fields y to form a dependency graph among the fields, wherein the weight of a certain node vi in the dependency graph is A (vi), and the weight is Wi;
a (vi) is a weight of a field dependency graph node vi, and represents the strength of the correlation between the node vi and other nodes in the dependency graph; wi is the weight of the node vi and other nodes;
and performing descending arrangement according to the field weights Wi of the database table, selecting a field set with the sum of the weights larger than a given threshold value, and recording the field set as a key field set of the table.
4. The method for rapid classification of mass database tables according to claim 1, wherein the step of computing the feature vector of the field comprises: according to the key attribute of the database table, carrying out hierarchical sampling again; and calculating the feature vector of the attribute field according to the metadata information of the data field, the statistical features of the data content and the like, and dividing the feature vector into numerical feature extraction and character feature extraction.
5. The method of rapidly classifying a large number of database tables according to claim 4, wherein the extracting numerical field features comprises:
(1) calculating the statistical characteristics of the fields;
(2) the data normalization processing comprises the following steps:
converting the raw data into a normal distribution between 0 and 1 by using a z-score normalization method;
the z-score formula is as follows: x '═ X-avg (X))/std (X), where X is the raw data, X' is the normalized data, avg (X) is the mean value of X, std (X) is the standard deviation of X; (3) constructing a probability distribution histogram includes: the outliers are first removed and data between [ Q1-1.5 IQR, Q3+1.5 IQR ] are selected for binning, where Q1 is the first quartile, Q3 is the third quartile, and IQR is the quartile range, IQR, Q3-Q1. Calculating a normalized data probability distribution histogram according to the number of buckets, wherein the number of sub-buckets determines the length of the feature vector; (4) judging whether the data meet common data distribution by using a KS method; (5) feature vector is statistical feature + normalized probability distribution histogram + distribution function parameters.
6. The method for rapid classification of mass database tables as claimed in claim 4, wherein the character type field feature extraction comprises:
(1) reading data of the attribute field, and extracting character length distribution characteristics of the character string;
(2) extracting character mode distribution characteristics of the character string;
(3) for the character pattern distribution of the extracted character strings, a preset regular expression is used for matching whether the character strings conform to the regular expression or not;
(4) extracting character string features by using a natural language processing method;
(5) carrying out named entity recognition;
(6) the feature vectors are combined.
7. The method for rapidly classifying massive database tables according to claim 1, wherein the key attributes are clustered by using a machine learning clustering algorithm, and labeling the clustering centers comprises: extracting the feature vectors of the key attributes according to the step 2, clustering the key attributes by using a machine learning method, labeling to form training samples for training a machine learning algorithm, clustering the feature vectors of the key data by using an unsupervised learning clustering algorithm, labeling a clustering center, and extending the label attributes to other attributes in the cluster.
8. The method for rapid classification of mass database tables according to claim 1, wherein forming a training set training classification algorithm comprises:
training a classification algorithm model with supervised learning according to the extracted key attribute features and the printed labels; and classifying by using the feature vectors of the extracted other attributes and the trained classification algorithm model, performing random sampling inspection on classification results, increasing the weight of classified error data, putting the classified error data into a training set, retraining the classification algorithm, iterating step by step, and finally outputting the trained classification algorithm.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911357917.6A CN111104466B (en) | 2019-12-25 | 2019-12-25 | Method for quickly classifying massive database tables |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911357917.6A CN111104466B (en) | 2019-12-25 | 2019-12-25 | Method for quickly classifying massive database tables |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111104466A true CN111104466A (en) | 2020-05-05 |
CN111104466B CN111104466B (en) | 2023-07-28 |
Family
ID=70425147
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911357917.6A Active CN111104466B (en) | 2019-12-25 | 2019-12-25 | Method for quickly classifying massive database tables |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111104466B (en) |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111860575A (en) * | 2020-06-05 | 2020-10-30 | 百度在线网络技术(北京)有限公司 | Method and device for processing article attribute information, electronic equipment and storage medium |
CN111913954A (en) * | 2020-06-20 | 2020-11-10 | 杭州城市大数据运营有限公司 | Intelligent data standard catalog generation method and device |
CN112380205A (en) * | 2020-11-17 | 2021-02-19 | 北京融七牛信息技术有限公司 | Method and system for automatically generating characteristics of distributed architecture |
CN112380348A (en) * | 2020-11-25 | 2021-02-19 | 中信百信银行股份有限公司 | Metadata processing method and device, electronic equipment and computer-readable storage medium |
CN112434032A (en) * | 2020-11-17 | 2021-03-02 | 北京融七牛信息技术有限公司 | Automatic feature generation system and method |
CN112530597A (en) * | 2020-11-26 | 2021-03-19 | 山东健康医疗大数据有限公司 | Data table classification method, device and medium based on Bert character model |
CN112614005A (en) * | 2020-11-30 | 2021-04-06 | 国网北京市电力公司 | Enterprise rework state processing method and device |
CN113094567A (en) * | 2021-03-31 | 2021-07-09 | 四川新网银行股份有限公司 | Malicious complaint identification method and system based on text clustering |
CN113435199A (en) * | 2021-07-18 | 2021-09-24 | 谢勇 | Storage and reading interference method and system for character corresponding culture |
CN113761297A (en) * | 2020-11-10 | 2021-12-07 | 北京沃东天骏信息技术有限公司 | Method and device for determining field relevancy in database table |
CN114528288A (en) * | 2021-08-31 | 2022-05-24 | 天津工业大学 | Design method of multi-type organ chip database |
CN115168345A (en) * | 2022-06-27 | 2022-10-11 | 天翼爱音乐文化科技有限公司 | Database classification method, system, device and storage medium |
WO2023098034A1 (en) * | 2021-11-30 | 2023-06-08 | 深圳前海微众银行股份有限公司 | Business data report classification method and apparatus |
US11720533B2 (en) | 2021-11-29 | 2023-08-08 | International Business Machines Corporation | Automated classification of data types for databases |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060074953A1 (en) * | 2004-09-30 | 2006-04-06 | International Business Machines Corporation | Metadata management for a data abstraction model |
US20160188710A1 (en) * | 2014-12-29 | 2016-06-30 | Wipro Limited | METHOD AND SYSTEM FOR MIGRATING DATA TO NOT ONLY STRUCTURED QUERY LANGUAGE (NoSOL) DATABASE |
CN105844398A (en) * | 2016-03-22 | 2016-08-10 | 武汉大学 | PLM (product life-cycle management) database-based mining algorithm for DPIPP (distributed parameterized intelligent product platform) product families |
CN107103025A (en) * | 2017-01-05 | 2017-08-29 | 北京亚信智慧数据科技有限公司 | A kind of data processing method and data processing platform (DPP) |
CN107294993A (en) * | 2017-07-05 | 2017-10-24 | 重庆邮电大学 | A kind of WEB abnormal flow monitoring methods based on integrated study |
CN109408555A (en) * | 2018-09-19 | 2019-03-01 | 智器云南京信息科技有限公司 | Data type recognition methods and device, data storage method and device |
US20190311149A1 (en) * | 2018-04-08 | 2019-10-10 | Imperva, Inc. | Detecting attacks on databases based on transaction characteristics determined from analyzing database logs |
CN110377754A (en) * | 2019-07-01 | 2019-10-25 | 北京信息科技大学 | A kind of database body learning optimization method based on decision tree |
CN110427992A (en) * | 2019-07-23 | 2019-11-08 | 杭州城市大数据运营有限公司 | Data matching method, device, computer equipment and storage medium |
CN110597816A (en) * | 2019-09-17 | 2019-12-20 | 深圳追一科技有限公司 | Data processing method, data processing device, computer equipment and computer readable storage medium |
-
2019
- 2019-12-25 CN CN201911357917.6A patent/CN111104466B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20060074953A1 (en) * | 2004-09-30 | 2006-04-06 | International Business Machines Corporation | Metadata management for a data abstraction model |
US20160188710A1 (en) * | 2014-12-29 | 2016-06-30 | Wipro Limited | METHOD AND SYSTEM FOR MIGRATING DATA TO NOT ONLY STRUCTURED QUERY LANGUAGE (NoSOL) DATABASE |
CN105844398A (en) * | 2016-03-22 | 2016-08-10 | 武汉大学 | PLM (product life-cycle management) database-based mining algorithm for DPIPP (distributed parameterized intelligent product platform) product families |
CN107103025A (en) * | 2017-01-05 | 2017-08-29 | 北京亚信智慧数据科技有限公司 | A kind of data processing method and data processing platform (DPP) |
CN107294993A (en) * | 2017-07-05 | 2017-10-24 | 重庆邮电大学 | A kind of WEB abnormal flow monitoring methods based on integrated study |
US20190311149A1 (en) * | 2018-04-08 | 2019-10-10 | Imperva, Inc. | Detecting attacks on databases based on transaction characteristics determined from analyzing database logs |
CN109408555A (en) * | 2018-09-19 | 2019-03-01 | 智器云南京信息科技有限公司 | Data type recognition methods and device, data storage method and device |
CN110377754A (en) * | 2019-07-01 | 2019-10-25 | 北京信息科技大学 | A kind of database body learning optimization method based on decision tree |
CN110427992A (en) * | 2019-07-23 | 2019-11-08 | 杭州城市大数据运营有限公司 | Data matching method, device, computer equipment and storage medium |
CN110597816A (en) * | 2019-09-17 | 2019-12-20 | 深圳追一科技有限公司 | Data processing method, data processing device, computer equipment and computer readable storage medium |
Non-Patent Citations (3)
Title |
---|
NILESHKUMAR D ET AL: "Proposed efficient approach for classification for multi-relational data mining using Bayesian Belief Network" * |
丁国辉 等: "基于DBSCAN聚类算法的多模式匹配", vol. 33, no. 33 * |
马军;宋玲;韩晓晖;闫泼;: "基于网页上下文的Deep Web数据库分类", vol. 19, no. 02 * |
Cited By (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111860575B (en) * | 2020-06-05 | 2023-06-16 | 百度在线网络技术(北京)有限公司 | Method and device for processing object attribute information, electronic equipment and storage medium |
CN111860575A (en) * | 2020-06-05 | 2020-10-30 | 百度在线网络技术(北京)有限公司 | Method and device for processing article attribute information, electronic equipment and storage medium |
CN111913954A (en) * | 2020-06-20 | 2020-11-10 | 杭州城市大数据运营有限公司 | Intelligent data standard catalog generation method and device |
CN111913954B (en) * | 2020-06-20 | 2023-08-04 | 杭州城市大数据运营有限公司 | Intelligent data standard catalog generation method and device |
CN113761297A (en) * | 2020-11-10 | 2021-12-07 | 北京沃东天骏信息技术有限公司 | Method and device for determining field relevancy in database table |
CN112434032B (en) * | 2020-11-17 | 2024-04-05 | 北京融七牛信息技术有限公司 | Automatic feature generation system and method |
CN112434032A (en) * | 2020-11-17 | 2021-03-02 | 北京融七牛信息技术有限公司 | Automatic feature generation system and method |
CN112380205A (en) * | 2020-11-17 | 2021-02-19 | 北京融七牛信息技术有限公司 | Method and system for automatically generating characteristics of distributed architecture |
CN112380205B (en) * | 2020-11-17 | 2024-04-02 | 北京融七牛信息技术有限公司 | Automatic feature generation method and system of distributed architecture |
CN112380348B (en) * | 2020-11-25 | 2024-03-26 | 中信百信银行股份有限公司 | Metadata processing method, apparatus, electronic device and computer readable storage medium |
CN112380348A (en) * | 2020-11-25 | 2021-02-19 | 中信百信银行股份有限公司 | Metadata processing method and device, electronic equipment and computer-readable storage medium |
CN112530597A (en) * | 2020-11-26 | 2021-03-19 | 山东健康医疗大数据有限公司 | Data table classification method, device and medium based on Bert character model |
CN112614005A (en) * | 2020-11-30 | 2021-04-06 | 国网北京市电力公司 | Enterprise rework state processing method and device |
CN112614005B (en) * | 2020-11-30 | 2024-04-30 | 国网北京市电力公司 | Method and device for processing reworking state of enterprise |
CN113094567A (en) * | 2021-03-31 | 2021-07-09 | 四川新网银行股份有限公司 | Malicious complaint identification method and system based on text clustering |
CN113435199A (en) * | 2021-07-18 | 2021-09-24 | 谢勇 | Storage and reading interference method and system for character corresponding culture |
CN114528288A (en) * | 2021-08-31 | 2022-05-24 | 天津工业大学 | Design method of multi-type organ chip database |
US11720533B2 (en) | 2021-11-29 | 2023-08-08 | International Business Machines Corporation | Automated classification of data types for databases |
WO2023098034A1 (en) * | 2021-11-30 | 2023-06-08 | 深圳前海微众银行股份有限公司 | Business data report classification method and apparatus |
CN115168345A (en) * | 2022-06-27 | 2022-10-11 | 天翼爱音乐文化科技有限公司 | Database classification method, system, device and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN111104466B (en) | 2023-07-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111104466B (en) | Method for quickly classifying massive database tables | |
CN107633007B (en) | Commodity comment data tagging system and method based on hierarchical AP clustering | |
CN109165294B (en) | Short text classification method based on Bayesian classification | |
WO2017166912A1 (en) | Method and device for extracting core words from commodity short text | |
CN110826320B (en) | Sensitive data discovery method and system based on text recognition | |
US10089581B2 (en) | Data driven classification and data quality checking system | |
CN110825877A (en) | Semantic similarity analysis method based on text clustering | |
CN112632228A (en) | Text mining-based auxiliary bid evaluation method and system | |
CN108519971B (en) | Cross-language news topic similarity comparison method based on parallel corpus | |
CN104834651B (en) | Method and device for providing high-frequency question answers | |
CN111158641B (en) | Automatic recognition method for transaction function points based on semantic analysis and text mining | |
US10083403B2 (en) | Data driven classification and data quality checking method | |
CN113360647B (en) | 5G mobile service complaint source-tracing analysis method based on clustering | |
CN110928981A (en) | Method, system and storage medium for establishing and perfecting iteration of text label system | |
CN112836509A (en) | Expert system knowledge base construction method and system | |
CN112580332B (en) | Enterprise portrait method based on label layering and deepening modeling | |
CN110222192A (en) | Corpus method for building up and device | |
CN112395881A (en) | Material label construction method and device, readable storage medium and electronic equipment | |
CN111753067A (en) | Innovative assessment method, device and equipment for technical background text | |
CN114722810A (en) | Real estate customer portrait method and system based on information extraction and multi-attribute decision | |
CN111626331B (en) | Automatic industry classification device and working method thereof | |
CN112989053A (en) | Periodical recommendation method and device | |
CN107609921A (en) | A kind of data processing method and server | |
CN114511027B (en) | Method for extracting English remote data through big data network | |
CN114049165B (en) | Commodity price comparison method, device, equipment and medium for purchasing system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
TA01 | Transfer of patent application right | ||
TA01 | Transfer of patent application right |
Effective date of registration: 20210916 Address after: 100854 east gate, 52 Yongding Road, Haidian District, Beijing Applicant after: China Changfeng electromechanical technology research and Design Institute Address before: 100854 east gate, 52 Yongding Road, Haidian District, Beijing Applicant before: Aerospace Science and Technology Network Information Development Co.,Ltd. |
|
GR01 | Patent grant | ||
GR01 | Patent grant |