CN110427992A - Data matching method, device, computer equipment and storage medium - Google Patents

Data matching method, device, computer equipment and storage medium Download PDF

Info

Publication number
CN110427992A
CN110427992A CN201910664541.7A CN201910664541A CN110427992A CN 110427992 A CN110427992 A CN 110427992A CN 201910664541 A CN201910664541 A CN 201910664541A CN 110427992 A CN110427992 A CN 110427992A
Authority
CN
China
Prior art keywords
column
data
sample
label
type
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910664541.7A
Other languages
Chinese (zh)
Inventor
姜琳
孟庆丰
李敏
袁晓晓
吴林强
许琮浩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Zhongyun Data Technology Co Ltd
Huzhou Big Data Operation Co Ltd
Hangzhou City Big Data Operation Co Ltd
Original Assignee
Hangzhou Zhongyun Data Technology Co Ltd
Huzhou Big Data Operation Co Ltd
Hangzhou City Big Data Operation Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Zhongyun Data Technology Co Ltd, Huzhou Big Data Operation Co Ltd, Hangzhou City Big Data Operation Co Ltd filed Critical Hangzhou Zhongyun Data Technology Co Ltd
Priority to CN201910664541.7A priority Critical patent/CN110427992A/en
Publication of CN110427992A publication Critical patent/CN110427992A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention is suitable for field of computer technology, provides a kind of data matching method, device, computer equipment and storage medium, which comprises obtains tables of data;Each data are arranged and carry out the matching of code table code value;To progress canonical identification in each data column;Determine the column type of each data column;The column feature vector of each column is extracted, the column feature vector includes the statistical nature of column data, the Expressive Features of column name and/or column annotation information and column essential attribute feature;The column feature vector of each column is identified, determines the column label of each column;Each column data is matched based on label.Data matching method provided in an embodiment of the present invention, after being pre-processed using code table code value and canonical identification, the column feature vector respectively arranged using preset column characteristic vector pickup, compared to existing method, the column feature vector that the present invention extracts has won feature of the data in multiple dimensions over by any means with smaller data volume, while guaranteeing accuracy rate, calculation amount is effectively reduced.

Description

Data matching method, device, computer equipment and storage medium
Technical field
The invention belongs to field of computer technology more particularly to a kind of data matching method, device, computer equipment and deposit Storage media.
Background technique
During government services are carried out, it will usually a large amount of government data is generated, although however these government datas place In different government services, but also can the similar a large amount of repeated datas of present pattern therefore handled to government data During, it usually needs the similar data of type that different government affairs business generate are integrated, are identified using data, in number According to the data for finding correlation between library.
There are many kinds of the methods of the existing data that correlation is found between database, the effect that different methods plays Also different.For example, manually carry out Data Matching method accuracy rate it is relatively high, but calculation amount with the increase of database urgency Increase severely and add, it is clear that is not suitable for the Data Matching of large database.And using program carry out Data Matching method there are mainly two types of, One is using field description present in database, using searching for generally searching similar data, but it is easy in this method The technical problem for causing matching rate not high because of field description missing, another kind is to utilize the data content progress in database Match, need to use different matching process to different types of data content, calculation amount is larger, and calculating speed is slow.
As it can be seen that existing data identification technology is also deposited in particular in the matching process of the big government data of data volume In technical problem that is computationally intensive, calculating data inaccuracy.
Summary of the invention
The embodiment of the present invention is designed to provide a kind of data matching method, device, computer equipment and storage medium, Aiming to solve the problem that existing data identification technology, there is also technical problems that is computationally intensive, calculating data inaccuracy.
The embodiments of the present invention are implemented as follows, a kind of data matching method, which comprises
Multiple tables of data to be matched are obtained, the column name information comprising multiple data column and/or column note in the tables of data Release the column data of information and each data column;
It is matched using the column data that code table code value arranges each data;
The part for meeting preset matching rule in each data column is identified using regular expression;
The column type for determining each data column, the column type are identified using default rule according to the column data of each data column Including numeric type and text-type;
It is arranged according to the column name information of each data column and/or column annotation information, the column data of each data column and each data Column type extracts the column feature vector of each data column using preset Feature Selection Model, and the column feature vector includes columns According to statistical nature, column name and/or column annotation information Expressive Features and data column essential attribute feature, the column data Statistical nature includes value range, mean value, variance, quantile, the coefficient of variation, kurtosis, the degree of bias, the comentropy of column data, data Column essential attribute feature includes the frequency of use of column data, the column type of data column and the number determined in advance according to preset rules According to the different degree of column;
Column type based on each data column is using the pre- data identification for first passing through training generation corresponding with the column type Model identifies the column feature vector that each data arrange, and determines the column label of each data column;
Column label based on each data column matches each data column.
The another object of the embodiment of the present invention is to provide a kind of data matching device, comprising:
Tables of data acquiring unit arranges in the tables of data comprising multiple data for obtaining multiple tables of data to be matched Column name information and/or column annotation information and each data column column data;
Code table code value matching unit, the column data for being arranged using code table code value each data are matched;
Canonical recognition unit, for identifying the part for meeting preset matching rule in each data column using regular expression;
Column type determining units, the column data for being arranged according to each data is identified using default rule determines that each data arrange Column type, the column type includes numeric type and text-type;
Column characteristic vector pickup unit, column name information and/or column annotation information, each data column for being arranged according to each data Column data and each data column column type, the column feature vector of each data column is extracted using preset Feature Selection Model, The column feature vector includes the statistical nature of column data, the Expressive Features of column name and/or column annotation information and data column base This attributive character, the statistical nature of the column data include the value range of column data, mean value, variance, quantile, variation lines Number, kurtosis, the degree of bias, comentropy, data column essential attribute feature include the frequency of use of column data, data column column type and The different degree of the data column determined in advance according to preset rules;
Column label determination unit, column type for being arranged based on each data pre- are first passed through using corresponding with the column type The data identification model that training generates identifies the column feature vector that each data arrange, and determines the column label of each data column;
Data matching unit, the column label for being arranged based on each data match each data column.
The another object of the embodiment of the present invention is to provide a kind of computer equipment, including memory and processor, described Computer program is stored in memory, when the computer program is executed by the processor, so that the processor executes The step of data matching method as described above.
The another object of the embodiment of the present invention is to provide a kind of computer readable storage medium, described computer-readable to deposit Computer program is stored on storage media, when the computer program is executed by processor, so that the processor executes as above The step of stating the data matching method.
A kind of data matching method provided in an embodiment of the present invention, after obtaining multiple tables of data to be matched, to each number The matching of code table code value and canonical identification are carried out according to column, the column class of each data column is then determined according to the column data that each data arrange Type, and according to the column name information of each data column and/or the column class of column annotation information, the column data of each data column and each data column Type extracts the column feature vector of each data column, statistical nature, column name including column data using preset Feature Selection Model And/or the Expressive Features and data column essential attribute feature of column annotation information, the statistical nature of the column data includes columns According to value range, mean value, variance, quantile, the coefficient of variation, kurtosis, the degree of bias, comentropy, data column essential attribute feature packet The different degree of the frequency of use of column data, the column type of data column and the data column determined in advance according to preset rules is included, so Column type afterwards based on each data column is using the pre- data identification model pair for first passing through training generation corresponding with the column type The column feature vector of each data column is identified, and determines the column label of each data column, is based ultimately upon the column label of each data column Each data column are matched.Data matching method provided in an embodiment of the present invention can make full use of the column name information of each column And/or the statistical information of column annotation information and column data, and the conventional essential attribute feature of column is combined, such as frequency of use, The different degree of the data marked in advance determines feature vector of falling out, and the feature vector extracted by above- mentioned information is compared to existing Data matching method, feature of each column data in multiple dimensions has been won over by any means with smaller data volume, so that calculation amount is significantly It reduces, and the subsequent data type based on column data is using the pre- data for first passing through training generation corresponding with the data type Identification model identifies the column feature vector of each column, wherein the data identification model is given birth to by mass data sample training At, so that the label result of the data finally determined is more accurate, it is accurate in guarantee compared to existing data identification method While rate, data calculation amount is greatly reduced, in particular for the big government data of data volume, the efficiency of Data Matching is significantly It improves.
Detailed description of the invention
Fig. 1 is a kind of step flow chart of data matching method provided in an embodiment of the present invention;
Fig. 2 is the step flow chart of another data matching method provided in an embodiment of the present invention;
Fig. 3 is the step flow chart of another data matching method provided in an embodiment of the present invention;
Fig. 4 is identified to be provided in an embodiment of the present invention based on column feature vector of the data type of column data to each column Method step flow chart;
Fig. 5 is that the numerical value number generated based on random forests algorithm training is stated in a kind of trained generation provided in an embodiment of the present invention According to the step flow chart of the method for identification model;
Fig. 6 is a kind of structural schematic diagram of data matching device provided in an embodiment of the present invention;
Fig. 7 is the structural schematic diagram of another data matching device provided in an embodiment of the present invention;
Fig. 8 is the structural schematic diagram of another data matching device provided in an embodiment of the present invention.
Specific embodiment
In order to make the objectives, technical solutions, and advantages of the present invention clearer, with reference to the accompanying drawings and embodiments, right The present invention is further elaborated.It should be appreciated that the specific embodiments described herein are merely illustrative of the present invention, and It is not used in the restriction present invention.
As shown in Figure 1, in one embodiment it is proposed that a kind of data matching method, specifically includes the following steps:
Step S102 obtains multiple tables of data to be matched.
In embodiments of the present invention, acquisition tables of data to be matched can derive from different databases, such as often Data acquisition can be realized by input data path in Oracle, SQL, A Liyun, the Hadoop etc. seen, and will be from different numbers Unification is carried out according to the format of the data obtained in library.
In embodiments of the present invention, the column name information and/or column annotation information comprising multiple data column in the tables of data And the column data of each data column.
Step S104 is matched using the column data that code table code value arranges each data.
In embodiments of the present invention, the symbol special for part present in tables of data, such as money symbol, utilize code Such additional character that value code table can match, so as to conveniently determine the data column service class Type.
Step S106 identifies the part for meeting preset matching rule in each data column using regular expression.
Step S108 identifies the column type for determining each data column according to the column data of each data column using default rule.
In embodiments of the present invention, the column data type of each column can be identified according to the column data that each data arrange, it is described Column data type includes text-type and numeric type.
Step S110, according to the column name information of each data column and/or column annotation information, the column data of each data column and each The column type of data column extracts the column feature vector of each data column using preset Feature Selection Model.
In embodiments of the present invention, the column feature vector includes the statistical nature of column data, column name and/or column annotation letter The Expressive Features and data column essential attribute feature of breath, the statistical nature of the column data include the value range of column data, Mean value, variance, quantile, the coefficient of variation, kurtosis, the degree of bias, comentropy, data column essential attribute feature include the use of column data The different degree of frequency, the column type of data column and the data column determined in advance according to preset rules.
Step S112, the column type based on each data column train generation using pre- first pass through corresponding with the column type Data identification model identifies the column feature vector that each data arrange, and determines the column label of each data column.
In embodiments of the present invention, data identification model is identified each column determined to the column feature vector of each column Column label be set in advance according to actual needs, such as label can be the contents such as population, area, GDP.
Step S114, the column label based on each data column match each data column.
In embodiments of the present invention, data identical for label show that the content of two column datas description can match, Such as be population to the label of A column data, the label of B column data is population, then shows that A, B column data are likely to be different zones Demographic data, A column data and B column data can be combined.
A kind of data matching method provided in an embodiment of the present invention, after obtaining multiple tables of data to be matched, to each number The matching of code table code value and canonical identification are carried out according to column, the column class of each data column is then determined according to the column data that each data arrange Type, and according to the column name information of each data column and/or the column class of column annotation information, the column data of each data column and each data column Type extracts the column feature vector of each data column, statistical nature, column name including column data using preset Feature Selection Model And/or the Expressive Features and data column essential attribute feature of column annotation information, the statistical nature of the column data includes columns According to value range, mean value, variance, quantile, the coefficient of variation, kurtosis, the degree of bias, comentropy, data column essential attribute feature packet The different degree of the frequency of use of column data, the column type of data column and the data column determined in advance according to preset rules is included, so Column type afterwards based on each data column is using the pre- data identification model pair for first passing through training generation corresponding with the column type The column feature vector of each data column is identified, and determines the column label of each data column, is based ultimately upon the column label of each data column Each data column are matched.Data matching method provided in an embodiment of the present invention can make full use of the column name information of each column And/or the statistical information of column annotation information and column data, and the conventional essential attribute feature of column is combined, such as frequency of use, The different degree of the data marked in advance determines feature vector of falling out, and the feature vector extracted by above- mentioned information is compared to existing Data matching method, feature of each column data in multiple dimensions has been won over by any means with smaller data volume, so that calculation amount is significantly It reduces, and the subsequent data type based on column data is using the pre- data for first passing through training generation corresponding with the data type Identification model identifies the column feature vector of each column, wherein the data identification model is given birth to by mass data sample training At, so that the label result of the data finally determined is more accurate, it is accurate in guarantee compared to existing data identification method While rate, data calculation amount is greatly reduced, in particular for the big government data of data volume, the efficiency of Data Matching is significantly It improves.
As shown in Fig. 2, in one embodiment it is proposed that another data matching method, with a kind of data shown in fig. 1 The difference of matching process is, before the step S110, further includes:
Step S202, the column data arranged based on preset data prediction model each data are pre-processed.
In embodiments of the present invention, the pretreatment includes the completion of missing data and the extraction of significant data.
In embodiments of the present invention, it is contemplated that when establishing database, data are there may be missing, mistake, meeting when serious Influence final matched accuracy rate, therefore, can by pre-set data prediction model simultaneously from the quality of data and Two aspect of content is to data cleansing, such as carries out completion to missing data, maked corrections, format wrong data to the numerical value that peels off Deleted, is smooth to column data progress, significant data is extracted etc., it reduces quality of data difference and recognition result is caused Influence, promote the accuracy rate that integrally identifies.
It is provided in an embodiment of the present invention another kind data matching method, compared to Fig. 1 provide a kind of data matching method, By being located in advance before extracting feature vector to data using column data of the pre-set data prediction model to each column Reason, can effectively improve the quality of data, reduction factor is influenced according to of poor quality and caused by recognition result, improves data Matched accuracy rate.
As shown in figure 3, in one embodiment it is proposed that another data matching method, with a kind of data shown in fig. 1 The difference of matching process is, before the step S114, further includes:
Step S302 extracts at least one column label determining by the identification of data identification model according to default rule.
In embodiments of the present invention, the result of label is shown by visualization technique, it may be convenient to assist industry Business personnel check the result of determining column label.
Step S304 judges whether the determining column label of the identification is accurate.When the determining column label of judgement identification is inaccurate When true, step S306 is executed;When the determining column label of judgement identification is accurate, other steps are executed.
In embodiments of the present invention, when the determining column label of data identification model identification is accurate, show that data identify mould Type classification accuracy is higher, at this point, the label that can be directly based upon each column data matches each column data, it is described to execute other Step is generally the label based on each column data and matches to each column data.
Step S306 modifies the column label, and optimizes the data identification model according to modified column label.
In embodiments of the present invention, when the determining column label inaccuracy of data identification model identification, show that data identify Model optimization is not yet complete, and there are certain errors, it is therefore desirable to optimize to data identification model, therefore, by will be described Column label is revised as correct label, can reversely optimize to data identification model, further improve data identification model Accuracy rate.
Another data matching method provided in an embodiment of the present invention, compared to Fig. 1 provide a kind of data matching method, After tag recognition, the label is shown using visualization technique, is checked with result of the auxiliary activities person to label, And further judge whether the determining label result of identification is accurate, when judging result inaccuracy, reversely data can be known Other model optimizes, and further increases the accuracy rate of data identification model.
As shown in figure 4, in one embodiment it is proposed that it is a kind of based on the data type of column data to the column feature of each column Vector carries out knowledge method for distinguishing, specifically includes the following steps:
Step S402 trains the numeric data identification model generated to data type using random forests algorithm is in advance based on Column feature vector for the column of numeric type is identified, and determines column label.
In embodiments of the present invention, the random forests algorithm is that a kind of more decision trees of utilization are trained simultaneously sample The method of prediction can be predicted and be exported according to the feature vector of sample wherein every decision tree includes multiple equinoxs Label, using the most label of decision tree prediction results different in random forest as final label.
In embodiments of the present invention, training is generated the numeric data generated based on random forests algorithm training and identifies mould The step of type, please refers to Fig. 5 and its explanatory paragraph.
Step S404 trains the text data identification model generated to data class using NB Algorithm is in advance based on Type is that the column feature vector of the column of text-type is identified, and determines column label.
The NB Algorithm is the algorithm based on Bayes principle in embodiments of the present invention, between data set Relationship it is relatively independent when, classifying quality is preferable, be usually used in text data classification.
As shown in figure 5, in one embodiment it is proposed that a kind of training generates to state is generated based on random forests algorithm training Numeric data identification model method, specifically includes the following steps:
Step S502 obtains multiple data sample tables.
In embodiments of the present invention, multiple sample column name information and/or sample column annotation are included in the data sample table The sample column data of information and each sample column.
Step S504 obtains the target labels of each sample column.
In embodiments of the present invention, the target labels of the sample column are previously known.
Step S506 identifies the sample column data for determining that each sample arranges using canonical based on the column sample data of each sample column Type.
In embodiments of the present invention, the process being trained using data sample is needed and carries out recognizing process to data It remains exactly the same, therefore, the canonical identification used in step S506 is identical as the canonical identification used in abovementioned steps S104.
Step S508, according to the sample of the sample column name information of each sample column and/or sample column annotation information, each sample column The sample column data type of column data and each sample column is arranged using the sample that preset Feature Selection Model extracts each sample column Feature vector.
In embodiments of the present invention, the sample column feature vector includes the statistical nature of sample column data, sample column name And/or the Expressive Features and sample column essential attribute feature of sample column annotation information, the statistical nature of the sample column data Value range, mean value and variance including sample column data, sample column essential attribute feature include the use of sample column data The different degree of frequency, the data type of sample column data and the sample column data determined in advance according to preset rules.
In embodiments of the present invention, likewise, the process being trained using data sample is needed and known to data The process of being clipped to remains exactly the same, therefore spy used in Feature Selection Model used in step S508 and abovementioned steps S106 It is identical that sign extracts model.
Step S510 establishes the numeric data identification model of the initialization containing variable element based on random forests algorithm.
Step S512 is determined according to the sample column feature vector of each sample column and the numeric data identification model The responsive tags of each sample column.
In embodiments of the present invention, the numeric data identification model can be understood as independent variable column feature vector and because becoming The functional relation for measuring label, independent variable column feature vector is input in function, so that it may determine dependent variable label.
Step S514, judges whether the responsive tags and the target labels meet preset trained success conditions.When When judging that the responsive tags and the target labels are unsatisfactory for preset trained success conditions, step S516 is executed;Work as judgement When the responsive tags and the target labels meet preset trained success conditions, step S518 is executed.
In embodiments of the present invention, the preset trained success conditions can be frequency of training and reach preset value, can also It is less than certain condition to be in response to the difference of label and target labels.
Step S516 adjusts the variable element in the numeric data identification model, and is back to the step 510.
Current value data identification model is determined as the numerical value number generated based on random forests algorithm training by step S518 According to identification model.
In embodiments of the present invention, when judging responsive tags and the target labels meet preset trained success conditions When, at this point, can be considered that numeric data identification model is tentatively completed, the high column mark of accuracy rate can be exported according to column feature vector Label.
As shown in fig. 6, in one embodiment it is proposed that a kind of data matching device, details are as follows.
In embodiments of the present invention, the data matching device includes:
Tables of data acquiring unit 610, for obtaining multiple tables of data to be matched.
In embodiments of the present invention, acquisition tables of data to be matched can derive from different databases, such as often Data acquisition can be realized by input data path in Oracle, SQL, A Liyun, the Hadoop etc. seen, and will be from different numbers Unification is carried out according to the format of the data obtained in library.
In embodiments of the present invention, the column name information and/or column annotation information comprising multiple data column in the tables of data And the column data of each data column.
Code table code value matching unit 620, the column data for being arranged using code table code value each data is matched.
In embodiments of the present invention, the symbol special for part present in tables of data, such as money symbol, utilize code Such additional character that value code table can match, so as to conveniently determine the data column service class Type.
Canonical recognition unit 630, for identifying the portion for meeting preset matching rule in each data column using regular expression Point.
Column type determining units 640, the column data for being arranged according to each data is identified using default rule determines each number According to the column type of column.
In embodiments of the present invention, the column data type of each column can be identified according to the column data that each data arrange, it is described Column data type includes text-type and numeric type.
Column characteristic vector pickup unit 650, column name information and/or column annotation information for being arranged according to each data, each number According to the column type that the column data of column and each data arrange, using preset Feature Selection Model extract the column features of each data column to Amount.
In embodiments of the present invention, the column feature vector includes the statistical nature of column data, column name and/or column annotation letter The Expressive Features and data column essential attribute feature of breath, the statistical nature of the column data include the value range of column data, Mean value, variance, quantile, the coefficient of variation, kurtosis, the degree of bias, comentropy, data column essential attribute feature include the use of column data The different degree of frequency, the column type of data column and the data column determined in advance according to preset rules.
Column label determination unit 660, the column type for being arranged based on each data is using corresponding with the column type preparatory The column feature vector that each data arrange is identified by the data identification model that training generates, and determines the column mark of each data column Label.
In embodiments of the present invention, data identification model is identified each column determined to the column feature vector of each column Column label be set in advance according to actual needs, such as label can be the contents such as population, area, GDP.
Data matching unit 670, the column label for being arranged based on each data match each data column.
In embodiments of the present invention, data identical for label show that the content of two column datas description can match, Such as be population to the label of A column data, the label of B column data is population, then shows that A, B column data are likely to be different zones Demographic data, A column data and B column data can be combined.
A kind of data matching device provided in an embodiment of the present invention, after obtaining multiple tables of data to be matched, to each number The matching of code table code value and canonical identification are carried out according to column, the column class of each data column is then determined according to the column data that each data arrange Type, and according to the column name information of each data column and/or the column class of column annotation information, the column data of each data column and each data column Type extracts the column feature vector of each data column, statistical nature, column name including column data using preset Feature Selection Model And/or the Expressive Features and data column essential attribute feature of column annotation information, the statistical nature of the column data includes columns According to value range, mean value, variance, quantile, the coefficient of variation, kurtosis, the degree of bias, comentropy, data column essential attribute feature packet The different degree of the frequency of use of column data, the column type of data column and the data column determined in advance according to preset rules is included, so Column type afterwards based on each data column is using the pre- data identification model pair for first passing through training generation corresponding with the column type The column feature vector of each data column is identified, and determines the column label of each data column, is based ultimately upon the column label of each data column Each data column are matched.Data matching method provided in an embodiment of the present invention can make full use of the column name information of each column And/or the statistical information of column annotation information and column data, and the conventional essential attribute feature of column is combined, such as frequency of use, The different degree of the data marked in advance determines feature vector of falling out, and the feature vector extracted by above- mentioned information is compared to existing Data matching method, feature of each column data in multiple dimensions has been won over by any means with smaller data volume, so that calculation amount is significantly It reduces, and the subsequent data type based on column data is using the pre- data for first passing through training generation corresponding with the data type Identification model identifies the column feature vector of each column, wherein the data identification model is given birth to by mass data sample training At, so that the label result of the data finally determined is more accurate, it is accurate in guarantee compared to existing data identification method While rate, data calculation amount is greatly reduced, in particular for the big government data of data volume, the efficiency of Data Matching is significantly It improves.
As shown in fig. 7, in one embodiment it is proposed that another data matching device, with a kind of data shown in Fig. 6 The difference of coalignment is, further includes:
Data pre-processing unit 710, for being carried out based on preset data prediction model to the column data that each data arrange Pretreatment.
In embodiments of the present invention, the pretreatment includes the completion of missing data and the extraction of significant data.
In embodiments of the present invention, it is contemplated that when establishing database, data are there may be missing, mistake, meeting when serious Influence final matched accuracy rate, therefore, can by pre-set data prediction model simultaneously from the quality of data and Two aspect of content is to data cleansing, such as carries out completion to missing data, maked corrections, format wrong data to the numerical value that peels off Deleted, is smooth to column data progress, significant data is extracted etc., it reduces quality of data difference and recognition result is caused Influence, promote the accuracy rate that integrally identifies.
It is provided in an embodiment of the present invention another kind data matching device, compared to Fig. 6 provide a kind of data matching device, By being located in advance before extracting feature vector to data using column data of the pre-set data prediction model to each column Reason, can effectively improve the quality of data, reduction factor is influenced according to of poor quality and caused by recognition result, improves data Matched accuracy rate.
As shown in figure 8, in one embodiment it is proposed that another data matching device, with a kind of data shown in Fig. 6 The difference of coalignment is, further includes:
Column label extracting unit 810 is identified for extracting at least one according to default rule by data identification model Determining column label.
In embodiments of the present invention, the result of label is shown by visualization technique, it may be convenient to assist industry Business personnel check the result of determining column label.
Column label judging unit 820, for judging whether the determining column label of the identification is accurate.
In embodiments of the present invention, when the determining column label of data identification model identification is accurate, show that data identify mould Type classification accuracy is higher, at this point, the label that can be directly based upon each column data matches each column data, it is described to execute other Step is generally the label based on each column data and matches to each column data.
Data identification model optimizes unit 830, for modifying the column when the determining column label inaccuracy of judgement identification Label, and the data identification model is optimized according to modified column label.
In embodiments of the present invention, when the determining column label inaccuracy of data identification model identification, show that data identify Model optimization is not yet complete, and there are certain errors, it is therefore desirable to optimize to data identification model, therefore, by will be described Column label is revised as correct label, can reversely optimize to data identification model, further improve data identification model Accuracy rate.
Another data matching device provided in an embodiment of the present invention, compared to Fig. 6 provide a kind of data matching device, After tag recognition, the label is shown using visualization technique, is checked with result of the auxiliary activities person to label, And further judge whether the determining label result of identification is accurate, when judging result inaccuracy, reversely data can be known Other model optimizes, and further increases the accuracy rate of data identification model.
In one embodiment it is proposed that a kind of computer equipment, the computer equipment include memory, processor and It is stored in the computer program that can be run on the memory and on the processor, the processor executes the computer It is performed the steps of when program
Multiple tables of data to be matched are obtained, the column name information comprising multiple data column and/or column note in the tables of data Release the column data of information and each data column;
It is matched using the column data that code table code value arranges each data;
The part for meeting preset matching rule in each data column is identified using regular expression;
The column type for determining each data column, the column type are identified using default rule according to the column data of each data column Including numeric type and text-type;
It is arranged according to the column name information of each data column and/or column annotation information, the column data of each data column and each data Column type extracts the column feature vector of each data column using preset Feature Selection Model, and the column feature vector includes columns According to statistical nature, column name and/or column annotation information Expressive Features and data column essential attribute feature, the column data Statistical nature includes value range, mean value, variance, quantile, the coefficient of variation, kurtosis, the degree of bias, the comentropy of column data, data Column essential attribute feature includes the frequency of use of column data, the column type of data column and the number determined in advance according to preset rules According to the different degree of column;
Column type based on each data column is using the pre- data identification for first passing through training generation corresponding with the column type Model identifies the column feature vector that each data arrange, and determines the column label of each data column;
Column label based on each data column matches each data column.
In one embodiment, a kind of computer readable storage medium is provided, is stored on computer readable storage medium Computer program, when computer program is executed by processor, so that processor executes following steps:
Multiple tables of data to be matched are obtained, the column name information comprising multiple data column and/or column note in the tables of data Release the column data of information and each data column;
It is matched using the column data that code table code value arranges each data;
The part for meeting preset matching rule in each data column is identified using regular expression;
The column type for determining each data column, the column type are identified using default rule according to the column data of each data column Including numeric type and text-type;
It is arranged according to the column name information of each data column and/or column annotation information, the column data of each data column and each data Column type extracts the column feature vector of each data column using preset Feature Selection Model, and the column feature vector includes columns According to statistical nature, column name and/or column annotation information Expressive Features and data column essential attribute feature, the column data Statistical nature includes value range, mean value, variance, quantile, the coefficient of variation, kurtosis, the degree of bias, the comentropy of column data, data Column essential attribute feature includes the frequency of use of column data, the column type of data column and the number determined in advance according to preset rules According to the different degree of column;
Column type based on each data column is using the pre- data identification for first passing through training generation corresponding with the column type Model identifies the column feature vector that each data arrange, and determines the column label of each data column;
Column label based on each data column matches each data column.
Although should be understood that various embodiments of the present invention flow chart in each step according to arrow instruction successively It has been shown that, but these steps are not that the inevitable sequence according to arrow instruction successively executes.Unless expressly state otherwise herein, There is no stringent sequences to limit for the execution of these steps, these steps can execute in other order.Moreover, in each embodiment At least part step may include multiple sub-steps perhaps these sub-steps of multiple stages or stage be not necessarily Synchronization executes completion, but can execute at different times, and the execution sequence in these sub-steps or stage also need not Be so successively carry out, but can at least part of the sub-step or stage of other steps or other steps in turn or Person alternately executes.
Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with Relevant hardware is instructed to complete by computer program, the program can be stored in a non-volatile computer and can be read In storage medium, the program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, provided herein Each embodiment used in any reference to memory, storage, database or other media, may each comprise non-volatile And/or volatile memory.Nonvolatile memory may include that read-only memory (ROM), programming ROM (PROM), electricity can be compiled Journey ROM (EPROM), electrically erasable ROM (EEPROM) or flash memory.Volatile memory may include random access memory (RAM) or external cache.By way of illustration and not limitation, RAM is available in many forms, such as static state RAM (SRAM), dynamic ram (DRAM), synchronous dram (SDRAM), double data rate sdram (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronization link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) directly RAM (RDRAM), straight Connect memory bus dynamic ram (DRDRAM) and memory bus dynamic ram (RDRAM) etc..
Each technical characteristic of embodiment described above can be combined arbitrarily, for simplicity of description, not to above-mentioned reality It applies all possible combination of each technical characteristic in example to be all described, as long as however, the combination of these technical characteristics is not deposited In contradiction, all should be considered as described in this specification.
The embodiments described above only express several embodiments of the present invention, and the description thereof is more specific and detailed, but simultaneously Limitations on the scope of the patent of the present invention therefore cannot be interpreted as.It should be pointed out that for those of ordinary skill in the art For, without departing from the inventive concept of the premise, various modifications and improvements can be made, these belong to guarantor of the invention Protect range.Therefore, the scope of protection of the patent of the invention shall be subject to the appended claims.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention Made any modifications, equivalent replacements, and improvements etc., should all be included in the protection scope of the present invention within mind and principle.

Claims (10)

1. a kind of data matching method, which is characterized in that the described method includes:
Multiple tables of data to be matched are obtained, the column name information comprising multiple data column and/or column annotation letter in the tables of data The column data of breath and each data column;
It is matched using the column data that code table code value arranges each data;
The part for meeting preset matching rule in each data column is identified using regular expression;
Identify that the column type for determining each data column, the column type include using default rule according to the column data of each data column Numeric type and text-type;
According to the column name information of each data column and/or the column class of column annotation information, the column data of each data column and each data column Type extracts the column feature vector of each data column using preset Feature Selection Model, and the column feature vector includes column data Statistical nature, the Expressive Features of column name and/or column annotation information and data column essential attribute feature, the statistics of the column data Feature includes value range, mean value, variance, quantile, the coefficient of variation, kurtosis, the degree of bias, the comentropy of column data, data column base This attributive character includes the frequency of use of column data, the column type of data column and arranges in advance according to the data that preset rules determine Different degree;
Column type based on each data column is using the pre- data identification model for first passing through training generation corresponding with the column type The column feature vector of each data column is identified, and determines the column label of each data column;
Column label based on each data column matches each data column.
2. data matching method according to claim 1, which is characterized in that in the column name information arranged according to each data And/or column annotation information, each data column column data and each data column column type, mentioned using preset Feature Selection Model Before the step of taking the column feature vector of each data column, further includes:
The column data arranged based on preset data prediction model each data is pre-processed, and the pretreatment includes missing number According to completion and significant data extraction.
3. data matching method according to claim 1, which is characterized in that in the column label pair based on each data column Before the step of each data column are matched, further includes:
At least one column label determining by the identification of data identification model is extracted according to default rule;
Judge whether the determining column label of the identification is accurate;
When the determining column label inaccuracy of judgement identification, the column label is modified, and institute is optimized according to modified column label State data identification model.
4. data matching method according to claim 1, which is characterized in that the column type based on each data column uses The pre- data identification model for first passing through training generation corresponding with the column type knows the column feature vector that each data arrange Not, and the step of determining the column label of each data column it specifically includes:
Use the numeric data identification model for being in advance based on random forests algorithm training generation to data type for the column of numeric type Column feature vector identified, and determine column label;
Use the text data identification model for being in advance based on NB Algorithm training generation to data type for text-type The column feature vector of column is identified, and determines column label.
5. data matching method according to claim 4, which is characterized in that training generates described based on random forests algorithm The step of numeric data identification model that training generates, specifically includes:
Multiple data sample tables are obtained, include multiple sample column name information and/or sample column annotation letter in the data sample table The sample column data of breath and each sample column;
Obtain the target labels of each sample column;
The sample column data type for determining that each sample arranges is identified using canonical based on the column sample data of each sample column;
According to the sample column name information of each sample column and/or sample column annotation information, the sample column data of each sample column and each The sample column data type of sample column extracts the sample column feature vector of each sample column, institute using preset Feature Selection Model State the description spy that sample column feature vector includes the statistical nature of sample column data, sample column name and/or sample column annotation information Sign and sample column essential attribute feature, the statistical nature of the sample column data include the value range of sample column data, Value and variance, sample column essential attribute feature include the frequency of use of sample column data, the data type of sample column data with And the different degree of the sample column data determined in advance according to preset rules;
The numeric data identification model of the initialization containing variable element is established based on random forests algorithm;
The sound of each sample column is determined according to the sample column feature vector of each sample column and the numeric data identification model Answer label;
Judge whether the responsive tags and the target labels meet preset trained success conditions;
When judging that the responsive tags and the target labels are unsatisfactory for preset trained success conditions, the numerical value number is adjusted According to the variable element in identification model, and it is back to the sample column feature vector according to each sample column and the numerical value number The step of determining the responsive tags of each sample column according to identification model;
When judging the responsive tags and the target labels meet preset trained success conditions, current value data are identified Model is determined as the numeric data identification model generated based on random forests algorithm training.
6. a kind of data matching device characterized by comprising
Tables of data acquiring unit, for obtaining multiple tables of data to be matched, the column comprising multiple data column in the tables of data The column data of name information and/or column annotation information and each data column;
Code table code value matching unit, the column data for being arranged using code table code value each data are matched;
Canonical recognition unit, for identifying the part for meeting preset matching rule in each data column using regular expression;
Column type determining units, the column data for being arranged according to each data identify the column for determining each data column using default rule Type, the column type include numeric type and text-type;
Column characteristic vector pickup unit, the column of column name information and/or column annotation information, each data column for being arranged according to each data The column type of data and each data column extracts the column feature vector of each data column using preset Feature Selection Model, described Column feature vector includes the statistical nature of column data, the Expressive Features of column name and/or column annotation information and the basic category of data column Property feature, the statistical nature of the column data includes the value range of column data, mean value, variance, quantile, the coefficient of variation, peak Degree, the degree of bias, comentropy, data column essential attribute feature include the frequency of use of column data, data column column type and in advance According to the different degree for the data column that preset rules determine;
Column label determination unit, column type for being arranged based on each data pre- first pass through training using corresponding with the column type The data identification model of generation identifies the column feature vector that each data arrange, and determines the column label of each data column;
Data matching unit, the column label for being arranged based on each data match each data column.
7. a kind of data matching unit according to claim 6, which is characterized in that further include:
Data pre-processing unit, the column data for being arranged based on preset data prediction model each data are pre-processed, The pretreatment includes the completion of missing data and the extraction of significant data.
8. a kind of data matching unit according to claim 6, which is characterized in that further include:
Column label extracting unit, for extracting at least one column determining by the identification of data identification model according to default rule Label;
Column label judging unit, for judging whether the determining column label of the identification is accurate;
Data identification model optimizes unit, for modifying the column label when the determining column label inaccuracy of judgement identification, and Optimize the data identification model according to modified column label.
9. a kind of computer equipment, which is characterized in that including memory and processor, computer journey is stored in the memory Sequence, when the computer program is executed by the processor, so that the processor perform claim requires any one of 1 to 5 power Benefit requires the step of data matching method.
10. a kind of computer readable storage medium, which is characterized in that be stored with computer on the computer readable storage medium Program, when the computer program is executed by processor, so that the processor perform claim requires any one of 1 to 5 right It is required that the step of data matching method.
CN201910664541.7A 2019-07-23 2019-07-23 Data matching method, device, computer equipment and storage medium Pending CN110427992A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910664541.7A CN110427992A (en) 2019-07-23 2019-07-23 Data matching method, device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910664541.7A CN110427992A (en) 2019-07-23 2019-07-23 Data matching method, device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN110427992A true CN110427992A (en) 2019-11-08

Family

ID=68411857

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910664541.7A Pending CN110427992A (en) 2019-07-23 2019-07-23 Data matching method, device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN110427992A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110929285A (en) * 2019-12-10 2020-03-27 支付宝(杭州)信息技术有限公司 Method and device for processing private data
CN111104466A (en) * 2019-12-25 2020-05-05 航天科工网络信息发展有限公司 Method for rapidly classifying massive database tables
CN113076379A (en) * 2021-04-27 2021-07-06 上海德衡数据科技有限公司 Method and system for distinguishing element number areas based on digital ICD
CN113127509A (en) * 2019-12-31 2021-07-16 ***通信集团重庆有限公司 Method and device for adapting SQL execution engine in PaaS platform
CN113157788A (en) * 2021-04-13 2021-07-23 福州外语外贸学院 Big data mining method and system
CN113312354A (en) * 2021-06-10 2021-08-27 北京百度网讯科技有限公司 Data table identification method, device, equipment and storage medium
WO2022123370A1 (en) * 2020-12-11 2022-06-16 International Business Machines Corporation Finding locations of tabular data across systems

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106407357A (en) * 2016-09-07 2017-02-15 深圳市中易科技有限责任公司 Engineering method for developing text data rule model
CN107851233A (en) * 2015-06-19 2018-03-27 阿普泰克科技公司 Local analytics at assets
CN108537207A (en) * 2018-04-24 2018-09-14 Oppo广东移动通信有限公司 Lip reading recognition methods, device, storage medium and mobile terminal
CN109299094A (en) * 2018-09-18 2019-02-01 深圳壹账通智能科技有限公司 Tables of data processing method, device, computer equipment and storage medium
CN109597892A (en) * 2018-12-25 2019-04-09 杭州数梦工场科技有限公司 Classification method, device, equipment and the storage medium of data in a kind of database
CN109635118A (en) * 2019-01-10 2019-04-16 博拉网络股份有限公司 A kind of user's searching and matching method based on big data
CN109887285A (en) * 2019-03-15 2019-06-14 北京经纬恒润科技有限公司 A kind of determination method and device for reason of stopping

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107851233A (en) * 2015-06-19 2018-03-27 阿普泰克科技公司 Local analytics at assets
CN106407357A (en) * 2016-09-07 2017-02-15 深圳市中易科技有限责任公司 Engineering method for developing text data rule model
CN108537207A (en) * 2018-04-24 2018-09-14 Oppo广东移动通信有限公司 Lip reading recognition methods, device, storage medium and mobile terminal
CN109299094A (en) * 2018-09-18 2019-02-01 深圳壹账通智能科技有限公司 Tables of data processing method, device, computer equipment and storage medium
CN109597892A (en) * 2018-12-25 2019-04-09 杭州数梦工场科技有限公司 Classification method, device, equipment and the storage medium of data in a kind of database
CN109635118A (en) * 2019-01-10 2019-04-16 博拉网络股份有限公司 A kind of user's searching and matching method based on big data
CN109887285A (en) * 2019-03-15 2019-06-14 北京经纬恒润科技有限公司 A kind of determination method and device for reason of stopping

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
吴家碚等: "《C语言程序设计与应用(高职)》", 31 January 2015 *

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110929285A (en) * 2019-12-10 2020-03-27 支付宝(杭州)信息技术有限公司 Method and device for processing private data
CN110929285B (en) * 2019-12-10 2022-01-25 支付宝(杭州)信息技术有限公司 Method and device for processing private data
CN111104466A (en) * 2019-12-25 2020-05-05 航天科工网络信息发展有限公司 Method for rapidly classifying massive database tables
CN113127509A (en) * 2019-12-31 2021-07-16 ***通信集团重庆有限公司 Method and device for adapting SQL execution engine in PaaS platform
CN113127509B (en) * 2019-12-31 2023-08-15 ***通信集团重庆有限公司 Method and device for adapting SQL execution engine in PaaS platform
WO2022123370A1 (en) * 2020-12-11 2022-06-16 International Business Machines Corporation Finding locations of tabular data across systems
US11500886B2 (en) 2020-12-11 2022-11-15 International Business Machines Corporation Finding locations of tabular data across systems
GB2616577A (en) * 2020-12-11 2023-09-13 Ibm Finding locations of tabular data across systems
CN113157788A (en) * 2021-04-13 2021-07-23 福州外语外贸学院 Big data mining method and system
CN113157788B (en) * 2021-04-13 2024-02-13 福州外语外贸学院 Big data mining method and system
CN113076379A (en) * 2021-04-27 2021-07-06 上海德衡数据科技有限公司 Method and system for distinguishing element number areas based on digital ICD
CN113312354A (en) * 2021-06-10 2021-08-27 北京百度网讯科技有限公司 Data table identification method, device, equipment and storage medium
CN113312354B (en) * 2021-06-10 2023-07-28 北京百度网讯科技有限公司 Data table identification method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN110427992A (en) Data matching method, device, computer equipment and storage medium
CN111160017B (en) Keyword extraction method, phonetics scoring method and phonetics recommendation method
CN110704633B (en) Named entity recognition method, named entity recognition device, named entity recognition computer equipment and named entity recognition storage medium
CN109635117B (en) Method and device for recognizing user intention based on knowledge graph
CN111444723B (en) Information extraction method, computer device, and storage medium
CN108959242B (en) Target entity identification method and device based on part-of-speech characteristics of Chinese characters
CN109992664B (en) Dispute focus label classification method and device, computer equipment and storage medium
US20140351228A1 (en) Dialog system, redundant message removal method and redundant message removal program
CN110795919A (en) Method, device, equipment and medium for extracting table in PDF document
CN111368049A (en) Information acquisition method and device, electronic equipment and computer readable storage medium
CN108573707B (en) Method, device, equipment and medium for processing voice recognition result
CN110427612B (en) Entity disambiguation method, device, equipment and storage medium based on multiple languages
CN112860841A (en) Text emotion analysis method, device and equipment and storage medium
CN108920677A (en) Questionnaire method, investigating system and electronic equipment
JP2019503541A (en) An annotation system for extracting attributes from electronic data structures
CN112287095A (en) Method and device for determining answers to questions, computer equipment and storage medium
CN112699923A (en) Document classification prediction method and device, computer equipment and storage medium
CN111814482A (en) Text key data extraction method and system and computer equipment
CN110795942B (en) Keyword determination method and device based on semantic recognition and storage medium
CN112347997A (en) Test question detection and identification method and device, electronic equipment and medium
CN110532229B (en) Evidence file retrieval method, device, computer equipment and storage medium
KR102185733B1 (en) Server and method for automatically generating profile
CN111581346A (en) Event extraction method and device
CN113420116B (en) Medical document analysis method, device, equipment and medium
CN109660621A (en) A kind of content delivery method and service equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20191108