CN109634949A - A kind of blended data cleaning method based on more versions of data - Google Patents

A kind of blended data cleaning method based on more versions of data Download PDF

Info

Publication number
CN109634949A
CN109634949A CN201811628044.3A CN201811628044A CN109634949A CN 109634949 A CN109634949 A CN 109634949A CN 201811628044 A CN201811628044 A CN 201811628044A CN 109634949 A CN109634949 A CN 109634949A
Authority
CN
China
Prior art keywords
data
cleaning
versions
rule
score
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811628044.3A
Other languages
Chinese (zh)
Other versions
CN109634949B (en
Inventor
高云君
陈刚
陈纯
葛丛丛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN201811628044.3A priority Critical patent/CN109634949B/en
Publication of CN109634949A publication Critical patent/CN109634949A/en
Application granted granted Critical
Publication of CN109634949B publication Critical patent/CN109634949B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of blended data cleaning methods based on more versions of data.The present invention is using Markov logical network probability graph model and minimizes reparation principle, Qualitative and quantitative technique are integrated in the present invention, design efficient data cleaning method, the structural data of mistake is detected and corrected, guarantee that wash result can either clean the dirty data for the constraint that breaks the rules and the change cost met to data set is minimum, and the statistical properties can be complied with.Entire data set is first divided into block and group according to Markov logic index technology by the present invention, then executes two stage data cleansing.First stage by the evaluation criterion of introducing confidence level score, cleans the data in each group to obtain the data cleansing result of multi version;Second stage merges the evaluation criterion of score by introducing, and the multi version result generated to the preposition stage merges, to generate final unified wash result.

Description

A kind of blended data cleaning method based on more versions of data
Technical field
The present invention relates to, to the cleaning technique of wrong data, be based particularly on more versions of data in Computer Database field Blended data cleaning method.
Background technique
The purpose of data cleansing is to find the content that wrong data is most likely to be in data set, and provides one reliably Data of righting the wrong method.Dirty data is exactly the data that there is mistake in data set.
Nowadays, with being continued to bring out using social networks, e-commerce as the novel information published method of representative, Yi Jiyun It calculating, the rise of Internet of Things computer technology, data just constantly increase and accumulate at an unprecedented rate, and in data point In analysis, the presence of dirty data not only results in the decision and insecure analysis of mistake, can also cause to hit to corporate economy.Cause No matter this all produces great interest to data cleansing in industry or academia.Data cleansing be to wrong data into Process row detection and repaired corrects existing error message, keeps the one of data its object is to delete wherein redundancy Cause property.
For data cleaning method, a few thing is had been made in domestic and foreign scholars at present.The method of mainstream can be at present Be roughly divided into two class of qualitative method and quantitative approach: (1) qualitative method is mainly to clean the mistake for violating integrity constraint rule Data, evaluation criterion are minimum cost principle, that is, require the cost of cleaning to minimize the change of data set, disadvantage is it The wrong data for being unsatisfactory for minimum cost principle can not be cleaned, although it still violates integrity constraint;(2) quantitative approach is Suitable model is constructed to determine cleaning strategy, its shortcoming is that such method is strongly dependent upon training based on data probability distributions Collection, it is desirable to provide enough and clean given datas are used as training set to construct reliable model, and this is for present big It being not suitable for for data environment, the Data Representation that overwhelming majority quantitative approach is cleaned at present is poorer than qualitative method, and Existing method runing time is longer.
Summary of the invention
In view of the above deficiencies, the present invention provides a kind of blended data cleaning method based on more versions of data, of the invention Method is not only to have guaranteed to execute cleaning to the data for violating ICs, but also accord with wash result by method that is qualitative and quantitatively combining Close statistical property.This method is based on Markov logical network, first according to Markov logic index technology by entire data set It is divided into block and group, then executes two stage data cleansing again, wherein data cleansing is individually performed to each piece in the first stage, Obtain multi-edition data wash result;Second stage, the data result based on multi version eliminate conflict, obtain final global system One wash result.Markov logic index technology reduces the detection range of dirty data, and it is clear can be effectively carried out data It washes.
In order to achieve the above object, the used technical solution of the present invention is as follows: a kind of mixed number based on more versions of data According to cleaning method, the step of this method, is as follows:
(1) it obtains regular (ICs) with dirty data collection and relevant integrity constraint;
(2) Markov logical network normalisation rule is converted by different types of integrity constraint rule, and with dirty The constant that each tuple includes in data set instantiates the normalisation rule after conversion, and each instantiation rule is referred to as data Piece;
(3) Markov logic index structure is established to dirty data collection, is first difference according to regular partition by dirty data collection Data block, each corresponding data block of rule, the minimum unit in each data block is data slice, then again by every number Different data groups is again divided into according to block;
(4) on the basis of step (3), the cleaning of first stage is executed, the evaluation criterion of confidence level score is introduced, passes through Independent cleaning is carried out to each data group to obtain the versions of data of multiple preliminary wash results;
(5) cleaning for executing second stage, introduces the evaluation criterion of fusion score, generates to the first stage multiple preliminary The versions of data of wash result is merged, and the collision problem between multi version is eliminated, to generate final unified wash result;
(6) mark dirty data collection present in repeated entries, by by above-mentioned two stages cleaning after there are still repeat number According to deletion;
(7) data set after output data cleaning.
Further, the step (2) specifically:
(2.1) the different types of integrity constraint of input is standardized as Markov by conjunctive normal form transformation rule Logical network rule;
(2.2) the corresponding constant of all variables data set in the rule after standardization is replaced.
Further, the step (3) specifically:
(3.1) entire dirty data collection is divided into multiple data by the integrity constraint rule for being included according to dirty data collection Block, each rule correspond to a data block, include several data slices in each data block;
(3.2) in each data block, the entry in attribute containing same keyword is divided into same group;It is wherein crucial Word is the reason item of rule, and the data slice with same cause is divided into one group.
Further, the step (4) specifically:
(4.1) handle abnormal data: item due to being appeared in error in data and cause its corresponding data slice to be drawn It assigns to the phenomenon in incorrect group and is known as "abnormal", then repartition the data slice of these mistakes in corresponding group;
(4.2) it is calculated in each group according to similarity distance metric method and Markov logical network weight learning method The confidence level score (reliability score) of abnormal data;
(4.3) independent to clean each data group: cleaning unit is each of data block, selects confidence level score maximum Benchmark of the data slice γ as replacement, will be belonged to using this data and be replaced with other data that leave a question open in data group It changes, until each data group cleaning in the data block finishes, that is, the independence for completing the data block is cleaned;
Above-mentioned cleaning is also similarly executed to other data blocks;The multiple preliminary wash results that will be cleaned by the stage It is considered as multiple versions of data, each data block is a versions of data.
Further, the step (5) specifically:
(5.1) firstly, all different data versions of the position clashed are respectively denoted as benchmark, then with each base Standard is starting, finds in other data blocks in addition to data block where benchmark and does not conflict with benchmark and have maximum Ma Erke The data slice of authority of the husband weight, and it is merged with benchmark;
(5.2) above-mentioned union operation is executed repeatedly, until all data blocks have all been traversed;Then it calculates under the benchmark The fusion score f-score (t) of amalgamation result=w1×…×wm, wherein wiIndicate the data slice being merged in i-th of data block Markov weight;
(5.3) it selects another benchmark for starting, executes union operation again, calculate its corresponding fusion score and remember Record, until obtaining the fusion score of the amalgamation result under all different benchmark;Then the selection fusion maximum amalgamation result of score The wash result unified as the final overall situation of the tuple.
Further, the step (6) is scanned entire data set specifically, after completing two stage cleaning, Hash table is established for each tuple therein, when duplicate keys are arrived in scanning, it is rejected.
The invention has the advantages that: the present invention be based on qualitative and quantitative technique blended data cleaning method, By Markov logical network rule, a plurality of types of integrity constraints are combined, introduce Markov logical network power Weight learning method and Similar distance measuring method are used as the foundation of data cleansing simultaneously, meet wash result can either qualitative Technology needs the minimum cost principle followed, and can meet the statistical property of quantitative technique.In addition, the present invention design it is excellent Change method, i.e. Markov logic index, reduce the detection range of dirty data, accelerate the runing time of data cleansing.This Invention with the data set of synthesis using really being tested, more higher than the currently a popular system cleaning efficiency of result presentation and clearly Wash precision.
Detailed description of the invention
Fig. 1 is implementation steps flow chart of the invention;
Fig. 2 (a) is hospital data collection according to rule (r1)FD:The Markov logical network of formation indexes knot Structure;
Fig. 2 (b) is hospital data collection according to rule (r2)DC:Shape At Markov logical network index structure;
Fig. 2 (c) is hospital data collection according to rule (r3) CFD:HN [" ELIZA "], CT [" BOAZ "]=> PN The Markov logical network index structure that [" 2567688400 "] are formed;
Fig. 3 (a) is the rule r after the first stage cleans1Corresponding Markov logical network index structure schematic diagram;
Fig. 3 (b) is the rule r after the first stage cleans2Corresponding Markov logical network index structure schematic diagram;
Fig. 3 (c) is the rule r after the first stage cleans3Corresponding Markov logical network index structure schematic diagram;
Fig. 4 is second stage cleaning process schematic diagram.
Specific embodiment
Technical solution of the present invention is described further now in conjunction with attached drawing and specific implementation:
As shown in Figure 1, specific implementation process of the present invention and working principle are as follows:
Step (1): the integrity constraint (IC) in frame and the data set with dirty data are input in frame;Under Face is illustrated dirty data collection and integrity constraint with table 1:
Table 1 illustrates a information for hospital data set record, includes 4 attributes, is hospital name (HN), city respectively (CT), affiliated state (ST), contact method (PN), grey shading label is wrong data in table 1.Given three integralities are about Beam:
Wherein D represents data set, t1,t2Represent two different tuples, functional dependence (Functional Dependency, abbreviation FD) rule r1Indicate that a city can only belong to a state, negative constraint (Denial Constraint, abbreviation DC) rule r2Indicate that the hospital in not Tonzhou has different telephone numbers, conditional function dependent Rule (Conditional Functional Dependency, abbreviation CFD) r3Indicate that the name of hospital, corresponding city Hezhou are determined The telephone number of Ding Liao hospital.
Table 1:
Step (2): converting Markov logical network normalisation rule for different types of integrity constraint rule, and The constant for including with each tuple that dirty data is concentrated instantiates the normalisation rule after conversion, and each instantiation rule is referred to as Data slice.
Specific steps include:
1) the different types of integrity constraint of input Markov is standardized as by conjunctive normal form transformation rule to patrol Collect networking rule;
2) constant of the data set of the variable in the rule after standardization is replaced.
Step (3): Markov logic index structure is established to dirty data collection, is first according to regular partition by dirty data collection Different data blocks, each rule correspond to a data block, and the minimum unit in each data block is data slice, then again will be every A data block is again divided into different groups, and specific steps include:
1) entire dirty data collection is divided into multiple data blocks, Mei Gegui by the integrity constraint rule that dirty data collection is included A data block is then corresponded to, includes several data slices γ in each data block;
2) in each data block, the entry in attribute containing same keyword is divided into same group, wherein keyword For the reason item of rule, the γ with same cause is divided into one group.
Markov logical network index construct is illustrated by taking Fig. 2 (a), Fig. 2 (b), Fig. 2 (c) as an example below:
Using the data set of table 1 as sample, given constraint rule is related to HN, CT, ST and PN, will according to three rules Data set is accordingly divided into three block B1、B2、B3, and pay attention to distinguishing attribute and result attribute the reason of in constraint rule.It connects down Come, operation is grouped to three blocks respectively, the identical array of reason attribute keyword in a group is divided into a group, such as B1Middle G13Three arrays the reason of keyword be all identical, so being classified as one group.B1Corresponding Markov logical network Shown in index structure such as Fig. 2 (a), B2Shown in corresponding Markov logical network index structure such as Fig. 2 (b), B3Corresponding Ma Er Shown in section husband logical network index structure such as Fig. 2 (c);
Step (4): on the basis of step (3), the cleaning of first stage is executed, introduces the evaluation mark of confidence level score Standard, by carrying out the independent multiple versions of data (each versions of data is from different blocks) of cleaning to each data group, specifically It is as follows:
1) abnormal data is handled.Item due to being appeared in error in data and cause its corresponding data slice to be divided into Phenomenon in incorrect group is known as "abnormal", then repartitions the data slice of these mistakes in corresponding group;
2) exception in each group is calculated according to similarity distance metric method and Markov logical network weight learning method The confidence level score (reliability score, r-score) of data, formula is Wherein d (γi*) represent the candidate alternate data γ of data slice γ and it*The distance between, w (γi) be data slice γ horse That section authority of the husband weight.
3) independent to clean each data block.Specifically, cleaning unit is each of data block, we select credible Spend benchmark of the maximum data slice γ of score as replacement, using this data will belong to other in a group leave a question open data into Row replacement.It all cleans and finishes until each of the data block, that is, complete the independent cleaning of the data block.Similarly to other numbers Above-mentioned cleaning is also executed according to block;The multiple preliminary wash results cleaned by the stage are considered as multiple versions of data, often A data block is a versions of data.Markov logic index structure such as Fig. 3 (a), Fig. 3 (b), Fig. 3 after stage cleaning (c) shown in.
Step (5):, may between different data version since first stage cleaning step produces the data result of multi version Conflict is generated, i.e., the same position in data set generates different wash results between different editions.Therefore, melted by introducing The evaluation criterion of score is closed, multi-edition data collision problem is eliminated, to obtain final global unified wash result.
With the tuple t in table 13For, after having executed first stage cleaning, in B1In with t3Relevant data slice is { CT:DOTHAN, ST:AL } (first versions of data), however in B3In with t3Relevant data slice be HN:ELIZA, CT: BOAZ, PN:2567688400 } (third versions of data).Obviously, t3Two are corresponded to after the cleaning of [CT] in the first stage not Same value (that is, " DOTHAN " and " BOAZ "), they are from two different versions of data.In other words, for t3For, it There is conflict on attribute CT, and final consistent wash result in order to obtain, conflict needs are solved.
The step is specific as follows:
1) all tuples comprising conflict are detected, and record the data slice where each conflict.As shown in figure 4, t3It is corresponding The data slice of two conflicts, respectively α1∈B1And α2∈B2, and using the two as the benchmark for generating different candidate schemes.
2) merging of corresponding data piece between different data block is executed for each benchmark.Need to consider two kinds of situations, if wait close And data slice and benchmark between there is no conflict, directly merging;Conflict if it exists, then needs corresponding in data slice to be combined Block in find another data slice (it does not conflict between benchmark, and corresponding Markov weight is maximum), then hold Row union operation, and using data slice new after merging as benchmark, above-mentioned steps are executed again, until all data blocks are all It completes to merge.It is noted that if can not find satisfactory data slice in merging process, then it is assumed that under the benchmark It is unable to complete merging.
3) after executing the step 2), multiple possible candidate schemes is generated for each tuple comprising conflict, are led to Introducing fusion score (fusion score, f-score) is crossed, is given a mark to each candidate scheme, it is final for selecting score highest item As a result, fusion score formula be f-score (t)=w1×…×wm.As shown in figure 4, for α1∈B1On the basis of conjunction And scheme, due to merging B3In corresponding data slice when, can not find satisfactory data slice, therefore, it is considered that under the benchmark It is unable to complete merging, therefore remembers f-score (t3)=0.And with α2∈B2On the basis of, amalgamation result t3=HN:ELIZA, CT:BOAZ, ST:AL, PN:2567688400 }, corresponding f-score (t3)=0.0678.Therefore, by second of Merge Scenarios As final t3Wash result.
Step (6): after completing two stage cleaning, we are scanned entire data set, are each member therein Group establishes Hash table, when duplicate keys are arrived in scanning, rejects to it.
Step (7): output data treated data set.

Claims (6)

1. a kind of blended data cleaning method based on more versions of data, which is characterized in that the step of this method is as follows:
(1) it obtains regular (ICs) with dirty data collection and relevant integrity constraint;
(2) Markov logical network normalisation rule is converted by different types of integrity constraint rule, and uses dirty data The constant for concentrating each tuple to include instantiates the normalisation rule after conversion, and each instantiation rule is referred to as data slice;
(3) Markov logic index structure is established to dirty data collection, is first different numbers according to regular partition by dirty data collection According to block, each rule corresponds to a data block, and the minimum unit in each data block is data slice, then again by each data block It is again divided into different data groups;
(4) on the basis of step (3), the cleaning of first stage is executed, the evaluation criterion of confidence level score is introduced, by every A data group carries out independent cleaning to obtain the versions of data of multiple preliminary wash results;
(5) cleaning for executing second stage, introduces the evaluation criterion of fusion score, the multiple preliminary cleanings generate to the first stage As a result versions of data is merged, and the collision problem between multi version is eliminated, to generate final unified wash result;
(6) mark dirty data collection present in repeated entries, by by above-mentioned two stages cleaning after there are still repeated data delete It removes;
(7) data set after output data cleaning.
2. the blended data cleaning method according to claim 1 based on more versions of data, it is characterised in that: the step (2) specifically:
(2.1) the different types of integrity constraint of input is standardized as Markov logic by conjunctive normal form transformation rule Networking rule;
(2.2) the corresponding constant of all variables data set in the rule after standardization is replaced.
3. the blended data cleaning method according to claim 1 based on more versions of data, it is characterised in that: the step (3) specifically:
(3.1) entire dirty data collection is divided into multiple data blocks by the integrity constraint rule for being included according to dirty data collection, often A rule corresponds to a data block, includes several data slices in each data block;
(3.2) in each data block, the entry in attribute containing same keyword is divided into same group;Wherein keyword is The reason item of rule, the data slice with same cause are divided into one group.
4. the blended data cleaning method according to claim 1 based on more versions of data, it is characterised in that: the step Suddenly (4) specifically:
(4.1) handle abnormal data: item due to being appeared in error in data and cause its corresponding data slice to be divided into Phenomenon in incorrect group is known as "abnormal", then repartitions the data slice of these mistakes in corresponding group;
(4.2) it is calculated according to similarity distance metric method and Markov logical network weight learning method abnormal in each group The confidence level score (reliability score) of data;
(4.3) independent to clean each data group: cleaning unit is each of data block, selects the maximum number of confidence level score Benchmark according to piece γ as replacement will be belonged to using this data and be replaced with other data that leave a question open in data group, directly Each data group cleaning into the data block finishes, that is, completes the independent cleaning of the data block;
Above-mentioned cleaning is also similarly executed to other data blocks;The multiple preliminary wash results cleaned by the stage are considered as Multiple versions of data, each data block are a versions of data.
5. the blended data cleaning method according to claim 1 based on more versions of data, it is characterised in that: the step Suddenly (5) specifically:
(5.1) firstly, all different data versions of the position clashed are respectively denoted as benchmark, then it is with each benchmark Starting is found in other data blocks in addition to data block where benchmark and does not conflict with benchmark and have maximum Markov power The data slice of weight, and it is merged with benchmark;
(5.2) above-mentioned union operation is executed repeatedly, until all data blocks have all been traversed;Then the merging under the benchmark is calculated As a result fusion score f-score (t)=w1×…×wm, wherein wiIndicate the horse for the data slice being merged in i-th of data block That section authority of the husband weight;
(5.3) it selects another benchmark for starting, executes union operation again, calculate its corresponding fusion score and record, directly To the fusion score for obtaining the amalgamation result under all different benchmark;Then select the fusion maximum amalgamation result of score as this The unified wash result of the final overall situation of tuple.
6. the blended data cleaning method according to claim 1 based on more versions of data, it is characterised in that: the step (6) specifically, being scanned after completing two stage cleaning to entire data set, Hash is established for each tuple therein Table rejects it when duplicate keys are arrived in scanning.
CN201811628044.3A 2018-12-28 2018-12-28 Mixed data cleaning method based on multiple data versions Active CN109634949B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811628044.3A CN109634949B (en) 2018-12-28 2018-12-28 Mixed data cleaning method based on multiple data versions

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811628044.3A CN109634949B (en) 2018-12-28 2018-12-28 Mixed data cleaning method based on multiple data versions

Publications (2)

Publication Number Publication Date
CN109634949A true CN109634949A (en) 2019-04-16
CN109634949B CN109634949B (en) 2022-04-12

Family

ID=66079015

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811628044.3A Active CN109634949B (en) 2018-12-28 2018-12-28 Mixed data cleaning method based on multiple data versions

Country Status (1)

Country Link
CN (1) CN109634949B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110287191A (en) * 2019-06-25 2019-09-27 北京明略软件***有限公司 Data alignment method and device, storage medium, electronic device
CN110968576A (en) * 2019-11-28 2020-04-07 哈尔滨工程大学 Content correlation-based numerical data consistency cleaning method
WO2021143463A1 (en) * 2020-01-17 2021-07-22 深圳市华傲数据技术有限公司 Data cleaning method and apparatus

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2919533A1 (en) * 2012-08-01 2014-02-06 Sherpa Technologies Inc. System and method for managing versions of program assets
CN105339940A (en) * 2013-06-28 2016-02-17 甲骨文国际公司 Naive, client-side sharding with online addition of shards
CN106649644A (en) * 2016-12-08 2017-05-10 腾讯音乐娱乐(深圳)有限公司 Lyric file generation method and device
US20180150543A1 (en) * 2016-11-30 2018-05-31 Linkedin Corporation Unified multiversioned processing of derived data
US20180219888A1 (en) * 2017-01-30 2018-08-02 Splunk Inc. Graph-Based Network Security Threat Detection Across Time and Entities
CN108921399A (en) * 2018-06-14 2018-11-30 北京新广视通科技有限公司 A kind of intelligence direct management system and method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2919533A1 (en) * 2012-08-01 2014-02-06 Sherpa Technologies Inc. System and method for managing versions of program assets
CN105339940A (en) * 2013-06-28 2016-02-17 甲骨文国际公司 Naive, client-side sharding with online addition of shards
US20180150543A1 (en) * 2016-11-30 2018-05-31 Linkedin Corporation Unified multiversioned processing of derived data
CN106649644A (en) * 2016-12-08 2017-05-10 腾讯音乐娱乐(深圳)有限公司 Lyric file generation method and device
US20180219888A1 (en) * 2017-01-30 2018-08-02 Splunk Inc. Graph-Based Network Security Threat Detection Across Time and Entities
CN108921399A (en) * 2018-06-14 2018-11-30 北京新广视通科技有限公司 A kind of intelligence direct management system and method

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110287191A (en) * 2019-06-25 2019-09-27 北京明略软件***有限公司 Data alignment method and device, storage medium, electronic device
CN110287191B (en) * 2019-06-25 2021-07-27 北京明略软件***有限公司 Data alignment method and device, storage medium and electronic device
CN110968576A (en) * 2019-11-28 2020-04-07 哈尔滨工程大学 Content correlation-based numerical data consistency cleaning method
WO2021143463A1 (en) * 2020-01-17 2021-07-22 深圳市华傲数据技术有限公司 Data cleaning method and apparatus

Also Published As

Publication number Publication date
CN109634949B (en) 2022-04-12

Similar Documents

Publication Publication Date Title
Tabassum et al. An investigation of cross-project learning in online just-in-time software defect prediction
CN109144882B (en) Software fault positioning method and device based on program invariants
CN111753101B (en) Knowledge graph representation learning method integrating entity description and type
Nobre et al. Lineage: Visualizing multivariate clinical data in genealogy graphs
Christophides et al. End-to-end entity resolution for big data: A survey
Bininda-Emonds The evolution of supertrees
Nargesian et al. Organizing data lakes for navigation
CN109634949A (en) A kind of blended data cleaning method based on more versions of data
Ge et al. A hybrid data cleaning framework using markov logic networks
CN111597347A (en) Knowledge embedded defect report reconstruction method and device
US20200320153A1 (en) Method for accessing data records of a master data management system
Deng et al. Unsupervised string transformation learning for entity consolidation
CN108959395A (en) A kind of level towards multi-source heterogeneous big data about subtracts combined cleaning method
Laure et al. Machine learning to data management: A round trip
Mahdavi et al. Semi-Supervised Data Cleaning with Raha and Baran.
Galhotra et al. Beer: blocking for effective entity resolution
US11321359B2 (en) Review and curation of record clustering changes at large scale
Song et al. Auto-validate: Unsupervised data validation using data-domain patterns inferred from data lakes
CN115438274A (en) False news identification method based on heterogeneous graph convolutional network
Imran et al. Complex process modeling in Process mining: A systematic review
Ciszak Application of clustering and association methods in data cleaning
Denaux et al. Towards Crowdsourcing Tasks for Accurate Misinformation Detection.
Zhou et al. D-bot: Database diagnosis system using large language models
Wang et al. Error diagnosis and data profiling with data x-ray
CN113516189A (en) Website malicious user prediction method based on two-stage random forest algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant