CN104881487A

CN104881487A - Data filling method and data filling system based on quality control

Info

Publication number: CN104881487A
Application number: CN201510304863.2A
Authority: CN
Inventors: 李直旭; 周剑; 杨强; 李洋
Original assignee: Zhangjiagang Institute of Industrial Technologies Soochow University
Current assignee: Suzhou Big Data Co ltd; Suzhou Big Data Research Institute Co ltd; Suzhou Big Data Trading Service Co ltd
Priority date: 2015-06-04
Filing date: 2015-06-04
Publication date: 2015-09-02
Anticipated expiration: 2035-06-04
Also published as: CN104881487B

Abstract

The invention discloses a data filling method and a data filling system based on quality control. The data filling method includes the steps of determining missing data according to existing data of a database; establishing a data dependency relationship of the database, and determining a dependency confidence level of the data dependency relationship; determining deducible data and at least one group of non-deducible data in the missing data according to the existing data and the data dependency relationship; determining one group of to-be-retrieved data from the groups of non-deducible data according to a preset rule; deducing the deducible data according to the existing data and the data dependency relationship, and calculating a deduction confidence level according to the dependency confidence level; when the deduction confidence level is higher than a preset threshold value, filling the deducible data, retrieving the to-be-retrieved data from external resources, and calculating a retrieving confidence level according to the dependency confidence level; when the retrieving confidence level is higher than a preset threshold value, filling the to-be-retrieved data. The data filling method and the data filling system based on quality control have the advantages that by means of executing deducing and retrieving alternately, high filling accuracy can be guaranteed at small cost, and the confidence level is enabled to be high due to the fact that the dependency confidence level of the data dependency relationship is taken into consideration.

Description

A kind of data filling method based on quality control and system

Technical field

The application relates to database processing technical field, particularly relates to a kind of data filling method based on quality control and system.

Background technology

Usually, in the data source of types of databases, often there are some absence informations, some is that some causes because of operational error because the disappearance of raw data causes.Absence information in these databases can cause data imperfect, is a comparatively general problem in types of databases, and the proposition of data stuffing technology is exactly wish to estimate, predict or give for change absence information in data source by some technological means.

The existing data filling method for character string data can be divided into two classes usually: based on the data filling method of reasoning and the data filling method based on retrieval.

Based on the data filling method of reasoning mainly in conjunction with some given quality of data rules (such as Functional Dependencies Feature Dependence relation), infer the absence information of vacancy from other parts of data set.Such as an address data set, known dependence " city title can determine province title ", write in one of them tuple of data set " school=' Nanjing University '; city=' Nanjing '; province=' Jiangsu ' ", and another one tuple writes " school=' South Airways '; city=' Nanjing '; province=' ' " (namely the province of second tuple is absence information), so we just can extend this as the province of vacancy in second tuple " Jiangsu " according to dependence.

Data filling method based on retrieval mainly retrieves the absence information obtaining vacancy from external resource such as network.When the absence information of data centralization exists in WWW, the method accurately can find absence information and be filled into the vacancy of data centralization.

But, major defect based on the data filling method of reasoning is embodied in filling up for unique absence information, namely there are not the words of the information corresponding to this absence information at the intact part of data centralization, so just can not infer exactly and fill this absence information, causing the accuracy rate of data stuffing low; Although and based on retrieval data filling method accurately can fill absence information, improve the accuracy rate of data stuffing, but it is when retrieving absence information, need the retrieval and inquisition carrying out magnanimity in external resource, this can produce a large amount of retrieval and inquisition operations, correspondingly will cause very large system overhead.

And said method does not all consider that the confidence level of the data dependence relation of data centralization causes the quality Control of the data of filling, the confidence level of the data of filling can be caused not high.

Summary of the invention

In view of this, the application provides a kind of data filling method based on quality control and system, to realize obtaining higher data stuffing accuracy rate under less system overhead, and improves the confidence level of the data of filling.

To achieve these goals, the technical scheme that provides of the embodiment of the present application is as follows:

Based on a data filling method for quality control, comprising:

The AFR control of described database is determined according to the data with existing in database, build the data dependence relation of described database and determine the dependence confidence level of described data dependence relation, repeat following steps, until the AFR control of described database is filled complete:

Deducibility data in the AFR control of described database and at least one group of not deducibility data are determined according to the data with existing in described database and described data dependence relation, and from described at least one group of not deducibility data, determine one group of data to be retrieved according to preset rules, infer described deducibility data according to the data with existing in described database and described data dependence relation and calculate according to described dependence confidence level and infer confidence level, described deducibility data are filled when described deduction confidence level is greater than predetermined threshold value, from the external resource of described database, retrieve described data to be retrieved and calculate retrieval confidence level according to described dependence confidence level, described data to be retrieved are filled when described retrieval confidence level is greater than described predetermined threshold value.

Preferably, describedly determine deducibility data in the AFR control of described database and at least one group of not deducibility data according to the data with existing in described database and described data dependence relation, comprising:

From the AFR control of described database, the AFR control that there is data dependence relation with the data with existing in described database is determined, as the deducibility data in the AFR control of described database according to the data with existing in described database and described data dependence relation;

AFR control dependence between the AFR control determining described database according to the data with existing in described database and described data dependence relation;

With each AFR control of described database for node, using the AFR control dependence between each AFR control as the directed edge between node, build AFR control dependency graph, and determine at least one group of not deducibility data in the AFR control of described database according to described AFR control dependency graph.

Preferably, the described at least one group of not deducibility data determined according to described AFR control dependency graph in the AFR control of described database, comprising:

From each node of described AFR control dependency graph, will identical AFR control dependence be there is and mutually between there is not any data dependence relation node merge into a node, carry out node merging;

After node merges, for the node of the multiple directed edges existed from multiple node sensing self, delete the multiple directed edges from multiple node sensing self, generate and simplify AFR control dependency graph;

From described simplification AFR control dependency graph, point to the node of directed edge of other node and the AFR control corresponding with the node set that there is not any directed edge between other node as at least one group of not deducibility data in the AFR control of described database using only having from self; Described node set comprises at least two nodes.

Preferably, describedly from described at least one group of not deducibility data, determine one group of data to be retrieved according to preset rules, comprising:

Calculate the expectation value of each AFR control in described database; Described expectation value is the probability that each data in described database become AFR control;

According to the expectation value of each AFR control in the described database calculated, the unblock mark of each AFR control in not deducibility data described in calculating; Described unblock mark is for assessment of the size of the data dependence relation between other AFR control in each AFR control in described not deducibility data and described not deducibility data;

AFR control in not deducibility data described in selecting successively according to the order that described unblock mark is descending adds retrieval set, until AFR control in described not deducibility data or in retrieval set, or when being obtained by the AFR control deduction in retrieval set, using the AFR control in described retrieval set as described data to be retrieved.

Preferably, described external resource comprises Internet resources.

Based on a data stuffing system for quality control, comprising:

Build module, for determining the AFR control of described database according to the data with existing in database, build the data dependence relation of described database and determine the dependence confidence level of described data dependence relation;

Packing module, for repeating following steps, until the AFR control of described database is filled complete:

Preferably, described packing module, comprising:

First determination module, for in the AFR control from described database, the AFR control that there is data dependence relation with the data with existing in described database is determined, as the deducibility data in the AFR control of described database according to the data with existing in described database and described data dependence relation;

Second determination module, for determine described database according to the data with existing in described database and described data dependence relation AFR control between AFR control dependence;

3rd determination module, for with each AFR control of described database for node, using the AFR control dependence between each AFR control as the directed edge between node, build AFR control dependency graph, and determine at least one group of not deducibility data in the AFR control of described database according to described AFR control dependency graph.

Preferably, described 3rd determination module, comprising:

Node merge cells, in each node from described AFR control dependency graph, will identical AFR control dependence be there is and mutually between there is not any data dependence relation node merge into a node, carry out node merging;

Directed edge prunes unit, after merging, for the node of the multiple directed edges existed from multiple node sensing self, deletes the multiple directed edges from multiple node sensing self, generates and simplify AFR control dependency graph for node;

Search unit, for from described simplification AFR control dependency graph, point to the node of the directed edge of other node and the AFR control corresponding with the node set that there is not any directed edge between other node as at least one group of not deducibility data in the AFR control of described database using only having from self; Described node set comprises at least two nodes.

Preferably, the described packing module determining one group of data to be retrieved according to preset rules from described at least one group of not deducibility data, for: the expectation value calculating each AFR control in described database; Described expectation value is the probability that each data in described database become AFR control;

Preferably, described external resource comprises Internet resources.

A kind of data filling method based on quality control provided by above the application, the AFR control of described database is determined according to the data with existing in database, build the data dependence relation of described database and determine the dependence confidence level of described data dependence relation, repeat following steps, until the AFR control of described database is filled complete: determine deducibility data in the AFR control of described database and at least one group of not deducibility data according to the data with existing in described database and described data dependence relation, and from described at least one group of not deducibility data, determine one group of data to be retrieved according to preset rules, infer described deducibility data according to the data with existing in described database and described data dependence relation and calculate according to described dependence confidence level and infer confidence level, described deducibility data are filled when described deduction confidence level is greater than predetermined threshold value, from the external resource of described database, retrieve described data to be retrieved and calculate retrieval confidence level according to described dependence confidence level, described data to be retrieved are filled when described retrieval confidence level is greater than described predetermined threshold value.Like this, by inferring and alternately performing of retrieving, efficiently and realize the filling of data centralization AFR control in high quality, higher data stuffing accuracy rate is obtained under can be implemented in less system overhead.

And, because this method has taken into full account the dependence confidence level of data dependence relation when padding data, and calculate the deduction confidence level of data of deduction and the retrieval confidence level of the data of retrieval according to relying on confidence level, the data of deduction are only just filled when inferring that confidence level is greater than predetermined threshold value, the data of retrieval are just filled when retrieval confidence level is greater than predetermined threshold value, can ensure that the data of filling obtain good quality control like this, make the confidence level of the data of filling higher.

Accompanying drawing explanation

In order to be illustrated more clearly in the embodiment of the present application or technical scheme of the prior art, be briefly described to the accompanying drawing used required in embodiment or description of the prior art below, apparently, the accompanying drawing that the following describes is only some embodiments recorded in the application, for those of ordinary skill in the art, under the prerequisite not paying creative work, other accompanying drawing can also be obtained according to these accompanying drawings.

The sample data table of the data filling method based on quality control that Fig. 1 provides for the application and the schematic diagram of data dependence relation;

The schematic diagram of the reciprocal process of the data filling method based on quality control that Fig. 2 provides for the embodiment of the present application;

The structure of the data filling method based on quality control that Fig. 3 provides for the embodiment of the present application simplifies the schematic diagram of the process of AFR control dependency graph;

The schematic flow sheet of a kind of embodiment of the data filling method based on quality control that Fig. 4 provides for the application;

The schematic flow sheet of the another kind of embodiment of the data filling method based on quality control that Fig. 5 provides for the application;

Fig. 6-Figure 10 is respectively the experimental data comparison diagram of the data filling method based on quality control that the application provides and prior art;

The selection schematic diagram of the quality control threshold value of the data filling method based on quality control that Figure 11 provides for the application;

The structural representation of a kind of embodiment of the data stuffing system based on quality control that Figure 12 provides for the application;

The structural representation of the another kind of embodiment of the data stuffing system based on quality control that Figure 13 provides for the application.

Embodiment

Technical scheme in the application is understood better in order to make those skilled in the art person, below in conjunction with accompanying drawing, the technical scheme of the application is clearly and completely described, obviously, described embodiment is only some embodiments of the present application, instead of whole embodiments.Based on the embodiment in the application, those of ordinary skill in the art are not making the every other embodiment obtained under creative work prerequisite, all should belong to the scope of the application's protection.

Below in conjunction with accompanying drawing, the embodiment of the application is described in detail.

The sample data table of the data filling method based on quality control that Fig. 1 provides for the application and the schematic diagram of data dependence relation.

The schematic flow sheet of a kind of embodiment of the data filling method based on quality control that Fig. 4 provides for the application.

With reference to shown in Fig. 4, the data filling method based on quality control that the embodiment of the present application provides comprises:

Step S100: the AFR control determining described database according to the data with existing in database, builds the data dependence relation of described database and determines the dependence confidence level of described data dependence relation;

In the embodiment of the present application, the definition that scheme is used first is provided:

1., for the attribute X in tables of data, Y, meet Feature Dependence X → Y.If there is some tuple in table to violate this constraint condition, then this Feature Dependence X → Y is claimed to be that similar properties relies on for in table, data meet the credibility of constraint X → Y, namely rely on confidence level.So, based on the rule of inference of this similar properties dependence and the confidence level of retrieval and inquisition be also

2. infer confidence level: given similar properties relies on X → Y, tuple T ₁and T ₂expression formula on attribute X and Y is:

The value of T2 on Y is empty, represents here with square.Infer the result=y obtained ₁deduction confidence level be given by the following formula:

The i.e. confidence level product of the value of rule of inference and use, herein represent tuple T ₁value x on attribute X ₁deduction confidence level be

3. retrieve confidence level: given similar properties relies on X → Y, tuple T ₁expression formula on attribute X and Y is:

retrieve the result=y obtained ₁retrieval confidence level be defined as:

The i.e. confidence level product of the value of search rule and use.

In the embodiment of the present application, owing to there is data with existing in database, then remove data with existing, be AFR control.And usually comprise certain data dependence relation between all data in same database.

Here data dependence relation comprises the dependence between data with existing and AFR control, the dependence between data with existing and data with existing, and the dependence between AFR control and AFR control.

Step S200: determine deducibility data in the AFR control of described database and at least one group of not deducibility data according to the data with existing in described database and described data dependence relation, and determine one group of data to be retrieved according to preset rules from described at least one group of not deducibility data;

In the embodiment of the present application, " deducibility data " refer to the AFR control can inferred by data with existing according to data dependence relation, there is data dependence relation between deducibility data and data with existing.

Such as: an address data set, comprise data dependence relation " city title can determine province title ", then write in one of them tuple of this address data set " school=' Nanjing University '; city=' Nanjing '; province=' Jiangsu ' ", and another one tuple writes " school=' South Airways '; city=' Nanjing '; province=' ' " (namely the province of second tuple is absence information), so we just can be inferred as the province of vacancy in second tuple " Jiangsu " according to data dependence relation.

In the embodiment of the present application, " not deducibility data " are the AFR control that cannot be directly inferred by data with existing, and there is not direct data dependence relation between data with existing.

In addition, " not deducibility data " as a part for AFR control, may and other AFR control between there is data dependence relation, also may and other AFR control between there is not data dependence relation.

When there is data dependence relation between " not deducibility data " and other AFR control, after " not deducibility data " are filled, other AFR control can be inferred according to " not deducibility data " (being data with existing after being filled) be filled, when there is not data dependence relation between " not deducibility data " and other AFR control, even if be filled the AFR control also cannot inferring other.

Step S300: infer described deducibility data according to the data with existing in described database and described data dependence relation and calculate according to described dependence confidence level and infer confidence level, described deducibility data are filled when described deduction confidence level is greater than predetermined threshold value, from the external resource of described database, retrieve described data to be retrieved and calculate retrieval confidence level according to described dependence confidence level, filling described data to be retrieved when described retrieval confidence level is greater than described predetermined threshold value;

In the embodiment of the present application, " infer according to the data with existing in described database and described data dependence relation and fill described deducibility data " is called deduction step, " retrieve from the external resource of described database and fill described data to be retrieved " is called searching step.

Because " deducibility data " refer to the AFR control can inferred by data with existing according to data dependence relation, data dependence relation is there is between deducibility data and data with existing, so can directly be inferred " deducibility data " by data with existing and described data dependence relation, then fill, then " deducibility data " after filling namely become data with existing.

Simultaneously, because " not deducibility data " are the AFR control that cannot be directly inferred by data with existing, and there is not direct data dependence relation between data with existing, so search these " not deducibility data " and fill from external resource such as Internet resources, the accuracy of filled data can be ensured.

Be understandable that, in the embodiment of the present application, when once infer just be filled with all AFR control time, follow-up searching step can be saved, and when there is no educible data, also advanced line retrieval step can carry out deduction step again, the step numbers in the present embodiment is also not used as the other side's method enforcement restriction sequentially.

Step S400: judge whether the AFR control of described database is filled complete; If not, step S200 is returned; If so, terminate.

The embodiment of the present application proposes a kind of data filling method based on quality control, the AFR control of described database is determined according to the data with existing in database, and the data dependence relation built in described database between all data, repeat following steps, until the AFR control of described database is filled complete: determine deducibility data in the AFR control of described database and at least one group of not deducibility data according to the data with existing in described database and described data dependence relation, and from described at least one group of not deducibility data, determine one group of data to be retrieved according to preset rules, infer according to the data with existing in described database and described data dependence relation and fill described deducibility data, retrieve from the external resource of described database and fill described data to be retrieved.

The method is used alternatingly to be inferred and retrieves padding data:

Such as: described data dependence relation determines that the deducibility data in the AFR control of described database are determined to be filled to the first deducibility data group in all AFR control in described database and the first data group to be retrieved; Infer according to described data dependence relation and fill the data in described first deducibility data group, retrieve from the external resource of described database and fill the data in described first data group to be retrieved, and determining the first residue AFR control in described database; According to described data dependence relation, determine the second deducibility data group in described first residue AFR control and the second data group to be retrieved; Infer according to described data dependence relation and fill the data in described second deducibility data group, retrieve from the external resource of described database and fill the data in described second data group to be retrieved, and determining the second residue AFR control in described database; The like, be filled complete until to be filled to all AFR control in described database.

That is: infer and fill first group of AFR control in described database, retrieve from the external resource of described database and fill second group of AFR control in described database; According to described data with existing, described first group of AFR control and described second group of AFR control, infer and fill the 3rd group of AFR control in described database, retrieve from the external resource of described database and fill the 4th group of AFR control in described database; The like, be filled complete until to be filled to the AFR control in described database.

Illustrate below: the reciprocal process of the data filling method based on quality control that the embodiment of the present application provides as shown in Figure 2:

(1) (note: SDI:StochasticDataImputation is the English abbreviation having the interactive mode of quality control to fill up to 0.8-SDI, wherein 0.8 for quality control threshold value, is the predetermined threshold value in the embodiment of the present application) reciprocal process of method is as shown below:

(2) first time infers step (Fig. 2 (a)): according to the dependence in data with existing in table and Fig. 2 (b), can infer T ₁[E], T ₁[F], T ₂the value of [B] is respectively b ₁, e ₁, f ₁, confidence level is respectively 0.95,0.95,0.90.

(3) first time searching step (Fig. 2 (b)): suppose to retrieve T ₃[B], T ₅the value of [B] is respectively b ₂, b ₃, corresponding confidence level is respectively 0.95,0.95.

(4) secondary deduction is because the restriction of threshold value 0.8, causes there are not the data that can infer.

(5) second time searching step (Fig. 2 (c)): retrieve T ₃[C], T ₃the value of [D] is respectively c ₂, d ₂, corresponding confidence level is respectively 0.95,0.95.

(6) third time infers step (Fig. 2 (d)): the value retrieved according to second time and the dependence of table, can infer T ₄[C], T ₄the value of [D] is respectively c ₃, d ₃, corresponding confidence level is respectively 0.95,0.95.

(7) third time searching step (Fig. 2 (e)): retrieve T ₄[E], T ₅the value of [E] is all e ₂, corresponding confidence level is all 1 (omits here and do not write).

(8) third time searching step (Fig. 2 (f)): according to retrieving the T obtained for the third time ₄[E] and T ₅the value of [E] and the dependence of attribute E and F, can infer T ₄[F], T ₅the value of [F] is all f ₂, corresponding confidence level is all 1 (omits here and do not write).So far, value of having vacant position end-of-fill.

After once inferring that step farthest fills all educible AFR control, ensuing searching step can retrieve a series of not educible AFR control, thus makes once to infer that in step, some remaining AFR control can be inferred upper.Repeat these two steps continuously until after occurring that termination condition does not such as have the AFR control that can fill, terminate the filling to AFR control.

By inferring that step and searching step replace padding data, the expense of system can be made less and data stuffing accuracy rate is higher, like this, by the alternately execution of inferring and retrieve, efficiently and in high quality can realize the filling of the AFR control for data centralization, under can be implemented in less system overhead, obtain higher data stuffing accuracy rate.Therefore, the data filling method based on quality control that the embodiment of the present application provides, preferred plan can be determined in data stuffing, and by this scheme, very high filling degree of accuracy and recall rate can be reached with minimum filling cost (system overhead).

The structure of the data filling method based on quality control that Fig. 3 provides for the embodiment of the present application simplifies the schematic diagram of the process of AFR control dependency graph.

The schematic flow sheet of the another kind of embodiment of the data filling method based on quality control that Fig. 5 provides for the application.

With reference to shown in Fig. 5, the data filling method based on quality control that the embodiment of the present application provides, determine deducibility data in the AFR control of described database and at least one group of not deducibility data according to the data with existing in described database and described data dependence relation in described step S200, comprising:

Step S201: from the AFR control of described database, the AFR control that there is data dependence relation with the data with existing in described database is determined, as the deducibility data in the AFR control of described database according to the data with existing in described database and described data dependence relation;

Step S202: the AFR control dependence between the AFR control determining described database according to the data with existing in described database and described data dependence relation;

Step S203: with each AFR control of described database for node, using the AFR control dependence between each AFR control as the directed edge between node, build AFR control dependency graph, and determine at least one group of not deducibility data in the AFR control of described database according to described AFR control dependency graph.

In the middle of filling process, TRIP method it is crucial that select minimum AFR control to retrieve in searching step, thus makes system overhead minimum, obtains optimal scheduling scheme.

The algorithm obtaining optimal scheduling scheme is as follows:

Build AFR control dependency graph: Fig. 3 (a), (b), (c) are depicted as building process.

Step 1: by all AFR control do not filled up as the node in AFR control dependency graph, as shown in Fig. 3 (a).

Step 2: by data dependence relation all possible between AFR control as internodal directed edge, so far, define AFR control dependency graph, as shown in Fig. 3 (b).

In the embodiment of the present application, at least one group of not deducibility data in the AFR control of described database are determined according to described AFR control dependency graph, first to the AFR control dependency graph built be simplified, then utilization simplification AFR control dependency graph determines at least one group of not deducibility data in the AFR control of described database, and simplification process comprises:

Step 3: the simplification of AFR control dependency graph:

(1) node merges: if some has identical data dependence and there is not any data dependence relation between these nodes, just these nodes are merged into a node, as shown in Fig. 3 (c), and O ₅and O ₆, O ₇and O ₈be merged into a node.

(2) limit is pruned: the AFR control dependency graph after merging for node, if there is so a kind of dependence in figure, needs multiple node to meet simultaneously and could release an other node, at this moment just need the dependence limit pruning away such.As shown in Fig. 3 (b), O ₄, O ₅, O ₆three nodes need to meet simultaneously could release O ₉, and these three nodes can also release O simultaneously ₇and O ₈, O ₁₁and O ₁₂, at this moment will prune away from O ₄, O ₅, O ₆set out and point to O ₉limit, similarly, point to O ₇and O ₈, O ₁₁and O ₁₂limit also need to prune away.

Simplification AFR control dependency graph shown in final formation Fig. 3 (c).

Determine data to be retrieved: data to be retrieved all can not be inferred, have two classes:

(1) as shown in Fig. 3 (d), be the simplification AFR control dependency graph of second time searching step, as O ₅, O ₆node after this merging, does not point to the limit of this node, so O as can be seen from figure clearly from other node ₅, O ₆retrieve.

(2) there is not the node set that outside node points to the directed edge of internal node, that is, a node set is comprised in be inferred in deadlock, and the node in this node set can not be inferred from the node of deadlock outside, so can think that the node in such node set is not educible, it is therefore the point that will retrieve.As shown in Fig. 3 (c), O ₄and O ₅, O ₆constitute a deadlock, so can select to retrieve O ₄or retrieval O ₅, O ₆, namely retrieve number in order to ensure Least-cost minimum, therefore select retrieval O ₄; In like manner for O ₇, O ₈and O ₁₁select retrieval O ₁₁.

In the embodiment of the present application, from described at least one group of not deducibility data, determine one group of data to be retrieved according to preset rules, comprising: greedy algorithm determines optimum retrieval scheme:

For the ease of understand, we first suppose value of having vacant position know in advance, then provide our optimal solution, be then extended to vacancy value and do not know truth in advance, and provide near-optimization solution scheme.

1. determine that near-optimization unlocks subset

Unlock mark to each node definition in dependency graph one, unlock the maximum node of mark and unlock set by by being selected into of greediness.

● unlock single deadlock

For realizing optimum solution lock side case, this greedy algorithm is always partial to the node selecting can bring minimum retrieval number of times and infer at most number of times, unlocks the deadlock D at its place, and is added in unblock set.We unlock mark to each node definition one, and be used for assessing node A for the contribution degree mentioning deadlock above breaking, A unlocks mark and can be defined as:

S _unlock(A|D)＝|Infer(A,D,τ)|-|A|

Wherein Infer (A, D, τ) represents in deadlock D, if after being threshold value retrieval A with τ, and all data acquisitions that can fill.|| represent the size returning set.During beginning, calculate the unblock mark of each node, select to unlock the maximum node join retrieval set of mark, then upgrade remaining node and unlock mark, again choose according to above rule and unlock the maximum node of mark, until value in deadlock D or in retrieval set, or can be inferred by the value in retrieval set and obtain.

● unlock the multiple deadlocks sharing node

Greedy algorithm is above extended, and can be used for unlocking multiple deadlocks of shared subordinate relation.If D is all values and its deadlock subordinate in deadlock, we calculate the unblock mark of each node in D, then select the maximum node of mark to join and unlock set U _din, constantly update residue node unblock mark, select new node, until in D all nodes at U _din, or can according to U _din value infer obtain.

2. determine that optimum expectation unlocks set

Above-mentionedly unlock set according to unblock mark determination near-optimization and be based on an important hypothesis: vacancy value is known in advance, but in real world, vacancy value in advance and do not know.So we are according to being each expectation value with probability calculated for vacancy value, come to expect to unlock mark for each result calculates.According to greedy algorithm above, we propose the greedy algorithm of another kind of mutation, choose the subset expecting to unlock mark maximum and retrieve.First introduce the expectation value how estimating vacancy value and probability thereof below, then show how calculation expectation unlocks mark.

● estimate the expectation value with probability

If without any dependence between the attribute in table, then the element y shown in R on attribute Y becomes certain vacancy on current attribute is worth possibility to be given by the following formula:

P _E(Y＝y)＝Percent(Y＝y|R)

Wherein Percent (Y=y|R) represents the number percent of value shared by the tuple of y on attribute Y.

If Existence dependency relationship X between the attribute in table _i→ Y (i=1,2,3 ... k), then these dependences should be considered.We use other tuples T' and T in the boolean vector π expression table that a length is k at X _i(i=1,2,3 ... k) relation on.And specify, if tuple T' is at attribute X _ion value a _i'=a _i, then π [i]=true, otherwise π [i]=false.

We define the value y' of a tuple T' on Y can become and when the vacancy on prostatitis is worth weight be:

Wherein T (π) represents all i set of π [i]=true, and F (π) represents the i set of π [i]=false.Finally, the probability of T [Y]=y can be defined by following formula:

P_{E} (T [Y] = y) = \underset{π &Element; Θ}{Σ} w (π) * Percent (Y = y | R [π])

Wherein Θ is the set of all possible π, and R [π] is the tuple-set having π relation in set R with tuple T, and Percent (Y=y|R [π]) represents the number percent of Y=y in all tuples of R [π].

● estimation is expected to unlock mark

According to the expectation value of each vacancy value calculated, we calculate the unblock mark of each unblock subset in deadlock, use formula below:

S_{unlock - E} (A | D) = \underset{θ}{Σ} P_{E} (θ) * | Infer (A, D, θ, τ) | - | A |

Here estimate each y _ibe filled into i-th vacancy B _iplace is by the probability adopted.θ=[y ₁, y ₂... ] comprise all possible y _ivalue fills B _isituation, Infer (A, D, θ, τ) represent when given threshold tau, when the vacancy value in deadlock D gets certain value in θ, by retrieval A, can by all AFR control subsets inferred.

The optimum expectation scheme algorithm in τ-SDI that then the embodiment of the present application provides is:

Retrieval and deduction are optionally carried out alternately, but when searching step each time, we are according to vacancy value, set up the deduction dependency graph that a threshold value is C.Allly all directly can not put into R by inferred value _i, deadlock set put into by all deadlocks.For each deadlock, we select and unlock the highest deduction subset of mark, and put into R _i, retrieval R _iin value, then enter next step deduction step, until do not have vacancy value to be filled.

Algorithm: detect the optimum expectation scheme in τ-SDI situation

Input: an imperfect table, vacancy value set is Ο

Export: data stuffing scheme S=<I ₀, R ₁, I ₁..., R _n, I _n>

Make i=0

do

1.I _i← all current can the value that arrives of reasoning

2. reasoning I _iin all values

3.i++

4. set up the reasoning dependency graph that a threshold value is τ

5.R _ithe lower value that cannot infer of ← τ restriction

6.Foreach shares the deadlock set D do of node

Calculate the unblock mark S of each unblock subset of deadlock D _unlock-E

R _i← R _i∪ unlocks the maximum unblock subset S of mark _unlock-E

7. retrieve R _iin value of having vacant position

While

Return<I ₀,R ₁,I ₁,…,R _n,I _n>

So just under the prerequisite of accuracy ensureing the data of filling, can make to need the data volume of retrieval minimum, the retrieval and inquisition carrying out magnanimity in external resource can be avoided, reduce retrieval and inquisition operation as far as possible, reduce system overhead.

Retrieve the filling recall rate that a small amount of AFR control can improve the method based on deduction greatly, in order to ensure can to obtain the highest recall rate under minimal-overhead, use search operaqtion that should be minimum, use inference operations as much as possible.

Illustrate experiment effect of the present invention below:

One, experimental situation,

Operating system: Mac OS X

Processor: 4 core I ntel Core i5

Internal memory: 8GB

Programming language: Java

Two, data set

Select 4 data sets, wherein 2 is real-life data set, and two other is the data set of Prof. Du Yucang.

(1) personal information table (PersonInfo): this table is containing 50,000 tuples, and each tuple has 9 attributes, is respectively name, mailbox, title, university, street, city, state, country and addresses of items of mail.These information be from the U.S., Britain, Canada with Australian 1000 different universities collect and obtain.

(2) DBLP delivers information table (DBLP): this table is containing 100,000 tuples, and each tuple has 5 attributes, is respectively the title of the paper delivered, the first authors, meeting title, time and place.Paper information all in table is all from Stochastic choice DBLP.

(3) form Ι (Syn-Ι) is synthesized: we have synthesized 100,000 tuples, the form of each tuple 100 attributes, comprise 1000 Feature Dependence relations produced at random, the confidence level of each dependence is 1.0, and the first row attribute in table is primary attribute.

(4) form Π (Syn-Π) is synthesized: the same with the scale of synthesis form (Syn-Ι), first row is also primary attribute.Difference is that the confidence level of Feature Dependence is the random number between 0 to 1.

Above form is complete relation table, in order to produce the imperfect table that experiment needs, the value that what we were random remove in complete table, but ensures that each tuple at least retains a primary attribute.Namely for PersonInfo, name or mailbox at least retain one, and for DBLP, Article Titles can be retained, and the first row attribute of two synthetic tables can be retained.

Three, experimental technique

For different vacancy rates (1%, 5%, 10%, 20%, 30%, 40%, 50%, 60%), we use different random seeds to create 5 incomplete tables, and following experimental result is the average result of 5 experiments.Note, for synthetic table, our retrieve data but not from the Internet, we have recorded the number of times of retrieval from original table.

Four, experimental result comparative analysis

On True Data, we select with state-of-the-art based on inferring and comparing based on the complementing method retrieved at present.

(1) based on the method (Inferring-based) inferred:

InferRules: infer vacancy value according to the Feature Dependence relation of intact part in table.

GKNN: what adopt state-of-the-art vacancy quantitative data fills up technology, mainly calculates the distance between vacancy value and training data, then selects k (we select k to equal 1 here) the most contiguous.

(2) based on the method (Retrieving-based) of retrieval:

WebPut: this is general search method, mainly from various data centralization retrieval vacancy value.

InfoGather: this method state-of-the-art technology of employing, can retrieve vacancy value from web page listings and form.

● accuracy

Proposed TRIP method and the above-mentioned method mentioned are carried out accuracy respectively on PersonInfo and DBLP data set compare, mainly compare 3 aspects: (1) degree of accuracy (Precision):, all by ratio (2) recall rate (Recall) correctly filled up of fill data: ratio (3) F1 that correctly fills up in value of having vacant position: be that standard is estimated in the combination of precision and recall, computing formula is 2*precision*recall/ (precision+recall).

Fig. 6 and Fig. 7 is respectively TRIP method and compares in the accuracy of PersonInfo and DBLP with existing 4 kinds of complementing methods.Can observe from these 2 tables, in fill data, the degree of accuracy of InferRules method is very high, and greatly about about 90%, but its recall rate is but very low; The degree of accuracy of GKNN method, 60% ~ 70%, is not very high, this is because GKNN filling up mainly for quantitative data, and our data set of experiment is all non-quantitative data; InfoGather and WebPut method this in 2 based on the degree of accuracy of the method for retrieval and recall rate obviously than high based on the method InferRules, the GKNN that infer, and WebPut has more up to recall rate; And TRIP method can reach very high degree of accuracy and recall rate relatively.

Fig. 8 is different data vacancy rate (Missing Ratio) 1% ~ 60% time, the change of this Measure Indexes of F1 of these 5 kinds of methods.As can be observed from Figure, WebPut and TRIP method apparently higher than other method, and TRIP method only lower than WebPut method little by little.

Therefore, from the experimental result of Fig. 6, Fig. 7 and Fig. 8 display, we clearly can show that TRIP has very high degree of accuracy and recall rate in data filling.

● cost

Respectively on PersonInfo and DBLP data set, by TRIP and the pure cost comparing them based on the method (WebPut) retrieved and the pure method (InferRules) based on deduction, mainly 2 aspects: (1) time cost (Time cost): the precise time (2) inquiry (#Queries) spent in a filling: the inquiry times (τ=0.7) of generation.

Fig. 9 is on data set PersonInfo and DBLP, between data vacancy rate (Missing Ratio) 1% ~ 60%, TRIP method and based on retrieval method (Retrieving-based), based on infer method (Inferring-based) between the time cost comparison.As can be seen from the figure, the time cost based on the method inferred is very low, and very high based on the time cost of the method for retrieval, and the time efficiency observing TRIP is significantly nearly 10 times of the method based on retrieval.

Figure 10 is on data set PersonInfo and DBLP, between data vacancy rate (Missing Ratio) 1% ~ 60%, and TRIP method and the comparison based on inquiry times between the method (Retrieving-based) retrieved.Can obviously observe from figure, the retrieval and inquisition number of times of TRIP method is obviously few a lot of than the method based on retrieval.

Therefore, from the experimental result of Fig. 9 and Figure 10 display, we clearly can show that TRIP has very large advantage in time cost and inquiry times.

● the selection of τ

On data set Syn-Π, we arrange miss rate is 0.4 to assess the situation of optimum expectation scheme in different threshold tau.

As shown in Figure 11, the value with τ rises to 0.9 by the value that 0 rises to 0.7, F1 by 0.4, but is declined fast by the value that 0.7 rises to 1, F1 along with τ.This is because vacancy value can rely on according to different similar properties, different deduction modes is used to obtain the different value of confidence level.Therefore, the sternly or too pine of threshold restriction all can not produce good Filling power.Accordingly, the cost of filling rises to 0.7 also by 36*10 by 0 along with τ ⁵be increased to maximal value 62*10 ⁵, because higher τ causes more value not to be pushed off out, so need the more value of retrieval.But rise to 1 along with by 0.7, cost reduces to 26*10 fast ⁵, because spendable search rule and rule of inference along with the increase of τ can be fewer and feweri.

For aforesaid each embodiment of the method, in order to simple description, therefore it is all expressed as a series of combination of actions, but those skilled in the art should know, the present invention is not by the restriction of described sequence of movement, because according to the present invention, some step can adopt other orders or carry out simultaneously.

The above disclosed a kind of data filling method based on quality control of the present invention, accordingly, the invention also discloses the data stuffing system based on quality control of the above-mentioned data filling method based on quality control of application.

The structural representation of a kind of embodiment of the data stuffing system based on quality control that Figure 12 provides for the application.

With reference to shown in Figure 12, a kind of data stuffing system based on quality control that the embodiment of the present application provides, comprising:

Build module 1, for determining the AFR control of described database according to the data with existing in database, build the data dependence relation of described database and determine the dependence confidence level of described data dependence relation;

Packing module 2, for repeating following steps, until the AFR control of described database is filled complete:

In the embodiment of the present application, with reference to shown in Figure 12, described packing module 2, comprising:

First determination module 21, for in the AFR control from described database, the AFR control that there is data dependence relation with the data with existing in described database is determined, as the deducibility data in the AFR control of described database according to the data with existing in described database and described data dependence relation;

Second determination module 22, for determine described database according to the data with existing in described database and described data dependence relation AFR control between AFR control dependence;

3rd determination module 23, for with each AFR control of described database for node, using the AFR control dependence between each AFR control as the directed edge between node, build AFR control dependency graph, and determine at least one group of not deducibility data in the AFR control of described database according to described AFR control dependency graph.

Wherein, described 3rd determination module 23, comprising:

The described packing module 2 determining one group of data to be retrieved according to preset rules from described at least one group of not deducibility data, for: the expectation value calculating each AFR control in described database; Described expectation value is the probability that each data in described database become AFR control;

Described external resource comprises Internet resources.

It should be noted that, the data stuffing system based on quality control of the present embodiment can adopt the data filling method based on quality control in said method embodiment, may be used for the whole technical schemes realized in said method embodiment, the function of its each functional module can according to the method specific implementation in said method embodiment, its specific implementation process can refer to the associated description in above-described embodiment, repeats no more herein.

It should be noted that, each embodiment in this instructions all adopts the mode of going forward one by one to describe, and what each embodiment stressed is the difference with other embodiments, between each embodiment identical similar part mutually see.For device class embodiment, due to itself and embodiment of the method basic simlarity, so description is fairly simple, relevant part illustrates see the part of embodiment of the method.

Professional can also recognize further, in conjunction with unit and the algorithm steps of each example of embodiment disclosed herein description, can realize with electronic hardware, computer software or the combination of the two, in order to the interchangeability of hardware and software is clearly described, generally describe composition and the step of each example in the above description according to function.These functions perform with hardware or software mode actually, depend on application-specific and the design constraint of technical scheme.Professional and technical personnel can use distinct methods to realize described function to each specifically should being used for, but this realization should not thought and exceeds scope of the present invention.

The software module that the method described in conjunction with embodiment disclosed herein or the step of algorithm can directly use hardware, processor to perform, or the combination of the two is implemented.Software module can be placed in the storage medium of other form any known in random access memory (RAM), internal memory, ROM (read-only memory) (ROM), electrically programmable ROM, electrically erasable ROM, register, hard disk, moveable magnetic disc, CD-ROM or technical field.

Finally, also it should be noted that, in this article, the such as relational terms of first and second grades and so on is only used for an entity or operation to separate with another entity or operational zone, and not necessarily requires or imply the relation that there is any this reality between these entities or operation or sequentially.And, term " comprises ", " comprising " or its any other variant are intended to contain comprising of nonexcludability, thus make to comprise the process of a series of key element, method, article or equipment and not only comprise those key elements, but also comprise other key elements clearly do not listed, or also comprise by the intrinsic key element of this process, method, article or equipment.When not more restrictions, the key element limited by statement " comprising ... ", and be not precluded within process, method, article or the equipment comprising described key element and also there is other identical element.

Be described in detail scheme provided by the present invention above, apply specific case herein and set forth principle of the present invention and embodiment, the explanation of above embodiment just understands method of the present invention and core concept thereof for helping; Meanwhile, for one of ordinary skill in the art, according to thought of the present invention, all will change in specific embodiments and applications, in sum, this description should not be construed as limitation of the present invention.

Claims

1. based on a data filling method for quality control, it is characterized in that, comprising:

2. method according to claim 1, is characterized in that, describedly determines deducibility data in the AFR control of described database and at least one group of not deducibility data according to the data with existing in described database and described data dependence relation, comprising:

3. method according to claim 2, is characterized in that, the described at least one group of not deducibility data determined according to described AFR control dependency graph in the AFR control of described database, comprising:

4. method according to claim 1, is characterized in that, describedly from described at least one group of not deducibility data, determines one group of data to be retrieved according to preset rules, comprising:

5. method according to claim 1, is characterized in that, described external resource comprises Internet resources.

6., based on a data stuffing system for quality control, it is characterized in that, comprising:

7. system according to claim 6, is characterized in that, described packing module, comprising:

8. system according to claim 7, is characterized in that, described 3rd determination module, comprising:

9. system according to claim 6, is characterized in that, the described packing module determining one group of data to be retrieved according to preset rules from described at least one group of not deducibility data, for: the expectation value calculating each AFR control in described database; Described expectation value is the probability that each data in described database become AFR control;

10. system according to claim 6, is characterized in that, described external resource comprises Internet resources.