A kind of data corresponding method based on first order logic and neural network
Technical field
The invention belongs to data migtation, data integration field, be specifically related to the data corresponding method based on first order logic and neural network of a kind of high matching efficiency and accuracy rate.
Background technology
Along with the continuous development of network and database technology, the kind of data and quantity is also in continuous increase, therefore, also becomes the problem that presses for solution for sharing of isomeric data and each other technical matterss such as conversion.At semantic WEB, data warehouse, P2P database, pattern is integrated and field such as ecommerce, all sharing with conversion each other of isomeric data has been carried out deep research.Pattern match has played irreplaceable effect as the first step that realizes that isomeric data is shared in whole data handling procedure.The work such as conversion that realize isomeric data at present mostly are to be carried out by hand by operating personnel; This just requires the operating personnel must be to the information of database; Semanteme such as element in mode configuration and the pattern is all very familiar; Can accomplish the shared and conversion to isomeric data like this, therefore the processing to the heterogeneous system data is the process of a more complicated.Continuous increase along with professional complexity and system complexity; The complexity of the needed data of system; All, in this case, rely on manual work to accomplish the isomeric data integrating process merely than complicated many of existing data cases; Obviously too difficult, therefore just more and more urgent to the integrated robotization demand of isomeric data.
Up to the present, corresponding method research has also obtained some achievements to data.The SemInt (A Tool for Identifying Attribute Correspondences in Heterogeneous Database Using Neural Networks) of Northwestern University exploitation in 2000 is the pattern matching system of an application mix matching technique; Its main applying neural network technology removes to confirm the matching candidate collection; And between single attribute of two patterns, set up a mapping, the coupling radix is 1:1; The Cupid (Generic schema matching with Cupid) that proposes in calendar year 2001 VLDB (the Very Large Data Base) meeting is a kind of unitized mixing matching process; The name adaptation is combined with the structuring matching algorithm; Can derive the similarity of attribute according to this structuring algorithm, draw and the similarity of attribute is a similarity according to attribute assembly (mainly being the data structure of attribute name and attribute); The COMA that proposes in the VLDB meeting in 2002 (A System for Flexible Combination of Schema Matching Approaches) is a kind of synthetic method for mode matching; It provides an external knowledge storehouse of having carried a plurality of different adaptations, and supports the method for multiple combination matching result; The SF (Similarity Flooding:A Versatile Graph Matching Algorithm) that proposes in ICDE (International Conference on Data Engineering) in 2002 meeting is a kind of matching process based on the mode configuration similarity; The iMap (Discovering complex semantic matches between database schemas) that proposes in SIGMOD (Special Interest Group on Management of Data) in 2004 meeting is a kind of mixing matching process based on pattern information and instance information; The method for mode matching based on copy that proposes in the ICDE meeting in 2005 mainly utilizes the overlapped data that is existed by the data centralization of match pattern to indicate the matching relationship between pattern, is a kind of mode-matching technique based on instance; The SMDD (Schema Mapping Method based on Data Distribution) that the National University of Defense technology in 2005 proposes in NDBC (National Data Base Conference) meeting is a kind of method for mode matching based on the data instance analytical characteristic; 2009-2010 Li Guohui etc. has proposed based on the structure matching method of functional dependence and the structure matching method that relies on based on partial function.
Though foregoing method can solve some matching problems in the pattern match; But and imperfection; And the match information for history is not used; When causing carrying out the operation of data correspondence next time again, still need utilize matching algorithm to mate again to the coupling of some known regimes, the time of both having wasted have so also influenced the accuracy rate of coupling.The present invention has then made full use of the knowledge of historical coupling, through utilizing first order logic and neural network the information of accomplishing in the match pattern is trained, and accomplishes the corresponding process of whole data.
Summary of the invention
The object of the present invention is to provide the data corresponding method that a kind of match time is shorter, accuracy rate is higher based on first order logic and neural network.
The objective of the invention is to realize like this:
The present invention includes following steps:
(1) the data matching pattern has been accomplished in analysis, sets up the table of pattern to be matched and the form of field form;
(2) will select the mode switch of training to be the table vector, leave in the table training set to be matched, and comprise table name, positive sample data, negative sample data and assert set;
(3) table in the table feature extraction algorithm pair set of use first order logic carries out feature extraction;
(4) characteristic of the table of storage extraction;
The table to be matched that the feature field of the table that (5) use to extract is treated in the match pattern matees;
(6) through the neural network algorithm that generates feedback the field in the pattern of accomplishing coupling is trained, revise the representation of field and the neural network of foundation;
(7) use neural network and the revised field presentation format that trains, the table of accomplishing coupling is carried out fields match.
The tableau format form turns to hexa-atomic group:
T=(N, N
e, K, K
e, S
c, D), wherein N is a table name, N
eFor the Chinese of table name explains that K is a major key, K
eBe the Chinese implication of major key, S
cBe the set of the title and the Chinese implication of each field except that major key, D is the size of data volume in the current table;
Adopt ten hexa-atomic groups for field in the table:
T
Attribute=(D
L, L
E, P
R, C
T, N
T, D
T, P
K, F
K, N
U, C
V, D
F, Max, Min, Ave, Var, StaDev), and wherein, D
LBe the length of field name, L
EBe the length of data, P
rBe the precision of data, C
TBe character types, N
TBe numeric type, D
TBe date type, P
KBe major key, F
KBe external key, N
UFor whether being empty, C
VBe unique constraints, D
FBe default value, Max is the data maximal value, and Min is the data minimum value, and Ave is a data mean value, and Var is the data variance, and StaDev is that data standard is poor.
Treating the concrete steps that the table to be matched in the match pattern matees comprises:
(1) field name that contains in table name of showing in the extraction pattern to be matched and the table;
(2) table name and the field name of the extraction of order traversal, in the process that travels through, the look-up table regular collection, whether retrieval has table to satisfy rule wherein, if satisfy its rule, the table that then will show and show in the rule matees, mark has been accomplished the table of coupling;
(3) continue traversal, till all tables all travel through completion, the feedback matching result.
The neural network algorithm that generates feedback comprises the steps:
(1) make up initial generation feedback network, the neuronic number of its input layer is N, and the output layer neuron number is M;
(2) each parameter that generates in the feedback network is carried out assignment; Comprise the weighted value w of its learning rate r, network and the bias value θ of each unit; Wherein the span of the r of the learning rate of network is (0.0≤r≤1.0), and the span of the bias θ of network weight w and each unit is respectively-1.0≤w≤1.0 and-1.0≤θ≤1.0;
(3) the generation feedback network that makes up is carried out forward and reversal error propagation, revise weights and bias value simultaneously;
(4) training dataset is input to neural network and uses the neural network algorithm that generates feedback that the node in the network is carried out cut operator with being connected, simultaneously initial ten hexa-atomic groups of forms of field are revised.
Beneficial effect of the present invention is: the present invention is merged the time that has effectively reduced Data Matching through the neural network with first order logic and artificial intelligence field.Come the his-and-hers watches characteristic to extract coupling through table feature extraction algorithm based on first order logic; Utilize the algorithm of the neural network that generates feedback that field is classified then; Mate, reduced the time in the data corresponding process, improved the efficient and the accuracy rate of coupling.
Description of drawings
Fig. 1 is based on the table feature extraction algorithm process flow diagram of first order logic;
Fig. 2 utilizes the characteristic of extracting to show to mate process flow diagram;
Fig. 3 is the process flow diagram that generates the neural network algorithm of feedback.
Embodiment
For example the present invention is done description in more detail below in conjunction with accompanying drawing:
(1) main processing process
Fig. 1 is table feature extraction TIAFL (the Table Information Acquisition Based on First-order Logic) algorithm flow chart based on first order logic; This algorithm comes the his-and-hers watches characteristic to extract; Its step may be summarized to be: at first; The training mode of selecting is shown vector representation, leave in the set, comprising table name, positive sample data, negative sample data with assert set; Secondly, each table in the table feature extraction algorithm pair set of use first order logic carries out feature extraction; At last, with the characteristic storage of each table that extracts,, the back shows identification so that using.
Fig. 2 utilizes the characteristic of extraction to show to mate process flow diagram, and its step may be summarized to be: at first, the table information in the pattern to be matched is extracted, the result is left in the set of table coupling; Secondly, travel through in the table coupling set in the pattern to be matched whether have element,, then travel through the table characteristic regular collection that extracts, check and whether satisfy rule wherein that if satisfy, the result that then will mate joins in the table matching result and goes if there is element.If do not have element in the set of the coupling of the table in the pattern to be matched, then coupling finishes; At last, the matching result that obtains is returned to the user.
Fig. 3 is for generating the neural network algorithm process flow diagram of feedback, and its step can be divided into two stages: the phase one is a generation phase, and this stage mainly is that network is trained, and wherein each parameter confirmed; Subordinate phase is a feedback stage, and each start node of neural network is carried out cut operator, then the result is fed back to the representation of field.
(2) specific algorithm
Fig. 1 is the TIAFL algorithm flow chart, and this algorithm comes the his-and-hers watches characteristic to extract, and its specific algorithm is following:
The 1TIAFL algorithm
1) the table information in the match pattern has been accomplished in scanning, and sample data is stored among the TableInfoList in will showing;
2) from TableInfoList, extract table name, positive sample data, negative sample number and assert set, be used for the positive sample data Pos of initialization, negative sample data Neg, assert set Predicates, the regular Learned_rules that the while initialization is learnt;
3) traversal Pos set, if be empty, then algorithm finishes; If be not empty, then forward step 4 to;
4) traversal Neg set if be not empty, then generates candidate character based on Predicates; Utilize valuation functions to assess candidate character then, choose best literal and join NewRule, recomputate the Neg set of satisfying present condition then; Circulation is carried out, till Neg set sky.
5) rule that will extract joins among the Learned_rules; Recomputate the sample data that satisfies rule among the Pos then, circulation is carried out, and shows all Rule Extraction completion up to all; Field feature with this table of table name and extraction feeds back to the user then, so as below show coupling.
2 utilize the table characteristic of extracting to show coupling
1) extract table information in the pattern to be matched, the table name that extracts each table in the pattern to be matched with and table in the field name that contains;
2) the table information of order traversal extraction in the process of traversal, is searched the table regular collection that algorithm 1 obtains, and sees if there is table and satisfies rule wherein, if satisfy its rule, then will show to mate with the table of showing in the rule;
3) mark has been accomplished the table of coupling from the table ensemble of communication, continues traversal, and circulation execution in step 2 is till all tables all travel throughs completion;
4) result that will mate feeds back to the user, accomplishes table level coupling.
3 generate the neural network algorithm of feedback
The step that generates the neural network algorithm of feedback can mainly be divided into following a few step:
(1) make up initial generation feedback network, the neuronic number of its input layer is N, and wherein N is the number of description field attribute; The output layer neuron number is M, and wherein M is the categorical measure after classifying through SOM; The neuron number of the hidden layer that adopts among this paper is (M+N)/2.
(2) each parameter that generates in the feedback network is carried out assignment, comprise the weighted value w of its learning rate r, network and the bias value θ of each unit.Wherein the span of the r of the learning rate of network is (0.0≤r≤1.0), and the span of the bias θ of network weight w and each unit is respectively-1.0≤w≤1.0 and-1.0≤θ≤1.0.
(3) to after each parameter assignment completion in the generation feedback network; Generation feedback network to making up carries out forward and reverse error propagation; Revise weights and bias value simultaneously, till parameter value or its error rate that all is not more than setting until all Δ w that satisfy last error propagation reaches the numerical value of prior setting less than the number of times of the setting value of appointment or its propagation.The phase one training finishes.
(4) neural network that will pass through the phase one training is imported as feedback stage, uses training set that neural network is trained once more simultaneously.
(5) definition Dynamic Array Array each neuronic information of record and logical variable flag
Removed, Dynamic Array Array2 deposits 16 indexs of field attribute formalization representation.
(6) each neuron of traversal input layer; Simultaneously the neuron that traverses is removed from neural network; Use the neural network remove after the node that training sample is trained then; If still can carry out correct classification to each training sample, then remove this node and with being connected of this node, delete the pairing attribute item of this node in array Array2 simultaneously; If can not, then recover this node, continue next neuron of traversal, repeat top-operation, finish until all neuron traversals.
(7) network that uses process secondary training neural network conduct afterwards to carry out fields match is treated match pattern and is carried out fields match, and in the proper vector of extraction field attribute, the data standard of employing goes on foot the data standard of revising for process last simultaneously.
(3) experimental analysis
The data in six areas that experiment of the present invention has been adopted are as pattern to be matched, and a mode standard is as target pattern.The artificial completion of process in these six areas and the coupling between the target pattern, the details of its coupling are as shown in table 1.Simultaneously when experimentizing, take wherein X city, H city, B city and C city as training dataset, Y city and Q province are then as test data set.
Table 1
Accuracy rate (precision), recall rate (recall) and comprehensive measurement index (overall) have been adopted in evaluation index for table coupling experimental result.Accuracy rate and recall rate can not reflect the quality of table coupling fully as two measurement indexs of information retrieval field, therefore need reflect the table quality of match through comprehensive measurement index.Process according to algorithm 1; At first the data set in X city, H city, B city and C city is trained; Obtain the table characteristic of its training; Utilize algorithm 2 then, use method and method of the present invention based on semantic similarity to experimentize respectively to test data set Y city and Q province, its experimental result is as shown in table 2.Through can getting interpretation, method of the present invention is than the semantic similarity between the simple dependence table name,, in accuracy rate, recall rate with raising is all being arranged aspect the measurement index comprehensively.
Table 2
Simultaneously, the present invention will pass through the coupling intermediate result that algorithm 1 and 2 obtains, and use algorithm 3 to mate; In that being set, iterations is set to 100000 times; Learning rate is 0.2, and training precision is 0.001 o'clock, and training time of directly mating and training time of the present invention contrast are as shown in table 3.From table, analyze and to draw, in three training time contrasts carrying out, can find out that the present invention effectively raises the efficient and the accuracy rate of its training.Using algorithm 3 to carry out in the experiment of fields match; Compare simultaneously with based on the neural network matching process of classification with based on the neural network matching process of attribute; Under same condition; The method of this discovery all effectively raises the accuracy rate of its fields match, is a kind of effective method.
Table 3
Table 4