CN104573130A

CN104573130A - Entity resolution method based on group calculation and entity resolution device based on group calculation

Info

Publication number: CN104573130A
Application number: CN201510076586.4A
Authority: CN
Inventors: 刘旭东; 孙海龙; 郭莉莎; 张日崇
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2015-02-12
Filing date: 2015-02-12
Publication date: 2015-04-29
Anticipated expiration: 2035-02-12
Also published as: CN104573130B

Abstract

The embodiment of the invention provides an entity resolution method based on group calculation and an entity resolution device based on group calculation. The entity resolution method comprises the following steps: firstly, layering and clustering initial records in a database to obtain at least two clustering subsets; while detecting that new records are added into the database, obtaining at least two relevant clustering subsets which are most relevant with the new records in the at least two clustering subsets, and determining candidate record pairs which are corresponding to the at least two relevant clustering subsets respectively; judging whether at least one entity record pair represents the same entity through a crowd-sourcing user marking mode; if the first candidate pair is determined to represent the same entity, adding the new records into the first clustering subset of the first records; and if neither of the candidate pairs is determined to represent the same entity, establishing a new clustering subset for the new records, and establishing a label set for the new clustering subset. The entity resolution method can be used for performing entity resolution on static and dynamic data, so that the resolution efficiency is improved.

Description

The entity resolution method calculated based on colony and device

Technical field

The embodiment of the present invention relates to computer technology, particularly relates to a kind of entity resolution method based on colony's calculating and device.

Background technology

Database be organize according to data structure, the warehouse of store and management data; Along with the development in infotech and market, data management is no longer only store and management data, and is transformed into the mode of the various data managements required for user.In data base administration process, propose entity resolution, wherein, the object of entity resolution identifies in database the different records representing same entity.Along with the arrival of large data age, increasing data are being needed to be mated or integrate by before analyzing and processing further, and therefore, the demand for high-quality entity resolution increases rapidly.

Existing entity resolution method is mainly for static data source (namely tentation data source is static constant), and each entity resolution process is all resolve whole data source.But in actual applications, new data all can be had in every section of temporal database to increase, delete or amendment, namely most of data source is all dynamic change, as information, the merchandise news on e-commerce website, the Bug resources bank etc. in field of software engineering that user on social network sites submits to; According to existing entity resolution method, have during newly-increased data at every turn and all need to carry out entity resolution to whole data source in database, cost is comparatively large, and namely analyzing efficiency is lower.

Summary of the invention

The embodiment of the present invention provides a kind of entity resolution method based on colony's calculating and device, can carry out entity resolution, realize higher recall ratio and precision ratio, thus improve analyzing efficiency under less cost Static and dynamic data set.

First aspect, the embodiment of the present invention provides a kind of entity resolution method calculated based on colony, comprising:

Hierarchical clustering methods based on mass-rent carries out hierarchical cluster to the original records in database, obtains at least two cluster subsets;

When detect add new record in described database time, obtain the characteristic information of described new record;

From described at least two cluster subsets, maximally related at least two relevant cluster subsets are obtained with described new record according to the subset information of described at least two cluster subsets and the characteristic information of described new record; Wherein, the subset information of described at least two cluster subsets comprises: the tally set information of described cluster subset and index information;

The candidate record pair corresponding respectively with described at least two relevant cluster subsets is determined according to described new record and the similarity magnitude relationship of each record in described at least two relevant cluster subsets;

Judge whether that candidate record described at least one is to representing same entity by mass-rent user annotation mode; If determine, the first candidate record is to representing same entity, then added to by described new record in the first cluster subset belonging to the first record, and upgrade the tally set of described first cluster subset; If determine, all described candidate record are not to representing same entity, then set up a new cluster subset for described new record, and be described new cluster subset establishing label collection; Wherein, described first record forms described first candidate record pair with described new record.

Alternatively, the described Hierarchical clustering methods based on mass-rent carries out hierarchical cluster to the original records in database, obtains at least two cluster subsets, comprising:

The original records that the probability representing same entity according to the large young pathbreaker of probability representing same entity often couple described between original records is greater than Upper Probability threshold value is a class to gathering, form corresponding elementary cluster subset, and be each described elementary cluster subset establishing label collection and index; Wherein, often pair of described original records forms described original records pair;

Successively described elementary cluster subset is hierarchically merged by mass-rent user annotation mode, until merge after each cluster subset between minor increment be greater than lower threshold, finally obtain at least two cluster subsets.

Alternatively, the original records that the probability that the large young pathbreaker of probability representing same entity between the described original records of described basis often pair represents same entity is greater than Upper Probability threshold value is a class to gathering, and forms corresponding elementary cluster subset, comprising:

Obtain described original records to the probability representing same entity;

The described original records probability representing same entity being greater than Upper Probability threshold value is a class to gathering, and forms corresponding elementary cluster subset.

Alternatively, describedly successively described elementary cluster subset hierarchically to be merged by mass-rent user annotation mode, until the minor increment between each cluster subset after merging is greater than lower threshold, finally obtains at least two cluster subsets, comprising:

Steps A, calculate the distance between often pair of elementary cluster subset in described elementary cluster subset, select describedly to merge subset apart from minimum a pair elementary cluster subset as two candidates;

Step B, judge whether the distance that described two candidates merge between subset is less than lower threshold; If described two candidate's distances merged between subset are less than described lower threshold, then merge subset from described two candidates respectively and select the second record formation second candidate record pair, by described second candidate record to and described two candidates tally set of merging subset send to mass-rent platform, judge that to make described mass-rent platform described second candidate record is to whether representing same entity and whether praise the label point in described tally set; Wherein, described second candidate record is right to merging in subset recording of the maximum probability representing same entity for described two candidates;

Step C, receive the first judged result that described mass-rent platform returns, and determine whether described two candidates to merge subset according to described first judged result and merge and according to described mass-rent platform, number of times is praised to the point of label in described tally set and the label in described tally set sorted and/or filters; If determine that described two candidates merge subset and represent same entity according to described first judged result, then described two candidates are merged subset and merge into a cluster subset, upgrade tally set and the index of described cluster subset, and described cluster subset merging obtained is as elementary cluster subset; If determine that described two candidates merge subset and do not represent same entity according to described first judged result, then the distance that described two candidates merge between subset is set to 1;

Return and continue to perform described steps A-step C, until described two candidate's distances merged between subset are greater than described lower threshold, then using at least two described elementary cluster subsets as at least two cluster subsets described in obtaining.

Alternatively, the described original records of described acquisition, to the probability representing same entity, comprising:

The similarity that original records is right according to the Similarity measures between the respective attributes that described original records is right;

Described original records is calculated to the probability representing same entity based on machine learning model.

Alternatively, the distance in the described elementary cluster subset of described calculating between often pair of elementary cluster subset, comprising:

From described often pair of elementary cluster subset, select the record representing the maximum probability of same entity to (r respectively _i, r _j), wherein, r _i∈ C _i, r _j∈ C _j, C _ifor an elementary cluster subset in described often pair of elementary cluster subset, C _jfor another the elementary cluster subset in described often pair of elementary cluster subset;

According to formula obtain the distance between described often pair of elementary cluster subset; Wherein, maxSimi is that described record is to (r _i, r _j) representing the probability of same entity, cosinSimi is the cosine similarity of described often pair of elementary cluster subset.

Alternatively, the subset information of at least two cluster subsets described in described basis and the characteristic information of described new record obtain with described new record maximally related at least two relevant cluster subsets from described at least two cluster subsets, comprising:

According to the described Hierarchical clustering methods based on mass-rent to the original records in database carry out hierarchical cluster obtain described in the tally set information of at least two cluster subsets and index information set up inverted index;

Characteristic information according to described inverted index and described new record is retrieved, obtain from described at least two cluster subsets maximally related with described new record described at least two relevant cluster subsets.

Alternatively, describedly determine the candidate record pair corresponding respectively with described at least two relevant cluster subsets according to described new record and the similarity magnitude relationship of each record in described at least two relevant cluster subsets, comprising:

Calculate the similarity of each record in described new record and described at least two relevant cluster subsets respectively;

From each described relevant cluster subset, select the record that maximum with the similarity of described new record respectively, and form the candidate record pair of corresponding described relevant cluster subset respectively with described new record; Wherein, the number of described relevant cluster subset equals the right number of described candidate record.

Alternatively, describedly judge whether that candidate record described at least one is to representing same entity by mass-rent user annotation mode; If determine, the first candidate record is to representing same entity, then added to by described new record in the first cluster subset belonging to the first record, and upgrade the tally set of described first cluster subset; If determine, all described candidates are similar then sets up a new cluster subset for described new record to not representing same entity, and is described new cluster subset establishing label collection, comprising:

By all described candidate record to sending to mass-rent platform, judge that to make described mass-rent platform described candidate record is to whether representing same entity;

Receive the second judged result that described mass-rent platform returns, and determine whether that candidate record described at least one is to representing same entity according to described second judged result; If determine that the first candidate record is to representing same entity according to described second judged result, then described new record is added in the first cluster subset belonging to the first record, and upgrade the tally set of described first cluster subset; If determine that all described candidate record are not to representing same entity, then set up a new cluster subset for described new record according to described second judged result, and it is described new cluster subset establishing label collection.

Second aspect, the embodiment of the present invention provides a kind of entity resolution device calculated based on colony, comprising:

Hierarchical cluster module, carries out hierarchical cluster for the Hierarchical clustering methods based on mass-rent to the original records in database, obtains at least two cluster subsets;

Detection module, for when detect add new record in described database time, obtain the characteristic information of described new record;

First determination module, for obtaining with described new record maximally related at least two relevant cluster subsets according to the subset information of described at least two cluster subsets and the characteristic information of described new record from described at least two cluster subsets; Wherein, the subset information of described at least two cluster subsets comprises: the tally set information of described cluster subset and index information;

Second determination module, for determining the candidate record pair corresponding respectively with described at least two relevant cluster subsets according to described new record and the similarity magnitude relationship of each record in described at least two relevant cluster subsets;

Divide module, for being judged whether that by mass-rent user annotation mode candidate record described at least one is to representing same entity; If determine, the first candidate record is to representing same entity, then added to by described new record in the first cluster subset belonging to the first record, and upgrade the tally set of described first cluster subset; If determine, all described candidate record are not to representing same entity, then set up a new cluster subset for described new record, and be described new cluster subset establishing label collection; Wherein, described first record forms described first candidate record pair with described new record.

In the present invention, the Hierarchical clustering methods based on mass-rent carries out hierarchical cluster to the original records in database, obtains at least two cluster subsets; Further, when detect add new record in described database time, obtain the characteristic information of described new record; Further, from described at least two cluster subsets, maximally related at least two relevant cluster subsets are obtained with described new record according to the subset information of described at least two cluster subsets and the characteristic information of described new record, wherein, the subset information of described at least two cluster subsets comprises: the tally set information of described cluster subset and index information; Further, the candidate record pair corresponding respectively with described at least two relevant cluster subsets is determined according to described new record and the similarity magnitude relationship of each record in described at least two relevant cluster subsets; Further, judge whether that candidate record described at least one is to representing same entity by mass-rent user annotation mode; If determine, the first candidate record is to representing same entity, then added to by described new record in the first cluster subset belonging to the first record, and upgrade the tally set of described first cluster subset; If determine, all described candidate record are not to representing same entity, then set up a new cluster subset for described new record, and be described new cluster subset establishing label collection; Wherein, described first record forms described first candidate record pair with described new record; Entity resolution can be carried out to Static and dynamic data set, under less cost, realize higher recall ratio and precision ratio, thus improve analyzing efficiency.

Accompanying drawing explanation

In order to be illustrated more clearly in the embodiment of the present invention or technical scheme of the prior art, be briefly described to the accompanying drawing used required in embodiment or description of the prior art below, apparently, accompanying drawing in the following describes is some embodiments of the present invention, for those of ordinary skill in the art, under the prerequisite not paying creative work, other accompanying drawing can also be obtained according to these accompanying drawings.

Fig. 1 is the schematic flow sheet that the present invention is based on the entity resolution method embodiment one that colony calculates;

Fig. 2 is the schematic flow sheet that the present invention is based on the entity resolution method embodiment two that colony calculates;

Fig. 3 is the schematic flow sheet that the present invention is based on the entity resolution method embodiment three that colony calculates;

Fig. 4 is the structural representation that the present invention is based on the entity resolution device embodiment one that colony calculates.

Embodiment

For making the object of the embodiment of the present invention, technical scheme and advantage clearly, below in conjunction with the accompanying drawing in the embodiment of the present invention, technical scheme in the embodiment of the present invention is clearly and completely described, obviously, described embodiment is the present invention's part embodiment, instead of whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art, not making the every other embodiment obtained under creative work prerequisite, belong to the scope of protection of the invention.

Due in some scenarios, the difference record representing same entity is usually not identical; The main task of entity resolution is exactly identify in database the different records representing same entity, is even more important when cleaning or integrate the data from multiple data source.Such as mailbox list may comprise the record that in fact a lot of bar refers to same physical address, but owing to comprising some different spellings or lack part information etc., is bound to there are some difference between every bar record.Such as, a company may have multiple different database (each database belongs to a subdivision) for depositing subscriber information message, generally, company wishes to obtain each user complete information more by integrating these user profile; Due in each different database, each user profile may occur in different forms, namely there is not a unified identifier, therefore, identifies the user profile of coupling among a plurality of databases and be not easy.

Machine learning is nearly more than 20 years multi-field cross disciplines of rising, and relates to the multi-door subjects such as theory of probability, statistics, Approximation Theory, convextiry analysis, algorithm complex theory.Machine Learning Theory mainly design and analysis some allow computing machine can the algorithm of " study " automatically, namely machine learning algorithm is that class automatic analysis from data obtains rule, and the algorithm that assimilated equations is predicted unknown data.Machine learning algorithm roughly can be divided into supervised learning, semi-supervised learning, unsupervised learning and enhancing study four large classes; 1) supervised learning refers to from given training data focusing study and goes out a function, when new data arrive, and can according to this function prediction result; Wherein, the training dataset of supervised learning requires it is comprise input and output (i.e. characteristic sum target), and the target that training data is concentrated is marked by people; Common supervised learning algorithm comprises regretional analysis and statistical classification; 2) unsupervised learning is compared with supervised learning, the result that training dataset does not artificially mark, and common unsupervised learning algorithm has cluster; 3) semi-supervised learning is between supervised learning and unsupervised learning; 4) strengthen study to learn by observing, namely each action can affect to some extent on environment, and the feedback of the surrounding environment that learning object arrives according to the observation judges.

Along with the arrival of large data age, demand for high-quality entity resolution increases rapidly, traditional entity resolution scheme is based on machine learning, although there has been a large amount of research work in the entity resolution field based on machine learning, but the semantic analysis involved by entity resolution process, domain knowledge and correlation experience, the machine accuracy rate that nobody judges when judging whether different record is same entity is high, and namely machine processing is still not accurate enough.Along with the fast development based on mass-rent pattern market, by manually marking the process being applied to entity resolution, namely carry out entity resolution based on mass-rent platform, although the accuracy rate of artificial mark judges than machine high, but can bring larger cost on time and money.

The entity resolution method calculated based on colony at present is all only applicable to static database, namely be all that whole database is resolved at every turn, wherein, mass-rent combines with cloud computing with machine learning or artificial intelligence by thought that colony calculates exactly, solves problem by the high efficiency of fusion calculation machine process and the accuracy of crowd's wisdom.But database is all dynamic in actual applications, the Bug resources bank in such as Facebook in landmark data collection, soft project, namely all can have new data to increase in every section of temporal database, needs and in database, existing data are resolved.Therefore, traditional entity resolution method being only applicable to static data source can not meet the demand of dynamic data source.

The present invention is resolved data by the high efficiency of fusion calculation machine process and the accuracy of crowd's wisdom, propose a kind ofly calculate based on colony and the scheme of entity resolution can be carried out Static and dynamic data set, the program under less cost, can realize higher recall ratio and precision ratio.

Fig. 1 is the schematic flow sheet that the present invention is based on the entity resolution method embodiment one that colony calculates, and as shown in Figure 1, the method for the present embodiment can comprise:

S101, based on the Hierarchical clustering methods of mass-rent, hierarchical cluster is carried out to the original records in database, obtain at least two cluster subsets.

Due in the data source of practical application, not repetition between most record; If all records are judged all giving mass-rent platform, in economy and all infeasible on the time, therefore, what can obtain based on machine learning records right recurrence probability, filtering out maximum probability or minimum probability by setting bound probability threshold values, to represent recording of same entity right, namely think that probability is greater than the record of Upper Probability threshold values to then representing same entity, probability is less than the record of lower limit probability threshold values to then not representing same entity.

In the embodiment of the present invention, by hierarchical clustering algorithm, the record representing same entity is all gathered (record in namely different subclasses is to then representing different entities) in same subclass; Hierarchical cluster refers to and is made up of the segmentation cluster of different levels, and the segmentation between level has nested relation.Particularly, cluster is carried out by adopting bottom-up strategy, first by obtaining elementary cluster subset after filtration step, described elementary cluster subset is also hierarchically merged into larger cluster subset by mass-rent user annotation mode according to certain order by distance then between basis often pair of elementary cluster subset iteratively, until the minor increment between each cluster subset after merging is greater than the probability representing same entity between record that the lower threshold i.e. often pair of cluster subset comprises all be less than described lower limit probability threshold values (wherein, distance between two cluster subsets is less, then represent that the probability that record between described two cluster subsets repeats between larger or described two cluster subsets the probability representing same entity is larger, distance between two cluster subsets is larger, then represent that the probability that record between described two cluster subsets repeats between less or described two cluster subsets the probability representing same entity is less), wherein, minor increment between each cluster subset after merging is greater than lower threshold, then the record represented between each cluster subset is less than described lower limit probability threshold values to the probability representing same entity, and namely each cluster subset does not represent same entity yet, therefore, clustering algorithm stops.Alternatively, describedly to comprise by obtaining elementary cluster subset after filtration step: according to described Upper Probability threshold values, preliminary clusters is carried out to the original records in database, as the original records of Upper Probability threshold values forms corresponding elementary cluster subset to gathering in a class as described in being greater than by the probability representing same entity; Wherein, the record in each described elementary cluster subset represents same entity, and the record between different described elementary cluster subset does not then represent same entity.

Optionally, step 101 comprises:

In the embodiment of the present invention, alternatively, first by obtaining described original records to the probability representing same entity, secondly, the described original records probability representing same entity being greater than Upper Probability threshold value is a class to gathering, and forms corresponding elementary cluster subset, further in order to the follow-up inquiry carrying out dynamic data, for each described elementary cluster subset establishing label collection and index, alternatively, by machine or be manually labeled as each described elementary cluster subset establishing label collection, alternatively, by carrying out participle to the property value of each record in described elementary cluster subset, remove stop words, get stem process, then select to repeat and at the reverse document-frequency of whole data centralization (inverse documentfrequency between record, being called for short IDF) word (i.e. keyword) that value is larger to add in the tally set of described elementary cluster subset (wherein, IDF value is the tolerance of a word general importance, the IDF value of a certain particular words equals: general act number obtains quotient divided by the number of the file comprising this word, again described quotient is taken the logarithm the numerical value obtained), and be that each described elementary cluster subset creates index according to the tally set information of described elementary cluster subset, wherein, described index information comprises the keyword of the sub-centralized recording of described elementary cluster, further, successively described elementary cluster subset is hierarchically merged by mass-rent user annotation mode, until merge after each cluster subset between minor increment be greater than lower threshold, the record then represented between each cluster subset is less than described lower limit probability threshold values to the probability representing same entity, namely each cluster subset does not represent same entity yet, therefore, clustering algorithm stops, and finally obtains at least two cluster subsets.

In the embodiment of the present invention, the similarity of each described original records to (described original records is to comprising two original records) can represent by a proper vector, the similarity of certain attribute between every one-dimensional representation two described original records of described proper vector, suppose that the function calculating similarity with n measures m attribute, then the dimension of described proper vector is n*m dimension, can original records is right according to the Similarity measures between the right respective attributes of described original records similarity.Further, entity resolution based on machine learning model can see classification problem as, what such as just representing two record representatives is same entity, otherwise represent described two entity on behalf different entities, namely the input of general sorter is the proper vector representing similarity between a pair record, output is the classification results of two class problems, but the application needs a pair record obtained to represent the probability of same entity, therefore, propose in the embodiment of the present invention to calculate described original records to the probability representing same entity based on machine learning model, alternatively, training classifier is carried out by training set, wherein, described training set comprises the proper vector representing respectively and repeat to record (namely representing the record of same entity) and non-duplicate record (namely representing the record of different entities), the sorter trained can represent that each described original records is to the probability representing same entity.

Alternatively, described Upper Probability threshold value and described lower limit probability threshold values can be determined according to the recall ratio of target setting and precision ratio; If described Upper Probability threshold values arranges too low, precision ratio can be reduced; If described lower limit probability threshold values is too high, recall ratio can be reduced; If described Upper Probability threshold values is too high or described lower limit probability threshold values is too low, filtration efficiency can be affected.

S102, when detect add new record in described database time, obtain the characteristic information of described new record.

In the embodiment of the present invention, when detect in described database add new record R time, obtain the characteristic information of described new record, alternatively, participle is carried out to the property value of described new record, remove stop words, get stem, and according to word frequency-reverse document-frequency (term frequency – inverse documentfrequency, be called for short TF-IDF) value extraction keyword (i.e. characteristic information), so that according to the subset information of already present described cluster subset in the characteristic information of described new record and described database, described new record is reasonably classified, if determine that the record in described new record and cluster subset described in certain all represents same entity, then described new record is integrated with described cluster subset, if or determine that the record in described new record and arbitrary described cluster subset does not represent same entity, then will set up a new cluster subset for described new record.

S103, from described at least two cluster subsets, obtain with described new record maximally related at least two relevant cluster subsets according to the subset information of described at least two cluster subsets and the characteristic information of described new record.

In the embodiment of the present invention, from described at least two cluster subsets, obtain with described new record maximally related at least two relevant cluster subsets according to the subset information of described at least two cluster subsets and the characteristic information of described new record by information retrieval mode; Alternatively, the subset information of described at least two cluster subsets comprises: the tally set information of described cluster subset and index information.

Alternatively, step S103 comprises:

In the embodiment of the present invention, according to the described Hierarchical clustering methods based on mass-rent to the original records in database carry out hierarchical cluster obtain described in the tally set information of at least two cluster subsets and index information set up inverted index, alternatively, according to the tally set information of each described cluster subset and index information using all labels in the tally set of each described cluster subset as key, the memory address of each described cluster subset is as value, namely each in described inverted index all comprises a property value and has the memory address of this property value corresponding record (owing to not being determine property value by recording, but the position of record is determined by property value, thus inverted index is called), further, characteristic information according to described inverted index and described new record is retrieved by information retrieval mode, obtain from described at least two cluster subsets maximally related with described new record described at least two relevant cluster subsets, namely represent at least two relevant cluster subsets of same entity with described new record most probable.

S104, determine the candidate record pair corresponding respectively with described at least two relevant cluster subsets according to described new record and the similarity magnitude relationship of each record in described at least two relevant cluster subsets.

In the embodiment of the present invention, according to the similarity magnitude relationship of each record in described new record R and described at least two relevant cluster subsets, the record R ' that one maximum with the similarity of described new record is selected respectively from each described relevant cluster subset, thus determine that the candidate record corresponding respectively with described at least two relevant cluster subsets is to (R, R ').

Alternatively, described step S104 comprises:

In the embodiment of the present invention, calculate the similarity of each record in described new record R and described at least two relevant cluster subsets respectively, further, the record R ' that one maximum with the similarity of described new record R is selected respectively from each described relevant cluster subset, and the candidate record forming corresponding described relevant cluster subset respectively with described new record R is to (R, R '); Wherein, the number of described relevant cluster subset equals the right number of described candidate record, if obtain maximally related five relevant cluster subsets with described new record R in step S103, then determine five candidate record pair corresponding respectively with described five relevant cluster subsets in step S104.Alternatively, the similarity of each record in described new record R and described at least two relevant cluster subsets can be calculated by text similarity measurement algorithm.

S105, judge whether that candidate record described at least one is to representing same entity by mass-rent user annotation mode; If determine, the first candidate record is to representing same entity, then added to by described new record in the first cluster subset belonging to the first record, and upgrade the tally set of described first cluster subset; If determine, all described candidate record are not to representing same entity, then set up a new cluster subset for described new record, and be described new cluster subset establishing label collection; Wherein, described first record forms described first candidate record pair with described new record.

In the embodiment of the present invention, determining that according to described new record and the similarity magnitude relationship of each record in described at least two relevant cluster subsets the candidate record corresponding respectively with described at least two relevant cluster subsets is to rear, by described candidate record to (R, R ') send to mass-rent platform, judge whether that candidate record described at least one represents same entity to (R, R ') by mass-rent user annotation mode; If determine, the first candidate record is to (R, R1 ') represent same entity, then described new record R to be added in the first cluster subset belonging to the first record R1 ' (wherein, described first cluster subset for described in a cluster subset at least two relevant cluster subsets), and upgrade the tally set of described first cluster subset; If determine, all described candidates are similar to (R, R ') do not represent same entity, then set up a new cluster subset for described new record R, and be described new cluster subset establishing label collection, alternatively, the keyword by extracting described new record creates the tally set of described new cluster subset; Wherein, described first candidate record to (R, R1 ') for all described candidate record are to some candidate record pair in (R, R ').

Alternatively, step S105 comprises:

In the embodiment of the present invention, by all described candidate record determined in step S104 to (R, R ') send to mass-rent platform, judge that to make described mass-rent platform described candidate record is to (R, R ') whether represent same entity, as as described in mass-rent platform by indicating described candidate record whether to represent same entity to (R, R ') to candidate record as described in each to the form that (R, R ') carries out marking, further, receive the second judged result that described mass-rent platform returns (as described in mass-rent platform to candidate record as described in each to the result whether representing same entity of mark), and determine whether that candidate record described at least one is to representing same entity according to described second judged result, same entity whether can be represented to each described candidate record to mark because described mass-rent platform can comprise multiple mass-rent user and multiple mass-rent user, also namely described second judged result is that multiple mass-rent user is to the result that whether represent same entity of each described candidate record to mark, alternatively, namely adopt Voting Algorithm to carry out the convergence of mass-rent result selects answer that poll is more than half as a result, if above mass-rent user represents same entity to candidate record described in certain to being labeled as mostly, then determine that described candidate record is to representing same entity, if indivedual or fewer than half mass-rent user represents different entities to candidate record described in certain to being labeled as, then determine that described candidate record is to not representing same entity, if determine that the first candidate record is to representing same entity according to described second judged result, then described new record is added in the first cluster subset belonging to the first record, and upgrade the tally set of described first cluster subset, if determine that all described candidate record are not to representing same entity, then set up a new cluster subset for described new record according to described second judged result, and it is described new cluster subset establishing label collection.

In the embodiment of the present invention, the Hierarchical clustering methods based on mass-rent carries out hierarchical cluster to the original records in database, obtains at least two cluster subsets; Further, when detect add new record in described database time, obtain the characteristic information of described new record; Further, from described at least two cluster subsets, maximally related at least two relevant cluster subsets are obtained with described new record according to the subset information of described at least two cluster subsets and the characteristic information of described new record, wherein, the subset information of described at least two cluster subsets comprises: the tally set information of described cluster subset and index information; Further, the candidate record pair corresponding respectively with described at least two relevant cluster subsets is determined according to described new record and the similarity magnitude relationship of each record in described at least two relevant cluster subsets; Further, judge whether that candidate record described at least one is to representing same entity by mass-rent user annotation mode; If determine, the first candidate record is to representing same entity, then added to by described new record in the first cluster subset belonging to the first record, and upgrade the tally set of described first cluster subset; If determine, all described candidate record are not to representing same entity, then set up a new cluster subset for described new record, and be described new cluster subset establishing label collection; Wherein, described first record forms described first candidate record pair with described new record; Entity resolution can be carried out to Static and dynamic data set, under less cost, realize higher recall ratio and precision ratio, thus improve analyzing efficiency.

In the embodiment of the present invention, calculate the distance between often pair of elementary cluster subset in described elementary cluster subset in step, select describedly to merge subset apart from minimum a pair elementary cluster subset as two candidates; Wherein, record between described elementary cluster subset is represented to the maximum probability representing same entity apart from minimum a pair elementary cluster subset.Alternatively, the distance in the described elementary cluster subset of described calculating between often pair of elementary cluster subset, comprising: from described often pair of elementary cluster subset, select the record representing the maximum probability of same entity to (r respectively _i, r _j), wherein, r _i∈ C _i, r _j∈ C _j, C _ifor an elementary cluster subset in described often pair of elementary cluster subset, C _jfor another the elementary cluster subset in described often pair of elementary cluster subset; Further, according to formula obtain the distance between described often pair of elementary cluster subset; Wherein, maxSimi is that described record is to (r _i, r _j) representing the probability of same entity, cosinSimi is the cosine similarity of described often pair of elementary cluster subset.Alternatively, also can adopt the distance between often pair of elementary cluster subset in the described elementary cluster subset of alternate manner calculating, repeat no more herein.

Further, in stepb by judging whether described two candidates distance merged between subset is less than lower threshold (namely judging whether the record that described two candidates merge subset is greater than described lower limit probability threshold value to the probability representing same entity), if described two candidate's distances merged between subset are less than described lower threshold (namely represent the record that described two candidates merge between subset and be greater than described lower limit probability threshold value to the probability representing same entity), then merge subset from described two candidates respectively and select the second record formation second candidate record pair, by described second candidate record to and described two candidates tally set of merging subset send to mass-rent platform, judge that to make described mass-rent platform described second candidate record is to whether representing same entity and whether praise the label point in described tally set, as as described in mass-rent platform by as described in the second candidate record to mark form instruction as described in the second candidate record to whether representing same entity, wherein, described second candidate record is right to merging in subset recording of the maximum probability representing same entity for described two candidates, as described in two candidates merge and in subset, select the record representing the maximum probability of same entity to (r ₁, r ₂) (i.e. described second candidate record to), described two candidates merge subset and are respectively C ₁and C ₂, r ₁for C ₁in the second record, r ₂for C ₂in the second record.

Further, receive in step C the first judged result that described mass-rent platform returns (as described in mass-rent platform to as described in the second candidate record to the result whether representing same entity of mark and to as described in two candidates merge the point of label in the tally set of subset and praise number of times result), and determine whether described two candidates to merge subset according to described first judged result and merge and according to described mass-rent platform, number of times is praised to the point of label in described tally set and the label in described tally set sorted and/or filters, same entity whether can be represented to described second candidate record to mark because described mass-rent platform can comprise multiple mass-rent user and multiple mass-rent user, namely also described second judged result is multiple mass-rent user to described second candidate record to the result whether representing same entity of mark and merge the point of label in the tally set of subset to described two candidates and praise number of times result, alternatively, namely Voting Algorithm can be adopted to carry out the convergence of mass-rent result selects answer that poll is more than half as a result, if above mass-rent user represents same entity to described second candidate record to being labeled as mostly, then determine that described second candidate record is to representing same entity, if indivedual or fewer than half mass-rent user represents different entities to described second candidate record to being labeled as, then determine that described second candidate record is to not representing same entity, if determine that described two candidates merge subset and represent same entity according to described first judged result, then described two candidates are merged subset and merge into a cluster subset, upgrade tally set and the index of described cluster subset, and described cluster subset merging obtained is as elementary cluster subset, if determine that described two candidates merge subset and do not represent same entity according to described first judged result, then the distance that described two candidates merge between subset is set to 1, further, return and continue to perform described steps A-step C, until described two candidate's distances merged between subset are greater than described lower threshold, namely represent the record that described two candidates merge between subset and described lower limit probability threshold values is less than to the probability representing same entity, simultaneously owing to alternatively merging subset by apart from minimum a pair elementary cluster subset, if described two candidate's clusters merged between subset are greater than described lower threshold, distance then between other elementary cluster subsets is inevitable is also greater than described lower threshold, namely described in each, initial clustering subset does not represent same entity, therefore, clustering algorithm stops, using the described elementary cluster subset of at least two now as at least two cluster subsets described in obtaining, thus achieve and iteratively described elementary cluster subset is hierarchically merged into larger cluster subset according to certain order.

Fig. 2 is the schematic flow sheet that the present invention is based on the entity resolution method embodiment two that colony calculates, on the basis of above-described embodiment, carry out to the original records in database the step that hierarchical cluster obtains at least two cluster subsets to the described Hierarchical clustering methods based on mass-rent to be described in detail, as shown in Figure 2, the method for the present embodiment can comprise:

S201, obtain described original records to the probability representing same entity, and perform step S202;

S202, the described original records probability representing same entity being greater than Upper Probability threshold value are a class to gathering, and form corresponding elementary cluster subset, and are each described elementary cluster subset establishing label collection and index, perform step S203;

S203, calculate the distance between often pair of elementary cluster subset in described elementary cluster subset, select describedly to merge subset apart from minimum a pair elementary cluster subset as two candidates, and perform step S204;

S204, judge whether the distance that described two candidates merge between subset is less than lower threshold; If so, then step S205 is performed; If not (namely described two candidate's distances merged between subset are greater than described lower threshold), then step S206 is performed;

S205, merge subset from described two candidates and select the second record formation second candidate record pair respectively, by described second candidate record to and described two candidates tally set of merging subset send to mass-rent platform, judge that to make described mass-rent platform described second candidate record is to whether representing same entity and whether praise the label point in described tally set; Further, step S207 is performed; Wherein, described second candidate record is right to merging in subset recording of the maximum probability representing same entity for described two candidates;

S206, then cluster stop, using at least two described elementary cluster subsets as at least two cluster subsets described in obtaining;

S207, receive the first judged result that described mass-rent platform returns, and determine whether described two candidates to merge subset according to described first judged result and merge and according to described mass-rent platform, number of times is praised to the point of label in described tally set and the label in described tally set sorted and/or filters; If determine that described two candidates merge subset and represent same entity according to described first judged result, then perform step S208; If determine that described two candidates merge subset and do not represent same entity according to described first judged result, then perform step S209;

S208, described two candidates are merged subset merge into a cluster subset, upgrade tally set and the index of described cluster subset, will the described cluster subset that obtains be merged as elementary cluster subset, and return and continue to perform step S203;

S209, the distance that described two candidates merge between subset is set to 1, and returns and continue to perform step S203.

Fig. 3 is the schematic flow sheet that the present invention is based on the entity resolution method embodiment three that colony calculates, and on the basis of above-described embodiment, as shown in Figure 3, the method for the present embodiment can comprise:

S301, when detect add new record in described database time, obtain the characteristic information of described new record;

The subset information of at least two cluster subsets described in S302, described basis and the characteristic information of described new record obtain with described new record maximally related at least two relevant cluster subsets from described at least two cluster subsets;

S303, determine the candidate record pair corresponding respectively with described at least two relevant cluster subsets, and by all described candidate record to sending to mass-rent platform, judge that to make described mass-rent platform described candidate record is to whether representing same entity;

S304, receive the second judged result that described mass-rent platform returns, and determine whether that candidate record described at least one is to representing same entity according to described second judged result; If determine that the first candidate record is to representing same entity according to described second judged result, then perform step S305; If determine that all described candidate record are not to representing same entity according to described second judged result, then perform step S306;

S305, add described new record to belonging to the first record the first cluster subset, and upgrade the tally set of described first cluster subset;

S306, set up a new cluster subset for described new record, and be described new cluster subset establishing label collection.

In the embodiment of the present invention, at least two cluster subsets be in step s 302 when detect add new record in described database time, existing cluster subset in described database.

Fig. 4 is the structural representation that the present invention is based on the entity resolution device embodiment one that colony calculates, as shown in Figure 4, the entity resolution device 40 based on colony's calculating that the present embodiment provides can comprise: hierarchical cluster module 401, detection module 402, first determination module 403, second determination module 404 and division module 405.

Wherein, hierarchical cluster module 401 carries out hierarchical cluster for the Hierarchical clustering methods based on mass-rent to the original records in database, obtains at least two cluster subsets;

Detection module 402 for when detect add new record in described database time, obtain the characteristic information of described new record;

First determination module 403 for obtaining with described new record maximally related at least two relevant cluster subsets according to the subset information of described at least two cluster subsets and the characteristic information of described new record from described at least two cluster subsets; Wherein, the subset information of described at least two cluster subsets comprises: the tally set information of described cluster subset and index information;

Second determination module 404 is for determining the candidate record pair corresponding respectively with described at least two relevant cluster subsets according to described new record and the similarity magnitude relationship of each record in described at least two relevant cluster subsets;

Divide module 405 for being judged whether that by mass-rent user annotation mode candidate record described at least one is to representing same entity; If determine, the first candidate record is to representing same entity, then added to by described new record in the first cluster subset belonging to the first record, and upgrade the tally set of described first cluster subset; If determine, all described candidate record are not to representing same entity, then set up a new cluster subset for described new record, and be described new cluster subset establishing label collection; Wherein, described first record forms described first candidate record pair with described new record.

Alternatively, described hierarchical cluster module comprises:

Elementary cluster cell, the original records that probability for representing same entity according to the large young pathbreaker of probability representing same entity between often pair of described original records is greater than Upper Probability threshold value is a class to gathering, form corresponding elementary cluster subset, and be each described elementary cluster subset establishing label collection and index; Wherein, often pair of described original records forms described original records pair;

Hierarchical cluster unit, for successively described elementary cluster subset hierarchically being merged by mass-rent user annotation mode, until merge after each cluster subset between minor increment be greater than lower threshold, finally obtain at least two cluster subsets.

Alternatively, described elementary cluster cell specifically for:

Obtain described original records to the probability representing same entity;

Alternatively, described hierarchical cluster unit specifically for:

Alternatively, described elementary cluster cell also specifically for:

Alternatively, described hierarchical cluster unit also specifically for:

Alternatively, described first determination module specifically for:

Alternatively, described second determination module specifically for:

Alternatively, described division module specifically for:

The entity resolution device calculated based on colony of the present embodiment, may be used for performing the present invention above-mentioned based on the technical scheme in the entity resolution method any embodiment of colony's calculating, it realizes principle and technique effect is similar, repeats no more herein.

One of ordinary skill in the art will appreciate that: all or part of step realizing above-mentioned each embodiment of the method can have been come by the hardware that programmed instruction is relevant.Aforesaid program can be stored in a computer read/write memory medium.This program, when performing, performs the step comprising above-mentioned each embodiment of the method; And aforesaid storage medium comprises: ROM, RAM, magnetic disc or CD etc. various can be program code stored medium.

Last it is noted that above each embodiment is only in order to illustrate technical scheme of the present invention, be not intended to limit; Although with reference to foregoing embodiments to invention has been detailed description, those of ordinary skill in the art is to be understood that: it still can be modified to the technical scheme described in foregoing embodiments, or carries out equivalent replacement to wherein some or all of technical characteristic; And these amendments or replacement, do not make the essence of appropriate technical solution depart from the scope of various embodiments of the present invention technical scheme.

Claims

1., based on the entity resolution method that colony calculates, it is characterized in that, comprising:

2. method according to claim 1, is characterized in that, the described Hierarchical clustering methods based on mass-rent carries out hierarchical cluster to the original records in database, obtains at least two cluster subsets, comprising:

3. method according to claim 2, it is characterized in that, the original records that the probability that the large young pathbreaker of probability representing same entity between the described original records of described basis often pair represents same entity is greater than Upper Probability threshold value is a class to gathering, and forms corresponding elementary cluster subset, comprising:

Obtain described original records to the probability representing same entity;

4. method according to claim 2, it is characterized in that, describedly successively described elementary cluster subset hierarchically to be merged by mass-rent user annotation mode, until merge after each cluster subset between minor increment be greater than lower threshold, finally obtain at least two cluster subsets, comprising:

5. method according to claim 3, is characterized in that, the described original records of described acquisition, to the probability representing same entity, comprising:

6. method according to claim 4, is characterized in that, the distance in the described elementary cluster subset of described calculating between often pair of elementary cluster subset, comprising:

7. the method according to any one of claim 1-6, it is characterized in that, the subset information of at least two cluster subsets described in described basis and the characteristic information of described new record obtain with described new record maximally related at least two relevant cluster subsets from described at least two cluster subsets, comprising:

8. the method according to any one of claim 1-6, it is characterized in that, describedly determine the candidate record pair corresponding respectively with described at least two relevant cluster subsets according to described new record and the similarity magnitude relationship of each record in described at least two relevant cluster subsets, comprising:

9. the method according to any one of claim 1-6, is characterized in that, describedly judges whether that candidate record described at least one is to representing same entity by mass-rent user annotation mode; If determine, the first candidate record is to representing same entity, then added to by described new record in the first cluster subset belonging to the first record, and upgrade the tally set of described first cluster subset; If determine, all described candidates are similar then sets up a new cluster subset for described new record to not representing same entity, and is described new cluster subset establishing label collection, comprising:

10., based on the entity resolution device that colony calculates, it is characterized in that, comprising: