CN105677757B

CN105677757B - It is a kind of based on double big data similarity join methods for sewing filtering

Info

Publication number: CN105677757B
Application number: CN201511020637.8A
Authority: CN
Inventors: 王国仁; 邓诗卓; 信俊昌; 聂铁铮; 赵相国; 季航旭
Original assignee: Northeastern University China
Current assignee: Beijing Institute of Technology BIT
Priority date: 2015-12-30
Filing date: 2015-12-30
Publication date: 2019-03-26
Anticipated expiration: 2035-12-30
Also published as: CN105677757A

Abstract

The present invention provides a kind of based on double big data similarity join methods for sewing filtering, comprising: the Document type data for extracting different data sources obtains entity record to be cleaned；Word frequency statistics are carried out to the element in entity record and word frequency ascending sort is pressed to the element in entity record；Using each of entity record prefix element as the index of the entity record, inverted index table is established to entity record；It is double to progress to the entity record in the same index to sew filtering similarity join, the entity record pair that similarity is greater than similarity threshold is obtained, realizes distributed computing.The present invention is filtered using middle element position information realization is sewed before and after entity record centering, significantly reduces the size of candidate collection, double to sew filtering and can achieve good chronergy the case where for different size of data source and different threshold values.And it is double sew filtering the distributed computing towards big data may be implemented, can be applied in distributed computing, improve big data cleaning efficiency.

Description

It is a kind of based on double big data similarity join methods for sewing filtering

Technical field

The invention belongs to data mining technology fields, and in particular to a kind of based on double big data similarity joins for sewing filtering Method.

Background technique

As big data era arrives, for mass data existing for internet, not all data are all useful Information, it is more and more fiery that useful processing data information technology is extracted from mass data.Correctly by the data of separate sources It integrates, using data mining technology, excavates and analyze wherein huge value.Similarity join technology (Similarity Join) has become a kind of essential data integration cleaning method.Similarity join is different from Value connection, by calculating the similarities of two records, by the record for meeting threshold condition to linking together.Currently, similitude Interconnection technique is broadly divided into two stages, respectively filtration stage and confirmation stage, the difference of different similarity join technologies Place is mainly reflected in filtration stage, by different filtering rules to being centainly unsatisfactory for the record of similarity threshold to carrying out Filter improves similarity join efficiency.It and is then that is, threshold value item can be able to satisfy to cannot directly filter out in confirmation stage The record of part obtains final similar record pair, completes similar connection to calculating.In existing pertinent literature, propose All elements in record, i.e., be ranked up by the solution that prefix filtering is carried out for ordered record according to certain sequence, By calculation formula, calculates prefix length and obtain prefix, establish inverted index, it is all in the inverted index of the same element Record becomes mutual candidate record.It also will use some other technologies, such as the length using different records in filtration stage It is filtered, is filtered using the position of element in record.

Summary of the invention

The purpose of the present invention is to provide a kind of based on double big data similarity join methods for sewing filtering.

The technical scheme is that

It is a kind of based on double big data similarity join methods for sewing filtering, comprising the following steps:

Step 1: extracting the Document type data of different data sources, obtain entity record to be cleaned；

Step 2: word frequency statistics: word frequency statistics being carried out to the element in entity record and word is pressed to the element in entity record Frequency ascending sort；

Step 3: using each of entity record prefix element as the index of the entity record, entity record being established Inverted index table；

Step 4: it is double to progress to the entity record in the same index to sew filtering similarity join, it obtains similarity and is greater than The entity record pair of similarity threshold realizes distributed computing.

Step 4-1: double to progress to the entity record in the same index to sew filtering similarity join: to ascending sort Entity record calculates prefix length, and entity record is divided into prefix and suffix, the positional information calculation of middle element is sewed using front and back The similarity upper limit value of entity record pair, and similarity upper limit value is compared with similarity threshold, retain the similarity upper limit Value is greater than the entity record pair of similarity threshold, realizes filtering；

Step 4-2: to the entity record of reservation to similarity is calculated, the entity that similarity value is greater than similarity threshold is obtained Record pair is completed based on double big data similarity joins for sewing filtering.

The utility model has the advantages that

The present invention is filtered using middle element position information realization is sewed before and after entity record centering, significantly reduces Candidate Set The size of conjunction, it is double to sew filtering and can achieve good time effect the case where for different size of data source and different threshold value Fruit.And it is double sew filtering the distributed computing towards big data may be implemented, can be applied in distributed computing, improve Big data cleaning efficiency.

Detailed description of the invention

Fig. 1 is of the invention based on double big data Cleaning application schematic diagrames for sewing filtering similarity join；

Fig. 2 is that the double of the specific embodiment of the invention sew the 1st kind of situation of filtering；

Fig. 3 is that the double of the specific embodiment of the invention sew the 2nd kind of situation of filtering；

Fig. 4 is that the double of the specific embodiment of the invention sew the 3rd kind of situation of filtering；

Fig. 5 is the record ordering of the specific embodiment of the invention and obtains record prefix；

Fig. 6, which is that the distribution of the specific embodiment of the invention is double, sews filtering schematic diagram；

Fig. 7 is a kind of based on double big data similarity join method flows for sewing filtering of the specific embodiment of the invention Figure.

Specific embodiment

Specific embodiments of the present invention will be described in detail with reference to the accompanying drawing.

Currently, enterprise is in face of the data increased sharply, demand is often found from magnanimity isomery uncertain data concentration useful The data with break-up value.Similar interconnection technique, which refers to find in one or more data sources, meets what similarity defined Data, be used to carry out data cleansing and data integration etc. operation, for example, social networks can according to user interest like and it is good Friendly relationship recommends certain customers as good friend from numerous virtual network crowds；Intellectual property detection can be from a large amount of differences in the world Database in paper carry out it is similar inquiry etc.；Some users need to detect the different institutions occurred in integrating process true In the case of whether be the same mechanism, such as retrieve Northeastern University, it is necessary to detect the school be China, Japan or the U.S. 's；It solves these problems and needs to use similar interconnection technique or similar inquiring technology.Sew filtering similarity join based on double Big data Cleaning application is as shown in Figure 1.

As realize Entity recognition (Entity Recognition) a kind of important method, similar interconnection technique pass through by Similar record is assembled to achieve the purpose that find similar entities.In data cleansing is integrated, for different data sources Entity record obtains the note for meeting similarity threshold to similarity calculation is carried out to record according to known similarity calculation function Record pair.After carrying out similar connection to the record for meeting similarity threshold, using in the prior art according to the similarity structure of entity Figure is built up, on the basis of subgraph, realizes related application.

It is a kind of based on double big data similarity join methods for sewing filtering, as shown in fig. 7, comprises following steps:

The purpose of word frequency statistics is for the prefix length of computational entity record, and orderly entity record is double to sew filtering The premise of calculating.All entity records are segmented and (record are split according to semanteme), participle is by every entity record Multiple elements are split into, word frequency statistics are carried out to element, ascending order arrangement is carried out to element according to word frequency, forms the set of record.

As shown in figure 5, sequence G is that all records are generated by participle and according to word frequency from low to high tactic Element list becomes { A, B, D, E, F } after sequence such as the record that id is 1, by the record of rearrangement, according to fixed The similarity function of justice calculates prefix length, it is assumed that Jaccard similarity is used, if threshold value t is 0.8, then the record Prefix length is 2, which can form two indexes with A and B for key respectively.

Step 4-1: double to progress to the entity record in the same index to sew filtering similarity join: to ascending sort Entity record calculates prefix length, and entity record is divided into prefix and suffix, is believed using the position that middle element (token) is sewed in front and back The similarity upper limit value of computational entity record pair is ceased, and similarity upper limit value is compared with similarity threshold, is retained similar The entity record pair that upper limit value is greater than similarity threshold is spent, realizes filtering；

Measuring similarity function is chosen, two record s and r are given, defining similarity function is sim (s, r).

Similarity function generally has the property that

(1)0≤sim(s,r)≤1

(2) sim (s, r)=sim (r, s)

It is indicated as sim (s, r)=0, the similarity of two entity records is 0, indicates that two entity records are dissimilar Entity record pair.

As sim (s, r)=1, two entity records of expression are completely similar, can indicate two records of same entity.

As 0 < sim (s, r) < 1, indicating that two entity records are similar, similarity function value is bigger, and it is more similar, it is similar It is smaller to spend functional value, it is more dissimilar.

Record is carried out segmenting can be obtained to record corresponding set, then for set S and set R, it is different Similarity function is as shown in table 1:

Table 1 gathers similarity function

That present embodiment is chosen is set covering similarity Overlap.Set is converted by every entity record, and is counted Similarity between gathering is calculated, for example, respectively indicating two words " She likes database and for entity record s and r Java " and " He likes java and music ", are translated into corresponding set, then S=She, likes, Database, and, java }, R={ He, likes, java, and, music }, | R ∩ S |=3, then the similarity of s and r is 3.Other set similarities can be set similarity by equation change transitions.

The length of prefix and the threshold value of similarity function are related.If selecting set covering similarity Overlap, it is assumed that phase It is α like degree threshold value, then the prefix length prefix (s) of entity record s=| s |-α+1；If selecting Jaccard similarity, Assuming that similarity threshold is t, then the prefix length prefix (s) of record s=| s |-t | s |+1.According to drawer principle, two There are intersections for the prefix of entity record, then two entity records are exactly the candidate entity record of similar entities record pair It is right.Realization is double on basis herein sews filtering.

As illustrated in fig. 2, it is assumed that covering similarity Overlap using set, similarity threshold α is 14, through discussion entity The positional relationship for recording the identical element between the prefix (prefix) of s and the suffix (suffix) of entity record r estimates entity It records s and records the similarity upper limit of r.By calculating, the prefix length of entity record r and entity record s are 6, are used respectively p₁And p₂It indicates, suffix uses q respectively₁And q₂It indicates.Wherein element B is p₁And p₂In the last one common element, take prefix p₁In First Elements C after element B is scanned the suffix of r as scan element, obtains Elements C in q₂In position letter Breath.Since entity record s and r are to be ranked up according to unified sequence G (such as word frequency ascending order) as standard, so p₁In The element occurred after C must be element of the C after sequence G.Similarly, the element after the Elements C in the suffix of entity record r And element after C is come, it follows that not appeared in s after B with the element before C in entity record r, therefore This part can not become a part of similarity upper limit upbound.And for all members after C in entity record s and r Element, maximum coverage are element token quantity smaller between the two.The upbound of final entity record s and r are public affairs Formula is as follows.

Upbound (s, r)=prec+1+min (| k₁|,|k₂|)

Wherein, prec is the number of common element in prefix, k₁For the length of entity record s suffix, k₂For entity record r The length of suffix；

By taking Fig. 2 as an example, upbound (s, r)=2+1+min (16,7)=10 < 14, then, s and r are to be unsatisfactory for phase certainly It like the entity record pair of degree threshold value, therefore is the similar entities record pair for the condition that is unsatisfactory for.

It similarly, is second situation, such as Fig. 3 institute in the absence of the scan element of entity record s is in entity record r Show, takes in entity record r prefix that first element D is scanned the suffix of s as scan element after common element, that ?

Upbound (s, r)=2+1+min (14,9)=12 < 14, it is also ineligible, therefore be the phase for the condition that is unsatisfactory for Like entity record pair.

If the scan element of entity record s and entity record r one another all scanning less than result when, as shown in figure 4, Similarity upper limit formula be upbound (s, r)=prec+0+min ((| k₁|,|k₂|)

So upbound (s, r)=2+0+min (14,16)=16 > 14.

Step 4-2: to the entity record of reservation to similarity is calculated, the entity that similarity value is greater than similarity threshold is obtained Record pair；

The constraint condition of other set similarity functions can may finally be converted by corresponding mathematical computations Cover the constraint condition of similarity.Such as Jaccard similarity function.If Jaccard similarity function threshold value is t, i.e.,Namely record s and the Jaccard similarity for recording r meet threshold value t, So record s and the Overlap similarity for recording r must satisfy t | s |.

Confirmation stage: for the entity record that filters out to not considering, for the entity record that does not filter out into Row confirmation calculates the similarity in candidate collection between entity record using the similarity function of definition, compares entity record to it Between similarity and similarity threshold, to being greater than the entity record of similarity threshold to exporting；

By the step 4-1 filtering realized it is found that some realizes filtering, that is, is judged as to be able to satisfy similar The entity record pair of threshold value is spent, also some is there is no filtering is realized, at this moment, such entity record is to just needing to carry out really it Recognize, the similarity of record pair is calculated by using defined similarity function, finally confirms the entity record to whether according with Conjunction condition.

As shown in fig. 6, sewing filtering similarity join for the entity record in the same index is double to progress, realized Filter.The index belongs to key-value model, and index Index is the key in key-value model, to realize distributed meter It calculates.Many distributed computing platforms are all made of key-value model, such as Hadoop, Spark etc..For example, for using C as rope In the value value drawn, it can be seen that id is respectively 2,3,4 record, sews filtering realization filtering, note using double to this three records Record is filtered (2,4).In the similarity by confirmation stage computational entity record pair, similar entities note to the end is obtained Record pair.

According to step 4 obtain as a result, relevant application operating can be further realized.According in the prior art to similar Record is to utilization similarity structure figures, and realizes Entity recognition.

Claims

1. a kind of based on double big data similarity join methods for sewing filtering, comprising the following steps:

Step 2: word frequency statistics: word frequency statistics being carried out to the element in entity record and word frequency liter is pressed to the element in entity record Sequence sequence；

Step 3: using each of entity record prefix element as the index of the entity record, the row of falling being established to entity record Concordance list；

Step 4: it is double to progress to the entity record in the same index to sew filtering similarity join, similarity is obtained greater than similar The entity record pair of threshold value is spent, realizes distributed computing；

Step 4-1: double to progress to the entity record in the same index to sew filtering similarity join: to the entity of ascending sort Record calculates prefix length, and entity record is divided into prefix and suffix, the positional information calculation entity of middle element is sewed using front and back The similarity upper limit value of record pair, and similarity upper limit value is compared with similarity threshold, it is big to retain similarity upper limit value In the entity record pair of similarity threshold, filtering is realized；

Step 4-2: to the entity record of reservation to similarity is calculated, the entity record that similarity value is greater than similarity threshold is obtained It is right, it completes based on double big data similarity joins for sewing filtering；

It is characterized in that, sewing the similarity of the positional information calculation entity record pair of middle element described in step 4-1 using front and back Upper limit value, formula are as follows:

Upbound (s, r)=prec+1+min (| k₁|,|k₂|)

Wherein, upbound (s, r) is the similarity upper limit value of entity record s and r, and prec is the number of common element in prefix, k₁For the length of entity record s suffix, k₂For the length of entity record r suffix.