CN105677757B - It is a kind of based on double big data similarity join methods for sewing filtering - Google Patents

It is a kind of based on double big data similarity join methods for sewing filtering Download PDF

Info

Publication number
CN105677757B
CN105677757B CN201511020637.8A CN201511020637A CN105677757B CN 105677757 B CN105677757 B CN 105677757B CN 201511020637 A CN201511020637 A CN 201511020637A CN 105677757 B CN105677757 B CN 105677757B
Authority
CN
China
Prior art keywords
similarity
entity record
record
entity
filtering
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201511020637.8A
Other languages
Chinese (zh)
Other versions
CN105677757A (en
Inventor
王国仁
邓诗卓
信俊昌
聂铁铮
赵相国
季航旭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Original Assignee
Northeastern University China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northeastern University China filed Critical Northeastern University China
Priority to CN201511020637.8A priority Critical patent/CN105677757B/en
Publication of CN105677757A publication Critical patent/CN105677757A/en
Application granted granted Critical
Publication of CN105677757B publication Critical patent/CN105677757B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides a kind of based on double big data similarity join methods for sewing filtering, comprising: the Document type data for extracting different data sources obtains entity record to be cleaned;Word frequency statistics are carried out to the element in entity record and word frequency ascending sort is pressed to the element in entity record;Using each of entity record prefix element as the index of the entity record, inverted index table is established to entity record;It is double to progress to the entity record in the same index to sew filtering similarity join, the entity record pair that similarity is greater than similarity threshold is obtained, realizes distributed computing.The present invention is filtered using middle element position information realization is sewed before and after entity record centering, significantly reduces the size of candidate collection, double to sew filtering and can achieve good chronergy the case where for different size of data source and different threshold values.And it is double sew filtering the distributed computing towards big data may be implemented, can be applied in distributed computing, improve big data cleaning efficiency.

Description

It is a kind of based on double big data similarity join methods for sewing filtering
Technical field
The invention belongs to data mining technology fields, and in particular to a kind of based on double big data similarity joins for sewing filtering Method.
Background technique
As big data era arrives, for mass data existing for internet, not all data are all useful Information, it is more and more fiery that useful processing data information technology is extracted from mass data.Correctly by the data of separate sources It integrates, using data mining technology, excavates and analyze wherein huge value.Similarity join technology (Similarity Join) has become a kind of essential data integration cleaning method.Similarity join is different from Value connection, by calculating the similarities of two records, by the record for meeting threshold condition to linking together.Currently, similitude Interconnection technique is broadly divided into two stages, respectively filtration stage and confirmation stage, the difference of different similarity join technologies Place is mainly reflected in filtration stage, by different filtering rules to being centainly unsatisfactory for the record of similarity threshold to carrying out Filter improves similarity join efficiency.It and is then that is, threshold value item can be able to satisfy to cannot directly filter out in confirmation stage The record of part obtains final similar record pair, completes similar connection to calculating.In existing pertinent literature, propose All elements in record, i.e., be ranked up by the solution that prefix filtering is carried out for ordered record according to certain sequence, By calculation formula, calculates prefix length and obtain prefix, establish inverted index, it is all in the inverted index of the same element Record becomes mutual candidate record.It also will use some other technologies, such as the length using different records in filtration stage It is filtered, is filtered using the position of element in record.
Summary of the invention
The purpose of the present invention is to provide a kind of based on double big data similarity join methods for sewing filtering.
The technical scheme is that
It is a kind of based on double big data similarity join methods for sewing filtering, comprising the following steps:
Step 1: extracting the Document type data of different data sources, obtain entity record to be cleaned;
Step 2: word frequency statistics: word frequency statistics being carried out to the element in entity record and word is pressed to the element in entity record Frequency ascending sort;
Step 3: using each of entity record prefix element as the index of the entity record, entity record being established Inverted index table;
Step 4: it is double to progress to the entity record in the same index to sew filtering similarity join, it obtains similarity and is greater than The entity record pair of similarity threshold realizes distributed computing.
Step 4-1: double to progress to the entity record in the same index to sew filtering similarity join: to ascending sort Entity record calculates prefix length, and entity record is divided into prefix and suffix, the positional information calculation of middle element is sewed using front and back The similarity upper limit value of entity record pair, and similarity upper limit value is compared with similarity threshold, retain the similarity upper limit Value is greater than the entity record pair of similarity threshold, realizes filtering;
Step 4-2: to the entity record of reservation to similarity is calculated, the entity that similarity value is greater than similarity threshold is obtained Record pair is completed based on double big data similarity joins for sewing filtering.
The utility model has the advantages that
The present invention is filtered using middle element position information realization is sewed before and after entity record centering, significantly reduces Candidate Set The size of conjunction, it is double to sew filtering and can achieve good time effect the case where for different size of data source and different threshold value Fruit.And it is double sew filtering the distributed computing towards big data may be implemented, can be applied in distributed computing, improve Big data cleaning efficiency.
Detailed description of the invention
Fig. 1 is of the invention based on double big data Cleaning application schematic diagrames for sewing filtering similarity join;
Fig. 2 is that the double of the specific embodiment of the invention sew the 1st kind of situation of filtering;
Fig. 3 is that the double of the specific embodiment of the invention sew the 2nd kind of situation of filtering;
Fig. 4 is that the double of the specific embodiment of the invention sew the 3rd kind of situation of filtering;
Fig. 5 is the record ordering of the specific embodiment of the invention and obtains record prefix;
Fig. 6, which is that the distribution of the specific embodiment of the invention is double, sews filtering schematic diagram;
Fig. 7 is a kind of based on double big data similarity join method flows for sewing filtering of the specific embodiment of the invention Figure.
Specific embodiment
Specific embodiments of the present invention will be described in detail with reference to the accompanying drawing.
Currently, enterprise is in face of the data increased sharply, demand is often found from magnanimity isomery uncertain data concentration useful The data with break-up value.Similar interconnection technique, which refers to find in one or more data sources, meets what similarity defined Data, be used to carry out data cleansing and data integration etc. operation, for example, social networks can according to user interest like and it is good Friendly relationship recommends certain customers as good friend from numerous virtual network crowds;Intellectual property detection can be from a large amount of differences in the world Database in paper carry out it is similar inquiry etc.;Some users need to detect the different institutions occurred in integrating process true In the case of whether be the same mechanism, such as retrieve Northeastern University, it is necessary to detect the school be China, Japan or the U.S. 's;It solves these problems and needs to use similar interconnection technique or similar inquiring technology.Sew filtering similarity join based on double Big data Cleaning application is as shown in Figure 1.
As realize Entity recognition (Entity Recognition) a kind of important method, similar interconnection technique pass through by Similar record is assembled to achieve the purpose that find similar entities.In data cleansing is integrated, for different data sources Entity record obtains the note for meeting similarity threshold to similarity calculation is carried out to record according to known similarity calculation function Record pair.After carrying out similar connection to the record for meeting similarity threshold, using in the prior art according to the similarity structure of entity Figure is built up, on the basis of subgraph, realizes related application.
It is a kind of based on double big data similarity join methods for sewing filtering, as shown in fig. 7, comprises following steps:
Step 1: extracting the Document type data of different data sources, obtain entity record to be cleaned;
Step 2: word frequency statistics: word frequency statistics being carried out to the element in entity record and word is pressed to the element in entity record Frequency ascending sort;
The purpose of word frequency statistics is for the prefix length of computational entity record, and orderly entity record is double to sew filtering The premise of calculating.All entity records are segmented and (record are split according to semanteme), participle is by every entity record Multiple elements are split into, word frequency statistics are carried out to element, ascending order arrangement is carried out to element according to word frequency, forms the set of record.
Step 3: using each of entity record prefix element as the index of the entity record, entity record being established Inverted index table;
As shown in figure 5, sequence G is that all records are generated by participle and according to word frequency from low to high tactic Element list becomes { A, B, D, E, F } after sequence such as the record that id is 1, by the record of rearrangement, according to fixed The similarity function of justice calculates prefix length, it is assumed that Jaccard similarity is used, if threshold value t is 0.8, then the record Prefix length is 2, which can form two indexes with A and B for key respectively.
Step 4: it is double to progress to the entity record in the same index to sew filtering similarity join, it obtains similarity and is greater than The entity record pair of similarity threshold realizes distributed computing.
Step 4-1: double to progress to the entity record in the same index to sew filtering similarity join: to ascending sort Entity record calculates prefix length, and entity record is divided into prefix and suffix, is believed using the position that middle element (token) is sewed in front and back The similarity upper limit value of computational entity record pair is ceased, and similarity upper limit value is compared with similarity threshold, is retained similar The entity record pair that upper limit value is greater than similarity threshold is spent, realizes filtering;
Measuring similarity function is chosen, two record s and r are given, defining similarity function is sim (s, r).
Similarity function generally has the property that
(1)0≤sim(s,r)≤1
(2) sim (s, r)=sim (r, s)
It is indicated as sim (s, r)=0, the similarity of two entity records is 0, indicates that two entity records are dissimilar Entity record pair.
As sim (s, r)=1, two entity records of expression are completely similar, can indicate two records of same entity.
As 0 < sim (s, r) < 1, indicating that two entity records are similar, similarity function value is bigger, and it is more similar, it is similar It is smaller to spend functional value, it is more dissimilar.
Record is carried out segmenting can be obtained to record corresponding set, then for set S and set R, it is different Similarity function is as shown in table 1:
Table 1 gathers similarity function
That present embodiment is chosen is set covering similarity Overlap.Set is converted by every entity record, and is counted Similarity between gathering is calculated, for example, respectively indicating two words " She likes database and for entity record s and r Java " and " He likes java and music ", are translated into corresponding set, then S=She, likes, Database, and, java }, R={ He, likes, java, and, music }, | R ∩ S |=3, then the similarity of s and r is 3.Other set similarities can be set similarity by equation change transitions.
The length of prefix and the threshold value of similarity function are related.If selecting set covering similarity Overlap, it is assumed that phase It is α like degree threshold value, then the prefix length prefix (s) of entity record s=| s |-α+1;If selecting Jaccard similarity, Assuming that similarity threshold is t, then the prefix length prefix (s) of record s=| s |-t | s |+1.According to drawer principle, two There are intersections for the prefix of entity record, then two entity records are exactly the candidate entity record of similar entities record pair It is right.Realization is double on basis herein sews filtering.
As illustrated in fig. 2, it is assumed that covering similarity Overlap using set, similarity threshold α is 14, through discussion entity The positional relationship for recording the identical element between the prefix (prefix) of s and the suffix (suffix) of entity record r estimates entity It records s and records the similarity upper limit of r.By calculating, the prefix length of entity record r and entity record s are 6, are used respectively p1And p2It indicates, suffix uses q respectively1And q2It indicates.Wherein element B is p1And p2In the last one common element, take prefix p1In First Elements C after element B is scanned the suffix of r as scan element, obtains Elements C in q2In position letter Breath.Since entity record s and r are to be ranked up according to unified sequence G (such as word frequency ascending order) as standard, so p1In The element occurred after C must be element of the C after sequence G.Similarly, the element after the Elements C in the suffix of entity record r And element after C is come, it follows that not appeared in s after B with the element before C in entity record r, therefore This part can not become a part of similarity upper limit upbound.And for all members after C in entity record s and r Element, maximum coverage are element token quantity smaller between the two.The upbound of final entity record s and r are public affairs Formula is as follows.
Upbound (s, r)=prec+1+min (| k1|,|k2|)
Wherein, prec is the number of common element in prefix, k1For the length of entity record s suffix, k2For entity record r The length of suffix;
By taking Fig. 2 as an example, upbound (s, r)=2+1+min (16,7)=10 < 14, then, s and r are to be unsatisfactory for phase certainly It like the entity record pair of degree threshold value, therefore is the similar entities record pair for the condition that is unsatisfactory for.
It similarly, is second situation, such as Fig. 3 institute in the absence of the scan element of entity record s is in entity record r Show, takes in entity record r prefix that first element D is scanned the suffix of s as scan element after common element, that ?
Upbound (s, r)=2+1+min (14,9)=12 < 14, it is also ineligible, therefore be the phase for the condition that is unsatisfactory for Like entity record pair.
If the scan element of entity record s and entity record r one another all scanning less than result when, as shown in figure 4, Similarity upper limit formula be upbound (s, r)=prec+0+min ((| k1|,|k2|)
So upbound (s, r)=2+0+min (14,16)=16 > 14.
Step 4-2: to the entity record of reservation to similarity is calculated, the entity that similarity value is greater than similarity threshold is obtained Record pair;
The constraint condition of other set similarity functions can may finally be converted by corresponding mathematical computations Cover the constraint condition of similarity.Such as Jaccard similarity function.If Jaccard similarity function threshold value is t, i.e.,Namely record s and the Jaccard similarity for recording r meet threshold value t, So record s and the Overlap similarity for recording r must satisfy t | s |.
Confirmation stage: for the entity record that filters out to not considering, for the entity record that does not filter out into Row confirmation calculates the similarity in candidate collection between entity record using the similarity function of definition, compares entity record to it Between similarity and similarity threshold, to being greater than the entity record of similarity threshold to exporting;
By the step 4-1 filtering realized it is found that some realizes filtering, that is, is judged as to be able to satisfy similar The entity record pair of threshold value is spent, also some is there is no filtering is realized, at this moment, such entity record is to just needing to carry out really it Recognize, the similarity of record pair is calculated by using defined similarity function, finally confirms the entity record to whether according with Conjunction condition.
As shown in fig. 6, sewing filtering similarity join for the entity record in the same index is double to progress, realized Filter.The index belongs to key-value model, and index Index is the key in key-value model, to realize distributed meter It calculates.Many distributed computing platforms are all made of key-value model, such as Hadoop, Spark etc..For example, for using C as rope In the value value drawn, it can be seen that id is respectively 2,3,4 record, sews filtering realization filtering, note using double to this three records Record is filtered (2,4).In the similarity by confirmation stage computational entity record pair, similar entities note to the end is obtained Record pair.
According to step 4 obtain as a result, relevant application operating can be further realized.According in the prior art to similar Record is to utilization similarity structure figures, and realizes Entity recognition.

Claims (1)

1. a kind of based on double big data similarity join methods for sewing filtering, comprising the following steps:
Step 1: extracting the Document type data of different data sources, obtain entity record to be cleaned;
Step 2: word frequency statistics: word frequency statistics being carried out to the element in entity record and word frequency liter is pressed to the element in entity record Sequence sequence;
Step 3: using each of entity record prefix element as the index of the entity record, the row of falling being established to entity record Concordance list;
Step 4: it is double to progress to the entity record in the same index to sew filtering similarity join, similarity is obtained greater than similar The entity record pair of threshold value is spent, realizes distributed computing;
Step 4-1: double to progress to the entity record in the same index to sew filtering similarity join: to the entity of ascending sort Record calculates prefix length, and entity record is divided into prefix and suffix, the positional information calculation entity of middle element is sewed using front and back The similarity upper limit value of record pair, and similarity upper limit value is compared with similarity threshold, it is big to retain similarity upper limit value In the entity record pair of similarity threshold, filtering is realized;
Step 4-2: to the entity record of reservation to similarity is calculated, the entity record that similarity value is greater than similarity threshold is obtained It is right, it completes based on double big data similarity joins for sewing filtering;
It is characterized in that, sewing the similarity of the positional information calculation entity record pair of middle element described in step 4-1 using front and back Upper limit value, formula are as follows:
Upbound (s, r)=prec+1+min (| k1|,|k2|)
Wherein, upbound (s, r) is the similarity upper limit value of entity record s and r, and prec is the number of common element in prefix, k1For the length of entity record s suffix, k2For the length of entity record r suffix.
CN201511020637.8A 2015-12-30 2015-12-30 It is a kind of based on double big data similarity join methods for sewing filtering Active CN105677757B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201511020637.8A CN105677757B (en) 2015-12-30 2015-12-30 It is a kind of based on double big data similarity join methods for sewing filtering

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201511020637.8A CN105677757B (en) 2015-12-30 2015-12-30 It is a kind of based on double big data similarity join methods for sewing filtering

Publications (2)

Publication Number Publication Date
CN105677757A CN105677757A (en) 2016-06-15
CN105677757B true CN105677757B (en) 2019-03-26

Family

ID=56298059

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201511020637.8A Active CN105677757B (en) 2015-12-30 2015-12-30 It is a kind of based on double big data similarity join methods for sewing filtering

Country Status (1)

Country Link
CN (1) CN105677757B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108573052B (en) * 2018-04-23 2019-09-10 南京大学 A kind of similar connection method of the set of threshold adaptive
CN108874880B (en) * 2018-05-04 2021-11-23 昆明理工大学 Trie-based space keyword query method and device
CN108846013B (en) * 2018-05-04 2021-11-23 昆明理工大学 Space keyword query method and device based on geohash and Patricia Trie
CN111046092B (en) * 2019-11-01 2022-06-17 东北大学 Parallel similarity connection method based on CPU-GPU heterogeneous system structure
CN111476037B (en) * 2020-04-14 2023-03-31 腾讯科技(深圳)有限公司 Text processing method and device, computer equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103257961A (en) * 2012-02-15 2013-08-21 北大方正集团有限公司 Method, device and system of bibliography repeat removal
CN103858403A (en) * 2011-10-14 2014-06-11 阿尔卡特朗讯公司 Processing messages correlated to multiple potential entities
CN104317801A (en) * 2014-09-19 2015-01-28 东北大学 Data cleaning system and method for aiming at big data

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103858403A (en) * 2011-10-14 2014-06-11 阿尔卡特朗讯公司 Processing messages correlated to multiple potential entities
CN103257961A (en) * 2012-02-15 2013-08-21 北大方正集团有限公司 Method, device and system of bibliography repeat removal
CN104317801A (en) * 2014-09-19 2015-01-28 东北大学 Data cleaning system and method for aiming at big data

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Efficient similarity joins for near-duplicate detection;Xiao C, Wang W, Lin X;《Acm Transactions on Database Systems》;20080425;第36卷(第3期);参见摘要,第1-6节
相似性连接查询技术研究进展;庞俊;《计算机科学与探索》;20131207;第7卷(第1期);全文

Also Published As

Publication number Publication date
CN105677757A (en) 2016-06-15

Similar Documents

Publication Publication Date Title
CN105677757B (en) It is a kind of based on double big data similarity join methods for sewing filtering
US20170140058A1 (en) Systems and Methods for Identifying Influencers and Their Communities in a Social Data Network
Leung et al. A data science solution for mining interesting patterns from uncertain big data
CN106202211B (en) Integrated microblog rumor identification method based on microblog types
CN104462084B (en) Search refinement is provided based on multiple queries to suggest
CN103577593B (en) A kind of video aggregation method and system based on microblog hot topic
JP2017512344A (en) System and method for rapid data analysis
Ahmed et al. A literature review on NoSQL database for big data processing
CN104317801A (en) Data cleaning system and method for aiming at big data
Panigrahy et al. How user behavior is related to social affinity
CN108460153A (en) A kind of social media friend recommendation method of mixing blog article and customer relationship
CN104537341A (en) Human face picture information obtaining method and device
CN107341199A (en) A kind of recommendation method based on documentation & info general model
Sapul et al. Trending topic discovery of Twitter Tweets using clustering and topic modeling algorithms
CN112559513A (en) Link data access method, device, storage medium, processor and electronic device
CN105740387B (en) A kind of scientific and technical literature recommended method based on author&#39;s frequent mode
Xu et al. Efficient summarization framework for multi-attribute uncertain data
US20130091145A1 (en) Method and apparatus for analyzing web trends based on issue template extraction
CN105589935A (en) Social group recognition method
KR101693727B1 (en) Apparatus and method for reorganizing social issues from research and development perspective using social network
CN104952023A (en) Health information management method and system based on mobile computing
Tseng et al. Advances in knowledge discovery and data mining
Khrouf et al. Aggregating social media for enhancing conference experience
Vandaele et al. Mining topological structure in graphs through forest representations
CN110781309A (en) Entity parallel relation similarity calculation method based on pattern matching

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20220323

Address after: 100081 No. 5 South Main Street, Haidian District, Beijing, Zhongguancun

Patentee after: BEIJING INSTITUTE OF TECHNOLOGY

Address before: 110819 No. 3 lane, Heping Road, Heping District, Shenyang, Liaoning 11

Patentee before: Northeastern University