CN105677757B - It is a kind of based on double big data similarity join methods for sewing filtering - Google Patents
It is a kind of based on double big data similarity join methods for sewing filtering Download PDFInfo
- Publication number
- CN105677757B CN105677757B CN201511020637.8A CN201511020637A CN105677757B CN 105677757 B CN105677757 B CN 105677757B CN 201511020637 A CN201511020637 A CN 201511020637A CN 105677757 B CN105677757 B CN 105677757B
- Authority
- CN
- China
- Prior art keywords
- similarity
- entity record
- record
- entity
- filtering
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/335—Filtering based on additional data, e.g. user or group profiles
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/316—Indexing structures
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The present invention provides a kind of based on double big data similarity join methods for sewing filtering, comprising: the Document type data for extracting different data sources obtains entity record to be cleaned;Word frequency statistics are carried out to the element in entity record and word frequency ascending sort is pressed to the element in entity record;Using each of entity record prefix element as the index of the entity record, inverted index table is established to entity record;It is double to progress to the entity record in the same index to sew filtering similarity join, the entity record pair that similarity is greater than similarity threshold is obtained, realizes distributed computing.The present invention is filtered using middle element position information realization is sewed before and after entity record centering, significantly reduces the size of candidate collection, double to sew filtering and can achieve good chronergy the case where for different size of data source and different threshold values.And it is double sew filtering the distributed computing towards big data may be implemented, can be applied in distributed computing, improve big data cleaning efficiency.
Description
Technical field
The invention belongs to data mining technology fields, and in particular to a kind of based on double big data similarity joins for sewing filtering
Method.
Background technique
As big data era arrives, for mass data existing for internet, not all data are all useful
Information, it is more and more fiery that useful processing data information technology is extracted from mass data.Correctly by the data of separate sources
It integrates, using data mining technology, excavates and analyze wherein huge value.Similarity join technology
(Similarity Join) has become a kind of essential data integration cleaning method.Similarity join is different from
Value connection, by calculating the similarities of two records, by the record for meeting threshold condition to linking together.Currently, similitude
Interconnection technique is broadly divided into two stages, respectively filtration stage and confirmation stage, the difference of different similarity join technologies
Place is mainly reflected in filtration stage, by different filtering rules to being centainly unsatisfactory for the record of similarity threshold to carrying out
Filter improves similarity join efficiency.It and is then that is, threshold value item can be able to satisfy to cannot directly filter out in confirmation stage
The record of part obtains final similar record pair, completes similar connection to calculating.In existing pertinent literature, propose
All elements in record, i.e., be ranked up by the solution that prefix filtering is carried out for ordered record according to certain sequence,
By calculation formula, calculates prefix length and obtain prefix, establish inverted index, it is all in the inverted index of the same element
Record becomes mutual candidate record.It also will use some other technologies, such as the length using different records in filtration stage
It is filtered, is filtered using the position of element in record.
Summary of the invention
The purpose of the present invention is to provide a kind of based on double big data similarity join methods for sewing filtering.
The technical scheme is that
It is a kind of based on double big data similarity join methods for sewing filtering, comprising the following steps:
Step 1: extracting the Document type data of different data sources, obtain entity record to be cleaned;
Step 2: word frequency statistics: word frequency statistics being carried out to the element in entity record and word is pressed to the element in entity record
Frequency ascending sort;
Step 3: using each of entity record prefix element as the index of the entity record, entity record being established
Inverted index table;
Step 4: it is double to progress to the entity record in the same index to sew filtering similarity join, it obtains similarity and is greater than
The entity record pair of similarity threshold realizes distributed computing.
Step 4-1: double to progress to the entity record in the same index to sew filtering similarity join: to ascending sort
Entity record calculates prefix length, and entity record is divided into prefix and suffix, the positional information calculation of middle element is sewed using front and back
The similarity upper limit value of entity record pair, and similarity upper limit value is compared with similarity threshold, retain the similarity upper limit
Value is greater than the entity record pair of similarity threshold, realizes filtering;
Step 4-2: to the entity record of reservation to similarity is calculated, the entity that similarity value is greater than similarity threshold is obtained
Record pair is completed based on double big data similarity joins for sewing filtering.
The utility model has the advantages that
The present invention is filtered using middle element position information realization is sewed before and after entity record centering, significantly reduces Candidate Set
The size of conjunction, it is double to sew filtering and can achieve good time effect the case where for different size of data source and different threshold value
Fruit.And it is double sew filtering the distributed computing towards big data may be implemented, can be applied in distributed computing, improve
Big data cleaning efficiency.
Detailed description of the invention
Fig. 1 is of the invention based on double big data Cleaning application schematic diagrames for sewing filtering similarity join;
Fig. 2 is that the double of the specific embodiment of the invention sew the 1st kind of situation of filtering;
Fig. 3 is that the double of the specific embodiment of the invention sew the 2nd kind of situation of filtering;
Fig. 4 is that the double of the specific embodiment of the invention sew the 3rd kind of situation of filtering;
Fig. 5 is the record ordering of the specific embodiment of the invention and obtains record prefix;
Fig. 6, which is that the distribution of the specific embodiment of the invention is double, sews filtering schematic diagram;
Fig. 7 is a kind of based on double big data similarity join method flows for sewing filtering of the specific embodiment of the invention
Figure.
Specific embodiment
Specific embodiments of the present invention will be described in detail with reference to the accompanying drawing.
Currently, enterprise is in face of the data increased sharply, demand is often found from magnanimity isomery uncertain data concentration useful
The data with break-up value.Similar interconnection technique, which refers to find in one or more data sources, meets what similarity defined
Data, be used to carry out data cleansing and data integration etc. operation, for example, social networks can according to user interest like and it is good
Friendly relationship recommends certain customers as good friend from numerous virtual network crowds;Intellectual property detection can be from a large amount of differences in the world
Database in paper carry out it is similar inquiry etc.;Some users need to detect the different institutions occurred in integrating process true
In the case of whether be the same mechanism, such as retrieve Northeastern University, it is necessary to detect the school be China, Japan or the U.S.
's;It solves these problems and needs to use similar interconnection technique or similar inquiring technology.Sew filtering similarity join based on double
Big data Cleaning application is as shown in Figure 1.
As realize Entity recognition (Entity Recognition) a kind of important method, similar interconnection technique pass through by
Similar record is assembled to achieve the purpose that find similar entities.In data cleansing is integrated, for different data sources
Entity record obtains the note for meeting similarity threshold to similarity calculation is carried out to record according to known similarity calculation function
Record pair.After carrying out similar connection to the record for meeting similarity threshold, using in the prior art according to the similarity structure of entity
Figure is built up, on the basis of subgraph, realizes related application.
It is a kind of based on double big data similarity join methods for sewing filtering, as shown in fig. 7, comprises following steps:
Step 1: extracting the Document type data of different data sources, obtain entity record to be cleaned;
Step 2: word frequency statistics: word frequency statistics being carried out to the element in entity record and word is pressed to the element in entity record
Frequency ascending sort;
The purpose of word frequency statistics is for the prefix length of computational entity record, and orderly entity record is double to sew filtering
The premise of calculating.All entity records are segmented and (record are split according to semanteme), participle is by every entity record
Multiple elements are split into, word frequency statistics are carried out to element, ascending order arrangement is carried out to element according to word frequency, forms the set of record.
Step 3: using each of entity record prefix element as the index of the entity record, entity record being established
Inverted index table;
As shown in figure 5, sequence G is that all records are generated by participle and according to word frequency from low to high tactic
Element list becomes { A, B, D, E, F } after sequence such as the record that id is 1, by the record of rearrangement, according to fixed
The similarity function of justice calculates prefix length, it is assumed that Jaccard similarity is used, if threshold value t is 0.8, then the record
Prefix length is 2, which can form two indexes with A and B for key respectively.
Step 4: it is double to progress to the entity record in the same index to sew filtering similarity join, it obtains similarity and is greater than
The entity record pair of similarity threshold realizes distributed computing.
Step 4-1: double to progress to the entity record in the same index to sew filtering similarity join: to ascending sort
Entity record calculates prefix length, and entity record is divided into prefix and suffix, is believed using the position that middle element (token) is sewed in front and back
The similarity upper limit value of computational entity record pair is ceased, and similarity upper limit value is compared with similarity threshold, is retained similar
The entity record pair that upper limit value is greater than similarity threshold is spent, realizes filtering;
Measuring similarity function is chosen, two record s and r are given, defining similarity function is sim (s, r).
Similarity function generally has the property that
(1)0≤sim(s,r)≤1
(2) sim (s, r)=sim (r, s)
It is indicated as sim (s, r)=0, the similarity of two entity records is 0, indicates that two entity records are dissimilar
Entity record pair.
As sim (s, r)=1, two entity records of expression are completely similar, can indicate two records of same entity.
As 0 < sim (s, r) < 1, indicating that two entity records are similar, similarity function value is bigger, and it is more similar, it is similar
It is smaller to spend functional value, it is more dissimilar.
Record is carried out segmenting can be obtained to record corresponding set, then for set S and set R, it is different
Similarity function is as shown in table 1:
Table 1 gathers similarity function
That present embodiment is chosen is set covering similarity Overlap.Set is converted by every entity record, and is counted
Similarity between gathering is calculated, for example, respectively indicating two words " She likes database and for entity record s and r
Java " and " He likes java and music ", are translated into corresponding set, then S=She, likes,
Database, and, java }, R={ He, likes, java, and, music }, | R ∩ S |=3, then the similarity of s and r is
3.Other set similarities can be set similarity by equation change transitions.
The length of prefix and the threshold value of similarity function are related.If selecting set covering similarity Overlap, it is assumed that phase
It is α like degree threshold value, then the prefix length prefix (s) of entity record s=| s |-α+1;If selecting Jaccard similarity,
Assuming that similarity threshold is t, then the prefix length prefix (s) of record s=| s |-t | s |+1.According to drawer principle, two
There are intersections for the prefix of entity record, then two entity records are exactly the candidate entity record of similar entities record pair
It is right.Realization is double on basis herein sews filtering.
As illustrated in fig. 2, it is assumed that covering similarity Overlap using set, similarity threshold α is 14, through discussion entity
The positional relationship for recording the identical element between the prefix (prefix) of s and the suffix (suffix) of entity record r estimates entity
It records s and records the similarity upper limit of r.By calculating, the prefix length of entity record r and entity record s are 6, are used respectively
p1And p2It indicates, suffix uses q respectively1And q2It indicates.Wherein element B is p1And p2In the last one common element, take prefix p1In
First Elements C after element B is scanned the suffix of r as scan element, obtains Elements C in q2In position letter
Breath.Since entity record s and r are to be ranked up according to unified sequence G (such as word frequency ascending order) as standard, so p1In
The element occurred after C must be element of the C after sequence G.Similarly, the element after the Elements C in the suffix of entity record r
And element after C is come, it follows that not appeared in s after B with the element before C in entity record r, therefore
This part can not become a part of similarity upper limit upbound.And for all members after C in entity record s and r
Element, maximum coverage are element token quantity smaller between the two.The upbound of final entity record s and r are public affairs
Formula is as follows.
Upbound (s, r)=prec+1+min (| k1|,|k2|)
Wherein, prec is the number of common element in prefix, k1For the length of entity record s suffix, k2For entity record r
The length of suffix;
By taking Fig. 2 as an example, upbound (s, r)=2+1+min (16,7)=10 < 14, then, s and r are to be unsatisfactory for phase certainly
It like the entity record pair of degree threshold value, therefore is the similar entities record pair for the condition that is unsatisfactory for.
It similarly, is second situation, such as Fig. 3 institute in the absence of the scan element of entity record s is in entity record r
Show, takes in entity record r prefix that first element D is scanned the suffix of s as scan element after common element, that
?
Upbound (s, r)=2+1+min (14,9)=12 < 14, it is also ineligible, therefore be the phase for the condition that is unsatisfactory for
Like entity record pair.
If the scan element of entity record s and entity record r one another all scanning less than result when, as shown in figure 4,
Similarity upper limit formula be upbound (s, r)=prec+0+min ((| k1|,|k2|)
So upbound (s, r)=2+0+min (14,16)=16 > 14.
Step 4-2: to the entity record of reservation to similarity is calculated, the entity that similarity value is greater than similarity threshold is obtained
Record pair;
The constraint condition of other set similarity functions can may finally be converted by corresponding mathematical computations
Cover the constraint condition of similarity.Such as Jaccard similarity function.If Jaccard similarity function threshold value is t, i.e.,Namely record s and the Jaccard similarity for recording r meet threshold value t,
So record s and the Overlap similarity for recording r must satisfy t | s |.
Confirmation stage: for the entity record that filters out to not considering, for the entity record that does not filter out into
Row confirmation calculates the similarity in candidate collection between entity record using the similarity function of definition, compares entity record to it
Between similarity and similarity threshold, to being greater than the entity record of similarity threshold to exporting;
By the step 4-1 filtering realized it is found that some realizes filtering, that is, is judged as to be able to satisfy similar
The entity record pair of threshold value is spent, also some is there is no filtering is realized, at this moment, such entity record is to just needing to carry out really it
Recognize, the similarity of record pair is calculated by using defined similarity function, finally confirms the entity record to whether according with
Conjunction condition.
As shown in fig. 6, sewing filtering similarity join for the entity record in the same index is double to progress, realized
Filter.The index belongs to key-value model, and index Index is the key in key-value model, to realize distributed meter
It calculates.Many distributed computing platforms are all made of key-value model, such as Hadoop, Spark etc..For example, for using C as rope
In the value value drawn, it can be seen that id is respectively 2,3,4 record, sews filtering realization filtering, note using double to this three records
Record is filtered (2,4).In the similarity by confirmation stage computational entity record pair, similar entities note to the end is obtained
Record pair.
According to step 4 obtain as a result, relevant application operating can be further realized.According in the prior art to similar
Record is to utilization similarity structure figures, and realizes Entity recognition.
Claims (1)
1. a kind of based on double big data similarity join methods for sewing filtering, comprising the following steps:
Step 1: extracting the Document type data of different data sources, obtain entity record to be cleaned;
Step 2: word frequency statistics: word frequency statistics being carried out to the element in entity record and word frequency liter is pressed to the element in entity record
Sequence sequence;
Step 3: using each of entity record prefix element as the index of the entity record, the row of falling being established to entity record
Concordance list;
Step 4: it is double to progress to the entity record in the same index to sew filtering similarity join, similarity is obtained greater than similar
The entity record pair of threshold value is spent, realizes distributed computing;
Step 4-1: double to progress to the entity record in the same index to sew filtering similarity join: to the entity of ascending sort
Record calculates prefix length, and entity record is divided into prefix and suffix, the positional information calculation entity of middle element is sewed using front and back
The similarity upper limit value of record pair, and similarity upper limit value is compared with similarity threshold, it is big to retain similarity upper limit value
In the entity record pair of similarity threshold, filtering is realized;
Step 4-2: to the entity record of reservation to similarity is calculated, the entity record that similarity value is greater than similarity threshold is obtained
It is right, it completes based on double big data similarity joins for sewing filtering;
It is characterized in that, sewing the similarity of the positional information calculation entity record pair of middle element described in step 4-1 using front and back
Upper limit value, formula are as follows:
Upbound (s, r)=prec+1+min (| k1|,|k2|)
Wherein, upbound (s, r) is the similarity upper limit value of entity record s and r, and prec is the number of common element in prefix,
k1For the length of entity record s suffix, k2For the length of entity record r suffix.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201511020637.8A CN105677757B (en) | 2015-12-30 | 2015-12-30 | It is a kind of based on double big data similarity join methods for sewing filtering |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201511020637.8A CN105677757B (en) | 2015-12-30 | 2015-12-30 | It is a kind of based on double big data similarity join methods for sewing filtering |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105677757A CN105677757A (en) | 2016-06-15 |
CN105677757B true CN105677757B (en) | 2019-03-26 |
Family
ID=56298059
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201511020637.8A Active CN105677757B (en) | 2015-12-30 | 2015-12-30 | It is a kind of based on double big data similarity join methods for sewing filtering |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105677757B (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108573052B (en) * | 2018-04-23 | 2019-09-10 | 南京大学 | A kind of similar connection method of the set of threshold adaptive |
CN108874880B (en) * | 2018-05-04 | 2021-11-23 | 昆明理工大学 | Trie-based space keyword query method and device |
CN108846013B (en) * | 2018-05-04 | 2021-11-23 | 昆明理工大学 | Space keyword query method and device based on geohash and Patricia Trie |
CN111046092B (en) * | 2019-11-01 | 2022-06-17 | 东北大学 | Parallel similarity connection method based on CPU-GPU heterogeneous system structure |
CN111476037B (en) * | 2020-04-14 | 2023-03-31 | 腾讯科技(深圳)有限公司 | Text processing method and device, computer equipment and storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103257961A (en) * | 2012-02-15 | 2013-08-21 | 北大方正集团有限公司 | Method, device and system of bibliography repeat removal |
CN103858403A (en) * | 2011-10-14 | 2014-06-11 | 阿尔卡特朗讯公司 | Processing messages correlated to multiple potential entities |
CN104317801A (en) * | 2014-09-19 | 2015-01-28 | 东北大学 | Data cleaning system and method for aiming at big data |
-
2015
- 2015-12-30 CN CN201511020637.8A patent/CN105677757B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103858403A (en) * | 2011-10-14 | 2014-06-11 | 阿尔卡特朗讯公司 | Processing messages correlated to multiple potential entities |
CN103257961A (en) * | 2012-02-15 | 2013-08-21 | 北大方正集团有限公司 | Method, device and system of bibliography repeat removal |
CN104317801A (en) * | 2014-09-19 | 2015-01-28 | 东北大学 | Data cleaning system and method for aiming at big data |
Non-Patent Citations (2)
Title |
---|
Efficient similarity joins for near-duplicate detection;Xiao C, Wang W, Lin X;《Acm Transactions on Database Systems》;20080425;第36卷(第3期);参见摘要,第1-6节 |
相似性连接查询技术研究进展;庞俊;《计算机科学与探索》;20131207;第7卷(第1期);全文 |
Also Published As
Publication number | Publication date |
---|---|
CN105677757A (en) | 2016-06-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105677757B (en) | It is a kind of based on double big data similarity join methods for sewing filtering | |
US20170140058A1 (en) | Systems and Methods for Identifying Influencers and Their Communities in a Social Data Network | |
Leung et al. | A data science solution for mining interesting patterns from uncertain big data | |
CN106202211B (en) | Integrated microblog rumor identification method based on microblog types | |
CN104462084B (en) | Search refinement is provided based on multiple queries to suggest | |
CN103577593B (en) | A kind of video aggregation method and system based on microblog hot topic | |
JP2017512344A (en) | System and method for rapid data analysis | |
Ahmed et al. | A literature review on NoSQL database for big data processing | |
CN104317801A (en) | Data cleaning system and method for aiming at big data | |
Panigrahy et al. | How user behavior is related to social affinity | |
CN108460153A (en) | A kind of social media friend recommendation method of mixing blog article and customer relationship | |
CN104537341A (en) | Human face picture information obtaining method and device | |
CN107341199A (en) | A kind of recommendation method based on documentation & info general model | |
Sapul et al. | Trending topic discovery of Twitter Tweets using clustering and topic modeling algorithms | |
CN112559513A (en) | Link data access method, device, storage medium, processor and electronic device | |
CN105740387B (en) | A kind of scientific and technical literature recommended method based on author's frequent mode | |
Xu et al. | Efficient summarization framework for multi-attribute uncertain data | |
US20130091145A1 (en) | Method and apparatus for analyzing web trends based on issue template extraction | |
CN105589935A (en) | Social group recognition method | |
KR101693727B1 (en) | Apparatus and method for reorganizing social issues from research and development perspective using social network | |
CN104952023A (en) | Health information management method and system based on mobile computing | |
Tseng et al. | Advances in knowledge discovery and data mining | |
Khrouf et al. | Aggregating social media for enhancing conference experience | |
Vandaele et al. | Mining topological structure in graphs through forest representations | |
CN110781309A (en) | Entity parallel relation similarity calculation method based on pattern matching |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right | ||
TR01 | Transfer of patent right |
Effective date of registration: 20220323 Address after: 100081 No. 5 South Main Street, Haidian District, Beijing, Zhongguancun Patentee after: BEIJING INSTITUTE OF TECHNOLOGY Address before: 110819 No. 3 lane, Heping Road, Heping District, Shenyang, Liaoning 11 Patentee before: Northeastern University |