CN109977287A

CN109977287A - A kind of house property data identity method of discrimination of different aforementioned sources

Info

Publication number: CN109977287A
Application number: CN201910242011.3A
Authority: CN
Inventors: 刘春阳; 张旭; 王鹏; 姜越; 张华平; 张吴波; 张宝华
Original assignee: National Computer Network and Information Security Management Center
Current assignee: National Computer Network and Information Security Management Center
Priority date: 2019-03-28
Filing date: 2019-03-28
Publication date: 2019-07-05
Anticipated expiration: 2039-03-28
Also published as: CN109977287B

Abstract

The present invention relates to a kind of house property data identity method of discrimination of different aforementioned sources, belong to internet data analysis and digging technology field.The identity method of discrimination based on chain man, I like my family, Central Plains, wheatland website announce house data and its correlation；The characteristics of by analysis house data, duplicate house data are rejected by three big steps of region duplicate removal, cell duplicate removal and house duplicate removal, the characteristics of house data are to actual house object description, although the angle and mode of description have differences, there are very strong correlations between data.The method can be to the house data deduplication for deriving from different web sites, can accurately and efficiently the identity to the house data from different aforementioned sources it differentiate, and repeat region, cell can be effectively removed, the effective integration towards multi-source heterogeneous house data may be implemented, " clean " is provided for Analysis of Real Estate, the data of " neat ".

Description

A kind of house property data identity method of discrimination of different aforementioned sources

Technical field

The present invention relates to a kind of house property data identity method of discrimination of different aforementioned sources, belong to internet data analysis with Digging technology field.

Background technique

Real estate is the important supporting body of national economy, is the particularly important pillar industry in China.Real-estate market situation With price trend not only concerning the overall development of national economy, the living standard of the people is more affected and affects, in recent years premises It produces market " changeable situation ", at the focus and hot spot of social concerns.

How monitoring real-estate market is reinforced, analysis property price tendency has become an important topic.With me State's real-estate market is gradually led to success, and the core status in second-hand house market increasingly highlights, dominates the ability of whole market It gradually increases, especially the cities such as Beijing, Shanghai, second-hand house trading volume have already taken up the transaction of most of real estate market Amount.

About the transaction of second-hand house, with the fast development of real-estate market and the extensive use of Internet technology, at present Have many house transaction websites on the net, such as chain man, I like my family, Central Plains real estate.These house transaction websites are house purchaser The network trading platform provided convenience with the person of selling house, the person of selling house issue information of selling house in these network platforms, and house purchaser exists It trades on platform.Due to the features such as its operation is convenient, and information source is wide, this network trading mode is also increasingly by masses Favor.In the data of these house transaction platforms publication, real-estate market state is really reflected.And house transaction Website broad covered area, the information strong real-time of reflection.Therefore, how to be believed using the house property on these real-time house transaction websites Breath is analyzed, and real estate market situation can be more accurately grasped.

The data for how passing through house transaction website have become an analysis real estate city to Analysis of Real Estate The important method of field.Many real estate research institutions have realized that institute in the real estate information of house transaction website platform publication The value contained starts with the expansion research of these data one after another, analyzes Market Trends of Real Estate.Such as Lian Jia research institute utilizes House transaction information in chain home Web site, has carried out Analysis of Real Estate, achieves preferable effect.But the room of its research Transaction data of the room trading range in this website.Also there are many mechanisms to begin through the data in crawl real estate transaction website, Conduct a research work, and new Research Thinking is provided for Analysis of Real Estate.

Preferable way is the house prosperity transaction data announced in comprehensive each real estate transaction website, carries out global analysis, energy More really reflect the state of real-estate market.But when converging the house data of multi-source land play website, exist same One user by same house the case where different web sites issue information of real estate, or even it is also possible to same house is being published to In same website.Therefore when carrying out Analysis of Real Estate, it is necessary to the identity of the houseclearing from multi-source website Differentiated, will wherein repeat house and reject, real estate analysis system could be used effectively.

In the houseclearing of each real estate transaction website orientation, the detailed features in house to be transacted are contained.Such as place Cell/position, room rate, floor, area, price etc..It can start with from house key feature is extracted, analyze the key feature in house Information, and differentiated with this identity in house, is rejected and repeats house, for real estate market analysis, study provide it is accurate, reliable Data basis.

Researcher there are many at present is studied in data deduplication field extensively, it was also proposed that the method for many.At this In a little research methods, the De-weight method that researcher proposes, many is all the method to general text data (document) duplicate removal, Basic ideas, text-oriented data calculate the similarity between text, then carry out duplicate removal.

But the currently, method for carrying out duplicate removal for the house data of fusion multi-source real estate website not yet.Real estate The house data that business site is announced, belong to partly-structured data, wherein containing dwelling feature abundant, such as affiliated small Area, area, place floor etc. can reduce the accuracy of judgement if differentiated using to web page text.

Therefore the bright houseclearing announced for real estate transaction website of we, extracts the key feature in house, to multi-source The identity of house data is differentiated, repetition house is removed.It is real estate city to merge these multi-sources, isomery house data Accurate and effective data set can be used in field analysis offer, provides data for Analysis of Real Estate and supports.

Summary of the invention

It is an object of the invention to for existing house property data information source is numerous, information is chaotic and criterion is noncommittal Technological deficiency, proposes a kind of house property data identity method of discrimination of different aforementioned sources, and the house property data identity differentiates The house data that method is announced based on existing multi-source and isomery real estate transaction website, analyze dwelling feature, to the same of house Property differentiated, reject repeat house；The data set of generation can be used for the correlative technology fields such as real estate analysis.

Core of the invention thought be based on chain man, I like my family, Central Plains, wheatland website announce house data and its phase Guan Xing；The characteristics of by analysis house data, rejects weight by three big steps of region duplicate removal, cell duplicate removal and house duplicate removal Multiple house data, the website data of the significantly more efficient more house transaction platforms of fusion, to provide standard to Analysis of Real Estate Really, effective data supporting.

Wherein, the characteristics of house data are to actual house object description, although the angle and mode of description are deposited In difference, but between data, there are very strong correlations；The house data of different web sites publication, although description is not quite similar, When description object is same house, there are following correlations:

1, house address is identical: the cell where the address in house is embodied in is the same cell, and the building at place belongs to The same building；The information in building, judgement in this respect can use institute where announcing house due to few websites Judge in the story height in building；

2, the essential attribute in house is identical: the essential attribute in house includes floor space, house type and direction；

3, the expectation of owner is identical: owner is consistent the price expectation in house when hiring out and selling house, because This can be used its listed price to judge.

The present invention is achieved by the following technical solutions:

The house property data identity method of discrimination, includes the following steps:

Step 1) has differences the same area description under same city for different web sites and carries out region duplicate removal；

The reason of carrying out area judging are as follows: the data of each house transaction website are all made of a kind of " city-> region-> small The hierarchical structure in area-> house "；In order to determine the position in house, whether the cell where determining house is same cell, and When determining some cell, the urban area where first determining cell is the same area, to improve the accuracy and efficiency differentiated；

Differentiation to region, by zone name to determine whether being the same region；By in house transaction website The analysis of the characteristics of to region name and people to the address habit in region, the generally appellation to region, there is " a core Heart word "；, just directly with " core word " name, some add similar " area ", " county " suffix behind " core word " for some； That is: zone name, which is constituted, has stronger regularity, can be differentiated using the method for rule to it；

Step 1) includes following sub-step again:

Step 1-1) for subsequent statistical and easy analysis, zone information table is established, and setting field saves in the table From the alias field and alias source precedence field in city and zone name and region where the region that website acquires；

The alias field in region is used for the title of identified areas, when being operated to region, by the alias in region whether phase Together, judge whether it is the same area.Alias source priority is used for the judgement updated to alias field；

Step 1-2) determine the priority of real estate website；

Wherein, website source priority is respectively 0,1,2, and number is smaller, and priority is higher；

Step 1-3) according to the real estate website priority of step 1-2) setting, inquire all areas under same city. Inquiry is obtained zone list to compare two-by-two, the ways and means compared are to judge whether the title in region is similar, according to judgement As a result the alias of update area, specifically:

Wherein judge whether the title in region is similar, is judged using the method for rule；According to each website to zone name Feature is write decision rule, is then determined using the zone name that decision rule treats judgement, if it is decided that two cells Title is identical, then the title of update area following the next rules.If the region currently compared is A, B, if a-quadrant and B area Name-matches success, then judge A, B alias whether assignment: if the alias of A, B all unassigned carry out 1-3A) operation；If A's Assignment and the alias unassigned of B carry out 1-3B to alias) operation；If the alias unassigned of A and the alias assignment progress 1- of B 3C) operate；If A, assignment carries out 1-3D to the alias of B) operation；

1-3A) obtain the priority λ a, the priority λ b of the source web of B area of the source web of a-quadrant.If λ a is big In λ b, then the name of B area is referred to as to the alias of a-quadrant, alias source priority of the λ b as a-quadrant is stored in region letter Table is ceased, step 1-4 is skipped to)；Conversely, the name of a-quadrant to be referred to as to the alias of B area, λ a is preferential as the alias source of B area Grade, is stored in zone information table, skips to step 1-4)；

1-3B) obtain the alias source priority λ a ', the priority λ b of the source web of B area of a-quadrant.If λ b is big In λ a ', using the alias of A as the alias of B, alias source priority of the λ a ' as B skips to step 1-4)；Conversely, by the other of A Name is updated to the title of B, is λ b by the alias source priority update of a-quadrant, skips to step 1-4)；

1-3C) obtain the alias source priority λ b ', the priority λ a of the source web of a-quadrant of B area.If λ a is big In λ b ', using the alias of B as the alias of A, alias source priority of the λ b ' as A skips to step 1-4)；Conversely, by the other of B Name is updated to the title of A, is λ b by the alias source priority update of B area, skips to step 1-4)；

1-3D) obtain the alias source priority λ b ', the alias source priority λ a ' of a-quadrant of B area.If λ a ' is big In λ b ', the alias of A is updated to the alias of B, is λ b ' by the alias source priority update of A, skips to step 1-4)；Conversely, The alias of B is updated to the title of A, is λ a ' by the alias source priority update of B area, skips to step 1-4)；

Step 1-4) pass through step 1-3) comparison domain title and match after, by all areas alias field be NULL The alias in region be assigned a value of area-name；

Step 1-5) using the region alias in zone information table as according to the inquiry and statistics for carrying out urban area；

Step 2) cell duplicate removal carries out identity differentiation for the cell in urban area；

Wherein, the reason of progress identity differentiation is: each website has differences the description of cell, completes step 1) after the duplicate removal of region, the alias in each region can be inquired according to the alias in region against unique region and obtain same area In cell.Wherein, cell information feature includes the warp of the title of cell, construction area, total amount, infrastructure management company and cell Latitude information.It can be differentiated according to identity of these features to cell.Step 2) specifically includes following sub-step:

Step 2-1) cell duplicate checking table is established, the cell ID for being determined as same cells is saved, sets up two words in the table Section: in ratio source cell mark and ratio in Target Cell Identifier；

Wherein, cell ID is the ID of cell.

Step 2-2) the website priority that determines according to step 1-2) searches for region in cell table according to zone information table The identical cell of alias, and compare the feature of cell two-by-two, if cell to be compared is C1, C2, specifically include following sub-step:

Step 2-2-1) subdistrict position diversity factor is calculated, specifically calculated using formula (1):

Subdistrict position diversity factor=| cell difference of longitude |+| cell latitude is poor | and (1)

Wherein, | cell difference of longitude | indicate the longitude of C1 and the longitude absolute value of the difference of C2, | cell latitude is poor | indicate C1 Latitude and C2 latitude absolute value of the difference；

If C1, C2 do not have latitude and longitude information, subdistrict position diversity factor is 0；

Step 2-2-2) plot area similarity is calculated, specifically calculated using formula (2):

Plot area similarity=| 1- | plot area is poor |/Min (plot area) | (2)

Wherein, | plot area is poor | indicate the plot area of C1 and the plot area absolute value of the difference of C2；(the cell face Min Product) indicate C1 plot area and C2 plot area in smaller value, if the two is equal, for the plot area of C1；

If not having construction area information, plot area similarity is 1；

Step 2-2-3) the total amount similarity of cell is calculated, specifically calculated using formula (3):

The total amount similarity of cell=| 1- | the total amount of cell is poor |/Min (the total amount of cell) | (3)

Wherein, | the total amount of cell is poor | indicate the total amount absolute value of the difference of cell of the total amount of cell and C2 of C1；Min (the total amount of cell) indicates the smaller value in the total amount of cell of C1 and the total amount of cell of C2, if the two is equal, for C1 The total amount of cell；

If cell does not have total amount information, the total amount similarity of cell is 1；

Step 2-2-4) calculate cell name similarity, specifically using formula (4) calculate:

Cell name similarity=1- (eDistance/maxlength (cell name)) (4)

Wherein, eDistance is the string editing distance of the cell name of C1 and the cell name of C2, calculation method For using insertion, modification, the method for deleting a character, the cell name of C1 is converted into required for the cell name of C2 Minimal action number of steps；Maxlength (cell name) be the length of the cell name of C1 and the cell name length of C2 compared with Big value, if the two is equal, for the length of the cell name of C1.

Step 2-2-5) calculate residential property Business Name similarity, specifically using formula (5) calculate:

1- (eDistance/maxlength (residential property Business Name)) (5)

Wherein, eDistance is the string editing distance of infrastructure management company's title of C1 and infrastructure management company's title of C2, meter Calculation method is that, using insertion, modification, the method for deleting a character, infrastructure management company's name translation C1 is public at the property of C2 Take charge of minimal action number of steps required for title；Maxlength (residential property Business Name) is infrastructure management company's title of C1 The larger value of infrastructure management company's Name Length of length and C2, if the two is equal, for the length of infrastructure management company's title of C1.

Step 2-3) using the similarity of the cell characteristic calculated in step 2-2, judge the identity of cell.If cell Position difference degree is less than the threshold value T1 of setting, and area similarity, total amount similarity are greater than the threshold value T2 of setting, cell name Claim, the similarity of infrastructure management company's title is greater than the threshold value T3 of setting, then be judged to being same cell；

Wherein, the Criterion of Selecting of threshold value T1 is that 0.01 degree is differed according to longitude and latitude, and the method that distance differs about 1000 meters will T1 is chosen for 0.02；Threshold value T2 is chosen for 0.95；Threshold value T3 is chosen for 0.9；

Step 2-4) if cell C1 and C2 is determined as same cell in step 2-3, C1 and C2 is updated to cell Duplicate checking table, specifically: obtain the source web priority λ 1, the source web priority λ 2 of cell C2 of cell C1；If λ 1 is big In λ 2, then using C2 as than middle source, C1 is used as than middle target, is saved in cell duplicate checking table；Otherwise using C1 as than middle source, C2 makees To be saved in cell duplicate checking table than middle target；

Step 3) house duplicate removal, i.e. progress house identity differentiation；

Wherein, the reason of progress identity differentiation is: same set of house may be transfer-listed in different web sites, and each A website has differences the description in house,, can be right according to information in cell duplicate checking table after completing step 2) cell duplicate removal The same cells of different web sites are clustered；Then according to cluster as a result, the room of the same cells of inquiry source different web sites Room list；

Wherein, floor height, the affiliated floor, listed price, construction area, family in building where the houseclearing feature includes Type structure, floor plan and direction；Further according to the identity in houseclearing feature decision house；

Step 3) specifically includes following sub-step:

Step 3-1) house duplicate checking table is established, set up two fields: than target house mark in the house Zhong Yuan mark and ratio；

Wherein, house is identified as the ID in house；

Step 3-2) according to the region in city, the cell of the same area is searched for, establishes cell list cList, and according to net It stands priority ranking；

Step 3-3) according to the information of cell duplicate checking table, the cell in cell list cList is clustered, specifically:

Step 3-3A) adjacency list G is constructed, it is one small in each of adjacency list G node on behalf cell list cList Area；

Step 3-3B) it is identified as the cell list of same cell from inquiring in cell list cList in cell duplicate checking table Each element in rList, rList includes two information: target in list ratio in source and ratio in list ratio in ratio；

Step 3-3C) rList is traversed, by than cell corresponding to target in middle source and ratio, one is constructed in adjacency list G Side；

Step 3-3D) depth-priority-searching method is used, traverse adjacency list G, each in the spanning forest F in G, forest F Tree, is the maximal connected subgraph in G；

Wherein, one cell of each tree representation in F；

Step 3-4) traversal is by each of forest F of generation tree Tree in step 3-3), and from house, table is inquired to obtain Set the house listings in the cell that all nodes of Tree indicate；

Step 3-5) house room is divided into 5 classes: it more than 1 room, 2 rooms, 3 rooms, 4 rooms and 4 rooms, is each List is established in classification；House in house listings is respectively added to corresponding tabulation according to room number, traversal is each to divide House in class list, and compare the feature in house two-by-two, if house to be compared is H1, H2, specifically:

Step 3-5-1) judge whether the floor height in place building be identical, if the floor height in building be not identical, judges two houses It is not identical, terminate judgement, if identical, then follow the steps 3-5-2) to step 3-5-4):

Step 3-5-2) calculate house where floor similarity: current web is not issued when issuing houseclearing Specific floor, but the general floor in house is only described, and there is also differences for the description of different web sites；

Circular are as follows: the synonym of each floor is established in the description according to each website to floor, compares house place When floor, if the floor in house belongs to a pair of of synonym, similarity 1, otherwise 0；

Step 3-5-2) reference area similarity: specifically calculated using formula (6):

| 1- | difference in areas |/Min (area) | (6)

Wherein, | difference in areas | indicate the floor space of H1 and the floor space absolute value of the difference of H2；Min (area) indicates H1 Floor space and H2 floor space in smaller value, if the two is equal, for the floor space of H1；

Step 3-5-3) calculate price similarity: specifically calculated using formula (7):

| 1- | price difference |/Min (price) | (7)

Wherein, | price difference | indicate the listed price of H1 and the listed price absolute value of the difference of H2；Min (price) indicates H1 Listed price and H2 listed price in smaller value, if the two is equal, for the listed price of H1；

Step 3-5-4) it calculates towards similarity: specifically calculated using formula (8):

1- (eDistance/maxlength (direction)) (8)

Wherein, eDistance is the string editing distance of the orientation of room of H1 and the orientation of room of H2, calculation method For using insertion, modification, the method for deleting a character, the orientation of room of H1 is converted into required for the orientation of room of H2 Minimal action number of steps；Maxlength (direction) is the larger value of the length of the orientation of room of H1 and the orientation of room length of H2, If the two is equal, for the length of the orientation of room of H1；

Step 3-6) if the floor similarity where house is 1, and its area similarity, price similarity are greater than setting Threshold value T4；It is greater than given threshold T5 towards similarity and is then considered as similar house；

Wherein, threshold value T4 is chosen for 0.95, and threshold value T5 is chosen for 0.5；

Step 3-7) similar house is determined as in step 3-6, due to there is many in practical houseclearing in floor, face Product, price, all similar house of direction, need to further judge the layout structure in house；The thicker black line of floor plan at present, hooks The wall profile in house is strangled out, and equipped with furniture, jewelry etc., therefore comparing the similar method of floor plan is extracted from floor plan The information of wall out is compared the wall distribution of horizontal and vertical both direction, specifically:

Step 3-7-1) according to the URL of house H1, H2, the floor plan of house H1, H2 are downloaded, and save, is set as image, According to step 3-7-2) to the method for 3-7-10), the wall information of the floor plan of H1, H2 is calculated separately；

Step 3-7-2) by image be converted into resolution ratio be 100*100 picture, be set as image100；

Step 3-7-3) image100 is converted into grayscale image, it is set as grayimage100；

Step 3-7-4) calculate grayimage100 grey level histogram h；

Step 3-7-5) color value for indicating wall line is extracted from grey level histogram, specifically: scanning is straight from left to right Side figure h, the difference for finding first and adjacent domain in histogram h is more than the gray value g of given threshold T6；

Wherein, threshold value T6 is chosen for 50；

Step 3-7-6) by gray value g obtained in step 3-7-5) to grayscale image grayimage100 binaryzation, specifically Are as follows:

Traverse grayimage100 all pixels point, if the gray value of pixel be p (i, j), calculate p (i, j) y with G absolute value of the difference, if | g (i, j)-g | greater than the threshold value T7 of setting, the gray scale of p (i, j) is set as 255, is otherwise set as 0；

Wherein, threshold value T7 is chosen for 10；

Step 3-7-8) remove in floor plan " large area " object, specifically:

Floor plan is scanned with particular size " rectangle ", calculates the pixel points pcount that gray value is 0 in rectangle, if Pcount is more than given threshold T8, then is considered as " large area " object, is set to 0；

Wherein, the size of rectangle is 5*5, and threshold value T8 is chosen for 16；

Step 3-7-9) calculate floor plan in wall information, method particularly includes:

Grayscale image after scanning binaryzation by row calculates the pixel that gray scale is 0 in every row and counts, obtains gray scale in every row For 0 picture element histogram rowHist；By the grayscale image after column scan binaryzation, the picture element that gray scale is 0 in each column is calculated Number obtains the picture element histogram colHist that gray scale is 0 in each column；

Step 3-7-10) scanning rowHist, the salient point in rowHist is calculated, which indicates the wall of floor plan, obtain To floor plan wall distribution series wallSeq1；ColHist is scanned, the salient point in colHist is calculated, obtains floor plan wall point Cloth sequence wallSeq2；

Wherein, the element in wallSeq1 and wallSeq2 is following sequence pair:

<location, length>

Wherein, the value of location is the position of rowHist (colHist) bumps, represents wall in floor plan Position, length are rowHist (colHist) convexity point values, represent the length of wall；

Step 3-7-11) calculate H1 wallSeq1 and H2 wallSeq1 similarity, specifically:

Step 3-7-11A) n1 is set as the sequence centering element number of the H1wallSeq1 of H1, if n2 is H2's The sequence centering element number of H2wallSeq1, i1 indicate the position of the sequence centering element to be compared of the H1wallSeq1 of H1, Start the position for indicating the sequence centering element to be compared of the H2wallSeq1 of H2 for 0, i2, starts to be 0；M be H1wallSeq1, Identical element number in H2wallSeq1 starts to be 0；

Step 3-7-11B) judge i1 and i2, if i1 is less than n1 and i2 is less than n2, then follow the steps 3-7-11C), otherwise Execute step 3-7-11E)；

Step 3-7-11C) wall locations diversity factor is calculated, specifically calculated using formula (9):

Diff_location=| H1wallSeq1 [i1] .location-H2wallSeq1 [i2] .location | (9)

If diff1 is less than threshold value T9, i1 adds 1, i2 to add 1, and calculates length of walls difference according to formula (10), otherwise holds Row step 3-7-11D):

Diff_length=| H1wallSeq1 [i1] .length-H2wallSeq1 [i2] .length | (10)

If diff_length is less than threshold value T10, m adds 1；Execute step 3-7-11B)；

Wherein, threshold value T9 is chosen for 4, and threshold value T10 is chosen for 8；

Step 3-7-11D) if H1wallSeq1 [i1] .Location is less than H2wallSeq1 [i2] .Location I1 adds 1, and otherwise i2 adds 1；

Execute step 3-7-11B)；

Step 3-7-11E) according to formula (11), calculate the similarity of H1wallSeq1 and H2wallSeq1:

S1_1=m/max (n1, n2) (11)

Wherein max (n1, n2) indicates the larger value of n1 and n2, if n1 is equal to n2, for n1；

Step 3-7-12) according to step 3-7-11 method calculate H1 wallSeq1 and H2 wallSeq2 similarity The wallSeq2's of the wallSeq2 and H2 of the similarity S2_1, H1 of the wallSeq1 of the wallSeq2 and H2 of S1_2, H1 is similar Spend S2_2；

Step 3-7-13) calculate H1 and H2 house type similarity, especially by formula (12), formula (13) calculate:

S1=max (S1_1, S1_2)； (12)

S2=max (S2_1, S2_2)； (13)

Step 3-8) if S1 and S2 are above given threshold T11, it is determined as identical house, and update houseclearing；

Wherein, max (S1_1, S1_2) indicates the larger value of S1_1 and S1_2, if S1_1 is equal to S1_2, for S1_1, Max (S2_1, S2_2) indicates the larger value of S2_1 and S2_2, if S2_1 is equal to S2_2, is chosen for for S2_1, threshold value T11 0.8；

Specific update method are as follows:

If the source web priority of house H1 is λ 1, the source web priority of house H2 is λ 2, if λ 1 is less than λ 2, Using H1 as than middle source, H2 is used as than middle target, is saved in house duplicate checking table, and otherwise using H2 as than middle source, H1 is used as than in Target is saved in house duplicate checking table；

Step 3-9) all houses in house duplicate checking table than the house that middle aiming field occurs all is to repeat house.

Beneficial effect

A kind of house property data identity method of discrimination of the different aforementioned sources proposed in the present invention, compared with prior art, It has the following beneficial effects:

1. the method can be to the house data deduplication for deriving from different web sites, can be accurately and efficiently to from not Identity with the house data of information source differentiates, and can effectively remove repeat region, cell, may be implemented towards multi-source heterogeneous House data effective integration, provide " clean " for Analysis of Real Estate, the data of " neat "；

2. the real estate transaction information of each real estate website orientation has been truly reflected the dynamic of current real-estate market State can more realistically be reflected using the method proposed in the present invention with the transaction data of the multiple real estate networks of effective integration Real estate transaction state can excavate more acurrate real-estate market variation dynamic, Ke Yiyong according to the house data of more websites Room is improved to promote the development of Analysis of Real Estate technology in field in the decision of government, enterprise operation, common people's investment etc. Real estate market monitoring, analysis level.

Detailed description of the invention

Fig. 1 is a kind of system architecture of the house property data identity method of discrimination of different aforementioned sources of the present invention；

Fig. 2 is a kind of system process flow of the house property data identity method of discrimination of different aforementioned sources of the present invention；

Fig. 3 be a kind of different aforementioned sources of the present invention house property data identity method of discrimination step 2 and embodiment 1 it is small Area's duplicate removal process；

Fig. 4 is a kind of house property data identity method of discrimination step 3 of different aforementioned sources of the present invention and the room of embodiment 1 Room duplicate removal flow chart；

Fig. 5 is the present invention a kind of the house property data identity method of discrimination step 3-3 and embodiment 1 of different aforementioned sources Cell clusters schematic diagram；

Fig. 6 is the present invention a kind of the house property data identity method of discrimination step 3-3 and embodiment 1 of different aforementioned sources Cell Clustering Effect figure；

Fig. 7 is the present invention a kind of the house property data identity method of discrimination step 3-5 and embodiment 1 of different aforementioned sources Sort out schematic diagram by room in house.

Specific embodiment

In order to make the purpose of the present invention, technical solution and advantage are more clearly understood, and are passed through below in conjunction with attached drawing specific real Example is applied, the present invention is described in more detail.It should be appreciated that specific embodiment described herein, be only used to explain this hair It is bright, it is not intended to limit the present invention.

Embodiment 1

Present embodiment describes a kind of the specific of the house property data identity method of discrimination of different aforementioned sources of the present invention Implement.

For implementation diagram as shown in Fig. 1 system architecture, Fig. 2 is a kind of house property data identity of different aforementioned sources of the present invention The system process flow of method of discrimination.The present invention accepts data collection system and data analysis system belongs to house property data processing Intermediate link.Wherein, data collection system acquires the transaction data of real estate, including cell number from each real estate transaction website According to, urban area data, house data, house fetched data etc., be stored in real estate base data library.

Using method proposed by the present invention, duplicate removal is carried out to the house data in real estate base data library, after processing Data be stored in real estate analytical database, analysis platform herein basis carry out data analysis, calculating and processing.

1 house label list of table

Serial number	Field name	Type	Explanation
				1	House_uid	Int	The house identification number that crawler system saves
2	city	Varcha(100)	City
				3	district	Varchar(100)	Region
4	Community_uid	Int	The Cell Identity CI that crawler system saves
				5	price	Float	The current listed price in house
6	Price_unit	Varchar(50)	The unit of value
				7	Build_area	Float	House architectural area
8	Use_area	Float	House usable floor area
				9	Area_unit	Varchar(50)	Square measure
10	Build_floor	Int	The floor in the affiliated building in house is high
				11	Current_floor	Varchar(50)	Floor where house
12	rooms	Int	House room number
				13	halls	Int	House Room number
14	towards	Varchar(50)	Orientation of room
				15	repeated	Bit	Whether repeated with other houses
16	checked	Bit	Whether on inspection

2 website source table of table

Serial number	Field name	Type	Explanation
				1	Web site name	Varchar(50)	Web site name
2	Website priority	Int	Website priority

3 urban area table of table

4 cell table of table

Serial number	Field name	Type	Explanation
				1	community_id	Int	Cell Id
2	Area_id	Int	Corresponding region table
				3	Community_name	Varchar(100)	Cell name
4	latitude	Float	Latitude
				5	longitude	Float	Longitude
6	repeated	Bit	Whether repeated with other cells
				7	checked	Bit	Whether on inspection

Table in 5 cell ratio of table

Table in 6 house ratio of table

First to the urban area duplicate removal of acquisition.The city area information acquired in retrieval real estate base data library, such as Fruit region is updated in analytical database not in analytical database, right then according to the step 1 introduced in the present invention Area information is handled.

Different web sites have differences the same area description under same city, such as: region of the chain home Web site to Shanghai It is described as follows:

Jing'an, Xuhui, Huangpu, Changning ...

I likes that my family website is as follows to the region description in Shanghai:

Jing'an District, Xuhui District, Huangpu District, Changning District ...

Differentiation to region, by zone name to determine whether being the same region；By to chain man, I like my family, The analysis of the characteristics of region is named in the websites such as Central Plains, wheatland and people to the address habit in region, generally to region Appellation has one " core word "；, just directly with " core word " name, some add similar behind " core word " for some The suffix such as " area ", " county ", such as " Jing'an " and " Jing'an District " of front etc..Zone name, which is constituted, has stronger regularity, can It is differentiated in the method using rule.

Secondly to the cell duplicate removal of acquisition.The cell information for retrieving the acquisition in real estate base data library, if cell Not in analytical database, updated in analytical database.

There is also differences for description of each real estate website to same cell name and infrastructure management company.For example, chain man net " in method south China " is known as to " in the method south China " cell of the Chongwenmen in Beijing Dongcheng, and wheatland website is then known as " method China, south, temple In ".The method that similarity of character string is used to the differentiation of cell name and infrastructure management company's title.

The flow chart of cell duplicate removal is as shown in Figure 3.

According to the District_alias field and province field in 3 urban area table of table, region is grouped Inquiry, obtains urban area list.Traversal urban area list is looked into according to the zone name in District_alias field Ask the cell list ClistAll in the region.It has been indicated on inspection, such as further according to checked information in cell if it is 1 Fruit is that 0 expression does not check, obtains the cell (not carrying out the cell of duplicate removal processing) that checked is 0, forms ClistUncheck column Table.Cell in ClistAll and ClistUncheck list is compared two-by-two according to method is introduced in step 2 in the present invention The similitude of cell.If similar, cell information is updated into 5 cell ratio of table in table.

Finally to houseclearing duplicate removal.Houseclearing is inquired from 1 house label list of table, according to cell, to houseclearing Duplicate removal processing is carried out, schematic diagram is as shown in Figure 4.

According to the District_alias field and province field in 3 urban area table of table, region is grouped Inquiry, obtains zone list.It traverses zone list and the region is inquired according to the zone name in District_alias field Cell list ClistAll.The cell after duplicate removal processing according to 4 cell of table than source in the src_uid ratio of middle table and Target in target_uid ratio essentially forms no digraph structure as shown in Figure 5.Each node in figure corresponds to one Cell, side indicate that two cells are the same cells.According to the method for the step 3 introduced in the present invention, current area ratio is searched Middle same cells, are clustered.Cluster result forms cluster list ClistCluster in same cells as shown in FIG. 6, list In each element represent the class race an of same cells.

ClistCluster is traversed, the houseclearing in each class race is searched, obtains all house listings under cell HlistAll.According to checked in houseclearing, if it is 1, indicates on inspection, indicated not check if it is 0, obtain The house listings HlistUncheck that checked is 0.In order to improve efficiency, reduces and compare number.By HlistAll and House in HlistUncheck is divided into 5 classes according to room number, as the house Fig. 7 is sorted out shown in schematic diagram by room.According to this hair The house Duplicate Removal Algorithm that bright middle step 3 is introduced, compares number identical house in room in HlistAll and HlistUncheck two-by-two Compared with.

If the house compared is similar, the house than in is updated into the table into 5 house ratio of table.

It is that duplicate removal processing, Ke Yiwei have been carried out to duplicate house data by treated house data Subsequent Analysis of Real Estate provides " clean ", the data of " neat " are supported.And data that treated, the original with acquisition Beginning data, which form, to be efficiently separated, and ensure that the modularization of analysis system and acquisition system, improves Analysis of Real Estate system The stability and independence of system.

The above is presently preferred embodiments of the present invention, and it is public that the present invention should not be limited to embodiment and attached drawing institute The content opened.It is all not depart from the lower equivalent or modification completed of spirit disclosed in this invention, both fall within the model that the present invention protects It encloses.

Claims

1. a kind of house property data identity method of discrimination of different aforementioned sources, it is characterised in that:

The house data that the house property data identity method of discrimination is announced based on existing multi-source and isomery real estate transaction website, Dwelling feature is analyzed, the identity in house is differentiated, rejects and repeats house；

Be specifically based on chain man, I like my family, Central Plains, wheatland website announce house data and its correlation；By analyzing house The characteristics of data, rejects duplicate house data by three big steps of region duplicate removal, cell duplicate removal and house duplicate removal；

Wherein, the characteristics of house data are to actual house object description, although the angle and mode of description are in the presence of poor It is different, but there are very strong correlations between data；The house data of different web sites publication describe although description is not quite similar When object is same house, there are following correlations:

A, house address is identical: the cell where the address in house is embodied in is the same cell, and the building at place belongs to same A building；The information in building, judgement in this respect can use place building where announcing house due to few websites Story height judge；

B, the essential attribute in house is identical: the essential attribute in house includes floor space, house type and direction；

C, the expectation of owner is identical: owner is consistent the price expectation in house when hiring out and selling house, therefore can Judged with its listed price；

Differentiation to region, by zone name to determine whether being the same region；By to the house transaction website area Zhong Dui The analysis of the characteristics of domain is named and people to the address habit in region, the generally appellation to region, there is " a core Word "；, just directly with " core word " name, some add suffix behind " core word " for some；

Step 1) includes following sub-step again:

Step 1-1) for subsequent statistical and easy analysis, zone information table is established, and setting field is saved from net in the table It stands the alias field and alias source precedence field in city and zone name and region where the region of acquisition；

Whether the alias field in region is used for the title of identified areas, identical by the alias in region when operating to region, sentences Whether disconnected is the same area；Alias source priority is used for the judgement updated to alias field；

Step 1-2) determine the priority of real estate website；

Step 1-3) according to the real estate website priority of step 1-2) setting, inquire all areas under same city；It will look into Inquiry obtains zone list and compares two-by-two, and the ways and means compared are to judge whether the title in region is similar, the result according to judgement The alias of update area, specifically:

Wherein judge whether the title in region is similar, is judged using the method for rule；The characteristics of according to each website to zone name, Decision rule is write, is then determined using the zone name that decision rule treats judgement, if it is decided that the title of two cells It is identical, then the title of update area following the next rules:

If the region currently compared is A, B, if the name-matches of a-quadrant and B area are successful, then judge whether the alias of A, B are assigned Value: if the alias of A, B all unassigned carry out 1-3A) operation；If assignment and the alias unassigned of B carry out 1-3B to the alias of A) behaviour Make；If the alias unassigned of A and the alias assignment progress 1-3C of B) operation；If A, assignment carries out 1-3D to the alias of B) behaviour Make；

1-3A) obtain the priority λ a, the priority λ b of the source web of B area of the source web of a-quadrant；If λ a is greater than λ The name of B area, then is referred to as the alias of a-quadrant by b, and alias source priority of the λ b as a-quadrant is stored in area information Table skips to step 1-4)；Conversely, the name of a-quadrant to be referred to as to the alias of B area, λ a is preferential as the alias source of B area Grade, is stored in zone information table, skips to step 1-4)；

1-3B) obtain the alias source priority λ a ', the priority λ b of the source web of B area of a-quadrant；If λ b is greater than λ A ', using the alias of A as the alias of B, alias source priority of the λ a ' as B skips to step 1-4)；Conversely, by the alias of A It is updated to the title of B, is λ b by the alias source priority update of a-quadrant, skips to step 1-4)；

1-3C) obtain the alias source priority λ b ', the priority λ a of the source web of a-quadrant of B area；If λ a is greater than λ B ', using the alias of B as the alias of A, alias source priority of the λ b ' as A skips to step 1-4)；Conversely, by the alias of B It is updated to the title of A, is λ b by the alias source priority update of B area, skips to step 1-4)；

1-3D) obtain the alias source priority λ b ', the alias source priority λ a ' of a-quadrant of B area；If λ a ' is greater than λ The alias of A is updated to the alias of B by b ', is λ b ' by the alias source priority update of A, is skipped to step 1-4)；Conversely, by B Alias be updated to the title of A, be λ a ' by the alias source priority update of B area, skip to step 1-4)；

Step 1-4) pass through step 1-3) comparison domain title and match after, by all areas alias field be NULL area The alias in domain is assigned a value of area-name；

Step 2) cell duplicate removal carries out identity differentiation for the cell in urban area, specifically includes following sub-step:

Step 2-1) cell duplicate checking table is established, the cell ID for being determined as same cells is saved, two fields are set up in the table: ratio Target Cell Identifier in middle source cell mark and ratio；

Wherein, cell ID is the ID of cell；

Step 2-2) the website priority that determines according to step 1-2) searches for region alias in cell table according to zone information table Identical cell, and compare the feature of cell two-by-two, if cell to be compared is C1, C2, specifically include following sub-step:

Wherein, | cell difference of longitude | indicate the longitude of C1 and the longitude absolute value of the difference of C2, | cell latitude is poor | indicate the latitude of C1 The latitude absolute value of the difference of degree and C2；

Plot area similarity=| 1- | plot area is poor |/Min (plot area) | (2)

Wherein, | plot area is poor | indicate the plot area of C1 and the plot area absolute value of the difference of C2；Min (plot area) table Show the smaller value in the plot area of C1 and the plot area of C2, if the two is equal, for the plot area of C1；

If not having construction area information, plot area similarity is 1；

Wherein, | the total amount of cell is poor | indicate the total amount absolute value of the difference of cell of the total amount of cell and C2 of C1；Min (cell Total amount) indicate C1 the total amount of cell and C2 the total amount of cell in smaller value, if the two is equal, for the cell of C1 Total amount；

Cell name similarity=1- (eDistance/maxlength (cell name)) (4)

Wherein, eDistance is the string editing distance of the cell name of C1 and the cell name of C2, and calculation method is to adopt With insertion, modification, the method for deleting a character, the cell name of C1 is converted into required for the cell name of C2 at least grasping Make number of steps；Maxlength (cell name) is the larger value of the length of the cell name of C1 and the cell name length of C2, such as Both fruits are equal, then are the length of the cell name of C1；

1- (eDistance/maxlength (residential property Business Name)) (5)

Wherein, eDistance is the string editing distance of infrastructure management company's title of C1 and infrastructure management company's title of C2, calculating side Method is, using insertion, modification, the method for deleting a character, infrastructure management company's name translation of C1 at infrastructure management company's name of C2 Minimal action number of steps required for claiming；Maxlength (residential property Business Name) is the length of infrastructure management company's title of C1 With the larger value of infrastructure management company's Name Length of C2, if the two is equal, for the length of infrastructure management company's title of C1；

Step 2-3) using the similarity of the cell characteristic calculated in step 2-2, judge the identity of cell；If subdistrict position Diversity factor be less than setting threshold value T1, and area similarity, total amount similarity be greater than setting threshold value T2, cell name, The similarity of infrastructure management company's title is greater than the threshold value T3 of setting, then is judged to being same cell；

Wherein, the Criterion of Selecting of threshold value T1 is that 0.01 degree is differed according to longitude and latitude, and the method that distance differs about 1000 meters selects T1 It is taken as 0.02；Threshold value T2 is chosen for 0.95；Threshold value T3 is chosen for 0.9；

Step 2-4) if cell C1 and C2 is determined as same cell in step 2-3, C1 and C2 is updated to cell duplicate checking Table, specifically: obtain the source web priority λ 1, the source web priority λ 2 of cell C2 of cell C1；If λ 1 is greater than λ 2, Then using C2 as than middle source, C1 is used as than middle target, is saved in cell duplicate checking table；Otherwise using C1 as than middle source, C2 be used as than Middle target is saved in cell duplicate checking table；

Wherein, floor height, the affiliated floor, listed price, construction area, house type knot in building where the houseclearing feature includes Structure, floor plan and direction；Further according to the identity in houseclearing feature decision house；

Step 3) specifically includes following sub-step:

Wherein, house is identified as the ID in house；

Step 3-2) according to the region in city, the cell of the same area is searched for, establishes cell list cList, and excellent according to website First grade sequence；

Step 3-3A) construct adjacency list G, a cell in each of adjacency list G node on behalf cell list cList；

Step 3-3C) rList is traversed, by than cell corresponding to target in middle source and ratio, a line is constructed in adjacency list G；

Step 3-3D) depth-priority-searching method is used, adjacency list G is traversed, every one tree in the spanning forest F in G, forest F is Maximal connected subgraph in G；

Wherein, one cell of each tree representation in F；

Step 3-4) by each of forest F of generation tree Tree in step 3-3), from house, table is inquired is set traversal The house listings in cell that all nodes of Tree indicate；

Step 3-5) house room is divided into 5 classes: it more than 1 room, 2 rooms, 3 rooms, 4 rooms and 4 rooms, is each classification Establish list；House in house listings is respectively added to corresponding tabulation according to room number, traverses each category column House in table, and compare the feature in house two-by-two, if house to be compared is H1, H2, specifically:

Step 3-5-1) judge whether the floor height in place building be identical, if the floor height in building be not identical, judges two houses not phase Together, terminate judgement, if identical, then follow the steps 3-5-2) to step 3-5-4):

Step 3-5-2) calculate house where floor similarity: current web when issue houseclearing, not publication specifically Floor, but the general floor in house is only described, and there is also differences for the description of different web sites；

Circular are as follows: the synonym of each floor is established in the description according to each website to floor, compares floor where house When, if the floor in house belongs to a pair of of synonym, similarity 1, otherwise 0；

| 1- | difference in areas |/Min (area) | (6)

Wherein, | difference in areas | indicate the floor space of H1 and the floor space absolute value of the difference of H2；The room of Min (area) expression H1 Smaller value in the long-pending floor space with H2 of roofing, if the two is equal, for the floor space of H1；

| 1- | price difference |/Min (price) | (7)

Wherein, | price difference | indicate the listed price of H1 and the listed price absolute value of the difference of H2；The extension of Min (price) expression H1 Smaller value in the listed price of list price lattice and H2, if the two is equal, for the listed price of H1；

1- (eDistance/maxlength (direction)) (8)

Wherein, eDistance is the string editing distance of the orientation of room of H1 and the orientation of room of H2, and calculation method is to adopt With insertion, modification, the method for deleting a character, the orientation of room of H1 is converted into required for the orientation of room of H2 at least grasping Make number of steps；Maxlength (direction) is the larger value of the length of the orientation of room of H1 and the orientation of room length of H2, if two Person is equal, then is the length of the orientation of room of H1；

Step 3-6) if the floor similarity where house is 1, and its area similarity, price similarity are greater than given threshold T4；It is greater than given threshold T5 towards similarity and is then considered as similar house；

Step 3-7) similar house is determined as in step 3-6, it is many in floor due to existing in practical houseclearing, area, Price, all similar house of direction, need to further judge the layout structure in house；The thicker black line of floor plan at present, is sketched the contours The wall profile in house out, and equipped with furniture, jewelry etc., therefore comparing the similar method of floor plan is extracted from floor plan The information of wall is compared the wall distribution of horizontal and vertical both direction, specifically:

Step 3-7-1) according to the URL of house H1, H2, the floor plan of house H1, H2 are downloaded, and save, is set as image, according to Step 3-7-2) to the method for 3-7-10), calculate separately the wall information of the floor plan of H1, H2；

Step 3-7-4) calculate grayimage100 grey level histogram h；

Step 3-7-5) color value for indicating wall line is extracted from grey level histogram, specifically: histogram is scanned from left to right H, the difference for finding first and adjacent domain in histogram h is more than the gray value g of given threshold T6；

Wherein, threshold value T6 is chosen for 50；

Step 3-7-6) by gray value g obtained in step 3-7-5) to grayscale image grayimage100 binaryzation, specifically:

The all pixels point of grayimage100 is traversed, if the gray value of pixel is p (i, j), it is poor to calculate p (i, j) y and g Absolute value, if | g (i, j)-g | greater than the threshold value T7 of setting, the gray scale of p (i, j) is set as 255, is otherwise set as 0；

Wherein, threshold value T7 is chosen for 10；

Step 3-7-8) remove in floor plan " large area " object, specifically:

Floor plan is scanned with particular size " rectangle ", the pixel points pcount that gray value is 0 in rectangle is calculated, if pcount More than given threshold T8, then it is considered as " large area " object, is set to 0；

Grayscale image after scanning binaryzation by row calculates the pixel that gray scale is 0 in every row and counts, and obtaining gray scale in every row is 0 Picture element histogram rowHist；By the grayscale image after column scan binaryzation, calculates the pixel that gray scale is 0 in each column and counts, Obtain the picture element histogram colHist that gray scale is 0 in each column；

Step 3-7-10) scanning rowHist, the salient point in rowHist is calculated, which indicates the wall of floor plan, obtain family Type figure wall distribution series wallSeq1；ColHist is scanned, the salient point in colHist is calculated, obtains floor plan wall distribution sequence Arrange wallSeq2；

Wherein, the element in wallSeq1 and wallSeq2 is following sequence pair:

<location, length>

Wherein, the value of location is the position of rowHist (colHist) bumps, represents position of the wall in floor plan, Length is rowHist (colHist) convexity point value, represents the length of wall；

Step 3-7-11) calculate H1 wallSeq1 and H2 wallSeq1 similarity, specifically:

Step 3-7-11A) n1 is set as the sequence centering element number of the H1wallSeq1 of H1, if n2 is the H2wallSeq1's of H2 Sequence centering element number, i1 indicate the position of the sequence centering element to be compared of the H1wallSeq1 of H1, start as 0, i2 table The position for showing the sequence centering element to be compared of the H2wallSeq1 of H2 starts to be 0；M is in H1wallSeq1, H2wallSeq1 Identical element number starts to be 0；

Diff_location=| H1wallSeq1 [i1] .location-H2wallSeq1 [i2] .location | (9)

If diff1 is less than threshold value T9, i1 adds 1, i2 to add 1, and calculates length of walls difference according to formula (10), otherwise executes step Rapid 3-7-11D):

Diff_length=| H1wallSeq1 [i1] .length-H2wallSeq1 [i2] .length | (10)

Step 3-7-11D) i1 adds if H1wallSeq1 [i1] .Location is less than H2wallSeq1 [i2] .Location 1, otherwise i2 adds 1；

Execute step 3-7-11B)；

S1_1=m/max (n1, n2) (11)

Step 3-7-12) according to step 3-7-11 method calculate H1 wallSeq1 and H2 wallSeq2 similarity S1_ The similarity of the wallSeq2 of the wallSeq2 and H2 of the similarity S2_1, H1 of the wallSeq1 of 2, H1 wallSeq2 and H2 S2_2；

S1=max (S1_1, S1_2)； (12)

S2=max (S2_1, S2_2)； (13)

Wherein, max (S1_1, S1_2) indicates the larger value of S1_1 and S1_2, if S1_1 is equal to S1_2, for S1_1, max (S2_1, S2_2) indicates the larger value of S2_1 and S2_2, if S2_1 is equal to S2_2, is chosen for 0.8 for S2_1, threshold value T11；

Specific update method are as follows:

If the source web priority of house H1 is λ 1, the source web priority of house H2 is λ 2, if λ 1 is less than λ 2, by H1 As than middle source, H2 is used as than middle target, is saved in house duplicate checking table, and otherwise using H2 as than middle source, H1 is used as than middle target, It is saved in house duplicate checking table；

2. a kind of house property data identity method of discrimination of different aforementioned sources according to claim 1, it is characterised in that: step Rapid the reason of 1) carrying out area judging are as follows: the data of each house transaction website be all made of it is a kind of " city-> region-> cell-> The hierarchical structure in house "；In order to determine the position in house, whether the cell where determining house is same cell, and is determined When some cell, the urban area where first determining cell is the same area, to improve the accuracy and efficiency differentiated.

3. a kind of house property data identity method of discrimination of different aforementioned sources according to claim 1, it is characterised in that: step The reason of rapid 2) progress identity differentiation is: each website has differences the description of cell, goes completing step 1) region After weight, the alias in each region, can be according to the alias in region against unique region, and inquiry obtains small in same area Area；

Wherein, cell information feature includes the longitude and latitude of the title of cell, construction area, total amount, infrastructure management company and cell Information；Differentiated according to identity of these features to cell.

4. a kind of house property data identity method of discrimination of different aforementioned sources according to claim 1, it is characterised in that: step The reason of rapid 3) progress identity differentiation is: same set of house may be transfer-listed in different web sites, and each website pair The description in house has differences, after completing step 2) cell duplicate removal, according to information in cell duplicate checking table, to different web sites Same cells are clustered；Then according to cluster as a result, the house listings of the same cells of inquiry source different web sites.