CN109977287A - A kind of house property data identity method of discrimination of different aforementioned sources - Google Patents
A kind of house property data identity method of discrimination of different aforementioned sources Download PDFInfo
- Publication number
- CN109977287A CN109977287A CN201910242011.3A CN201910242011A CN109977287A CN 109977287 A CN109977287 A CN 109977287A CN 201910242011 A CN201910242011 A CN 201910242011A CN 109977287 A CN109977287 A CN 109977287A
- Authority
- CN
- China
- Prior art keywords
- house
- cell
- alias
- area
- similarity
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 67
- 238000004458 analytical method Methods 0.000 claims abstract description 28
- 230000004069 differentiation Effects 0.000 claims description 11
- 238000012986 modification Methods 0.000 claims description 7
- 230000004048 modification Effects 0.000 claims description 7
- 238000010276 construction Methods 0.000 claims description 6
- 238000003780 insertion Methods 0.000 claims description 6
- 230000037431 insertion Effects 0.000 claims description 6
- 238000004364 calculation method Methods 0.000 claims description 5
- 230000009471 action Effects 0.000 claims description 4
- 235000013399 edible fruits Nutrition 0.000 claims description 3
- NAWXUBYGYWOOIX-SFHVURJKSA-N (2s)-2-[[4-[2-(2,4-diaminoquinazolin-6-yl)ethyl]benzoyl]amino]-4-methylidenepentanedioic acid Chemical compound C1=CC2=NC(N)=NC(N)=C2C=C1CCC1=CC=C(C(=O)N[C@@H](CC(=C)C(O)=O)C(O)=O)C=C1 NAWXUBYGYWOOIX-SFHVURJKSA-N 0.000 claims description 2
- 238000013519 translation Methods 0.000 claims description 2
- 238000005516 engineering process Methods 0.000 abstract description 5
- 238000007405 data analysis Methods 0.000 abstract description 4
- 230000010354 integration Effects 0.000 abstract description 3
- 238000012545 processing Methods 0.000 description 7
- 238000011160 research Methods 0.000 description 7
- 238000010586 diagram Methods 0.000 description 5
- 238000007689 inspection Methods 0.000 description 4
- 238000011161 development Methods 0.000 description 3
- 230000018109 developmental process Effects 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000013480 data collection Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000004927 fusion Effects 0.000 description 2
- 238000012544 monitoring process Methods 0.000 description 2
- 230000026676 system process Effects 0.000 description 2
- 230000008901 benefit Effects 0.000 description 1
- 230000024245 cell differentiation Effects 0.000 description 1
- 230000000739 chaotic effect Effects 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 239000004744 fabric Substances 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 230000008569 process Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/10—Services
- G06Q50/16—Real estate
Landscapes
- Business, Economics & Management (AREA)
- Tourism & Hospitality (AREA)
- Health & Medical Sciences (AREA)
- Economics (AREA)
- General Health & Medical Sciences (AREA)
- Human Resources & Organizations (AREA)
- Marketing (AREA)
- Primary Health Care (AREA)
- Strategic Management (AREA)
- Physics & Mathematics (AREA)
- General Business, Economics & Management (AREA)
- General Physics & Mathematics (AREA)
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The present invention relates to a kind of house property data identity method of discrimination of different aforementioned sources, belong to internet data analysis and digging technology field.The identity method of discrimination based on chain man, I like my family, Central Plains, wheatland website announce house data and its correlation;The characteristics of by analysis house data, duplicate house data are rejected by three big steps of region duplicate removal, cell duplicate removal and house duplicate removal, the characteristics of house data are to actual house object description, although the angle and mode of description have differences, there are very strong correlations between data.The method can be to the house data deduplication for deriving from different web sites, can accurately and efficiently the identity to the house data from different aforementioned sources it differentiate, and repeat region, cell can be effectively removed, the effective integration towards multi-source heterogeneous house data may be implemented, " clean " is provided for Analysis of Real Estate, the data of " neat ".
Description
Technical field
The present invention relates to a kind of house property data identity method of discrimination of different aforementioned sources, belong to internet data analysis with
Digging technology field.
Background technique
Real estate is the important supporting body of national economy, is the particularly important pillar industry in China.Real-estate market situation
With price trend not only concerning the overall development of national economy, the living standard of the people is more affected and affects, in recent years premises
It produces market " changeable situation ", at the focus and hot spot of social concerns.
How monitoring real-estate market is reinforced, analysis property price tendency has become an important topic.With me
State's real-estate market is gradually led to success, and the core status in second-hand house market increasingly highlights, dominates the ability of whole market
It gradually increases, especially the cities such as Beijing, Shanghai, second-hand house trading volume have already taken up the transaction of most of real estate market
Amount.
About the transaction of second-hand house, with the fast development of real-estate market and the extensive use of Internet technology, at present
Have many house transaction websites on the net, such as chain man, I like my family, Central Plains real estate.These house transaction websites are house purchaser
The network trading platform provided convenience with the person of selling house, the person of selling house issue information of selling house in these network platforms, and house purchaser exists
It trades on platform.Due to the features such as its operation is convenient, and information source is wide, this network trading mode is also increasingly by masses
Favor.In the data of these house transaction platforms publication, real-estate market state is really reflected.And house transaction
Website broad covered area, the information strong real-time of reflection.Therefore, how to be believed using the house property on these real-time house transaction websites
Breath is analyzed, and real estate market situation can be more accurately grasped.
The data for how passing through house transaction website have become an analysis real estate city to Analysis of Real Estate
The important method of field.Many real estate research institutions have realized that institute in the real estate information of house transaction website platform publication
The value contained starts with the expansion research of these data one after another, analyzes Market Trends of Real Estate.Such as Lian Jia research institute utilizes
House transaction information in chain home Web site, has carried out Analysis of Real Estate, achieves preferable effect.But the room of its research
Transaction data of the room trading range in this website.Also there are many mechanisms to begin through the data in crawl real estate transaction website,
Conduct a research work, and new Research Thinking is provided for Analysis of Real Estate.
Preferable way is the house prosperity transaction data announced in comprehensive each real estate transaction website, carries out global analysis, energy
More really reflect the state of real-estate market.But when converging the house data of multi-source land play website, exist same
One user by same house the case where different web sites issue information of real estate, or even it is also possible to same house is being published to
In same website.Therefore when carrying out Analysis of Real Estate, it is necessary to the identity of the houseclearing from multi-source website
Differentiated, will wherein repeat house and reject, real estate analysis system could be used effectively.
In the houseclearing of each real estate transaction website orientation, the detailed features in house to be transacted are contained.Such as place
Cell/position, room rate, floor, area, price etc..It can start with from house key feature is extracted, analyze the key feature in house
Information, and differentiated with this identity in house, is rejected and repeats house, for real estate market analysis, study provide it is accurate, reliable
Data basis.
Researcher there are many at present is studied in data deduplication field extensively, it was also proposed that the method for many.At this
In a little research methods, the De-weight method that researcher proposes, many is all the method to general text data (document) duplicate removal,
Basic ideas, text-oriented data calculate the similarity between text, then carry out duplicate removal.
But the currently, method for carrying out duplicate removal for the house data of fusion multi-source real estate website not yet.Real estate
The house data that business site is announced, belong to partly-structured data, wherein containing dwelling feature abundant, such as affiliated small
Area, area, place floor etc. can reduce the accuracy of judgement if differentiated using to web page text.
Therefore the bright houseclearing announced for real estate transaction website of we, extracts the key feature in house, to multi-source
The identity of house data is differentiated, repetition house is removed.It is real estate city to merge these multi-sources, isomery house data
Accurate and effective data set can be used in field analysis offer, provides data for Analysis of Real Estate and supports.
Summary of the invention
It is an object of the invention to for existing house property data information source is numerous, information is chaotic and criterion is noncommittal
Technological deficiency, proposes a kind of house property data identity method of discrimination of different aforementioned sources, and the house property data identity differentiates
The house data that method is announced based on existing multi-source and isomery real estate transaction website, analyze dwelling feature, to the same of house
Property differentiated, reject repeat house;The data set of generation can be used for the correlative technology fields such as real estate analysis.
Core of the invention thought be based on chain man, I like my family, Central Plains, wheatland website announce house data and its phase
Guan Xing;The characteristics of by analysis house data, rejects weight by three big steps of region duplicate removal, cell duplicate removal and house duplicate removal
Multiple house data, the website data of the significantly more efficient more house transaction platforms of fusion, to provide standard to Analysis of Real Estate
Really, effective data supporting.
Wherein, the characteristics of house data are to actual house object description, although the angle and mode of description are deposited
In difference, but between data, there are very strong correlations;The house data of different web sites publication, although description is not quite similar,
When description object is same house, there are following correlations:
1, house address is identical: the cell where the address in house is embodied in is the same cell, and the building at place belongs to
The same building;The information in building, judgement in this respect can use institute where announcing house due to few websites
Judge in the story height in building;
2, the essential attribute in house is identical: the essential attribute in house includes floor space, house type and direction;
3, the expectation of owner is identical: owner is consistent the price expectation in house when hiring out and selling house, because
This can be used its listed price to judge.
The present invention is achieved by the following technical solutions:
The house property data identity method of discrimination, includes the following steps:
Step 1) has differences the same area description under same city for different web sites and carries out region duplicate removal;
The reason of carrying out area judging are as follows: the data of each house transaction website are all made of a kind of " city-> region-> small
The hierarchical structure in area-> house ";In order to determine the position in house, whether the cell where determining house is same cell, and
When determining some cell, the urban area where first determining cell is the same area, to improve the accuracy and efficiency differentiated;
Differentiation to region, by zone name to determine whether being the same region;By in house transaction website
The analysis of the characteristics of to region name and people to the address habit in region, the generally appellation to region, there is " a core
Heart word ";, just directly with " core word " name, some add similar " area ", " county " suffix behind " core word " for some;
That is: zone name, which is constituted, has stronger regularity, can be differentiated using the method for rule to it;
Step 1) includes following sub-step again:
Step 1-1) for subsequent statistical and easy analysis, zone information table is established, and setting field saves in the table
From the alias field and alias source precedence field in city and zone name and region where the region that website acquires;
The alias field in region is used for the title of identified areas, when being operated to region, by the alias in region whether phase
Together, judge whether it is the same area.Alias source priority is used for the judgement updated to alias field;
Step 1-2) determine the priority of real estate website;
Wherein, website source priority is respectively 0,1,2, and number is smaller, and priority is higher;
Step 1-3) according to the real estate website priority of step 1-2) setting, inquire all areas under same city.
Inquiry is obtained zone list to compare two-by-two, the ways and means compared are to judge whether the title in region is similar, according to judgement
As a result the alias of update area, specifically:
Wherein judge whether the title in region is similar, is judged using the method for rule;According to each website to zone name
Feature is write decision rule, is then determined using the zone name that decision rule treats judgement, if it is decided that two cells
Title is identical, then the title of update area following the next rules.If the region currently compared is A, B, if a-quadrant and B area
Name-matches success, then judge A, B alias whether assignment: if the alias of A, B all unassigned carry out 1-3A) operation;If A's
Assignment and the alias unassigned of B carry out 1-3B to alias) operation;If the alias unassigned of A and the alias assignment progress 1- of B
3C) operate;If A, assignment carries out 1-3D to the alias of B) operation;
1-3A) obtain the priority λ a, the priority λ b of the source web of B area of the source web of a-quadrant.If λ a is big
In λ b, then the name of B area is referred to as to the alias of a-quadrant, alias source priority of the λ b as a-quadrant is stored in region letter
Table is ceased, step 1-4 is skipped to);Conversely, the name of a-quadrant to be referred to as to the alias of B area, λ a is preferential as the alias source of B area
Grade, is stored in zone information table, skips to step 1-4);
1-3B) obtain the alias source priority λ a ', the priority λ b of the source web of B area of a-quadrant.If λ b is big
In λ a ', using the alias of A as the alias of B, alias source priority of the λ a ' as B skips to step 1-4);Conversely, by the other of A
Name is updated to the title of B, is λ b by the alias source priority update of a-quadrant, skips to step 1-4);
1-3C) obtain the alias source priority λ b ', the priority λ a of the source web of a-quadrant of B area.If λ a is big
In λ b ', using the alias of B as the alias of A, alias source priority of the λ b ' as A skips to step 1-4);Conversely, by the other of B
Name is updated to the title of A, is λ b by the alias source priority update of B area, skips to step 1-4);
1-3D) obtain the alias source priority λ b ', the alias source priority λ a ' of a-quadrant of B area.If λ a ' is big
In λ b ', the alias of A is updated to the alias of B, is λ b ' by the alias source priority update of A, skips to step 1-4);Conversely,
The alias of B is updated to the title of A, is λ a ' by the alias source priority update of B area, skips to step 1-4);
Step 1-4) pass through step 1-3) comparison domain title and match after, by all areas alias field be NULL
The alias in region be assigned a value of area-name;
Step 1-5) using the region alias in zone information table as according to the inquiry and statistics for carrying out urban area;
Step 2) cell duplicate removal carries out identity differentiation for the cell in urban area;
Wherein, the reason of progress identity differentiation is: each website has differences the description of cell, completes step
1) after the duplicate removal of region, the alias in each region can be inquired according to the alias in region against unique region and obtain same area
In cell.Wherein, cell information feature includes the warp of the title of cell, construction area, total amount, infrastructure management company and cell
Latitude information.It can be differentiated according to identity of these features to cell.Step 2) specifically includes following sub-step:
Step 2-1) cell duplicate checking table is established, the cell ID for being determined as same cells is saved, sets up two words in the table
Section: in ratio source cell mark and ratio in Target Cell Identifier;
Wherein, cell ID is the ID of cell.
Step 2-2) the website priority that determines according to step 1-2) searches for region in cell table according to zone information table
The identical cell of alias, and compare the feature of cell two-by-two, if cell to be compared is C1, C2, specifically include following sub-step:
Step 2-2-1) subdistrict position diversity factor is calculated, specifically calculated using formula (1):
Subdistrict position diversity factor=| cell difference of longitude |+| cell latitude is poor | and (1)
Wherein, | cell difference of longitude | indicate the longitude of C1 and the longitude absolute value of the difference of C2, | cell latitude is poor | indicate C1
Latitude and C2 latitude absolute value of the difference;
If C1, C2 do not have latitude and longitude information, subdistrict position diversity factor is 0;
Step 2-2-2) plot area similarity is calculated, specifically calculated using formula (2):
Plot area similarity=| 1- | plot area is poor |/Min (plot area) | (2)
Wherein, | plot area is poor | indicate the plot area of C1 and the plot area absolute value of the difference of C2;(the cell face Min
Product) indicate C1 plot area and C2 plot area in smaller value, if the two is equal, for the plot area of C1;
If not having construction area information, plot area similarity is 1;
Step 2-2-3) the total amount similarity of cell is calculated, specifically calculated using formula (3):
The total amount similarity of cell=| 1- | the total amount of cell is poor |/Min (the total amount of cell) | (3)
Wherein, | the total amount of cell is poor | indicate the total amount absolute value of the difference of cell of the total amount of cell and C2 of C1;Min
(the total amount of cell) indicates the smaller value in the total amount of cell of C1 and the total amount of cell of C2, if the two is equal, for C1
The total amount of cell;
If cell does not have total amount information, the total amount similarity of cell is 1;
Step 2-2-4) calculate cell name similarity, specifically using formula (4) calculate:
Cell name similarity=1- (eDistance/maxlength (cell name)) (4)
Wherein, eDistance is the string editing distance of the cell name of C1 and the cell name of C2, calculation method
For using insertion, modification, the method for deleting a character, the cell name of C1 is converted into required for the cell name of C2
Minimal action number of steps;Maxlength (cell name) be the length of the cell name of C1 and the cell name length of C2 compared with
Big value, if the two is equal, for the length of the cell name of C1.
Step 2-2-5) calculate residential property Business Name similarity, specifically using formula (5) calculate:
1- (eDistance/maxlength (residential property Business Name)) (5)
Wherein, eDistance is the string editing distance of infrastructure management company's title of C1 and infrastructure management company's title of C2, meter
Calculation method is that, using insertion, modification, the method for deleting a character, infrastructure management company's name translation C1 is public at the property of C2
Take charge of minimal action number of steps required for title;Maxlength (residential property Business Name) is infrastructure management company's title of C1
The larger value of infrastructure management company's Name Length of length and C2, if the two is equal, for the length of infrastructure management company's title of C1.
Step 2-3) using the similarity of the cell characteristic calculated in step 2-2, judge the identity of cell.If cell
Position difference degree is less than the threshold value T1 of setting, and area similarity, total amount similarity are greater than the threshold value T2 of setting, cell name
Claim, the similarity of infrastructure management company's title is greater than the threshold value T3 of setting, then be judged to being same cell;
Wherein, the Criterion of Selecting of threshold value T1 is that 0.01 degree is differed according to longitude and latitude, and the method that distance differs about 1000 meters will
T1 is chosen for 0.02;Threshold value T2 is chosen for 0.95;Threshold value T3 is chosen for 0.9;
Step 2-4) if cell C1 and C2 is determined as same cell in step 2-3, C1 and C2 is updated to cell
Duplicate checking table, specifically: obtain the source web priority λ 1, the source web priority λ 2 of cell C2 of cell C1;If λ 1 is big
In λ 2, then using C2 as than middle source, C1 is used as than middle target, is saved in cell duplicate checking table;Otherwise using C1 as than middle source, C2 makees
To be saved in cell duplicate checking table than middle target;
Step 3) house duplicate removal, i.e. progress house identity differentiation;
Wherein, the reason of progress identity differentiation is: same set of house may be transfer-listed in different web sites, and each
A website has differences the description in house,, can be right according to information in cell duplicate checking table after completing step 2) cell duplicate removal
The same cells of different web sites are clustered;Then according to cluster as a result, the room of the same cells of inquiry source different web sites
Room list;
Wherein, floor height, the affiliated floor, listed price, construction area, family in building where the houseclearing feature includes
Type structure, floor plan and direction;Further according to the identity in houseclearing feature decision house;
Step 3) specifically includes following sub-step:
Step 3-1) house duplicate checking table is established, set up two fields: than target house mark in the house Zhong Yuan mark and ratio;
Wherein, house is identified as the ID in house;
Step 3-2) according to the region in city, the cell of the same area is searched for, establishes cell list cList, and according to net
It stands priority ranking;
Step 3-3) according to the information of cell duplicate checking table, the cell in cell list cList is clustered, specifically:
Step 3-3A) adjacency list G is constructed, it is one small in each of adjacency list G node on behalf cell list cList
Area;
Step 3-3B) it is identified as the cell list of same cell from inquiring in cell list cList in cell duplicate checking table
Each element in rList, rList includes two information: target in list ratio in source and ratio in list ratio in ratio;
Step 3-3C) rList is traversed, by than cell corresponding to target in middle source and ratio, one is constructed in adjacency list G
Side;
Step 3-3D) depth-priority-searching method is used, traverse adjacency list G, each in the spanning forest F in G, forest F
Tree, is the maximal connected subgraph in G;
Wherein, one cell of each tree representation in F;
Step 3-4) traversal is by each of forest F of generation tree Tree in step 3-3), and from house, table is inquired to obtain
Set the house listings in the cell that all nodes of Tree indicate;
Step 3-5) house room is divided into 5 classes: it more than 1 room, 2 rooms, 3 rooms, 4 rooms and 4 rooms, is each
List is established in classification;House in house listings is respectively added to corresponding tabulation according to room number, traversal is each to divide
House in class list, and compare the feature in house two-by-two, if house to be compared is H1, H2, specifically:
Step 3-5-1) judge whether the floor height in place building be identical, if the floor height in building be not identical, judges two houses
It is not identical, terminate judgement, if identical, then follow the steps 3-5-2) to step 3-5-4):
Step 3-5-2) calculate house where floor similarity: current web is not issued when issuing houseclearing
Specific floor, but the general floor in house is only described, and there is also differences for the description of different web sites;
Circular are as follows: the synonym of each floor is established in the description according to each website to floor, compares house place
When floor, if the floor in house belongs to a pair of of synonym, similarity 1, otherwise 0;
Step 3-5-2) reference area similarity: specifically calculated using formula (6):
| 1- | difference in areas |/Min (area) | (6)
Wherein, | difference in areas | indicate the floor space of H1 and the floor space absolute value of the difference of H2;Min (area) indicates H1
Floor space and H2 floor space in smaller value, if the two is equal, for the floor space of H1;
Step 3-5-3) calculate price similarity: specifically calculated using formula (7):
| 1- | price difference |/Min (price) | (7)
Wherein, | price difference | indicate the listed price of H1 and the listed price absolute value of the difference of H2;Min (price) indicates H1
Listed price and H2 listed price in smaller value, if the two is equal, for the listed price of H1;
Step 3-5-4) it calculates towards similarity: specifically calculated using formula (8):
1- (eDistance/maxlength (direction)) (8)
Wherein, eDistance is the string editing distance of the orientation of room of H1 and the orientation of room of H2, calculation method
For using insertion, modification, the method for deleting a character, the orientation of room of H1 is converted into required for the orientation of room of H2
Minimal action number of steps;Maxlength (direction) is the larger value of the length of the orientation of room of H1 and the orientation of room length of H2,
If the two is equal, for the length of the orientation of room of H1;
Step 3-6) if the floor similarity where house is 1, and its area similarity, price similarity are greater than setting
Threshold value T4;It is greater than given threshold T5 towards similarity and is then considered as similar house;
Wherein, threshold value T4 is chosen for 0.95, and threshold value T5 is chosen for 0.5;
Step 3-7) similar house is determined as in step 3-6, due to there is many in practical houseclearing in floor, face
Product, price, all similar house of direction, need to further judge the layout structure in house;The thicker black line of floor plan at present, hooks
The wall profile in house is strangled out, and equipped with furniture, jewelry etc., therefore comparing the similar method of floor plan is extracted from floor plan
The information of wall out is compared the wall distribution of horizontal and vertical both direction, specifically:
Step 3-7-1) according to the URL of house H1, H2, the floor plan of house H1, H2 are downloaded, and save, is set as image,
According to step 3-7-2) to the method for 3-7-10), the wall information of the floor plan of H1, H2 is calculated separately;
Step 3-7-2) by image be converted into resolution ratio be 100*100 picture, be set as image100;
Step 3-7-3) image100 is converted into grayscale image, it is set as grayimage100;
Step 3-7-4) calculate grayimage100 grey level histogram h;
Step 3-7-5) color value for indicating wall line is extracted from grey level histogram, specifically: scanning is straight from left to right
Side figure h, the difference for finding first and adjacent domain in histogram h is more than the gray value g of given threshold T6;
Wherein, threshold value T6 is chosen for 50;
Step 3-7-6) by gray value g obtained in step 3-7-5) to grayscale image grayimage100 binaryzation, specifically
Are as follows:
Traverse grayimage100 all pixels point, if the gray value of pixel be p (i, j), calculate p (i, j) y with
G absolute value of the difference, if | g (i, j)-g | greater than the threshold value T7 of setting, the gray scale of p (i, j) is set as 255, is otherwise set as 0;
Wherein, threshold value T7 is chosen for 10;
Step 3-7-8) remove in floor plan " large area " object, specifically:
Floor plan is scanned with particular size " rectangle ", calculates the pixel points pcount that gray value is 0 in rectangle, if
Pcount is more than given threshold T8, then is considered as " large area " object, is set to 0;
Wherein, the size of rectangle is 5*5, and threshold value T8 is chosen for 16;
Step 3-7-9) calculate floor plan in wall information, method particularly includes:
Grayscale image after scanning binaryzation by row calculates the pixel that gray scale is 0 in every row and counts, obtains gray scale in every row
For 0 picture element histogram rowHist;By the grayscale image after column scan binaryzation, the picture element that gray scale is 0 in each column is calculated
Number obtains the picture element histogram colHist that gray scale is 0 in each column;
Step 3-7-10) scanning rowHist, the salient point in rowHist is calculated, which indicates the wall of floor plan, obtain
To floor plan wall distribution series wallSeq1;ColHist is scanned, the salient point in colHist is calculated, obtains floor plan wall point
Cloth sequence wallSeq2;
Wherein, the element in wallSeq1 and wallSeq2 is following sequence pair:
<location, length>
Wherein, the value of location is the position of rowHist (colHist) bumps, represents wall in floor plan
Position, length are rowHist (colHist) convexity point values, represent the length of wall;
Step 3-7-11) calculate H1 wallSeq1 and H2 wallSeq1 similarity, specifically:
Step 3-7-11A) n1 is set as the sequence centering element number of the H1wallSeq1 of H1, if n2 is H2's
The sequence centering element number of H2wallSeq1, i1 indicate the position of the sequence centering element to be compared of the H1wallSeq1 of H1,
Start the position for indicating the sequence centering element to be compared of the H2wallSeq1 of H2 for 0, i2, starts to be 0;M be H1wallSeq1,
Identical element number in H2wallSeq1 starts to be 0;
Step 3-7-11B) judge i1 and i2, if i1 is less than n1 and i2 is less than n2, then follow the steps 3-7-11C), otherwise
Execute step 3-7-11E);
Step 3-7-11C) wall locations diversity factor is calculated, specifically calculated using formula (9):
Diff_location=| H1wallSeq1 [i1] .location-H2wallSeq1 [i2] .location | (9)
If diff1 is less than threshold value T9, i1 adds 1, i2 to add 1, and calculates length of walls difference according to formula (10), otherwise holds
Row step 3-7-11D):
Diff_length=| H1wallSeq1 [i1] .length-H2wallSeq1 [i2] .length | (10)
If diff_length is less than threshold value T10, m adds 1;Execute step 3-7-11B);
Wherein, threshold value T9 is chosen for 4, and threshold value T10 is chosen for 8;
Step 3-7-11D) if H1wallSeq1 [i1] .Location is less than H2wallSeq1 [i2] .Location
I1 adds 1, and otherwise i2 adds 1;
Execute step 3-7-11B);
Step 3-7-11E) according to formula (11), calculate the similarity of H1wallSeq1 and H2wallSeq1:
S1_1=m/max (n1, n2) (11)
Wherein max (n1, n2) indicates the larger value of n1 and n2, if n1 is equal to n2, for n1;
Step 3-7-12) according to step 3-7-11 method calculate H1 wallSeq1 and H2 wallSeq2 similarity
The wallSeq2's of the wallSeq2 and H2 of the similarity S2_1, H1 of the wallSeq1 of the wallSeq2 and H2 of S1_2, H1 is similar
Spend S2_2;
Step 3-7-13) calculate H1 and H2 house type similarity, especially by formula (12), formula (13) calculate:
S1=max (S1_1, S1_2); (12)
S2=max (S2_1, S2_2); (13)
Step 3-8) if S1 and S2 are above given threshold T11, it is determined as identical house, and update houseclearing;
Wherein, max (S1_1, S1_2) indicates the larger value of S1_1 and S1_2, if S1_1 is equal to S1_2, for S1_1,
Max (S2_1, S2_2) indicates the larger value of S2_1 and S2_2, if S2_1 is equal to S2_2, is chosen for for S2_1, threshold value T11
0.8;
Specific update method are as follows:
If the source web priority of house H1 is λ 1, the source web priority of house H2 is λ 2, if λ 1 is less than λ 2,
Using H1 as than middle source, H2 is used as than middle target, is saved in house duplicate checking table, and otherwise using H2 as than middle source, H1 is used as than in
Target is saved in house duplicate checking table;
Step 3-9) all houses in house duplicate checking table than the house that middle aiming field occurs all is to repeat house.
Beneficial effect
A kind of house property data identity method of discrimination of the different aforementioned sources proposed in the present invention, compared with prior art,
It has the following beneficial effects:
1. the method can be to the house data deduplication for deriving from different web sites, can be accurately and efficiently to from not
Identity with the house data of information source differentiates, and can effectively remove repeat region, cell, may be implemented towards multi-source heterogeneous
House data effective integration, provide " clean " for Analysis of Real Estate, the data of " neat ";
2. the real estate transaction information of each real estate website orientation has been truly reflected the dynamic of current real-estate market
State can more realistically be reflected using the method proposed in the present invention with the transaction data of the multiple real estate networks of effective integration
Real estate transaction state can excavate more acurrate real-estate market variation dynamic, Ke Yiyong according to the house data of more websites
Room is improved to promote the development of Analysis of Real Estate technology in field in the decision of government, enterprise operation, common people's investment etc.
Real estate market monitoring, analysis level.
Detailed description of the invention
Fig. 1 is a kind of system architecture of the house property data identity method of discrimination of different aforementioned sources of the present invention;
Fig. 2 is a kind of system process flow of the house property data identity method of discrimination of different aforementioned sources of the present invention;
Fig. 3 be a kind of different aforementioned sources of the present invention house property data identity method of discrimination step 2 and embodiment 1 it is small
Area's duplicate removal process;
Fig. 4 is a kind of house property data identity method of discrimination step 3 of different aforementioned sources of the present invention and the room of embodiment 1
Room duplicate removal flow chart;
Fig. 5 is the present invention a kind of the house property data identity method of discrimination step 3-3 and embodiment 1 of different aforementioned sources
Cell clusters schematic diagram;
Fig. 6 is the present invention a kind of the house property data identity method of discrimination step 3-3 and embodiment 1 of different aforementioned sources
Cell Clustering Effect figure;
Fig. 7 is the present invention a kind of the house property data identity method of discrimination step 3-5 and embodiment 1 of different aforementioned sources
Sort out schematic diagram by room in house.
Specific embodiment
In order to make the purpose of the present invention, technical solution and advantage are more clearly understood, and are passed through below in conjunction with attached drawing specific real
Example is applied, the present invention is described in more detail.It should be appreciated that specific embodiment described herein, be only used to explain this hair
It is bright, it is not intended to limit the present invention.
Embodiment 1
Present embodiment describes a kind of the specific of the house property data identity method of discrimination of different aforementioned sources of the present invention
Implement.
For implementation diagram as shown in Fig. 1 system architecture, Fig. 2 is a kind of house property data identity of different aforementioned sources of the present invention
The system process flow of method of discrimination.The present invention accepts data collection system and data analysis system belongs to house property data processing
Intermediate link.Wherein, data collection system acquires the transaction data of real estate, including cell number from each real estate transaction website
According to, urban area data, house data, house fetched data etc., be stored in real estate base data library.
Using method proposed by the present invention, duplicate removal is carried out to the house data in real estate base data library, after processing
Data be stored in real estate analytical database, analysis platform herein basis carry out data analysis, calculating and processing.
1 house label list of table
Serial number | Field name | Type | Explanation |
1 | House_uid | Int | The house identification number that crawler system saves |
2 | city | Varcha(100) | City |
3 | district | Varchar(100) | Region |
4 | Community_uid | Int | The Cell Identity CI that crawler system saves |
5 | price | Float | The current listed price in house |
6 | Price_unit | Varchar(50) | The unit of value |
7 | Build_area | Float | House architectural area |
8 | Use_area | Float | House usable floor area |
9 | Area_unit | Varchar(50) | Square measure |
10 | Build_floor | Int | The floor in the affiliated building in house is high |
11 | Current_floor | Varchar(50) | Floor where house |
12 | rooms | Int | House room number |
13 | halls | Int | House Room number |
14 | towards | Varchar(50) | Orientation of room |
15 | repeated | Bit | Whether repeated with other houses |
16 | checked | Bit | Whether on inspection |
2 website source table of table
Serial number | Field name | Type | Explanation |
1 | Web site name | Varchar(50) | Web site name |
2 | Website priority | Int | Website priority |
3 urban area table of table
4 cell table of table
Serial number | Field name | Type | Explanation |
1 | community_id | Int | Cell Id |
2 | Area_id | Int | Corresponding region table |
3 | Community_name | Varchar(100) | Cell name |
4 | latitude | Float | Latitude |
5 | longitude | Float | Longitude |
6 | repeated | Bit | Whether repeated with other cells |
7 | checked | Bit | Whether on inspection |
Table in 5 cell ratio of table
Table in 6 house ratio of table
First to the urban area duplicate removal of acquisition.The city area information acquired in retrieval real estate base data library, such as
Fruit region is updated in analytical database not in analytical database, right then according to the step 1 introduced in the present invention
Area information is handled.
Different web sites have differences the same area description under same city, such as: region of the chain home Web site to Shanghai
It is described as follows:
Jing'an, Xuhui, Huangpu, Changning ...
I likes that my family website is as follows to the region description in Shanghai:
Jing'an District, Xuhui District, Huangpu District, Changning District ...
Differentiation to region, by zone name to determine whether being the same region;By to chain man, I like my family,
The analysis of the characteristics of region is named in the websites such as Central Plains, wheatland and people to the address habit in region, generally to region
Appellation has one " core word ";, just directly with " core word " name, some add similar behind " core word " for some
The suffix such as " area ", " county ", such as " Jing'an " and " Jing'an District " of front etc..Zone name, which is constituted, has stronger regularity, can
It is differentiated in the method using rule.
Secondly to the cell duplicate removal of acquisition.The cell information for retrieving the acquisition in real estate base data library, if cell
Not in analytical database, updated in analytical database.
There is also differences for description of each real estate website to same cell name and infrastructure management company.For example, chain man net
" in method south China " is known as to " in the method south China " cell of the Chongwenmen in Beijing Dongcheng, and wheatland website is then known as " method China, south, temple
In ".The method that similarity of character string is used to the differentiation of cell name and infrastructure management company's title.
The flow chart of cell duplicate removal is as shown in Figure 3.
According to the District_alias field and province field in 3 urban area table of table, region is grouped
Inquiry, obtains urban area list.Traversal urban area list is looked into according to the zone name in District_alias field
Ask the cell list ClistAll in the region.It has been indicated on inspection, such as further according to checked information in cell if it is 1
Fruit is that 0 expression does not check, obtains the cell (not carrying out the cell of duplicate removal processing) that checked is 0, forms ClistUncheck column
Table.Cell in ClistAll and ClistUncheck list is compared two-by-two according to method is introduced in step 2 in the present invention
The similitude of cell.If similar, cell information is updated into 5 cell ratio of table in table.
Finally to houseclearing duplicate removal.Houseclearing is inquired from 1 house label list of table, according to cell, to houseclearing
Duplicate removal processing is carried out, schematic diagram is as shown in Figure 4.
According to the District_alias field and province field in 3 urban area table of table, region is grouped
Inquiry, obtains zone list.It traverses zone list and the region is inquired according to the zone name in District_alias field
Cell list ClistAll.The cell after duplicate removal processing according to 4 cell of table than source in the src_uid ratio of middle table and
Target in target_uid ratio essentially forms no digraph structure as shown in Figure 5.Each node in figure corresponds to one
Cell, side indicate that two cells are the same cells.According to the method for the step 3 introduced in the present invention, current area ratio is searched
Middle same cells, are clustered.Cluster result forms cluster list ClistCluster in same cells as shown in FIG. 6, list
In each element represent the class race an of same cells.
ClistCluster is traversed, the houseclearing in each class race is searched, obtains all house listings under cell
HlistAll.According to checked in houseclearing, if it is 1, indicates on inspection, indicated not check if it is 0, obtain
The house listings HlistUncheck that checked is 0.In order to improve efficiency, reduces and compare number.By HlistAll and
House in HlistUncheck is divided into 5 classes according to room number, as the house Fig. 7 is sorted out shown in schematic diagram by room.According to this hair
The house Duplicate Removal Algorithm that bright middle step 3 is introduced, compares number identical house in room in HlistAll and HlistUncheck two-by-two
Compared with.
If the house compared is similar, the house than in is updated into the table into 5 house ratio of table.
It is that duplicate removal processing, Ke Yiwei have been carried out to duplicate house data by treated house data
Subsequent Analysis of Real Estate provides " clean ", the data of " neat " are supported.And data that treated, the original with acquisition
Beginning data, which form, to be efficiently separated, and ensure that the modularization of analysis system and acquisition system, improves Analysis of Real Estate system
The stability and independence of system.
The above is presently preferred embodiments of the present invention, and it is public that the present invention should not be limited to embodiment and attached drawing institute
The content opened.It is all not depart from the lower equivalent or modification completed of spirit disclosed in this invention, both fall within the model that the present invention protects
It encloses.
Claims (4)
1. a kind of house property data identity method of discrimination of different aforementioned sources, it is characterised in that:
The house data that the house property data identity method of discrimination is announced based on existing multi-source and isomery real estate transaction website,
Dwelling feature is analyzed, the identity in house is differentiated, rejects and repeats house;
Be specifically based on chain man, I like my family, Central Plains, wheatland website announce house data and its correlation;By analyzing house
The characteristics of data, rejects duplicate house data by three big steps of region duplicate removal, cell duplicate removal and house duplicate removal;
Wherein, the characteristics of house data are to actual house object description, although the angle and mode of description are in the presence of poor
It is different, but there are very strong correlations between data;The house data of different web sites publication describe although description is not quite similar
When object is same house, there are following correlations:
A, house address is identical: the cell where the address in house is embodied in is the same cell, and the building at place belongs to same
A building;The information in building, judgement in this respect can use place building where announcing house due to few websites
Story height judge;
B, the essential attribute in house is identical: the essential attribute in house includes floor space, house type and direction;
C, the expectation of owner is identical: owner is consistent the price expectation in house when hiring out and selling house, therefore can
Judged with its listed price;
The house property data identity method of discrimination, includes the following steps:
Step 1) has differences the same area description under same city for different web sites and carries out region duplicate removal;
Differentiation to region, by zone name to determine whether being the same region;By to the house transaction website area Zhong Dui
The analysis of the characteristics of domain is named and people to the address habit in region, the generally appellation to region, there is " a core
Word ";, just directly with " core word " name, some add suffix behind " core word " for some;
Step 1) includes following sub-step again:
Step 1-1) for subsequent statistical and easy analysis, zone information table is established, and setting field is saved from net in the table
It stands the alias field and alias source precedence field in city and zone name and region where the region of acquisition;
Whether the alias field in region is used for the title of identified areas, identical by the alias in region when operating to region, sentences
Whether disconnected is the same area;Alias source priority is used for the judgement updated to alias field;
Step 1-2) determine the priority of real estate website;
Wherein, website source priority is respectively 0,1,2, and number is smaller, and priority is higher;
Step 1-3) according to the real estate website priority of step 1-2) setting, inquire all areas under same city;It will look into
Inquiry obtains zone list and compares two-by-two, and the ways and means compared are to judge whether the title in region is similar, the result according to judgement
The alias of update area, specifically:
Wherein judge whether the title in region is similar, is judged using the method for rule;The characteristics of according to each website to zone name,
Decision rule is write, is then determined using the zone name that decision rule treats judgement, if it is decided that the title of two cells
It is identical, then the title of update area following the next rules:
If the region currently compared is A, B, if the name-matches of a-quadrant and B area are successful, then judge whether the alias of A, B are assigned
Value: if the alias of A, B all unassigned carry out 1-3A) operation;If assignment and the alias unassigned of B carry out 1-3B to the alias of A) behaviour
Make;If the alias unassigned of A and the alias assignment progress 1-3C of B) operation;If A, assignment carries out 1-3D to the alias of B) behaviour
Make;
1-3A) obtain the priority λ a, the priority λ b of the source web of B area of the source web of a-quadrant;If λ a is greater than λ
The name of B area, then is referred to as the alias of a-quadrant by b, and alias source priority of the λ b as a-quadrant is stored in area information
Table skips to step 1-4);Conversely, the name of a-quadrant to be referred to as to the alias of B area, λ a is preferential as the alias source of B area
Grade, is stored in zone information table, skips to step 1-4);
1-3B) obtain the alias source priority λ a ', the priority λ b of the source web of B area of a-quadrant;If λ b is greater than λ
A ', using the alias of A as the alias of B, alias source priority of the λ a ' as B skips to step 1-4);Conversely, by the alias of A
It is updated to the title of B, is λ b by the alias source priority update of a-quadrant, skips to step 1-4);
1-3C) obtain the alias source priority λ b ', the priority λ a of the source web of a-quadrant of B area;If λ a is greater than λ
B ', using the alias of B as the alias of A, alias source priority of the λ b ' as A skips to step 1-4);Conversely, by the alias of B
It is updated to the title of A, is λ b by the alias source priority update of B area, skips to step 1-4);
1-3D) obtain the alias source priority λ b ', the alias source priority λ a ' of a-quadrant of B area;If λ a ' is greater than λ
The alias of A is updated to the alias of B by b ', is λ b ' by the alias source priority update of A, is skipped to step 1-4);Conversely, by B
Alias be updated to the title of A, be λ a ' by the alias source priority update of B area, skip to step 1-4);
Step 1-4) pass through step 1-3) comparison domain title and match after, by all areas alias field be NULL area
The alias in domain is assigned a value of area-name;
Step 1-5) using the region alias in zone information table as according to the inquiry and statistics for carrying out urban area;
Step 2) cell duplicate removal carries out identity differentiation for the cell in urban area, specifically includes following sub-step:
Step 2-1) cell duplicate checking table is established, the cell ID for being determined as same cells is saved, two fields are set up in the table: ratio
Target Cell Identifier in middle source cell mark and ratio;
Wherein, cell ID is the ID of cell;
Step 2-2) the website priority that determines according to step 1-2) searches for region alias in cell table according to zone information table
Identical cell, and compare the feature of cell two-by-two, if cell to be compared is C1, C2, specifically include following sub-step:
Step 2-2-1) subdistrict position diversity factor is calculated, specifically calculated using formula (1):
Subdistrict position diversity factor=| cell difference of longitude |+| cell latitude is poor | and (1)
Wherein, | cell difference of longitude | indicate the longitude of C1 and the longitude absolute value of the difference of C2, | cell latitude is poor | indicate the latitude of C1
The latitude absolute value of the difference of degree and C2;
If C1, C2 do not have latitude and longitude information, subdistrict position diversity factor is 0;
Step 2-2-2) plot area similarity is calculated, specifically calculated using formula (2):
Plot area similarity=| 1- | plot area is poor |/Min (plot area) | (2)
Wherein, | plot area is poor | indicate the plot area of C1 and the plot area absolute value of the difference of C2;Min (plot area) table
Show the smaller value in the plot area of C1 and the plot area of C2, if the two is equal, for the plot area of C1;
If not having construction area information, plot area similarity is 1;
Step 2-2-3) the total amount similarity of cell is calculated, specifically calculated using formula (3):
The total amount similarity of cell=| 1- | the total amount of cell is poor |/Min (the total amount of cell) | (3)
Wherein, | the total amount of cell is poor | indicate the total amount absolute value of the difference of cell of the total amount of cell and C2 of C1;Min (cell
Total amount) indicate C1 the total amount of cell and C2 the total amount of cell in smaller value, if the two is equal, for the cell of C1
Total amount;
If cell does not have total amount information, the total amount similarity of cell is 1;
Step 2-2-4) calculate cell name similarity, specifically using formula (4) calculate:
Cell name similarity=1- (eDistance/maxlength (cell name)) (4)
Wherein, eDistance is the string editing distance of the cell name of C1 and the cell name of C2, and calculation method is to adopt
With insertion, modification, the method for deleting a character, the cell name of C1 is converted into required for the cell name of C2 at least grasping
Make number of steps;Maxlength (cell name) is the larger value of the length of the cell name of C1 and the cell name length of C2, such as
Both fruits are equal, then are the length of the cell name of C1;
Step 2-2-5) calculate residential property Business Name similarity, specifically using formula (5) calculate:
1- (eDistance/maxlength (residential property Business Name)) (5)
Wherein, eDistance is the string editing distance of infrastructure management company's title of C1 and infrastructure management company's title of C2, calculating side
Method is, using insertion, modification, the method for deleting a character, infrastructure management company's name translation of C1 at infrastructure management company's name of C2
Minimal action number of steps required for claiming;Maxlength (residential property Business Name) is the length of infrastructure management company's title of C1
With the larger value of infrastructure management company's Name Length of C2, if the two is equal, for the length of infrastructure management company's title of C1;
Step 2-3) using the similarity of the cell characteristic calculated in step 2-2, judge the identity of cell;If subdistrict position
Diversity factor be less than setting threshold value T1, and area similarity, total amount similarity be greater than setting threshold value T2, cell name,
The similarity of infrastructure management company's title is greater than the threshold value T3 of setting, then is judged to being same cell;
Wherein, the Criterion of Selecting of threshold value T1 is that 0.01 degree is differed according to longitude and latitude, and the method that distance differs about 1000 meters selects T1
It is taken as 0.02;Threshold value T2 is chosen for 0.95;Threshold value T3 is chosen for 0.9;
Step 2-4) if cell C1 and C2 is determined as same cell in step 2-3, C1 and C2 is updated to cell duplicate checking
Table, specifically: obtain the source web priority λ 1, the source web priority λ 2 of cell C2 of cell C1;If λ 1 is greater than λ 2,
Then using C2 as than middle source, C1 is used as than middle target, is saved in cell duplicate checking table;Otherwise using C1 as than middle source, C2 be used as than
Middle target is saved in cell duplicate checking table;
Step 3) house duplicate removal, i.e. progress house identity differentiation;
Wherein, floor height, the affiliated floor, listed price, construction area, house type knot in building where the houseclearing feature includes
Structure, floor plan and direction;Further according to the identity in houseclearing feature decision house;
Step 3) specifically includes following sub-step:
Step 3-1) house duplicate checking table is established, set up two fields: than target house mark in the house Zhong Yuan mark and ratio;
Wherein, house is identified as the ID in house;
Step 3-2) according to the region in city, the cell of the same area is searched for, establishes cell list cList, and excellent according to website
First grade sequence;
Step 3-3) according to the information of cell duplicate checking table, the cell in cell list cList is clustered, specifically:
Step 3-3A) construct adjacency list G, a cell in each of adjacency list G node on behalf cell list cList;
Step 3-3B) it is identified as the cell list of same cell from inquiring in cell list cList in cell duplicate checking table
Each element in rList, rList includes two information: target in list ratio in source and ratio in list ratio in ratio;
Step 3-3C) rList is traversed, by than cell corresponding to target in middle source and ratio, a line is constructed in adjacency list G;
Step 3-3D) depth-priority-searching method is used, adjacency list G is traversed, every one tree in the spanning forest F in G, forest F is
Maximal connected subgraph in G;
Wherein, one cell of each tree representation in F;
Step 3-4) by each of forest F of generation tree Tree in step 3-3), from house, table is inquired is set traversal
The house listings in cell that all nodes of Tree indicate;
Step 3-5) house room is divided into 5 classes: it more than 1 room, 2 rooms, 3 rooms, 4 rooms and 4 rooms, is each classification
Establish list;House in house listings is respectively added to corresponding tabulation according to room number, traverses each category column
House in table, and compare the feature in house two-by-two, if house to be compared is H1, H2, specifically:
Step 3-5-1) judge whether the floor height in place building be identical, if the floor height in building be not identical, judges two houses not phase
Together, terminate judgement, if identical, then follow the steps 3-5-2) to step 3-5-4):
Step 3-5-2) calculate house where floor similarity: current web when issue houseclearing, not publication specifically
Floor, but the general floor in house is only described, and there is also differences for the description of different web sites;
Circular are as follows: the synonym of each floor is established in the description according to each website to floor, compares floor where house
When, if the floor in house belongs to a pair of of synonym, similarity 1, otherwise 0;
Step 3-5-2) reference area similarity: specifically calculated using formula (6):
| 1- | difference in areas |/Min (area) | (6)
Wherein, | difference in areas | indicate the floor space of H1 and the floor space absolute value of the difference of H2;The room of Min (area) expression H1
Smaller value in the long-pending floor space with H2 of roofing, if the two is equal, for the floor space of H1;
Step 3-5-3) calculate price similarity: specifically calculated using formula (7):
| 1- | price difference |/Min (price) | (7)
Wherein, | price difference | indicate the listed price of H1 and the listed price absolute value of the difference of H2;The extension of Min (price) expression H1
Smaller value in the listed price of list price lattice and H2, if the two is equal, for the listed price of H1;
Step 3-5-4) it calculates towards similarity: specifically calculated using formula (8):
1- (eDistance/maxlength (direction)) (8)
Wherein, eDistance is the string editing distance of the orientation of room of H1 and the orientation of room of H2, and calculation method is to adopt
With insertion, modification, the method for deleting a character, the orientation of room of H1 is converted into required for the orientation of room of H2 at least grasping
Make number of steps;Maxlength (direction) is the larger value of the length of the orientation of room of H1 and the orientation of room length of H2, if two
Person is equal, then is the length of the orientation of room of H1;
Step 3-6) if the floor similarity where house is 1, and its area similarity, price similarity are greater than given threshold
T4;It is greater than given threshold T5 towards similarity and is then considered as similar house;
Wherein, threshold value T4 is chosen for 0.95, and threshold value T5 is chosen for 0.5;
Step 3-7) similar house is determined as in step 3-6, it is many in floor due to existing in practical houseclearing, area,
Price, all similar house of direction, need to further judge the layout structure in house;The thicker black line of floor plan at present, is sketched the contours
The wall profile in house out, and equipped with furniture, jewelry etc., therefore comparing the similar method of floor plan is extracted from floor plan
The information of wall is compared the wall distribution of horizontal and vertical both direction, specifically:
Step 3-7-1) according to the URL of house H1, H2, the floor plan of house H1, H2 are downloaded, and save, is set as image, according to
Step 3-7-2) to the method for 3-7-10), calculate separately the wall information of the floor plan of H1, H2;
Step 3-7-2) by image be converted into resolution ratio be 100*100 picture, be set as image100;
Step 3-7-3) image100 is converted into grayscale image, it is set as grayimage100;
Step 3-7-4) calculate grayimage100 grey level histogram h;
Step 3-7-5) color value for indicating wall line is extracted from grey level histogram, specifically: histogram is scanned from left to right
H, the difference for finding first and adjacent domain in histogram h is more than the gray value g of given threshold T6;
Wherein, threshold value T6 is chosen for 50;
Step 3-7-6) by gray value g obtained in step 3-7-5) to grayscale image grayimage100 binaryzation, specifically:
The all pixels point of grayimage100 is traversed, if the gray value of pixel is p (i, j), it is poor to calculate p (i, j) y and g
Absolute value, if | g (i, j)-g | greater than the threshold value T7 of setting, the gray scale of p (i, j) is set as 255, is otherwise set as 0;
Wherein, threshold value T7 is chosen for 10;
Step 3-7-8) remove in floor plan " large area " object, specifically:
Floor plan is scanned with particular size " rectangle ", the pixel points pcount that gray value is 0 in rectangle is calculated, if pcount
More than given threshold T8, then it is considered as " large area " object, is set to 0;
Wherein, the size of rectangle is 5*5, and threshold value T8 is chosen for 16;
Step 3-7-9) calculate floor plan in wall information, method particularly includes:
Grayscale image after scanning binaryzation by row calculates the pixel that gray scale is 0 in every row and counts, and obtaining gray scale in every row is 0
Picture element histogram rowHist;By the grayscale image after column scan binaryzation, calculates the pixel that gray scale is 0 in each column and counts,
Obtain the picture element histogram colHist that gray scale is 0 in each column;
Step 3-7-10) scanning rowHist, the salient point in rowHist is calculated, which indicates the wall of floor plan, obtain family
Type figure wall distribution series wallSeq1;ColHist is scanned, the salient point in colHist is calculated, obtains floor plan wall distribution sequence
Arrange wallSeq2;
Wherein, the element in wallSeq1 and wallSeq2 is following sequence pair:
<location, length>
Wherein, the value of location is the position of rowHist (colHist) bumps, represents position of the wall in floor plan,
Length is rowHist (colHist) convexity point value, represents the length of wall;
Step 3-7-11) calculate H1 wallSeq1 and H2 wallSeq1 similarity, specifically:
Step 3-7-11A) n1 is set as the sequence centering element number of the H1wallSeq1 of H1, if n2 is the H2wallSeq1's of H2
Sequence centering element number, i1 indicate the position of the sequence centering element to be compared of the H1wallSeq1 of H1, start as 0, i2 table
The position for showing the sequence centering element to be compared of the H2wallSeq1 of H2 starts to be 0;M is in H1wallSeq1, H2wallSeq1
Identical element number starts to be 0;
Step 3-7-11B) judge i1 and i2, if i1 is less than n1 and i2 is less than n2, then follow the steps 3-7-11C), otherwise execute
Step 3-7-11E);
Step 3-7-11C) wall locations diversity factor is calculated, specifically calculated using formula (9):
Diff_location=| H1wallSeq1 [i1] .location-H2wallSeq1 [i2] .location | (9)
If diff1 is less than threshold value T9, i1 adds 1, i2 to add 1, and calculates length of walls difference according to formula (10), otherwise executes step
Rapid 3-7-11D):
Diff_length=| H1wallSeq1 [i1] .length-H2wallSeq1 [i2] .length | (10)
If diff_length is less than threshold value T10, m adds 1;Execute step 3-7-11B);
Wherein, threshold value T9 is chosen for 4, and threshold value T10 is chosen for 8;
Step 3-7-11D) i1 adds if H1wallSeq1 [i1] .Location is less than H2wallSeq1 [i2] .Location
1, otherwise i2 adds 1;
Execute step 3-7-11B);
Step 3-7-11E) according to formula (11), calculate the similarity of H1wallSeq1 and H2wallSeq1:
S1_1=m/max (n1, n2) (11)
Wherein max (n1, n2) indicates the larger value of n1 and n2, if n1 is equal to n2, for n1;
Step 3-7-12) according to step 3-7-11 method calculate H1 wallSeq1 and H2 wallSeq2 similarity S1_
The similarity of the wallSeq2 of the wallSeq2 and H2 of the similarity S2_1, H1 of the wallSeq1 of 2, H1 wallSeq2 and H2
S2_2;
Step 3-7-13) calculate H1 and H2 house type similarity, especially by formula (12), formula (13) calculate:
S1=max (S1_1, S1_2); (12)
S2=max (S2_1, S2_2); (13)
Step 3-8) if S1 and S2 are above given threshold T11, it is determined as identical house, and update houseclearing;
Wherein, max (S1_1, S1_2) indicates the larger value of S1_1 and S1_2, if S1_1 is equal to S1_2, for S1_1, max
(S2_1, S2_2) indicates the larger value of S2_1 and S2_2, if S2_1 is equal to S2_2, is chosen for 0.8 for S2_1, threshold value T11;
Specific update method are as follows:
If the source web priority of house H1 is λ 1, the source web priority of house H2 is λ 2, if λ 1 is less than λ 2, by H1
As than middle source, H2 is used as than middle target, is saved in house duplicate checking table, and otherwise using H2 as than middle source, H1 is used as than middle target,
It is saved in house duplicate checking table;
Step 3-9) all houses in house duplicate checking table than the house that middle aiming field occurs all is to repeat house.
2. a kind of house property data identity method of discrimination of different aforementioned sources according to claim 1, it is characterised in that: step
Rapid the reason of 1) carrying out area judging are as follows: the data of each house transaction website be all made of it is a kind of " city-> region-> cell->
The hierarchical structure in house ";In order to determine the position in house, whether the cell where determining house is same cell, and is determined
When some cell, the urban area where first determining cell is the same area, to improve the accuracy and efficiency differentiated.
3. a kind of house property data identity method of discrimination of different aforementioned sources according to claim 1, it is characterised in that: step
The reason of rapid 2) progress identity differentiation is: each website has differences the description of cell, goes completing step 1) region
After weight, the alias in each region, can be according to the alias in region against unique region, and inquiry obtains small in same area
Area;
Wherein, cell information feature includes the longitude and latitude of the title of cell, construction area, total amount, infrastructure management company and cell
Information;Differentiated according to identity of these features to cell.
4. a kind of house property data identity method of discrimination of different aforementioned sources according to claim 1, it is characterised in that: step
The reason of rapid 3) progress identity differentiation is: same set of house may be transfer-listed in different web sites, and each website pair
The description in house has differences, after completing step 2) cell duplicate removal, according to information in cell duplicate checking table, to different web sites
Same cells are clustered;Then according to cluster as a result, the house listings of the same cells of inquiry source different web sites.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910242011.3A CN109977287B (en) | 2019-03-28 | 2019-03-28 | Method for judging identity of real estate data of different information sources |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910242011.3A CN109977287B (en) | 2019-03-28 | 2019-03-28 | Method for judging identity of real estate data of different information sources |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109977287A true CN109977287A (en) | 2019-07-05 |
CN109977287B CN109977287B (en) | 2021-02-02 |
Family
ID=67081085
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910242011.3A Expired - Fee Related CN109977287B (en) | 2019-03-28 | 2019-03-28 | Method for judging identity of real estate data of different information sources |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109977287B (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111125054A (en) * | 2019-11-21 | 2020-05-08 | 青岛聚好联科技有限公司 | Method and device for community data migration |
CN111259966A (en) * | 2020-01-17 | 2020-06-09 | 青梧桐有限责任公司 | Method and system for identifying homonymous cell with multi-feature fusion |
CN111260445A (en) * | 2020-01-20 | 2020-06-09 | 北京无限光场科技有限公司 | House resource information display method, device, terminal and storage medium |
CN111275096A (en) * | 2020-01-17 | 2020-06-12 | 青梧桐有限责任公司 | Homonymous cell identification method and system based on image identification |
CN111291155A (en) * | 2020-01-17 | 2020-06-16 | 青梧桐有限责任公司 | Method and system for identifying homonymous cells based on text similarity |
WO2021147458A1 (en) * | 2020-01-20 | 2021-07-29 | 腾讯科技(深圳)有限公司 | Method and device for matching wireless hotspot and point of interest |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2022225455A1 (en) * | 2021-04-20 | 2022-10-27 | Real Estate Analytics Pte. Ltd. | A system for generating a deduplicated property listing from a plurality of property listings and a method thereof |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102023984A (en) * | 2009-09-10 | 2011-04-20 | 阿里巴巴集团控股有限公司 | Method and system for screening duplicated entity data |
KR101285254B1 (en) * | 2012-08-01 | 2013-07-11 | (주)한국부동산데이타 | Online apartment house appraisal system and method thereof |
CN108536825A (en) * | 2018-04-10 | 2018-09-14 | 苏州市中地行信息技术有限公司 | A method of whether identification source of houses data repeat |
CN108763570A (en) * | 2018-06-05 | 2018-11-06 | 北京拓世寰宇网络技术有限公司 | A kind of method and device identifying the identical source of houses |
CN109035078A (en) * | 2018-08-31 | 2018-12-18 | 北京诸葛找房信息技术有限公司 | A kind of source of houses polymerization based on the similar calculating of various dimensions information |
CN109460428A (en) * | 2018-11-12 | 2019-03-12 | 百度在线网络技术(北京)有限公司 | Space-time analysis method, apparatus and storage medium |
-
2019
- 2019-03-28 CN CN201910242011.3A patent/CN109977287B/en not_active Expired - Fee Related
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102023984A (en) * | 2009-09-10 | 2011-04-20 | 阿里巴巴集团控股有限公司 | Method and system for screening duplicated entity data |
KR101285254B1 (en) * | 2012-08-01 | 2013-07-11 | (주)한국부동산데이타 | Online apartment house appraisal system and method thereof |
CN108536825A (en) * | 2018-04-10 | 2018-09-14 | 苏州市中地行信息技术有限公司 | A method of whether identification source of houses data repeat |
CN108763570A (en) * | 2018-06-05 | 2018-11-06 | 北京拓世寰宇网络技术有限公司 | A kind of method and device identifying the identical source of houses |
CN109035078A (en) * | 2018-08-31 | 2018-12-18 | 北京诸葛找房信息技术有限公司 | A kind of source of houses polymerization based on the similar calculating of various dimensions information |
CN109460428A (en) * | 2018-11-12 | 2019-03-12 | 百度在线网络技术(北京)有限公司 | Space-time analysis method, apparatus and storage medium |
Non-Patent Citations (1)
Title |
---|
高保禄等: "基于地理本体推理的多源数据一致性判别方法", 《科技通报》 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111125054A (en) * | 2019-11-21 | 2020-05-08 | 青岛聚好联科技有限公司 | Method and device for community data migration |
CN111259966A (en) * | 2020-01-17 | 2020-06-09 | 青梧桐有限责任公司 | Method and system for identifying homonymous cell with multi-feature fusion |
CN111275096A (en) * | 2020-01-17 | 2020-06-12 | 青梧桐有限责任公司 | Homonymous cell identification method and system based on image identification |
CN111291155A (en) * | 2020-01-17 | 2020-06-16 | 青梧桐有限责任公司 | Method and system for identifying homonymous cells based on text similarity |
CN111260445A (en) * | 2020-01-20 | 2020-06-09 | 北京无限光场科技有限公司 | House resource information display method, device, terminal and storage medium |
WO2021147458A1 (en) * | 2020-01-20 | 2021-07-29 | 腾讯科技(深圳)有限公司 | Method and device for matching wireless hotspot and point of interest |
Also Published As
Publication number | Publication date |
---|---|
CN109977287B (en) | 2021-02-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109977287A (en) | A kind of house property data identity method of discrimination of different aforementioned sources | |
CN109710701B (en) | Automatic construction method for big data knowledge graph in public safety field | |
CN108009710A (en) | Node test importance appraisal procedure based on similarity and TrustRank algorithms | |
CN105912642A (en) | Product price data acquisition method and system | |
Perez et al. | Identifying building typologies and their spatial patterns in the metropolitan areas of Marseille and Osaka | |
CN113393149A (en) | Method and system for optimizing urban citizen destination, computer equipment and storage medium | |
Mao et al. | Graph-based 3D building semantic segmentation for sustainability analysis | |
Fang et al. | Research on the correlation between pedestrian density and street spatial characteristics of commercial blocks in downtown area: A case study on Shanghai Tianzifang | |
CN110287237B (en) | Social network structure analysis based community data mining method | |
Zhao et al. | [Retracted] Research on Prediction Model of Hotels’ Development Scale Based on BP Artificial Neural Network Algorithm | |
Perez et al. | Building typologies for urban fabric classification: Osaka and Marseille case studies | |
Annamoradnejad et al. | Using web Mining in the analysis of housing prices: A case study of tehran | |
Hou et al. | [Retracted] Application of Artificial Intelligence‐Based Sensor Technology in the Recommendation Model of Cultural Tourism Resources | |
Duan et al. | LightGBM Low‐Temperature Prediction Model Based on LassoCV Feature Selection | |
CN108182496A (en) | A kind of city internet opens data acquisition process analysis method | |
Hu et al. | Urban landscape information atlas and model system based on remote sensing images | |
Lu et al. | Analysis and Evaluation of Factors Influencing the Low‐Carbon Effect of Urban High‐Rise Settlement Planning Schemes Based on AHP‐Fuzzy Comprehensive Evaluation Method | |
Zhu et al. | Rural road network planning based on 5g and traffic big data | |
CN115051850A (en) | Intelligent detection method and detection system for global hidden network threat clues | |
CN112150285B (en) | Abnormal financial organization hierarchy dividing system and method based on neighborhood topological structure | |
Qian et al. | Detect community structure from the enron email corpus based on link mining | |
CN109446424B (en) | Invalid address webpage filtering method and system | |
Wang et al. | Classification of Rural Tourism Features Based on Hierarchical Clustering Analysis Knowledge Recognition Algorithm | |
CN113127714A (en) | Logistics big data acquisition method | |
Thai et al. | A study on competitiveness of sea and island tourism in Vietnam |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20210202 |
|
CF01 | Termination of patent right due to non-payment of annual fee |