CN104102667A

CN104102667A - POI (Point of Interest) information differentiation method and device

Info

Publication number: CN104102667A
Application number: CN201310125396.8A
Authority: CN
Inventors: 罗丽俊
Original assignee: Navinfo Co Ltd
Current assignee: Navinfo Co Ltd
Priority date: 2013-04-11
Filing date: 2013-04-11
Publication date: 2014-10-15

Abstract

The invention provides a POI (Point of Interest) information differentiation method and device. The method comprises the following steps: disassembling POI information into a plurality of first feature words; combining the first feature words, and searching by a search engine to acquire a POI set; calculating the first similarity between each POI information and the POI information to be differentiated in the POI set; selecting one or more POI information as a differentiation result of the POI information to be differentiated according to the first similarity. According to the POI information differentiation method and device, the POI information to be differentiated into the feature words which are effectively combined, and the related POI set is searched, the similarity between the POI in the set and the POI information to be differentiated is calculated, and the differentiation result is output according to the similarity, more search conditions can be combined by using the disassembled feature words, and thus more possible results are searched and the system matching ratio is increased.

Description

A kind of POI information gap separating method and device

Technical field

The present invention relates to POI difference field, particularly relate to a kind of POI information gap separating method and device.

Background technology

Present stage operator to third party POI(Point of Interest, point of interest) difference is carried out in storehouse, it is mainly the main body word that extracts POI title and POI address, phone is formatd, utilize afterwards the main body word of title and address, and phone, type, these information of coordinate, go to search relevant information in original storehouse; In the result of inquiry, find POI that similarity is high as the result of coupling; Wherein similarity mainly relies on the main body of title and the calculating of the main body similarity of address; The method of calculating mainly adopts the methods such as editing distance, Jaccard likeness coefficient.Existing this difference method, every day, everyone can only difference go out 100-200 bar POI, along with third party POI information sharply increases, the serious production that affects geographic information data of traditional difference method.

In the method for existing POI difference, be all generally by artificial setting up third party POI storehouse and original storehouse classification contrast relationship or artificially classification mark carried out in third party POI storehouse, solve the error that third party POI storehouse and original storehouse classification disunity bring, but this mode is very general, there is certain error, be unfavorable for dwindling matching range; POI coordinate is mainly from third party POI storehouse simultaneously, and still the coordinate of third party library has certain deviation conventionally, and does not comprise coordinate in most of third party POI storehouse, is unfavorable for equally dwindling matching range; In the computing method of similarity, the main main body similarity relying on after address and title fractionation, this method is inaccurate for the similarity of calculated address, because address is a minute geographical rank, in same district not, there will be main body duplication of name phenomenon, and the address rank weight after the inborn ability of different addresses should change; The scope of simultaneously only going to dwindle coupling by title main body, address main body, classification and coordinate can be shone into the omission of part matched data.

In a word, existing POI differential system matching rate is low, and the length that expends time in has increased the difficulty of subsequent operation.

Summary of the invention

The object of this invention is to provide a kind of POI information gap separating method and device, improved POI difference matching rate, reduced and expended time in.

In order to solve the problems of the technologies described above, the invention provides a kind of POI information gap separating method, comprise the steps:

To treat that difference POI information disassembles into a plurality of First Characteristic words;

A plurality of First Characteristic words are combined, and by search engine inquiry, obtain POI and gather;

Calculate in POI set each POI information and treat the first similarity between difference POI information;

According to described the first similarity, select one or more POI information as the difference result for the treatment of difference POI information.

Preferably, each POI information and treat the first similarity between difference POI information in described calculating POI set, further comprises:

For each Second Characteristic word in POI information distributes respectively a weight;

Calculate second similarity in each Second Characteristic word and existing POI inquiry storehouse;

The product of the second similarity of the weight that in POI information, each Second Characteristic word distributes and its correspondence is carried out to summation operation, obtain operation result;

Using this operation result as described POI information and treat the first similarity between difference POI information.

Preferably, the Second Characteristic word of described POI information is one or more in title, address, phone, classification;

When Second Characteristic word is title, second similarity in this title and existing POI inquiry storehouse is: the matching result in this title and existing POI inquiry storehouse;

When Second Characteristic word is address, second similarity in this address and existing POI inquiry storehouse is: by this address, according to partition of the level, be a plurality of subaddressings, for distributing a weight in each subaddressing, sub-similarity is mated to obtain with existing POI inquiry storehouse in each subaddressing, and the product of the sub-similarity of the weight of each subaddressing and Corresponding matching is carried out to summation operation, the result obtaining;

When Second Characteristic word is phone, second similarity in this phone and existing POI inquiry storehouse is: the matching result in this phone and existing POI inquiry storehouse;

When Second Characteristic word is classification, second similarity in this classification and existing POI inquiry storehouse is: the matching result in this classification and existing POI inquiry storehouse.

Preferably, adopt following formula to calculate the second similarity score of this address _addr:

{score}_{addr} = Σ_{k = 1}^{n} α_{k} \cdot {level}_{k};

Wherein n is the rank sum that address is divided; Level _ksub-similarity for the subaddressing coupling of different stage; α _kfor the weight of subaddressing corresponding level, and

Preferably, when Second Characteristic word is address, and there is coordinate time in this address and existing POI inquiry storehouse simultaneously, also calculate this address and existing POI inquiry storehouse distance, according to the distance of calculating, obtain third phase like degree, the similarity that this third phase is calculated according to the subaddressing of dividing like degree and this address compares, and selects one of them as second similarity in this address and existing POI inquiry storehouse.

Preferably, described third phase adopts following formula to calculate like degree:

Score _{addr_2}=dist/dist_kind, wherein, dist is the distance that described address and existing POI inquiry storehouse are inquired about, dist_kind is to the predetermined maximum length of inhomogeneity.

Preferably, when the Second Characteristic word in described POI information is the combination of title, address, phone and classification, this POI information and treat that the first similarity score between difference POI information is:

score＝α·score _name+β·socre _address+χ·socre _phone+δ·socre _kind，

Wherein, α, β, χ, the δ weight for distributing, and alpha+beta+χ+δ=1; Score _namefor title second-phase is like degree, score _addraddress the second similarity, score _phonefor phone the second similarity, score _kindfor classification the second similarity.

The invention provides a kind of POI information gap separating device, comprising:

Feature Words is disassembled module, for the difference POI information for the treatment of of obtaining is disassembled into a plurality of First Characteristic words;

POI gathers acquisition module, obtains the POI set of character pair word combination for a plurality of First Characteristic words being organized to merga pass search engine inquiry;

Similarity determination module, for calculating POI each POI information of set and treating the first similarity between difference POI information;

Output module, for selecting one or more POI information as the difference result for the treatment of difference POI information according to the first similarity.

Preferably, described similarity determination module further comprises,

Weight allocation submodule, is used to each Second Characteristic word in POI information to distribute respectively a weight;

Similarity calculating sub module, for calculating second similarity in each Second Characteristic word and existing POI inquiry storehouse;

Summation operation submodule, for carrying out summation operation by the product of the second similarity of the weight of each Second Characteristic word distribution of POI information and its correspondence;

Operation result output sub-module, for exporting this operation result as the first similarity.

Preferably, described similarity calculating sub module further comprises:

Technique scheme has following beneficial effect: the present invention disassembles into a plurality of Feature Words by the POI information of obtaining, this Feature Words combination is gathered by search engine inquiry POI, and by calculate in POI set POI information and and treat that the similarity between difference POI information exports difference result, the difference method of this POI information can calculate the similarity of POI more accurately, Feature Words after simultaneously utilizing these to split, can be combined into more querying condition, thereby inquire how possible result, improved the matching rate of system.

Accompanying drawing explanation

Fig. 1 is the process flow diagram of the POI information gap separating method of an embodiment of the present invention;

Fig. 2 is the process flow diagram of the title difference of an embodiment of the present invention;

Fig. 3 is definite Feature Words set of an embodiment of the present invention and the process flow diagram for the treatment of the similarity between difference POI information;

Fig. 4 is the POI information difference overall flow figure of an embodiment of the present invention;

Fig. 5 is the structured flowchart of the POI information gap separating device of an embodiment of the present invention;

Fig. 6 is the structured flowchart of the similarity determination module of an embodiment of the present invention.

Embodiment

For making the technical problem to be solved in the present invention, technical scheme and advantage clearer, be described in detail below in conjunction with the accompanying drawings and the specific embodiments.

As shown in Figure 1, the process flow diagram for the POI information gap separating method of an embodiment of the present invention, comprising:

Step S101: will treat that difference POI information disassembles into a plurality of First Characteristic words;

Step S102: a plurality of First Characteristic words are combined, and obtain POI by search engine inquiry and gather;

Step S103: calculate in POI set each POI information and treat the first similarity between difference POI information;

Step S104: select one or more POI information as the difference result for the treatment of difference POI information according to described the first similarity.

The present invention disassembles into a plurality of Feature Words by the POI information of obtaining, this Feature Words combination is obtained to POI by search engine gathers, and POI information and treat that the similarity between difference POI information exports difference result in gathering by POI, the difference method of this POI information can calculate the similarity of POI more accurately, Feature Words after simultaneously utilizing these to split, can be combined into more querying condition, thereby inquire how possible result, improve the matching rate of system.After the selected Feature Words set of output, and the selected Feature Words set of output is carried out to Accuracy Verification to existing POI inquiry storehouse, generate POI data accurately, and according to the database of the described Data Update of POI accurately electronic chart.

In step S101, the difference POI information for the treatment of of obtaining is disassembled into a plurality of First Characteristic words, can be to carry out title difference and address difference.As shown in Figure 2, be the process flow diagram of the title difference of an embodiment of the present invention, comprise and obtain additional information, obtain another name, remove prefix word, remove suffix word, remove noise word, obtain trunk word.Wherein, title can be split as additional information, another name, prefix word, suffix word, trunk word; For example, after " international trade shop, the chain hotel of Beijing Mu Yang fashion (Bo Li commercial hotel, former Beijing) " splits, prefix word is Beijing; Suffix word is chain hotel; Trunk word is Mu Yang fashion; Additional information is international trade shop; Another name Wei Boli commercial hotel.Title disassembly principle: it is corresponding regular that prefix word mainly relies on corresponding dictionary, another name, trunk word with the fractionation of suffix word, additional information splits main dependence.Address can difference be the address ranks such as province, city, district, small towns, road, community, mark, doorplate.For example, after " the good wooden garden apartment 8-9 floor in No. 978, triumph South Street, Xingqing District, Yinchuan City " splits: city-level-Yinchuan City; Level-Xingqing District, district; Road level-triumph South Street; Doorplate level-No. 978; Mark level-Jia wood garden apartment.Disassembly principle: province, city, district, small towns, road, community, mark, utilize corresponding dictionary; The address ranks such as doorplate are utilized rule; For non-existent word in dictionary, utilize the corresponding rank of Rule.Matching addresses: if the POI of input does not have coordinate, by matching addresses service acquisition coordinate.

Obtain Feature Words POI name of the information is carried out to difference, also obtain the classification that Feature Words is corresponding, such other acquisition process is as follows:

Pre-service: utilize χ ²the weight of statistical nature word in classification, removes the weight of certain threshold value; χ ²the computing formula of statistics is:

χ^{2} (w, C) = \frac{N \times {(AD - BC)}^{2}}{(A + C) \times (B + D) \times (A + B) \times (C + D)}

Wherein, characteristic item is that w classification is C.～w represents the further feature item except w,～C represents other classification except C, the relation of characteristic item w and classification C has following four kinds of situations so: (w, C), (w,～C), (～w, C), (～w ,～C), the frequency that represents respectively the POI of these four kinds of situations with A, B, C, D, POI sum N=A+B+C+D.

POI title is being carried out to difference, obtaining First Characteristic word, obtaining the classification weight that First Characteristic word is corresponding; Choose classification that one or more weight is higher as the possible classification of this POI as Output rusults.

In step S102, a plurality of First Characteristic words are combined, and gather by the POI that search engine inquiry obtains character pair word combination.Described query script: used POI search engine service, inquired relevant POI by fields such as title, address, type, coordinates and gather.Specifically refer to the address after the title after fractionation, fractionation, phone, classification, coordinate etc. are effectively combined, it can be the combination of address and phone, also can be the combination of title and address, take various array modes, First Characteristic word combination after combination is obtained to POI by search engine inquiry to be gathered, this search engine can be Baidu, Google etc., can also be existing POI inquiry storehouse.The POI set of obtaining by search engine in this step includes a plurality of POI information.

In step S103, calculate in POI set each POI information and treat the first similarity between difference POI information.As shown in Figure 3, for POI information in the calculating POI set of an embodiment of the present invention and treat the calculation process of the first similarity between difference POI information, comprise the steps:

Step S1031: for each Second Characteristic word in POI information distributes respectively a weight;

Step S1032: the second similarity of calculating each Second Characteristic word and existing POI inquiry storehouse;

Step S1033: the product of the second similarity of the weight that in POI information, each Second Characteristic word distributes and its correspondence is carried out to summation operation;

Step S1034: using this operation result as described POI information and treat the first similarity between difference POI information.

In step S1031, for each Second Characteristic word in POI information distributes respectively a weight, can be that title, address, phone, classification etc. assign weight according to Feature Words, the size of weight can be set as required, also can automatically set according to classification.

In step S1032, calculate second similarity in each Second Characteristic word and existing POI inquiry storehouse, when Feature Words is title, second similarity in this title and existing POI inquiry storehouse is: the matching result in this title and existing POI inquiry storehouse;

In step S1033, the product of the second similarity of the weight that in POI information, each Second Characteristic word distributes and its correspondence is carried out to summation operation; Second Characteristic word in described POI information is title, address, phone and classification, and this POI gathers and treats that the second similarity score between difference POI information is:

score＝α·score _name+β·socre _address+χ·socre _phone+δ·socre _kind，

In step S1034, according to the result of step S1033 summation operation, export this operation result.

In embodiments of the invention, when Second Characteristic word is address, this address Feature Words is a plurality of subaddressings according to partition of the level, and the weight of each subaddressing is determined according to corresponding level.This weight is the rank having along with difference address and dynamically changing.

When Second Characteristic word of the present invention is title, score _namefor title second-phase is like degree, main the second similarity at trunk word+suffix word and existing POI library name; Another name and second similarity of inquiring about storehouse POI title; Trunk word and second similarity of inquiring about the trunk word of storehouse POI title are chosen a maximum value as the second similarity of POI title from these similarities, if this value is less than predetermined threshold values, remove this record; Wherein, the second similarity=editing distance/major term is long.Editing distance: claim again Levenshtein distance, refer between two word strings, change into another required minimum editing operation number of times by one.The editing operation of license comprises a character replacement is become to another character, inserts a character, deletes a character.

In embodiments of the invention, when Second Characteristic word is address, score _addrcomputing formula is wherein n is the rank sum that address is divided; Level _ksub-similarity for different stage subaddressing coupling; α _kfor the weight of subaddressing corresponding level, and this weight is the rank having along with difference address and dynamically changing.For example,, for address No. 978, South Street " triumph good wooden garden apartment " road level wherein: " triumph South Street " weight is 0.5; Doorplate level: " No. 978 " weight is 0.2; Mark level: " good wooden garden apartment " weight is 0.3; If address is " No. 978, triumph South Street ", so road level: " triumph South Street " weight is 0.7; Doorplate level: " No. 978 " weight is 0.3.

When Second Characteristic word is address, and there is coordinate time in this address and existing POI inquiry storehouse simultaneously, also calculate this address and existing POI inquiry storehouse distance, according to the distance of calculating, obtain third phase like degree, the similarity that this third phase is calculated according to the subaddressing of dividing like degree and this address compares, and selects one of them as second similarity in this address and existing POI inquiry storehouse.The distance of this address is by address spaces is become to geographic coordinate, utilizes Geocoding to change into geographic coordinate, thereby calculates the distance of appropriate address in this address and POI inquiry storehouse.

Score _{addr_2}=dist/dist_kind, wherein, dist is described address and existing POI inquiry storehouse distance, dist_kind is to the predetermined maximum length of inhomogeneity.When this address and existing POI inquiry storehouse exists coordinate time simultaneously, from score _addrwith score _{addr_2}in choose a maximal value as the second similarity of this address.

In embodiments of the invention, when Second Characteristic word is phone, score _phonefor phone similarity: 1 represents that phone equates, 0 represents that phone is unequal.In embodiments of the invention, when Second Characteristic word is classification, score _kindfor classification similarity: 1 represents that classification equates, 0 represents that classification is unequal.

As shown in Figure 4, the POI information difference overall flow figure for an embodiment of the present invention, comprises the steps:

Obtaining POI information, obtain the POI information that third party provides, can be that service provider provides or individual providing;

Format is processed, and the POI information of obtaining is formatd to processing, and format processing procedure is prior art.

Title is split, title is split as to additional information, another name, prefix word, suffix word, trunk word etc., title splits and can adopt prior art to split.

Classification is obtained, and according to the title splitting, carries out obtaining of classification, and classification is obtained and can be adopted prior art to carry out.

Address dividing, address can difference be the address ranks such as province, city, district, small towns, road, community, mark, doorplate, address dividing can adopt prior art to split.

Coordinate obtains, address spaces is become to geographic coordinate, utilize Geocoding to transform, Geocoding is a kind of coding method based on space orientation technique, and the geographical location information that it provides a kind of handle to be described as address converts the mode of the geographic coordinate that can be used to GIS (Geographic Information System) to.

Match query, inquires about Feature Words set and existing POI storehouse, to obtain each Feature Words set and to treat the first similarity between difference POI information.

Record output, exports qualified Feature Words set according to similarity.

As shown in Figure 5, the structured flowchart for the POI information gap separating device of an embodiment of the present invention, comprising:

Feature Words is disassembled module 100, for the difference POI information for the treatment of of obtaining is disassembled into a plurality of First Characteristic words;

POI gathers acquisition module 200, for a plurality of First Characteristic words being organized to merga pass search engine inquiry, obtains POI set;

Similarity determination module 300, for calculating each POI set POI information and treating the first similarity between difference POI information;

Output module 400, for selecting one or more POI information as the difference result for the treatment of difference POI information according to the first similarity.

As shown in Figure 6, the structured flowchart for the similarity determination module of an embodiment of the present invention, comprises,

Weight allocation submodule 301, is used to each Second Characteristic word in POI information to distribute respectively a weight;

Similarity calculating sub module 302, for calculating second similarity in each Second Characteristic word and existing POI inquiry storehouse;

Summation operation submodule 303, for carrying out summation operation by the product of the second similarity of the weight of each Second Characteristic word distribution of POI information and its correspondence;

Operation result output sub-module 304, for exporting as POI information and treating this operation result of the first similarity between difference POI information.

Similarity calculating sub module of the present invention further comprises: when Second Characteristic word is title, second similarity in this title and existing POI inquiry storehouse is: the matching result in this title and existing POI inquiry storehouse;

In embodiments of the invention, when Second Characteristic word is address, score _addrcomputing formula is wherein n is the rank sum that address is divided; Level _ksub-similarity for different stage subaddressing coupling; α _kfor the weight of subaddressing corresponding level, and this weight is the rank having along with difference address and dynamically changing.

Technique scheme: the POI information of obtaining is disassembled into a plurality of Feature Words, this Feature Words combination is obtained to POI by search engine gathers, and POI information and treat that the similarity between difference POI information exports difference result in gathering by POI, the difference method of this POI information can calculate the similarity of POI more accurately, Feature Words after simultaneously utilizing these to split, can be combined into more querying condition, thereby inquire how possible result, improve the matching rate of system.

The above is the preferred embodiment of the present invention; it should be pointed out that for those skilled in the art, do not departing under the prerequisite of principle of the present invention; can also make some improvements and modifications, these improvements and modifications also should be considered as protection scope of the present invention.

Claims

1. a POI information gap separating method, is characterized in that, comprises the steps:

2. POI information gap separating method according to claim 1, is characterized in that, each POI information and treat the first similarity between difference POI information in described calculating POI set, further comprises:

3. POI information gap separating method according to claim 2, is characterized in that, the Second Characteristic word of described POI information is one or more in title, address, phone, classification;

4. POI information gap separating method according to claim 3, is characterized in that, adopts following formula to calculate the second similarity score of this address _addr:

{score}_{addr} = Σ_{k = 1}^{n} α_{k} \cdot {level}_{k};

5. POI information gap separating method according to claim 3, it is characterized in that, when Second Characteristic word is address, and there is coordinate time in this address and existing POI inquiry storehouse simultaneously, also calculate this address and existing POI inquiry storehouse distance, according to the distance of calculating, obtain third phase like degree, the similarity that this third phase is calculated according to the subaddressing of dividing like degree and this address compares, and selects one of them as second similarity in this address and existing POI inquiry storehouse.

6. POI information gap separating method according to claim 5, is characterized in that, described third phase adopts following formula to calculate like degree:

7. according to the POI information gap separating method described in any one in claim 3-6, it is characterized in that, when the Second Characteristic word in described POI information is the combination of title, address, phone and classification, this POI information and treat that the first similarity score between difference POI information is:

score＝α·score _name+β·socre _address+χ·socre _phone+δ·socre _kind，

8. a POI information gap separating device, is characterized in that, comprising:

POI gathers acquisition module, for a plurality of First Characteristic words being organized to merga pass search engine inquiry, obtains POI set;

9. POI information gap separating device according to claim 8, is characterized in that, described similarity determination module further comprises,

10. according to POI information gap separating device claimed in claim 9, it is characterized in that, described similarity calculating sub module further comprises: