CN105808609B

CN105808609B - Method and equipment for judging data redundancy of information points

Info

Publication number: CN105808609B
Application number: CN201410854997.7A
Authority: CN
Inventors: 杨自华; 张文斗
Original assignee: Autonavi Software Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2014-12-31
Filing date: 2014-12-31
Publication date: 2020-04-14
Anticipated expiration: 2034-12-31
Also published as: CN105808609A

Abstract

The invention discloses a method and a device for judging data redundancy of information points, which comprises the following steps: pairwise matching POIs in the electronic map database; for each pairing, the following steps are performed: calculating the distance between the two POIs according to the longitude and latitude coordinates of the two POIs in the pairing; judging whether the distance is greater than or equal to a preset distance threshold value; if yes, determining that the two POIs are not redundant data; if not, calculating the similarity of the POI according to the attribute information of the two POI, and judging whether the similarity of the POI is larger than or equal to a preset similarity threshold value, if so, determining that the two POI are redundant data, otherwise, determining that the two POI are not redundant data. According to the scheme, whether the two paired POI data are redundant data or not can be determined in the data fusion stage, so that the data volume of POI storage is reduced, the storage resources occupied by the POI can be saved, and the performance of the system is improved.

Description

Method and equipment for judging data redundancy of information points

Technical Field

The invention relates to the technical field of navigation electronic maps, in particular to a method and equipment for judging data redundancy of a Point of interest (POI).

Background

With the development of science and technology, people have more and more demands on data volume, and mass data are generated. For example: the method for navigating the POI in the electronic map mainly comprises the following steps of: name, address, category (e.g., hotel, hospital, gas station, parking lot, dining, etc.), location (e.g., latitude and longitude coordinates), telephone, business hours, surrounding environment (e.g., nearby hotel, restaurant, shop, etc.), etc.

However, because there are many data sources for obtaining POI data, such as field collection, third-party purchase, network capture, and the like, and there are differences in data formats, text descriptions, and the like of different source data, it is very likely that there are differences in POI data describing different sources of the same POI, so that multiple pieces of POI data are stored in the electronic map database for the same POI, and for example, one of the multiple POIs is located in the kendir store in tokyo, the multiple POIs for describing the POI may be respectively "kendir (tokyo store)", "tokyo street × kendir", and "kygkistry store in international business center B", so that a large amount of POI data are redundant, how redundant data in the electronic map database can be identified becomes an urgent problem to be solved at present.

Disclosure of Invention

In view of this, embodiments of the present invention provide a method and an apparatus for determining redundancy of information point data, which are used to solve the problem in the prior art that the redundancy of information point data is relatively serious.

A method for judging data redundancy of information points comprises the following steps:

pairwise matching POIs in the electronic map database;

for each pairing, the following steps are performed:

calculating the distance between the two POIs according to the longitude and latitude coordinates of the two POIs in the pairing;

judging whether the distance between the two POIs is larger than or equal to a preset distance threshold value or not;

if yes, determining that the two POIs in the pairing are not redundant data;

if not, calculating the similarity of the POI according to the attribute information of the two POI in the pairing, and judging whether the similarity of the POI is larger than or equal to a preset similarity threshold value, if so, determining that the two POI are redundant data, otherwise, determining that the two POI are not redundant data.

An information point data redundancy discrimination apparatus comprising:

the matching unit is used for pairwise matching POI in the electronic map database;

the distance calculation unit is used for calculating the distance between the two POIs according to the longitude and latitude coordinates of the two POIs in the pairing aiming at each pairing and triggering the judgment unit;

the judging unit is used for judging whether the distance between the two POIs is larger than or equal to a preset distance threshold value; if yes, determining that the two POIs in the pairing are not redundant data; if not, triggering the similarity judgment unit;

and the similarity judgment unit is used for calculating the similarity of the POI according to the attribute information of the two POI in the pairing, judging whether the similarity of the POI is greater than or equal to a preset similarity threshold value, if so, determining that the two POI are redundant data, and otherwise, determining that the two POI are not redundant data.

The invention has the following beneficial effects:

the embodiment of the invention pairs POIs in the electronic map database; for each pairing, the following steps are performed: calculating the distance between the two POIs according to the longitude and latitude coordinates of the two POIs in the pairing; judging whether the distance between the two POIs is larger than or equal to a preset distance threshold value or not; if yes, determining that the two POIs in the pairing are not redundant data; if not, calculating the similarity of the POI according to the attribute information of the two POI in the pairing, and judging whether the similarity of the POI is larger than or equal to a preset similarity threshold value, if so, determining that the two POI are redundant data, otherwise, determining that the two POI are not redundant data. According to the scheme, the redundant data of the POI in the electronic map database can be determined, the POI in the electronic map database is subjected to redundancy judgment, when a plurality of POI belonging to the redundant data are determined, the redundant POI are combined in time, the data volume of POI storage is reduced, the storage resources occupied by the POI can be saved, and the performance of the system is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

FIG. 1A is a flowchart illustrating a method for determining redundancy of information point data according to an embodiment of the present invention;

FIG. 1B is a second flowchart of a method for determining redundancy of information point data according to an embodiment of the present invention;

FIG. 2A is a schematic structural diagram of a device for determining redundancy of information point data according to an embodiment of the present invention;

fig. 2B is a second schematic structural diagram of a device for determining redundancy of information point data according to an embodiment of the present invention.

Detailed Description

In order to achieve the purpose of the invention, the embodiment of the invention provides a method and equipment for judging data redundancy of information points, wherein each two POIs in an electronic map database are paired; for each pairing, the following steps are performed: calculating the distance between the two POIs according to the longitude and latitude coordinates of the two POIs in the pairing; judging whether the distance between the two POIs is larger than or equal to a preset distance threshold value or not; if yes, determining that the two POIs in the pairing are not redundant data; if not, calculating the similarity of the POI according to the attribute information of the two POI in the pairing, judging whether the similarity of the POI is larger than or equal to a preset similarity threshold value, if so, determining that the two POI are redundant data, otherwise, determining that the two POI are not redundant data, and the scheme can determine the redundant data of the POI in the electronic map database.

It should be noted that the method for determining the data redundancy of the information point according to the embodiment of the present invention can be mainly applied to: the method comprises the steps of (1) eliminating duplication during production and updating of a navigation POI database; when third-party data are fused, duplicate removal and fusion are carried out, and alias information is obtained at the same time; the client-side is used for eliminating the duplication during uploading; the result in retrieval is rearranged; data redundancy caused by human errors (such as wrong words and the like) is eliminated; the method can reduce a large amount of manual duplicate removal work, and meanwhile, due to high program operation efficiency, the method can be suitable for judgment operation of massive information point data redundancy in a big data processing process.

Various embodiments of the present invention are described in further detail below with reference to the accompanying drawings. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The first embodiment is as follows:

as shown in fig. 1A, a schematic flow chart of a method for determining information point data redundancy according to an embodiment of the present invention is provided, and the method may be as follows.

Step 101: and pairing POI in the electronic map database.

For each pairing, the following steps 102 to 108 are performed:

step 102: calculating the distance between the two POIs according to the longitude and latitude coordinates of the two POIs in the pairing;

step 103: judging whether the distance between the two POIs is larger than or equal to a preset distance threshold value, if so, executing a step 104, otherwise, executing a step 105;

step 104: determining that the two POIs in the pair are not redundant data;

step 105: calculating POI similarity according to the attribute information of the two POIs in the pairing;

step 106: judging whether the similarity of the POI is more than or equal to a preset similarity threshold, if so, executing a step 107, otherwise, executing a step 108;

step 107: determining the two POIs as redundant data;

step 108: determining that the two POIs are not redundant data.

Wherein step 102 is re-executed for the next pairing after step 104, step 107 and step 108.

Preferably, in the embodiment of the present invention, in order to further improve the accuracy of the POI redundancy judgment, before step 101 of the method flow shown in fig. 1A, step 100 may be further included, as shown in fig. 1B:

step 100: and preprocessing the attribute information of the POI in the electronic map database.

In this embodiment of the present invention, the attribute information of the POI at least includes one of the following attribute information:

POI name, POI address, phone number, POI type. In the foregoing step 100, preprocessing a POI name in the attribute information of the POI specifically includes:

step a1, determining whether the POI name contains a name prefix according to a preset administrative division prefix table, and if so, recording the name prefix and the name prefix identification of the POI name.

Specifically, when a POI in the electronic database is determined to contain a POI name, it is determined whether the POI name contained in the POI contains an administrative division prefix from a preset administrative division prefix table, and if the POI name contained in the POI contains an administrative division prefix, the administrative division prefix contained in the POI name contained in the POI is recorded, and the identifier of the administrative division prefix is determined.

For example: the POI is "beijing hailong mansion", wherein "beijing city" is the administrative division prefix of the POI, and records the identifier of "beijing city".

The administrative division prefix referred to herein may be plural.

In the process of determining whether or not the POI name included in the POI includes the administrative division prefix, if it is determined that the POI name includes at least two administrative division prefixes, for example, "beijing" and "hubei" appearing in "beijing hubei mansion", then the administrative division prefix appearing first is used as the name division prefix of the POI, and the identifier of the administrative prefix is used as the name prefix identifier of the POI, that is, the identifier of "beijing" is used as the name prefix identifier of "beijing hubeijing mansion".

As shown in table 1, the table is a preset administrative division prefix table:

administrative division prefix	Identification	Administrative division prefix	Identification
				All-grass of Anyang	100009	In the bar	100013
Anyang city	100009	City of Bazhong	100013
				Linzhou city	100009	Pingchang county	100013
Anyang county	100009	Nanjiang county	100013
				Inhuang county	100009	Tongjiang county	100013

Tangyin county	100009	Bazhou district	100013
				Huaxian county	100009
…	…	…	…

TABLE 1

Step a2, determining whether the POI name contains a name suffix according to a preset suffix list, and recording the name suffix and the name suffix identification of the POI name if the POI name contains the name suffix.

Specifically, according to a terminal maximum matching rule, matching terminal characters of the POI name included in the POI with a suffix table, and determining whether the POI name included in the POI includes a name suffix according to a matching result. And if the matching result is that the POI name contained in the POI contains a name suffix, recording the name suffix and the name suffix identification of the POI name.

As shown in table 2, is a preset suffix table:

TABLE 2

For example: if the POI is 'God software company', matching the terminal characters of the POI with a suffix list, and recording the name suffix of the POI and the name suffix identification corresponding to the 'company' as the matching result is that the POI contains 'company'; if the POI is 'Goodware Co., Ltd', matching the terminal characters of the POI with a suffix list, and recording the name suffix of the POI and the name suffix identification corresponding to the 'Co., Ltd' as the matching result, wherein the POI comprises 'Co., Ltd'.

And a step a3, judging whether the POI name contains brackets, if so, deleting the brackets and the content in the brackets from the POI name, and recording the bracket content.

Specifically, if the parentheses are included in the POI name and the parentheses are at the end of the POI name, the parentheses and the content in the parentheses are deleted and recorded.

For example: "Syngnathus mansion (Zhongguancun street)", which became "Syngnathus mansion" after pretreatment, and "Zhongguancun street" was recorded.

Step a4, judging whether the chain store name is contained in the POI name according to a preset chain store name table, and if so, replacing the chain store name with the POI name.

Specifically, the POI name of the POI is compared with the chain store name table, and if the POI name of the POI includes the chain store name as a result of the comparison, the POI name is replaced with the matched chain store name.

For example: if the POI name is "kentucky restaurant limited" or "kentucky restaurant", and the POI name includes the chain name "kentucky", the POI name "kentucky restaurant limited" or "kentucky restaurant" is replaced with "kentucky".

As shown in table 3, is a preset name chain name table:

name chain store data	Identification	Name chain store data	Identification
				Kendyl	100001	Nine-head hawk	100005
Root of beautiful Sweetclover	100002	Victory guest	100006
				Fishing on the seabed	100003	Beautiful jiangnan	100007
Dong Leshun (a Chinese character of Dong)	100004
				…	…	…	…

TABLE 3

When the chain store name appears in the parentheses included in the POI name, the chain store name in the parentheses and the chain store name in the parentheses are deleted and the chain store name in the parentheses is recorded, and it is not necessary to replace the POI name with the matching chain store name.

For example: "Syngnathus building (by Kendeki XXX store)", which became "Syngnathus building" after pretreatment, and was recorded "by Kendeki XXX store".

When name data of POI data is preprocessed, the name chain store data may be processed after the name prefix processing and the name suffix processing.

Step a5, according to the preset synonym abbreviation list, judging whether the POI name contains synonym or abbreviation, if yes, replacing the corresponding synonym or abbreviation in the POI name with the default word corresponding to the synonym or abbreviation in the synonym abbreviation list.

Specifically, the POI name of the POI is matched with the synonym abbreviation table, whether a synonym or abbreviation portion exists in the POI name is determined, if yes, a default value defined in the synonym abbreviation table is used for replacing the synonym portion in the POI name included in the POI, and the other portions are kept unchanged.

As shown in table 4, it is a preset synonym abbreviation table:

synonyms/acronyms	Identification	Default value
			Institute of science of China	100001	Y
Chinese academy of sciences	100001
			Chinese academy of sciences	100001
Consultation	100002	Y
			Information	100002

TABLE 4

In table 4, the word identified as the default value Y is the same as the word or is the default word of the abbreviation of the word, and if the POI name is "beijing chinese academy", the POI name includes the abbreviation "chinese academy", and the default word "chinese scientific institute" corresponding to the abbreviation is used to replace the abbreviation in the POI name, so as to preprocess the POI name to "beijing chinese scientific institute".

Step a6, according to the preset name alias table, determining whether the POI name contains an alias, if so, replacing the default word corresponding to the alias in the name alias table with the corresponding alias in the POI name.

Specifically, the POI name of the POI is matched with the name alias table, whether an alias exists in the POI name is determined, and if the alias exists, the alias defined in the alias table is used to replace the alias in the POI name, and the rest of the POI name is kept unchanged.

As shown in table 5, it is a preset alias table:

alias name	Identification	Default value
			National stadium	100003	Y
Bird nest	100003
			University of Beijing teachers		Y
North Master Da
			…	…	…

TABLE 5

In table 5, the word identified as the default value Y is a default word of another noun of the word, and if the POI name is "beijing university", the POI name includes another noun "beijing university", and the default word "beijing university" of another noun is used to replace another noun in the POI name, so that the POI name is preprocessed to "beijing university", i.e., beijing university of haidingdistrict.

Preferably, in order to further improve the accuracy of preprocessing the POI name, in the embodiment of the present invention, before preprocessing the POI name of the POI, preprocessing such as checking the validity of the POI name included in the POI, full-angle half-angle font conversion, and simplified font conversion is further required.

Specifically, the preprocessing the POI address in the attribute information of the POI in step 100 specifically includes:

step b1, determining whether the POI address contains an address prefix according to a preset administrative division prefix table, and if so, recording the address prefix and the address prefix identification of the POI address.

The determination method is the same as the method of whether the POI name includes the name prefix, and is not described in detail here.

For example: the POI address is 'Beijing City Beijing four-ring West road', and after address prefix preprocessing is carried out, the address prefix of the POI address is determined to be 'Beijing City'.

In an embodiment of the present invention, the POI address may include multiple administrative division prefixes at the same time. In the process of determining whether the address prefix included in the POI address includes the administrative division prefix, if the POI address includes at least two administrative division prefixes, for example, "shanghai city" and "nanjing", which appear in "shanghai city, nanjing road 25", the administrative division prefix appearing first is used as the address prefix of the POI address, and the identifier of the administrative division prefix is used as the address prefix identifier of the POI, that is, the identifier of "shanghai city" is used as the address prefix identifier of "shanghai city, nanjing road 25".

And b2, judging whether the POI address contains brackets or not, and if so, deleting the brackets and the content in the brackets from the POI address.

Step b3, according to the preset synonym abbreviation list, judging whether the POI address contains synonym or abbreviation, if yes, replacing the corresponding synonym or abbreviation in the POI address with the default word corresponding to the synonym or abbreviation in the synonym abbreviation list.

Step b4, judging whether the POI address contains nouns according to a preset address alias table, and if so, replacing corresponding nouns in the POI address with default words corresponding to the alias words in the address alias table.

It should be noted that, the processing method of whether the POI address includes a bracket, whether the POI address includes a synonym, and whether the POI address includes an noun is the same as the processing method of whether the POI name included in the POI includes a bracket, whether the POI address includes a synonym, and whether the POI name includes an noun, and is not described in detail here.

Preferably, in order to further improve the accuracy of preprocessing the POI address, in the embodiment of the present invention, before the POI address is preprocessed, the validity of the POI address needs to be checked, converted into a full-angle half-angle font, converted into a simplified font, and the like.

Specifically, the preprocessing of the phone number in the POI attribute information in step 100 specifically includes:

and converting the telephone number of the POI according to a preset telephone number format.

Specifically, the format of the telephone number of the POI is converted into a uniform format.

Specifically, in step 101, pairwise pairing is performed on POIs in the electronic map database, which mainly includes:

selecting any two POIs from the electronic map database, and determining whether the two POIs are redundant data. If the electronic map database comprises the POIs 1, 2, 3 and 4, the result of pairwise pairing of the POIs in the electronic map database is as follows: POI1 and POI2, POI1 and POI3, POI1 and POI4, POI2 and POI3, POI2 and POI4, POI3 and POI 4.

In step 105, when the attribute information includes the POI name, calculating the name similarity of the two POIs; when the attribute information comprises POI addresses, calculating the address similarity of the two POIs; when the attribute information comprises telephone numbers, calculating the telephone number similarity of the two POIs; and when the attribute information comprises the types of the POI, calculating the type similarity of the two POI.

Specifically, the calculating of the name similarity of the two POIs in the embodiment of the present invention specifically includes: judging whether the name prefix identifications of the two POIs are consistent; if the two POIs are not consistent, determining that the name similarity of the two POIs is 0; if the similarity is consistent, determining that the name prefix similarity of the two POIs is 1, calculating the main body similarity, the name suffix similarity and the bracket content similarity of the two POIs, and calculating the name similarity of the two POIs according to the main body similarity, the name suffix similarity and the bracket content similarity. Wherein, the main body refers to the part of the POI name except the name prefix, the name suffix and the parenthesis content.

Specifically, since the number of POI name prefixes is uncertain, and there may be a plurality of POI name prefixes, or one POI name prefix, or there may be no POI name prefix, when determining whether the name prefix identifiers of the two POIs are consistent, the following cases are included:

case 1: the two POI names comprise name prefixes, if the name prefix identifications of the two POI names are the same, the two POI name prefixes are determined to be consistent, and if the name prefix identifications of the two POI names are different, the two POI name prefixes are determined to be inconsistent;

and 2, when two POI names comprise a name prefix and one POI name does not comprise the name prefix, determining that the two POI name prefixes are consistent.

And 3, when the two POI names do not contain the name prefixes, determining that the prefixes of the two POI names are consistent.

Specifically, when the name prefix similarity of two POIs is determined to be 1, calculating the subject similarity, the name suffix similarity and the bracket content similarity of the two POIs, and calculating the name similarity of the two POIs according to the subject similarity, the name suffix similarity and the bracket content similarity.

Calculating name suffix similarity of two POIs, comprising: judging whether the name suffix identifications of the two POIs are consistent, if so, determining that the similarity of the name suffixes of the two POIs is 1; and if not, determining that the similarity of the name suffixes of the two POIs is 0.

When judging whether the name suffix identifications of the two POIs are consistent, the following conditions are included:

case 1: the two POI names comprise name suffixes, if the name suffix identifications of the two POI names are the same, the two POI name suffixes are determined to be consistent, and if the name suffix identifications of the two POI names are different, the two POI name suffixes are determined to be inconsistent;

case 2, when two POI names, one of them contains a name suffix and the other does not, it is determined that the two POI name suffixes coincide.

And 3, when the two POI names do not contain the name suffixes, determining that the two POI name suffixes are consistent.

Specifically, the similarity between the parenthesized contents in the POI names of the two POIs is calculated by the following formula:

where a represents the parenthesis in the name of one of the two POIs, | a | represents the character string length of a, B represents the parenthesis in the name of the other POI, | B | represents the character string length of B, Edit (a, B) represents the Edit distance value between a and B obtained by the Edit distance method.

Specifically, the calculating of the subject similarity of the two POIs may specifically be obtained by the following formula:

wherein α is a preset weighting harmonic coefficient (for example, 0.8, but not limited thereto), S₁For calculating a subject similarity of the POI names of the two POIs based on the edit distance, specifically

A represents the subject content in the name of one POI of the two POIs, | A | represents the character string length of A, B represents the subject content in the name of the other POI, | B | represents the character string length of B, Edit (A, B) represents the Edit distance value between A and B obtained by using the Edit distance mode, S₂The main body similarity of the POI names of the two POIs calculated based on the Jaccard coefficient is specifically

It should be noted that the edit distance refers to the minimum number of operations for changing one character string into another character string through the insertion and deletion operations.

For example, if a corresponds to abcd and B corresponds to abdd, then | a | ═ 4, | B | ═ 4, Edit (a, B) ═ 2, | a ∩ B | ═ 3, and | a ∪ B | ═ 4.

When the main body similarity, the name suffix similarity and the bracket content similarity of the two POIs are obtained through calculation, the name similarity of the two POIs is obtained through calculation in the following mode:

S_{name (R)}＝S_Suffix*w_{Suffix weight}+S_{Main body}*w_{Subject weights}+S_{Content of parentheses}*w_{Bracket content weight}；

Wherein S is_{Name (R)}Representing the name similarity of the two POIs, S_SuffixRepresenting the name suffix similarity of the two POIs, S_{Main body}Representing the subject similarity of the two POIs, S_{Content of parentheses}Indicates the similarity of the parenthesis content of the two POIs, w_{Suffix weight}A weight value, w, representing the similarity of the name suffixes of the two POIs_{Subject weights}A weight value, w, representing the subject similarity of the two POIs_{Bracket content weight}A weight value representing the similarity of the parenthesized contents of the two POIs, wherein w_{Suffix weight}+w_{Subject weights}+w_{Bracket content weight}＝1。

It should be noted that, although not limited herein, each weight value may be determined empirically or actually, the weight value of the "main weight" is generally larger than the "suffix weight" and the "bracket content weight".

For example: the weight value corresponding to the "main weight" is 0.8, the weight value corresponding to the "suffix weight" is 0.1, and the weight value corresponding to the bracket content weight "is 0.1.

In the embodiment of the present invention, calculating the address similarity between the two POIs specifically includes: judging whether the address prefix identifications of the two POIs are consistent; if the two POIs are not consistent, determining that the address similarity of the two POIs is 0; and if the two POIs are consistent, determining that the address prefix similarity of the two POIs is 1, calculating the main body similarity and the address subsequence similarity of the two POIs, and calculating the address similarity of the two POIs according to the main body similarity and the address subsequence similarity. Wherein, the main body refers to the part of the POI address except the address prefix, the address suffix and the parenthesis content.

Specifically, since the number of POI address prefixes is uncertain, there may be multiple, one, or no address prefixes, when determining whether the address prefix identifiers of the two POIs are consistent, the following cases are included:

case 1: the two POI addresses comprise address prefixes, if the address prefix identifications of the two POI addresses are the same, the two POI address prefixes are determined to be consistent, and if the address prefix identifications of the two POI addresses are different, the two POI address prefixes are determined to be inconsistent;

and 2, when one POI address comprises an address prefix and the other POI address does not comprise the address prefix, determining that the address prefixes of the two POIs are consistent.

And 3, when the two POI addresses do not contain the address prefixes, determining that the two POI address prefixes are consistent.

Specifically, when the address prefix similarity of the two POIs is determined to be 1, the main body similarity and the address subsequence similarity of the two POIs are calculated, and the address similarity of the two POIs is calculated according to the main body similarity and the address subsequence similarity.

Since the POI address of the POI is preprocessed, a portion other than the address prefix and the address suffix of the POI address can be regarded as a subject of the POI address.

Specifically, the main body similarity of two POI addresses is calculated by the following formula:

wherein S_MiddA(A, B) is the similarity of the corresponding main bodies of the two POI, A is the main body of one POI address in the two POI addresses, B is the main body of the other POI address, m is the number of address elements obtained by dividing the main body A into words,n is the number of address elements obtained by word segmentation of the subject B, a_iThe maximum similarity among the similarities of the ith address element in the body A and each address element in the body B, B_jThe maximum similarity is the similarity between the jth address element in the main body B and each address element in the main body A; wherein, the similarity of the address elements in the main body A and the main body B is obtained according to the following formula:

wherein, S (A)_i,B_j) Representing the similarity of the ith address element in the body A and the jth address element in the body B, | A_iI and I B_jI are respectively address elements A_iAnd B_jLength of (E), E_dit(Ai,Bj)Is an address element A_iAnd B_jThe edit distance of (1).

It should be noted that there may be an inclusion relationship between the POI addresses of the two POIs, and therefore, in addition to calculating the main body similarity of the two POIs, the address subsequence similarity of the two POIs also needs to be calculated.

It should be noted that the address subsequence refers to one address string that includes another address string, that is, the address string in the POI address of one POI in the two POIs includes the address string in the POI address of the other POI.

In the embodiment of the present invention, calculating the address subsequence similarity of two POIs specifically includes:

and judging whether the address of one POI is completely contained in the address of the other POI in the POI addresses of the two POIs, if so, determining that the address subsequence similarity of the two POIs is 1, and otherwise, determining that the address subsequence similarity of the two POIs is 0.

For example: the two POI addresses are respectively 'middle guan street 135' and 'middle guan street 35', no inclusion relationship exists between the two POI addresses, and the similarity of the address sub-sequences of the two POI addresses is 0.

When the subject similarity and the address subsequence similarity of the two POIs are obtained through calculation, calculating the address similarity of the two POIs according to the subject similarity and the address subsequence similarity, and the method comprises the following steps:

when the address subsequence similarity of the two POIs is 0, calculating the address similarity of the two POIs as the calculated main body similarity of the two POIs;

when the address subsequence similarity of the two POIs is 1, obtaining the address similarity of the two POIs by the following method:

wherein S is_MiddA(A, B) represents the subject similarity of the two POIs, S_{Sub-sequence of addresses}Representing the similarity of the address subsequence of the two POIs, the value is 1,

a weight value, w, representing the subject similarity of the two POIs_{Address subsequence weights}A weight value representing the address subsequence similarity of the two POIs, wherein,

it should be noted that the weight value corresponding to the "middla weight" and the weight value corresponding to the "address subsequence weight" may be determined according to actual needs, or may be determined according to an empirical value, which is not limited herein. For example: the weighting value corresponding to the "middla weight" is 0.6, and the weighting value corresponding to the "address subsequence weight" is 0.4.

In the embodiment of the present invention, calculating the similarity between the telephone numbers of the two POIs specifically includes: judging whether the telephone numbers of the two POIs are consistent; if the two POIs are consistent, determining that the telephone number similarity of the two POIs is 1; and if the two POIs are not consistent, determining that the telephone number similarity of the two POIs is 0.

In the embodiment of the present invention, calculating the type similarity between the two POIs specifically includes: judging whether the POI types of the two POIs are consistent; if the two POIs are consistent, determining that the type similarity of the two POIs is 1; and if the POI similarity is not consistent with the POI similarity, determining that the type similarity of the two POI is 0.

In the embodiment of the present invention, the POI similarity of two POIs is calculated, and the POI similarity is calculated according to the similarity of the attribute information contained in the POI similarity, that is, the sum of the product values of the similarity of the contained attribute information and the weight values corresponding to the attribute information is determined as the POI similarity of the two POIs, where the sum of the weight values corresponding to the attribute information is 1. Taking the attribute information including name, address, telephone number and type as an example, calculating the similarity of the two POIs according to the name similarity, the address similarity, the telephone number similarity and the type similarity obtained by calculation, and specifically obtaining the similarity of the two POIs by calculation through the following formula:

S＝S_{name (R)}*W_{Name weight}+S_Address*W_{Address weight}+S_{Telephone number}*W_{Telephone number weight}+S_{Type (B)}*W_{Type weight}

Wherein S is_{Name (R)}Representing the name similarity of the two POIs, S_AddressRepresenting the address similarity of the two POIs, S_{Telephone number}Representing the telephone number similarity of the two POIs, S_{Type (B)}Representing the similarity of the types of the two POIs, w_{Name weight}A weight value, w, representing the name similarity of the two POIs_{Address weight}A weight value, w, representing the address similarity of the two POIs_{Telephone number weight}A weight value, w, representing the telephone number similarity of the two POIs_{Type weight}A weight value representing the similarity of types of the two POIs, wherein W_{Name weight}+W_{Address weight}+W_{Telephone number weight}+W_{Type weight}＝1。

Preferably, to further improve the accuracy of the similarity between the two POIs, in the embodiment of the present invention, the distance similarity between the two POIs is also considered when calculating the similarity between the two POIs, and when the distance between the two POIs is greater than or equal to the preset distance threshold, the distance similarity between the two POIs is determined to be 0, and when the distance between the two POIs is less than the preset distance threshold, the distance similarity between the two POIs is determined to be 1, which may be specifically as follows:

S＝(S_{name (R)}*w_{Name weight}+S_Address*w_{Address weight}+S_{Distance between two adjacent plates}*w_{Distance weight}+S_{Telephone number}*w_{Telephone number weight}+S_{Type (B)}*w_{Type weight})*100

Wherein S is_{Name (R)}Representing the name similarity of the two POIs, S_AddressRepresenting the address similarity of the two POIs, S_{Distance between two adjacent plates}Representing the distance similarity of the two POIs, S_{Telephone number}Representing the telephone number similarity of the two POIs, S_{Type (B)}Representing the similarity of the types of the two POIs, w_{Name weight}A weight value, w, representing the name similarity of the two POIs_{Address weight}A weight value, w, representing the address similarity of the two POIs_{Distance weight}A weight value, w, representing the distance similarity of the two POIs_{Telephone number weight}A weight value, w, representing the telephone number similarity of the two POIs_{Type weight}A weight value representing the similarity of the types of the two POIs, wherein w_{Name weight}+w_{Address weight}+w_{Distance weight}+w_{Telephone number weight}+w_{Type weight}＝1。

It should be noted that, a weight value corresponding to the "name weight", a weight value corresponding to the "address weight", a weight value corresponding to the "distance weight", a weight value corresponding to the "phone number weight", and a weight value corresponding to the "type weight" may be determined according to actual needs, or may be determined according to experimental data, which is not limited herein, and generally, a weight value corresponding to the "name weight" and a weight value corresponding to the "address weight" are greater than a weight value corresponding to the "distance weight", a weight value corresponding to the "phone number weight", and a weight value corresponding to the "type weight".

For example: the weight value corresponding to the "name weight" is 0.45, the weight value corresponding to the "address weight" is 0.45, the weight value corresponding to the "distance weight" is 0, the weight value corresponding to the "telephone number weight" is 0.05, and the weight value corresponding to the "type weight" is 0.05.

In the embodiment of the present invention, after determining the redundant data in the electronic map database by the method shown in fig. 1A or fig. 2A, the method may further include the following steps: and processing the determined redundant data to remove the redundant data in the electronic map database. For example: determining that POI1 and POI2 are redundant data; determining that POI1 and POI3 are redundant data; determining that POI2 and POI8 are redundant data; … … are provided. The POI1, POI2, POI3 and POI8 … … are redundant data, and at this time, the POI1, POI2, POI3 and POI8 … … need to be processed, and the POI1, POI2, POI3 and POI8 … … may be merged into one POI; an optimal one of the POIs 1, 2, 3 and 8 … … may be selected as a final one of the POIs 1, 2, 3 and 8 … …, and the other POIs may be deleted.

Through the scheme of the first embodiment of the invention, every two POIs in the electronic map database are paired; for each pairing, the following steps are performed: calculating the distance between the two POIs according to the longitude and latitude coordinates of the two POIs in the pairing; judging whether the distance between the two POIs is larger than or equal to a preset distance threshold value or not; if yes, determining that the two POIs in the pairing are not redundant data; if not, calculating the similarity of the POI according to the attribute information of the two POI in the pairing, and judging whether the similarity of the POI is larger than or equal to a preset similarity threshold value, if so, determining that the two POI are redundant data, otherwise, determining that the two POI are not redundant data. According to the scheme, the redundant data of the POI in the electronic map database can be determined, the POI in the electronic map database is subjected to redundancy judgment, when a plurality of POI belonging to the redundant data are determined, the redundant POI are combined in time, the data volume of POI storage is reduced, the storage resources occupied by the POI can be saved, and the performance of the system is improved.

Example two:

as shown in fig. 2A, which is a schematic structural diagram of a device for calculating similarity of information points according to a second embodiment of the present invention, the device for calculating similarity of information points according to the second embodiment of the present invention has a function of executing the method according to the first embodiment of the present invention, and the device for calculating similarity of information points according to the second embodiment of the present invention includes: a matching unit 21, a distance calculating unit 22, a judging unit 23, and a similarity judging unit 24, wherein:

the matching unit 21 is used for pairwise matching POI in the electronic map database;

the distance calculating unit 22 is used for calculating the distance between the two POIs according to the longitude and latitude coordinates of the two POIs in the pairing and triggering the judging unit 23;

the judging unit 23 is configured to judge whether the distance between the two POIs is greater than or equal to a preset distance threshold; if yes, determining that the two POIs in the pairing are not redundant data; if not, the similarity judgment unit 24 is triggered;

and the similarity judgment unit 24 is configured to calculate POI similarity according to the attribute information of the two POIs in the pairing, and judge whether the POI similarity is greater than or equal to a preset similarity threshold, if so, determine that the two POIs are redundant data, and if not, determine that the two POIs are not redundant data.

Specifically, the attribute information of the POI includes at least one of: POI name, POI address, telephone number, POI type;

the similarity determination unit 24 calculates the similarity according to the attribute information of the two POIs in the pairing, and specifically includes: when the attribute information comprises POI names, calculating the name similarity of the two POIs; when the attribute information comprises POI addresses, calculating the address similarity of the two POIs; when the attribute information comprises telephone numbers, calculating the telephone number similarity of the two POIs; when the attribute information comprises POI types, calculating the type similarity of the two POIs; and calculating the POI similarity of the two POIs according to the name similarity, the address similarity, the telephone number similarity and the type similarity obtained by calculation.

Optionally, the apparatus further comprises a preprocessing unit 25, as shown in fig. 2B, wherein:

the preprocessing unit 25 is configured to preprocess the attribute information of the POIs in the electronic map database before the matching unit 21 pairwise pairs the POIs in the electronic map database.

Specifically, the preprocessing unit 25 preprocesses the POI name in the attribute information of the POI, and specifically includes: determining whether the POI name contains a name prefix according to a preset administrative division prefix table, and if so, recording the name prefix and the name prefix identification of the POI name; determining whether the POI name contains a name suffix or not according to a preset suffix list, and recording the name suffix and the name suffix identification of the POI name if the POI name contains the name suffix; judging whether the POI name contains brackets or not, if so, deleting the brackets and the content in the brackets from the POI name, and recording the content in the brackets; judging whether the POI name contains the chain store name or not according to a preset chain store name table, and replacing the chain point name with the POI name if the POI name contains the chain store name; judging whether the POI name contains synonyms or short words or not according to a preset synonym short word table, and if so, replacing the corresponding synonyms or short words in the POI name by default words corresponding to the synonyms or short words in the synonym short word table; judging whether the POI name contains an alias or not according to a preset name alias table, and if so, replacing a default word corresponding to the alias in the name alias table with a corresponding alias in the POI name;

the similarity determination unit 24 calculates the name similarity of the two POIs, and specifically includes: judging whether the name prefix identifications of the two POIs are consistent; if the two POIs are not consistent, determining that the name similarity of the two POIs is 0; if the two POIs are consistent, determining that the similarity of the name prefixes of the two POIs is 1, calculating the similarity of the body, the name suffix and the parenthesis content of the two POIs, and calculating the similarity of the names of the two POIs according to the similarity of the body, the similarity of the name suffix and the parenthesis content, wherein the body refers to the part of the POI names except the name prefixes, the name suffixes and the parenthesis content.

Specifically, the preprocessing unit 25 preprocesses the POI address in the attribute information of the POI, and specifically includes: determining whether the POI address contains an address prefix according to a preset administrative division prefix table, and if so, recording the address prefix and an address prefix identifier of the POI address; judging whether the POI address contains brackets or not, and if so, deleting the brackets and the content in the brackets from the POI address; judging whether the POI address contains synonyms or acronyms or not according to a preset synonym abbreviation list, and if so, replacing the corresponding synonyms or acronyms in the POI address with default words corresponding to the synonyms or acronyms in the synonym abbreviation list; judging whether the POI address contains an alias or not according to a preset address alias table, and if so, replacing a default word corresponding to the alias word in the address alias table with a corresponding alias word in the POI address;

the similarity determination unit 24 calculates the address similarity of the two POIs, and specifically includes: judging whether the address prefix identifications of the two POIs are consistent; if the two POIs are not consistent, determining that the address similarity of the two POIs is 0; if the two POIs are consistent, determining that the similarity of the address prefixes of the two POIs is 1, calculating the similarity of the main body and the address subsequence of the two POIs, and calculating the similarity of the addresses of the two POIs according to the similarity of the main body and the address subsequence, wherein the main body refers to the contents of the POIs except the address prefixes, the address suffixes and the parenthesis.

Specifically, the similarity determination unit 24 calculates the subject similarity of two POIs, and obtains the subject similarity according to the following formula:

wherein S_MiddA(A, B) is the main body similarity corresponding to two POIs, A is the main body of one POI, B is the main body of the other POI, m is the number of address elements obtained by segmenting the main body A, n is the number of address elements obtained by segmenting the main body B, a_iThe maximum similarity among the similarities of the ith address element in the body A and each address element in the body B, B_jThe maximum similarity is the similarity between the jth address element in the main body B and each address element in the main body A;

wherein, the similarity of the address elements in the main body A and the main body B is obtained according to the following formula:

wherein S (A)_i,B_j) Representing the similarity of the ith address element in the body A and the jth address element in the body B, | A_iI and I B_jI are respectively address elements A_iAnd B_jThe length of (a) of (b),

is an address element A_iAnd B_jThe edit distance of (d);

and/or the presence of a gas in the gas,

the calculating the address subsequence similarity of the two POIs specifically comprises:

and judging whether the address of one POI is completely contained in the address of the other POI in the two POIs, if so, determining that the address subsequence similarity of the two POIs is 1, and otherwise, determining that the address subsequence similarity of the two POIs is 0.

Specifically, the preprocessing unit 25 preprocesses the phone number in the POI attribute information, and specifically includes:

converting the telephone number of the POI according to a preset telephone number format;

the calculating of the similarity between the telephone numbers of the two POIs by the similarity determination unit 24 specifically includes: judging whether the telephone numbers of the two POIs are consistent; if the two POIs are consistent, determining that the telephone number similarity of the two POIs is 1; and if the two POIs are not consistent, determining that the telephone number similarity of the two POIs is 0.

Specifically, the calculating the type similarity of the two POIs by the similarity determination unit 24 specifically includes: judging whether the POI types of the two POIs are consistent; if the two POIs are consistent, determining that the type similarity of the two POIs is 1; and if the POI similarity is not consistent with the POI similarity, determining that the type similarity of the two POI is 0.

It should be noted that the computing device according to the embodiment of the present invention may be implemented by hardware, or may be implemented by software, which is not limited specifically herein.

The computing equipment provided by the embodiment of the invention can determine whether two or more POI data are redundant data or not in the data fusion stage, and if the two or more POI data are redundant data, the redundant POI data are combined in time, so that the data volume of stored POI data is reduced, system resources occupied by the POI data can be saved, and the performance of the system is improved.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus (device), or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention has been described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (devices) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A method for judging data redundancy of information points is characterized by comprising the following steps:

pairwise matching POIs in the electronic map database;

for each pairing, the following steps are performed:

if yes, determining that the two POIs in the pairing are not redundant data;

if not, calculating POI similarity according to the attribute information of the two POIs in the pairing, and judging whether the POI similarity is larger than or equal to a preset similarity threshold value, if so, determining that the two POIs are redundant data, otherwise, determining that the two POIs are not redundant data;

the attribute information of the POI includes at least one of: POI name, POI address, telephone number, POI type;

before pairwise pairing POIs in the electronic map database, the method further comprises the following steps:

preprocessing attribute information of POI in the electronic map database;

preprocessing a POI address in the attribute information of the POI, which specifically comprises the following steps:

determining whether the POI address contains an address prefix according to a preset administrative division prefix table, and if so, recording the address prefix and an address prefix identifier of the POI address;

judging whether the POI address contains brackets or not, and if so, deleting the brackets and the content in the brackets from the POI address;

judging whether the POI address contains synonyms or acronyms or not according to a preset synonym abbreviation list, and if so, replacing the corresponding synonyms or acronyms in the POI address with default words corresponding to the synonyms or acronyms in the synonym abbreviation list;

judging whether the POI address contains an alias or not according to a preset address alias table, and if so, replacing a default word corresponding to the alias word in the address alias table with a corresponding alias word in the POI address;

the calculating the address similarity of the two POIs specifically includes:

judging whether the address prefix identifications of the two POIs are consistent;

if the two POIs are not consistent, determining that the address similarity of the two POIs is 0;

if the two POIs are consistent, determining that the similarity of the address prefixes of the two POIs is 1, calculating the similarity of the main body and the address subsequence of the two POIs, and calculating the similarity of the addresses of the two POIs according to the similarity of the main body and the address subsequence, wherein the main body refers to the partial content of the address of the POI except the address prefixes, the address suffixes and the parenthesis content;

and the address subsequence similarity is used for judging whether the address of one POI is completely contained in the address of the other POI in the two POIs.

2. The method according to claim 1, wherein calculating the similarity according to the attribute information of the two POIs in the pair specifically comprises:

when the attribute information comprises POI names, calculating the name similarity of the two POIs;

when the attribute information comprises POI addresses, calculating the address similarity of the two POIs;

when the attribute information comprises telephone numbers, calculating the telephone number similarity of the two POIs;

when the attribute information comprises POI types, calculating the type similarity of the two POIs;

and calculating the POI similarity of the two POIs according to the name similarity, the address similarity, the telephone number similarity and the type similarity obtained by calculation.

3. The method according to claim 2, wherein preprocessing the POI name in the attribute information of the POI specifically comprises:

determining whether the POI name contains a name prefix according to a preset administrative division prefix table, and if so, recording the name prefix and the name prefix identification of the POI name;

determining whether the POI name contains a name suffix or not according to a preset suffix list, and recording the name suffix and the name suffix identification of the POI name if the POI name contains the name suffix;

judging whether the POI name contains brackets or not, if so, deleting the brackets and the content in the brackets from the POI name, and recording the content in the brackets;

judging whether the POI name contains the chain store name or not according to a preset chain store name table, and if so, replacing the POI name with the chain store name;

judging whether the POI name contains synonyms or short words or not according to a preset synonym short word table, and if so, replacing the corresponding synonyms or short words in the POI name by default words corresponding to the synonyms or short words in the synonym short word table;

judging whether the POI name contains an alias or not according to a preset name alias table, and if so, replacing a default word corresponding to the alias in the name alias table with a corresponding alias in the POI name;

the calculating of the name similarity of the two POIs specifically includes:

judging whether the name prefix identifications of the two POIs are consistent;

if the two POIs are not consistent, determining that the name similarity of the two POIs is 0;

if the two POIs are consistent, determining that the similarity of the name prefixes of the two POIs is 1, calculating the similarity of the body, the name suffix and the parenthesis content of the two POIs, and calculating the similarity of the names of the two POIs according to the similarity of the body, the similarity of the name suffix and the parenthesis content, wherein the body refers to the part of the POI names except the name prefixes, the name suffixes and the parenthesis content.

4. The method of claim 1, wherein the calculating the subject similarity of the two POIs specifically comprises:

the subject similarity of the two POIs is calculated according to the following formula:

wherein, S (A)_i,B_j) Indicating the ith address element and the master in the body ASimilarity of jth address element in B, | A_iI and I B_jI are respectively address elements A_iAnd B_jThe length of (a) of (b),

is an address element A_iAnd B_jThe edit distance of (d);

and/or the presence of a gas in the gas,

5. The method of claim 1, wherein preprocessing the phone number in the POI attribute information specifically comprises:

calculating the telephone number similarity of the two POIs, specifically comprising:

judging whether the telephone numbers of the two POIs are consistent;

if the two POIs are consistent, determining that the telephone number similarity of the two POIs is 1;

and if the two POIs are not consistent, determining that the telephone number similarity of the two POIs is 0.

6. The method of claim 1, wherein calculating the type similarity of the two POIs comprises:

judging whether the POI types of the two POIs are consistent;

if the two POIs are consistent, determining that the type similarity of the two POIs is 1;

and if the POI similarity is not consistent with the POI similarity, determining that the type similarity of the two POI is 0.

7. An apparatus for discriminating data redundancy of information points, comprising:

the judging unit is used for judging whether the distance between the two POIs is larger than or equal to a preset distance threshold value; if yes, determining that the two POIs in the pairing are not redundant data; if not, triggering a similar judgment unit;

the similarity judgment unit is used for calculating POI similarity according to the attribute information of the two POIs in the pairing, judging whether the POI similarity is larger than or equal to a preset similarity threshold value, if so, determining that the two POIs are redundant data, otherwise, determining that the two POIs are not redundant data;

the discrimination apparatus further includes: a pre-processing unit, wherein:

the preprocessing unit is used for preprocessing the attribute information of the POI in the electronic map database before the matching unit pairwise matches the POI in the electronic map database;

the preprocessing unit preprocesses the POI address in the attribute information of the POI, and specifically includes:

the calculating the address similarity of the two POIs by the similarity judging unit specifically includes:

8. The apparatus according to claim 7, wherein the similarity determination unit calculates the similarity according to the attribute information of the two POIs in the pair, and specifically includes:

9. The apparatus according to claim 8, wherein the preprocessing unit preprocesses the POI address in the attribute information of the POI, and specifically includes:

10. The apparatus according to claim 7, wherein the similarity determination unit calculates the subject similarity between two POIs, and specifically includes:

is an address element A_iAnd B_jThe edit distance of (d);

and/or the presence of a gas in the gas,

11. The apparatus according to claim 7, wherein the preprocessing unit preprocesses the phone number in the POI attribute information, and specifically includes:

the similarity determination unit specifically determines the similarity between the telephone numbers of the two POIs, and includes:

judging whether the telephone numbers of the two POIs are consistent;

12. The apparatus according to claim 7, wherein the similarity determination unit calculates the type similarity between the two POIs, and specifically includes:

judging whether the POI types of the two POIs are consistent;