CN102789467A

CN102789467A - Data fusion method, data fusion device and data processing system

Info

Publication number: CN102789467A
Application number: CN2011101317655A
Authority: CN
Inventors: 张轩; 王东海
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2011-05-20
Filing date: 2011-05-20
Publication date: 2012-11-21

Abstract

The invention is suitable for the field of information processing technology and provides a data fusion method, a data fusion device and a data processing system. The data fusion method includes the following steps: receiving input data; adjusting whether data which are the same as the input data exist in pre-stored data; and adding novel description information in the input data into the same data when the data which are the same as the input data exist in the pre-stored data. Different description information in the same data can be fused, information of data can be enriched, and satisfaction with searched data of users is improved.

Description

A kind of data fusion method, device and data handling system

Technical field

The invention belongs to technical field of information processing, relate in particular to a kind of data fusion method, device and data handling system.

Background technology

(Point of Interset, POI) data generally include information such as title, classification, address, longitude and latitude to point of interest.The acquisition mode of POI data has multiple, for example: collection, internet collection etc. on the spot.Because the difference of acquisition mode causes same POI data that collect possibly have different descriptors.

Does the different descriptor of same the POI data that how will collect merge? Is key how to judge that a plurality of POI data that collect are same POI data? Prior art judges through the direct relatively title of POI data whether said POI data are same POI data; Error rate is higher; Because the difference of acquisition mode; The title of POI data maybe be also incomplete same, but the same really POI data of expression, for example:

Title 1: Quanjude (Yu Quan Road)

Address 1: No. 44, Fuxing Road, Haidian District, Beijing City;

Title 2: shop, Quanjude Yu Quan Road

Address 2: No. 44, Fuxing Road, Haidian District, Beijing City;

Though title 1 is different with title 2, what on map, represent is same position, and what therefore should think expression is same POI data.In addition, because the POI data is larger, judge through the title that compares the POI data in twos whether said POI data are same POI data, require a great deal of time, and cost is higher and efficient is lower.

Summary of the invention

The embodiment of the invention provides a kind of data fusion method, is intended to solve the problem of different descriptors in the identical POI data.

The embodiment of the invention is achieved in that a kind of data fusion method, said method comprising the steps of:

Receive the data of input;

Judge in the data that prestore and whether have the data identical with the data of said input;

When having the identical data of data with said input in the data that prestore, new descriptor is added in the said identical data in the data with said input.

Another purpose of the embodiment of the invention is to provide a kind of device of data fusion, and said device comprises:

The Data Receiving unit is used to receive the data of input;

Judging unit is used for judging whether the data that prestore exist the data identical with the data of said input;

The data fusion unit is used for when there are the identical data of data with said input in the data that prestore, and new descriptor is added in the said identical data in the data with said input.

A purpose again of the embodiment of the invention is to provide a kind of data handling system, and said data handling system comprises said data fusion device.

In embodiments of the present invention; Whether there are the data identical in the data that prestore through judgement with the data of said input; When having the identical data of data with said input, new descriptor is added in the said identical data in the data with said input, can effectively enrich the information of data; Reduce the redundancy of data simultaneously, improve the satisfaction of user the data that search.

Description of drawings

Fig. 1 is the realization flow figure of the data fusion method that provides of the embodiment of the invention one;

Fig. 2 is the concrete realization flow figure of the judgement identical data that provides of the embodiment of the invention two;

Fig. 3 is the composition structural drawing of the data fusion device that provides of the embodiment of the invention three;

Fig. 4 is the composition structural drawing of the judging unit that provides of the embodiment of the invention three.

Embodiment

In order to make the object of the invention, technical scheme and advantage clearer,, the present invention is further elaborated below in conjunction with accompanying drawing and embodiment.Should be appreciated that specific embodiment described herein only in order to explanation the present invention, and be not used in qualification the present invention.

Whether there are the data identical in the data that the embodiment of the invention prestores through judgement with the data of said input; When having the identical data of data with said input; New descriptor is added in the said identical data in the data with said input; Can effectively enrich the information of data, and reduce the redundancy of data, improve the satisfaction of user the data that search.

For technical scheme of the present invention is described, describe through specific embodiment below.

Embodiment one:

Fig. 1 shows the realization flow of the data fusion method that the embodiment of the invention one provides, and details are as follows for this procedure:

In step S101, receive the data of input.

In the present embodiment, said data are including, but not limited to interest point data.

In step S102, judge in the data prestore whether have the data identical with the data of said input, if judged result is " being ", execution in step S103 then, if judged result is " denying ", execution in step S104 then.

In the present embodiment; For the different descriptors in the data identical in the data that collect are merged; Enrich data message; When receiving the data of input, the data of this input and the data that prestore are compared, judge in the data that prestore whether have the data identical with the data of this input.Whether wherein judge in the data that prestore exists the concrete steps of the data identical with the data of said input as shown in Figure 2.

In step S103, when having the identical data of data with said input, new descriptor is added in the said identical data in the data with said input.

In the present embodiment, the different or new descriptors that refer to a plurality of same data of said fusion are fused on the data.For example:

Data 1:

Title 1: Quanjude (Yu Quan Road)

Address 1: No. 44, Fuxing Road, Haidian District, Beijing City

Phone 1:12345678;

Data 2:

Title 2: shop, Quanjude Yu Quan Road

Address 2: No. 44, Fuxing Road, Haidian District, Beijing City

Phone 2:87654321;

Find that through judging data 1 and data 2 are same data, data 1 and data 2 merged that the data after the fusion are:

Title: Quanjude (Yu Quan Road) or shop, Quanjude Yu Quan Road

Address: No. 44, Fuxing Road, Haidian District, Beijing City

Phone: 12345678 or 87654321;

Through fusion to different descriptors in the identical data, can effectively enrich the information of former data, improved the satisfaction of user to the data that search.And, there is identical information in original identical data, after merging, reduced the redundancy of identical data, saved the space of data storage.

In step S104, when not having the identical data of data with said input, store the data of said input.

In the present embodiment, when not having the identical data of data with this input, explain that these data for newly-increased data, then directly store these data, so that compare with the data of input next time.

In embodiments of the present invention; Whether have the data identical with the data of said input in the data that prestore through judgement, when having the identical data of data with said input, new descriptor is added in the said identical data in the data with said input; Can effectively enrich the information of data; Improve the satisfaction of user, reduced the redundancy of data simultaneously, saved the space of data storage the data that search.

Embodiment two:

Fig. 2 is the concrete realization flow that whether has the data identical with the data of said input in the data that prestore of judgement that the embodiment of the invention two provides:

In step S201, the data of input are carried out pre-service;

In the present embodiment, preprocessing process includes but not limited to address longitude and latitude conversion, title fractionation, address fractionation etc.

Said address longitude and latitude converts into when there is not longitude and latitude in said data, and the address through said data obtains longitude and latitude.

Said title is split as title with said data and is split as address prefix, core, branch part and closes key name (Keyname) and add suffix macrotaxonomy part.Wherein, address prefix is through to behind the data participle, according to address word sequence table, obtains the address speech, removes that last address of address speech obtains again.For example: " the Long Guan restaurant is gone back in the ChangPing, Beijing City district ", address speech are " ChangPing, Beijing City district Hui Longguan ", remove last address of address speech, and then address prefix is " ChangPing, Beijing City district ", and " Hui Longguan " remains into the core of back; The branch part then is to obtain branch suffix speech through " () " and branch suffix list, again through judging whether the speech before the suffix speech of branch is that address name or street name are obtained complete branch trade name; Remove address prefix and the branch part is the core; At last, with the core basis, through comparison Keyname allocation list and suffix macrotaxonomy table, find the keyname and the suffix macrotaxonomy part of correspondence, wherein Keyname allocation list and suffix macrotaxonomy table obtain after through manual sorting.

In step S202, the title of said pretreated data is carried out the binary word segmentation, each speech after the cutting is generated corresponding key word with the longitude and latitude combination of said data;

In the present embodiment; Title to pretreated data is carried out the binary word segmentation, and each speech after the cutting is generated corresponding key word (KEY) with the longitude and latitude combination of said data, illustrates as follows: the title " shop, Kfc Zhong Guan-cun " of data is carried out the binary word segmentation; Be divided among the kf|fc|c | middle pass | Guan Cun | 7 speech in shop, village; Then the longitude and latitude of these data is done scope checking, verify these data whether in Chinese scope, i.e. 43.005＜=latitude＜=144.015; 18.0 whether set up＜=longtitude＜=54.0, when setting up according to lat_key=int ((latitude-43.005) * 1000)/15; Long_key=int ((longtitude-18.000) * 1000)/10 calculates, and wherein, 1000 expression scopes are in one kilometer, and 10 and 15 is constant.At last every pair of speech (like " middle pass ")+lat_key+long_key is KEY.Under the KEY that prestores all to the tabulation of data should be arranged; This tabulation has comprised all data relevant with this KEY, when receiving the data of input, compares KEY earlier; When KEY is identical, just its corresponding data list is compared, can significantly reduce the number of times that comparison is calculated thus.Improve the efficient that identical data is judged.

In step S203, according to the data list of said keyword search to the correspondence that prestores;

In the present embodiment, according to said key word, the data list of the correspondence that the burst at the data of search input place and the data longitude and latitude of this input prestore in 8 bursts on every side.

In step S204, data and each data in the said data list of input are carried out similarity relatively;

In the present embodiment, similarity parameter relatively including, but not limited to following at least one: identical core word, identical suffix macrotaxonomy, binary speech similarity ratio, anti-document frequency comprehensive similarity, substring, subsequence and individual character comprise rate.Wherein, said identical core word directly obtains after pre-service.

When the more said binary speech of data similarity ratio (bigram_similar), need meet the following conditions: the title of two data relatively needs two consecutive identical words at least; The physical distance of two data relatively will be in a km.Bigram_similar is through the quantity a of same words and the quantity b of different speech in the data name behind the statistics binary word segmentation, calculates by a/a+b then.For example: " the prosperous SiChuan of Hui Longguan Fish Filets in Hot Chili Oil " and " prosperous SiChuan Fish Filets in Hot Chili Oil go back to the Long Guan shop "; Identical speech have " Hui Long | Long Guan | prosperous crust | the SiChuan | Shu Shui | poach | Boiled fish " 8 pairs; Different speech has " see prosperous | fish returns | sight shop " 3 pairs, so bigram_similar is 8/8+3=0.727.

Said identical suffix macrotaxonomy (key-categorysuf_similar) is to add the suffix macrotaxonomy through the keyname that compares two data partly to obtain, and detailed process is exemplified below:

The keyname of tentation data 1 is 1k, and suffix is categorized as 1s, and the keyname of data 2 is 2k, and suffix is categorized as 2s, and concrete computing method are following:

If 1k for empty or 2k for empty // if keyname all arranged

{

If 1k is not equal to the 2k//keyname difference

{

key-categoty_simlar＝0；

}

If Else 1k equals 2k

{

If 1s for empty and 2s for empty // the suffix macrotaxonomy all arranged

{

If equaling 2s // suffix macrotaxonomy, 1s equates

{

Key-categoty_simlar=1; //keyname is identical, and the suffix macrotaxonomy is identical

}

But the identical suffix macrotaxonomy of Else//keyname is different

{

key-categoty_simlar＝0；

}

Else

{

Key-categoty_simlar=2; // keyname is identical, does not have suffix }

}

Else//all do not have keyname

{

If 1s for empty and 2s for empty // the suffix macrotaxonomy all arranged

{

If equaling 2s // suffix, 1s divides quasi-equal

{

Key-categoty_simlar=3; // not having keyname, the suffix classification is identical

}

Else

{

key-categoty_simlar＝0；

}

Else//do not have keyname is branch quasisuffix not also

{

key-categoty_simlar＝1；

}

}；

When the data of two comparisons satisfy that bigram_similar does not conflict with keyname greater than the classification (for example: school and square) in preset threshold values, big zone and address clearly when different (the data physical distance of two comparisons is greater than 30 meters), calculate anti-document frequency comprehensive similarity (idf_similar) parameter again.

Said anti-document frequency comprehensive similarity (idf_similar) is that overall computing formula is through the mediation similarity of calculating title similarity, address similarity, phone similarity and obtaining apart from similarity:

Idf_similar＝0.85*name_similar+0.05*address_similar+0.05*phone_similar+0.05*lating_similar

Title similarity (name_similar) most importantly wherein, the concrete computing formula of name_similar is following:

Name_similar＝W_same_scores_total*2/1_scores+2_scores

Wherein, W_same_scores_total=Wsame_1*Wsame_1_scores+Wsame_2*Wsame _ 2_scores...+Wsame_n*Wsame_n_scores

1_scores＝1_w_1*1_w_1_scores+...+1_w_t*1_w_t_scores

2_scores＝2_w_1*2_w_1_scores+...+2_w_f*2_w_f_scores

Wsame_i carries out the same words obtained behind the binary word segmentation to the title of two data; Wsame_i_scores is the score value (being weight) that each same words is added up in advance.

Can reduce the influence of some non-core speech in the title of data through the name_similar computing method; For example " extra large emperor hotel " and " extra large emperor hotel "; Because the frequency that " extra large emperor " occurs in the title of data is lower; So Wsame_scores (extra large emperor) is just high, and the frequency that " hotel " " hotel " occurs in the title of data is higher, so Wsame_scores (hotel); Wsame_scores (hotel) score value is just low; But the score of Wsame_scores (extra large emperor) * 2/Wsame_scores (extra large emperor) * 2+Wsame_scores (hotel)+Wsame_scores (hotel) is really very high, so can judge that " extra large emperor hotel " is similar for very with " extra large emperor hotel ", in like manner calculates in " extra large emperor hotel " and " Xing Hai hotel "; The name_similar that obtains is very low, so can judge that " extra large emperor hotel " and " Xing Hai hotel " is for dissimilar.

The computing formula of Lating_similar is: Lating_similar=MIN (100.0/distance, 1);

Said address similarity (address_similar) is that the address with the data of two comparisons is divided into province, city, county, area, street, six ranks of rank and compares;

The manner of comparison of said phone similarity (phone_similar) is following:

Phone_similar=1 when phone a and phone b are identical; Back 7 back 7 phone_similar=0.7 when identical with phone b as phone a; Phone_similar=0 under other situation;

Said substring is meant the character string that in long string, occurs continuously, and for example: " abc " is the substring of " abcef ";

Said subsequence is meant the character string that in long string, occurs in order, and for example: " abc " is the subsequence of " axbxc ";

It is the probability that the individual character in the substring occurs in long string that said individual character comprises rate, and for example: the probability that " a " in " abc ", " b " occur in " abdef " is 1/5=0.2.

In the present embodiment, calculate, can effectively reduce the error rate that identical data is judged, improve the recall rate of identical data and the efficient that identical data is judged through the similarity of data relatively being carried out various dimensions.

In step S205, when said similarity met preset threshold value, the data of judging said comparison were same data.

In the present embodiment, according to the similarity that calculates, compare with preset threshold value, judge whether the data of said comparison are same data, when said similarity met preset threshold value, the data of judging said comparison were same data.For example: when bigram_similar＞=0.8, the data of judging said comparison are same data; When bigram_similar＜0.2, the data of judging said comparison are not same data; When 0.4＜=bigram_similar＜0.8; If the keyname of idf_similar＞0.9 or data is identical with the suffix macrotaxonomy or have substring or subsequence to concern; The data of then judging said comparison earlier are same data, if idf_similar＜=0.9, and the suffix macrotaxonomy is all identical; And remove idf_similar＞=0.5 behind the suffix speech, the data of then judging said comparison are same data; When 0.2＜=bigram_similar＜0.4; If the title of 0.5＞cal_similar＞0.1 and two data relatively has the keyname of substring or subsequence relation or data all identical with the suffix classification; The data of then judging said comparison are same data, and other situation judge that then the data of said comparison are not same data.

Embodiment three:

Fig. 3 shows the composition structure of the data fusion device that the embodiment of the invention three provides, and for the ease of explanation, only shows the part relevant with the embodiment of the invention.

This data fusion device can be to run on the unit that software unit, hardware cell or software and hardware in the data handling system combine, and also can be used as independently, suspension member is integrated in these data handling systems or runs in the application system of these data handling systems.

This data fusion device comprises Data Receiving unit 31, judging unit 32, data fusion unit 33 and direct storage unit 34.Wherein, the concrete function of each unit is following:

Data Receiving unit 31 is used to receive the data of input;

Judging unit 32; Be used for judging whether the data that prestore exist the data identical with the data of said input; And when judged result is " being ", add in the said identical data through new descriptor in the data of data fusion unit 33 with said input; When judged result is " denying ", through the data of the said input of direct storage unit 34 storages.Wherein, said judging unit 32 also comprises pre-processing module 41, title cutting module 42, search module 43, similarity comparison module 44 and determination module 45 (as shown in Figure 4), and each module concrete function is following:

Pre-processing module 41 is used for the data of input are carried out pre-service;

Title cutting module 42 is used for the title of said pretreated data is carried out the binary word segmentation, and each speech after the cutting is generated corresponding key word with the longitude and latitude combination of said data;

Search module 43 is used for according to the data list of said keyword search to the correspondence that prestores;

Similarity comparison module 44 is used for the data of input and each data of said data list are carried out similarity relatively;

Determination module 45 is used for when said similarity meets preset threshold value, and the data of judging said comparison are same data.

In this enforcement, the concrete implementation of each module repeats no more at this as stated.

In embodiments of the present invention; Whether there are the data identical in the data that prestore through judgement with the data of said input; When having the identical data of data with said input, new descriptor is added in the said identical data in the data with said input, can effectively enrich the information of data; And reduce the redundancy of data, improve the satisfaction of user to the data that search.And, in the process that identical data is judged, calculate through the similarity of data relatively being carried out various dimensions, can effectively reduce the error rate that identical data is judged, improve the recall rate of identical data and the efficient that identical data is judged.

The above is merely preferred embodiment of the present invention, not in order to restriction the present invention, all any modifications of within spirit of the present invention and principle, being done, is equal to and replaces and improvement etc., all should be included within protection scope of the present invention.

Claims

1. a data fusion method is characterized in that, said method comprising the steps of:

Receive the data of input;

2. whether the method for claim 1 is characterized in that, exist the step of the data identical with the data of said input to be specially in the data that said judgement prestores:

Data to input are carried out pre-service;

Title to said pretreated data is carried out the binary word segmentation, and each speech after the cutting is generated corresponding key word with the longitude and latitude combination of said data;

According to the data list of said keyword search to the correspondence that prestores;

Data and each data in the said data list of input are carried out similarity relatively; When said similarity meets preset threshold value, judge that the data of said comparison and the data of said input are same data.

3. method as claimed in claim 2 is characterized in that, said preprocessing process comprises that address longitude and latitude conversion, title split, the address splits;

Said address longitude and latitude converts into when there is not longitude and latitude in data, and the address through said data obtains longitude and latitude;

Said title is split as title with said data and is split as address prefix, core, branch part and closes key name and add suffix macrotaxonomy part;

Said address is split as to be divided the address of said data according to province, city, county, area, street, rank.

4. method as claimed in claim 2; It is characterized in that, the parameter of said similarity comparison comprise following at least one: identical core word, identical suffix macrotaxonomy, binary speech similarity ratio, anti-document frequency comprehensive similarity, substring, subsequence and individual character comprise rate.

5. the method for claim 1 is characterized in that, said method is further comprising the steps of:

When not having the identical data of data with said input, store the data of said input.

6. a data fusion device is characterized in that, said device comprises:

The Data Receiving unit is used to receive the data of input;

Judging unit is used for judging whether the data that prestore exist the data identical with the data of said input; And

7. device as claimed in claim 6 is characterized in that, said judging unit also comprises:

Pre-processing module is used for the data of input are carried out pre-service;

Title cutting module is used for the title of said pretreated data is carried out the binary word segmentation, and each speech after the cutting is generated corresponding key word with the longitude and latitude combination of said data;

Search module is used for according to the data list of said keyword search to the correspondence that prestores;

The similarity comparison module is used for the data of input and each data of said data list are carried out similarity relatively;

Determination module is used for when said similarity meets preset threshold value, judges that the data of said comparison and the data of said input are same data.

8. device as claimed in claim 7 is characterized in that, said preprocessing process comprises that address longitude and latitude conversion, title split, the address splits;

Said title is split as title with said data and is split as address prefix, core, branch part and key word and adds suffix macrotaxonomy part;

9. device as claimed in claim 7; It is characterized in that, the parameter of said similarity comparison comprise following at least one: identical core word, identical suffix macrotaxonomy, binary speech similarity ratio, anti-document frequency comprehensive similarity, substring, subsequence and individual character comprise rate.

10. device as claimed in claim 6 is characterized in that, said device also comprises:

Storage unit is used for when not having the identical data of data with said input, storing the data of said input.

11. a data handling system is characterized in that, said data handling system comprises the described data fusion device of each claim of claim 6 to 10.