CN105989078A

CN105989078A - Index construction method for structured peer-to-peer network as well as retrieval method, apparatus and system

Info

Publication number: CN105989078A
Application number: CN201510072216.3A
Authority: CN
Inventors: 刘大伟; 刘玮; 隋雪青; 程学旗; 戴鹏飞
Original assignee: Yantai Zhong Ke Network Technical Institute
Current assignee: Yantai Zhong Ke Network Technical Institute
Priority date: 2015-02-11
Filing date: 2015-02-11
Publication date: 2016-10-05
Anticipated expiration: 2035-02-11
Also published as: CN105989078B

Abstract

The invention discloses an index construction method for a structured peer-to-peer network as well as a retrieval method, apparatus and system. The index construction method comprises the steps of selecting a hash function index parameter; mapping index data into a hash table according to a hash function family, wherein each piece of the index data is subjected to k-time hash operation and enters a k-dimensional vector bucket; for each bucket in the hash table, calculating a l2 norm of a random point p; estimating normal distribution of an index data set D according to the l2 norm of the random point p; dividing a bucket space into a conventional region and a sparse region according to the normal distribution; mapping one bucket in the hash table into each key value of a one-dimensional distributed hash table according to the conventional region and the sparse region; and inserting each key value into each node of the peer-to-peer network in sequence according to a chord routing protocol. According to the method, a locality-sensitive hashing algorithm is extended to the distributed structured peer-to-peer network, the retrieval speed is increased, and the retrieval precision of an original centralized local hash algorithm is reserved.

Description

A kind of method of structured p2p network index building, search method, Apparatus and system

Technical field

The present invention relates to the data processing field such as information retrieval and cluster analysis, particularly relate to a kind of structure Change the method for peer-to-peer network index building, search method, Apparatus and system.

Background technology

Along with the fast development of internet information, mass data (such as text document, picture, video etc.) The high dimensional object that mass data is concentrated it is expressed as according to feature after classifying.Many application need on big rule Efficient and extendible information retrieval is performed under mould distributed network environment.Traditional nearest _neighbor retrieval problem How effectively to be performed under distributed environment, become major issue.

In prior art, the research of K-NN search problem is divided into two classes: accurately inquire about and approximate query.

At accurate query aspects, frequently with tree index structure to space partition zone, such as R tree, TV tree, KD Tree etc..The method obtains good result on non-higher dimensional space.But, truthful data collection is often higher-dimension , this kind search method based on tree construction faces dimension calamity problem, retrieval performance is even inferior to linear Search modes.

In terms of approximate query, search method based on hash algorithm can effectively perform in higher dimensional space K-NN search, but inquiry precision declined.

Recently research have indicated that, local sensitivity Hash (Locality Sensitive Hashing, LSH) The K-NN search of higher dimensional space can be effectively realized.LSH approximation method is by Indyk and Motwani Proposing, it uses the hash function of multiple maintenance local characteristics, the object of higher dimensional space is carried out Hash. Similar data object can be placed on close position with the highest probability by LSH.

Here, it is considered to metric space is the theorem in Euclid space of d dimension, wherein d is that a relatively large number is (high Dimension).D is the some set of d dimension space.The accurate arest neighbors of number of request strong point q is defined as closest The node p of q^*∈D.Equation belowization statement can be done, there is not other p ∈ D, meet d (p, q) ＜ d (p^*, q), Distance between wherein d () represents at 2.Local sensitivity Hash supports the search of c-approximate KNN, definition As follows: for given request point q, the distance if there is a p to some q is a q to nearest neighbor point p^*Away from From c times, then claiming some p is the c approximate KNN point of a q, and c >=1 is degree of approximation.Ours In research, use l_pSystem represents the distance between p and q 2, i.e. d (p, q)=| | p, q | |_p。

The basic thought of LSH is, with several hash functions, data point is carried out Hash, and each function ensures will Similar data object is placed on close position with the highest probability.One LSH function bunch can be determined by formalization Justice is as follows: family of functions H=={h:D → U}, wherein U is LSH bucket space.For any m, n ∈ D, when meeting:

If d is (m, n)≤r₁, then Pr [h (m)=h (n)] >=p₁；

If d is (m, n) >=r₂, then Pr [h (m)=h (n)]≤p₂,

Then claiming is (r₁,r₂,p₁,p₂) sensitive, wherein r₁＜ r₂,p₁＞ p₂, Pr [h (m)=h (n)] represents collision rate.

It can be seen that LSH family of functions is defined under certain specific distance metric and distribution.Herein In we identify that distance function is l_pDimension theorem in Euclid space, LSH family of functions is based on p Stable distritation, wherein, P=1 is that Cauchy is distributed, and p=2 is normal distribution.Here LSH family of functions is defined as follows:

Wherein a is a d dimensional vector, and vector element obeys p steady-state distribution, and B is to obey uniformly dividing of [0, W] Cloth.Therefore, the Mapping of data points that a d ties up is an integer by each function.Based on p stable function LSH pattern under, generally use multiple hash functions to generate bucket label g (v).Bucket label g (v) is a k dimension Integer vectors,All data points of data set are hashing onto in Hash table respectively In the individual bucket with k dimensional vector sign.In order to ensure the precision of similar to search, frequently with l independent Hash Table builds LSH index structure.Retrieval link later, can be by by request point Hash l time, putting into In l data bucket of l different Hash tables, and other data in the bucket comprising request point are retrieved.

In recent years, LSH obtains some theoretic improvement, and Datar proposes l_pThe local sensitivity Hash of tolerance Function.One significant drawbacks of LSH is, in order to reach good retrieval performance, needs to remain a large amount of Hash table, these Hash tables do not possess autgmentability.Certain methods attempts expanding to LSH different application Under scene, including centralized and distributed environment.Bawa proposes to set up measurable rule in P2P system Mould, the LSH method of Parameter adjustable.Lv proposes a kind of LSH method of multiple exploration, to reduce Hash table Quantity, and improve algorithm time, space efficiency.Haghani uses the multiple exploration of two-layer Mapping implementation LSH method, to support distributed similar to search.

The retrieval under distributed environment of the magnanimity high dimensional data, depends on a series of net being placed on and specifying position Articulare.When system start-up, any node all may have access to these gateway nodes.Centralization configuration side Method guarantees the effectiveness mapped, but under dynamic distributed network environment, it is difficult to the load maintaining system is equal Weighing apparatus.Prior art includes:

1) space filling curve (Space filling curves, SFC), such as Hilbert curve, The point of hyperspace is mapped to the one-dimensional space, simultaneously holding point position relationship in original multi-dimensional space Constant.As the one-dimensional data point of result, the data structure of structural P 2 P cover layer network can be placed on In, such as B+ tree, DHT structure etc..Notice that original Hilbert curve is based on being uniformly distributed, not According to data distribution, spatial density can be adjusted.But, when Spatial Dimension rises, the efficiency of SFC Can decline, thus be not used to the nearest _neighbor retrieval of higher dimensional space index.

2) structured p2p network, the method managing a large amount of distributed data, such as Chord, Pastry, Distributed hashtable (Distributed Hash Table, DHT) cover layer network common for CAN etc., It is used to carry out quick routing inquiry.Such as Chord, use concordance Hash that key assignments key is mapped to joint Point, is provided simultaneously with scalability, keeps the key assignments key of the almost identical quantity of each node administration.Use SHA-1 Function provides good load balancing, scalability and availability as basic hash function, Chord. Meanwhile, Chord does not limit for the structure of inquiry key assignments key, and the key space of Chord is plane Changing, this is how the key assignments that object map is Chord to be provided the biggest motility.But, this A little networks do not support that keyword retrieval is inquired about.

For these shortcomings, the purpose of the present invention is: for magnanimity high dimensional data distributed search problem, Under distributed scene, introduce structured p2p network structure, based in higher dimensional space based on LSH Neighbor search, and utilize SFC that LSH bucket space is mapped to DHT node space effectively, i.e. realize many Dimension is to the mapping of the one-dimensional space.Consider load balancing, and object storage location further, propose one and become The method changing nonuniform space space filling curve (SFC), processes distributed hashtable (DHT) network environment The load balancing of lower distributed request.

Summary of the invention

It is an object of the invention to provide a kind of method of structured p2p network index building, retrieval side Method, Apparatus and system, to solve in prior art when Spatial Dimension rises, space filling curve cannot Nearest _neighbor retrieval and structured p2p network for higher dimensional space index do not support that keyword retrieval is looked into The problem ask.

On the one hand, it is provided that a kind of method of structured p2p network index building, including:

S1, given l₂Dimension index data set D, choose hash function indexing parameter (W, k, l)；

Wherein, W represents the threshold value being uniformly distributed interval, and k represents that each index data carries out k Hash fortune Calculating, l represents Hash table number；

All index datas in index data set D, according to below equation, are mapped to l Kazakhstan by S2 In the bucket of uncommon table；

g (v) = (h_{a_{1}, B_{1}} (v), . . ., h_{a_{k}, B_{k}} (v));

Wherein, h (v) represents hash function formula, and v represents the index data in index data set D, Represent the vector obeying steady-state distribution, a₁...a_kRepresent the k chosen from the vector obeying steady-state distribution Individual different vector, B represents the equally distributed vector of obedience, B₁...B_KRepresent and be uniformly distributed from obedience Vector in the k that chooses different vector, g (v) represents a k dimensional vector bucket；

S3, according to below equation, calculates the l of random point p in bucket₂Norm | | p | |₂:

{| | p | |}_{2} = {({| p_{1} |}^{2} + {| p_{1} |}^{2} + . . . + {| p_{k} |}^{2})}^{\frac{1}{2}};

Wherein, | | p | |₂Represent the l of P₂Norm, P represents the bucket of a k dimensional vector, p₁…p_kRepresent a k Every dimension in dimensional vector bucket；

S4, according to below equation, the data statistical characteristics of computation index data acquisition system D:

N (\frac{k}{2}, \frac{\sqrt{k} {| | p | |}_{2}}{W});

Wherein,Representing normal distribution formula, k represents that each index data carries out k Hash Computing, | | p | |₂Represent the l of p₂Norm, W represents the threshold value being uniformly distributed interval；

S5, according to described data statistical characteristics, is divided into general areas and sparse region by bucket space；

S6, according to general areas and sparse region, will be mapped to each of one-dimensional distributed hashtable to measuring tank In individual key assignments；

S7, according to chord Routing Protocol, is inserted into each node of peer-to-peer network successively by each key assignments In.

Additionally providing a kind of structured p2p network index building device, described device includes:

Indexing parameter chooses module, is used for giving l₂Dimension index data set D, chooses hash function index Parameter (W, k, l)；

Hash table generation module, for according to below equation, by all indexes in index data set D Data are mapped in the bucket of l Hash table；

g (v) = (h_{a_{1}, B_{1}} (v), . . ., h_{a_{k}, B_{k}} (v));

Norm calculation module, for according to below equation, calculates the l of random point p in bucket₂Norm | | p | |₂:

{| | p | |}_{2} = {({| p_{1} |}^{2} + {| p_{1} |}^{2} + . . . + {| p_{k} |}^{2})}^{\frac{1}{2}};

Data statistical characteristics computing module, for according to below equation, the number of computation index data acquisition system D Feature according to statistics:

N (\frac{k}{2}, \frac{\sqrt{k} {| | p | |}_{2}}{W});

Region divides module, for according to described data statistical characteristics, bucket space is divided into general areas and Sparse region；

Bucket DUAL PROBLEMS OF VECTOR MAPPING module, for according to general areas and sparse region, will be mapped to one-dimensional to measuring tank In each key assignments of distributed hashtable；With

Peer-to-peer network mapping block, for according to chord Routing Protocol, is inserted into each key assignments successively In each node of peer-to-peer network.

On the other hand, additionally provide the search method of a kind of structured p2p network, including:

S11, Selecting Index use hash function search argument (W, k, l)；

S12, according to below equation, l the k obtaining each point to be retrieved corresponding in l Hash table ties up bucket Vector；

g (v) = (h_{a_{1}, B_{1}} (v), . . ., h_{a_{k}, B_{k}} (v));

Wherein, v represents data to be retrieved, and a represents the vector obeying steady-state distribution, a₁...a_kRepresent from Obeying k the different vector chosen in the vector of steady-state distribution, B represents the equally distributed vector of obedience, B₁...B_KRepresent that g (v) represents one from obeying k the different vector chosen equally distributed vector Individual k dimensional vector bucket；

S13, divides according to the region in index stage, determines the region at each to be retrieved some place；

S14, according to the region at each to be retrieved some place, by right with each point to be retrieved in l Hash table L k dimension bucket DUAL PROBLEMS OF VECTOR MAPPING answered is in the key assignments of one-dimensional distributed hashtable；

S15, navigates to the node of peer-to-peer network according to chord Routing Protocol；

S16, calculates Euclidean distance in the same bucket vector in the Hash table of described node, will calculate Result returns to a node.

Additionally providing the retrieval device of a kind of structured p2p network, described device includes:

Search argument chooses module, for Selecting Index use hash function search argument (W, k, l)；

Wherein, W represents the threshold value being uniformly distributed interval, and k represents that each retrieval data carry out k Hash fortune Calculating, l represents Hash table number；

Bucket vector acquisition module, for the hash function race used according to index? To l the k dimension bucket vector that each point to be retrieved is corresponding in l Hash table；

Zone location module, divides for the region according to the index stage, determines each to be retrieved some place Region；

Bucket DUAL PROBLEMS OF VECTOR MAPPING module, for the region according to each to be retrieved some place, by l Hash table With in each corresponding l k dimension bucket DUAL PROBLEMS OF VECTOR MAPPING to be retrieved to the key assignments of one-dimensional distributed hashtable；

Node locating module, for navigating to the node of peer-to-peer network according to chord Routing Protocol；

Euclidean distance computing module, calculates in the same bucket vector in the Hash table of described node Euclidean distance, returns to a node by result of calculation.

Another further aspect, additionally provides the searching system of a kind of structured p2p network, and described system includes: Including: the device of index building and retrieval device；

The device of the most above-mentioned a kind of structured p2p network index building of device of described index building；

The retrieval device of the most above-mentioned a kind of structured p2p network of described retrieval device.

Beneficial effects of the present invention: the data of higher dimensional space are entered by the present invention by local sensitivity hash algorithm Row similar to search, and utilize typical hash algorithm to be combined, by higher-dimension with non-homogeneous Hilbert curve Local sensitivity hash bucket space is mapped in the key assignments of one-dimensional distributed hashtable, solves existing skill In art when Spatial Dimension rises, space filling curve is not used to the nearest _neighbor retrieval of higher dimensional space index And structured p2p network do not supports the problem that keyword retrieval is inquired about.

The present invention has a following tripartite face advantage relative to prior art:

(1) efficiency and benefit: local sensitivity hash algorithm is demonstrated to effectively perform higher dimensional space Nearest neighbor approximation is retrieved.Local sensitive hash algorithm is entered in terms of scalability by the index structure of the present invention Go improvement, and LSH has been expanded in distributed structured peer-to-peer network, meanwhile, do not reduced original The precision of local sensitivity hash algorithm inquiry and query rate.

(2) locality and load balancing: under distributed scene, loads and shares to all nodes, and Need to process Single Point of Faliure problem.In local sensitivity hash algorithm retrieval phase, some specific algorithms, Such as multiple detection method (multi-probe method), need to detect multiple close data bucket.? On the index structure of the present invention, the load that this retrieval brings is frequently not and is uniformly distributed.In order to ensure can It is required for meeting by property, the locality of index and the load balancing of distributed hashtable.

(3) discretization and scalability: when one centralized approach is applied under distributed environment, How discretization be most important be also the most difficult problem, directly influence the scalability of system simultaneously. Under P2P network environment, global configuration such as gateway node or super node, scalability will be unfavorable for. Under the application scenarios of the present invention, with the support mechanism of DHT cover layer network, control the dynamic of whole network State changes, in order to avoid system is damaged by upper layer application.

Accompanying drawing explanation

For the technical scheme being illustrated more clearly that in the embodiment of the present invention, below will be to embodiment or existing Technology in describing the required accompanying drawing used be briefly described, it should be apparent that, attached in describing below Figure is only some embodiments of the present invention, for those of ordinary skill in the art, is not paying wound On the premise of the property made is laborious, it is also possible to obtain other accompanying drawing according to these accompanying drawings.

Fig. 1 is the fundamental curve of the Hilbert curve under two-dimensional space；

Fig. 2 is the curve of order 2 of the Hilbert curve under two-dimensional space；

Fig. 3 is the method flow diagram of the index building that the embodiment of the present invention one provides；

Fig. 4 is the indexing unit structural representation of the offer of the embodiment of the present invention two；

Fig. 5 is the search method flow chart that the embodiment of the present invention three provides；

Fig. 6 is the retrieval apparatus structure schematic diagram that the embodiment of the present invention four provides；

Fig. 7 is the two dimension non-homogeneous Hibert curve chart of the concrete example of the present invention one；

Fig. 8 is the normal distribution area distribution plot of the concrete example of the present invention one；

Fig. 9 is the searching system structural representation that the embodiment of the present invention five provides.

Detailed description of the invention

In order to make the purpose of the present invention, technical scheme and advantage clearer, below in conjunction with accompanying drawing and Embodiment, is further elaborated to the present invention.Should be appreciated that described herein being embodied as Example only in order to explain the present invention, is not intended to limit the present invention.

In order to technical solutions according to the invention are described, illustrate below by specific embodiment.

Hilbert curve is introduced: for ease of showing that mapping process is seen the most intuitively, it is assumed that k=2.

Fig. 1 is the fundamental curve of the Hilbert curve under two-dimensional space.

Fig. 2 is the curve of order 2 of the Hilbert curve under two-dimensional space.

Hilbert curve is a kind of space filling curve, as it is shown in figure 1, divide the space into 4 pieces. Numeral along curve is the corresponding mapping from 2 dimension coordinate spaces to the one-dimensional space.

As in figure 2 it is shown, whole space is divided into 16 regions, the most each 1/4th fritters enter One step is divided into 4 less piecemeals.Although each region is cut into less unit, curve is still protected Hold continuously.Hilbert curve has injection, and keeps original spatial relation, i.e. at d dimension sky Interior close point, becomes binary number after mapping, still keeps close in the one-dimensional space.But, The not all point of this position relationship, after mapping, can maintain spatial relationship constant.

The present invention utilizes the position of Hilbert curve to maintain characteristic, and LSH index structure is mapped to DHT NameSpace.Noticing that the precision of retrieval is determined by LSH method, Hilbert curve is only used as one Plant mapping method, effectiveness and the efficiency of K-NN search in DHT network can't be affected.

It is to be appreciated that the present invention consider metric space be theorem in Euclid space, under theorem in Euclid space, data it Between distance Euclidean distance calculate.

Embodiment one

A kind of method present embodiments providing structured p2p network index building, sees Fig. 3, this reality The method flow executing example offer is specific as follows:

S1, given l₂Dimension index data set D, choose hash function indexing parameter (W, k, l).

Wherein, W represents the threshold value being uniformly distributed interval, and k represents that each index data carries out k Hash fortune Calculating, l represents Hash table number.

All index datas in index data set D, according to below equation, are mapped to l Kazakhstan by S2 In the bucket of uncommon table, each index data carries out k Hash operation and enters in a k dimensional vector bucket；Each Index data enters in the bucket of a k dimensional vector in a Hash table, and each bucket is denoted as a k dimension Vector, each Hash table includes a series of bucket；

g (v) = (h_{a_{1}, B_{1}} (v), . . ., h_{a_{k}, B_{k}} (v));

Wherein, h (v) represents hash function formula, and v represents the index data in index data set D, Represent the vector obeying steady-state distribution, a₁...a_kRepresent the k chosen from the vector obeying steady-state distribution Individual different vector, B represents the equally distributed vector of obedience, B₁...B_KRepresent and be uniformly distributed from obedience Vector in the k that chooses different vector, g (v) represents a k dimensional vector bucket.

S3, for each bucket in l Hash table, calculates the l of some random point p₂Norm | | p | |₂。

The l of random point p is calculated according to below equation₂Norm:

{| | p | |}_{2} = {({| p_{1} |}^{2} + {| p_{1} |}^{2} + . . . + {| p_{k} |}^{2})}^{\frac{1}{2}},

Wherein, | | p | |₂Represent the l of P₂Norm, P represents the bucket of a k dimensional vector, p₁…p_kRepresent a k Every dimension in dimensional vector bucket.

S4, according to the l of some random point p₂Norm | | p | |₂Estimate the normal distribution of index data set D.

Normal distribution according to below equation estimation index data set D:

N (\frac{k}{2}, \frac{\sqrt{k} {| | p | |}_{2}}{W}),

Wherein,Representing normal distribution formula, k represents that each index data carries out k Hash Computing, | | p | |₂Represent the l of p₂Norm, W represents the threshold value being uniformly distributed interval.

S5, according to described normal distribution, is divided into general areas and sparse region by bucket space.Wherein, often Rule region accounts for the 80% of overall area, and sparse region accounts for the 20% of overall area.

S6, according to general areas and sparse region, is respectively mapped to one by each bucket in l Hash table In each key assignments of dimension distributed hashtable.

Concrete, generate l chord ring according to l Hash table, and according to chord Routing Protocol, adopt Successively each key assignments is inserted in each node of peer-to-peer network with concordance hash algorithm.

Above-mentioned steps S3 to step S6 is the key point of the present invention, is illustrated below: at chord ring Distributed hashtable network in, Node distribution is approaches uniformity distribution, and is provided node by SHA-1 Indicate.It is uniformly distributed owing to the mapping result of Hilbert curve generation does not meets, it is impossible to proof load Equilibrium.Occur that this load imbalance has following 2 reasons:

(1) Hilbert curve is by the some uniform mapping in hyperspace to one-dimensional curve space, causes The skewness of ID in DHT NameSpace.

(2) index data is hashing onto the interval interior, for Normal Distribution of length W by hash function Integer space, through by W length split, create substantial amounts of minizone.Therefore can be partitioned into a large amount of Little rectangular cells, is used for filling whole region.The element ha of k dimension bucket vector g (p), B (p) is at normal distribution Sparse end edge region, seldom performs map operation.In sparse region, the density of Hilbert curve is still The highest so that the curve on marginal area loses value.

The method that the present embodiment provides, by being improved feature heterogeneous for Hibert curve.Cause Each h (p) Normal DistributionAnd depend on a p itself, therefore use k normal distribution Summation is distributedEstimate the normal distribution of k dimension bucket vector g (p), thus bucket space is divided into two Region, one is general areas, and another is sparse region.For two different regions, use not Curve with density.The present invention utilizes non-homogeneous Hilbert curve, it is achieved that by higher-dimension bucket DUAL PROBLEMS OF VECTOR MAPPING In the key assignments of one-dimensional distributed hashtable, it is achieved thereby that the Approximate Retrieval of high dimensional data, solve existing Having in technology when Spatial Dimension rises, space filling curve is not used to the arest neighbors of higher dimensional space index Retrieval and structured p2p network do not support the problem that keyword retrieval is inquired about.

Embodiment two

Present embodiments providing the device of a kind of structured p2p network index building, this device is used for performing The method of the index building of the structured p2p network in above-described embodiment one, sees Fig. 4, this device bag Include:

Indexing parameter chooses module 1, is used for giving l₂Dimension index data set D, chooses hash function index Parameter (W, k, l).

Hash table generation module 2 is chosen module 1 with described indexing parameter and is connected, and is used for according to below equation, All index datas in index data set D are mapped in the bucket of l Hash table, each index number Enter in a k dimensional vector bucket according to carrying out k Hash operation；Each index data enters a Hash table In a k dimensional vector bucket in, each bucket is denoted as a k dimensional vector, and each Hash table includes one The bucket of series；

g (v) = (h_{a_{1}, B_{1}} (v), . . ., h_{a_{k}, B_{k}} (v));

Norm calculation module 3 is connected with described Hash table generation module 2, for in l Hash table Each bucket, calculate the l of random point p in bucket₂Norm | | p | |₂。

The l of random point p is calculated according to below equation₂Norm:

{| | p | |}_{2} = {({| p_{1} |}^{2} + {| p_{1} |}^{2} + . . . + {| p_{k} |}^{2})}^{\frac{1}{2}};

Wherein, p represents a k dimensional vector, p₁、p₂,、...p_kRepresent in k dimensional vector is one-dimensional.

Data statistical characteristics computing module 4 is connected with described norm calculation module 3, for according to some with The l of machine point p₂Norm | | p | |₂Estimate the normal distribution of index data set D.

Normal distribution according to below equation estimation index data set D:

N (\frac{k}{2}, \frac{\sqrt{k} {| | p | |}_{2}}{W});

Wherein, k represents that each index data carries out k Hash operation, and W represents the threshold being uniformly distributed interval Value；||p||₂Represent the l of random point p₂Norm.

Region divides module 5 and described data statistical characteristics computing module and described Hash table generation module 2 Connect, for according to described normal distribution, bucket space is divided into general areas and sparse region.Wherein, General areas accounts for the 80% of overall area, and sparse region accounts for the 20% of overall area.

Bucket DUAL PROBLEMS OF VECTOR MAPPING module 6 divides module 5 with described region and described Hash table generation module 2 is connected, For according to general areas and sparse region, each bucket in l Hash table being respectively mapped to one-dimensional point In each key assignments of cloth Hash table.

Peer-to-peer network mapping block 7 is connected with described bucket DUAL PROBLEMS OF VECTOR MAPPING module 6, for according to l Hash Table generates l chord ring, and for according to chord Routing Protocol, uses concordance hash algorithm to depend on Secondary each key assignments is inserted in each node of peer-to-peer network.

The device that the present embodiment provides, by being improved feature heterogeneous for Hibert curve.This Invention proposes a kind of method of extendible structured p2p network index building, at higher-dimension mass data collection Middle execution K-NN search.By using l_pThe local sensitivity hash algorithm under the criterion number to higher dimensional space According to carrying out similar to search, and typical hash algorithm is utilized to be combined, by height with non-homogeneous Hilbert curve The local sensitivity hash bucket space of dimension is mapped to one-dimensional DHT index space.The method of the present invention, with Time consider the demand that similarity retrieval and structured p2p network maintain, it is special that index itself possesses local sensitivity Property, and the load balance ability of DHT network, it is achieved thereby that the Approximate Retrieval of high dimensional data, solve In prior art when Spatial Dimension rises, space filling curve is not used to higher dimensional space index Neighbour's retrieval and structured p2p network do not support the problem that keyword retrieval is inquired about.

Embodiment three

Present embodiments provide the search method of a kind of structured p2p network, see Fig. 5, the present embodiment The method flow provided is specific as follows:

S11, Selecting Index use hash function search argument (W, k, l).

g (v) = (h_{a_{1}, B_{1}} (v), . . ., h_{a_{k}, B_{k}} (v));

Wherein, v represents data to be retrieved, and a represents the vector obeying steady-state distribution, a₁...a_kRepresent from Obeying k the different vector chosen in the vector of steady-state distribution, B represents the equally distributed vector of obedience, B₁...B_KRepresent that g (v) represents one from obeying k the different vector chosen equally distributed vector Individual k dimensional vector bucket.

S13, divides according to the region in index stage, determines the region at each to be retrieved some place.

S14, according to the region at each to be retrieved some place, by right with each point to be retrieved in l Hash table L k dimension bucket DUAL PROBLEMS OF VECTOR MAPPING answered is in the key assignments of one-dimensional distributed hashtable.

S15, navigates to the node of peer-to-peer network according to chord Routing Protocol.

The n-arest neighbors that n point is point to be retrieved that Euclidean distance is nearest is chosen from described result of calculation.

Preferably, the result of calculation of described Euclidean distance can be ranked up, by the calculating knot after sequence Fruit returns to interface and shows, facilitates user intuitively to check, chooses and meet user's request from this result N-arest neighbors is as result.

The method that the present embodiment provides, it is achieved that the Approximate Retrieval of high dimensional data, solves in prior art When Spatial Dimension rises, space filling curve be not used to higher dimensional space index nearest _neighbor retrieval and Structured p2p network does not support the problem that keyword retrieval is inquired about.

Embodiment four

Present embodiments provide the retrieval device of a kind of structured p2p network, see Fig. 6, this device bag Include:

Search argument chooses module 11, the hash function indexing parameter used for Selecting Index (W,k,l)。

Bucket vector acquisition module 12 is chosen module with described search argument and is connected, and is used for according to below equation, Obtain l the k dimension bucket vector that each point to be retrieved is corresponding in l Hash table；

g (v) = (h_{a_{1}, B_{1}} (v), . . ., h_{a_{k}, B_{k}} (v));

Zone location module 13 and described search argument choose module 11 and described bucket vector acquisition module 12 Connect, divide for the region according to the index stage, determine the region at each to be retrieved some place.

Bucket DUAL PROBLEMS OF VECTOR MAPPING module 14 chooses module 11 and described zone location module 13 with described search argument Connect, for according to the region at each to be retrieved some place, by l Hash table with each point to be retrieved L corresponding k ties up bucket DUAL PROBLEMS OF VECTOR MAPPING in the key assignments of one-dimensional distributed hashtable.

Node locating module 15 is connected with described bucket DUAL PROBLEMS OF VECTOR MAPPING module 14, for routeing according to chord Agreement navigates to the node of peer-to-peer network.

Euclidean distance computing module 16 is connected with described node locating module 15, at described node Calculate Euclidean distance in same bucket vector in Hash table, result of calculation is returned to a node.From Described result of calculation is chosen the n-arest neighbors that n point is point to be retrieved that Euclidean distance is nearest.

The device that the present embodiment provides, it is achieved that the Approximate Retrieval of high dimensional data, solves in prior art When Spatial Dimension rises, space filling curve be not used to higher dimensional space index nearest _neighbor retrieval and Structured p2p network does not support the problem that keyword retrieval is inquired about.

Fig. 7 is the two dimension non-homogeneous Hibert curve chart of a concrete example.

Technical scheme be exemplified below:

First, given dimension is 5 index data v, the respectively v of 16₁,v₂,…,v₅, such as table 1 below Shown in.

Table 1:

1	-1.31	-1.35	-0.20	0.67	1.03	0.89	1.44	-1.10	-0.03	-0.86	1.53	-1.09	0.09	-0.62	-1.40	1-42
																	2	-0.43	3.03	-0.12	-1.21	0.73	-1.15	0.33	-0.24	-0.16	0.08	-0.77	0.03	-1.49	0.75	-1.42	0.29
3	0.34	0.73	1.49	0.72	-0.30	-1.07	-0.75	0.32	0.63	-1.21	0.37	0.55	-0.74	-0.19	0.49	0.20
																	4	3.58	-0.06	1.41	1.63	0.29	-0.81	1.37	0.31	1.09	-1.11	-0.23	1.10	-1.06	0.89	-0.18	1.59
5	2.77	0.71	1.42	0.49	-0.79	-2.94	-1.71	-0.86	1.11	-0.01	1.12	1.54	2.35	-0.76	-0.20	-0.80

Secondly, hash function indexing parameter is chosen: W=2, k=8, l=1.

8 a are randomly selected from the vector of Normal Distribution, as shown in table 2 below.8 16 dimension to Amount a: every one a vector of behavior, has a respectively₁,a₂,…,a₈。

Table 2:

1	-0.20	-1.21	2.91	0.83	1.38	-1.06	-0.47	-0.27	1.10	-0.28	0.70	-2.05	-0.35	-0.82	-1.58	0.51
																	2	0.28	0.03	-1.33	1.13	0.35	-0.30	0.02	-0.26	-1.75	-0.29	-0.83	-0.98	-1.16	-0.53	-2.00	0.96
3	0.52	-0.02	-0.03	-0.80	1.02	-0.13	-0.71	1.35	-0.22	-0.59	-0.29	-0.85	-1.12	2.53	1.66	0.31
																	4	-1.26	-0.87	-0.18	0.79	-1.33	-2.33	-1.45	0.33	0.39	0.45	-0.13	0.18	-0.48	0.86	-1.36	0.46
5	-0.85	-0.33	0.55	1.04	-1.12	1.26	0.66	-0.07	-0.20	-0.22	-0.30	0.02	0.05	0.83	1.53	0.47
																	6	-0.21	0.63	0.18	-1.03	0.95	0.31	0.14	0.52	0.26	-0.94	-0.16	-0.15	-0.53	1.68	-0.88	-0.48
7	-0.71	-1.17	-0.19	-0.27	1.53	-0.25	-1.06	1.60	1.23	-0.23	-1.51	-0.44	-0.16	0.28	-0.26	0.44
																	8	0.39	-1.25	-0.95	-0.74	-0.51	-0.32	0.01	-3.03	-0.46	1.24	-1.07	0.93	0.35	-0.03	0.18	-1.57

It is uniformly distributed in the vector of [0, W] from obedience and randomly selects 8 B, as shown in table 3 below.8 B For: it is respectively B₁,B₂,…,B₈。

Table 3:

1	1.61
		2	0.48
3	0.98
		4	1.15
5	1.77
		6	0.34
7	0.37
		8	0.06

Then, according to hash functionEach index data constitutes k hash function, shape One-tenth hash function race isWherein k is 8.After above-mentioned steps, 5 indexes Data enter in the bucket of 5 k dimensional vectors in 1 Hash table, and each index data generates a bucket g₁,g₂,…g₈, correspondence obtains 58 dimension bucket vectors, as shown in table 4 below.

Table 4:

1	10.23	6.86	-2.37	0.60	2.41	1.87	2.80	-4.14
									2	-0.81	5.24	2.99	2.11	-4.77	7.11	0.12	-1.97
3	5.85	-1.40	2.54	3.68	1.18	1.06	0.79	-4.38
									4	6.31	2.58	4.68	-0.24	1.42	1.97	0.26	-4.25
5	6.51	-5.04	-4.61	6.60	-5.12	-3.01	-3.36	5.97

Below just constitute an index Hash table 4, owing to have chosen l=1, i.e. only generate 1 table.

Next step, by these 5 index datas according to non-homogeneous Hilbert curve mapping to one-dimensional distributed In each key assignments of Hash table.

For ease of explaining, before only taking in its 8 dimension for 5 bucket vectors, 2 dimensions carry out flow process demonstration, as Shown in table 5 below, i.e.

Table 5:

1	10.23	6.86
			2	-0.81	5.24
3	5.85	-1.40
			4	6.31	2.58
5	6.51	-5.04

Wherein, the present invention using above-mentioned 5 buckets vector as random point.

In the partition process of region, in this example, (in previous step, k=8, takes it herein for simplification to k=2 In front 2 dimension), W=2, calculate random point p l₂Norm:

Next step, calculate the l of random point p according to below equation₂Norm:

{| | p | |}_{2} = {({| p_{1} |}^{2} + {| p_{1} |}^{2} + . . . + {| p_{k} |}^{2})}^{\frac{1}{2}};

Try to achieve the l of random point p₂Norm is made even and is 7.74.

It is N (1,3.87) through being calculated corresponding normal distribution, for ease of intuitively checking, draws Fig. 8.

Fig. 8 is the normal distribution area distribution plot of this concrete example of the present invention.Wherein, abscissa is one-dimensional data Value, vertical coordinate is the percentage ratio that accumulation area accounts for the gross area.So divide such as in the region of each dimension Under:

The decomposition point of general areas and sparse region is-2.20 and 4.20, i.e. sparse region be (negative infinite, -2.20) and (4.20, the most infinite), it is only used as a square to carry out Hilbert fitting a curve. General areas is (-2.20,4.20), wherein divides square for W=2 and carries out Hilbert fitting a curve. And then obtain non-homogeneous Hilbert fitting a curve method and be:

Each dimension is divided into 6 regions: (negative infinite ,-3), (-3 ,-1), (-1,1), (1,3) (3,5), (5, the most infinite).

Visible, 6 Hilbert codings must be used, 5 data are as shown in table 6 after mapping respectively.

Table 6:

1	011011
		2	10110
3	100001
		4	11101
5	100100

Through above-mentioned steps, complete index building.

Technical scheme about search method of the present invention:

Take some 16 equally and tie up data v to be retrieved, also pass through the step of above-mentioned index building.According to Chord Routing Protocol navigates to the node of peer-to-peer network.

Next step, calculate Euclidean distance in the same bucket vector in the Hash table of described node, if Result is in same bucket, then for the arest neighbors data found；Without in any one bucket, Then search without arest neighbors result.

Embodiment five

Present embodiments provide the searching system of a kind of structured p2p network, see Fig. 9, this system bag Include: the device 20 of index building and retrieval device 30.

Wherein, the device of the index building that the device 20 such as above-described embodiment two of index building provides；

The retrieval device that retrieval device 30 such as above-described embodiment four provides.

The system that the present embodiment provides, by being improved feature heterogeneous for Hibert curve.Cause Each h (p) Normal DistributionAnd depend on a p itself, therefore use k normal distribution Summation is distributedEstimate the normal distribution of k dimension bucket vector g (p), thus bucket space is divided into two Region, one is general areas, and another is sparse region.For two different regions, use not Curve with density.The present invention utilizes non-homogeneous Hilbert curve, it is achieved that by higher-dimension bucket DUAL PROBLEMS OF VECTOR MAPPING In the key assignments of one-dimensional distributed hashtable, it is achieved thereby that the Approximate Retrieval of high dimensional data, solve existing Having in technology when Spatial Dimension rises, space filling curve is not used to the arest neighbors of higher dimensional space index Retrieval and structured p2p network do not support the problem that keyword retrieval is inquired about.

It should be understood that the device of the index building of above-described embodiment offer is when index building, retrieval Device, when retrieving, is only illustrated with the division of above-mentioned each functional module, in actual application, As desired above-mentioned functions distribution can be completed by different functional modules, will the internal junction of device Structure is divided into different functional modules, has completed all or part of function described above.On it addition, The device of index building that stating embodiment provides belongs to same design with the embodiment of the method for index building, on The retrieval device that stating embodiment provides belongs to same design with search method embodiment, and it implements process Refer to embodiment of the method.Here repeat no more.

By technique scheme, present invention achieves and local sensitivity hash algorithm is expanded to distributed knot Structure peer-to-peer network, improves retrieval rate, meanwhile, remains the hash algorithm inspection of original centralized local The precision of rope.

The invention described above embodiment sequence number, just to describing, does not represent the quality of embodiment.

All or part of step in the embodiment of the present invention, it is possible to use hardware realizes, it is also possible to pass through Program instructs relevant hardware and completes, and described program can be stored in the storage medium that can read, Such as CD or hard disk etc..

The foregoing is only presently preferred embodiments of the present invention, not in order to limit the present invention, all at this Any amendment, equivalent and the improvement etc. made within the spirit of invention and principle, should be included in this Within the protection domain of invention.

Claims

1. the method for a structured p2p network index building, it is characterised in that including:

g (v) = (h_{a_{1}, B_{1}} (v), . . ., h_{a_{k}, B_{k}} (v));

S3, according to norm formula, calculates the l of random point p in bucket₂Norm | | p | |₂:

{| | p | |}_{2} = {({| p_{1} |}^{2} + {| p_{1} |}^{2} + . . . + {| p_{k} |}^{2})}^{\frac{1}{2}};

N (\frac{k}{2}, \frac{\sqrt{k} {| | p | |}_{2}}{W});

The method of structured p2p network index building the most according to claim 1, it is characterised in that Described general areas accounts for the 80% of overall area, and described sparse region accounts for the 20% of overall area.

3. a structured p2p network index building device, it is characterised in that described device includes:

Indexing parameter chooses module (1), is used for giving l₂Dimension index data set D, chooses hash function rope Draw parameter (W, k, l)；

Hash table generation module (2), for according to below equation, by owning in index data set D Index data is mapped in the bucket of l Hash table；

g (v) = (h_{a_{1}, B_{1}} (v), . . ., h_{a_{k}, B_{k}} (v));

Norm calculation module (3), for according to below equation, calculates the l of random point p in bucket₂Norm | | p | |₂:

{| | p | |}_{2} = {({| p_{1} |}^{2} + {| p_{1} |}^{2} + . . . + {| p_{k} |}^{2})}^{\frac{1}{2}};

Data statistical characteristics computing module (4), for according to below equation, computation index data acquisition system D Data statistical characteristics:

N (\frac{k}{2}, \frac{\sqrt{k} {| | p | |}_{2}}{W});

Region divides module (5), for according to described data statistical characteristics, bucket space is divided into routine Region and sparse region；

Bucket DUAL PROBLEMS OF VECTOR MAPPING module (6), for according to general areas and sparse region, will map to measuring tank In each key assignments of one-dimensional distributed hashtable；With

Peer-to-peer network mapping block (7), for according to chord Routing Protocol, inserts each key assignments successively Enter in each node of peer-to-peer network.

Structured p2p network index building device the most according to claim 3, it is characterised in that Described general areas accounts for the 80% of overall area, and described sparse region accounts for the 20% of overall area.

5. the search method of a structured p2p network, it is characterised in that including:

S11, Selecting Index use hash function search argument (W, k, l)；

g (v) = (h_{a_{1}, B_{1}} (v), . . ., h_{a_{k}, B_{k}} (v));

S16, in the same bucket vector in the Hash table of described node, in calculating bucket, each point is to asking Seek Euclidean distance a little, Euclidean distance result of calculation is returned to a node.

The search method of structured p2p network the most according to claim 5, it is characterised in that institute State region and include sparse region and general areas, the 80% of station, described general areas overall area is described sparse Region accounts for the 20% of overall area.

7. the retrieval device of a structured p2p network, it is characterised in that described device includes:

Search argument chooses module (11), the hash function search argument used for Selecting Index (W,k,l)；

Bucket vector acquisition module (12), for according to below equation, obtains each point to be retrieved at l L k dimension bucket vector corresponding in Hash table；

g (v) = (h_{a_{1}, B_{1}} (v), . . ., h_{a_{k}, B_{k}} (v));

Zone location module (13), divides for the region according to the index stage, determines each to be retrieved The region at some place；

Bucket DUAL PROBLEMS OF VECTOR MAPPING module (14), for the region according to each to be retrieved some place, by l Kazakhstan Uncommon table ties up the bucket DUAL PROBLEMS OF VECTOR MAPPING key to one-dimensional distributed hashtable with each l corresponding k to be retrieved In value；

Node locating module (15), for navigating to the node of peer-to-peer network according to chord Routing Protocol； With

Euclidean distance computing module (16), for the same bucket vector in the Hash table of described node In calculate Euclidean distance, Euclidean distance result of calculation is returned to a node.

The retrieval device of structured p2p network the most according to claim 7, described region includes dilute Dredging region and general areas, the 80% of station, described general areas overall area, described sparse region accounts for overall area 20%.

9. the searching system of a structured p2p network, it is characterised in that described system includes: structure Change device (20) and the retrieval device (30) of structured p2p network of peer-to-peer network index building；

The device such as claim 3 of described structured p2p network index building is to arbitrary in claim 5 The device of the index building described in claim；

The retrieval device of described structured p2p network retrieves device as claimed in claim 7.