CN1220159C

CN1220159C - High-dimensional vector data quick similar search method

Info

Publication number: CN1220159C
Application number: CN 03129687
Authority: CN
Inventors: 董道国; 薛向阳
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2003-07-03
Filing date: 2003-07-03
Publication date: 2005-09-21
Anticipated expiration: 2023-07-03
Also published as: CN1477563A

Abstract

The present invention relates to a high-dimensional vector data quick similar retrieving method. In the present invention, a new index structure of an ordered VA-File is provided; the index structure sequences and recombines approximate vectors in the VA-File; collected data in a high-dimensional space is stored in the adjacent positions of the file to the largest degree; the ordered VA-File is adaptively divided into a certain number of kinds according to practical application; the data of each kind is continuously stored in the file. In an inquiring process, data of several kinds closest to inquiry vectors is selected for inquiry processing. Consequently, the present invention largely enhances inquiry efficiency. Moreover, kind quantity needing inquiring is adjusted for different requirements for inquiry result precision according to the practical application. The method of the present invention largely reduces disk access cost, and enhances the inquiry efficiency.

Description

The quick similar to search method of a kind of higher-dimension vector data

Technical field

The invention belongs to data processing field such as multimedia information retrieval, data mining and cluster analysis, be specifically related to a kind of method that is used for the quick similar to search of higher-dimension vector data.

Background technology

Over nearest 10 years, the similarity retrieval of higher-dimension vector data in fields such as multimedia information retrieval field, data mining and cluster analyses in occupation of more and more important position.For a lot of practical applications in these fields, people often wish to find out rapidly from the magnanimity multimedia database and the most similar or maximally related k the object of certain given data object, promptly so-called k neighbour inquiry.

The technology path of realizing k neighbour inquiry is: extract the eigenvector (normally higher-dimension) of each multimedia object in the multimedia database, describe or portray the content of corresponding object with this vector, thereby just obtain the feature vector data storehouse; Extract the eigenvector of object to be checked with same feature extraction algorithm; Similar or degree of correlation between any two objects is just measured with distance between their pairing eigenvectors; Therefore, the similar inquiry of k neighbour is exactly the eigenvector of k minimum distance of search in the feature vector data storehouse, and the multimedia object of this k eigenvector correspondence is exactly the result of the similar inquiry of k neighbour of expectation.

In actual applications, the eigenvector dimension of description object content does not wait from tens to hundreds of even several thousand, realize that the simplest direct method of k neighbour inquiry is sequential scanning (SScan), that is: read each eigenvector in the property data base successively, calculate distance between each eigenvector and the query vector, keep k apart from reckling, just obtain final Query Result.When the data volume of eigenvector was big, the full feature database just must be stored in the disk, so SScan just need expend a large amount of magnetic disc i/os and CPU calculation cost.In order to quicken inquiry velocity and raising search efficiency, most popular method is exactly to reduce magnetic disc i/o and CPU calculation cost by index structure.

People have proposed a lot of high dimensional data index structures in order to realize the quick similarity retrieval of high n dimensional vector n.R-Tree class index structure is paid close attention to maximum class multi-dimensional indexing structures.The eighties in 20th century, Guttman was with B ⁺-Tree has proposed R-Tree[1 to the multidimensional expansion], it utilizes the tree construction management data, and each internal node is the minimum boundary rectangle (MBR:Minimal Bounding Rectangle) of all data in this node, and True Data only appears in the leaf node.Literary composition [7] has provided k search algorithm neighbour based on R-Tree, it is from root node, realize choice by calculating query vector with minor increment MINDIST between each MBR and minimum ultimate range MINMAXDIST to a large amount of paths, only inquiry may comprise result's subtree, thereby reaches the purpose that reduces the disk access cost.But, exist serious overlapping region and dead zone phenomenon between the R-Tree internal node, influenced search efficiency based on R-Tree.In order to improve the performance of R-Tree, R is proposed in succession ⁺-Tree[5], R ^*-Tree[4], X-Tree[6], SS-Tree[2], SR-Tree[3] etc., but these tree index structures are along with the data dimension increases, and query performance descends rapidly, after dimension surpasses 20, even, promptly do not face so-called " dimension disaster " as the performance of SScan.

Except that adopting the tree index structure, people such as Webber have proposed the approximate file (VA-File) of vector in literary composition [7], it is divided into some quantized intervals with each dimension of data space, whole like this data space is split into the grid of a large amount of non-overlapping copies, each grid can take the very little binary bits string list of storage space with one and show, each higher-dimension vector data always drops in one of them grid, with regard to the former data of Bit String approximate representation with this grid correspondence, this Bit String just is called approximate vector, and these approximate vectors are just formed VA-File according to former series arrangement.Realize that on VA-File similar inquiry needs through two steps: at first scan each approximate vector among the VA-File, cross the data object that filtering does not meet querying condition in a large number according to query vector with the distance between the grid of each approximate vector representative, keep candidate; And then the original vector data that reads all candidate correspondences carries out accurate Calculation, just obtains final k neighbour Query Result.As can be seen, VA-File just adopts simple quantization approximation method to compress original vector data to reduce the disk I cost, do not adopt the approximate vector after any complex data structures is gone the organization and administration quantification, data for cluster distribution in many practical applications, the first step of inquiry often can only filter out the little a part of data of ratio, causes the inquiry expense in second step still very big.After this, in order to continue to improve search efficiency, literary composition [8] has provided the algorithm of realizing approximate k neighbour retrieval based on VA-File, its thought is only to inquire about among the result who obtains by the first step to return with the individual collection as a result of of the k of query vector lower-bound-distance minimum, do not carry out accurate k neighbour and search and do not read raw data, because the distance that the first step calculates only is a kind of approximate distance, so the result that this inquiry obtains is approximate k neighbour.

List of references

1.Guttman?A.“R-Trees：A?dynamic?index?structure?for?spatial?searching”，Proc.ACMSIGMOD?Int.Conf.on?Management?of?Data，Boston，MA，1984：47-57.

2.White?D.A.，Jain?R.“Similarity?indexing?with?the?SS-Tree”，Proc.12 ^th?Int?Conf?on?DataEngineering，New?Orleans，LA，1996.

3.N.Katayama?and?S.Satoh.“The?SR-Tree：An?index?structure?for?high?dimensional?nearestneighbor?queries”，Int.Proc.Of?the?ACM?SIGMOD?Int.Conf.on?Management?of?Data，Tucson，Arizon?USA，1997：369-380.

4.N.Beckmann?and?H.P.Kriegel?and?R.Schneider?and?B.Seeger.“The?R*-Tree：an?efficientand?robust?access?method?for?points?and?rectangles”，Proc.of?the?SIGMOD?Conf.，AtlanticCity，NJ，June?1990：322-331.

5.Sellis?T.，Roussopoulos?N.and?Faloutsos?C.“The?R+?Tree：A?dynamic?index?formultidimensional?objects”，Proc.13 ^th?Int.Conf.on?Very?Large?Databases，Brighton，England，1987：507-518.

6.Stefan?Berchtold，Daniel?A.Keim，and?Hans-Peter?Kriegel.“The?X-Tree：An?index?structurefor?high?dimensional?data”，Proc.ofthe?22 ^nd?VLDB?Conference，1996：28-39.

7.Roger?Weber，Hans-J.Schek，Stephen?Blott，“A?Quantitative?Analysis?and?PerformanceStudy?for?Similarity?Search?Methods?in?High-Dimensional?Spaces，”Proc.of?the?24 ^th?VLDBConference，New?York，USA，1998.

8.R.Weber，K.Bohm，“Trading?Quality?for?Time?with?Nearest?Neighbor?Search”，In?Proc.Ofthe?7 ^th?Conf.On?Extending?Database?Technology，Konstanz，Germany，March?2000.

Summary of the invention

The objective of the invention is to propose approximate k search algorithm neighbour of a kind of improved VA-File, with the speed of further quickening higher-dimension vector data similar to search.

The higher-dimension vector data similar to search method that the present invention proposes is on approximate k search algorithm's neighbour of VA-File basis, according to certain rule to the VA-File reorganization of sorting, when being similar to the k neighbour and inquiring about, no longer as VA-File, all approximate vectors are scanned calculating, and just scan and the approximate vector of calculating section, retrieve thereby further quicken approximate k neighbour.We claim that the VA-File that sorts after recombinating is Ordered VA-File (being ordering VA-File).

Step of the present invention is as follows: (1) selects for use certain rule that the approximate vector among the VA-File is organized: will be inserted into approximate vector and be inserted into an ad-hoc location, make near the mean distance minimum of the approximate vector this vector and this position, its objective is to make the data that flock together in higher dimensional space leave in the adjacent space of file, wherein the distance calculation between vector can adopt any distance measure as far as possible; (2) sorted approximate vector data is carried out cluster segmentation, obtain several class center vectors, the approximate vector data of each class is continuous storage in Ordered VA-File, the distance of approximate vector data is closer in the class, the distance of approximate vector data is distant between class, and the class center vector is stored in the system hosts, each class center vector is represented one section continuously arranged approximate vector data, and promptly certain initial document location and ends file position two tuples are related in the approximate vector data file after class center vector and the ordering; (3) when retrieval calculated the distance between vector to be checked and each the class center vector earlier, owing to the class center vector is stored in the main memory, so do not need to visit disk during distance calculation; (4) obtain their reference position and end positions in Ordered VA-File according to those less class center vectors of distance, scan the approximate vector data of depositing continuously between these starting and ending positions, just can obtain the approximate query result.Because no longer scan whole approximate vector data file,, improved inquiry velocity so significantly reduce the disk access cost.

Embodiment

Among the present invention, to the VA-File reorganization of sorting, make up Ordered VA-File and can adopt following insertion algorithm: in Ordered VA-File, insert a new vector, at first this vector quantization compression is obtained corresponding approximate vector, did different disposal according to whether having carried out cluster segmentation among the present OrderedVA-File then:

1) if the current cluster segmentation of not carrying out as yet, then from whole Ordered VA-File, search a such position: the mean distance minimum of each m approximate vector before and after the approximate vector that is inserted into and this position, after obtaining this position, new vector is inserted into this position

2) if the current cluster segmentation of having carried out, at first calculate the distance between new vector and each class center this moment, searches such position in that class of chosen distance minimum: the mean distance minimum of each m approximate vector before and after the approximate vector that is inserted into and this position; And vector is inserted in this position

In the present invention, the Ordered VA-File that builds is carried out simple clustering can adopt following cluster segmentation algorithm: in Ordered VA-File, that the higher dimensional space middle distance is close vector data is stored on the nearer position as far as possible, before the inquiry, need Ordered VA-File is divided into the class of some (N), the approximate vector data of each class is deposited order in OrderedVA-File be continuous, distance in the class between the approximate vector data should be smaller, and the distance of the approximate vector data between class is then bigger.The time can only search some approximate vector datas in all kinds of in inquiry like this, just can realize similarity retrieval fast.If comprising the Ordered VA-File of n vector is designated as Its method that is divided into the N class is as follows:

1) calculates distance between the approximate vector of each approximate vector and its previous position, be designated as (dist ₁, dist ₂..., dist _N-1), wherein

I＞0, D is a distance metric method between the vector.

2) to each vector, the ratio between the distance that calculating obtains above and its each m range averaging value in front and back is designated as (ratio ₁, ratio ₂..., ratio _N-1), wherein

rati o_{i} = \frac{{dist}_{i}}{\underset{\max (i - m, 0) \leq j \leq \min (i + m, n - 1), j &NotEqual; i}{Σdis t_{j}}},

I＞0 wherein.

3) position of N-1 ratio correspondence of the maximum that calculates in the selection previous step is a cut-point with this N-1 position, and whole OrderedVA-File is divided into the N class.

4) ask the average of approximate vector in each class as the class center.

Among the present invention, the algorithm of approximate k neighbour inquiry is as follows: Ordered VA-File is kept at the information (comprising class center vector, such starting and ending position in Ordered VA-File) of each class in the main memory all the time, when being similar to the k neighbour and inquiring about, do not need whole file scan, the approximate vector data of sub-fraction carries out distance calculation and only need read wherein, and performing step is as follows:

1) distance between calculating query vector and all kinds of center, the L class of chosen distance minimum;

2) calculate distance between the approximate vector in query vector and the L class of selecting above, the k of chosen distance minimum the result set that the pairing object of approximate vector is inquired about as approximate k neighbour.

Owing to the L class of only selecting in this algorithm and query vector is nearest is carried out distance calculation, therefore efficient will be far above approximate k search algorithm neighbour of traditional VA-File, and selected L class is nearest apart from query vector usually, they most possibly comprise the k neighbour of query vector, so the approximation collection can guarantee to compare higher quality.

Among the present invention, have related parameter to fix really then down:

1) determining of N: N is big more, and the data bulk that may comprise in each class is more little, and the inquiry cost is also just more little, but the needed space of category information that is kept in the main memory must be big more.Therefore definite principle of N is: according to the spendable main memory capacity of searching system in the practical application, select big as far as possible N.Two kinds of extreme cases can occur under this principle: (1) internal memory is enough big, and N is defined as data volume size n in the database, and be equivalent to whole Ordered VA-File is read in internal memory this moment, and query script does not carry out a disk read-write operation, and search efficiency is the highest; (2) internal memory is minimum, and N is chosen as 1, and the data among the promptly whole Ordered VA-File are 1 class, and this moment, query script needed from all approximate vectors of disk scanning, and search efficiency is equal to original VA-File.This shows that performance is equal to traditional VA-File under the worst case of Ordered VA-File;

2) determining of L: in the query script, only select to inquire about with the approximate vector of the nearest L class of query vector, so L is big more, the Query Result precision is high more, and the inquiry cost is big more; L is more little, and the Query Result precision is more little, and it is also just more little to inquire about cost accordingly, thus L fix really then can be according to user's search request self-adaptation adjustment: if higher, then select bigger L value to accuracy requirement; If more value efficient, then select less L value.Extreme case is that L is chosen as N, and this moment, search efficiency and result were equal to traditional VA-File fully.

In a word, the present invention proposes by reorganization that VA-File is sorted and form Ordered VA-File, and can carry out cluster segmentation and fast query, can obtain very high search efficiency according to user's actual demand.Under the minimum worst case high with inquiring about accuracy requirement of internal memory, the efficient of Ordered VA-File is equal to traditional VA-File.

Claims

1, a kind of method that realizes the quick similar to search of higher-dimension vector data is characterized in that concrete steps are as follows:

(1) according to certain rule the approximate vector ordering in the approximate file of vector is recombinated, this rule will be inserted into approximate vector data exactly and be inserted into such position: the mean distance minimum of each m approximate vector before and after the approximate vector that is inserted into and this position claims that the approximate file of vector after this ordering is the approximate file of ordering vector;

(2) to carrying out cluster segmentation in the approximate file of ordering vector, the approximate vector data of each class is storage continuously in the approximate file of ordering vector, the distance of approximate vector data is closer in the class, the distance of approximate vector data is far away between class, the average of getting each class data is the class center vector, and all class center vectors are left in the main memory;

At first calculate the distance between vector to be checked and each the class center vector when (3) retrieving;

(4) the L class of chosen distance minimum is calculated the distance between the approximate vector in query vector and the L class of selecting above, the k of chosen distance minimum the result set that the pairing object of approximate vector is inquired about as approximate k neighbour,

In the step (2), the approximate file of ordering vector carries out cluster segmentation and adopts following algorithm: establish the approximate file of the ordering vector that comprises n vector and be designated as

Its method that is divided into the N class is as follows:

A, calculate the distance between the approximate vector of each approximate vector and its previous position, be designated as (dist ₁, dist ₂..., dist _N-1), wherein

{dist}_{i} = D ({\overset{&RightArrow;}{v}}_{i}, {\overset{&RightArrow;}{v}}_{i - 1}),

I＞0, D is a distance metric method between the vector;

B, to each vector, the ratio before and after the distance that calculating obtains above and its between each m range averaging value is designated as (ratio ₁, ratio ₂..., ratio _N-1), wherein

{ratio}_{i} = {dist}_{i} /, \underset{\max (i - m, 0) \leq j \leq \min (i + m, n - 1), j &NotEqual; i}{Σ {dist}_{j}},

I＞0 wherein;

The position of N-1 ratio correspondence of the maximum that calculates in C, the selection previous step is a cut-point with this N-1 position, is the N class with the approximate file division of whole ordering vector;

D, ask in each class the average of approximate vector as the class center.

2, search method according to claim 1 is characterized in that definite principle of class division numbers N: according to the spendable main memory amount of capacity of searching system in the practical application, select big as far as possible N.

3, search method according to claim 1, definite principle of the class quantity L that visits when it is characterized in that k neighbour similar to search: if to Query Result accuracy requirement height, and do not value the inquiry cost, then select bigger L value; If more value search efficiency, then select less L value.