CN1220159C - High-dimensional vector data quick similar search method - Google Patents

High-dimensional vector data quick similar search method Download PDF

Info

Publication number
CN1220159C
CN1220159C CN 03129687 CN03129687A CN1220159C CN 1220159 C CN1220159 C CN 1220159C CN 03129687 CN03129687 CN 03129687 CN 03129687 A CN03129687 A CN 03129687A CN 1220159 C CN1220159 C CN 1220159C
Authority
CN
China
Prior art keywords
vector
approximate
file
class
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN 03129687
Other languages
Chinese (zh)
Other versions
CN1477563A (en
Inventor
董道国
薛向阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fudan University
Original Assignee
Fudan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fudan University filed Critical Fudan University
Priority to CN 03129687 priority Critical patent/CN1220159C/en
Publication of CN1477563A publication Critical patent/CN1477563A/en
Application granted granted Critical
Publication of CN1220159C publication Critical patent/CN1220159C/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to a high-dimensional vector data quick similar retrieving method. In the present invention, a new index structure of an ordered VA-File is provided; the index structure sequences and recombines approximate vectors in the VA-File; collected data in a high-dimensional space is stored in the adjacent positions of the file to the largest degree; the ordered VA-File is adaptively divided into a certain number of kinds according to practical application; the data of each kind is continuously stored in the file. In an inquiring process, data of several kinds closest to inquiry vectors is selected for inquiry processing. Consequently, the present invention largely enhances inquiry efficiency. Moreover, kind quantity needing inquiring is adjusted for different requirements for inquiry result precision according to the practical application. The method of the present invention largely reduces disk access cost, and enhances the inquiry efficiency.

Description

The quick similar to search method of a kind of higher-dimension vector data
Technical field
The invention belongs to data processing field such as multimedia information retrieval, data mining and cluster analysis, be specifically related to a kind of method that is used for the quick similar to search of higher-dimension vector data.
Background technology
Over nearest 10 years, the similarity retrieval of higher-dimension vector data in fields such as multimedia information retrieval field, data mining and cluster analyses in occupation of more and more important position.For a lot of practical applications in these fields, people often wish to find out rapidly from the magnanimity multimedia database and the most similar or maximally related k the object of certain given data object, promptly so-called k neighbour inquiry.
The technology path of realizing k neighbour inquiry is: extract the eigenvector (normally higher-dimension) of each multimedia object in the multimedia database, describe or portray the content of corresponding object with this vector, thereby just obtain the feature vector data storehouse; Extract the eigenvector of object to be checked with same feature extraction algorithm; Similar or degree of correlation between any two objects is just measured with distance between their pairing eigenvectors; Therefore, the similar inquiry of k neighbour is exactly the eigenvector of k minimum distance of search in the feature vector data storehouse, and the multimedia object of this k eigenvector correspondence is exactly the result of the similar inquiry of k neighbour of expectation.
In actual applications, the eigenvector dimension of description object content does not wait from tens to hundreds of even several thousand, realize that the simplest direct method of k neighbour inquiry is sequential scanning (SScan), that is: read each eigenvector in the property data base successively, calculate distance between each eigenvector and the query vector, keep k apart from reckling, just obtain final Query Result.When the data volume of eigenvector was big, the full feature database just must be stored in the disk, so SScan just need expend a large amount of magnetic disc i/os and CPU calculation cost.In order to quicken inquiry velocity and raising search efficiency, most popular method is exactly to reduce magnetic disc i/o and CPU calculation cost by index structure.
People have proposed a lot of high dimensional data index structures in order to realize the quick similarity retrieval of high n dimensional vector n.R-Tree class index structure is paid close attention to maximum class multi-dimensional indexing structures.The eighties in 20th century, Guttman was with B +-Tree has proposed R-Tree[1 to the multidimensional expansion], it utilizes the tree construction management data, and each internal node is the minimum boundary rectangle (MBR:Minimal Bounding Rectangle) of all data in this node, and True Data only appears in the leaf node.Literary composition [7] has provided k search algorithm neighbour based on R-Tree, it is from root node, realize choice by calculating query vector with minor increment MINDIST between each MBR and minimum ultimate range MINMAXDIST to a large amount of paths, only inquiry may comprise result's subtree, thereby reaches the purpose that reduces the disk access cost.But, exist serious overlapping region and dead zone phenomenon between the R-Tree internal node, influenced search efficiency based on R-Tree.In order to improve the performance of R-Tree, R is proposed in succession +-Tree[5], R *-Tree[4], X-Tree[6], SS-Tree[2], SR-Tree[3] etc., but these tree index structures are along with the data dimension increases, and query performance descends rapidly, after dimension surpasses 20, even, promptly do not face so-called " dimension disaster " as the performance of SScan.
Except that adopting the tree index structure, people such as Webber have proposed the approximate file (VA-File) of vector in literary composition [7], it is divided into some quantized intervals with each dimension of data space, whole like this data space is split into the grid of a large amount of non-overlapping copies, each grid can take the very little binary bits string list of storage space with one and show, each higher-dimension vector data always drops in one of them grid, with regard to the former data of Bit String approximate representation with this grid correspondence, this Bit String just is called approximate vector, and these approximate vectors are just formed VA-File according to former series arrangement.Realize that on VA-File similar inquiry needs through two steps: at first scan each approximate vector among the VA-File, cross the data object that filtering does not meet querying condition in a large number according to query vector with the distance between the grid of each approximate vector representative, keep candidate; And then the original vector data that reads all candidate correspondences carries out accurate Calculation, just obtains final k neighbour Query Result.As can be seen, VA-File just adopts simple quantization approximation method to compress original vector data to reduce the disk I cost, do not adopt the approximate vector after any complex data structures is gone the organization and administration quantification, data for cluster distribution in many practical applications, the first step of inquiry often can only filter out the little a part of data of ratio, causes the inquiry expense in second step still very big.After this, in order to continue to improve search efficiency, literary composition [8] has provided the algorithm of realizing approximate k neighbour retrieval based on VA-File, its thought is only to inquire about among the result who obtains by the first step to return with the individual collection as a result of of the k of query vector lower-bound-distance minimum, do not carry out accurate k neighbour and search and do not read raw data, because the distance that the first step calculates only is a kind of approximate distance, so the result that this inquiry obtains is approximate k neighbour.
List of references
1.Guttman?A.“R-Trees:A?dynamic?index?structure?for?spatial?searching”,Proc.ACMSIGMOD?Int.Conf.on?Management?of?Data,Boston,MA,1984:47-57.
2.White?D.A.,Jain?R.“Similarity?indexing?with?the?SS-Tree”,Proc.12 th?Int?Conf?on?DataEngineering,New?Orleans,LA,1996.
3.N.Katayama?and?S.Satoh.“The?SR-Tree:An?index?structure?for?high?dimensional?nearestneighbor?queries”,Int.Proc.Of?the?ACM?SIGMOD?Int.Conf.on?Management?of?Data,Tucson,Arizon?USA,1997:369-380.
4.N.Beckmann?and?H.P.Kriegel?and?R.Schneider?and?B.Seeger.“The?R*-Tree:an?efficientand?robust?access?method?for?points?and?rectangles”,Proc.of?the?SIGMOD?Conf.,AtlanticCity,NJ,June?1990:322-331.
5.Sellis?T.,Roussopoulos?N.and?Faloutsos?C.“The?R+?Tree:A?dynamic?index?formultidimensional?objects”,Proc.13 th?Int.Conf.on?Very?Large?Databases,Brighton,England,1987:507-518.
6.Stefan?Berchtold,Daniel?A.Keim,and?Hans-Peter?Kriegel.“The?X-Tree:An?index?structurefor?high?dimensional?data”,Proc.ofthe?22 nd?VLDB?Conference,1996:28-39.
7.Roger?Weber,Hans-J.Schek,Stephen?Blott,“A?Quantitative?Analysis?and?PerformanceStudy?for?Similarity?Search?Methods?in?High-Dimensional?Spaces,”Proc.of?the?24 th?VLDBConference,New?York,USA,1998.
8.R.Weber,K.Bohm,“Trading?Quality?for?Time?with?Nearest?Neighbor?Search”,In?Proc.Ofthe?7 th?Conf.On?Extending?Database?Technology,Konstanz,Germany,March?2000.
Summary of the invention
The objective of the invention is to propose approximate k search algorithm neighbour of a kind of improved VA-File, with the speed of further quickening higher-dimension vector data similar to search.
The higher-dimension vector data similar to search method that the present invention proposes is on approximate k search algorithm's neighbour of VA-File basis, according to certain rule to the VA-File reorganization of sorting, when being similar to the k neighbour and inquiring about, no longer as VA-File, all approximate vectors are scanned calculating, and just scan and the approximate vector of calculating section, retrieve thereby further quicken approximate k neighbour.We claim that the VA-File that sorts after recombinating is Ordered VA-File (being ordering VA-File).
Step of the present invention is as follows: (1) selects for use certain rule that the approximate vector among the VA-File is organized: will be inserted into approximate vector and be inserted into an ad-hoc location, make near the mean distance minimum of the approximate vector this vector and this position, its objective is to make the data that flock together in higher dimensional space leave in the adjacent space of file, wherein the distance calculation between vector can adopt any distance measure as far as possible; (2) sorted approximate vector data is carried out cluster segmentation, obtain several class center vectors, the approximate vector data of each class is continuous storage in Ordered VA-File, the distance of approximate vector data is closer in the class, the distance of approximate vector data is distant between class, and the class center vector is stored in the system hosts, each class center vector is represented one section continuously arranged approximate vector data, and promptly certain initial document location and ends file position two tuples are related in the approximate vector data file after class center vector and the ordering; (3) when retrieval calculated the distance between vector to be checked and each the class center vector earlier, owing to the class center vector is stored in the main memory, so do not need to visit disk during distance calculation; (4) obtain their reference position and end positions in Ordered VA-File according to those less class center vectors of distance, scan the approximate vector data of depositing continuously between these starting and ending positions, just can obtain the approximate query result.Because no longer scan whole approximate vector data file,, improved inquiry velocity so significantly reduce the disk access cost.
Embodiment
Among the present invention, to the VA-File reorganization of sorting, make up Ordered VA-File and can adopt following insertion algorithm: in Ordered VA-File, insert a new vector, at first this vector quantization compression is obtained corresponding approximate vector, did different disposal according to whether having carried out cluster segmentation among the present OrderedVA-File then:
1) if the current cluster segmentation of not carrying out as yet, then from whole Ordered VA-File, search a such position: the mean distance minimum of each m approximate vector before and after the approximate vector that is inserted into and this position, after obtaining this position, new vector is inserted into this position
2) if the current cluster segmentation of having carried out, at first calculate the distance between new vector and each class center this moment, searches such position in that class of chosen distance minimum: the mean distance minimum of each m approximate vector before and after the approximate vector that is inserted into and this position; And vector is inserted in this position
In the present invention, the Ordered VA-File that builds is carried out simple clustering can adopt following cluster segmentation algorithm: in Ordered VA-File, that the higher dimensional space middle distance is close vector data is stored on the nearer position as far as possible, before the inquiry, need Ordered VA-File is divided into the class of some (N), the approximate vector data of each class is deposited order in OrderedVA-File be continuous, distance in the class between the approximate vector data should be smaller, and the distance of the approximate vector data between class is then bigger.The time can only search some approximate vector datas in all kinds of in inquiry like this, just can realize similarity retrieval fast.If comprising the Ordered VA-File of n vector is designated as Its method that is divided into the N class is as follows:
1) calculates distance between the approximate vector of each approximate vector and its previous position, be designated as (dist 1, dist 2..., dist N-1), wherein
Figure C0312968700081
I>0, D is a distance metric method between the vector.
2) to each vector, the ratio between the distance that calculating obtains above and its each m range averaging value in front and back is designated as (ratio 1, ratio 2..., ratio N-1), wherein rati o i = dist i Σdis t j max ( i - m , 0 ) ≤ j ≤ min ( i + m , n - 1 ) , j ≠ i , I>0 wherein.
3) position of N-1 ratio correspondence of the maximum that calculates in the selection previous step is a cut-point with this N-1 position, and whole OrderedVA-File is divided into the N class.
4) ask the average of approximate vector in each class as the class center.
Among the present invention, the algorithm of approximate k neighbour inquiry is as follows: Ordered VA-File is kept at the information (comprising class center vector, such starting and ending position in Ordered VA-File) of each class in the main memory all the time, when being similar to the k neighbour and inquiring about, do not need whole file scan, the approximate vector data of sub-fraction carries out distance calculation and only need read wherein, and performing step is as follows:
1) distance between calculating query vector and all kinds of center, the L class of chosen distance minimum;
2) calculate distance between the approximate vector in query vector and the L class of selecting above, the k of chosen distance minimum the result set that the pairing object of approximate vector is inquired about as approximate k neighbour.
Owing to the L class of only selecting in this algorithm and query vector is nearest is carried out distance calculation, therefore efficient will be far above approximate k search algorithm neighbour of traditional VA-File, and selected L class is nearest apart from query vector usually, they most possibly comprise the k neighbour of query vector, so the approximation collection can guarantee to compare higher quality.
Among the present invention, have related parameter to fix really then down:
1) determining of N: N is big more, and the data bulk that may comprise in each class is more little, and the inquiry cost is also just more little, but the needed space of category information that is kept in the main memory must be big more.Therefore definite principle of N is: according to the spendable main memory capacity of searching system in the practical application, select big as far as possible N.Two kinds of extreme cases can occur under this principle: (1) internal memory is enough big, and N is defined as data volume size n in the database, and be equivalent to whole Ordered VA-File is read in internal memory this moment, and query script does not carry out a disk read-write operation, and search efficiency is the highest; (2) internal memory is minimum, and N is chosen as 1, and the data among the promptly whole Ordered VA-File are 1 class, and this moment, query script needed from all approximate vectors of disk scanning, and search efficiency is equal to original VA-File.This shows that performance is equal to traditional VA-File under the worst case of Ordered VA-File;
2) determining of L: in the query script, only select to inquire about with the approximate vector of the nearest L class of query vector, so L is big more, the Query Result precision is high more, and the inquiry cost is big more; L is more little, and the Query Result precision is more little, and it is also just more little to inquire about cost accordingly, thus L fix really then can be according to user's search request self-adaptation adjustment: if higher, then select bigger L value to accuracy requirement; If more value efficient, then select less L value.Extreme case is that L is chosen as N, and this moment, search efficiency and result were equal to traditional VA-File fully.
In a word, the present invention proposes by reorganization that VA-File is sorted and form Ordered VA-File, and can carry out cluster segmentation and fast query, can obtain very high search efficiency according to user's actual demand.Under the minimum worst case high with inquiring about accuracy requirement of internal memory, the efficient of Ordered VA-File is equal to traditional VA-File.

Claims (3)

1, a kind of method that realizes the quick similar to search of higher-dimension vector data is characterized in that concrete steps are as follows:
(1) according to certain rule the approximate vector ordering in the approximate file of vector is recombinated, this rule will be inserted into approximate vector data exactly and be inserted into such position: the mean distance minimum of each m approximate vector before and after the approximate vector that is inserted into and this position claims that the approximate file of vector after this ordering is the approximate file of ordering vector;
(2) to carrying out cluster segmentation in the approximate file of ordering vector, the approximate vector data of each class is storage continuously in the approximate file of ordering vector, the distance of approximate vector data is closer in the class, the distance of approximate vector data is far away between class, the average of getting each class data is the class center vector, and all class center vectors are left in the main memory;
At first calculate the distance between vector to be checked and each the class center vector when (3) retrieving;
(4) the L class of chosen distance minimum is calculated the distance between the approximate vector in query vector and the L class of selecting above, the k of chosen distance minimum the result set that the pairing object of approximate vector is inquired about as approximate k neighbour,
In the step (2), the approximate file of ordering vector carries out cluster segmentation and adopts following algorithm: establish the approximate file of the ordering vector that comprises n vector and be designated as
Figure C031296870002C1
Its method that is divided into the N class is as follows:
A, calculate the distance between the approximate vector of each approximate vector and its previous position, be designated as (dist 1, dist 2..., dist N-1), wherein dist i = D ( v → i , v → i - 1 ) , I>0, D is a distance metric method between the vector;
B, to each vector, the ratio before and after the distance that calculating obtains above and its between each m range averaging value is designated as (ratio 1, ratio 2..., ratio N-1), wherein ratio i = dist i / , Σ dist j max ( i - m , 0 ) ≤ j ≤ min ( i + m , n - 1 ) , j ≠ i , I>0 wherein;
The position of N-1 ratio correspondence of the maximum that calculates in C, the selection previous step is a cut-point with this N-1 position, is the N class with the approximate file division of whole ordering vector;
D, ask in each class the average of approximate vector as the class center.
2, search method according to claim 1 is characterized in that definite principle of class division numbers N: according to the spendable main memory amount of capacity of searching system in the practical application, select big as far as possible N.
3, search method according to claim 1, definite principle of the class quantity L that visits when it is characterized in that k neighbour similar to search: if to Query Result accuracy requirement height, and do not value the inquiry cost, then select bigger L value; If more value search efficiency, then select less L value.
CN 03129687 2003-07-03 2003-07-03 High-dimensional vector data quick similar search method Expired - Fee Related CN1220159C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 03129687 CN1220159C (en) 2003-07-03 2003-07-03 High-dimensional vector data quick similar search method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 03129687 CN1220159C (en) 2003-07-03 2003-07-03 High-dimensional vector data quick similar search method

Publications (2)

Publication Number Publication Date
CN1477563A CN1477563A (en) 2004-02-25
CN1220159C true CN1220159C (en) 2005-09-21

Family

ID=34153648

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 03129687 Expired - Fee Related CN1220159C (en) 2003-07-03 2003-07-03 High-dimensional vector data quick similar search method

Country Status (1)

Country Link
CN (1) CN1220159C (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102306202B (en) * 2011-09-30 2013-09-04 中国传媒大学 High-dimension vector rapid searching algorithm based on block distance
JP5324677B2 (en) * 2012-02-24 2013-10-23 株式会社日立製作所 Similar document search support device and similar document search support program
CN103377237B (en) * 2012-04-27 2016-08-17 常州艾斯玛特信息科技有限公司 The neighbor search method of high dimensional data and fast approximate image searching method
CN102831225A (en) * 2012-08-27 2012-12-19 南京邮电大学 Multi-dimensional index structure under cloud environment, construction method thereof and similarity query method
CN103106276B (en) * 2013-02-17 2016-03-30 南京师范大学 A kind of vector data grid index method of encoding based on little angle
CN103345496B (en) * 2013-06-28 2016-12-28 新浪网技术(中国)有限公司 multimedia information retrieval method and system
CN110309139B (en) * 2018-03-05 2024-02-13 理光软件研究所(北京)有限公司 High-dimensional neighbor pair searching method and system
US11449484B2 (en) 2018-06-25 2022-09-20 Ebay Inc. Data indexing and searching using permutation indexes

Also Published As

Publication number Publication date
CN1477563A (en) 2004-02-25

Similar Documents

Publication Publication Date Title
Ferhatosmanoglu et al. Vector approximation based indexing for non-uniform high dimensional data sets
Fang et al. Computing Iceberg Queries Efficiently.
US6084595A (en) Indexing method for image search engine
US7966327B2 (en) Similarity search system with compact data structures
Ferhatosmanoglu et al. Approximate nearest neighbor searching in multimedia databases
Sarawagi Indexing OLAP data
Chen et al. A region-based fuzzy feature matching approach to content-based image retrieval
Weber et al. An approximation based data structure for similarity search
Berchtold et al. The X-tree: An index structure for high-dimensional data
Kanth et al. Dimensionality reduction for similarity searching in dynamic databases
Barros et al. Using the triangle inequality to reduce the number of comparisons required for similarity-based retrieval
Lu et al. Hierarchical indexing structure for efficient similarity search in video retrieval
Yu et al. ClusterTree: Integration of cluster representation and nearest-neighbor search for large data sets with high dimensions
CN1220159C (en) High-dimensional vector data quick similar search method
Cui et al. Indexing high-dimensional data for efficient in-memory similarity search
CN1352432A (en) Index and method for searching characteristic vector space
Li et al. Clindex: Clustering for Similarity Queries in High-Dimensional Spaces.
Cha Bitmap indexing method for complex similarity queries with relevance feedback
Cui et al. Efficient high-dimensional indexing by sorting principal component
Berchtold et al. Indexing High-dimensional Space: Database Support for Next Decades's Applications
Mejdoub et al. Fast algorithm for image database indexing based on lattice
in Kim et al. A dynamic indexing structure for searching time-series patterns
Bouteldja et al. Evaluation of strategies for multiple sphere queries with local image descriptors
Chahir et al. Peano key rediscovery for content-based retrieval of images
Choupo et al. Optimizing progressive query-by-example over pre-clustered large image databases

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C17 Cessation of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20050921

Termination date: 20100703