CN110909197A

CN110909197A - High-dimensional feature processing method and device

Info

Publication number: CN110909197A
Application number: CN201911066022.7A
Authority: CN
Inventors: 陈晓东; 朱金华
Original assignee: Shenzhen ZNV Technology Co Ltd; Nanjing ZNV Software Co Ltd
Current assignee: Shenzhen ZNV Technology Co Ltd; Nanjing ZNV Software Co Ltd
Priority date: 2019-11-04
Filing date: 2019-11-04
Publication date: 2020-03-24

Abstract

A method and a device for processing high-dimensional features comprise the following steps: selecting sample characteristics, and carrying out coarse quantization on the sample characteristics to generate a plurality of clustering centers; respectively adding the high-dimensional features to be inserted into the corresponding HNSW clusters according to the nearest principle; and respectively calculating the distances between the target features and the clustering centers, performing HNSW algorithm retrieval in a preset number of clusters closest to the sequencing, and returning a retrieval result. In the embodiment of the application, the clustering is performed by using the coarse quantization, so that the method is suitable for large-scale distributed fragment storage and is easy for flexible expansion of mass storage; the HNSW algorithm is searched in the clusters which are closest to the sequencing in the preset number, so that the recall rate is improved; the method combines the advantages of vector quantization and HNSW, and reduces the calculation cost and time of data insertion and retrieval.

Description

High-dimensional feature processing method and device

Technical Field

The present application relates to image retrieval, and in particular, to a method and an apparatus for processing high-dimensional features.

Background

Deep learning is deeply applied in many fields at present, particularly in the application scenes related to face recognition. A public security department in a large city collects real-time videos through a large number of deployed video monitoring systems, and high-dimensional features of human faces extracted through a face recognition system can reach the level of tens of millions of levels each day, and the level of hundreds of billions of levels each year. At present, a high-dimensional feature storage index scheme designed based on a product quantization or HNSW scheme is used alone, but the scheme is not good in searching of billion-level feature libraries, and it is almost impossible to perform second-level query on billions of face feature libraries and keep high recall rate, and the prior art cannot achieve high recall rate while completing second-level query.

Disclosure of Invention

The application provides a method and a device for processing high-dimensional features.

According to a first aspect of the present application, there is provided a method for processing high-dimensional features, comprising:

selecting sample characteristics, and carrying out coarse quantization on the sample characteristics to generate a plurality of clustering centers;

respectively adding the high-dimensional features to be inserted into the corresponding HNSW clusters according to the nearest principle;

and respectively calculating the distances between the target features and the clustering centers, performing HNSW algorithm retrieval in a preset number of clusters closest to the sequencing, and returning a retrieval result.

Further, the coarsely quantizing the sample features includes:

and clustering the sample characteristics by adopting a K-Means method.

Further, the adding the high-dimensional features to be inserted into the corresponding HNSW clusters according to the distance nearest principle includes:

respectively calculating the distance between the high-dimensional features to be inserted and the clustering centers, and finding the nearest clustering centers according to distance sequencing;

and inserting the high-dimensional features to be inserted into the nearest HNSW clusters corresponding to the cluster centers.

Further, still include:

and sequencing the retrieval results in the clusters with the preset number in sequence from small to large according to the distance from the target features and returning.

According to a second aspect of the present application, there is provided a processing apparatus for high-dimensional features, comprising:

the coarse quantization module is used for selecting sample characteristics, performing coarse quantization on the sample characteristics and generating a plurality of clustering centers;

the clustering module is used for respectively adding the high-dimensional features to be inserted into the corresponding HNSW clusters according to the nearest principle;

and the retrieval module is used for respectively calculating the distances between the target features and the clustering centers, performing HNSW algorithm retrieval in the clusters with the preset number closest to the sequencing, and returning the retrieval result.

Further, the coarse quantization module comprises a clustering unit;

and the clustering unit is used for clustering the sample characteristics by adopting a K-Means method.

Further, the clustering module comprises:

the searching unit is used for respectively calculating the distances between the high-dimensional features to be inserted and the clustering centers and finding the closest clustering centers according to the distance sequence;

and the inserting unit is used for inserting the high-dimensional features to be inserted into the nearest HNSW clusters corresponding to the clustering centers.

Further, still include:

and the returning module is used for sequencing and returning the retrieval results in the clusters with the preset number from small to large according to the distances from the target features.

According to a third aspect of the present application, there is provided a processing apparatus for high-dimensional features, comprising:

a memory for storing a program;

a processor for implementing the above method by executing the program stored in the memory.

According to a fourth aspect of the present application, there is provided a computer readable storage medium comprising a program executable by a processor to implement the above method.

Due to the adoption of the technical scheme, the beneficial effects of the application are as follows:

in the embodiment of the application, the method comprises the steps of selecting sample characteristics, roughly quantizing the sample characteristics, and generating a plurality of clustering centers; respectively adding the high-dimensional features to be inserted into the corresponding HNSW clusters according to the nearest principle; and respectively calculating the distances between the target features and the clustering centers, performing HNSW algorithm retrieval in a preset number of clusters closest to the sequencing, and returning a retrieval result. The method and the device have the advantages that the coarse quantization is used for clustering, so that the method and the device are suitable for large-scale distributed fragment storage and are easy for expansion and contraction of mass storage; the HNSW algorithm is searched in the clusters which are closest to the sequencing in the preset number, so that the recall rate is improved; the method combines the advantages of vector quantization and HNSW, and reduces the calculation cost and time of data insertion and retrieval.

Drawings

FIG. 1 is a flow chart of a method in one implementation of one embodiment of the present application;

FIG. 2 is a schematic diagram of NSW storage and retrieval;

FIG. 3 is a schematic diagram of HNSW store retrieval;

FIG. 4 is a basic schematic diagram of a storage retrieval scheme combining vector quantization and HNSW;

FIG. 5 is a flow chart of a method in another implementation of the first embodiment of the present application;

FIG. 6 is a schematic diagram of program modules of an apparatus according to a second embodiment of the present application;

fig. 7 is a schematic diagram of program modules of an apparatus according to another embodiment of the second embodiment of the present application.

Detailed Description

The present invention will be described in further detail with reference to the following detailed description and accompanying drawings. The present application may be embodied in many different forms and is not limited to the embodiments described in the present embodiment. The following detailed description is provided to facilitate a more thorough understanding of the present disclosure, and the words used to indicate orientation, top, bottom, left, right, etc. are used solely to describe the illustrated structure in connection with the accompanying figures.

One skilled in the relevant art will recognize, however, that one or more of the specific details can be omitted, or other methods, components, or materials can be used. In some instances, some embodiments are not described or not described in detail.

The numbering of the components as such, e.g., "first", "second", etc., is used herein only to distinguish the objects as described, and does not have any sequential or technical meaning.

Furthermore, the technical features, aspects or characteristics described herein may be combined in any suitable manner in one or more embodiments. It will be readily appreciated by those of skill in the art that the order of the steps or operations of the methods associated with the embodiments provided herein may be varied. Thus, any sequence in the figures and examples is for illustrative purposes only and does not imply a requirement in a certain order unless explicitly stated to require a certain order.

The first embodiment is as follows:

as shown in fig. 1, one embodiment of the method for processing high-dimensional features of the present application includes the following steps:

step 102: and selecting sample characteristics, and carrying out coarse quantization on the sample characteristics to generate a plurality of clustering centers. The sample feature in this application refers to a high-dimensional feature as a sample.

Step 104: and respectively adding the high-dimensional features to be inserted into corresponding HNSW (hierarchical Navigable Small World graphs) clusters according to the distance nearest principle.

Step 106: and respectively calculating the distances between the target features and the clustering centers, performing HNSW algorithm retrieval in a preset number of clusters closest to the sequencing, and returning a retrieval result. The target feature in this application refers to a high-dimensional feature to be retrieved.

HNSW is an optimized version of NSW (navigatable Small World graphs), and the NSW naive patterning algorithm is: as shown in fig. 2, when a new point is inserted into one point by one point in fig. 2, m points (m is set by a user) nearest to the new point are found by a naive search method, and a connection line connecting the new point to the m points forms a graph, where 21 is an entry point and 22 is a query point.

The solid line is the line connecting the neighboring points, and the dotted line is the "highway mechanism". If the search is performed from the point 21, when the search point 22 is close to the node, the result can be quickly found through the dashed line "highway mechanism". A point where an earlier insertion makes it easier to form an associated "highway" connection and a later insertion makes it harder to form an associated "highway" connection.

As shown in fig. 3, the HNSW composition algorithm is similar to a skip list structure, where 31 is an entry point, 32 is a query point, a is layer 0, B is layer 1, C is layer 2, and all points in the data set are in layer 0, and this point can be calculated by the formula floor (-ln (0,1)) x ml) according to a set constant ml to go deep into the layer several. In the formula, x is a multiplication number, floor () means rounding down, uniform (0,1) means randomly taking a value in uniform distribution, and ln () means taking a logarithm. When inserting the composition, the point is calculated to be deep to the next layer, t nearest neighbors are searched in the NSW graph of each layer and are respectively connected, and the operation is carried out on each layer of graph.

Its search process starts from any point on the top layer (numbered C in fig. 3), and the algorithm greedily traverses elements from the top layer until a local minimum is reached. The search then switches to a lower layer (with shorter links), and the process continues to repeat starting again with the element that is the local minimum in the previous layer. The searched friend points are stored in a waste list, and the situation that the user does not make his way later is prevented.

HNSW reduces the computational complexity of NSW from multiple log (polylaurithmic) to log (logrithmic) by adopting a layered structure.

The basic principle of the storage and retrieval scheme combining vector quantization and HNSW is shown in fig. 4, firstly, K-Means is used for carrying out coarse quantization on batch sample features to obtain N clustering centers C1, C2 and … … CN, all high-dimensional features are compared with the distance from each center and respectively correspond to the nearest clustering center to form N HNSW clusters.

The construction process comprises the steps of firstly, calculating the distance of each high-dimensional feature to be added and all the clustering centers, and finding the nearest clustering center according to the distance sequence after calculation; and secondly, inserting the features into HNSW clusters corresponding to the cluster center, wherein the process of inserting the HNSW clusters is the same as the process of constructing the HNSW. The distance between the characteristic in each cluster and the cluster center of the cluster is not more than the distance between the characteristic in each cluster and the cluster centers of other clusters.

The retrieval process is that the first step calculates the distance to all the clustering centers aiming at the searched target characteristics, and finds out the nearest clustering center; and secondly, searching the nearest adjacent features by using an HNSW algorithm in the clusters corresponding to the nearest cluster centers found in the last step. In the two-step process of retrieval, the first step is to quickly locate the interested subspace according to the calculation of the clustering center, so that the number of search calculation is greatly reduced, and the second step is to search in the interested region according to the HNSW principle, so that the jump table principle and the neighbor graph principle are applied, and the search complexity is reduced to the logarithmic level.

As shown in fig. 5, another embodiment of the method for processing high-dimensional features of the present application includes the following steps:

step 502: and selecting sample characteristics.

And selecting part of all high-dimensional features as samples, wherein the features preferably have certain representativeness and can reflect the distribution characteristics of the features. For example, in an embodiment, the high-dimensional features of the human face are taken as an example, a number of the high-dimensional features of about 1/20 can be selected as samples, and the proportion of the selected samples to all the high-dimensional features can also be set according to experience

Step 504: and (3) carrying out K-means (K-means clustering algorithm) training on the sample characteristics to generate N clustering centers. The number N of cluster centers may be an empirical value or may be set as desired.

Step 506: and respectively calculating the distance between the high-dimensional feature to be inserted and each clustering center, and finding the clustering center closest to the high-dimensional feature to be inserted according to the distance sequence.

Step 508: and inserting the high-dimensional features to be inserted into HNSW clusters, wherein the HNSW clusters are the HNSW clusters corresponding to the cluster centers closest to the inserted high-dimensional features.

Step 510: and respectively calculating the distance between the target feature and each cluster center, searching the nearest adjacent feature in the K clusters closest to the sequencing by using an HNSW algorithm, and returning a retrieval result. K is the preset number of clusters needing to be searched. In one embodiment, the search may be performed in parallel in the K clusters closest to the rank.

Step 512: and sorting the retrieval results in the K clusters, specifically sorting the retrieval results in sequence from small to large according to the distance to the target feature, and returning the sorting results.

Example two:

as shown in fig. 6, the high-dimensional feature processing apparatus 600 of the present application, in one embodiment, may include a coarse quantization module 610, a clustering module 620, and a retrieval module 630.

The coarse quantization module 610 is configured to select sample features, perform coarse quantization on the sample features, and generate a plurality of clustering centers;

a clustering module 620, configured to add the high-dimensional features to corresponding HNSW clusters according to a nearest rule;

and the retrieval module 630 is configured to calculate distances between the target features and the cluster centers, perform retrieval in a preset number of clusters closest to the ordering, and return a retrieval result.

As shown in fig. 7, another embodiment of a processing apparatus 700 for high-dimensional features of the present application may include a coarse quantization module, a clustering module, a retrieval module, and a return module.

The coarse quantization module 710 is configured to select sample features, perform coarse quantization on the sample features, and generate a plurality of clustering centers;

a clustering module 720, configured to add the high-dimensional features to corresponding HNSW clusters according to a nearest rule;

the retrieval module 730 is used for respectively calculating the distances between the target features and the clustering centers, retrieving in clusters with the preset number closest to the sequencing, and returning a retrieval result;

and a returning module 740, configured to sort and return the search results in the preset number of clusters from small to large according to the distances to the target features.

Further, the coarse quantization module 710 may include a clustering unit 711; and the clustering unit 711 is used for clustering the sample characteristics by adopting a K-Means method.

Further, the clustering module 720 may include a lookup unit 721 and an insertion unit 722.

A searching unit 721, configured to calculate distances between the high-dimensional features and the clustering centers, respectively, and find a closest clustering center according to the distance sorting;

and an inserting unit 722, configured to insert the high-dimensional feature into the HNSW cluster corresponding to the nearest cluster center.

Example three:

the processing apparatus of the high-dimensional features of the present application, in one embodiment, includes a memory and a processor.

A memory for storing a program;

and the processor is used for executing the program stored in the memory to realize the method in the first embodiment.

Example four:

a computer-readable storage medium comprising a program executable by a processor to perform the method of the first embodiment.

Those skilled in the art will appreciate that all or part of the steps of the various methods in the above embodiments may be implemented by instructions associated with hardware via a program, which may be stored in a computer-readable storage medium, and the storage medium may include: read-only memory, random access memory, magnetic or optical disk, and the like.

The foregoing is a more detailed description of the present application in connection with specific embodiments thereof, and it is not intended that the present application be limited to the specific embodiments thereof. It will be apparent to those skilled in the art from this disclosure that many more simple derivations or substitutions can be made without departing from the spirit of the disclosure.

Claims

1. A method for processing high-dimensional features, comprising:

2. The method of claim 1, wherein the coarsely quantizing the sample features comprises:

and clustering the sample characteristics by adopting a K-Means method.

3. The method of claim 2, wherein the adding the high-dimensional features to be inserted into the corresponding HNSW clusters according to the nearest-nearest rule comprises:

4. The method of any of claims 1 to 3, further comprising:

5. An apparatus for processing high-dimensional features, comprising:

6. The apparatus of claim 5, wherein the coarse quantization module comprises a clustering unit;

7. The apparatus of claim 6, wherein the clustering module comprises:

8. The apparatus of any of claims 5 to 7, further comprising:

9. An apparatus for processing high-dimensional features, comprising:

a memory for storing a program;

a processor for implementing the method of any one of claims 1-4 by executing a program stored by the memory.

10. A computer-readable storage medium, comprising a program executable by a processor to implement the method of any one of claims 1-4.