WO2021081913A1 - 向量查询方法、装置、电子设备及存储介质 - Google Patents

向量查询方法、装置、电子设备及存储介质 Download PDF

Info

Publication number
WO2021081913A1
WO2021081913A1 PCT/CN2019/114795 CN2019114795W WO2021081913A1 WO 2021081913 A1 WO2021081913 A1 WO 2021081913A1 CN 2019114795 W CN2019114795 W CN 2019114795W WO 2021081913 A1 WO2021081913 A1 WO 2021081913A1
Authority
WO
WIPO (PCT)
Prior art keywords
vector
sample
query
residual
vectors
Prior art date
Application number
PCT/CN2019/114795
Other languages
English (en)
French (fr)
Inventor
张家兴
Original Assignee
北京欧珀通信有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京欧珀通信有限公司 filed Critical 北京欧珀通信有限公司
Priority to CN201980099370.6A priority Critical patent/CN114245896A/zh
Priority to PCT/CN2019/114795 priority patent/WO2021081913A1/zh
Publication of WO2021081913A1 publication Critical patent/WO2021081913A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/51Indexing; Data structures therefor; Storage structures

Definitions

  • This application relates to the field of data processing technology, and more specifically, to a vector query method, device, electronic equipment, and storage medium.
  • the query content and sample content are represented by vectors, and the query is performed according to the vectors, and the sample content that matches the query content is finally obtained.
  • the content of the user's query becomes more and more complex, and the query takes more time, so the efficiency of the query needs to be improved.
  • this application proposes a vector query method, device, electronic equipment, and storage medium.
  • an embodiment of the present application provides a vector query method to obtain a query vector; according to a pre-established first index, obtain a first cluster center vector whose distance to the query vector satisfies a first set distance condition ,
  • the first index includes a plurality of first clusters obtained by clustering sample vectors and a first cluster center vector corresponding to each first cluster, and each first cluster It includes a plurality of sample vectors; obtains the residual vector between the query vector and the target vector as the query residual vector; obtains the corresponding to each sample vector in the plurality of sample vectors according to a pre-established second index
  • the second index includes the code corresponding to each residual sample vector obtained by product quantization on the sample residual vector corresponding to each sample vector using a product quantization method, and the sample residual vector is the The residual vector between the sample vector and the target vector; according to the query residual vector and the code corresponding to each sample residual vector, the distance to the query vector is obtained from the multiple sample vectors The sample vector that meets the second set
  • an embodiment of the present application provides a vector query device.
  • the device includes: a vector acquisition module, a first determination module, a residual acquisition module, a second determination module, and a vector determination module, wherein the vector acquisition
  • the module is used to obtain a query vector;
  • the first determining module is used to obtain a first cluster center vector whose distance to the query vector satisfies a first set distance condition according to a pre-established first index, as a target vector,
  • the first index includes a plurality of first clusters obtained by clustering sample vectors and a first cluster center vector corresponding to each first cluster, and each first cluster includes a plurality of sample vectors
  • the residual obtaining module is used to obtain the residual vector between the query vector and the target vector as a query residual vector;
  • the second determining module is used to obtain the residual vector according to a pre-established second index
  • the code corresponding to each sample vector in the plurality of sample vectors, and the second index includes the code corresponding to each residual sample vector obtained by
  • an embodiment of the present application provides an electronic device, including: one or more processors; a memory; one or more application programs, wherein the one or more application programs are stored in the memory and It is configured to be executed by the one or more processors, and the one or more programs are configured to execute the vector query method provided in the above-mentioned first aspect.
  • an embodiment of the present application provides a computer-readable storage medium.
  • the computer-readable storage medium stores program code, and the program code can be invoked by a processor to execute the vector provided in the first aspect. Query method.
  • the solution provided by this application obtains the query vector, and then obtains the first cluster center vector whose distance to the query vector meets the first set distance according to the pre-established first index, and then obtains the difference between the query vector and the target vector.
  • the residual vector is used as a residual query vector. According to a pre-established second index, the code corresponding to each sample vector in the multiple sample vectors is obtained.
  • the second index includes the sample residual corresponding to each sample vector using the product quantization method
  • the vector is multiplied and quantized to obtain the code corresponding to each residual sample vector, and then according to the query residual vector and the code corresponding to each sample residual vector, the distance to the query vector is obtained from multiple sample vectors to meet the second set distance
  • Conditional sample vectors are used as query results to realize vector retrieval through coarse clustering and product quantization, which reduces the complexity of vector retrieval step by step, so that the speed and accuracy of vector retrieval can be guaranteed.
  • Fig. 1 shows a flowchart of an index construction method according to an embodiment of the present application.
  • Fig. 2 shows a flowchart of step S110 in a method for constructing an index according to an embodiment of the present application.
  • FIG. 3 shows a schematic diagram of the principle of establishing a second index provided by an embodiment of the present application.
  • Fig. 4 shows a flowchart of a vector query method according to an embodiment of the present application.
  • Fig. 5 shows a flowchart of step S220 in the vector query method provided in an embodiment of the present application.
  • FIG. 6 shows a flowchart of step S250 in the vector query method provided in an embodiment of the present application.
  • Fig. 7 shows a flowchart of a vector query method according to another embodiment of the present application.
  • Fig. 8 shows a block diagram of a vector query device according to an embodiment of the present application.
  • Fig. 9 is a block diagram of an electronic device for executing the vector query method according to an embodiment of the present application according to an embodiment of the present application.
  • FIG. 10 is a storage unit for storing or carrying program code for implementing the vector query method according to the embodiment of the present application according to an embodiment of the present application.
  • the current index structure mainly includes: tree index structure, hash index, graph index and vector quantization.
  • tree index structure generally speaking, when the spatial dimension is relatively low, the tree index is funny, but when the vector dimension is relatively high, the performance and accuracy are not ideal.
  • hash indexes although this method can quickly build indexes, it does not perform well in the accuracy of retrieval. For high-dimensional vectors above tens of millions, the accuracy is usually less than 50%, which is difficult to apply to most scenarios.
  • this indexing method can achieve good results in vector similarity calculations of tens of millions of levels, but if the data scale reaches hundreds of millions of levels, the time to build the index will be very long, and the time spent in retrieval is also very long. It cannot meet the needs of online computing, and when adding indexes to subsequently added samples, it will cause a wide range of linkages in the index structure, and performance is difficult to guarantee.
  • vector quantization such as clustering, product quantization, etc.
  • when solving vector similar calculations with hundreds of millions of data volumes relying solely on clustering or product quantization methods. If the accuracy of retrieval is to be ensured, the indexing time will be longer. long.
  • the inventor proposed the vector query method, device, electronic device, and storage medium provided by the embodiments of the application.
  • the closest distance to the query vector is obtained.
  • the sample vector closest to the query vector is obtained under the cluster where the cluster center is located, so that the distance between the query vector and each sample vector is not required to be calculated violently.
  • the query time is reduced, and the method of product quantization is adopted, which can effectively improve the accuracy of vector query.
  • the specific vector query method will be described in detail in the subsequent embodiments.
  • FIG. 1 shows a schematic flowchart of an index construction method provided by an embodiment of the present application.
  • the index construction method is used for clustering according to sample data, and establishing a first index according to the clustering result, and then using the product quantization method to obtain a codebook according to the clustering result, and establishing a second index, the first index And the second index is used in the process of vector query.
  • the index construction method can be applied to electronic devices. The following will take an electronic device as an example to illustrate the specific process of this embodiment. Of course, it is understandable that the electronic device applied in this embodiment may be a server or other devices, which is not limited here.
  • the flow shown in Fig. 1 will be described in detail below, and the index construction method may specifically include the following steps:
  • Step S110 Perform clustering on all sample vectors to obtain multiple first clusters.
  • sample content can be images, videos, audios, documents, web pages, news posts, and other types of content.
  • the specific types of sample content may not be limited.
  • the sample content when the electronic device is used to process and query images, the sample content may be image content.
  • the sample content can be processed to obtain a sample vector corresponding to the sample content, and the sample vector is used to characterize the characteristics of the sample content.
  • the characteristics of the sample content can be extracted according to different types of sample content and a sample vector can be formed.
  • image features such as brightness value, gray value, number of pixels, gray average value, gray median value, etc. can be extracted as elements constituting the sample vector to form the sample vector.
  • features such as tone color, pitch, volume, text content, keywords in the audio content, etc. can be extracted as elements constituting the sample vector, thereby forming the sample vector.
  • features such as word segmentation, keywords, and word frequency can be extracted as elements constituting the sample vector to form the sample vector.
  • the specific method of obtaining the sample vector may not be a limitation.
  • a larger number of features can be correspondingly extracted, thereby forming a high-dimensional sample vector.
  • the specific dimension of the sample vector may not be a limitation. For example, it may be several hundred dimensions or several thousand dimensions, which is not limited here.
  • a sample vector corresponding to each sample content can be obtained.
  • the business scenario is generally to find a few vectors that are similar to the query vector from one billion high-dimensional sample vectors, if only through clustering To solve this problem, the number of clusters may need to be more than 10,000, and the process of clustering convergence will be very slow. You can first perform coarse clustering on all sample vectors, so that after obtaining sample vectors, you can compare all sample vectors Perform clustering.
  • multiple clusters can be generated, that is, multiple categories are generated, and the generated clusters are regarded as the first cluster, and each first cluster corresponds to a cluster center.
  • the cluster center can be understood as the centroid in the cluster, which is at the center of the sample vector distribution under the cluster.
  • Each cluster center can be represented by a vector with the same dimension as the sample vector, which becomes the center vector of the cluster center, and the center vector of the first cluster can be used as the first cluster center vector.
  • all the sample vectors are clustered, and the K-means clustering algorithm can be used to cluster all the vectors.
  • the K-means clustering algorithm can be used to cluster all the vectors.
  • Figure 2 please refer to Figure 2 to cluster all sample vectors to obtain multiple first clusters, including:
  • Step S111 Determine the number of clusters according to the number of all the sample vectors, or according to a set algorithm.
  • the number of clusters can be determined according to the number of sample vectors and the required intensive reading, so as to achieve the number of first clusters obtained after clustering. Is the predetermined number of clusters.
  • the number of clusters can also be determined according to the set algorithm, such as the elbow rule, contour coefficient and other algorithms.
  • the number of clusters can meet a certain relationship with the number of sample vectors, so as to achieve the effect of coarse clustering.
  • Step S112 According to the number of clusters, the K-means clustering algorithm is used to cluster all the sample vectors to obtain a plurality of first clusters.
  • the number of the first clusters is equal to the number of the clusters. The numbers are the same.
  • the electronic device can set and adjust the clustering parameters in the K-means clustering algorithm according to the number of clusters, and then cluster all the sample vectors, so that the first cluster obtained after clustering
  • the number is the number of clusters determined in step S111 to implement coarse clustering of sample vectors.
  • clustering all vectors is not limited, and clustering can also be performed based on other clustering methods, such as a hierarchical clustering algorithm, a density-based clustering algorithm, and so on.
  • Step S120 Obtain the first cluster center vector corresponding to the cluster center in each first cluster.
  • multiple first clusters can be obtained, and the first cluster corresponding to the cluster center in each first cluster in the multiple first clusters
  • the cluster center vector is the first cluster center vector obtained in step S110.
  • Step S130 Obtain the first cluster center vector that is closest to each sample vector.
  • the distance between the sample vector and each first cluster center vector can be calculated.
  • the first cluster center vector closest to each sample vector can be determined according to the calculated distance, that is, the first cluster center vector closest to the sample vector.
  • Cluster center vector The first cluster center vector closest to the sample vector is the first cluster center vector corresponding to the first cluster to which the sample vector belongs.
  • the distance between the sample vector and the first cluster center vector is used to characterize the distance between the two, which can refer to Euclidean distance, Mahalanobis distance, angle cosine distance, etc., which is not limited here.
  • Step S140 Establish the index relationship between each first cluster and the corresponding first cluster center vector, and the index relationship between each sample vector and the corresponding first cluster, to obtain the first index.
  • the first cluster center vectors of are the same, and these sample vectors are corresponding to the first cluster corresponding to the first cluster center vector, that is, the sample vectors under the first cluster are determined. Based on this, the index relationship between each sample vector and the corresponding first cluster can be established, and the inverted index relationship between the first cluster and its corresponding multiple sample vectors can be established, and each first cluster can also be established. The index relationship between a cluster and the corresponding first cluster center vector, thereby obtaining the first index. According to the first index, the first cluster center vector corresponding to each first cluster can be queried, and the first cluster to which each sample vector belongs can be queried, and the sample vector under each first cluster can be queried.
  • Step S150 Obtain a sample residual vector corresponding to each sample vector in all the sample vectors, where the sample residual vector is a residual vector between each sample vector and its corresponding first cluster center vector.
  • the first cluster center vector corresponding to each sample vector in the first index can be used to obtain the relationship between each sample vector and its corresponding first cluster center vector.
  • the sample vector and its corresponding first cluster center vector can be subtracted to obtain a residual vector between the sample vector and its corresponding first cluster center vector.
  • Step S160 Reduce the dimension of each sample residual vector to multiple sub-sample vectors in multiple sub-spaces, and the multiple sub-sample vectors have a one-to-one correspondence with the multiple sub-spaces.
  • product quantization may be performed again to perform actuarial calculation and indexing.
  • the process of performing product quantization may be step S160 to step S190 in the embodiment of this application.
  • the first cluster center vector has the same dimension as the sample vector, so the sample residual vector corresponding to each sample vector It is also the same as the dimension of the sample vector, that is, the sample residual vector is also high-dimensional.
  • the processing of high-dimensional vectors is more complex, so the sample residual vector can be reduced in dimensionality.
  • the sample residual vector can be reduced to multiple sample subvectors in multiple subspaces, and the multiple sample subvectors correspond to multiple subspaces one-to-one, that is, each sample residual vector is in a one-to-one correspondence with multiple subspaces.
  • the sample subvectors in different subspaces can be divided into multiple subspaces that are equally divided to ensure that the number of dimensions of the sample subvectors in different subspaces is the same; it can also not be divided according to the way of equal division. In this case, the samples in different subspaces The number of dimensions of the vector may be different.
  • the sample residual vector is a 30-dimensional vector, which is expressed as (i 1 , i 2 , i 3 ,..., i 28 , i 29 , i 30 ), and the sample residual vector can be divided into 6 subspaces.
  • the corresponding sub-vector in the first subspace can be (i 1 , i 2 , i 3 , i 4 , i 5 ), and the corresponding sub-vector in the second subspace can be (i 6 , i 7 , i 8 , i 9 , i 10 ), the corresponding sub-vector in the third subspace can be (i 11 , i 12 , i 13 , i 14 , i 15 ), and the corresponding sub-vector in the fourth subspace can be (i 16 , i 17 , i 18 , i 19 , i 20 ), the corresponding sub-vector in the fifth subspace can be (i 21 , i 22 , i 23 , i 24 , i 25 ), the corresponding in the sixth subspace
  • the sub-vector can be (i 26 , i 27 , i 28 , i 29 , i 30 ).
  • the above space division and sub-vector division are only examples
  • Step S170 clustering the sub-sample vectors in the same sub-space, obtaining multiple second clusters in each sub-space, and a second cluster center vector corresponding to each second cluster.
  • the subsample vectors in the same subspace can be clustered according to each subspace.
  • the K-means clustering algorithm can be used to cluster sample sub-vectors in the same subspace to obtain clustering results.
  • the clustering results include multiple second clusters, each of which The cluster also corresponds to a cluster center, and the cluster center corresponding to the second cluster serves as the second cluster center.
  • the second cluster center corresponds to a second cluster center Vector
  • the second cluster center vector has the same dimension as the sample sub-vector.
  • the specific clustering method please refer to the clustering method in step S110, which will not be repeated here.
  • each sample residual vector is divided into the same number of sample sub-vectors. After clustering the sample subvectors in each subspace, multiple second clusters and their corresponding second cluster center vectors can be obtained. Among them, for each subspace, the same clustering algorithm can be used to determine the same number of second clusters; for different subspaces, different clustering algorithms can also be used to determine different numbers of clustering results.
  • the sample sub-vectors before clustering the sample sub-vectors in the same subspace, may also be processed to improve accuracy.
  • a reference orthogonal matrix may be used to transform each sample sub-vector before performing clustering.
  • the reference orthogonal matrix can be determined based on the optimal product quantization method. For example, the quantization error function can be minimized to obtain the reference orthogonal matrix. For example, after the minimum value of the quantization error function is solved, the Iterative optimization, and finally get the reference orthogonal matrix.
  • the reference orthogonal matrix is used to transform the sample sub-vectors, so the clustering error can be made smaller and the clustering accuracy can be improved.
  • sample residual vectors of multiple sample vectors corresponding to the first cluster After dimensionality is reduced into multiple sample sub-vectors, the sample sub-vectors in the same subspace are clustered. In this way, when the second index established subsequently is used for vector query, after the first cluster corresponding to the query vector is queried, the sample vector as the query result can be found faster according to the second index.
  • Step S180 Encode each second cluster to obtain a sub-code corresponding to each second cluster, wherein multiple sub-codes corresponding to each sample residual vector constitute a code corresponding to each sample residual vector.
  • the second cluster in each subspace can be coded , So as to obtain the sub-code corresponding to each second cluster, so as to establish an index according to the sub-code corresponding to each second cluster.
  • each subspace can correspond to L second clusters, and the corresponding L second clusters in the first subspace can be ordered Code, obtain sub-codes, and the sub-codes are 1, 2, 3,..., L respectively.
  • Step S190 Establish an index relationship between the sub-code corresponding to each sample residual vector and the second cluster center vector to obtain a second index.
  • each sample subvector of the corresponding multiple sample subvectors can be determined to determine each sample subvector.
  • the second cluster center vector closest to the vector in the corresponding subspace, and the second cluster corresponding to the second cluster center vector, and then the sub-code of the second cluster corresponds to the sample sub-vector to determine
  • the sub-code corresponding to each sample sub-vector of the multiple sample sub-vectors of the sample residual vector, and the index relationship between the sub-code corresponding to each sample residual vector and the second cluster center vector is established to obtain the second index .
  • the sub-code corresponding to each sample sub-vector in the multiple sample sub-vectors of the sample residual vector constitutes the code of the sample residual vector.
  • the dimension of the sample vector is 256 dimensions, and it can be divided into 4 subspaces.
  • the dimensions of the sample subvectors in each subspace It is 64 dimensions.
  • the second cluster in each subspace can be encoded as a 1-byte integer, and in each subspace, after clustering the sample subvectors , Generate 256 second clusters, and the subcodes corresponding to the 256 second clusters can be used as the codebook corresponding to the subspace.
  • each second cluster is encoded as a 1-byte integer
  • the sample residual vector can be quantified when the sample residual vector is approximated by the corresponding second cluster center vector in each subspace.
  • the code is a 4-byte integer number, which is the sub-code of the 4 second clusters.
  • the corresponding second cluster center vector in each subspace refers to the second cluster center vector closest to each sample subvector of the sample residual vector in its corresponding subspace.
  • the sample residual vector A in the first subspace, the subcode corresponding to the second cluster where the nearest second cluster center vector is located is 23, and in the second subspace, the second nearest The subcode corresponding to the second cluster where the cluster center vector is located is 148.
  • the subcode corresponding to the second cluster where the closest second cluster center vector is located is 235.
  • the subcode corresponding to the second cluster where the closest second cluster center vector is located is 230, and the sample residual vector is approximately represented by the corresponding second cluster center vector in each subspace, which can be quantified
  • the code is (23,148,235,230), which is the code corresponding to the sample residual vector, and according to the second cluster center vector corresponding to the code, the index relationship between the multiple sub-codes corresponding to the sample residual vector and the second cluster center vector is established, Obtain the second index.
  • the second cluster center vector corresponding to each sub-code can be queried, and the combination of the four queried second cluster center vectors , As an approximation of the sample residual vector.
  • the index component method provided by the embodiment of the present application obtains multiple first clusters and first cluster center vectors corresponding to the first clusters after coarse clustering of sample vectors, and establishes each cluster according to the coarse clustering results.
  • the index relationship between each first cluster and the corresponding first cluster center vector, and the index relationship between each sample vector and the corresponding first cluster obtain the first index.
  • the sample residual vector between each sample vector and its corresponding first cluster center vector is obtained, and after dimensionality reduction, clustering is performed, and then each second cluster is coded, and finally each The index relationship between the sub-code corresponding to the sample residual vector and the second cluster center vector is used to obtain the second index. Therefore, in the index establishment process, the first is to do coarse clustering to realize the rough division, and then through the product quantification, approximate actuarial index establishment, which greatly reduces the index establishment time.
  • FIG. 4 shows a schematic flowchart of a vector query method provided by an embodiment of the present application.
  • the vector query method is used to obtain the cluster center closest to the query vector through the coarse clustering results of the sample data, and then obtain the distance to the query vector based on the index established by the product quantification under the cluster where the cluster center is located. The most recent sample vector, thereby improving the efficiency of vector query.
  • the vector query method can be applied to the above-mentioned electronic device. The following will elaborate on the process shown in FIG. 4, and the vector query method may specifically include the following steps:
  • Step S210 Obtain a query vector.
  • the query vector may refer to the query vector generated according to the query content required by the user.
  • the query vector may be a query vector generated based on text content, and for example, the query vector may also be a query vector generated based on image content, which is not limited here, and the manner of generating the query vector can be referred to the generation in the foregoing embodiment The method of the sample vector will not be repeated here.
  • Step S220 According to the pre-established first index, obtain a first clustering center vector whose distance from the query vector satisfies a first set distance condition as a target vector, and the first index includes clustering sample vectors The obtained plurality of first clusters and the first cluster center vector corresponding to each first cluster, each of the first clusters includes a plurality of sample vectors.
  • the pre-established first index is the index relationship between each first cluster and the corresponding first cluster center vector, and the index relationship between each sample vector and the corresponding first cluster. Therefore, The first index may include a plurality of first clusters obtained by clustering the sample vectors and a first cluster center vector corresponding to each first cluster, and each first cluster includes a plurality of sample vectors.
  • the method for establishing the first index can refer to the content in the foregoing embodiment, which will not be repeated here.
  • the first index may be stored in the electronic device in advance.
  • the first cluster center vector whose distance from the query vector meets the first set distance condition can be obtained according to the first index, and the obtained The first cluster center vector is used as the target vector.
  • step S220 may include:
  • Step S221 Obtain a first cluster center vector corresponding to each first cluster according to the first index.
  • Step S222 Calculate the distance between the query vector and each first cluster center vector respectively.
  • Step S223 According to the distance between the query vector and each first cluster center vector, obtain a first cluster center vector whose distance to the query vector satisfies a first set distance condition.
  • the first cluster center vector corresponding to each first cluster can be queried, and then the distance between the query vector and each first cluster center vector is calculated separately to obtain the query vector The first cluster center vector whose distance satisfies the first set distance condition.
  • the obtained first cluster center vector is used as the target vector, and the first cluster corresponding to the target vector is the cluster that best matches the query vector.
  • the first set distance condition may include: a first cluster center vector with the smallest distance from the query vector; or a first cluster center with a distance less than a first distance threshold from the query vector vector.
  • the first set distance condition is the first cluster center vector with the smallest distance from the query vector
  • the distance between the query vector and each first cluster center vector can be calculated separately.
  • the multiple distances obtained by calculation are sorted from small to large, and then according to the sorting result, the first cluster center vector corresponding to the minimum distance is determined as the target vector.
  • the first set distance condition is the first clustering center vector whose distance to the query vector is less than the first distance threshold
  • the distance between the query vector and each first clustering center vector can be calculated separately, and then filtering
  • the first cluster center vector corresponding to the distance less than the first distance threshold is taken as the target vector.
  • Step S230 Obtain a residual vector between the query vector and the target vector as a query residual vector.
  • the query vector and the target vector may be subtracted to obtain a residual vector between the query vector and the target vector.
  • Step S240 Obtain the code corresponding to each sample vector in the plurality of sample vectors according to a pre-established second index, where the second index includes using a product quantization method to determine the sample residual vector corresponding to each sample vector The code corresponding to each residual sample vector obtained by performing product quantization, where the sample residual vector is a residual vector between the sample vector and the target vector.
  • the second index may be the index relationship between the sub-code corresponding to each sample residual vector and the second cluster center vector.
  • the second index may include the code corresponding to each residual sample vector obtained by product quantization of the sample residual vector corresponding to each sample vector by using the product quantization method, that is, the code obtained when the second index is established in the foregoing embodiment
  • the code corresponding to each residual sample vector For the method of establishing the second index, please refer to the content of the foregoing embodiment, which will not be repeated here.
  • Step S250 According to the query residual vector and the code corresponding to each sample residual vector, obtain a sample vector whose distance to the query vector satisfies a second set distance condition from the multiple sample vectors, As a result of the query.
  • step S250 may include:
  • Step S251 Obtain the distance between the query vector and each sample vector according to the query residual vector and the code corresponding to each sample residual vector.
  • the code includes a plurality of sub-codes
  • the second index also includes a second cluster center vector corresponding to each sub-code.
  • the sub-sample vectors in the same subspace are clustered to obtain multiple second clusters in each subspace, and the second clustering
  • the class is obtained by encoding, and multiple sub-sample vectors correspond to multiple sub-spaces one to one.
  • step S251 may include:
  • the second clustering center vector corresponding to each sub-code in the multiple sub-codes of each sample residual vector; according to the multiple sub-query vectors and the multiple second clusters corresponding to each sample residual vector The center vector is used to obtain the distance between the query vector and each sample vector.
  • the dimensionality of the query residual vector can be reduced according to the method of establishing the second indexing process in the foregoing embodiment, and the dimensionality reduction method can be the same, that is, the dimensionality is also reduced to multiple subvectors in multiple subspaces.
  • the multiple sub-vectors serve as multiple sub-query vectors, and the multiple sub-query vectors correspond to multiple sub-spaces one-to-one.
  • the code corresponding to each sample residual vector and the second cluster center vector corresponding to each sub-code in the code can be known. Then, the distance between the query vector and each sample vector can be obtained according to multiple sub-query vectors and multiple second cluster center vectors corresponding to each of the sample residual vectors.
  • obtaining the distance between the query vector and each sample vector includes: Calculating the distance between the sub-query vector and each second cluster center vector corresponding to the sample residual vector in the same subspace for any sample residual vector in the plurality of sample residual vectors; For each sample residual vector, according to the corresponding relationship between each sample residual vector and sample vector, sum the distances calculated in each subspace to obtain the difference between the query vector and each sample vector distance.
  • each sample residual vector corresponds to the sample vector
  • the sample residual vector is the residual vector between the sample vector and the first cluster center vector
  • the query residual vector is the query vector and the first cluster center vector.
  • a residual vector between the cluster center vectors Therefore, when calculating the distance between the query residual vector and the sample residual vector, it is calculating the distance between the query vector and the sample vector.
  • (A-B)-(C-B) means A-C.
  • Step S252 According to the distance between the query vector and each sample vector, obtain a sample vector whose distance to the query vector satisfies a second set distance condition from the multiple sample vectors as a query result.
  • the second set distance condition includes: a sample vector with the smallest distance from the query vector; or a sample vector with a distance less than a second distance threshold from the query vector.
  • the vector query method obtained by the embodiment of the present application obtains the query vector, and then obtains the first cluster center vector whose distance to the query vector meets the first set distance according to the first index established in advance, and then obtains the query vector and the target
  • the residual vector between the vectors is used as the residual query vector.
  • the code corresponding to each sample vector in the multiple sample vectors is obtained.
  • the second index includes using the product quantization method to correspond to each sample vector
  • the code corresponding to each residual sample vector obtained by product quantization of the sample residual vector of the sample, and then according to the query residual vector and the code corresponding to each sample residual vector, obtain the distance from the query vector from the multiple sample vectors to satisfy the first Second, the sample vector with the distance condition is set as the query result, so that vector retrieval is realized through coarse clustering and product quantization, which reduces the complexity of vector retrieval step by step, so that the speed and accuracy of vector retrieval can be guaranteed.
  • FIG. 7 shows a schematic flowchart of a vector query method provided by another embodiment of the present application.
  • the vector query method can be applied to the above-mentioned electronic devices.
  • the flow shown in FIG. 7 will be described in detail below, and the vector query method may specifically include the following steps:
  • Step S310 Obtain a query vector.
  • step S310 may include: obtaining a service query request; judging whether the service query request carries a vector; if it carries a vector, use the vector as a query vector; if it does not carry a vector, then generate The query vector corresponding to the service query request.
  • the electronic device can parse the parameters in the service query request and determine whether the parameters carry a vector. If it carries a vector, it can directly use the carried vector as the query vector. If the vector is not carried, the query vector can be generated in the manner in the foregoing embodiment.
  • Step S320 Determine whether there is a historical query result corresponding to the query vector.
  • the electronic device since the electronic device serves the queries of different users, and the same user may query the same content multiple times, the electronic device may store past historical query results corresponding to the query vector. After obtaining the query vector, it can be judged whether the historical query result corresponding to the query vector is stored locally, so as to determine whether to execute the query process according to the query result.
  • Step S330 If there is a historical query result corresponding to the query vector, use the sample vector in the historical query result as the query result.
  • the historical query result corresponding to the query vector is stored locally, the historical query result can be directly used as the query result corresponding to this query vector, thereby saving the time spent in the query process.
  • Step S340 If there is no historical query result corresponding to the query vector, obtain a first cluster center vector whose distance to the query vector satisfies the first set distance condition according to the pre-established first index, as the target Vector, the first index includes a plurality of first clusters obtained by clustering a sample vector and a first cluster center vector corresponding to each first cluster, and each first cluster includes a plurality of Sample vector.
  • step S340 to step S370 are performed.
  • Step S350 Obtain a residual vector between the query vector and the target vector as a query residual vector.
  • Step S360 Obtain the code corresponding to each sample vector in the plurality of sample vectors according to a pre-established second index, where the second index includes using a product quantization method to determine the sample residual vector corresponding to each sample vector The code corresponding to each residual sample vector obtained by performing product quantization, where the sample residual vector is a residual vector between the sample vector and the target vector.
  • Step S370 According to the query residual vector and the code corresponding to each sample residual vector, obtain a sample vector whose distance to the query vector satisfies a second set distance condition from the multiple sample vectors, As a result of the query.
  • steps S340 to S370 can refer to the content of the foregoing embodiment, which will not be repeated here.
  • the above-mentioned multiple sample vectors are stored in the first database, that is, the query results obtained in the above steps are the query results obtained based on the first database.
  • the sample vectors, the first index and the second index in the second database and the first database may be different.
  • the vector query method may also include:
  • the first index and the second index can be established based on the sample vector in the second database, and then the vector query can be performed in the manner of the above step S340 to step S370 to obtain the second query result. After that, the first query result and the second query result are combined to obtain the third query result, and the third query result is used as the query result of this vector query, so that the vector query is more accurate.
  • merging may refer to taking both the first query result and the second query result as the final query result.
  • the database may have a large database and a small database.
  • the large database can refer to a database with a large amount of sample data, which is mainly used to store historical sample vectors
  • the small database can refer to a database with relatively few sample data.
  • the vector query method obtained by the embodiments of the present application obtains a query vector and then determines whether there is a historical query result corresponding to the query vector. If there is a historical query result corresponding to the query vector, it can be directly used as the query result, thereby reducing the processing amount.
  • the first cluster center vector whose distance to the query vector meets the first set distance is obtained according to the pre-built first index, and then the difference between the query vector and the target vector is obtained.
  • the residual vector is used as a residual query vector. According to a pre-established second index, the code corresponding to each sample vector in the multiple sample vectors is obtained.
  • the second index includes the sample residual corresponding to each sample vector using the product quantization method
  • the vector is multiplied and quantized to obtain the code corresponding to each residual sample vector, and then according to the query residual vector and the code corresponding to each sample residual vector, the distance to the query vector is obtained from multiple sample vectors to meet the second set distance
  • Conditional sample vectors are used as query results to realize vector retrieval through coarse clustering and product quantization, which reduces the complexity of vector retrieval step by step, so that the speed and accuracy of vector retrieval can be guaranteed.
  • FIG. 8 shows a structural block diagram of a vector query device 400 provided by an embodiment of the present application.
  • the vector query device 400 can be applied to the above-mentioned electronic equipment.
  • the vector query device 400 includes: a vector acquiring module 410, a first determining module 420, a residual acquiring module 430, a second determining module 440, and a vector determining module 450.
  • the vector obtaining module 410 is configured to obtain a query vector
  • the first determining module 420 is configured to obtain a first cluster whose distance to the query vector satisfies a first set distance condition according to a pre-established first index.
  • the cluster center vector is used as a target vector.
  • the first index includes a plurality of first clusters obtained by clustering the sample vectors and a first cluster center vector corresponding to each first cluster.
  • the clustering includes multiple sample vectors;
  • the residual acquisition module 430 is configured to acquire the residual vector between the query vector and the target vector as a query residual vector;
  • the second determination module 440 is configured to Obtain the code corresponding to each sample vector in the plurality of sample vectors according to a pre-established second index, where the second index includes using a product quantization method to perform product quantization on the sample residual vector corresponding to each sample vector
  • the code corresponding to each residual sample vector obtained, where the sample residual vector is the residual vector between the sample vector and the target vector;
  • the vector determining module 450 is configured to perform according to the query residual vector And the code corresponding to the residual vector of each sample, obtaining a sample vector whose distance from the query vector satisfies a second set distance condition from the multiple sample vectors as a query result.
  • the vector determination module 450 may include a distance calculation unit and a vector filtering unit.
  • the distance calculation unit is used to obtain the distance between the query vector and each sample vector according to the query residual vector and the code corresponding to each sample residual vector;
  • the vector filtering unit is used to obtain the distance between the query vector and each sample vector according to the The distance between the query vector and each sample vector, and the sample vector whose distance to the query vector satisfies the second set distance condition is obtained from the plurality of sample vectors as the query result.
  • the code includes a plurality of sub-codes
  • the second index also includes a second cluster center vector corresponding to each sub-code
  • the sub-code is to reduce the dimensionality of each sample residual vector into multiple sub-codes. After multiple subsample vectors in the space, cluster the subsample vectors in the same subspace to obtain multiple second clusters in each subspace, and encode the second clusters to obtain the multiple The sub-sample vectors correspond to the multiple sub-spaces one-to-one.
  • the distance calculation unit may be specifically configured to reduce the dimension of the query residual vector into multiple sub-vectors in multiple subspaces as multiple sub-query vectors, and the multiple sub-query vectors are one-to-one with the multiple sub-spaces.
  • the second index obtains the second cluster center vector corresponding to each sub-code in the multiple sub-codes of the residual vector of each sample; according to the multiple sub-query vectors, and each sample
  • the multiple second cluster center vectors corresponding to the residual vectors are used to obtain the distance between the query vector and each sample vector.
  • the distance calculation unit obtains the distance between the query vector and each sample vector according to the plurality of sub-query vectors and the plurality of second cluster center vectors corresponding to the residual vectors of each sample.
  • the distance of includes: for any sample residual vector in the plurality of sample residual vectors, respectively calculating each second cluster center vector corresponding to the sub-query vector and the sample residual vector in the same subspace For each sample residual vector, according to the corresponding relationship between each sample residual vector and sample vector, sum the distances calculated in each subspace to obtain the query vector and each Distance between sample vectors
  • the vector query device 400 may further include a first index building module.
  • the first index building module can be used to: cluster all sample vectors to obtain multiple first clusters; obtain the first cluster center vector corresponding to the cluster center in each first cluster; The first cluster center vector with the closest distance to the sample vector; establish the index relationship between each first cluster and the corresponding first cluster center vector, and the index relationship between each sample vector and the corresponding first cluster, to obtain The first index.
  • the first index establishment module clusters all the sample vectors to obtain multiple first clusters, including: determining the number of clusters according to the number of all the sample vectors or according to a set algorithm; According to the number of clusters, a K-means clustering algorithm is used to cluster all the sample vectors to obtain a plurality of first clusters, and the number of the first clusters is the same as the number of clusters.
  • the vector query device 400 may also include a second index building module.
  • the second index establishment module may be used to obtain a sample residual vector corresponding to each sample vector in all the sample vectors, where the sample residual vector is the difference between each sample vector and its corresponding first cluster center vector Residual vector; the dimension of each sample residual vector is reduced to multiple sub-sample vectors in multiple sub-spaces, and the multiple sub-sample vectors correspond to the multiple sub-spaces one-to-one; the sub-sample vectors in the same sub-space are Clustering, obtaining multiple second clusters in each subspace, and the second cluster center vector corresponding to each second cluster; encoding each second cluster to obtain each second cluster corresponding
  • the sub-code of each sample residual vector constitutes the code corresponding to each sample residual vector; the index relationship between the sub-code corresponding to each sample residual vector and the second cluster center vector is established, Obtain the second index.
  • the first set distance condition includes: a first cluster center vector with the smallest distance from the query vector; or a first cluster with a distance less than a first distance threshold from the query vector Center vector.
  • the second set distance condition includes: a sample vector with the smallest distance from the query vector; or a sample vector with a distance less than a second distance threshold from the query vector.
  • the vector query device 400 may further include a cache query module, which is configured to obtain the distance from the query vector that satisfies the first set distance condition according to the first index established in advance.
  • the first clustering center vector is used as the target vector to determine whether there is a historical query result corresponding to the query vector; if there is no historical query result corresponding to the query vector, the first determining module is based on the pre-established first An index to obtain a first cluster center vector whose distance from the query vector satisfies a first set distance condition, as a target vector.
  • the vector query device 400 may further include a result determination module.
  • the result determination module is configured to, if there is a historical query result corresponding to the query vector, use the sample vector in the historical query result as the query result.
  • the plurality of sample vectors are stored in a first database.
  • the vector query device 400 may further include: a result identification module, a result query module, and a result merging module.
  • the result identification module is used to use the query result as the first query result; the result query module is used to obtain the second query result according to the first index and the second index established by the sample vector in the second database; the result merging module is used The first query result and the second query result are combined to obtain a third query result as the query result corresponding to the query vector.
  • the vector obtaining module 410 may include: a request obtaining unit, configured to obtain a service query request; a vector determining unit, configured to determine whether the service query request carries a vector; and a first execution unit, configured to: If a vector is carried, the vector is used as a query vector; the second execution unit is configured to generate a query vector corresponding to the service query request if the vector is not carried.
  • the residual obtaining module 430 may be specifically configured to: subtract the query vector and the target vector to obtain a residual vector between the query vector and the target vector.
  • the vector query device 400 may further include an index update module.
  • the index update module is configured to update the first index and the second index according to the newly acquired sample vector every preset time interval.
  • the coupling between the modules may be electrical, mechanical or other forms of coupling.
  • each functional module in each embodiment of the present application may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module.
  • the above-mentioned integrated modules can be implemented in the form of hardware or software function modules.
  • the electronic device 100 may be an electronic device capable of running an application program, such as a server.
  • the electronic device 100 in this application may include one or more of the following components: a processor 110, a memory 120, and one or more application programs.
  • One or more application programs may be stored in the memory 120 and configured to be Or multiple processors 110 execute, and one or more programs are configured to execute the method described in the foregoing method embodiment.
  • the processor 110 may include one or more processing cores.
  • the processor 110 uses various interfaces and lines to connect various parts of the entire electronic device 100, and executes by running or executing instructions, programs, code sets, or instruction sets stored in the memory 120, and calling data stored in the memory 120.
  • Various functions and processing data of the electronic device 100 may use at least one of digital signal processing (Digital Signal Processing, DSP), Field-Programmable Gate Array (Field-Programmable Gate Array, FPGA), and Programmable Logic Array (Programmable Logic Array, PLA).
  • DSP Digital Signal Processing
  • FPGA Field-Programmable Gate Array
  • PLA Programmable Logic Array
  • the processor 110 may be integrated with one or a combination of a central processing unit (CPU), a graphics processing unit (GPU), a modem, and the like.
  • the CPU mainly processes the operating system, user interface, and application programs; the GPU is used for rendering and drawing of display content; the modem is used for processing wireless communication. It can be understood that the above-mentioned modem may not be integrated into the processor 110, but may be implemented by a communication chip alone.
  • the memory 120 may include random access memory (RAM) or read-only memory (Read-Only Memory).
  • the memory 120 may be used to store instructions, programs, codes, code sets or instruction sets.
  • the memory 120 may include a program storage area and a data storage area, where the program storage area may store instructions for implementing the operating system and instructions for implementing at least one function (such as touch function, sound playback function, image playback function, etc.) , Instructions used to implement the following various method embodiments, etc.
  • the data storage area can also store data (such as phone book, audio and video data, chat record data) created by the terminal 100 during use.
  • FIG. 10 shows a structural block diagram of a computer-readable storage medium provided by an embodiment of the present application.
  • the computer-readable medium 800 stores program code, and the program code can be invoked by a processor to execute the method described in the foregoing method embodiment.
  • the computer-readable storage medium 800 may be an electronic memory such as flash memory, EEPROM (Electrically Erasable Programmable Read Only Memory), EPROM, hard disk, or ROM.
  • the computer-readable storage medium 800 includes a non-transitory computer-readable storage medium.
  • the computer-readable storage medium 800 has storage space for the program code 810 for executing any method steps in the above-mentioned methods. These program codes can be read from or written into one or more computer program products.
  • the program code 810 may be compressed in a suitable form, for example.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请公开了一种向量查询方法、装置、电子设备及存储介质,该安向量查询方法包括:获取查询向量;根据预先建立的第一索引,获取与查询向量的距离满足第一设定距离条件的第一聚类中心向量,作为目标向量;获取查询向量与目标向量之间的残差向量,作为查询残差向量;根据预先建立的第二索引,获取多个样本向量中每个样本向量对应的编码,第二索引包括采用乘积量化方法对每个样本向量对应的样本残差向量进行乘积量化获得的每个残差样本向量对应的编码;根据查询残差向量以及每个样本残差向量对应的编码,从多个样本向量中获取与查询向量的距离满足第二设定距离条件的样本向量,作为查询结果。本方法可以提升向量查询的速度。

Description

向量查询方法、装置、电子设备及存储介质 技术领域
本申请涉及数据处理技术领域,更具体地,涉及一种向量查询方法、装置、电子设备及存储介质。
背景技术
随着互联网的发展,越来越多的用户利用互联网进行信息搜索工作,以搜索感兴趣的内容。通常在搜索时,查询内容和样本内容会以向量进行表示,并且根据向量来进行查询,最终获得与查询内容匹配的样本内容。随着用户需求的增多,用户查询的内容也越来越复杂,查询时花费的时间较多,因此需要提升查询的效率。
发明内容
鉴于上述问题,本申请提出了一种向量查询方法、装置、电子设备及存储介质。
第一方面,本申请实施例提供了一种向量查询方法,获取查询向量;根据预先建立的第一索引,获取与所述查询向量的距离满足第一设定距离条件的第一聚类中心向量,作为目标向量,所述第一索引包括对样本向量进行聚类获得的多个第一聚类以及每个第一聚类对应的第一聚类中心向量,所述每个第一聚类中包括多个样本向量;获取所述查询向量与所述目标向量之间的残差向量,作为查询残差向量;根据预先建立的第二索引,获取所述多个样本向量中每个样本向量对应的编码,所述第二索引包括采用乘积量化方法对所述每个样本向量对应的样本残差向量进行乘积量化获得的每个残差样本向量对应的编码,所述样本残差向量为所述样本向量与所述目标向量之间的残差向量;根据所述查询残差向量以及所述每个样本残差向量对应的编码,从所述多个样本向量中获取与所述查询向量的距离满足第二设定距离条件的样本向量,作为查询结果。
第二方面,本申请实施例提供了一种向量查询装置,所述装置包括:向量获取模块、第一确定模块、残差获取模块、第二确定模块以及向量确定模块,其中,所述向量获取模块用于获取查询向量;所述第一确定模块用于根据预先建立的第一索引,获取与所述查询向量的距离满足第一设定距离条件的第一聚类中心向量,作为目标向量,所述第一索引包括对样本向量进行聚类获得的多个第一聚类以及每个第一聚类对应的第一聚类中心向量,所述每个第一聚类中包括多个样本向量;所述残差获取模块用于获取所述查询向量与所述目标向量之间的残差向量,作为查询残差向量;所述第二确定模块用于根据预先建立的第二索引,获取所述多个样本向量中每个样本向量对应的编码,所述第二索引包括采用乘积量化方法对所述每个样本向量对应的样本残差向量进行乘积量化获得的每个残差样本向量对应的编码,所述样本残差向量为所述样本向量与所述目标向量之间的残差向量;所述向量确定模块用于根据所述查询残差向量以及所述每个样本残差向量对应的编码,从所述多个样本向量中获取与所述查询向量的距离满足第二设定距离条件的样本向量,作为查询结果。
第三方面,本申请实施例提供了一种电子设备,包括:一个或多个处理器;存储器;一个或多个应用程序,其中所述一个或多个应用程序被存储在所述存储器中并被配置为由所述一个或多个处理器执行,所述一个或多个程序配置用于执行上述第一方面提供的向量查询方法。
第四方面,本申请实施例提供了一种计算机可读取存储介质,所述计算机可读取 存储介质中存储有程序代码,所述程序代码可被处理器调用执行上述第一方面提供的向量查询方法。
本申请提供的方案,通过获取查询向量,然后根据预先建立的第一索引,获取与查询向量的距离满足第一设定距离的第一聚类中心向量,再获取查询向量与目标向量之间的残差向量,作为残差查询向量,根据预先建立的第二索引,获取多个样本向量中每个样本向量对应的编码,第二索引包括采用乘积量化方法对每个样本向量对应的样本残差向量进行乘积量化获得的每个残差样本向量对应的编码,然后根据查询残差向量以及每个样本残差向量对应的编码,从多个样本向量中获取与查询向量距离满足第二设定距离条件的样本向量,作为查询结果,从而通过粗聚类和乘积量化,实现向量检索,分步降低了向量检索的复杂度,使得向量检索的速度和准确性都能得到保证。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1示出了根据本申请一个实施例的索引的构建方法流程图。
图2示出了根据本申请一个实施例的索引的构建方法中步骤S110的流程图。
图3示出了本申请实施例提供的建立第二索引的原理示意图。
图4示出了根据本申请一个实施例的向量查询方法流程图。
图5示出了本申请一个实施例中提供的向量查询方法中步骤S220的流程图。
图6示出了本申请一个实施例中提供的向量查询方法中步骤S250的流程图。
图7示出了根据本申请另一个实施例的向量查询方法流程图。
图8示出了根据本申请一个实施例的向量查询装置的一种框图。
图9是本申请实施例的用于执行根据本申请实施例的向量查询方法的电子设备的框图。
图10是本申请实施例的用于保存或者携带实现根据本申请实施例的向量查询方法的程序代码的存储单元。
具体实施方式
为了使本技术领域的人员更好地理解本申请方案,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述。
随着网络时代的来临,越来越多的用户利用互联网进行需求内容的查询。例如,利用互联网查询图片、商品、新闻等内容。在进行内容查询的过程中,会使用向量检索实现内容查询。例如,会将用户输入的查询内容量化为向量,然后采用暴力算法,分别比对该向量与每个样本内容的向量之间的距离或者相似度,并根据比对结果确定查询结果,如果数据规模较大时,会带来较大的运算压力。
在向量检索的传统技术中,也会建立索引结构,并在向量检索时,利用索引结构进行向量检索。目前索引结构主要包括:树索引结构、哈希索引、图索引及矢量量化等。其中,对于树索引结构,一般而言,在空间维度比较低的时候,树索引比较搞笑,但向量维度较高时,性能及准确度不太理想。对于哈希索引,该方法虽然可以完成快速的建立索引,但是在检索的准确率上表现不足,针对千万级别以上的高维向量,准确率通常低于50%,难以应用于大多场景。对于图索引,该索引方法在千万级别的向量相似计算上可以取得较好的效果,但是如果数据规模达到上亿级别时,建索引的时长会很长,检索时花费的时间也很长,不能满足在线计算的需求,并且对后续增加的样本添加索引时,会造成索引结构大范围的联动,性能难以得到保证。对于矢量量化(例如聚类、乘积量化等),在解决上亿级别数据量的向量相似计算时,单纯依靠聚 类或者乘积量化的方法,如果要保证检索的准确率,建立索引的时间会较长。
针对上述问题,发明人在经过长时间的研究之后,提出了本申请实施例提供的向量查询方法、装置、电子设备以及存储介质,通过对样本数据的粗聚类结果,获取与查询向量距离最近的聚类中心,再根据乘积量化建立的索引,在聚类中心所在聚类下,获取与查询向量距离最近的样本向量,从而无需暴力的计算查询向量与每个样本向量之间的距离,大幅缩减了查询时长,并且采用乘积量化的方法,能有效提升向量查询的准确率。其中,具体的向量查询方法在后续的实施例中进行详细的说明。
请参阅图1,图1示出了本申请一个实施例提供的索引的构建方法的流程示意图。所述索引的构建方法用于通过根据样本数据进行聚类,而根据聚类结果建立第一索引,以及根据聚类结果,再利用乘积量化方法获得码本,并建立第二索引,第一索引以及第二索引均用于向量查询的过程。在具体的实施例中,所述索引的构建方法可以应用于电子设备。下面将以电子设备为例,说明本实施例的具体流程,当然,可以理解的,本实施例所应用的电子设备可以为服务器等设备,在此不做限定。下面将针对图1所示的流程进行详细的阐述,所述索引的构建方法具体可以包括以下步骤:
步骤S110:对所有样本向量进行聚类,获得多个第一聚类。
在本申请实施例中,电子设备中可以存储有大量样本内容,样本内容可以为图像、视频、音频、文档、网页、新闻帖子等多种类型的内容,样本内容的具体类型可以不作为限定,例如,电子设备用于处理和查询图像时,则样本内容可以图像内容。为了使样本内容可以用于内容的分析和查询,可以对样本内容进行处理,以获得样本内容对应的样本向量,样本向量用于表征样本内容的特征。
在一些实施方式中,可以根据不同类型的样本内容,而提取样本内容的特征并形成样本向量。例如,对于图像类型的样本内容,可以提取亮度值、灰度值、像素数量、灰度均值、灰度中值等图像特征,作为构成样本向量的元素,从而形成样本向量。又例如,对于音频的样本内容,可以提取音色、音调、音量、音频内容中的文本内容、关键词等特征,作为构成样本向量的元素,从而形成样本向量。再例如,对于文档内容,可以提取分词、关键词、词频等特征,作为构成样本向量的元素,从而形成样本向量。当然,具体获取样本向量的方式可以不作为限定。
在一些实施方式中,对于内容较为复杂的类型的样本内容,可以对应提取到数量较多的特征,从而形成高维的样本向量。样本向量的具体维度可以不作为限定,例如,可以为几百维,也可以为几千维,在此不做限定。
进一步的,在对所有样本内容中每个样本内容构建样本向量后,可以获得每个样本内容对应的样本向量。
在一些实施方式中,由于通常样本向量的规模非常大,例如10亿级别,业务场景一般是从10亿的高维的样本向量中找出与查询向量相似的少数的向量,如果只是通过聚类去解决这个问题,聚类个数可能需要万级别以上,聚类收敛的过程会非常慢,可以先对所有样本向量先进行粗聚类划分,从而,在获取样本向量后,可以对所有样本向量进行聚类。
在一些实施方式中,在对样本向量进行聚类后,可以产生多个聚类,即产生多个类别,将产生的聚类作为第一聚类,每个第一聚类对应有一个聚类中心。其中,聚类中心可以理解为该聚类中的质心,处于该聚类下样本向量分布的中心。每个聚类中心可以用一个与样本向量相同维度的向量表示,该向量成为聚类中心的中心向量,可以将第一聚类的中心向量作为第一聚类中心向量。
在一些实施方式中,对所有样本向量进行聚类,可以采用K均值聚类算法对所有向量进行聚类。具体的,请参阅图2,对所有样本向量进行聚类,获得多个第一聚类,包括:
步骤S111:根据所述所有样本向量的数量,或者根据设定算法,确定聚类个数。
在一些实施方式中,由于是需要对样本向量进行粗聚类,因此可以根据样本向量的数量,及需求的精读而确定聚类个数,以实现聚类后获得的第一聚类的个数为预先确定的聚类个数。当然也可以根据设定算法,例如肘部法则、轮廓系数等算法,确定聚类个数。另外,该聚类个数可以与样本向量的数量满足一定关系,从而达到粗聚类的效果。
步骤S112:根据所述聚类个数,采用K均值聚类算法对所述所有样本向量进行聚类,获得多个第一聚类,所述第一聚类的个数与所述聚类个数相同。
在一些实施方式中,电子设备可以根据该聚类个数,设置和调整K均值聚类算法中的聚类参数,然后对所有样本向量进行聚类,从而聚类后获得的第一聚类的个数为步骤S111中确定的聚类个数,实现对样本向量的粗聚类。其中,采用K均值剧烈算法进行聚类时,可以随机选择K个样本向量作为初始均值向量,计算样本到各均值向量的距离,把它划到距离最小的聚类中;再计算新的均值向量,进行迭代,直至均值向量未更新或到达最大次数。
当然,具体对所有向量进行聚类的方法可以不作为限定,也还可以基于其他聚类方法进行聚类,例如基于层次的聚类算法、基于密度的聚类算法等。
步骤S120:获取每个第一聚类中的聚类中心对应的第一聚类中心向量。
在本申请实施例中,在对所有样本向量进行聚类后,可以获得多个第一聚类,以及多个第一聚类中每个第一聚类中的聚类中心对应的第一聚类中心向量,即步骤S110中获得的第一聚类中心向量。
步骤S130:获取与每个样本向量的距离最近的第一聚类中心向量。
在本申请实施例中,可以针对每个样本向量,计算该样本向量与各个第一聚类中心向量之间的距离。在得到样本向量与各个第一聚类中心向量之间的距离后,则可以根据计算的距离,确定与每个样本向量距离最近的第一聚类中心向量,也就是样本向量最靠近的第一聚类中心向量。样本向量距离最近的第一聚类中心向量,即为该样本向量所属的第一聚类对应的第一聚类中心向量。其中,样本向量与第一聚类中心向量之间的距离,用于表征两者之间相距的远近,可以指欧式距离、马氏距离、夹角余弦距离等,在此不做限定。
步骤S140:建立每个第一聚类与对应的第一聚类中心向量的索引关系,以及每个样本向量与对应的第一聚类的索引关系,得到第一索引。
在本申请实施例中,在确定出与每个样本向量距离最近的第一聚类中向量后,则可以根据与每个样本向量距离最近的第一聚类中心向量,确定出哪些样本向量对应的第一聚类中心向量相同,并将这些样本向量对应到第一聚类中心向量对应的第一聚类,也就是确定出第一聚类下的样本向量。基于此,可以建立出每个样本向量与对应的第一聚类的索引关系,并且建立出第一聚类与其对应的多个样本向量之间的倒排索引关系,还可以建立出每个第一聚类与对应的第一聚类中心向量的索引关系,从而得到第一索引。根据第一索引,可以查询到每个第一聚类对应的第一聚类中心向量,并且可以查询到每个样本向量所属的第一聚类,每个第一聚类下的样本向量。
步骤S150:获取所述所有样本向量中每个样本向量对应的样本残差向量,所述样本残差向量为每个样本向量与其对应的第一聚类中心向量之间的残差向量。
在本申请实施例中,在建立第一索引之后,可以根据第一索引中,每个样本向量对应的第一聚类中心向量,获取每个样本向量与其对应的第一聚类中心向量之间的残差向量,并将获取的残差向量作为样本残差向量。其中,可以将样本向量与其对应的第一聚类中心向量相减,从而获得样本向量与其对应的第一聚类中心向量之间的残差向量。
步骤S160:对每个样本残差向量降维为多个子空间中的多个子样本向量,所述多个子样本向量与所述多个子空间一一对应。
在本申请实施例中,在对样本向量进行粗聚类后,可以再进行乘积量化,以进行精算建立索引。具体的,进行乘积量化的过程,可以为本申请实施例中的步骤S160至步骤S190。
在本申请实施例中,考虑到样本向量通常为高维的向量(例如256维的样本向量),第一聚类中心向量与样本向量的维度相同,因此每个样本向量对应的样本残差向量也与样本向量的维度相同,也就是说,样本残差向量也为高维。而高维的向量在处理时的复杂度较大,因此可以对样本残差向量进行降维。
在一些实施方式中,可以将样本残差向量降维为多个子空间中的多个样本子向量,多个样本子向量与多个子空间一一对应,也就是说,每个样本残差向量在每个子空间中存在一个样本子向量,各个子空间中对应的样本子向量组合后即为一个完整的样本残差向量。从而通过对样本残差向量进行降维,可以减小处理复杂度,并且在粗聚类的基础上进行精细划分。在降维时,可以按照均分的多个子空间进行划分,保证不同子空间中的样本子向量的维度数相同;也可以不按照均分的方式划分,这样的话,不同子空间中的样本子向量的维度数可能不同。
例如,样本残差向量为30维的向量,其表示为(i 1,i 2,i 3,…,i 28,i 29,i 30),样本残差向量可以按照均分的6个子空间进行划分,第一个子空间中对应的子向量可以为(i 1,i 2,i 3,i 4,i 5),第二个子空间中对应的子向量可以为(i 6,i 7,i 8,i 9,i 10),第三个子空间中对应的子向量可以为(i 11,i 12,i 13,i 14,i 15),第四个子空间中对应的子向量可以为(i 16,i 17,i 18,i 19,i 20),第五个子空间中对应的子向量可以为(i 21,i 22,i 23,i 24,i 25),第六个子空间中对应的子向量可以为(i 26,i 27,i 28,i 29,i 30)。当然,以上的空间划分以及子向量的划分仅为举例。
步骤S170:对同一子空间中的子样本向量进行聚类,获取每个子空间中的多个第二聚类,以及每个第二聚类对应的第二聚类中心向量。
在本申请实施例中,在对每个样本残差向量降维为多个子空间中的样本子向量后,则可以按照每个子空间,对相同子空间中的子样本向量进行聚类。作为一种具体的实施方式,可以采用K均值聚类算法,对同一子空间中的样本子向量进行聚类,获得聚类结果,聚类结果中包括多个第二聚类,每个第二聚类同样对应一个聚类中心,第二聚类对应的聚类中心作为第二聚类中心,另外,与步骤S110中的聚类相同的,第二聚类中心对应有一个第二聚类中心向量,并且第二聚类中心向量与样本子向量的维度相同。具体进行聚类的方式,可以参阅步骤S110中聚类的方式,在此不再赘述。
在一些实施方式中,由于对每个样本残差向量均按照相同降维的方式,进行了降维,因此每个样本残差向量都被分为相同个数的样本子向量。对每个子空间中的样本子向量进行聚类后,均可以获得多个第二聚类及其对应的第二聚类中心向量。其中,对于各个子空间中,可以采用相同的聚类算法,确定出相同数目的第二聚类;对于不同子空间,也可以采取不同的聚类算法,从而确定出不同数目的聚类结果。
在一些实施方式中,在对同一子空间中的样本子向量进行聚类之前,还可以对样本子向量进行处理,以提升准确率。具体的,可以采用参考正交矩阵对每个样本子向量进行变换后,再进行聚类。
其中,参考正交矩阵可以为基于最优乘积量化的方法确定,例如,可以对量化误差函数进行最小值求解,得到参考正交矩阵,又例如,对量化误差函数进行最小值求解后,再进行迭代优化,最后得到参考正交矩阵。
在该实施方式下,由于对样本子向量进行聚类之前,采用参考正交矩阵对样本子向量进行了变换,因此可以使得聚类时的误差较小,提升聚类的准确性。
在一些实施方式中,在对同一子空间中的样本子向量进行聚类时,还可以是,在同一第一聚类下,对第一聚类对应的多个样本向量的样本残差向量,按照降维为多个样本子向量后,对同一子空间中的样本子向量进行聚类。这样可以使得后续建立的第 二索引,在用于向量查询时,在查询到查询向量对应的第一聚类后,能根据第二索引,更快的查找到作为查询结果的样本向量。
步骤S180:对每个第二聚类进行编码,获得每个第二聚类对应的子编码,其中,每个样本残差向量对应的多个子编码构成每个样本残差向量对应的编码。
在本申请实施例中,在对相同子空间下的样本子向量进行了聚类,获得了每个子空间对应的多个第二聚类后,可以对每个子空间中的第二聚类进行编码,从而获得每个第二聚类对应的子编码,以便根据每个第二聚类对应的子编码,建立索引。
例如,如果每个样本残差向量降维为2个子空间中的样本子向量,每个子空间可以对应L个第二聚类,第1个子空间中对应的L个第二聚类,可以按序编码,获得子编码,子编码分别为1,2,3,…,L。
步骤S190:建立每个样本残差向量对应的子编码与第二聚类中心向量的索引关系,获得第二索引。
在本申请实施例中,在对每个第二聚类进行编码后,则可以针对每个样本残差向量,对其对应的多个样本子向量中每个样本子向量,确定每个样本子向量在对应的子空间中距离最近的第二聚类中心向量,以及第二聚类中心向量对应的第二聚类,然后将第二聚类的子编码与该样本子向量对应,从而确定出该样本残差向量的多个样本子向量中每个样本子向量对应的子编码,并建立起每个样本残差向量对应的子编码与第二聚类中心向量的索引关系,获得第二索引。样本残差向量的多个样本子向量中每个样本子向量对应的子编码,即构成了该样本残差向量的编码。
下面,结合图3中的示例,对本申请实施例中建立第二索引的过程进行描述。例如,请参阅图3,在乘积量化训练阶段,针对N个样本残差向量,样本向量的维度为256维,可以将其均分为4个子空间,每一个子空间中的样本子向量的维度为64维。在每一个子空间中,对样本子向量进行聚类后,可以对每个子空间中的第二聚类编码为1字节的整型数,并且每个子空间中,对样本子向量聚类后,产生256个第二聚类,256个第二聚类对应的子编码可以作为该子空间对应的码本。由于,每个第二聚类都编码为1字节的整型数,因此在将样本残差向量近似用各个子空间中对应的第二聚类中心向量进行表示时,样本残差向量可以量化编码为一个4字节的整型数,即4个第二聚类的子编码。其中,各个子空间中对应的第二聚类中心向量,指样本残差向量的各个样本子向量在其对应的子空间中,距离最近的第二聚类中心向量。比如,样本残差向量A,在第一个子空间中,距离最近的第二聚类中心向量所在的第二聚类对应的子编码为23,在第二个子空间中,距离最近的第二聚类中心向量所在的第二聚类对应的子编码为148,在第三个子空间中,距离最近的第二聚类中心向量所在的第二聚类对应的子编码为235,在第四个子空间中,距离最近的第二聚类中心向量所在的第二聚类对应的子编码为230,则样本残差向量近似用各个子空间中对应的第二聚类中心向量进行表示时,可以量化编码为(23,148,235,230),也就是样本残差向量对应的编码,并且根据编码对应的第二聚类中心向量,建立起样本残差向量对应的多个子编码与第二聚类中心向量的索引关系,获得第二索引。在查询样本残差向量时,通过第二索引以及样本残差向量的编码,可查询到每个子编码对应的第二聚类中心向量,并将查询到的四个第二聚类中心向量的组合,作为样本残差向量的近似。
本申请实施例提供的索引的构件方法,通过对样本向量进行粗聚类后,获得多个第一聚类,以及第一聚类对应的第一聚类中心向量,根据粗聚类结果建立每个第一聚类与对应的第一聚类中心向量的索引关系,每个样本向量与对应的第一聚类的索引关系,得到第一索引。进一步的,求取每个样本向量与其对应的第一聚类中心向量之间的样本残差向量,再经过降维后,进行聚类,再对每个第二聚类编码,最后建立每个样本残差向量对应的子编码与第二聚类中心向量的索引关系,得到第二索引。从而实现索引建立过程中,先做粗聚类,实现粗略的划分,再经过乘积量化,近似精算建立 索引,大大的减少了建立索引的时间。
请参阅图4,图4示出了本申请一个实施例提供的向量查询方法的流程示意图。所述向量查询方法用于通过对样本数据的粗聚类结果,获取与查询向量距离最近的聚类中心,再根据乘积量化建立的索引,在聚类中心所在聚类下,获取与查询向量距离最近的样本向量,从而提升向量查询的效率。在具体的实施例中,所述向量查询方法可以应用于上述电子设备。下面将针对图4所示的流程进行详细的阐述,所述向量查询方法具体可以包括以下步骤:
步骤S210:获取查询向量。
在本申请实施例中,查询向量可以指根据用户需求的查询内容,生成的查询向量。例如,该查询向量可以为根据文字内容生成的查询向量,又例如,该查询向量也可以为根据图像内容生成的查询向量,在此不做限定,生成查询向量的方式可以参阅前述实施例中生成样本向量的方式,在此不再赘述。
步骤S220:根据预先建立的第一索引,获取与所述查询向量的距离满足第一设定距离条件的第一聚类中心向量,作为目标向量,所述第一索引包括对样本向量进行聚类获得的多个第一聚类以及每个第一聚类对应的第一聚类中心向量,所述每个第一聚类中包括多个样本向量。
在本申请实施例中,预先建立的第一索引为每个第一聚类与对应的第一聚类中心向量的索引关系,以及每个样本向量与对应的第一聚类的索引关系,因此第一索引可以包括对样本向量进行聚类获得的多个第一聚类以及每个第一聚类对应的第一聚类中心向量,每个第一聚类中包括多个样本向量。其中,第一索引的建立方式可以参照前述实施例中的内容,在此不再赘述。
电子设备中可以预先存储有第一索引,在获得查询向量时,可以根据第一索引,获取与查询向量之间的距离满足第一设定距离条件的第一聚类中心向量,并将获得的第一聚类中心向量作为目标向量。
在一些实施方式中,请参见图5,步骤S220可以包括:
步骤S221:根据所述第一索引,获取每个第一聚类对应的第一聚类中心向量。
步骤S222:分别计算所述查询向量与每个第一聚类中心向量之间的距离。
步骤S223:根据所述查询向量与每个第一聚类中心向量之间的距离,获取与所述查询向量的距离满足第一设定距离条件的第一聚类中心向量。
可以理解的,根据第一索引,可以查询到每个第一聚类对应的第一聚类中心向量,然后分别计算查询向量与每个第一聚类中心向量之间的距离,获取与查询向量的距离满足第一设定距离条件的第一聚类中心向量。获取到的第一聚类中心向量,作为目标向量,该目标向量对应的第一聚类即为该查询向量最匹配的一个聚类。
在一些实施方式中,第一设定距离条件可以包括:与所述查询向量的距离最小的第一聚类中心向量;或者与所述查询向量的距离小于第一距离阈值的第一聚类中心向量。
在该方式下,当第一设定距离条件为与所述查询向量的距离最小的第一聚类中心向量时,则可以在分别计算查询向量与每个第一聚类中心向量之间的距离后,对计算获得的多个距离进行从小到大的排序,然后根据排序结果,确定最小距离所对应的第一聚类中心向量作为目标向量。当第一设定距离条件为与查询向量的距离小于第一距离阈值的第一聚类中心向量时,则可以在分别计算查询向量与每个第一聚类中心向量之间的距离后,筛选出小于第一距离阈值的距离所对应的第一聚类中心向量作为目标向量。
步骤S230:获取所述查询向量与所述目标向量之间的残差向量,作为查询残差向量。
在本申请实施例中,可以将所述查询向量与所述目标向量相减,获得所述查询向 量与所述目标向量之间的残差向量。
步骤S240:根据预先建立的第二索引,获取所述多个样本向量中每个样本向量对应的编码,所述第二索引包括采用乘积量化方法对所述每个样本向量对应的样本残差向量进行乘积量化获得的每个残差样本向量对应的编码,所述样本残差向量为所述样本向量与所述目标向量之间的残差向量。
在本申请实施例中,第二索引可以为每个样本残差向量对应的子编码与第二聚类中心向量的索引关系。第二索引可以包括采用乘积量化方法对所述每个样本向量对应的样本残差向量进行乘积量化获得的每个残差样本向量对应的编码,即前述实施例中建立第二索引时,获得的每个残差样本向量对应的编码。第二索引的建立方式可以参阅前述实施例的内容,在此不再赘述。
步骤S250:根据所述查询残差向量以及所述每个样本残差向量对应的编码,从所述多个样本向量中获取与所述查询向量的距离满足第二设定距离条件的样本向量,作为查询结果。
在本申请实施例中,请参阅图6,步骤S250可以包括:
步骤S251:根据所述查询残差向量以及所述每个样本残差向量对应的编码,获取所述查询向量与所述每个样本向量之间的距离。
在本申请实施例中,由前述实施例中建立第二索引的过程可知,编码包括多个子编码,第二索引还包括每个子编码对应的第二聚类中心向量,子编码为将每个样本残差向量降维成多个子空间中的多个子样本向量之后,对同一子空间中的子样本向量进行聚类后获得每个子空间中的多个第二聚类,并对所述第二聚类进行编码获得,多个子样本向量与多个子空间一一对应。
在一些实施方式中,步骤S251可以包括:
将所述查询残差向量降维为多个子空间中的多个子向量,作为多个子查询向量,所述多个子查询向量与所述多个子空间一一对应;根据所述第二索引,获取所述每个样本残差向量的多个子编码中每个子编码所对应的第二聚类中心向量;根据所述多个子查询向量,以及所述每个样本残差向量对应的多个第二聚类中心向量,获取所述查询向量与所述每个样本向量之间的距离。
可以理解的,可以按照前述实施例中建立第二索引过程的方式,对查询残差向量进行降维,并且降维的方式可以一致,即也是降维为多个子空间中的多个子向量,将多个子向量作为多个子查询向量,并且多个子查询向量与多个子空间一一对应。根据第二索引,可以知道每个样本残差向量对应的编码,以及编码中每个子编码对应的第二聚类中心向量。然后则可以根据多个子查询向量,以及所述每个样本残差向量对应的多个第二聚类中心向量,获取查询向量与每个样本向量之间的距离。
其中,根据所述多个子查询向量,以及所述每个样本残差向量对应的多个第二聚类中心向量,获取所述查询向量与所述每个样本向量之间的距离,包括:对于所述多个样本残差向量中的任一样本残差向量,分别计算相同子空间中所述子查询向量与该样本残差向量对应的每个第二聚类中心向量之间的距离;对于所述每个样本残差向量,根据每个样本残差向量与样本向量的对应关系,对每个子空间中计算获得的距离求和,获得所述查询向量与所述每个样本向量之间的距离。
可以理解的,由于每个样本残差向量是与样本向量对应的,并且样本残差向量是样本向量与第一聚类中心向量之间的残差向量,而查询残差向量是查询向量与第一聚类中心向量之间的残差向量,因此,在计算查询残差向量与样本残差向量之间的距离,即是在计算查询向量与样本向量之间的距离。例如,(A-B)-(C-B),即为A-C。而计算查询残差向量与样本残差向量之间的距离,可以分别通过计算相同子空间中,查询子向量与第二聚类中心向量之间的距离后,将计算获得的多个距离相加获得,因此可以获取到查询向量与样本向量之间的距离。
步骤S252:根据所述查询向量与每个样本向量之间的距离,从所述多个样本向量中获取与所述查询向量的距离满足第二设定距离条件的样本向量,作为查询结果。
在一些实施方式中,所述第二设定距离条件包括:与所述查询向量的距离最小的样本向量;或者与所述查询向量的距离小于第二距离阈值的样本向量。
本申请实施例提供的向量查询方法,通过获取查询向量,然后根据预先建立的第一索引,获取与查询向量的距离满足第一设定距离的第一聚类中心向量,再获取查询向量与目标向量之间的残差向量,作为残差查询向量,根据预先建立的第二索引,获取多个样本向量中每个样本向量对应的编码,第二索引包括采用乘积量化方法对每个样本向量对应的样本残差向量进行乘积量化获得的每个残差样本向量对应的编码,然后根据查询残差向量以及每个样本残差向量对应的编码,从多个样本向量中获取与查询向量距离满足第二设定距离条件的样本向量,作为查询结果,从而通过粗聚类和乘积量化,实现向量检索,分步降低了向量检索的复杂度,使得向量检索的速度和准确性都能得到保证。
请参阅图7,图7示出了本申请另一个实施例提供的向量查询方法的流程示意图。该向量查询方法可以应用于上述电子设备。下面将针对图7所示的流程进行详细的阐述,所述向量查询方法具体可以包括以下步骤:
步骤S310:获取查询向量。
在一些实施方式中,步骤S310可以包括:获取业务查询请求;判断所述业务查询请求中是否携带有向量;如果携带有向量,则将所述向量作为查询向量;如果未携带有向量,则生成与所述业务查询请求对应的查询向量。
可以理解的,电子设备在获得业务查询请求之后,可以解析业务查询请求中的参数,并确定参数是否携带向量,如果携带有向量,则可以直接将携带的向量作为查询向量。而如果未携带有向量,则可以按照前述实施例中的方式生成查询向量。
步骤S320:判断是否存在与所述查询向量对应的历史查询结果。
在一些实施方式中,电子设备中由于对不同用户的查询进行服务,并且相同用户可能多次查询相同的内容,因此电子设备可以将以往的历史查询结果,与查询向量对应后,进行存储。在获得查询向量后,可以判断本地是否保存有该查询向量对应的历史查询结果,以便根据查询结果,确定是否执行查询过程。
步骤S330:如果存在与所述查询向量对应的历史查询结果,则将所述历史查询结果中的样本向量,作为查询结果。
可以理解的,如果本地存储有查询向量对应的历史查询结果,则可以直接将该历史查询结果,作为本次查询向量对应的查询结果,而节省了查询过程所花费的时间。
步骤S340:如果不存在与所述查询向量对应的历史查询结果,根据预先建立的第一索引,获取与所述查询向量的距离满足第一设定距离条件的第一聚类中心向量,作为目标向量,所述第一索引包括对样本向量进行聚类获得的多个第一聚类以及每个第一聚类对应的第一聚类中心向量,所述每个第一聚类中包括多个样本向量。
可以理解的,如果本地存储有查询向量对应的历史查询结果,则进行查询过程,即执行步骤S340至步骤S370。
步骤S350:获取所述查询向量与所述目标向量之间的残差向量,作为查询残差向量。
步骤S360:根据预先建立的第二索引,获取所述多个样本向量中每个样本向量对应的编码,所述第二索引包括采用乘积量化方法对所述每个样本向量对应的样本残差向量进行乘积量化获得的每个残差样本向量对应的编码,所述样本残差向量为所述样本向量与所述目标向量之间的残差向量。
步骤S370:根据所述查询残差向量以及所述每个样本残差向量对应的编码,从所述多个样本向量中获取与所述查询向量的距离满足第二设定距离条件的样本向量,作 为查询结果。
在本申请实施例中,步骤S340至步骤S370可以参阅前述实施例的内容,在此不再赘述。
在一些实施方式中,上述多个样本向量存储于第一数据库中,即上述步骤获取到的查询结果为基于第一数据库而获得的查询结果。还可以存在第二数据库,第二数据库与第一数据库中的样本向量、第一索引以第二索引可以不同。该向量查询方法还可以包括:
将所述查询结果作为第一查询结果;根据由第二数据库中的样本向量建立的第一索引以及第二索引,获取第二查询结果;将所述第一查询结果以及所述第二查询结果合并,获得第三查询结果,作为所述查询向量对应的查询结果。
可以理解的,可以基于第二数据库中的样本向量建立的第一索引以及第二索引,再按照上述步骤S340至步骤S370的方式,进行向量查询,获得第二查询结果。之后再将第一查询结果与第二查询结果合并,得到第三查询结果,并将第三查询结果作为本次向量查询的查询结果,从而使得向量查询更加准确。其中,合并可以指将第一查询结果以及第二查询结果均作为最终的查询结果。
在一些场景中,数据库可能存在大库和小库,大库可以指拥有大量样本数据的数据库,主要用于存放历史的样本向量,小库可以指拥有相对较少的样本数据的数据库,主要用于存放近期获取的样本向量,通过上述方式即可以使得向量查询能从所有样本向量中进行查询,提升向量查询的准确率。
本申请实施例提供的向量查询方法,通过获取查询向量,然后确定是否存在查询向量对应的历史查询结果,如果存在查询向量对应的历史查询结果,则可以直接作为查询结果,从而减少处理量。而不存在查询向量对应的历史查询结果时,根据预先建立的第一索引,获取与查询向量的距离满足第一设定距离的第一聚类中心向量,再获取查询向量与目标向量之间的残差向量,作为残差查询向量,根据预先建立的第二索引,获取多个样本向量中每个样本向量对应的编码,第二索引包括采用乘积量化方法对每个样本向量对应的样本残差向量进行乘积量化获得的每个残差样本向量对应的编码,然后根据查询残差向量以及每个样本残差向量对应的编码,从多个样本向量中获取与查询向量距离满足第二设定距离条件的样本向量,作为查询结果,从而通过粗聚类和乘积量化,实现向量检索,分步降低了向量检索的复杂度,使得向量检索的速度和准确性都能得到保证。
请参阅图8,其示出了本申请实施例提供的一种向量查询装置400的结构框图。该向量查询装置400可以应用于上述电子设备。该向量查询装置400包括:向量获取模块410、第一确定模块420、残差获取模块430、第二确定模块440以及向量确定模块450。其中,所述向量获取模块410用于获取查询向量;所述第一确定模块420用于根据预先建立的第一索引,获取与所述查询向量的距离满足第一设定距离条件的第一聚类中心向量,作为目标向量,所述第一索引包括对样本向量进行聚类获得的多个第一聚类以及每个第一聚类对应的第一聚类中心向量,所述每个第一聚类中包括多个样本向量;所述残差获取模块430用于获取所述查询向量与所述目标向量之间的残差向量,作为查询残差向量;所述第二确定模块440用于根据预先建立的第二索引,获取所述多个样本向量中每个样本向量对应的编码,所述第二索引包括采用乘积量化方法对所述每个样本向量对应的样本残差向量进行乘积量化获得的每个残差样本向量对应的编码,所述样本残差向量为所述样本向量与所述目标向量之间的残差向量;所述向量确定模块450用于根据所述查询残差向量以及所述每个样本残差向量对应的编码,从所述多个样本向量中获取与所述查询向量的距离满足第二设定距离条件的样本向量,作为查询结果。
在一些实施方式中,向量确定模块450可以包括:距离计算单元以及向量筛选单 元。距离计算单元用于根据所述查询残差向量以及所述每个样本残差向量对应的编码,获取所述查询向量与所述每个样本向量之间的距离;向量筛选单元用于根据所述查询向量与每个样本向量之间的距离,从所述多个样本向量中获取与所述查询向量的距离满足第二设定距离条件的样本向量,作为查询结果。
在该实施方式下,所述编码包括多个子编码,所述第二索引还包括每个子编码对应的第二聚类中心向量,所述子编码为将每个样本残差向量降维成多个子空间中的多个子样本向量之后,对同一子空间中的子样本向量进行聚类后获得每个子空间中的多个第二聚类,并对所述第二聚类进行编码获得,所述多个子样本向量与所述多个子空间一一对应。
进一步的,距离计算单元可以具体用于:将所述查询残差向量降维为多个子空间中的多个子向量,作为多个子查询向量,所述多个子查询向量与所述多个子空间一一对应;根据所述第二索引,获取所述每个样本残差向量的多个子编码中每个子编码所对应的第二聚类中心向量;根据所述多个子查询向量,以及所述每个样本残差向量对应的多个第二聚类中心向量,获取所述查询向量与所述每个样本向量之间的距离。
在该方式下,距离计算单元根据所述多个子查询向量,以及所述每个样本残差向量对应的多个第二聚类中心向量,获取所述查询向量与所述每个样本向量之间的距离,包括:对于所述多个样本残差向量中的任一样本残差向量,分别计算相同子空间中所述子查询向量与该样本残差向量对应的每个第二聚类中心向量之间的距离;对于所述每个样本残差向量,根据每个样本残差向量与样本向量的对应关系,对每个子空间中计算获得的距离求和,获得所述查询向量与所述每个样本向量之间的距离
在一些实施方式中,该向量查询装置400还可以包括第一索引建立模块。第一索引建立模块可以用于:对所有样本向量进行聚类,获得多个第一聚类;获取每个第一聚类中的聚类中心对应的第一聚类中心向量;获取与每个样本向量的距离最近的第一聚类中心向量;建立每个第一聚类与对应的第一聚类中心向量的索引关系,以及每个样本向量与对应的第一聚类的索引关系,得到第一索引。
在该实施方式下,第一索引建立模块对所有样本向量进行聚类,获得多个第一聚类,包括:根据所述所有样本向量的数量,或者根据设定算法,确定聚类个数;根据所述聚类个数,采用K均值聚类算法对所述所有样本向量进行聚类,获得多个第一聚类,所述第一聚类的个数与所述聚类个数相同。
进一步的,该向量查询装置400还可以包括第二索引建立模块。第二索引建立模块可以用于:获取所述所有样本向量中每个样本向量对应的样本残差向量,所述样本残差向量为每个样本向量与其对应的第一聚类中心向量之间的残差向量;对每个样本残差向量降维为多个子空间中的多个子样本向量,所述多个子样本向量与所述多个子空间一一对应;对同一子空间中的子样本向量进行聚类,获取每个子空间中的多个第二聚类,以及每个第二聚类对应的第二聚类中心向量;对每个第二聚类进行编码,获得每个第二聚类对应的子编码,其中,每个样本残差向量对应的多个子编码构成每个样本残差向量对应的编码;建立每个样本残差向量对应的子编码与第二聚类中心向量的索引关系,获得第二索引。
在一些实施方式中,所述第一设定距离条件包括:与所述查询向量的距离最小的第一聚类中心向量;或者与所述查询向量的距离小于第一距离阈值的第一聚类中心向量。
在一些实施方式中,所述第二设定距离条件包括:与所述查询向量的距离最小的样本向量;或者与所述查询向量的距离小于第二距离阈值的样本向量。
在一些实施方式中,该向量查询装置400还可以包括缓存查询模块,缓存查询模块用于在所述根据预先建立的第一索引,获取与所述查询向量的距离满足第一设定距离条件的第一聚类中心向量,作为目标向量之前,判断是否存在与所述查询向量对应 的历史查询结果;如果不存在与所述查询向量对应的历史查询结果,则第一确定模块根据预先建立的第一索引,获取与所述查询向量的距离满足第一设定距离条件的第一聚类中心向量,作为目标向量。
在该实施方式下,该向量查询装置400还可以包括结果确定模块。结果确定模块用于如果存在与所述查询向量对应的历史查询结果,则将所述历史查询结果中的样本向量,作为查询结果。
在一些实施方式中,所述多个样本向量存储于第一数据库中。该向量查询装置400还可以包括:结果标识模块、结果查询模块以及结果合并模块。结果标识模块用于将所述查询结果作为第一查询结果;结果查询模块用于根据由第二数据库中的样本向量建立的第一索引以及第二索引,获取第二查询结果;结果合并模块用于将所述第一查询结果以及所述第二查询结果合并,获得第三查询结果,作为所述查询向量对应的查询结果。
在一些实施方式中,向量获取模块410可以包括:请求获取单元,用于获取业务查询请求;向量判断单元,用于判断所述业务查询请求中是否携带有向量;第一执行单元,用于如果携带有向量,则将所述向量作为查询向量;第二执行单元,用于如果未携带有向量,则生成与所述业务查询请求对应的查询向量。
在一些实施方式中,残差获取模块430可以具体用于:将所述查询向量与所述目标向量相减,获得所述查询向量与所述目标向量之间的残差向量。
在一些实施方式中,该向量查询装置400还可以包括索引更新模块。索引更新模块用于每间隔预设时长,根据新获取的样本向量,对所述第一索引以及所述第二索引进行更新。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述装置和模块的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,模块相互之间的耦合可以是电性,机械或其它形式的耦合。
另外,在本申请各个实施例中的各功能模块可以集成在一个处理模块中,也可以是各个模块单独物理存在,也可以两个或两个以上模块集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。
综上所述,本申请提供的方案,
请参考图9,其示出了本申请实施例提供的一种电子设备的结构框图。该电子设备100可以是服务器等能够运行应用程序的电子设备。本申请中的电子设备100可以包括一个或多个如下部件:处理器110、存储器120以及一个或多个应用程序,其中一个或多个应用程序可以被存储在存储器120中并被配置为由一个或多个处理器110执行,一个或多个程序配置用于执行如前述方法实施例所描述的方法。
处理器110可以包括一个或者多个处理核。处理器110利用各种接口和线路连接整个电子设备100内的各个部分,通过运行或执行存储在存储器120内的指令、程序、代码集或指令集,以及调用存储在存储器120内的数据,执行电子设备100的各种功能和处理数据。可选地,处理器110可以采用数字信号处理(Digital Signal Processing,DSP)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)、可编程逻辑阵列(Programmable Logic Array,PLA)中的至少一种硬件形式来实现。处理器110可集成中央处理器(Central Processing Unit,CPU)、图像处理器(Graphics Processing Unit,GPU)和调制解调器等中的一种或几种的组合。其中,CPU主要处理操作***、用户界面和应用程序等;GPU用于负责显示内容的渲染和绘制;调制解调器用于处理无线通信。可以理解的是,上述调制解调器也可以不集成到处理器110中,单独通过一块通信芯片进行实现。
存储器120可以包括随机存储器(Random Access Memory,RAM),也可以包括 只读存储器(Read-Only Memory)。存储器120可用于存储指令、程序、代码、代码集或指令集。存储器120可包括存储程序区和存储数据区,其中,存储程序区可存储用于实现操作***的指令、用于实现至少一个功能的指令(比如触控功能、声音播放功能、图像播放功能等)、用于实现下述各个方法实施例的指令等。存储数据区还可以存储终端100在使用中所创建的数据(比如电话本、音视频数据、聊天记录数据)等。
请参考图10,其示出了本申请实施例提供的一种计算机可读存储介质的结构框图。该计算机可读介质800中存储有程序代码,所述程序代码可被处理器调用执行上述方法实施例中所描述的方法。
计算机可读存储介质800可以是诸如闪存、EEPROM(电可擦除可编程只读存储器)、EPROM、硬盘或者ROM之类的电子存储器。可选地,计算机可读存储介质800包括非易失性计算机可读介质(non-transitory computer-readable storage medium)。计算机可读存储介质800具有执行上述方法中的任何方法步骤的程序代码810的存储空间。这些程序代码可以从一个或者多个计算机程序产品中读出或者写入到这一个或者多个计算机程序产品中。程序代码810可以例如以适当形式进行压缩。
最后应说明的是:以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不驱使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。

Claims (20)

  1. 一种向量查询方法,其特征在于,所述方法包括:
    获取查询向量;
    根据预先建立的第一索引,获取与所述查询向量的距离满足第一设定距离条件的第一聚类中心向量,作为目标向量,所述第一索引包括对样本向量进行聚类获得的多个第一聚类以及每个第一聚类对应的第一聚类中心向量,所述每个第一聚类中包括多个样本向量;
    获取所述查询向量与所述目标向量之间的残差向量,作为查询残差向量;
    根据预先建立的第二索引,获取所述多个样本向量中每个样本向量对应的编码,所述第二索引包括采用乘积量化方法对所述每个样本向量对应的样本残差向量进行乘积量化获得的每个残差样本向量对应的编码,所述样本残差向量为所述样本向量与所述目标向量之间的残差向量;
    根据所述查询残差向量以及所述每个样本残差向量对应的编码,从所述多个样本向量中获取与所述查询向量的距离满足第二设定距离条件的样本向量,作为查询结果。
  2. 根据权利要求1所述的方法,其特征在于,所述根据所述查询残差向量以及所述每个样本残差向量对应的编码,从所述多个样本向量中获取与所述查询向量的距离满足第二设定距离条件的样本向量,作为查询结果,包括:
    根据所述查询残差向量以及所述每个样本残差向量对应的编码,获取所述查询向量与所述每个样本向量之间的距离;
    根据所述查询向量与每个样本向量之间的距离,从所述多个样本向量中获取与所述查询向量的距离满足第二设定距离条件的样本向量,作为查询结果。
  3. 根据权利要求2所述的方法,其特征在于,所述编码包括多个子编码,所述第二索引还包括每个子编码对应的第二聚类中心向量,所述子编码为将每个样本残差向量降维成多个子空间中的多个子样本向量之后,对同一子空间中的子样本向量进行聚类后获得每个子空间中的多个第二聚类,并对所述第二聚类进行编码获得,所述多个子样本向量与所述多个子空间一一对应。
  4. 根据权利要求3所述的方法,其特征在于,所述根据所述查询残差向量以及所述每个样本残差向量对应的编码,获取所述查询向量与所述每个样本向量之间的距离,包括:
    将所述查询残差向量降维为多个子空间中的多个子向量,作为多个子查询向量,所述多个子查询向量与所述多个子空间一一对应;
    根据所述第二索引,获取所述每个样本残差向量的多个子编码中每个子编码所对应的第二聚类中心向量;
    根据所述多个子查询向量,以及所述每个样本残差向量对应的多个第二聚类中心向量,获取所述查询向量与所述每个样本向量之间的距离。
  5. 根据权利要求4所述的方法,其特征在于,所述根据所述多个子查询向量,以及所述每个样本残差向量对应的多个第二聚类中心向量,获取所述查询向量与所述每个样本向量之间的距离,包括:
    对于所述多个样本残差向量中的任一样本残差向量,分别计算相同子空间中所述子查询向量与该样本残差向量对应的每个第二聚类中心向量之间的距离;
    对于所述每个样本残差向量,根据每个样本残差向量与样本向量的对应关系,对每个子空间中计算获得的距离求和,获得所述查询向量与所述每个样本向量之间的距离。
  6. 根据权利要求1-5任一项所述的方法,其特征在于,预先建立所述第一索引的过程,包括:
    对所有样本向量进行聚类,获得多个第一聚类;
    获取每个第一聚类中的聚类中心对应的第一聚类中心向量;
    获取与每个样本向量的距离最近的第一聚类中心向量;
    建立每个第一聚类与对应的第一聚类中心向量的索引关系,以及每个样本向量与对应的第一聚类的索引关系,得到第一索引。
  7. 根据权利要求6所述的方法,其特征在于,所述对所有样本向量进行聚类,获得多个第一聚类,包括:
    根据所述所有样本向量的数量,或者根据设定算法,确定聚类个数;
    根据所述聚类个数,采用K均值聚类算法对所述所有样本向量进行聚类,获得多个第一聚类,所述第一聚类的个数与所述聚类个数相同。
  8. 根据权利要求6或7所述的方法,其特征在于,预先建立所述第二索引的过程,包括:
    获取所述所有样本向量中每个样本向量对应的样本残差向量,所述样本残差向量为每个样本向量与其对应的第一聚类中心向量之间的残差向量;
    对每个样本残差向量降维为多个子空间中的多个子样本向量,所述多个子样本向量与所述多个子空间一一对应;
    对同一子空间中的子样本向量进行聚类,获取每个子空间中的多个第二聚类,以及每个第二聚类对应的第二聚类中心向量;
    对每个第二聚类进行编码,获得每个第二聚类对应的子编码;
    建立每个样本残差向量对应的子编码与第二聚类中心向量的索引关系,获得第二索引,其中,每个样本残差向量对应的多个子编码构成每个样本残差向量对应的编码。
  9. 根据权利要求1-8任一项所述的方法,其特征在于,所述第一设定距离条件包括:
    与所述查询向量的距离最小的第一聚类中心向量;或者
    与所述查询向量的距离小于第一距离阈值的第一聚类中心向量。
  10. 根据权利要求1-9任一项所述的方法,其特征在于,所述第二设定距离条件包括:
    与所述查询向量的距离最小的样本向量;或者
    与所述查询向量的距离小于第二距离阈值的样本向量。
  11. 根据权利要求1-10任一项所述的方法,其特征在于,在所述根据预先建立的第一索引,获取与所述查询向量的距离满足第一设定距离条件的第一聚类中心向量,作为目标向量之前,所述方法还包括:
    判断是否存在与所述查询向量对应的历史查询结果;
    如果不存在与所述查询向量对应的历史查询结果,则执行所述根据预先建立的第一索引,获取与所述查询向量的距离满足第一设定距离条件的第一聚类中心向量,作为目标向量。
  12. 根据权利要求11所述的方法,其特征在于,所述方法还包括:
    如果存在与所述查询向量对应的历史查询结果,则将所述历史查询结果中的样本向量,作为查询结果。
  13. 根据权利要求1-12任一项所述的方法,其特征在于,所述多个样本向量存储于第一数据库中,所述方法还包括:
    将所述查询结果作为第一查询结果;
    根据由第二数据库中的样本向量建立的第一索引以及第二索引,获取第二查询结果;
    将所述第一查询结果以及所述第二查询结果合并,获得第三查询结果,作为所述 查询向量对应的查询结果。
  14. 根据权利要求1-13任一项所述的方法,其特征在于,所述获取查询向量,包括:
    获取业务查询请求;
    判断所述业务查询请求中是否携带有向量;
    如果携带有向量,则将所述向量作为查询向量;
    如果未携带有向量,则生成与所述业务查询请求对应的查询向量。
  15. 根据权利要求1-14任一项所述的方法,其特征在于,所述根据预先建立的第一索引,获取与所述查询向量的距离满足第一设定距离条件的第一聚类中心向量,包括:
    根据所述第一索引,获取每个第一聚类对应的第一聚类中心向量;
    分别计算所述查询向量与每个第一聚类中心向量之间的距离;
    根据所述查询向量与每个第一聚类中心向量之间的距离,获取与所述查询向量的距离满足第一设定距离条件的第一聚类中心向量。
  16. 根据权利要求1-15任一项所述的方法,其特征在于,所述获取所述查询向量与所述目标向量之间的残差向量,包括:
    将所述查询向量与所述目标向量相减,获得所述查询向量与所述目标向量之间的残差向量。
  17. 根据权利要求1-16任一项所述的方法,其特征在于,所述方法还包括:
    每间隔预设时长,根据新获取的样本向量,对所述第一索引以及所述第二索引进行更新。
  18. 一种向量查询装置,其特征在于,所述装置包括:向量获取模块、第一确定模块、残差获取模块、第二确定模块以及向量确定模块,其中,
    所述向量获取模块用于获取查询向量;
    所述第一确定模块用于根据预先建立的第一索引,获取与所述查询向量的距离满足第一设定距离条件的第一聚类中心向量,作为目标向量,所述第一索引包括对样本向量进行聚类获得的多个第一聚类以及每个第一聚类对应的第一聚类中心向量,所述每个第一聚类中包括多个样本向量;
    所述残差获取模块用于获取所述查询向量与所述目标向量之间的残差向量,作为查询残差向量;
    所述第二确定模块用于根据预先建立的第二索引,获取所述多个样本向量中每个样本向量对应的编码,所述第二索引包括采用乘积量化方法对所述每个样本向量对应的样本残差向量进行乘积量化获得的每个残差样本向量对应的编码,所述样本残差向量为所述样本向量与所述目标向量之间的残差向量;
    所述向量确定模块用于根据所述查询残差向量以及所述每个样本残差向量对应的编码,从所述多个样本向量中获取与所述查询向量的距离满足第二设定距离条件的样本向量,作为查询结果。
  19. 一种电子设备,其特征在于,包括:
    一个或多个处理器;
    存储器;
    一个或多个应用程序,其中所述一个或多个应用程序被存储在所述存储器中并被配置为由所述一个或多个处理器执行,所述一个或多个程序配置用于执行如权利要求1-17任一项所述的方法。
  20. 一种计算机可读取存储介质,其特征在于,所述计算机可读取存储介质中存储有程序代码,所述程序代码可被处理器调用执行如权利要求1-17任一项所述的方法。
PCT/CN2019/114795 2019-10-31 2019-10-31 向量查询方法、装置、电子设备及存储介质 WO2021081913A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201980099370.6A CN114245896A (zh) 2019-10-31 2019-10-31 向量查询方法、装置、电子设备及存储介质
PCT/CN2019/114795 WO2021081913A1 (zh) 2019-10-31 2019-10-31 向量查询方法、装置、电子设备及存储介质

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/114795 WO2021081913A1 (zh) 2019-10-31 2019-10-31 向量查询方法、装置、电子设备及存储介质

Publications (1)

Publication Number Publication Date
WO2021081913A1 true WO2021081913A1 (zh) 2021-05-06

Family

ID=75714814

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/114795 WO2021081913A1 (zh) 2019-10-31 2019-10-31 向量查询方法、装置、电子设备及存储介质

Country Status (2)

Country Link
CN (1) CN114245896A (zh)
WO (1) WO2021081913A1 (zh)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113626471A (zh) * 2021-08-05 2021-11-09 北京达佳互联信息技术有限公司 数据检索方法、装置、电子设备及存储介质
CN115169489A (zh) * 2022-07-25 2022-10-11 北京百度网讯科技有限公司 数据检索方法、装置、设备以及存储介质
CN116010669A (zh) * 2023-01-18 2023-04-25 深存科技(无锡)有限公司 向量库重训练的触发方法、装置、检索服务器及存储介质
WO2023108995A1 (zh) * 2021-12-15 2023-06-22 平安科技(深圳)有限公司 向量相似度计算方法、装置、设备及存储介质
CN116541420A (zh) * 2023-07-07 2023-08-04 上海爱可生信息技术股份有限公司 向量数据的查询方法

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117194737B (zh) * 2023-09-14 2024-06-07 上海交通大学 基于距离阈值的近似近邻搜索方法、***、介质及设备

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190205331A1 (en) * 2017-01-20 2019-07-04 Rakuten, Inc. Image search system, image search method, and program
CN110134804A (zh) * 2019-05-20 2019-08-16 北京达佳互联信息技术有限公司 图像检索方法、装置及存储介质
CN110168525A (zh) * 2016-10-11 2019-08-23 谷歌有限责任公司 快速数据库搜索***和方法

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110168525A (zh) * 2016-10-11 2019-08-23 谷歌有限责任公司 快速数据库搜索***和方法
US20190205331A1 (en) * 2017-01-20 2019-07-04 Rakuten, Inc. Image search system, image search method, and program
CN110134804A (zh) * 2019-05-20 2019-08-16 北京达佳互联信息技术有限公司 图像检索方法、装置及存储介质

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113626471A (zh) * 2021-08-05 2021-11-09 北京达佳互联信息技术有限公司 数据检索方法、装置、电子设备及存储介质
CN113626471B (zh) * 2021-08-05 2024-02-23 北京达佳互联信息技术有限公司 数据检索方法、装置、电子设备及存储介质
WO2023108995A1 (zh) * 2021-12-15 2023-06-22 平安科技(深圳)有限公司 向量相似度计算方法、装置、设备及存储介质
CN115169489A (zh) * 2022-07-25 2022-10-11 北京百度网讯科技有限公司 数据检索方法、装置、设备以及存储介质
CN116010669A (zh) * 2023-01-18 2023-04-25 深存科技(无锡)有限公司 向量库重训练的触发方法、装置、检索服务器及存储介质
CN116010669B (zh) * 2023-01-18 2023-12-08 深存科技(无锡)有限公司 向量库重训练的触发方法、装置、检索服务器及存储介质
CN116541420A (zh) * 2023-07-07 2023-08-04 上海爱可生信息技术股份有限公司 向量数据的查询方法
CN116541420B (zh) * 2023-07-07 2023-09-15 上海爱可生信息技术股份有限公司 向量数据的查询方法

Also Published As

Publication number Publication date
CN114245896A (zh) 2022-03-25

Similar Documents

Publication Publication Date Title
WO2021081913A1 (zh) 向量查询方法、装置、电子设备及存储介质
Wu et al. Multiscale quantization for fast similarity search
KR101565265B1 (ko) 피쳐 위치 정보의 코딩
WO2023051783A1 (zh) 一种编码方法、解码方法、装置、设备及可读存储介质
US20020039446A1 (en) Pattern recognition based on piecewise linear probability density function
WO2023019933A1 (zh) 构建检索数据库的方法、装置、设备以及存储介质
WO2019226429A1 (en) Data compression by local entropy encoding
US11120214B2 (en) Corpus generating method and apparatus, and human-machine interaction processing method and apparatus
CN112347246B (zh) 一种基于谱分解的自适应文档聚类方法及***
US8768075B2 (en) Method for coding signals with universal quantized embeddings
US11531695B2 (en) Multiscale quantization for fast similarity search
WO2019085765A1 (zh) 图像检索
Yang et al. Mean-removed product quantization for large-scale image retrieval
Yu et al. Bilinear optimized product quantization for scalable visual content analysis
CN111767421A (zh) 用于检索图像方法、装置、电子设备和计算机可读介质
Wang Neural Network‐Based Dynamic Segmentation and Weighted Integrated Matching of Cross‐Media Piano Performance Audio Recognition and Retrieval Algorithm
US20230086264A1 (en) Decoding method, encoding method, decoder, and encoder based on point cloud attribute prediction
CN115129949A (zh) 向量范围检索的方法、装置、设备、介质及程序产品
Amara et al. Nearest neighbor search with compact codes: A decoder perspective
Kan et al. A supervised learning to index model for approximate nearest neighbor image retrieval
CN111858899B (zh) 语句处理方法、装置、***和介质
CN113901278A (zh) 一种基于全局多探测和适应性终止的数据搜索方法和装置
CN114266249A (zh) 一种基于birch聚类的海量文本聚类方法
CN113220840A (zh) 文本处理方法、装置、设备以及存储介质
Chen et al. Neighborhood-exact nearest neighbor search for face retrieval

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19950674

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19950674

Country of ref document: EP

Kind code of ref document: A1