WO2023030184A1 - Data retrieval method and related device - Google Patents

Data retrieval method and related device Download PDF

Info

Publication number
WO2023030184A1
WO2023030184A1 PCT/CN2022/115091 CN2022115091W WO2023030184A1 WO 2023030184 A1 WO2023030184 A1 WO 2023030184A1 CN 2022115091 W CN2022115091 W CN 2022115091W WO 2023030184 A1 WO2023030184 A1 WO 2023030184A1
Authority
WO
WIPO (PCT)
Prior art keywords
vector
vectors
index node
target
retrieval
Prior art date
Application number
PCT/CN2022/115091
Other languages
French (fr)
Chinese (zh)
Inventor
金钊
刘文杰
李会峰
聂光耀
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2023030184A1 publication Critical patent/WO2023030184A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/906Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/907Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]

Definitions

  • This application relates to the field of computers, in particular to a data retrieval method and related equipment.
  • the amount of data (including, for example, image data, video data, audio data, and text data) stored in digital information repositories such as online Internet and cloud-based databases is growing dramatically. Processing search queries on unstructured data in an accurate and resource-efficient manner is a technical challenge.
  • Similarity search is a data search method for searching unstructured data objects based on the comparison between the similarity between the retrieved object and the data object in the search database.
  • a similarity search typically involves creating metadata for each data object stored in the database, creating metadata for the retrieved object, and then comparing the query object's metadata with the data object's metadata.
  • the metadata for each data object may be in the form of a feature vector, which is a multi-dimensional numerical feature vector representing the data object.
  • similarity search can be defined as finding the feature vector most similar to a given feature vector (eg, query vector) from among multiple feature vectors stored in a database.
  • similarity searching generally involves translating (converting) a search object (eg, image, video sample, audio sample or text) into a first vector representing the search object using a feature extraction algorithm.
  • the first vector is then used to search a database of feature vectors (eg, the database may include a plurality of third vectors) to locate one or more third vectors most similar to the first vector.
  • the retrieval system may include routing nodes and index nodes (index nodes may also be called shards), and perform hash processing on the third vector through locality sensitive hashing (LSH for short), so that each third vector Vectors are mapped into corresponding index nodes.
  • LSH locality sensitive hashing
  • the routing node will distribute the first vector to each index node, and each index node independently accesses the ANN index contained inside, calculates a set of results most similar to the first vector in the ANN index, and returns to the routing node, routing The node collects the returned results of all index nodes, and merges and processes the results.
  • the present application provides a data retrieval system, the data retrieval system includes a routing node and a plurality of index nodes, wherein,
  • the routing node is used to obtain the first vector
  • feature extraction can be performed on the retrieval object to obtain the first vector (or called it), for example, the retrieval object can be one of video data, audio data, image data, text data, or other unstructured data.
  • image data may be represented by a respective original feature vector obtained from a color histogram of the original image data
  • video data may be represented by a respective scale-invariant feature vector obtained from the original video data (scale -invariant feature transform, SIFT) or 3D-SIFT or from the original feature vector obtained from the discriminative video descriptor (discriminate video descriptor, DVD) to characterize the first vector.
  • SIFT scale -invariant feature transform
  • 3D-SIFT discriminative video descriptor
  • the objects involved in the embodiments of the present application may include data objects and search objects
  • the search objects may be the objects that the user needs to query
  • the data objects may be pre-stored objects in the search database
  • the first vector may be the above-mentioned original feature vector or a vector obtained by performing other processing (such as normalization) on the original feature vector, which is not limited here;
  • the routing information may be stored in the form of a data table, where the routing information may include the identifier of each second vector (the identifier may directly or indirectly point to the storage address of the corresponding second vector), and each The index node corresponding to the second vector (for example, the identifier of the index node, which may directly or indirectly point to the address of the corresponding index node).
  • the routing information may include the identifier of each second vector (the identifier may directly or indirectly point to the storage address of the corresponding second vector), and each The index node corresponding to the second vector (for example, the identifier of the index node, which may directly or indirectly point to the address of the corresponding index node).
  • the routing node obtains a target index node from the plurality of index nodes according to the first vector and routing information, wherein the routing information includes a plurality of second vectors and an index corresponding to each second vector node, each of the second vectors is used to represent one or more third vectors stored on the corresponding index node, the third vector is a representation of a data object, and the first vector and the plurality of second a vector similarity between target vectors in the vectors is greater than a threshold, and the target vector corresponds to the target index node in the routing information;
  • each of the second vectors is used to represent one or more third vectors stored on the corresponding index node. It can be understood that each second vector may represent one or more third vectors stored on the corresponding index node The characteristics of part or all of the third vector, that is, the second vector can be used as a feature representation of part or all of the third vector stored on the corresponding index node (the second vector and part or all of the third vector stored on the corresponding index node The similarity between vectors is high, e.g. all belong to the same cluster).
  • a target vector whose vector similarity is greater than a threshold may be determined from multiple second vectors in the routing information according to the first vector, where the target vector may be one of the multiple second vectors that is identical to the first
  • the second vector with the most similar vector, or relatively similar vector can determine the target vector based on, for example, the first index (coarse quantizer index) constructed above.
  • the quantity of the target vector can be one or more, when the quantity of the target vector is 1, the target vector can be the second vector most similar to the first vector among multiple second vectors, when the quantity of the target vector is When multiple, the target vector can be a plurality of second vectors with the highest similarity between the first vector and the first vector among the multiple second vectors;
  • the number of target vectors may be a preset number, the more the number of target vectors, the higher the accuracy of the final retrieval result may be, but the greater the number of index nodes distributed by the determined target vectors may be. , which in turn increases the amount of data concurrency, which may increase the delay required for the retrieval process. In practical applications, a more balanced value can be selected based on retrieval accuracy and delay requirements;
  • the threshold here may be related to the number of target vectors and the structure of the first index. When the number of target vectors is large, the threshold is lower. When the performance of the first index is better (for example, it can be determined that it is close to the The vector with the high similarity of the first vector), the higher the threshold;
  • the routing node passes the first vector to the target index node
  • the target index node is configured to determine the retrieval result of the first vector from multiple third vectors stored by itself.
  • the target index node determines a retrieval result from the plurality of third vectors based on a similarity comparison between the first vector and the plurality of stored third vectors.
  • the routing node determines the index node to which the node to be indexed needs to be routed based on similarity, and the third vector on the index node is also deployed based on vector similarity (vectors with greater similarity are deployed in the same On the index node, the vectors with less similarity are deployed on different index nodes), that is to say, the third vector with high similarity with the first vector is stored on a specific index node (candidate index node), Therefore, the routing node only needs to send the first vector to the target index node, and the target index node can obtain a very accurate retrieval result based on the candidate nodes stored by itself, so that each retrieval only needs to visit one or part of the index nodes (the number is relatively small).
  • the concurrency performance of cluster retrieval can increase linearly with the size of the cluster, which effectively solves the problem of vector similarity retrieval performance in massive vector-scale scenarios. and scalability issues.
  • each of the second vectors corresponds to a cluster
  • each of the clusters includes one or more of the third vectors
  • different second vectors in the plurality of second vectors correspond to different of clusters.
  • the third vector in each cluster is a vector with higher similarity, and the third vectors between different clusters have lower similarity.
  • the target vector may be one of the multiple third vectors (for example, it may be the centroid of a cluster obtained after clustering multiple third vectors) .
  • the target index node can store multiple vectors for data retrieval, and the multiple third vectors Among the multiple vectors, the proportion of the number of vectors is greater than the target ratio.
  • the target ratio may be a value close to 1, such as 80%, 90%, 95%.
  • the vector similarity between the first vector and M target vectors in the plurality of second vectors is greater than a threshold, and the M target vectors in the routing information correspond to For the target index node, the M is a positive integer greater than 1.
  • multiple target vectors with high similarity with the first vector can be determined in the routing information, thereby increasing the number of third vectors used in the subsequent retrieval of the first vector, thereby increasing the The precision of the search results.
  • each second vector corresponds to a cluster
  • each second vector is a vector corresponding to a cluster center of the cluster.
  • a clustering can be performed on the similarity of the generated similarity subspaces to be divided into the original feature vector according to the original vector set, wherein the clustering (clustering) is based on a certain standard (such as distance ) divides a data set into different classes or clusters, so that the similarity of data objects in the same cluster is as large as possible, and the difference of data objects that are not in the same cluster is also as large as possible. That is to say, after clustering, the data of the same class are gathered together as much as possible, and the data of different classes are separated as much as possible.
  • a certain standard such as distance
  • Clustering methods may include, but are not limited to, K-means, mean shift clustering, density-based clustering methods, agglomerative hierarchical clustering, graph community detection, Gaussian mixture model k-means (gaussian mixture model kmeans, GMM k-means) and other clustering methods, the cluster centroid set of the global vector space can be obtained through clustering.
  • Each centroid vector in the cluster centroid set corresponds to the center of a similarity subspace, representing a vector subspace of a global vector space.
  • centroid vector of each clustering category can be used as the second vector in the routing information, or a vector located near the centroid vector in the corresponding clustering category can be used as the second vector in the routing information, or a vector similar to the centroid vector
  • the vector of can be used as the second vector in the routing information, as long as it can represent the corresponding clustering category, there is no limitation on how to select the second vector.
  • each of the index nodes is configured to store multiple third vectors according to clusters, and each cluster includes at least one third vector.
  • the third vector on the index node is also deployed based on clusters, that is, deployed based on vector similarity (vectors with greater similarity are deployed on the same index node, and vectors with less similarity are deployed on different index nodes ), that is to say, the third vector with high similarity to the first vector is stored on a specific index node (candidate index node), so the routing node only needs to send the first vector to the target index node, and the target
  • the index node can obtain a very accurate retrieval result based on the candidate nodes stored by itself, so that each retrieval only needs to visit one or part of the index nodes (the number is small) to obtain accurate retrieval results, which improves the performance of large-scale vector scale. Similarity retrieval performance.
  • the target index node includes one or more index nodes, and the number of index nodes included in the target index node is less than a preset number.
  • the preset number may be 2, 3, 4, 5 and so on.
  • the retrieval result is a partial vector in the plurality of third vectors.
  • the first vector is a representation of a retrieval object
  • the retrieval object includes one or more of text data, audio data, image data, or video data.
  • the routing node is further configured to transfer the target vector to the target index node; the target index node is specifically configured to: based on the target vector and the first mapping relationship, from itself Determining the one or more third vectors from the plurality of stored third vectors, the first mapping relationship indicating the mapping relationship between the target vector and one or more third vectors in the plurality of third vectors ; Determining the retrieval result of the first vector from the one or more third vectors.
  • each index node may store one or more second vectors corresponding to the third vector stored by itself, and the one or more second vectors may be used to represent the first vector stored by the index node itself.
  • the index node can determine one or more third vectors corresponding to the target vector based on the target vector (that is, a second vector), and the target vector can be used as a representation of one or more third vectors, and then the index node Retrieval results may be determined from one or more third vectors based on the first vector.
  • the target vector can be used as a representation of one or more third vectors, and the similarity between the first vector and the target vector is greater than the threshold, it is possible to determine the retrieval result from one or more third vectors while satisfying the retrieval accuracy. Under the premise, speed up the retrieval process.
  • the target index node is specifically configured to: determine a part of the third vector from multiple third vectors stored in itself according to the first vector, where the part of the third vector corresponds to the same clusters, and the vector similarity between the cluster centers of the clusters and the first vector is greater than a threshold; determine the retrieval result of the first vector from the part of the third vector.
  • each index node may store one or more fourth vectors corresponding to the third vector stored by itself, and the one or more fourth vectors may be used to represent the first vector stored by the index node itself.
  • the index node can determine the fourth vector with the highest similarity from one or more fourth vectors, and obtain one or more third vectors corresponding to the determined fourth vector, and the fourth vector can be used as The representation of one or more third vectors, and then the index node can determine the retrieval result from the one or more third vectors based on the first vector.
  • the fourth vector can be used as a representation of one or more third vectors, and the similarity between the fourth vector and the first vector is greater than the threshold, it is possible to determine the retrieval result from one or more third vectors when the retrieval precision is satisfied. Under the premise of speed, speed up the retrieval process.
  • the present application provides a data retrieval method, the method comprising:
  • the routing information includes a plurality of second vectors and an index node corresponding to each second vector, each The second vector is used to represent one or more third vectors stored on the corresponding index node, the third vector is a representation of a data object, and the first vector and the objects in the plurality of second vectors The vector similarity between vectors is greater than a threshold, and the target vector corresponds to the target index node in the routing information;
  • the routing node determines the index node to which the node to be indexed needs to be routed based on similarity, and the third vector on the index node is also deployed based on vector similarity (vectors with greater similarity are deployed in the same On the index node, the vectors with less similarity are deployed on different index nodes), that is to say, the third vector with high similarity with the first vector is stored on a specific index node (candidate index node), Therefore, the routing node only needs to send the first vector to the target index node, and the target index node can obtain a very accurate retrieval result based on the candidate nodes stored by itself, so that each retrieval only needs to visit one or part of the index nodes (the number is relatively small).
  • the concurrency performance of cluster retrieval can increase linearly with the size of the cluster, which effectively solves the problem of vector similarity retrieval performance in massive vector-scale scenarios. and scalability issues.
  • each of the second vectors corresponds to a cluster
  • each of the clusters includes one or more of the third vectors
  • different second vectors in the plurality of second vectors correspond to different of clusters.
  • each second vector corresponds to a cluster
  • each second vector is a vector corresponding to a cluster center of the cluster.
  • each of the index nodes is configured to store multiple third vectors according to clusters, and each cluster includes at least one third vector.
  • the third vector stored in each index node is a vector included in one or more clusters.
  • the vector similarity between the first vector and each of at least two target vectors among the plurality of second vectors is greater than a threshold, and the at least two target vectors Corresponds to the target index node in the routing information.
  • the target index node includes one or more index nodes.
  • the retrieval result is a partial vector in the plurality of third vectors.
  • the first vector is a representation of a retrieval object
  • the retrieval object includes one or more of text data, audio data, image data, or video data.
  • the present application provides a data retrieval device, the device is applied to a computer system, and the computer system includes a routing node and a plurality of index nodes, and the routing node includes:
  • An acquisition module configured to acquire the first vector
  • a routing module configured to obtain a target index node from the plurality of index nodes according to the first vector and routing information, wherein the routing information includes a plurality of second vectors and each of the second vectors corresponds to An index node, each of the second vectors is used to represent one or more third vectors stored on the corresponding index node, the third vector is a representation of a data object, and the first vector and the plurality of third vectors are The vector similarity between target vectors in the two vectors is greater than a threshold, and the target vector corresponds to the target index node in the routing information;
  • a sending module configured to transfer the first vector to the target index node
  • the target index node includes:
  • the retrieval module is configured to determine the retrieval result of the first vector from a plurality of third vectors stored in itself.
  • the routing node determines the index node to which the node to be indexed needs to be routed based on similarity, and the third vector on the index node is also deployed based on vector similarity (vectors with greater similarity are deployed in the same On the index node, the vectors with less similarity are deployed on different index nodes), that is to say, the third vector with high similarity with the first vector is stored on a specific index node (candidate index node), Therefore, the routing node only needs to send the first vector to the target index node, and the target index node can obtain a very accurate retrieval result based on the candidate nodes stored by itself, so that each retrieval only needs to visit one or part of the index nodes (the number is relatively small).
  • the concurrency performance of cluster retrieval can increase linearly with the size of the cluster, which effectively solves the problem of vector similarity retrieval performance in massive vector-scale scenarios. and scalability issues.
  • each of the second vectors corresponds to a cluster
  • each of the clusters includes one or more of the third vectors
  • different second vectors in the plurality of second vectors correspond to different of clusters.
  • each second vector corresponds to a cluster
  • each second vector is a vector corresponding to a cluster center of the cluster.
  • each of the index nodes is configured to store multiple third vectors according to clusters, and each cluster includes at least one third vector.
  • the third vector stored in each index node is a vector included in one or more clusters.
  • the vector similarity between the first vector and each of at least two target vectors among the plurality of second vectors is greater than a threshold, and the at least two target vectors Corresponds to the target index node in the routing information.
  • the target index node includes one or more index nodes.
  • the retrieval result is a partial vector in the plurality of third vectors.
  • the first vector is a representation of a retrieval object
  • the retrieval object includes one or more of text data, audio data, image data, or video data.
  • the sending module is further configured to transfer the target vector to the target index node;
  • the retrieval module is specifically configured to determine the one or more third vectors from a plurality of third vectors stored in itself based on the target vector and the first mapping relationship, the first mapping relationship indicating the target a mapping relationship between a vector and one or more third vectors in the plurality of third vectors;
  • a retrieval result for the first vector is determined from the one or more third vectors.
  • the retrieval module is specifically configured to determine a part of the third vector from multiple third vectors stored in itself according to the first vector, where the part of the third vector corresponds to the same cluster , and the vector similarity between the cluster center of the cluster and the first vector is greater than a threshold;
  • a retrieval result of the first vector is determined from the part of the third vector.
  • the present application provides a data retrieval device, the device comprising:
  • An acquisition module configured to acquire the first vector
  • a routing module configured to obtain a target index node from the plurality of index nodes according to the first vector and routing information, wherein the routing information includes a plurality of second vectors and each of the second vectors corresponds to An index node, each of the second vectors is used to represent one or more third vectors stored on the corresponding index node, the third vector is a representation of a data object, and the first vector and the plurality of third vectors are The vector similarity between target vectors in the two vectors is greater than a threshold, and the target vector corresponds to the target index node in the routing information;
  • a sending module configured to deliver the first vector to the target index node, where the first vector is used to instruct the target index node to determine a retrieval result of the first vector from a plurality of third vectors stored by itself .
  • the routing node determines the index node to which the node to be indexed needs to be routed based on similarity, and the third vector on the index node is also deployed based on vector similarity (vectors with greater similarity are deployed in the same On the index node, the vectors with less similarity are deployed on different index nodes), that is to say, the third vector with high similarity with the first vector is stored on a specific index node (candidate index node), Therefore, the routing node only needs to send the first vector to the target index node, and the target index node can obtain a very accurate retrieval result based on the candidate nodes stored by itself, so that each retrieval only needs to visit one or part of the index nodes (the number is relatively small).
  • the concurrency performance of cluster retrieval can increase linearly with the size of the cluster, which effectively solves the problem of vector similarity retrieval performance in massive vector-scale scenarios. and scalability issues.
  • each of the second vectors corresponds to a cluster
  • each of the clusters includes one or more of the third vectors
  • different second vectors in the plurality of second vectors correspond to different of clusters.
  • each second vector corresponds to a cluster
  • each second vector is a vector corresponding to a cluster center of the cluster.
  • each of the index nodes is configured to store multiple third vectors according to clusters, and each cluster includes at least one third vector.
  • the third vector stored in each index node is a vector included in one or more clusters.
  • the vector similarity between the first vector and each of at least two target vectors among the plurality of second vectors is greater than a threshold, and the at least two target vectors Corresponds to the target index node in the routing information.
  • the target index node includes one or more index nodes.
  • the retrieval result is a partial vector in the plurality of third vectors.
  • the first vector is a representation of a retrieval object
  • the retrieval object includes one or more of text data, audio data, image data, or video data.
  • the present application provides a computer system, including a first index node and a second index node, the first index node includes a first memory and a first processor, and the second index node includes a second memory and a second processor, where,
  • the first memory is configured to store a plurality of first vectors for data retrieval, the similarity between different first vectors in the plurality of first vectors is greater than a threshold, and the first vector is a data object representation of
  • the second memory is configured to store a plurality of second vectors for data retrieval, the similarity between different second vectors in the plurality of second vectors is greater than the threshold, and the second vectors are a representation of a data object, and the vector similarity between the plurality of first vectors and the plurality of second vectors is less than the threshold;
  • the first processor configured to perform data retrieval based on the plurality of first vectors
  • the second processor is configured to perform data retrieval based on the plurality of second vectors.
  • the third vector on the index node is also deployed based on vector similarity (vectors with greater similarity are deployed on the same index node, and vectors with less similarity are deployed on different index nodes), That is to say, the third vector with high similarity with the first vector is stored on a specific index node (candidate index node), so the routing node only needs to send the first vector to the target index node, and the target index node A very accurate retrieval result can be obtained based on the candidate nodes stored by itself, so that each retrieval only needs to visit one or part of the index nodes (the number is small) to obtain accurate retrieval results, which improves the similarity of large-scale vector scale Retrieval performance, the concurrency performance of cluster retrieval can grow linearly with the size of the cluster, effectively solving the performance and scalability problems of vector similarity retrieval in massive vector scale scenarios.
  • the multiple first vectors are vectors included in one or more first clusters
  • the multiple second vectors are vectors included in one or more second clusters
  • the The first cluster is different from the second cluster.
  • the first cluster and the second cluster are clusters.
  • the first processor is configured to determine a retrieval result from the plurality of first vectors based on a third vector, where the third vector is a representation of the first retrieval object;
  • the second processor is configured to determine a retrieval result from the plurality of second vectors based on a fourth vector, where the fourth vector is a representation of a second retrieval object.
  • both the first index node and the second index node are communicatively connected to a routing node, and both the third vector and the fourth vector are sent from the routing node.
  • the first retrieval object and the second retrieval object include one or more of text data, audio data, image data, or video data.
  • the embodiment of the present application provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when it is run on a computer, the computer executes the above-mentioned first aspect and any one thereof.
  • the embodiment of the present application provides a computer program, which, when running on a computer, enables the computer to execute the above-mentioned first aspect and any optional method thereof, the second aspect and any optional method thereof , the third aspect and any optional method thereof.
  • the present application provides a chip system, which includes a processor, configured to support an execution device or a training device to implement the functions involved in the above aspect, for example, send or process the data involved in the above method; or, information.
  • the chip system further includes a memory, and the memory is used for storing necessary program instructions and data of the execution device or the training device.
  • the system-on-a-chip may consist of chips, or may include chips and other discrete devices.
  • FIG. 1 is a schematic diagram of an application architecture of an embodiment of the present application
  • FIG. 2 is a schematic diagram of a retrieval scenario in an embodiment of the present application
  • FIG. 3 is a schematic diagram of an application architecture of an embodiment of the present application.
  • FIG. 4 is a schematic flow chart of a data retrieval method in an embodiment of the present application.
  • FIG. 5 is a schematic flow chart of a data retrieval method in an embodiment of the present application.
  • FIG. 6 is a schematic flow chart of a data retrieval method according to an embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of a data retrieval device according to an embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of a data retrieval device according to an embodiment of the present application.
  • FIG. 9 is a schematic structural diagram of a data retrieval device according to an embodiment of the present application.
  • FIG. 10 is a schematic structural diagram of a terminal device according to an embodiment of the present application.
  • FIG. 11 is a schematic structural diagram of a server according to an embodiment of the present application.
  • FIG. 12 is a schematic structural diagram of a chip according to an embodiment of the present application.
  • the embodiment of the present application can be applied to an indexing system for data retrieval. Next, the application scenario of the embodiment of the present application will be introduced first.
  • FIG. 1 shows a schematic architecture of an indexing system 100 .
  • Indexing system 100 may include client system 130 , computer system 160 , and third party system 170 connected to each other through network 110 .
  • FIG. 1 shows a schematic arrangement (connection relationship) of client system 130, computer system 160, third-party system 170, and network 110
  • the embodiment of the present application does not limit client system 130, computer system 160, The arrangement of third party system 170 and network 110 .
  • two or more of client system 130 , computer system 160 , and third party system 170 may bypass network 110 and connect directly to each other.
  • two or more of client system 130, computer system 160, and third party system 170 may be physically or logically co-located with one another in whole or in part.
  • FIG. 1 shows a specific number of client systems 130, computer systems 160, third-party systems 170, and networks 110
  • the embodiments of the present application do not limit the number of client systems 130, computer systems 160, third-party systems 170, and networks. 110 quantity.
  • indexing system 100 may include multiple client systems 130 , computer systems 160 , third party systems 170 and network 110 .
  • the present application may include any suitable network 110 .
  • network 110 may include an ad hoc network, an intranet, an extranet, a virtual private network (VPN), a local area network (LAN), a wireless LAN (wireless LAN, WLAN), wide area network (wide area network, WAN), wireless WAN (wireless WAN, WWAN), metropolitan area network (metropolitan area network, MAN), part of the Internet, public switched telephone network (public switched telephone network, PSTN), a cellular telephone network, or a combination of two or more of these.
  • Network 110 may include one or more networks 110 .
  • link 150 may connect client system 130, computer system 160, and third party system 170 to communication network 110 or to each other.
  • the application may include any suitable link 150 .
  • one or more links 150 include one or more wired (such as digital subscriber line (DSL) or cable-based data over cable service interface specification (DOCSIS) )) links, wireless (such as Wi-Fi or worldwide interoperability for microwave access (WiMAX)) links, or optical (such as synchronous optical networking (SONET) or synchronous digital hierarchy (synchronous digital hierarchy, SDH)) link.
  • wired such as digital subscriber line (DSL) or cable-based data over cable service interface specification (DOCSIS)
  • WiMAX worldwide interoperability for microwave access
  • SONET synchronous optical networking
  • SDH synchronous digital hierarchy
  • one or more links 150 each include an ad hoc network, an intranet, an extranet, a VPN, a LAN, a WLAN, a WAN, a WWAN, a MAN, a portion of the Internet, a portion of the PSTN, a cellular technology-based network, a network based on satellite communication technology, another link 150, or a combination of two or more such links 150.
  • Link 150 need not be the same throughout indexing system 100 .
  • One or more first links 150 may differ from one or more second links 150 in one or more respects.
  • client system 130 may include hardware, software, or embedded logic components, or a combination of two or more such components, and be capable of executing appropriate functional electronic devices.
  • client system 130 may include a computer system such as a desktop computer, notebook or laptop computer, netbook, tablet computer, e-book reader, GPS device, camera, personal digital assistant (personal digital assistant, PDA), handheld electronic devices, cellular phones, smart phones, other suitable electronic devices, or any suitable combination thereof.
  • the application may include any suitable client system 130 .
  • Client system 130 may enable network users at client system 130 to access network 110 .
  • Client system 130 may enable its users to communicate with other users at other client systems 130 .
  • client system 130 may include a web browser.
  • a user at client system 130 may enter a uniform resource locator (URL) or other address that directs the web browser to a particular server (e.g., server 162 or a server associated with third-party system 170), And the web browser can generate a hypertext transfer protocol (hypertext transfer protocol, HTTP) request and pass the HTTP request to the server.
  • the server may accept HTTP requests and deliver one or more hypertext markup language (HTML) documents to client system 130 in response to the HTTP requests.
  • the client system 130 can render a web interface (such as a webpage) based on the HTML file from the server for presentation to the user (for example, refer to FIG. 2 ).
  • This disclosure contemplates any suitable source files.
  • the web interface may be rendered from HTML files, Extensible Hypertext Markup Language (XHTML) files, or Extensible Markup Language (XML) files according to specific needs.
  • XHTML Extensible Hypertext Markup Language
  • XML Exten
  • the user can input Uniform Resource Locator (uniform resource locator, URL) or the address relevant to data retrieval that directs web browser to specific server (such as server 162 or the server associated with third-party system 170), and then The user can input the data to be retrieved in the web browser, and the web browser can generate an HTTP request containing the data to be retrieved and transmit the HTTP request to the server.
  • the server can accept the HTTP request, obtain the retrieval result based on the data retrieval method provided in the embodiment of the present application, and transmit one or more HTML files to the client system 130 in response to the HTTP request, and the retrieval result can be included in the HTML file .
  • the client system 130 may render a web interface (eg, a web page) for presentation to the user based on the HTML file from the server (eg, present retrieval results to the user).
  • the client system 130 may include an application program (application, APP) for providing a data retrieval function.
  • application application, APP
  • Computer system 160 may be accessed by other components of indexing system 100 directly or via network 110 .
  • client system 130 may access computer system 160 directly or via network 110 using a web browser or an APP associated with computer system 160 .
  • computer system 160 may include one or more servers 162 .
  • Each server 162 may be a unitary server or a distributed server spanning multiple computers or multiple data centers.
  • each server 162 may include hardware, software, or embedded logic components, or a combination of two or more such components for performing the appropriate functionality implemented or supported by server 162 .
  • computer system 160 may include one or more data stores 164 . Data storage 164 may be used to store various types of information.
  • each data store 164 may be a relational database, columnar database, relational database, or other suitable database.
  • this disclosure describes or illustrates particular types of databases, this disclosure contemplates any suitable type of databases.
  • Certain embodiments may provide interfaces that enable client systems 130 , computer systems 160 , or third-party systems 170 to manage, retrieve, modify, add, or delete information stored in data storage 164 .
  • computer system 160 may store one or more third vectors in one or more data stores 164 .
  • computer system 160 is capable of linking various entities.
  • computer system 160 may enable users to interact with each other and receive content from third-party systems 170 or other entities, or allow users to interact with such entities through application programming interfaces (APIs) or other communication channels.
  • APIs application programming interfaces
  • third-party systems 170 may include third-party content object providers.
  • Third-party content object providers may include one or more sources of content objects that may be delivered to client system 130 .
  • a content object may include information about things or activities of interest to a user, such as movie showtimes, movie reviews, restaurant reviews, restaurant menus, product information and reviews, or other suitable information.
  • content objects may include incentive content objects (eg, coupons, discount coupons, gift certificates, or other suitable incentive objects).
  • the computer system 160 may include a routing node and multiple index nodes (for example, index node 1, index node 2, index node N, etc.), where the routing node and each index node may be independent servers, Or it is a logical node (such as a virtual machine, etc.) with data storage and data processing capabilities on the server, which is not limited here.
  • the routing node can obtain the first vector, and based on certain rules, determine the index node that needs to perform data indexing operations from multiple index nodes.
  • the index node can include a subspace model, and each index node can include a subspace
  • the model and the sub-index, the sub-space model may include multiple third vectors, and the index node may determine the index result from the sub-space model based on the sub-index and the first vector, and feed it back to the routing node.
  • FIG. 4 is a schematic flowchart of a data retrieval method provided by an embodiment of the present application.
  • the method can be applied to a computer system, and the computer system includes a routing node and a plurality of index nodes.
  • the method includes:
  • the routing node acquires a first vector.
  • the routing node may obtain the first vector used to represent the retrieval object.
  • the user can input a search object (or data object) on the terminal side, where the search object can be one of video data, audio data, image data, text data, or other unstructured data
  • the end-side can transfer the retrieval object to the computer system, and the computer system realizes the mapping from the retrieval object to the first vector, or other calculation units (for example, the end-side, or connected to the computer system) other than the computer system end-side and other computing units between the computer systems) realize the mapping from the search object to the first vector.
  • the routing node in the computer system can obtain the first vector obtained through mapping.
  • feature extraction can be performed on the retrieval object to obtain the first vector (or called it), for example, the retrieval object can be one of video data, audio data, image data, text data, or other Unstructured data objects.
  • image data may be represented by a respective original feature vector obtained from a color histogram of the original image data
  • video data may be represented by a respective scale-invariant feature vector obtained from the original video data (scale -invariant feature transform, SIFT) or 3D-SIFT or from the original feature vector obtained from the discriminative video descriptor (discriminate video descriptor, DVD) to characterize the first vector.
  • SIFT scale -invariant feature transform
  • 3D-SIFT discriminative video descriptor
  • Many different feature vector formats are known for representing different kinds of data objects, any of which may be suitable for the feature extraction process.
  • the first vector may be the aforementioned original feature vector or a vector obtained by performing other processing (such as normalization) on the original feature vector, which is not limited here.
  • the retrieval object (eg, text data) can be represented in a d-dimensional vector space, where d represents any suitable number of dimensions.
  • Retrieval objects (eg, text data) can be represented in a vector space as first vectors called term embeddings.
  • Text data may correspond to the coordinates of a particular point in vector space (ie, the end point of the vector).
  • a retrieval object (eg, text data) may be mapped to a first vector in a vector space by applying a function defined by a dictionary.
  • dictionaries trained to map text to vector representations may be utilized, or such dictionaries themselves may be generated by training.
  • a model such as Word2vec may be used to map a retrieved object (eg, text data) to a vector representation in a vector space.
  • a retrieved object eg, text data
  • a machine learning model eg, a neural network
  • the machine learning model may have been trained using a sequence of training data (eg, corpora each including a retrieval object (eg, text data)).
  • retrieval objects can be represented in a vector space as vectors called feature vectors or object embeddings.
  • a retrieval object may be mapped to a vector based on one or more properties, attributes, or characteristics of the retrieval object, the relationship of the object to other objects, or any other suitable information associated with the object.
  • a function may map retrieved objects to vectors through feature extraction, which may start from an initial set of measurement data and establish derived values (eg, features).
  • objects, including videos or images may be mapped to vectors by using algorithms to detect or isolate various desired portions or shapes of objects.
  • the features used to compute the vectors can be based on features from edge detection, corner detection, blob detection, ridge detection, scale invariant feature transformation, edge direction, intensity of change, autocorrelation, motion detection, optical flow, thresholding, blob Information obtained by extraction, template matching, Hough transform (eg, line, circle, ellipse, arbitrary shape), or any other suitable information.
  • retrieval objects comprising audio data may be based on features such as spectral slope, tonality coefficient, audio spectral centroid, audio spectral envelope, Mel frequency cepstrum, or any other suitable information. map to a vector.
  • the function can map the retrieved object to a vector using a transformed reduced feature set (eg, feature selection).
  • the function may map the retrieved object to the first vector based on one or more n-grams associated with the retrieved object.
  • the routing node obtains a target index node from the plurality of index nodes according to the first vector and routing information, wherein the routing information includes a plurality of second vectors and each of the second vectors corresponds to index node, each of the second vectors is used to represent one or more third vectors stored on the corresponding index node, the third vector is a representation of the data object, the first vector and the plurality of The vector similarity between target vectors in the second vector is greater than a threshold, and the target vector corresponds to the target index node in the routing information;
  • the routing node may also obtain routing information, where the routing information may include multiple second vectors and an index node corresponding to each second vector.
  • the routing node is used as a data transfer station. After obtaining the first vector, it is necessary to determine the index node information (for example, the index node ID, the index node ID Which index node among the multiple index nodes can be clearly indicated), in the embodiment of the present application, the process of determining from the first vector to the index node can be realized through routing information.
  • the index node information for example, the index node ID, the index node ID Which index node among the multiple index nodes can be clearly indicated
  • feature extraction can be performed on multiple candidate data objects (such as unstructured data objects), and during the feature extraction process, information can be extracted from the unstructured data objects included in the object database, for Each candidate data object of the plurality of candidate data objects generates a corresponding raw feature vector.
  • the unstructured data objects included in the object database may be one of video data, audio data, image data, text data, or other unstructured data objects.
  • each image object of image objects can be characterized by a respective original feature vector obtained from the color histogram of the original image data
  • each video object in the video objects can be represented by a respective original feature vector obtained from the original video Raw feature vector representations obtained from SIFT or 3D-SIFT of the data or obtained from the discriminative video descriptor DVD.
  • SIFT or 3D-SIFT of the data or obtained from the discriminative video descriptor DVD can be represented by a respective original feature vector obtained from the original video Raw feature vector representations obtained from SIFT or 3D-SIFT of the data or obtained from the discriminative video descriptor DVD.
  • a number of different feature vector formats are known for representing different kinds of data objects, any of which are suitable for the feature extraction process to convert candidate data objects into their respective raw feature vectors.
  • each of the second vectors is used to represent one or more third vectors stored on the corresponding index node. It can be understood that each second vector may represent one or more third vectors stored on the corresponding index node The characteristics of part or all of the third vector, that is, the second vector can be used as a feature representation of part or all of the third vector stored on the corresponding index node (the second vector and part or all of the third vector stored on the corresponding index node The similarity between the vectors is high, for example, they all belong to the same cluster).
  • the clusters here may be clustered or obtained based on local hashing.
  • the second vector may be the cluster center.
  • a clustering can be performed on the similarity of the generated similarity subspaces to be divided into the original feature vector according to the original vector set, wherein the clustering (clustering) is based on a certain standard (such as distance ) divides a data set into different classes or clusters, so that the similarity of data objects in the same cluster is as large as possible, and the difference of data objects that are not in the same cluster is also as large as possible. That is to say, after clustering, the data of the same class are gathered together as much as possible, and the data of different classes are separated as much as possible.
  • a certain standard such as distance
  • Clustering methods may include, but are not limited to, K-means, mean shift clustering, density-based clustering methods, agglomerative hierarchical clustering, graph community detection, Gaussian mixture model k-means (gaussian mixture model kmeans, GMM k-means) and other clustering methods, the cluster centroid set of the global vector space can be obtained through clustering.
  • Each centroid vector in the cluster centroid set corresponds to the center of a similarity subspace, representing a vector subspace of a global vector space.
  • centroid vector of each clustering category can be used as the second vector in the routing information, or a vector located near the centroid vector in the corresponding clustering category can be used as the second vector in the routing information, or a vector similar to the centroid vector
  • the vector of can be used as the second vector in the routing information, as long as it can represent the corresponding clustering category, there is no limitation on how to select the second vector.
  • the clustering category of each second vector can be mapped to one or more index nodes, that is, each second vector and index node
  • the second vector can include the second vector 1 to the second vector 100
  • the index node can include the index node 1 to the index node 10
  • the second vector 1 to the second vector 10 can be mapped to the index node 1.
  • the second vector 71 to the second vector 80 are mapped to the index node 8
  • the second vector 81 to the second vector 90 are mapped to the index node 9
  • the second vector 91 to the second vector 100 are mapped to the index node 10 .
  • secondary clustering can be performed on the multiple second vectors obtained by clustering according to the number of index nodes, and the clustering methods can include but not limited to K-means, mean shift clustering Classes, density-based clustering methods, agglomerative hierarchical clustering, graph community detection (graph community detection), Gaussian mixture model k-means (gaussian mixture model kmeans, GMM k-means) and other clustering methods, through clustering can be obtained
  • K-means K-means
  • mean shift clustering Classes density-based clustering methods
  • agglomerative hierarchical clustering graph community detection (graph community detection), Gaussian mixture model k-means (gaussian mixture model kmeans, GMM k-means) and other clustering methods
  • GMM k-means Gaussian mixture model kmeans
  • the second vectors mapped to each index node may all belong to the same cluster category obtained after secondary clustering, and the second vectors mapped to different index nodes belong to the clustering category obtained after secondary clustering different clustering categories.
  • other second vectors in the clustering category where the centroid vector is located can be mapped to other index nodes in addition to the index node to which the centroid vector is mapped (provided that the first The similarity between the two vectors and the centroid vectors in the second cluster centroid set corresponding to the index node mapped to is very high). That is to say, some of the second vectors (with a small number) mapped by different index nodes may also belong to the same clustering category obtained after secondary clustering.
  • the second vector may include the second vector 1 to the second vector 100, and the index nodes may include the index node 1 to the index node 10, and after secondary clustering, the second vector 1 to the second vector 10 belong to the same cluster Category, the second vector 11 to the second vector 20 belong to the same cluster category, the second vector 21 to the second vector 30 belong to the same cluster category, the second vector 31 to the second vector 40 belong to the same cluster category, The second vector 41 to the second vector 50 belong to the same clustering category, the second vector 51 to the second vector 60 belong to the same clustering category, the second vector 61 to the second vector 70 belong to the same clustering category, the second The vector 71 to the second vector 80 belong to the same cluster category, the second vector 81 to the second vector 90 belong to the same cluster category, and the second vector 91 to the second vector 100 belong to the same cluster category.
  • the second vectors mapped to each index node may all belong to the same cluster category obtained after secondary clustering, and the second vectors mapped to different index nodes belong to different clusters obtained after secondary clustering Class category, for example, can map second vector 1 to second vector 10 to index node 1, second vector 11 to second vector 20 to index node 2, second vector 21 to second vector 30 to Inode 3, map second vector 31 to second vector 40 to inode 4, map second vector 41 to second vector 50 to inode 5, map second vector 51 to second vector 60 to inode 6.
  • Map the second vector 61 to the second vector 70 to the index node 7 map the second vector 71 to the second vector 80 to the index node 8, map the second vector 81 to the second vector 90 to the index node 9,
  • the second vector 91 to the second vector 100 are mapped to the index node 10 .
  • the second vectors mapped to each index node may all belong to the same clustering category obtained after secondary clustering, and some of the second vectors (small in number) mapped to different index nodes may also belong to the second vector
  • the same cluster category obtained after sub-clustering for example, if the vector similarity between the second vector 10 and the centroid vectors of the cluster categories to which the second vector 11 to the second vector 20 belongs is very high, then the second Vector 1 to second vector 10 are mapped to index node 1 , and second vector 10 and second vector 11 to second vector 20 are mapped to index node 2 . That is, the second vector 10 is mapped to both index node 1 and index node 2 .
  • the vector similarity described in the embodiment of the present application may also be called a similarity measure, and the similarity measure may be cosine similarity, Minkowski distance, Mahalanobis distance, Jaccard similarity coefficient or any suitable similarity measure.
  • the similarity measure for can be cosine similarity, as another example and not limitation, and The similarity measure for can be the Euclidean distance
  • the similarity measure of two vectors may indicate how similar two objects or n-grams respectively corresponding to the two vectors are to each other.
  • mapping relationship between each second vector and the index node can be established, that is, the routing information is constructed.
  • the routing information may be stored in the form of a data table, where the routing information may include the identifier of each second vector (the identifier may directly or indirectly point to the storage address of the corresponding second vector), and each The index node corresponding to the second vector (for example, the identifier of the index node, which may directly or indirectly point to the address of the corresponding index node).
  • the routing information may include the identifier of each second vector (the identifier may directly or indirectly point to the storage address of the corresponding second vector), and each The index node corresponding to the second vector (for example, the identifier of the index node, which may directly or indirectly point to the address of the corresponding index node).
  • the routing node needs to identify which vector similarity between the first vector and which or which second vectors in the routing information is greater (for example, vector similarity greater than the threshold), specifically, the first index can be constructed according to the above-mentioned first cluster centroid set (the index type of the first index includes but is not limited to a hierarchical navigable small world algorithm (hierarchical navigable small world, HNSW) layered graph structure, LSH locality-sensitive hashing, etc.), the first index may also be called a coarse quantizer index. The first vector can quickly find the most similar one or more second vectors through the coarse quantizer index.
  • the above-mentioned first index may be saved on the routing node.
  • each second vector corresponds to an index node
  • other third vectors in the clustering category where each second vector is located for example, the third vector in each clustering category can be can be called a vector subspace
  • different ANN algorithms may be selected according to the number of vectors to perform ANN model training, and a sub-index model corresponding to each vector subspace may be generated.
  • the ANN algorithm includes but is not limited to scalar quantization (scalar quantizer, SQ), product quantization (product quantization, PQ), HNSW, inverted product quantization, etc. It is equivalent to each clustering category forming a sub-index model, that is, each second vector can correspond to a sub-index model, and then the sub-index model corresponding to each second vector can be deployed on the index corresponding to the second vector on the node.
  • the routing node can determine the target index node from the multiple index nodes according to the first vector and the routing information, where the first vector and the multiple The vector similarity between target vectors in the second vectors is greater than a threshold, and the target vectors correspond to the target index nodes in the routing information.
  • a target vector whose vector similarity is greater than a threshold may be determined from multiple second vectors in the routing information according to the first vector, where the target vector may be one of the multiple second vectors that is identical to the first
  • the second vector with the most similar vector, or relatively similar vector can determine the target vector based on, for example, the first index (coarse quantizer index) constructed above.
  • the quantity of the target vector can be one or more, when the quantity of the target vector is 1, the target vector can be the second vector most similar to the first vector among multiple second vectors, when the quantity of the target vector is When there are multiple target vectors, the target vectors may be the second vectors with the highest similarity with the first vector among the multiple second vectors.
  • the determined target vector may be as close as possible to a vector with a high similarity with the first vector, which may not be the same as the first vector among multiple routing nodes.
  • a vector of vectors with high similarity may be as close as possible to a vector with a high similarity with the first vector, which may not be the same as the first vector among multiple routing nodes.
  • the number of target vectors may be a preset number, the more the number of target vectors, the higher the accuracy of the final retrieval result may be, but the greater the number of index nodes distributed by the determined target vectors may be. , which in turn increases the amount of data concurrency, which may increase the delay required for the retrieval process. In practical applications, a more balanced value can be selected based on retrieval accuracy and delay requirements.
  • the threshold here may be related to the number of target vectors and the structure of the first index. When the number of target vectors is large, the threshold is lower. When the performance of the first index is better (for example, it can be determined that it is close to the vector with high similarity to the first vector), the higher the threshold.
  • the index node corresponding to the target vector may be determined according to the routing information.
  • determining the index node corresponding to the target vector may include determining the index corresponding to the target vector Node identification (such as node number), address and other information.
  • an index node corresponding to each target vector may be determined.
  • the target vector may be a vector among the multiple third vectors (for example, may be a cluster centroid of multiple third vectors).
  • the target index node can store multiple vectors for data retrieval, and the multiple third vectors Among the multiple vectors, the proportion of the number of vectors is greater than the target ratio.
  • the target ratio may be a value close to 1, such as 80%, 90%, 95%.
  • the vector similarity between the first vector and M target vectors in the plurality of second vectors is greater than a threshold, and the M target vectors in the routing information correspond to For the target index node, the M is a positive integer greater than 1.
  • multiple target vectors with high similarity with the first vector can be determined in the routing information, thereby increasing the number of third vectors used in the subsequent retrieval of the first vector, thereby increasing the The precision of the search results.
  • the target index node includes one or more index nodes, and the number of index nodes included in the target index node is less than a preset number.
  • the preset number may be 2, 3, 4, 5 and so on.
  • the routing node determines the index node to which the node to be indexed needs to be routed based on similarity, and the third vector on the index node is also deployed based on vector similarity (vectors with greater similarity are deployed in the same On the index node, the vectors with less similarity are deployed on different index nodes), that is to say, the third vector with high similarity with the first vector is stored on a specific index node (candidate index node), Therefore, the routing node only needs to send the first vector to the target index node, and the target index node can obtain a very accurate retrieval result based on the candidate nodes stored by itself, so that each retrieval only needs to visit one or part of the index nodes (the number is relatively small).
  • the concurrency performance of cluster retrieval can increase linearly with the size of the cluster, which effectively solves the problem of vector similarity retrieval performance in massive vector-scale scenarios. and scalability issues.
  • the routing node transmits the first vector to the target index node.
  • the routing node after the routing node determines the target index node, it can transfer the first vector to the target index node, and then the target index node performs data retrieval for the first vector based on multiple third vectors stored by itself. .
  • the routing node may transfer the first vector to multiple target index nodes, so that each target index node performs the first vector based on the multiple third vectors stored by itself. data retrieval.
  • the target index node determines the retrieval result of the first vector from multiple third vectors stored by itself.
  • the target index node after receiving the first vector, can determine the retrieval result of the first vector from multiple third vectors stored in itself, and optionally, based on the first vector and The similarity comparison between the stored multiple third vectors, and the retrieval result is determined from the multiple third vectors.
  • the routing node can also transfer the information of the vector subspace corresponding to the target vector to the target index node, and then the target index node can obtain the sub-index model corresponding to the vector subspace, and based on the sub-index model, compare the first vector with the stored Similarity comparison between multiple third vectors, determining retrieval results from the multiple third vectors.
  • the retrieval result may be a part of the multiple third vectors (for example, it may be one or more third vectors with the highest similarity).
  • the routing node may also transmit the target vector to the target index node; furthermore, the target index node may, based on the target vector and the first mapping relationship, select Determine the one or more third vectors among the three vectors, the first mapping relationship includes the plurality of second vectors and the corresponding relationship between each of the second vectors and the third vector; based on the first vector A retrieval result is determined from the one or more third vectors.
  • each index node may store one or more second vectors corresponding to the third vector stored by itself, and the one or more second vectors may be used to represent the first vector stored by the index node itself.
  • the index node can determine one or more third vectors corresponding to the target vector based on the target vector (that is, a second vector), and the target vector can be used as a representation of one or more third vectors, and then the index node Retrieval results may be determined from one or more third vectors based on the first vector.
  • the target vector can be used as a representation of one or more third vectors, and the similarity between the first vector and the target vector is greater than the threshold, it is possible to determine the retrieval result from one or more third vectors while satisfying the retrieval accuracy. Under the premise, speed up the retrieval process.
  • the target index node may determine part of the third vectors from multiple third vectors stored by itself according to the first vector, where each third vector in the part of the third vectors corresponds to The same cluster, and the vector similarity between the cluster center of the same cluster and the first vector is greater than a threshold; determine the retrieval result from the part of the third vector based on the first vector.
  • each index node may store one or more fourth vectors corresponding to the third vector stored by itself, and the one or more fourth vectors may be used to represent the first vector stored by the index node itself.
  • the index node can determine the fourth vector with the highest similarity from one or more fourth vectors, and obtain one or more third vectors corresponding to the determined fourth vector, and the fourth vector can be used as The representation of one or more third vectors, and then the index node can determine the retrieval result from the one or more third vectors based on the first vector.
  • the fourth vector can be used as a representation of one or more third vectors, and the similarity between the fourth vector and the first vector is greater than the threshold, it is possible to determine the retrieval result from one or more third vectors when the retrieval precision is satisfied. Under the premise of speed, speed up the retrieval process.
  • the routing node can receive the semantic vector to be retrieved, and the routing node can determine the number of the semantic vector segment 1 to be accessed according to the routing information, and forward the semantic vector to be retrieved to the semantic vector segment 1 Compare the semantic vector to be retrieved with the subspace model on slice 1, obtain the corresponding sub-index, use the sub-index to calculate the similarity with the semantic vector to be retrieved, obtain the most similar one or more third vector identifiers and return to the routing node.
  • the semantic vectors of the webpages to be indexed can be clustered once according to the similarity, and the set of semantic vectors of the webpages to be indexed can be clustered for the first time according to the number of similarity subspaces to be divided.
  • Clustering the first cluster centroid set of the global vector space is obtained.
  • the generated first cluster centroid set can be clustered according to the number of slices to obtain the second cluster centroid set.
  • Each centroid in the second cluster centroid set corresponds to a webpage semantic vector segment.
  • construct the first index called the coarse quantizer index
  • construct the mapping relationship between the first index and the webpage semantic vector fragments For the generated first cluster centroid set, construct the first index, called the coarse quantizer index, and construct the mapping relationship between the first index and the webpage semantic vector fragments.
  • the first index and the mapping relationship together constitute a routing device, and are installed in the routing node.
  • the first cluster centroid set is assigned to each semantic vector slice according to the above mapping relationship to generate a subspace model.
  • Use the routing device to find the first cluster centroid and the second cluster centroid corresponding to each original vector to be indexed, and store the direction corresponding to the first cluster centroid in the semantic vector slice corresponding to the second cluster centroid in quantum space.
  • Assign the generated subspace model to each semantic vector slice according to the second cluster centroid construct a sub-index model, and use the semantic vector stored in the vector subspace corresponding to each first cluster centroid to generate each vector subspace corresponding The sub-index model of .
  • a sub-index model is used to construct a second index. All generated sub-indices are allocated to each shard according to the subspace model.
  • a routing device may be used at the cooperative forwarding node to find a corresponding semantic vector segment and route to the segment.
  • On the semantic vector slice that is routed to look up the subspace model, and find the vector subspace corresponding to the centroid in the first cluster centroid set that is most similar.
  • call the subindex generated by the vector subspace calculate a group of vectors most similar to the original vector to be retrieved, and return to the routing node.
  • An embodiment of the present application provides a data retrieval method, the method is applied to a computer system, the computer system includes a routing node and a plurality of index nodes, the method includes: the routing node obtains the first vector and routing information, the The routing information includes a plurality of second vectors and index nodes corresponding to each second vector; the routing node determines a target index node from the plurality of index nodes according to the first vector and the routing information, wherein the The vector similarity between the first vector and target vectors in the plurality of second vectors is greater than a threshold, and the target vector corresponds to the target index node in the routing information; the routing node sends the The target index node transmits the first vector; the target index node determines a retrieval result from the multiple third vectors based on the similarity comparison between the first vector and the stored multiple third vectors, this In the embodiment of the application, the routing node determines the index node to which the node to be indexed needs to be routed based on similarity,
  • FIG. 5 is a schematic flow chart of a data retrieval method provided by the embodiment of the present application. The method includes:
  • step 501 for the description of step 501, reference may be made to the description of step 401 in the foregoing embodiment, and details are not repeated here.
  • step 502 may refer to the description of step 402 in the foregoing embodiment, and details are not repeated here.
  • step 503 For the description of step 503, reference may be made to the description of step 403 and step 404 in the above-mentioned embodiment, and details are not repeated here.
  • the routing node determines the index node to which the node to be indexed needs to be routed based on similarity, and the third vector on the index node is also deployed based on vector similarity (vectors with greater similarity are deployed in the same On the index node, the vectors with less similarity are deployed on different index nodes), that is to say, the third vector with high similarity with the first vector is stored on a specific index node (candidate index node), Therefore, the routing node only needs to send the first vector to the target index node, and the target index node can obtain a very accurate retrieval result based on the candidate nodes stored by itself, so that each retrieval only needs to visit one or part of the index nodes (the number is relatively small).
  • the concurrency performance of cluster retrieval can increase linearly with the size of the cluster, which effectively solves the problem of vector similarity retrieval performance in massive vector-scale scenarios. and scalability issues.
  • each of the second vectors corresponds to a cluster
  • each of the clusters includes one or more of the third vectors
  • different second vectors in the plurality of second vectors correspond to different of clusters.
  • each second vector corresponds to a cluster
  • each second vector is a vector corresponding to a cluster center of the cluster.
  • each of the index nodes is configured to store multiple third vectors according to clusters, and each cluster includes at least one third vector.
  • the third vector stored in each index node is a vector included in one or more clusters.
  • the vector similarity between the first vector and each of at least two target vectors among the plurality of second vectors is greater than a threshold, and the at least two target vectors Corresponds to the target index node in the routing information.
  • the target index node includes one or more index nodes.
  • the retrieval result is a partial vector in the plurality of third vectors.
  • the first vector is a representation of a retrieval object
  • the retrieval object includes one or more of text data, audio data, image data, or video data.
  • FIG. 6 is a schematic flow chart of a data retrieval method provided in an embodiment of the present application, and the method includes:
  • step 601 for the description of step 601, reference may be made to the description of step 403 in the foregoing embodiment, and details are not repeated here.
  • step 602 For the description of step 602, reference may be made to the description of step 404 in the foregoing embodiment, and details are not repeated here.
  • multiple vectors used for data retrieval are stored on the target index node, and the multiple third vectors are vectors whose quantity proportion among the multiple vectors is larger than the target ratio.
  • the retrieval result is a partial vector in the plurality of third vectors.
  • the present application also provides a computer system, including a first index node and a second index node, the first index node includes a first memory and a first processor, and the second index node includes a second memory and The second processor, wherein the first memory is configured to store a plurality of first vectors for data retrieval, the similarity between different first vectors among the plurality of first vectors is greater than a threshold, and the The first vector is a representation of a data object; the second memory is configured to store a plurality of second vectors for data retrieval, and the similarity between different second vectors in the plurality of second vectors is greater than The threshold, the second vector is a representation of a data object, and the vector similarity between the plurality of first vectors and the plurality of second vectors is smaller than the threshold; the first processor is configured to perform data retrieval based on the plurality of first vectors; the second processor is configured to perform data retrieval based on the plurality of second vectors.
  • the third vector on the index node is also deployed based on vector similarity (vectors with greater similarity are deployed on the same index node, and vectors with less similarity are deployed on different index nodes), That is to say, the third vector with high similarity with the first vector is stored on a specific index node (candidate index node), so the routing node only needs to send the first vector to the target index node, and the target index node A very accurate retrieval result can be obtained based on the candidate nodes stored by itself, so that each retrieval only needs to visit one or part of the index nodes (the number is small) to obtain accurate retrieval results, which improves the similarity of large-scale vector scale Retrieval performance, the concurrency performance of cluster retrieval can grow linearly with the size of the cluster, effectively solving the performance and scalability problems of vector similarity retrieval in massive vector scale scenarios.
  • the multiple first vectors are vectors included in one or more first clusters
  • the multiple second vectors are vectors included in one or more second clusters
  • the The first cluster is different from the second cluster.
  • the first cluster and the second cluster are clusters.
  • the first processor is configured to determine a retrieval result from the plurality of first vectors based on a third vector, where the third vector is a representation of the first retrieval object;
  • the second processor is configured to determine a retrieval result from the plurality of second vectors based on a fourth vector, where the fourth vector is a representation of the second retrieval object.
  • both the first index node and the second index node are communicatively connected to a routing node, and both the third vector and the fourth vector are sent from the routing node.
  • the first retrieval object and the second retrieval object include one or more of text data, audio data, image data, or video data.
  • FIG. 7 is a schematic structural diagram of a data retrieval device 700 provided by the embodiment of the present application.
  • the device can be applied to a computer system, and the computer system includes a routing node and a plurality of index nodes.
  • the routing node includes :
  • step 401 for the specific description of the obtaining module 701, reference may be made to the description of step 401 in the above-mentioned embodiment, and details are not repeated here.
  • a routing module 702 configured to obtain a target index node from the plurality of index nodes according to the first vector and routing information, wherein the routing information includes a plurality of second vectors and each of the second vectors corresponds to index node, each of the second vectors is used to represent one or more third vectors stored on the corresponding index node, the third vector is a representation of the data object, the first vector and the plurality of The vector similarity between target vectors in the second vector is greater than a threshold, and the target vector corresponds to the target index node in the routing information;
  • routing module 702 for the specific description of the routing module 702, reference may be made to the description of step 402 in the above embodiment, and details are not repeated here.
  • a sending module 703, configured to transfer the first vector to the target index node.
  • the target index node includes:
  • the retrieval module 704 is configured to determine the retrieval result of the first vector from a plurality of third vectors stored in itself.
  • the routing node determines the index node to which the node to be indexed needs to be routed based on similarity, and the third vector on the index node is also deployed based on vector similarity (vectors with greater similarity are deployed in the same On the index node, the vectors with less similarity are deployed on different index nodes), that is to say, the third vector with high similarity with the first vector is stored on a specific index node (candidate index node), Therefore, the routing node only needs to send the first vector to the target index node, and the target index node can obtain a very accurate retrieval result based on the candidate nodes stored by itself, so that each retrieval only needs to visit one or part of the index nodes (the number is relatively small).
  • the concurrency performance of cluster retrieval can increase linearly with the size of the cluster, which effectively solves the problem of vector similarity retrieval performance in massive vector-scale scenarios. and scalability issues.
  • each of the second vectors corresponds to a cluster
  • each of the clusters includes one or more of the third vectors
  • different second vectors in the plurality of second vectors correspond to different of clusters.
  • each of the second vectors corresponds to a cluster
  • each of the second vectors is a vector corresponding to a cluster center of the cluster.
  • each of the index nodes is configured to store multiple third vectors according to clusters, and each cluster includes at least one third vector.
  • the third vector stored in each index node is a vector included in one or more clusters.
  • the vector similarity between the first vector and each of at least two target vectors among the plurality of second vectors is greater than a threshold, and the at least two target vectors Corresponds to the target index node in the routing information.
  • the target index node includes one or more index nodes.
  • the retrieval result is a partial vector in the plurality of third vectors.
  • the first vector is a representation of a retrieval object
  • the retrieval object includes one or more of text data, audio data, image data, or video data.
  • the sending module is further configured to transfer the target vector to the target index node;
  • the retrieval module is specifically configured to determine the one or more third vectors from a plurality of third vectors stored in itself based on the target vector and the first mapping relationship, the first mapping relationship includes one or more second vectors and the corresponding relationship between each second vector and the third vector in the one or more second vectors;
  • a retrieval result is determined from the one or more third vectors based on the first vector.
  • the retrieval module is specifically configured to, according to the first vector, determine a part of the third vector from a plurality of third vectors stored in itself, wherein each of the part of the third vector The third vector corresponds to the same cluster, and the vector similarity between the cluster center of the same cluster and the first vector is greater than a threshold;
  • a retrieval result is determined from the portion of third vectors based on the first vector.
  • FIG. 8 is a schematic structural diagram of a data retrieval device provided in an embodiment of the present application.
  • the device 800 includes:
  • step 501 for the specific description of the obtaining module 801, reference may be made to the description of step 501 in the above embodiment, and details are not repeated here.
  • a routing module 802 configured to obtain a target index node from the plurality of index nodes according to the first vector and routing information, wherein the routing information includes a plurality of second vectors and each of the second vectors corresponds to index node, each of the second vectors is used to represent one or more third vectors stored on the corresponding index node, the third vector is a representation of the data object, the first vector and the plurality of The vector similarity between target vectors in the second vector is greater than a threshold, and the target vector corresponds to the target index node in the routing information;
  • routing module 802 for the specific description of the routing module 802, reference may be made to the description of step 502 in the above embodiment, and details are not repeated here.
  • a sending module 803, configured to transmit the first vector to the target index node, where the first vector is used to instruct the target index node to determine the retrieval of the first vector from multiple third vectors stored by itself. result.
  • the routing node determines the index node to which the node to be indexed needs to be routed based on similarity, and the third vector on the index node is also deployed based on vector similarity (vectors with greater similarity are deployed in the same On the index node, the vectors with less similarity are deployed on different index nodes), that is to say, the third vector with high similarity with the first vector is stored on a specific index node (candidate index node), Therefore, the routing node only needs to send the first vector to the target index node, and the target index node can obtain a very accurate retrieval result based on the candidate nodes stored by itself, so that each retrieval only needs to visit one or part of the index nodes (the number is relatively small).
  • the concurrency performance of cluster retrieval can increase linearly with the size of the cluster, which effectively solves the problem of vector similarity retrieval performance in massive vector-scale scenarios. and scalability issues.
  • each of the second vectors corresponds to a cluster
  • each of the clusters includes one or more of the third vectors
  • different second vectors in the plurality of second vectors correspond to different of clusters.
  • each second vector corresponds to a cluster
  • each second vector is a vector corresponding to a cluster center of the cluster.
  • each of the index nodes is configured to store multiple third vectors according to clusters, and each cluster includes at least one third vector.
  • the third vector stored in each index node is a vector included in one or more clusters.
  • the vector similarity between the first vector and each of at least two target vectors among the plurality of second vectors is greater than a threshold, and the at least two target vectors Corresponds to the target index node in the routing information.
  • the target index node includes one or more index nodes.
  • the retrieval result is a partial vector in the plurality of third vectors.
  • the first vector is a representation of a retrieval object
  • the retrieval object includes one or more of text data, audio data, image data, or video data.
  • FIG. 9 is a schematic structural diagram of a data retrieval device provided by an embodiment of the present application.
  • the device 900 can be applied to a first index node, and the first index node stores multiple first and third vectors.
  • the vector similarity between different first and third vectors among the plurality of first and third vectors is greater than a threshold, and the device 900 includes:
  • An acquisition module 901 configured to acquire a first vector, the vector similarity between the first vector and the plurality of first and third vectors is greater than a threshold;
  • a retrieval module 902 configured to determine a retrieval result from the multiple first and third vectors based on the similarity comparison between the first vector and the multiple first and third vectors.
  • multiple vectors used for data retrieval are stored on the first index node, and the multiple first and third vectors are vectors whose quantity proportion among the multiple vectors is greater than the target ratio .
  • the retrieval result is a partial vector in the plurality of third vectors.
  • the routing node determines the index node to which the node to be indexed needs to be routed based on similarity, and the third vector on the index node is also deployed based on vector similarity (vectors with greater similarity are deployed in the same On the index node, the vectors with less similarity are deployed on different index nodes), that is to say, the third vector with high similarity with the first vector is stored on a specific index node (candidate index node), Therefore, the routing node only needs to send the first vector to the target index node, and the target index node can obtain a very accurate retrieval result based on the candidate nodes stored by itself, so that each retrieval only needs to visit one or part of the index nodes (the number is relatively small).
  • the concurrency performance of cluster retrieval can increase linearly with the size of the cluster, which effectively solves the problem of vector similarity retrieval performance in massive vector-scale scenarios. and scalability issues.
  • the terminal device 1000 includes: a receiver 1001, a transmitter 1002, a processor 1003, and a memory 1004 (the number of processors 1003 in the terminal device 1000 can be one or more, and one processor is taken as an example in FIG. 10 ) , where the processor 1003 may include an application processor 10031 and a communication processor 10032 .
  • the receiver 1001 , the transmitter 1002 , the processor 1003 and the memory 1004 may be connected through a bus or in other ways.
  • the memory 1004 may include read-only memory and random-access memory, and provides instructions and data to the processor 1003 .
  • a part of the memory 1004 may also include a non-volatile random access memory (non-volatile random access memory, NVRAM).
  • NVRAM non-volatile random access memory
  • the memory 1004 stores processors and operating instructions, executable modules or data structures, or their subsets, or their extended sets, wherein the operating instructions may include various operating instructions for implementing various operations.
  • the processor 1003 controls the operation of the terminal device.
  • various components of the terminal device are coupled together through a bus system, where the bus system may include a power bus, a control bus, and a status signal bus in addition to a data bus.
  • the various buses are referred to as bus systems in the figures.
  • the methods disclosed in the foregoing embodiments of the present application may be applied to the processor 1003 or implemented by the processor 1003 .
  • the processor 1003 may be an integrated circuit chip, which has a signal processing capability. In the implementation process, each step of the above method may be completed by an integrated logic circuit of hardware in the processor 1003 or instructions in the form of software.
  • the above-mentioned processor 1003 may be a general-purpose processor, a digital signal processor (digital signal processing, DSP), a microprocessor or a microcontroller, and may further include an application specific integrated circuit (ASIC), field programmable Field-programmable gate array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
  • DSP digital signal processing
  • FPGA field programmable Field-programmable gate array
  • the processor 1003 may implement or execute various methods, steps, and logic block diagrams disclosed in the embodiments of the present application.
  • a general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.
  • the steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor.
  • the software module can be located in a mature storage medium in the field such as random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, register.
  • the storage medium is located in the memory 1004, and the processor 1003 reads the information in the memory 1004, and completes the steps of the above method in combination with its hardware.
  • the receiver 1001 can be used to receive input digital or character information, and generate signal input related to related settings and function control of the terminal device.
  • the transmitter 1002 can be used to output digital or character information through the first interface; the transmitter 1002 can also be used to send instructions to the disk group through the first interface to modify the data in the disk group; the transmitter 1002 can also include a display device such as a display screen .
  • the processor 1003 is configured to execute the relevant steps performed on the terminal side in the data retrieval method described in the above embodiment.
  • FIG. 11 There are relatively large differences due to different performances, and may include one or more central processing units (central processing units, CPU) 1111 (for example, one or more processors) and memory 1132, and one or more storage application programs 1142 or data 1144 storage medium 1130 (for example, one or more mass storage devices).
  • the memory 1132 and the storage medium 1130 may be temporary storage or persistent storage.
  • the program stored in the storage medium 1130 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations on the server.
  • the central processing unit 1111 may be configured to communicate with the storage medium 1130 , and execute a series of instruction operations in the storage medium 1130 on the server 1100 .
  • the server 1100 can also include one or more power supplies 1126, one or more wired or wireless network interfaces 1150, one or more input and output interfaces 1158; or, one or more operating systems 1141, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.
  • the central processing unit 1111 is configured to execute the data retrieval method described in the embodiments corresponding to FIG. 4 and FIG. 6 .
  • the embodiment of the present application also provides a computer program product, which, when running on a computer, causes the computer to perform the steps performed by the aforementioned execution device, or causes the computer to perform the steps performed by the aforementioned server.
  • An embodiment of the present application also provides a computer-readable storage medium, the computer-readable storage medium stores a program for signal processing, and when it is run on a computer, the computer executes the steps performed by the aforementioned executing device , or make the computer perform the steps performed by the aforementioned server.
  • the execution device, server, or terminal device provided in the embodiment of the present application may specifically be a chip.
  • the chip includes: a processing unit and a communication unit.
  • the processing unit may be, for example, a processor, and the communication unit may be, for example, an input/output interface, a pipe pins or circuits etc.
  • the processing unit can execute the computer-executed instructions stored in the storage unit, so that the chips in the execution device execute the data processing methods described in the above embodiments, or the chips in the server execute the data processing methods described in the above embodiments.
  • the storage unit is a storage unit in the chip, such as a register, a cache, etc.
  • the storage unit may also be a storage unit located outside the chip in the wireless access device, such as only Read-only memory (ROM) or other types of static storage devices that can store static information and instructions, random access memory (random access memory, RAM), etc.
  • ROM Read-only memory
  • RAM random access memory
  • FIG. 12 is a schematic structural diagram of a chip provided by the embodiment of the present application.
  • the chip can be represented as a neural network processor NPU 1200, and the NPU 1200 is mounted on the main CPU (Host CPU) as a coprocessor, and the tasks are assigned by the Host CPU.
  • the core part of the NPU is the operation circuit 1203, and the controller 1204 controls the operation circuit 1203 to extract data in the memory (weight memory or input memory) and perform operations.
  • the operation circuit 1203 includes multiple processing units (Process Engine, PE).
  • arithmetic circuit 1203 is a two-dimensional systolic array.
  • the arithmetic circuit 1203 may also be a one-dimensional systolic array or other electronic circuits capable of performing mathematical operations such as multiplication and addition.
  • arithmetic circuit 1203 is a general-purpose matrix processor.
  • the operation circuit fetches the data corresponding to the matrix B from the weight memory 1202, and caches it in each PE in the operation circuit.
  • the operation circuit fetches the data of matrix A from the input memory 1201 and performs matrix operation with matrix B, and the obtained partial or final results of the matrix are stored in an accumulator (accumulator) 1208 .
  • the unified memory 1206 is used to store input data and output data.
  • the weight data directly accesses the controller (Direct Memory Access Controller, DMAC) 1205 through the storage unit, and the DMAC is transferred to the weight storage 1202.
  • the input data is also transferred to the unified memory 1206 through the DMAC.
  • DMAC Direct Memory Access Controller
  • the BIU is the Bus Interface Unit, that is, the bus interface unit 1210, which is used for the interaction between the AXI bus and the DMAC and the instruction fetch buffer (Instruction Fetch Buffer, IFB) 1209.
  • IFB Instruction Fetch Buffer
  • the bus interface unit 1210 (Bus Interface Unit, BIU for short), is used for the instruction fetch memory 1209 to obtain instructions from the external memory, and is also used for the storage unit access controller 1205 to obtain the original data of the input matrix A or the weight matrix B from the external memory.
  • BIU Bus Interface Unit
  • the DMAC is mainly used to move the input data in the external memory DDR to the unified memory 1206 , move the weight data to the weight memory 1202 , or move the input data to the input memory 1201 .
  • the vector computing unit 1207 includes a plurality of computing processing units, and if necessary, further processes the output of the computing circuit, such as vector multiplication, vector addition, exponent operation, logarithmic operation, size comparison and so on. It is mainly used for non-convolutional/fully connected layer network calculations in neural networks, such as Batch Normalization (batch normalization), pixel-level summation, and upsampling of feature planes.
  • Batch Normalization batch normalization
  • pixel-level summation pixel-level summation
  • upsampling of feature planes upsampling of feature planes.
  • the vector computation unit 1207 can store the vector of the processed output to unified memory 1206 .
  • the vector calculation unit 1207 can apply a linear function; or, a nonlinear function to the output of the operation circuit 1203, such as performing linear interpolation on the feature plane extracted by the convolution layer, and then such as a vector of accumulated values to generate an activation value.
  • the vector computation unit 1207 generates normalized values, pixel-level summed values, or both.
  • the vector of processed outputs can be used as an activation input to operational circuitry 1203, eg, for use in subsequent layers in a neural network.
  • An instruction fetch buffer (instruction fetch buffer) 1209 connected to the controller 1204 is used to store instructions used by the controller 1204;
  • the unified memory 1206, the input memory 1201, the weight memory 1202 and the fetch memory 1209 are all On-Chip memories. External memory is private to the NPU hardware architecture.
  • the processor mentioned above can be a general-purpose central processing unit, microprocessor, ASIC, or one or more integrated circuits for controlling the execution of the above-mentioned programs.
  • the device embodiments described above are only illustrative, and the units described as separate components may or may not be physically separated, and the components shown as units may or may not be A physical unit can be located in one place, or it can be distributed to multiple network units. Part or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • the connection relationship between the modules indicates that they have communication connections, which can be specifically implemented as one or more communication buses or signal lines.
  • the essence of the technical solution of this application or the part that contributes to the prior art can be embodied in the form of a software product, and the computer software product is stored in a readable storage medium, such as a floppy disk of a computer , U disk, mobile hard disk, ROM, RAM, magnetic disk or optical disk, etc., including several instructions to make a computer device (which can be a personal computer, training device, or network device, etc.) execute the instructions described in various embodiments of the present application method.
  • a computer device which can be a personal computer, training device, or network device, etc.
  • all or part of them may be implemented by software, hardware, firmware or any combination thereof.
  • software When implemented using software, it may be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes one or more computer instructions.
  • the computer can be a general purpose computer, a special purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transferred from a website, computer, training device, or data
  • the center transmits to another website site, computer, training device or data center via wired (eg, coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (eg, infrared, wireless, microwave, etc.).
  • wired eg, coaxial cable, fiber optic, digital subscriber line (DSL)
  • wireless eg, infrared, wireless, microwave, etc.
  • the computer-readable storage medium may be any available medium that can be stored by a computer, or a data storage device such as a training device or a data center integrated with one or more available media.
  • the available medium may be a magnetic medium (such as a floppy disk, a hard disk, or a magnetic tape), an optical medium (such as a DVD), or a semiconductor medium (such as a solid state disk (Solid State Disk, SSD)), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Library & Information Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present application discloses a data retrieval method, comprising: a routing node determining a target vector according to the vector similarity between a first vector and a plurality of second vectors, determining a target index node among a plurality of index nodes on the basis of routing information, and transmitting the first vector to a target index node, such that the target index node determines a retrieval result from third vectors stored thereby. In the present application, the third vector having high similarity with the first vector is stored on a specific index node, and therefore, the routing node only needs to send the first vector to the target index node, and then the target index node can obtain an accurate retrieval result on the basis of a candidate node stored thereby, such that an accurate retrieval result can be obtained only by accessing one or some of the index nodes (having a small quantity) each time of retrieval, thereby improving the similarity retrieval performance of large-scale vector retrieval.

Description

一种数据检索方法及相关设备A data retrieval method and related equipment
本申请要求于2021年8月31日提交中国专利局、申请号为202111017328.0、发明名称为“一种数据检索方法及相关设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application with the application number 202111017328.0 and the invention title "A Data Retrieval Method and Related Equipment" filed with the China Patent Office on August 31, 2021, the entire contents of which are incorporated by reference in this application middle.
技术领域technical field
本申请涉及计算机领域,尤其涉及一种数据检索方法及相关设备。This application relates to the field of computers, in particular to a data retrieval method and related equipment.
背景技术Background technique
存储在数字信息存储库(例如在线互联网和基于云的数据库)中的数据(例如包括图像数据、视频数据、音频数据和文本数据)的数量正在急剧增长。以准确且资源有效利用的方式处理非结构化数据的搜索查询是一项技术挑战。The amount of data (including, for example, image data, video data, audio data, and text data) stored in digital information repositories such as online Internet and cloud-based databases is growing dramatically. Processing search queries on unstructured data in an accurate and resource-efficient manner is a technical challenge.
相似度搜索是一种基于检索对象和搜索数据库中的数据对象的相似度之间的比较来搜索非结构化数据对象的数据搜索方法。相似度搜索通常涉及为数据库中存储的每个数据对象创建元数据,为检索对象创建元数据,然后比较所述查询对象的元数据与所述数据对象的元数据。每个数据对象的元数据可以采用特征向量的形式,该特征向量是表示数据对象的多维数值特征向量。在这方面,相似度搜索可以被定义为从数据库中存储的多个特征向量中找到与给定特征向量(例如,查询向量)最相似的特征向量。Similarity search is a data search method for searching unstructured data objects based on the comparison between the similarity between the retrieved object and the data object in the search database. A similarity search typically involves creating metadata for each data object stored in the database, creating metadata for the retrieved object, and then comparing the query object's metadata with the data object's metadata. The metadata for each data object may be in the form of a feature vector, which is a multi-dimensional numerical feature vector representing the data object. In this regard, similarity search can be defined as finding the feature vector most similar to a given feature vector (eg, query vector) from among multiple feature vectors stored in a database.
因此,相似度搜索通常涉及使用特征提取算法将检索对象(例如,图像、视频样本、音频样本或文本)翻译为(转换为)表示所述检索对象的第一向量。然后,所述第一向量用于搜索特征向量的数据库(例如,数据库可以包括多个第三向量),以定位与第一向量最相似的一个或多个第三向量。Thus, similarity searching generally involves translating (converting) a search object (eg, image, video sample, audio sample or text) into a first vector representing the search object using a feature extraction algorithm. The first vector is then used to search a database of feature vectors (eg, the database may include a plurality of third vectors) to locate one or more third vectors most similar to the first vector.
现有采用基于哈希的近似最近邻(approximate nearest neighbor,ANN)算法实现相似度搜索。具体的,检索***可以包括路由节点以及索引节点(索引节点也可以称为分片shard),通过局部敏感哈希(locality sensitive hashing,简称LSH)对第三向量进行哈希处理,使得各个第三向量映射到对应的索引节点中。在检索时,路由节点会将第一向量分发至各个索引节点,各个索引节点独立访问内部包含的ANN索引,计算ANN索引中与第一向量最相似的一组结果,并返回至路由节点,路由节点收集全部索引节点的返回结果,对结果进行合并和处理。At present, the approximate nearest neighbor (ANN) algorithm based on hash is used to realize the similarity search. Specifically, the retrieval system may include routing nodes and index nodes (index nodes may also be called shards), and perform hash processing on the third vector through locality sensitive hashing (LSH for short), so that each third vector Vectors are mapped into corresponding index nodes. When retrieving, the routing node will distribute the first vector to each index node, and each index node independently accesses the ANN index contained inside, calculates a set of results most similar to the first vector in the ANN index, and returns to the routing node, routing The node collects the returned results of all index nodes, and merges and processes the results.
现有的基于LSH-ANN的索引和搜索算法存在的一个问题是,随着第三向量数量的增加,索引节点的数量也在不断增加,路由节点与索引节点之间的交互次数逐渐变多,且路由节点在收集各个索引节点的返回结果以及对结果合并等阶段的时延开销将明显增加。One problem with the existing LSH-ANN-based indexing and searching algorithms is that with the increase in the number of the third vector, the number of index nodes is also increasing, and the number of interactions between routing nodes and index nodes is gradually increasing. Moreover, the delay overhead of the routing node in the stages of collecting the returned results of each index node and merging the results will increase significantly.
发明内容Contents of the invention
第一方面,本申请提供了一种数据检索***,所述数据检索***包括路由节点以及多个索引节点,其中,In a first aspect, the present application provides a data retrieval system, the data retrieval system includes a routing node and a plurality of index nodes, wherein,
所述路由节点,用于获取第一向量;The routing node is used to obtain the first vector;
在一种可能的实现中,可以对检索对象进行特征提取,以得到第一向量(或者称之为),例如,检索对象可以是视频数据、音频数据、图像数据、文本数据之一,或者其它非结构性数据。例如,图像数据可以由各自的从所述原始图像数据的颜色直方图中获得的原始特征向量来表征第一向量,视频数据可以由各自的从所述原始视频数据的尺度不变特征转换(scale-invariant feature transform,SIFT)或者3D-SIFT获得的或者从区分视频描述符(discriminate video descriptor,DVD)获得的原始特征向量来表征第一向量。许多不同的特征向量格式因为表征不同的数据对象种类而被熟知,这些格式中的任一格式都可以适用于特征提取过程;In a possible implementation, feature extraction can be performed on the retrieval object to obtain the first vector (or called it), for example, the retrieval object can be one of video data, audio data, image data, text data, or other unstructured data. For example, image data may be represented by a respective original feature vector obtained from a color histogram of the original image data, and video data may be represented by a respective scale-invariant feature vector obtained from the original video data (scale -invariant feature transform, SIFT) or 3D-SIFT or from the original feature vector obtained from the discriminative video descriptor (discriminate video descriptor, DVD) to characterize the first vector. Many different feature vector formats are known for representing different kinds of data objects, any of these formats can be suitable for the feature extraction process;
其中,本申请实施例中涉及的对象可以包括数据对象以及检索对象,检索对象可以为用户需要查询的对象,数据对象可以为搜索数据库中预先存储的对象;Wherein, the objects involved in the embodiments of the present application may include data objects and search objects, the search objects may be the objects that the user needs to query, and the data objects may be pre-stored objects in the search database;
在一种可能的实现中,第一向量可以为上述原始特征向量或者是对原始特征向量进行其他处理(例如归一化)得到的向量,这里并不限定;In a possible implementation, the first vector may be the above-mentioned original feature vector or a vector obtained by performing other processing (such as normalization) on the original feature vector, which is not limited here;
在一种可能的实现中,可以以数据表的形式存储路由信息,其中路由信息可以包括各个第二向量的标识(该标识可以直接或者间接地指向对应的第二向量的存储地址),以及每个第二向量对应的索引节点(例如是索引节点的标识,该标识可以直接或者间接地指向对应的索引节点的地址)。尽管本公开描述了以特定方式存储路由信息,但是本公开设想了以任何合适的方式存储上述路由信息;In a possible implementation, the routing information may be stored in the form of a data table, where the routing information may include the identifier of each second vector (the identifier may directly or indirectly point to the storage address of the corresponding second vector), and each The index node corresponding to the second vector (for example, the identifier of the index node, which may directly or indirectly point to the address of the corresponding index node). Although this disclosure describes storing routing information in a particular manner, this disclosure contemplates storing such routing information in any suitable manner;
所述路由节点根据所述第一向量以及路由信息,从所述多个索引节点中得到目标索引节点,其中,所述路由信息包括多个第二向量以及每个所述第二向量对应的索引节点,每个所述第二向量用于表示对应的索引节点上存储的一个或多个第三向量,所述第三向量为数据对象的表征,所述第一向量与所述多个第二向量中的目标向量之间的向量相似度大于阈值,且所述目标向量在所述路由信息中对应于所述目标索引节点;The routing node obtains a target index node from the plurality of index nodes according to the first vector and routing information, wherein the routing information includes a plurality of second vectors and an index corresponding to each second vector node, each of the second vectors is used to represent one or more third vectors stored on the corresponding index node, the third vector is a representation of a data object, and the first vector and the plurality of second a vector similarity between target vectors in the vectors is greater than a threshold, and the target vector corresponds to the target index node in the routing information;
在一种可能的实现中,每个所述第二向量用于表示对应的索引节点上存储的一个或多个第三向量可以理解为,每个第二向量可以表示出对应的索引节点上存储的部分或全部第三向量的特征,也就是第二向量可以作为对应的索引节点上存储的部分或全部第三向量的特征表征(第二向量与对应的索引节点上存储的部分或全部第三向量之间的相似度很高,例如都属于同一个簇)。In a possible implementation, each of the second vectors is used to represent one or more third vectors stored on the corresponding index node. It can be understood that each second vector may represent one or more third vectors stored on the corresponding index node The characteristics of part or all of the third vector, that is, the second vector can be used as a feature representation of part or all of the third vector stored on the corresponding index node (the second vector and part or all of the third vector stored on the corresponding index node The similarity between vectors is high, e.g. all belong to the same cluster).
在一种可能的实现中,可以根据第一向量从路由信息中的多个第二向量中确定向量相似度大于阈值的目标向量,其中,该目标向量可以为多个第二向量中与第一向量最相似、或者较为相似的第二向量,例如可以基于上述构建的第一索引(粗量化器索引),来确定目标向量。其中,目标向量的数量可以为一个或多个,当目标向量的数量为1个时,目标向量可以为多个第二向量中与第一向量最相似的第二向量,当目标向量的数量为多个时,目标向量可以为多个第二向量中与第一向量之间相似度排名靠前多个的第二向量;In a possible implementation, a target vector whose vector similarity is greater than a threshold may be determined from multiple second vectors in the routing information according to the first vector, where the target vector may be one of the multiple second vectors that is identical to the first The second vector with the most similar vector, or relatively similar vector, can determine the target vector based on, for example, the first index (coarse quantizer index) constructed above. Wherein, the quantity of the target vector can be one or more, when the quantity of the target vector is 1, the target vector can be the second vector most similar to the first vector among multiple second vectors, when the quantity of the target vector is When multiple, the target vector can be a plurality of second vectors with the highest similarity between the first vector and the first vector among the multiple second vectors;
应理解,目标向量的数量可以为预先设定好的数量,目标向量的数量越多,则最后检索结果的精度可能越高,但可能导致确定出的目标向量所分布的索引节点的数量越多,进而加大了数据的并发量,可能会增大检索过程所需的时延,实际应用时可以基于检索精度以及时延要求选择一个较为平衡的值;It should be understood that the number of target vectors may be a preset number, the more the number of target vectors, the higher the accuracy of the final retrieval result may be, but the greater the number of index nodes distributed by the determined target vectors may be. , which in turn increases the amount of data concurrency, which may increase the delay required for the retrieval process. In practical applications, a more balanced value can be selected based on retrieval accuracy and delay requirements;
应理解,这里的阈值,可以和目标向量的数量以及第一索引的结构相关,当目标向量的数量很多时,阈值越低,当第一索引的性能较好时(例如可以确定出接近于与第一向量高相似度的向量),阈值越高;It should be understood that the threshold here may be related to the number of target vectors and the structure of the first index. When the number of target vectors is large, the threshold is lower. When the performance of the first index is better (for example, it can be determined that it is close to the The vector with the high similarity of the first vector), the higher the threshold;
所述路由节点向所述目标索引节点传递所述第一向量;the routing node passes the first vector to the target index node;
所述目标索引节点,用于从自身存储的多个第三向量中确定所述第一向量的检索结果。The target index node is configured to determine the retrieval result of the first vector from multiple third vectors stored by itself.
在一种可能的实现中,所述目标索引节点基于所述第一向量与存储的多个第三向量之间的相似性比较,从所述多个第三向量中确定检索结果。In a possible implementation, the target index node determines a retrieval result from the plurality of third vectors based on a similarity comparison between the first vector and the plurality of stored third vectors.
本申请实施例中,路由节点基于相似性确定出待索引节点需要路由至的索引节点,且在索引节点上的第三向量也是基于向量相似性部署的(相似性较大的向量部署在同一个索引节点上,相似性较小的向量部署在不同的索引节点上),也就是说,和第一向量相似度很高的第三向量都存储在特定的索引节点上(候选索引节点)上,因此路由节点只需要将第一向量发送至目标索引节点上,目标索引节点就可以基于自身存储的候选节点得到一个很精确的检索结果,使每次检索只需访问一个或部分索引节点(数量较少)即可获得精确的检索结果,提升了大规模向量规模的相似性检索性能,集群检索的并发性能可以随着集群规模而线性增长,有效解决了海量向量规模场景下的向量相似性检索性能和扩展性问题。In this embodiment of the application, the routing node determines the index node to which the node to be indexed needs to be routed based on similarity, and the third vector on the index node is also deployed based on vector similarity (vectors with greater similarity are deployed in the same On the index node, the vectors with less similarity are deployed on different index nodes), that is to say, the third vector with high similarity with the first vector is stored on a specific index node (candidate index node), Therefore, the routing node only needs to send the first vector to the target index node, and the target index node can obtain a very accurate retrieval result based on the candidate nodes stored by itself, so that each retrieval only needs to visit one or part of the index nodes (the number is relatively small). less) to obtain accurate retrieval results, which improves the performance of large-scale vector-scale similarity retrieval. The concurrency performance of cluster retrieval can increase linearly with the size of the cluster, which effectively solves the problem of vector similarity retrieval performance in massive vector-scale scenarios. and scalability issues.
在一种可能的实现中,每个所述第二向量对应一个簇,每个所述簇包括一个或多个所述第三向量,且所述多个第二向量中不同第二向量对应不同的簇。In a possible implementation, each of the second vectors corresponds to a cluster, each of the clusters includes one or more of the third vectors, and different second vectors in the plurality of second vectors correspond to different of clusters.
其中,每个簇中的第三向量为相似度较高的向量,不同簇之间的第三向量的相似度较低。Wherein, the third vector in each cluster is a vector with higher similarity, and the third vectors between different clusters have lower similarity.
由于目标索引节点上存储的第三向量为目标向量所属的聚类类别中的第三向量,因此,所述目标向量与所述多个第三向量中每个第三向量之间的向量相似度大于阈值,且所述多个第三向量中不同第三向量之间的向量相似度大于阈值。可选的,在一种可能的实现中,所述目标向量可以为所述多个第三向量中的一个向量(例如,可以为多个第三向量聚类后得到的一个聚类的质心)。Since the third vector stored on the target index node is the third vector in the cluster category to which the target vector belongs, the vector similarity between the target vector and each third vector in the plurality of third vectors is greater than a threshold, and the vector similarity between different third vectors among the plurality of third vectors is greater than a threshold. Optionally, in a possible implementation, the target vector may be one of the multiple third vectors (for example, it may be the centroid of a cluster obtained after clustering multiple third vectors) .
在一种可能的实现中,所述目标索引节点上还可以存储有除了多个第三向量之外的其他向量(这部分向量也可以作为数据检索的第三向量,且与多个第三向量之间的相似度较低,但这部分向量在目标索引节点上的数量较少),也就是说,目标索引节点可以存储有用于进行数据检索的多个向量,且所述多个第三向量为所述多个向量中数量占比大于目标比例的向量,示例性的,目标比例可以为百分之80、百分之90、百分之95等接近于1的数值。In a possible implementation, other vectors besides multiple third vectors may also be stored on the target index node (this part of vectors may also be used as the third vectors for data retrieval, and the same as multiple third vectors The similarity between them is low, but the number of these vectors on the target index node is small), that is to say, the target index node can store multiple vectors for data retrieval, and the multiple third vectors Among the multiple vectors, the proportion of the number of vectors is greater than the target ratio. Exemplarily, the target ratio may be a value close to 1, such as 80%, 90%, 95%.
在一种可能的实现中,所述第一向量与所述多个第二向量中的M个目标向量之间的向量相似度大于阈值,所述M个目标向量在所述路由信息中对应于所述目标索引节点,所述M为大于1的正整数。为了保证检索的精度,可以在路由信息中确定出多个和第一向量相似度较高的目标向量,进而增加了后续针对于第一向量进行检索时采用的第三向量的数量,进而增加了检索结果的精确度。In a possible implementation, the vector similarity between the first vector and M target vectors in the plurality of second vectors is greater than a threshold, and the M target vectors in the routing information correspond to For the target index node, the M is a positive integer greater than 1. In order to ensure the accuracy of retrieval, multiple target vectors with high similarity with the first vector can be determined in the routing information, thereby increasing the number of third vectors used in the subsequent retrieval of the first vector, thereby increasing the The precision of the search results.
在一种可能的实现中,每个所述第二向量对应一个聚类,每个所述第二向量为所述聚类的聚类中心对应的向量。In a possible implementation, each second vector corresponds to a cluster, and each second vector is a vector corresponding to a cluster center of the cluster.
在一种可能的实现中,可以对生成的对原始特征向量按原始向量集合待划分的相似度子空间数量相似度进行一次聚类,其中聚类(clustering)是按照某个特定标准(如距离)把一个数据集分割成不同的类或簇,使得同一个簇内的数据对象的相似性尽可能大,同时不在同一个簇中的数据对象的差异性也尽可能地大。也即聚类后同一类的数据尽可能聚集到一起,不同类数据尽量分离。聚类方法可以包括但不限于K均值(k-means)、均值漂移聚类、基于密度的聚类方法、凝聚层次聚类、图团体检测(graph community detection)、高斯混合模型k-means(gaussian mixture model kmeans,GMM k-means)等聚类方法,通过聚类可以得到全局向量空间的聚类质心集合。聚类质心集合中的每个质心向量对应一个相似度子空间的中心,代表一个全局向量空间的一个向量子空间。In a possible implementation, a clustering can be performed on the similarity of the generated similarity subspaces to be divided into the original feature vector according to the original vector set, wherein the clustering (clustering) is based on a certain standard (such as distance ) divides a data set into different classes or clusters, so that the similarity of data objects in the same cluster is as large as possible, and the difference of data objects that are not in the same cluster is also as large as possible. That is to say, after clustering, the data of the same class are gathered together as much as possible, and the data of different classes are separated as much as possible. Clustering methods may include, but are not limited to, K-means, mean shift clustering, density-based clustering methods, agglomerative hierarchical clustering, graph community detection, Gaussian mixture model k-means (gaussian mixture model kmeans, GMM k-means) and other clustering methods, the cluster centroid set of the global vector space can be obtained through clustering. Each centroid vector in the cluster centroid set corresponds to the center of a similarity subspace, representing a vector subspace of a global vector space.
其中,各个聚类类别的质心向量可以作为路由信息中的第二向量,或者是在相应的聚类类别中位于质心向量附近的向量可以作为路由信息中的第二向量,或者是与质心向量相似的向量可以作为路由信息中的第二向量,只要能够代表对应的聚类类别,并不限定怎么选取第二向量。Wherein, the centroid vector of each clustering category can be used as the second vector in the routing information, or a vector located near the centroid vector in the corresponding clustering category can be used as the second vector in the routing information, or a vector similar to the centroid vector The vector of can be used as the second vector in the routing information, as long as it can represent the corresponding clustering category, there is no limitation on how to select the second vector.
在一种可能的实现中,每个所述索引节点用于按照簇存储多个所述第三向量,每个所述簇包括至少一个所述第三向量。在索引节点上的第三向量也是基于簇部署的,也就是基于向量相似性部署的(相似性较大的向量部署在同一个索引节点上,相似性较小的向量部署在不同的索引节点上),也就是说,和第一向量相似度很高的第三向量都存储在特定的索引节点上(候选索引节点)上,因此路由节点只需要将第一向量发送至目标索引节点上,目标索引节点就可以基于自身存储的候选节点得到一个很精确的检索结果,使每次检索只需访问一个或部分索引节点(数量较少)即可获得精确的检索结果,提升了大规模向量规模的相似性检索性能。In a possible implementation, each of the index nodes is configured to store multiple third vectors according to clusters, and each cluster includes at least one third vector. The third vector on the index node is also deployed based on clusters, that is, deployed based on vector similarity (vectors with greater similarity are deployed on the same index node, and vectors with less similarity are deployed on different index nodes ), that is to say, the third vector with high similarity to the first vector is stored on a specific index node (candidate index node), so the routing node only needs to send the first vector to the target index node, and the target The index node can obtain a very accurate retrieval result based on the candidate nodes stored by itself, so that each retrieval only needs to visit one or part of the index nodes (the number is small) to obtain accurate retrieval results, which improves the performance of large-scale vector scale. Similarity retrieval performance.
在一种可能的实现中,所述目标索引节点包括一个或多个索引节点,且所述目标索引节点包括的索引节点数量小于预设数量。其中,预设数量可以为2、3、4、5等。本申请实施例使每次检索只需访问一个或部分索引节点(数量较少)即可获得精确的检索结果,提升了大规模向量规模的相似性检索性能。In a possible implementation, the target index node includes one or more index nodes, and the number of index nodes included in the target index node is less than a preset number. Wherein, the preset number may be 2, 3, 4, 5 and so on. In the embodiment of the present application, accurate retrieval results can be obtained only by visiting one or a part of index nodes (the number is small) for each retrieval, which improves the similarity retrieval performance on a large-scale vector scale.
在一种可能的实现中,所述检索结果为所述多个第三向量中的部分向量。In a possible implementation, the retrieval result is a partial vector in the plurality of third vectors.
在一种可能的实现中,第一向量为检索对象的表征,所述检索对象包括文本数据、音频数据、图像数据或视频数据中的一个或多个。In a possible implementation, the first vector is a representation of a retrieval object, and the retrieval object includes one or more of text data, audio data, image data, or video data.
在一种可能的实现中,所述路由节点,还用于向所述目标索引节点传递所述目标向量; 所述目标索引节点具体用于:基于所述目标向量以及第一映射关系,从自身存储的多个第三向量中确定所述一个或多个第三向量,所述第一映射关系指示所述目标向量与所述多个第三向量中的一个或多个第三向量的映射关系;从所述一个或多个第三向量中确定所述第一向量的检索结果。In a possible implementation, the routing node is further configured to transfer the target vector to the target index node; the target index node is specifically configured to: based on the target vector and the first mapping relationship, from itself Determining the one or more third vectors from the plurality of stored third vectors, the first mapping relationship indicating the mapping relationship between the target vector and one or more third vectors in the plurality of third vectors ; Determining the retrieval result of the first vector from the one or more third vectors.
在一种可能的实现中,各个索引节点上可以存储有自身存储的第三向量所对应的一个或多个第二向量,该一个或多个第二向量可以用于表征索引节点自身存储的第三向量。在检索时,索引节点可以基于目标向量(即一个第二向量),确定出该目标向量对应的一个或多个第三向量,目标向量可以作为一个或多个第三向量的表征,进而索引节点可以基于第一向量从一个或多个第三向量中确定检索结果。In a possible implementation, each index node may store one or more second vectors corresponding to the third vector stored by itself, and the one or more second vectors may be used to represent the first vector stored by the index node itself. Three vectors. When searching, the index node can determine one or more third vectors corresponding to the target vector based on the target vector (that is, a second vector), and the target vector can be used as a representation of one or more third vectors, and then the index node Retrieval results may be determined from one or more third vectors based on the first vector.
由于目标向量可以作为一个或多个第三向量的表征,且第一向量与目标向量之间的相似度大于阈值,因此从一个或多个第三向量中确定检索结果可以在满足检索精确度的前提下,加速检索过程。Since the target vector can be used as a representation of one or more third vectors, and the similarity between the first vector and the target vector is greater than the threshold, it is possible to determine the retrieval result from one or more third vectors while satisfying the retrieval accuracy. Under the premise, speed up the retrieval process.
在一种可能的实现中,所述目标索引节点具体用于:根据所述第一向量,从自身存储的多个第三向量中确定部分第三向量,其中所述部分第三向量对应相同的簇,且所述簇的簇中心与所述第一向量之间的向量相似度大于阈值;从所述部分第三向量中确定所述第一向量的检索结果。In a possible implementation, the target index node is specifically configured to: determine a part of the third vector from multiple third vectors stored in itself according to the first vector, where the part of the third vector corresponds to the same clusters, and the vector similarity between the cluster centers of the clusters and the first vector is greater than a threshold; determine the retrieval result of the first vector from the part of the third vector.
在一种可能的实现中,各个索引节点上可以存储有自身存储的第三向量所对应的一个或多个第四向量,该一个或多个第四向量可以用于表征索引节点自身存储的第三向量。在检索时,索引节点可以基于从一个或多个第四向量中确定出相似度最大的第四向量,并获取确定出的第四向量对应的一个或多个第三向量,第四向量可以作为一个或多个第三向量的表征,进而索引节点可以基于第一向量从一个或多个第三向量中确定检索结果。In a possible implementation, each index node may store one or more fourth vectors corresponding to the third vector stored by itself, and the one or more fourth vectors may be used to represent the first vector stored by the index node itself. Three vectors. When retrieving, the index node can determine the fourth vector with the highest similarity from one or more fourth vectors, and obtain one or more third vectors corresponding to the determined fourth vector, and the fourth vector can be used as The representation of one or more third vectors, and then the index node can determine the retrieval result from the one or more third vectors based on the first vector.
由于第四向量可以作为一个或多个第三向量的表征,且第四向量与第一向量之间的相似度大于阈值,因此从一个或多个第三向量中确定检索结果可以在满足检索精确度的前提下,加速检索过程。Since the fourth vector can be used as a representation of one or more third vectors, and the similarity between the fourth vector and the first vector is greater than the threshold, it is possible to determine the retrieval result from one or more third vectors when the retrieval precision is satisfied. Under the premise of speed, speed up the retrieval process.
第二方面,本申请提供了一种数据检索方法,所述方法包括:In a second aspect, the present application provides a data retrieval method, the method comprising:
获取第一向量;get the first vector;
根据所述第一向量以及路由信息,从所述多个索引节点中得到目标索引节点,其中,所述路由信息包括多个第二向量以及每个所述第二向量对应的索引节点,每个所述第二向量用于表示对应的索引节点上存储的一个或多个第三向量,所述第三向量为数据对象的表征,所述第一向量与所述多个第二向量中的目标向量之间的向量相似度大于阈值,且所述目标向量在所述路由信息中对应于所述目标索引节点;Obtain a target index node from the plurality of index nodes according to the first vector and routing information, wherein the routing information includes a plurality of second vectors and an index node corresponding to each second vector, each The second vector is used to represent one or more third vectors stored on the corresponding index node, the third vector is a representation of a data object, and the first vector and the objects in the plurality of second vectors The vector similarity between vectors is greater than a threshold, and the target vector corresponds to the target index node in the routing information;
向所述目标索引节点传递所述第一向量,所述第一向量用于指示所述目标索引节点从自身存储的多个第三向量中确定所述第一向量的检索结果。passing the first vector to the target index node, where the first vector is used to instruct the target index node to determine the retrieval result of the first vector from multiple third vectors stored by itself.
本申请实施例中,路由节点基于相似性确定出待索引节点需要路由至的索引节点,且在索引节点上的第三向量也是基于向量相似性部署的(相似性较大的向量部署在同一个索 引节点上,相似性较小的向量部署在不同的索引节点上),也就是说,和第一向量相似度很高的第三向量都存储在特定的索引节点上(候选索引节点)上,因此路由节点只需要将第一向量发送至目标索引节点上,目标索引节点就可以基于自身存储的候选节点得到一个很精确的检索结果,使每次检索只需访问一个或部分索引节点(数量较少)即可获得精确的检索结果,提升了大规模向量规模的相似性检索性能,集群检索的并发性能可以随着集群规模而线性增长,有效解决了海量向量规模场景下的向量相似性检索性能和扩展性问题。In this embodiment of the application, the routing node determines the index node to which the node to be indexed needs to be routed based on similarity, and the third vector on the index node is also deployed based on vector similarity (vectors with greater similarity are deployed in the same On the index node, the vectors with less similarity are deployed on different index nodes), that is to say, the third vector with high similarity with the first vector is stored on a specific index node (candidate index node), Therefore, the routing node only needs to send the first vector to the target index node, and the target index node can obtain a very accurate retrieval result based on the candidate nodes stored by itself, so that each retrieval only needs to visit one or part of the index nodes (the number is relatively small). less) to obtain accurate retrieval results, which improves the performance of large-scale vector-scale similarity retrieval. The concurrency performance of cluster retrieval can increase linearly with the size of the cluster, which effectively solves the problem of vector similarity retrieval performance in massive vector-scale scenarios. and scalability issues.
在一种可能的实现中,每个所述第二向量对应一个簇,每个所述簇包括一个或多个所述第三向量,且所述多个第二向量中不同第二向量对应不同的簇。In a possible implementation, each of the second vectors corresponds to a cluster, each of the clusters includes one or more of the third vectors, and different second vectors in the plurality of second vectors correspond to different of clusters.
在一种可能的实现中,每个所述第二向量对应一个聚类,每个所述第二向量为所述聚类的聚类中心对应的向量。In a possible implementation, each second vector corresponds to a cluster, and each second vector is a vector corresponding to a cluster center of the cluster.
在一种可能的实现中,每个所述索引节点用于按照簇存储多个所述第三向量,每个所述簇包括至少一个所述第三向量。In a possible implementation, each of the index nodes is configured to store multiple third vectors according to clusters, and each cluster includes at least one third vector.
在一种可能的实现中,每个所述索引节点存储的第三向量为一个或多个簇中包括的向量。In a possible implementation, the third vector stored in each index node is a vector included in one or more clusters.
在一种可能的实现中,所述第一向量与所述多个第二向量中的至少两个目标向量中的每个目标向量之间的向量相似度大于阈值,所述至少两个目标向量在所述路由信息中对应于所述目标索引节点。In a possible implementation, the vector similarity between the first vector and each of at least two target vectors among the plurality of second vectors is greater than a threshold, and the at least two target vectors Corresponds to the target index node in the routing information.
在一种可能的实现中,所述目标索引节点包括一个或多个索引节点。In a possible implementation, the target index node includes one or more index nodes.
在一种可能的实现中,所述检索结果为所述多个第三向量中的部分向量。In a possible implementation, the retrieval result is a partial vector in the plurality of third vectors.
在一种可能的实现中,所述第一向量为检索对象的表征,所述检索对象包括文本数据、音频数据、图像数据或视频数据中的一个或多个。In a possible implementation, the first vector is a representation of a retrieval object, and the retrieval object includes one or more of text data, audio data, image data, or video data.
第三方面,本申请提供了一种数据检索装置,所述装置应用于计算机***,所述计算机***包括路由节点以及多个索引节点,所述路由节点包括:In a third aspect, the present application provides a data retrieval device, the device is applied to a computer system, and the computer system includes a routing node and a plurality of index nodes, and the routing node includes:
获取模块,用于获取第一向量;An acquisition module, configured to acquire the first vector;
路由模块,用于根据所述第一向量以及路由信息,从所述多个索引节点中得到目标索引节点,其中,所述路由信息包括多个第二向量以及每个所述第二向量对应的索引节点,每个所述第二向量用于表示对应的索引节点上存储的一个或多个第三向量,所述第三向量为数据对象的表征,所述第一向量与所述多个第二向量中的目标向量之间的向量相似度大 于阈值,且所述目标向量在所述路由信息中对应于所述目标索引节点;A routing module, configured to obtain a target index node from the plurality of index nodes according to the first vector and routing information, wherein the routing information includes a plurality of second vectors and each of the second vectors corresponds to An index node, each of the second vectors is used to represent one or more third vectors stored on the corresponding index node, the third vector is a representation of a data object, and the first vector and the plurality of third vectors are The vector similarity between target vectors in the two vectors is greater than a threshold, and the target vector corresponds to the target index node in the routing information;
发送模块,用于向所述目标索引节点传递所述第一向量;a sending module, configured to transfer the first vector to the target index node;
所述目标索引节点,包括:The target index node includes:
检索模块,用于从自身存储的多个第三向量中确定所述第一向量的检索结果。The retrieval module is configured to determine the retrieval result of the first vector from a plurality of third vectors stored in itself.
本申请实施例中,路由节点基于相似性确定出待索引节点需要路由至的索引节点,且在索引节点上的第三向量也是基于向量相似性部署的(相似性较大的向量部署在同一个索引节点上,相似性较小的向量部署在不同的索引节点上),也就是说,和第一向量相似度很高的第三向量都存储在特定的索引节点上(候选索引节点)上,因此路由节点只需要将第一向量发送至目标索引节点上,目标索引节点就可以基于自身存储的候选节点得到一个很精确的检索结果,使每次检索只需访问一个或部分索引节点(数量较少)即可获得精确的检索结果,提升了大规模向量规模的相似性检索性能,集群检索的并发性能可以随着集群规模而线性增长,有效解决了海量向量规模场景下的向量相似性检索性能和扩展性问题。In this embodiment of the application, the routing node determines the index node to which the node to be indexed needs to be routed based on similarity, and the third vector on the index node is also deployed based on vector similarity (vectors with greater similarity are deployed in the same On the index node, the vectors with less similarity are deployed on different index nodes), that is to say, the third vector with high similarity with the first vector is stored on a specific index node (candidate index node), Therefore, the routing node only needs to send the first vector to the target index node, and the target index node can obtain a very accurate retrieval result based on the candidate nodes stored by itself, so that each retrieval only needs to visit one or part of the index nodes (the number is relatively small). less) to obtain accurate retrieval results, which improves the performance of large-scale vector-scale similarity retrieval. The concurrency performance of cluster retrieval can increase linearly with the size of the cluster, which effectively solves the problem of vector similarity retrieval performance in massive vector-scale scenarios. and scalability issues.
在一种可能的实现中,每个所述第二向量对应一个簇,每个所述簇包括一个或多个所述第三向量,且所述多个第二向量中不同第二向量对应不同的簇。In a possible implementation, each of the second vectors corresponds to a cluster, each of the clusters includes one or more of the third vectors, and different second vectors in the plurality of second vectors correspond to different of clusters.
在一种可能的实现中,每个所述第二向量对应一个聚类,每个所述第二向量为所述聚类的聚类中心对应的向量。In a possible implementation, each second vector corresponds to a cluster, and each second vector is a vector corresponding to a cluster center of the cluster.
在一种可能的实现中,每个所述索引节点用于按照簇存储多个所述第三向量,每个所述簇包括至少一个所述第三向量。In a possible implementation, each of the index nodes is configured to store multiple third vectors according to clusters, and each cluster includes at least one third vector.
在一种可能的实现中,每个所述索引节点存储的第三向量为一个或多个簇中包括的向量。In a possible implementation, the third vector stored in each index node is a vector included in one or more clusters.
在一种可能的实现中,所述第一向量与所述多个第二向量中的至少两个目标向量中的每个目标向量之间的向量相似度大于阈值,所述至少两个目标向量在所述路由信息中对应于所述目标索引节点。In a possible implementation, the vector similarity between the first vector and each of at least two target vectors among the plurality of second vectors is greater than a threshold, and the at least two target vectors Corresponds to the target index node in the routing information.
在一种可能的实现中,所述目标索引节点包括一个或多个索引节点。In a possible implementation, the target index node includes one or more index nodes.
在一种可能的实现中,所述检索结果为所述多个第三向量中的部分向量。In a possible implementation, the retrieval result is a partial vector in the plurality of third vectors.
在一种可能的实现中,所述第一向量为检索对象的表征,所述检索对象包括文本数据、音频数据、图像数据或视频数据中的一个或多个。In a possible implementation, the first vector is a representation of a retrieval object, and the retrieval object includes one or more of text data, audio data, image data, or video data.
在一种可能的实现中,所述发送模块,还用于向所述目标索引节点传递所述目标向量;In a possible implementation, the sending module is further configured to transfer the target vector to the target index node;
所述检索模块,具体用于基于所述目标向量以及第一映射关系,从自身存储的多个第 三向量中确定所述一个或多个第三向量,所述第一映射关系指示所述目标向量与所述多个第三向量中的一个或多个第三向量的映射关系;The retrieval module is specifically configured to determine the one or more third vectors from a plurality of third vectors stored in itself based on the target vector and the first mapping relationship, the first mapping relationship indicating the target a mapping relationship between a vector and one or more third vectors in the plurality of third vectors;
从所述一个或多个第三向量中确定所述第一向量的检索结果。A retrieval result for the first vector is determined from the one or more third vectors.
在一种可能的实现中,所述检索模块,具体用于根据所述第一向量,从自身存储的多个第三向量中确定部分第三向量,其中所述部分第三向量对应相同的簇,且所述簇的簇中心与所述第一向量之间的向量相似度大于阈值;In a possible implementation, the retrieval module is specifically configured to determine a part of the third vector from multiple third vectors stored in itself according to the first vector, where the part of the third vector corresponds to the same cluster , and the vector similarity between the cluster center of the cluster and the first vector is greater than a threshold;
从所述部分第三向量中确定所述第一向量的检索结果。A retrieval result of the first vector is determined from the part of the third vector.
第四方面,本申请提供了一种数据检索装置,所述装置包括:In a fourth aspect, the present application provides a data retrieval device, the device comprising:
获取模块,用于获取第一向量;An acquisition module, configured to acquire the first vector;
路由模块,用于根据所述第一向量以及路由信息,从所述多个索引节点中得到目标索引节点,其中,所述路由信息包括多个第二向量以及每个所述第二向量对应的索引节点,每个所述第二向量用于表示对应的索引节点上存储的一个或多个第三向量,所述第三向量为数据对象的表征,所述第一向量与所述多个第二向量中的目标向量之间的向量相似度大于阈值,且所述目标向量在所述路由信息中对应于所述目标索引节点;A routing module, configured to obtain a target index node from the plurality of index nodes according to the first vector and routing information, wherein the routing information includes a plurality of second vectors and each of the second vectors corresponds to An index node, each of the second vectors is used to represent one or more third vectors stored on the corresponding index node, the third vector is a representation of a data object, and the first vector and the plurality of third vectors are The vector similarity between target vectors in the two vectors is greater than a threshold, and the target vector corresponds to the target index node in the routing information;
发送模块,用于向所述目标索引节点传递所述第一向量,所述第一向量用于指示所述目标索引节点从自身存储的多个第三向量中确定所述第一向量的检索结果。A sending module, configured to deliver the first vector to the target index node, where the first vector is used to instruct the target index node to determine a retrieval result of the first vector from a plurality of third vectors stored by itself .
本申请实施例中,路由节点基于相似性确定出待索引节点需要路由至的索引节点,且在索引节点上的第三向量也是基于向量相似性部署的(相似性较大的向量部署在同一个索引节点上,相似性较小的向量部署在不同的索引节点上),也就是说,和第一向量相似度很高的第三向量都存储在特定的索引节点上(候选索引节点)上,因此路由节点只需要将第一向量发送至目标索引节点上,目标索引节点就可以基于自身存储的候选节点得到一个很精确的检索结果,使每次检索只需访问一个或部分索引节点(数量较少)即可获得精确的检索结果,提升了大规模向量规模的相似性检索性能,集群检索的并发性能可以随着集群规模而线性增长,有效解决了海量向量规模场景下的向量相似性检索性能和扩展性问题。In this embodiment of the application, the routing node determines the index node to which the node to be indexed needs to be routed based on similarity, and the third vector on the index node is also deployed based on vector similarity (vectors with greater similarity are deployed in the same On the index node, the vectors with less similarity are deployed on different index nodes), that is to say, the third vector with high similarity with the first vector is stored on a specific index node (candidate index node), Therefore, the routing node only needs to send the first vector to the target index node, and the target index node can obtain a very accurate retrieval result based on the candidate nodes stored by itself, so that each retrieval only needs to visit one or part of the index nodes (the number is relatively small). less) to obtain accurate retrieval results, which improves the performance of large-scale vector-scale similarity retrieval. The concurrency performance of cluster retrieval can increase linearly with the size of the cluster, which effectively solves the problem of vector similarity retrieval performance in massive vector-scale scenarios. and scalability issues.
在一种可能的实现中,每个所述第二向量对应一个簇,每个所述簇包括一个或多个所述第三向量,且所述多个第二向量中不同第二向量对应不同的簇。In a possible implementation, each of the second vectors corresponds to a cluster, each of the clusters includes one or more of the third vectors, and different second vectors in the plurality of second vectors correspond to different of clusters.
在一种可能的实现中,每个所述第二向量对应一个聚类,每个所述第二向量为所述聚类的聚类中心对应的向量。In a possible implementation, each second vector corresponds to a cluster, and each second vector is a vector corresponding to a cluster center of the cluster.
在一种可能的实现中,每个所述索引节点用于按照簇存储多个所述第三向量,每个所述簇包括至少一个所述第三向量。In a possible implementation, each of the index nodes is configured to store multiple third vectors according to clusters, and each cluster includes at least one third vector.
在一种可能的实现中,每个所述索引节点存储的第三向量为一个或多个簇中包括的向 量。In a possible implementation, the third vector stored in each index node is a vector included in one or more clusters.
在一种可能的实现中,所述第一向量与所述多个第二向量中的至少两个目标向量中的每个目标向量之间的向量相似度大于阈值,所述至少两个目标向量在所述路由信息中对应于所述目标索引节点。In a possible implementation, the vector similarity between the first vector and each of at least two target vectors among the plurality of second vectors is greater than a threshold, and the at least two target vectors Corresponds to the target index node in the routing information.
在一种可能的实现中,所述目标索引节点包括一个或多个索引节点。In a possible implementation, the target index node includes one or more index nodes.
在一种可能的实现中,所述检索结果为所述多个第三向量中的部分向量。In a possible implementation, the retrieval result is a partial vector in the plurality of third vectors.
在一种可能的实现中,所述第一向量为检索对象的表征,所述检索对象包括文本数据、音频数据、图像数据或视频数据中的一个或多个。In a possible implementation, the first vector is a representation of a retrieval object, and the retrieval object includes one or more of text data, audio data, image data, or video data.
第五方面,本申请提供了一种计算机***,包括第一索引节点和第二索引节点,所述第一索引节点包括第一存储器和第一处理器,所述第二索引节点包括第二存储器和第二处理器,其中,In a fifth aspect, the present application provides a computer system, including a first index node and a second index node, the first index node includes a first memory and a first processor, and the second index node includes a second memory and a second processor, where,
所述第一存储器,被配置为存储用于进行数据检索的多个第一向量,所述多个第一向量中不同第一向量之间的相似度大于阈值,所述第一向量为数据对象的表征;The first memory is configured to store a plurality of first vectors for data retrieval, the similarity between different first vectors in the plurality of first vectors is greater than a threshold, and the first vector is a data object representation of
所述第二存储器,被配置为存储用于进行数据检索的多个第二向量,所述多个第二向量中不同第二向量之间的相似度大于所述阈值,所述第二向量为数据对象的表征,且所述多个第一向量与所述多个第二向量之间的向量相似度小于所述阈值;The second memory is configured to store a plurality of second vectors for data retrieval, the similarity between different second vectors in the plurality of second vectors is greater than the threshold, and the second vectors are a representation of a data object, and the vector similarity between the plurality of first vectors and the plurality of second vectors is less than the threshold;
所述第一处理器,被配置为基于所述多个第一向量进行数据检索;the first processor configured to perform data retrieval based on the plurality of first vectors;
所述第二处理器,被配置为基于所述多个第二向量进行数据检索。The second processor is configured to perform data retrieval based on the plurality of second vectors.
本申请实施例中,索引节点上的第三向量也是基于向量相似性部署的(相似性较大的向量部署在同一个索引节点上,相似性较小的向量部署在不同的索引节点上),也就是说,和第一向量相似度很高的第三向量都存储在特定的索引节点上(候选索引节点)上,因此路由节点只需要将第一向量发送至目标索引节点上,目标索引节点就可以基于自身存储的候选节点得到一个很精确的检索结果,使每次检索只需访问一个或部分索引节点(数量较少)即可获得精确的检索结果,提升了大规模向量规模的相似性检索性能,集群检索的并发性能可以随着集群规模而线性增长,有效解决了海量向量规模场景下的向量相似性检索性能和扩展性问题。In the embodiment of the present application, the third vector on the index node is also deployed based on vector similarity (vectors with greater similarity are deployed on the same index node, and vectors with less similarity are deployed on different index nodes), That is to say, the third vector with high similarity with the first vector is stored on a specific index node (candidate index node), so the routing node only needs to send the first vector to the target index node, and the target index node A very accurate retrieval result can be obtained based on the candidate nodes stored by itself, so that each retrieval only needs to visit one or part of the index nodes (the number is small) to obtain accurate retrieval results, which improves the similarity of large-scale vector scale Retrieval performance, the concurrency performance of cluster retrieval can grow linearly with the size of the cluster, effectively solving the performance and scalability problems of vector similarity retrieval in massive vector scale scenarios.
在一种可能的实现中,所述多个第一向量为一个或多个第一簇中包括的向量,所述多个第二向量为一个或多个第二簇中包括的向量,且所述第一簇与所述第二簇不同。In a possible implementation, the multiple first vectors are vectors included in one or more first clusters, the multiple second vectors are vectors included in one or more second clusters, and the The first cluster is different from the second cluster.
在一种可能的实现中,所述第一簇和所述第二簇为聚类。In a possible implementation, the first cluster and the second cluster are clusters.
在一种可能的实现中,所述第一处理器,被配置为基于第三向量从所述多个第一向量中确定检索结果,所述第三向量为第一检索对象的表征;In a possible implementation, the first processor is configured to determine a retrieval result from the plurality of first vectors based on a third vector, where the third vector is a representation of the first retrieval object;
所述第二处理器,被配置为基于第四向量从所述多个第二向量中确定检索结果,所述第四向量为第二检索对象的表征。The second processor is configured to determine a retrieval result from the plurality of second vectors based on a fourth vector, where the fourth vector is a representation of a second retrieval object.
在一种可能的实现中,所述第一索引节点和所述第二索引节点均通信连接于路由节点,且所述第三向量和所述第四向量均来自于所述路由节点的发送。In a possible implementation, both the first index node and the second index node are communicatively connected to a routing node, and both the third vector and the fourth vector are sent from the routing node.
在一种可能的实现中,所述第一检索对象和所述第二检索对象包括文本数据、音频数据、图像数据或视频数据中的一个或多个。In a possible implementation, the first retrieval object and the second retrieval object include one or more of text data, audio data, image data, or video data.
第六方面,本申请实施例提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机程序,当其在计算机上运行时,使得计算机执行上述第一方面及其任一可选的方法、第二方面及其任一可选的方法、第三方面及其任一可选的方法。In the sixth aspect, the embodiment of the present application provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when it is run on a computer, the computer executes the above-mentioned first aspect and any one thereof. Alternative methods, the second aspect and any optional methods thereof, the third aspect and any optional methods thereof.
第七方面,本申请实施例提供了一种计算机程序,当其在计算机上运行时,使得计算机执行上述第一方面及其任一可选的方法、第二方面及其任一可选的方法、第三方面及其任一可选的方法。In the seventh aspect, the embodiment of the present application provides a computer program, which, when running on a computer, enables the computer to execute the above-mentioned first aspect and any optional method thereof, the second aspect and any optional method thereof , the third aspect and any optional method thereof.
第八方面,本申请提供了一种芯片***,该芯片***包括处理器,用于支持执行设备或训练设备实现上述方面中所涉及的功能,例如,发送或处理上述方法中所涉及的数据;或,信息。在一种可能的设计中,所述芯片***还包括存储器,所述存储器,用于保存执行设备或训练设备必要的程序指令和数据。该芯片***,可以由芯片构成,也可以包括芯片和其他分立器件。In an eighth aspect, the present application provides a chip system, which includes a processor, configured to support an execution device or a training device to implement the functions involved in the above aspect, for example, send or process the data involved in the above method; or, information. In a possible design, the chip system further includes a memory, and the memory is used for storing necessary program instructions and data of the execution device or the training device. The system-on-a-chip may consist of chips, or may include chips and other discrete devices.
附图说明Description of drawings
图1为本申请实施例的一种应用架构示意;FIG. 1 is a schematic diagram of an application architecture of an embodiment of the present application;
图2为本申请实施例的一种检索场景示意;FIG. 2 is a schematic diagram of a retrieval scenario in an embodiment of the present application;
图3为本申请实施例的一种应用架构示意;FIG. 3 is a schematic diagram of an application architecture of an embodiment of the present application;
图4为本申请实施例的一种数据检索方法的流程示意;FIG. 4 is a schematic flow chart of a data retrieval method in an embodiment of the present application;
图5为本申请实施例的一种数据检索方法的流程示意;FIG. 5 is a schematic flow chart of a data retrieval method in an embodiment of the present application;
图6为本申请实施例的一种数据检索方法的流程示意;FIG. 6 is a schematic flow chart of a data retrieval method according to an embodiment of the present application;
图7为本申请实施例的一种数据检索装置的结构示意;FIG. 7 is a schematic structural diagram of a data retrieval device according to an embodiment of the present application;
图8为本申请实施例的一种数据检索装置的结构示意;FIG. 8 is a schematic structural diagram of a data retrieval device according to an embodiment of the present application;
图9为本申请实施例的一种数据检索装置的结构示意;FIG. 9 is a schematic structural diagram of a data retrieval device according to an embodiment of the present application;
图10为本申请实施例的一种终端设备的结构示意;FIG. 10 is a schematic structural diagram of a terminal device according to an embodiment of the present application;
图11为本申请实施例的一种服务器的结构示意;FIG. 11 is a schematic structural diagram of a server according to an embodiment of the present application;
图12为本申请实施例的一种芯片的结构示意。FIG. 12 is a schematic structural diagram of a chip according to an embodiment of the present application.
具体实施方式Detailed ways
下面结合本发明实施例中的附图对本发明实施例进行描述。本发明的实施方式部分使用的术语仅用于对本发明的具体实施例进行解释,而非旨在限定本发明。Embodiments of the present invention will be described below with reference to the drawings in the embodiments of the present invention. The terms used in the embodiments of the present invention are only used to explain specific examples of the present invention, and are not intended to limit the present invention.
下面结合附图,对本申请的实施例进行描述。本领域普通技术人员可知,随着技术的发展和新场景的出现,本申请实施例提供的技术方案对于类似的技术问题,同样适用。Embodiments of the present application are described below in conjunction with the accompanying drawings. Those of ordinary skill in the art know that, with the development of technology and the emergence of new scenarios, the technical solutions provided in the embodiments of the present application are also applicable to similar technical problems.
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的术语在适当情况下可以互换,这仅仅是描述本申请的实施例中对相同属性的对象在描述时所采用的区分方式。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,以便包含一系列单元的过程、方法、***、产品或设备不必限于那些单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它单元。The terms "first", "second" and the like in the specification and claims of the present application and the above drawings are used to distinguish similar objects, and are not necessarily used to describe a specific sequence or sequence. It should be understood that the terms used in this way can be interchanged under appropriate circumstances, and this is merely a description of the manner in which objects with the same attribute are described in the embodiments of the present application. Furthermore, the terms "comprising" and "having", as well as any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, product, or apparatus comprising a series of elements is not necessarily limited to those elements, but may include elements not expressly included. Other elements listed explicitly or inherent to the process, method, product, or apparatus.
本申请实施例可以应用于用于进行数据检索的索引***中,接下来首先介绍本申请实施例的应用场景。The embodiment of the present application can be applied to an indexing system for data retrieval. Next, the application scenario of the embodiment of the present application will be introduced first.
图1示出了索引***100的一个架构示意。索引***100可以包括通过网络110连接到彼此的客户端***130、计算机***160和第三方***170。尽管图1示出了客户端***130、计算机***160、第三方***170和网络110的一种排布(连接关系)示意,但是本申请实施例并不限定客户端***130、计算机***160、第三方***170和网络110的排布方式。作为示例而不是作为限制,客户端***130、计算机***160和第三方***170中的两个或更多个可以绕过网络110而直接连接到彼此。作为另一示例,客户端***130、计算机***160和第三方***170中的两个或更多个可以全部或部分地在物理上或逻辑上彼此位于同一位置。此外,尽管图1示出了特定数量的客户端***130、计算机***160、第三方***170和网络110,但是本申请实施例不限定客户端***130、计算机***160、第三方***170和网络110的数量。作为示例而不是作为限制,索引***100可以包括多个客户端***130、计算机***160、第三方***170和网络110。FIG. 1 shows a schematic architecture of an indexing system 100 . Indexing system 100 may include client system 130 , computer system 160 , and third party system 170 connected to each other through network 110 . Although FIG. 1 shows a schematic arrangement (connection relationship) of client system 130, computer system 160, third-party system 170, and network 110, the embodiment of the present application does not limit client system 130, computer system 160, The arrangement of third party system 170 and network 110 . By way of example and not limitation, two or more of client system 130 , computer system 160 , and third party system 170 may bypass network 110 and connect directly to each other. As another example, two or more of client system 130, computer system 160, and third party system 170 may be physically or logically co-located with one another in whole or in part. In addition, although FIG. 1 shows a specific number of client systems 130, computer systems 160, third-party systems 170, and networks 110, the embodiments of the present application do not limit the number of client systems 130, computer systems 160, third-party systems 170, and networks. 110 quantity. By way of example and not limitation, indexing system 100 may include multiple client systems 130 , computer systems 160 , third party systems 170 and network 110 .
在一种可能的实现中,本申请可以包括任何合适的网络110。作为示例而不是作为限制,网络110的一个或更多个部分可以包括自组织网络、内联网、外联网、虚拟专用网络(virtual private network,VPN)、局域网(local area network,LAN)、无线LAN(wireless LAN,WLAN)、广域网(wide area network,WAN)、无线WAN(wireless WAN,WWAN)、城域网(metropolitan area network,MAN)、互联网的一部分、公共交换电话网(public switched telephone network,PSTN)的一部分、蜂窝电话网、或这些中的两个或更多个的组合。网络110可以包括一个或更多个网络110。In one possible implementation, the present application may include any suitable network 110 . By way of example and not limitation, one or more portions of network 110 may include an ad hoc network, an intranet, an extranet, a virtual private network (VPN), a local area network (LAN), a wireless LAN (wireless LAN, WLAN), wide area network (wide area network, WAN), wireless WAN (wireless WAN, WWAN), metropolitan area network (metropolitan area network, MAN), part of the Internet, public switched telephone network (public switched telephone network, PSTN), a cellular telephone network, or a combination of two or more of these. Network 110 may include one or more networks 110 .
在一种可能的实现中,链路150可以将客户端***130、计算机***160和第三方***170连接到通信网络110或连接到彼此。本申请可以包括任何合适的链路150。在一些实施例中,一个或更多个链路150包括一个或更多个有线(例如数字用户线路(digital subscriber line,DSL)或基于电缆的数据服务接口规范(data over cable service interface specification,DOCSIS))链路、无线(例如Wi-Fi或全球互通微波接入(worldwide interoperability for microwave access,WiMAX))链路、或光(例如同步光网络(synchronous optical networking,SONET)或同步数字 体系(synchronous digital hierarchy,SDH))链路。在特定实施例中,一个或更多个链路150各自包括自组织网络、内联网、外联网、VPN、LAN、WLAN、WAN、WWAN、MAN、互联网的一部分、PSTN的一部分、基于蜂窝技术的网络、基于卫星通信技术的网络、另一链路150、或两个或更多个这种链路150的组合。链路150不需要在整个索引***100中是相同的。一个或更多个第一链路150可以在一个或更多个方面上不同于一个或更多个第二链路150。In one possible implementation, link 150 may connect client system 130, computer system 160, and third party system 170 to communication network 110 or to each other. The application may include any suitable link 150 . In some embodiments, one or more links 150 include one or more wired (such as digital subscriber line (DSL) or cable-based data over cable service interface specification (DOCSIS) )) links, wireless (such as Wi-Fi or worldwide interoperability for microwave access (WiMAX)) links, or optical (such as synchronous optical networking (SONET) or synchronous digital hierarchy (synchronous digital hierarchy, SDH)) link. In particular embodiments, one or more links 150 each include an ad hoc network, an intranet, an extranet, a VPN, a LAN, a WLAN, a WAN, a WWAN, a MAN, a portion of the Internet, a portion of the PSTN, a cellular technology-based network, a network based on satellite communication technology, another link 150, or a combination of two or more such links 150. Link 150 need not be the same throughout indexing system 100 . One or more first links 150 may differ from one or more second links 150 in one or more respects.
在一种可能的实现中,客户端***130可以是包括硬件、软件或嵌入式逻辑组件、或两个或更多个这样的组件的组合,并且能够执行由客户端***130实现或支持的适当功能的电子设备。作为示例而不是作为限制,客户端***130可以包括计算机***,例如台式计算机、笔记本或膝上型计算机、上网本、平板计算机、电子书阅读器、GPS设备、相机、个人数字助理(personal digital assistant,PDA)、手持电子设备、蜂窝电话、智能手机、其他合适的电子设备、或它们的任何合适的组合。本申请可以包括任何合适的客户端***130。客户端***130可以使在客户端***130处的网络用户能够访问网络110。客户端***130可以使它的用户能够与在其他客户端***130处的其他用户进行通信。In one possible implementation, client system 130 may include hardware, software, or embedded logic components, or a combination of two or more such components, and be capable of executing appropriate functional electronic devices. By way of example and not limitation, client system 130 may include a computer system such as a desktop computer, notebook or laptop computer, netbook, tablet computer, e-book reader, GPS device, camera, personal digital assistant (personal digital assistant, PDA), handheld electronic devices, cellular phones, smart phones, other suitable electronic devices, or any suitable combination thereof. The application may include any suitable client system 130 . Client system 130 may enable network users at client system 130 to access network 110 . Client system 130 may enable its users to communicate with other users at other client systems 130 .
在一种可能的实现中,客户端***130可以包括web浏览器。在客户端***130处的用户可以输入统一资源定位符(uniform resource locator,URL)或将web浏览器引导到特定的服务器(例如服务器162或与第三方***170相关联的服务器)的其他地址,并且web浏览器可以生成超文本传输协议(hypertext transfer protocol,HTTP)请求并将HTTP请求传递到服务器。服务器可以接受HTTP请求,并响应于HTTP请求而向客户端***130传递一个或更多个超文本标记语言(hypertext markup language,HTML)文件。客户端***130可以基于来自服务器的HTML文件来再现web界面(例如网页)用于呈现给用户(例如可以参照图2所示)。本公开设想了任何合适的源文件。作为示例而不是作为限制,可以根据特定的需要从HTML文件、可扩展超文本标记语言(XHTML)文件或可扩展标记语言(XML)文件来再现web界面。In one possible implementation, client system 130 may include a web browser. A user at client system 130 may enter a uniform resource locator (URL) or other address that directs the web browser to a particular server (e.g., server 162 or a server associated with third-party system 170), And the web browser can generate a hypertext transfer protocol (hypertext transfer protocol, HTTP) request and pass the HTTP request to the server. The server may accept HTTP requests and deliver one or more hypertext markup language (HTML) documents to client system 130 in response to the HTTP requests. The client system 130 can render a web interface (such as a webpage) based on the HTML file from the server for presentation to the user (for example, refer to FIG. 2 ). This disclosure contemplates any suitable source files. By way of example and not limitation, the web interface may be rendered from HTML files, Extensible Hypertext Markup Language (XHTML) files, or Extensible Markup Language (XML) files according to specific needs.
其中,用户可以输入统一资源定位符(uniform resource locator,URL)或将web浏览器引导到特定的服务器(例如服务器162或与第三方***170相关联的服务器)的与数据检索相关的地址,进而用户可以在web浏览器中输入待检索数据,web浏览器可以生成包含待检索数据的HTTP请求并将HTTP请求传递到服务器。服务器可以接受HTTP请求,基于本申请实施例中提供的数据检索方法,得到检索结果,并响应于HTTP请求而向客户端***130传递一个或更多个HTML文件,该HTML文件中可以包括检索结果。客户端***130可以基于来自服务器的HTML文件来再现web界面(例如网页)用于呈现给用户(例如将检索结果呈现给用户)。Wherein, the user can input Uniform Resource Locator (uniform resource locator, URL) or the address relevant to data retrieval that directs web browser to specific server (such as server 162 or the server associated with third-party system 170), and then The user can input the data to be retrieved in the web browser, and the web browser can generate an HTTP request containing the data to be retrieved and transmit the HTTP request to the server. The server can accept the HTTP request, obtain the retrieval result based on the data retrieval method provided in the embodiment of the present application, and transmit one or more HTML files to the client system 130 in response to the HTTP request, and the retrieval result can be included in the HTML file . The client system 130 may render a web interface (eg, a web page) for presentation to the user based on the HTML file from the server (eg, present retrieval results to the user).
在一种可能的实现中,客户端***130可以包括用于提供数据检索功能的应用程序(application,APP)。In a possible implementation, the client system 130 may include an application program (application, APP) for providing a data retrieval function.
计算机***160可以由索引***100的其他组件直接地或经由网络110来访问。作为示例而不是作为限制,客户端***130可以使用web浏览器或与计算机***160相关联的APP直接地或经由网络110来访问计算机***160。在特定实施例中,计算机***160可以包括一个或更多个服务器162。每个服务器162可以是单一服务器(unitary server)或跨越多台计算机或多 个数据中心的分布式服务器。在特定实施例中,每个服务器162可以包括硬件、软件或嵌入式逻辑组件、或用于执行由服务器162实现或支持的适当功能的两个或更多个这样的组件的组合。在特定实施例中,计算机***160可以包括一个或更多个数据存储器164。数据存储器164可以用于存储各种类型的信息。在特定实施例中,可以根据特定的数据结构来组织存储在数据存储器164中的信息。在特定实施例中,每个数据存储器164可以是关系数据库、纵列(columnar)数据库、相关性数据库或其他合适的数据库。尽管本公开描述或示出了特定类型的数据库,但是本公开设想了任何合适类型的数据库。特定实施例可以提供使客户端***130、计算机***160或第三方***170能够管理、检索、修改、添加或删除存储在数据存储器164中的信息的接口。Computer system 160 may be accessed by other components of indexing system 100 directly or via network 110 . By way of example and not limitation, client system 130 may access computer system 160 directly or via network 110 using a web browser or an APP associated with computer system 160 . In particular embodiments, computer system 160 may include one or more servers 162 . Each server 162 may be a unitary server or a distributed server spanning multiple computers or multiple data centers. In particular embodiments, each server 162 may include hardware, software, or embedded logic components, or a combination of two or more such components for performing the appropriate functionality implemented or supported by server 162 . In particular embodiments, computer system 160 may include one or more data stores 164 . Data storage 164 may be used to store various types of information. In particular embodiments, the information stored in data store 164 may be organized according to particular data structures. In particular embodiments, each data store 164 may be a relational database, columnar database, relational database, or other suitable database. Although this disclosure describes or illustrates particular types of databases, this disclosure contemplates any suitable type of databases. Certain embodiments may provide interfaces that enable client systems 130 , computer systems 160 , or third-party systems 170 to manage, retrieve, modify, add, or delete information stored in data storage 164 .
在特定实施例中,计算机***160可以在一个或更多个数据存储器164中存储一个或更多个第三向量。In particular embodiments, computer system 160 may store one or more third vectors in one or more data stores 164 .
在特定实施例中,计算机***160能够链接各种实体。作为示例而不是作为限制,计算机***160可以使用户能够彼此互动以及从第三方***170或其他实体接收内容,或者允许用户通过应用编程接口(API)或其他通信渠道与这些实体互动。In particular embodiments, computer system 160 is capable of linking various entities. By way of example and not limitation, computer system 160 may enable users to interact with each other and receive content from third-party systems 170 or other entities, or allow users to interact with such entities through application programming interfaces (APIs) or other communication channels.
在特定实施例中,第三方***170可以包括一种或更多种类型的服务器、一个或更多个数据库、一个或更多个接口(包括但不限于API)、一个或更多个web服务、一个或更多个内容源、一个或更多个网络或任何其他合适的部件(例如,服务器可以与这些部件通信)。第三方***170可以由与操作计算机***160的实体不同的实体进行操作。然而,在特定实施例中,计算机***160和第三方***170可以结合彼此来操作以向计算机***160或第三方***170的用户提供检索服务。In particular embodiments, third party system 170 may include one or more types of servers, one or more databases, one or more interfaces (including but not limited to APIs), one or more web services , one or more content sources, one or more networks, or any other suitable component (eg, a server may be in communication with these components). Third party system 170 may be operated by a different entity than the entity operating computer system 160 . However, in certain embodiments, computer system 160 and third party system 170 may operate in conjunction with each other to provide retrieval services to users of computer system 160 or third party system 170 .
在特定实施例中,第三方***170可以包括第三方内容对象提供者。第三方内容对象提供者可以包括可以被传递到客户端***130的内容对象的一个或更多个源。作为示例而不是作为限制,诸如,内容对象可以包括关于用户感兴趣的事情或活动的信息,例如电影放映时间、电影评论、餐馆评论、餐馆菜单、产品信息和评论或其他合适的信息。作为另一示例而不是作为限制,内容对象可以包括激励内容对象(例如优惠券、折扣券、礼品券或其他合适的激励对象)。In particular embodiments, third-party systems 170 may include third-party content object providers. Third-party content object providers may include one or more sources of content objects that may be delivered to client system 130 . By way of example and not limitation, for example, a content object may include information about things or activities of interest to a user, such as movie showtimes, movie reviews, restaurant reviews, restaurant menus, product information and reviews, or other suitable information. As another example and not limitation, content objects may include incentive content objects (eg, coupons, discount coupons, gift certificates, or other suitable incentive objects).
接下来描述本申请实施例中用于执行数据检索的计算机***160的架构示意。Next, a schematic architecture of the computer system 160 for performing data retrieval in the embodiment of the present application will be described.
如图3所示,计算机***160可以包括路由节点以及多个索引节点(例如,索引节点1、索引节点2、索引节点N等),其中,路由节点以及每个索引节点可以为独立的服务器,或者是服务器上具有数据存储和数据处理能力的逻辑节点(例如虚拟机等),这里并不限定。路由节点可以获取到第一向量,并基于一定的规则,从多个索引节点中确定需要进行数据索引操作的索引节点,索引节点上可以包括子空间模型,其中每个索引节点上可以包括子空间模型和子索引,子空间模型中可以包括多个第三向量,索引节点可以基于子索引以及第一向量,从子空间模型中确定索引结果,并反馈至路由节点。As shown in FIG. 3 , the computer system 160 may include a routing node and multiple index nodes (for example, index node 1, index node 2, index node N, etc.), where the routing node and each index node may be independent servers, Or it is a logical node (such as a virtual machine, etc.) with data storage and data processing capabilities on the server, which is not limited here. The routing node can obtain the first vector, and based on certain rules, determine the index node that needs to perform data indexing operations from multiple index nodes. The index node can include a subspace model, and each index node can include a subspace The model and the sub-index, the sub-space model may include multiple third vectors, and the index node may determine the index result from the sub-space model based on the sub-index and the first vector, and feed it back to the routing node.
参照图4,图4为本申请实施例提供的一种数据检索方法的流程示意,所述方法可以应用于计算机***,所述计算机***包括路由节点以及多个索引节点,所述方法包括:Referring to FIG. 4 , FIG. 4 is a schematic flowchart of a data retrieval method provided by an embodiment of the present application. The method can be applied to a computer system, and the computer system includes a routing node and a plurality of index nodes. The method includes:
401、所述路由节点获取第一向量。401. The routing node acquires a first vector.
本申请实施例中,路由节点可以获取到用于表示检索对象的第一向量。In the embodiment of the present application, the routing node may obtain the first vector used to represent the retrieval object.
接下来对第一向量进行描述:Next, describe the first vector:
1、关于路由节点获取到第一向量的方式:1. About the way the routing node obtains the first vector:
在一种可能的实现中,用户可以在端侧输入检索对象(或者称之为数据对象),其中检索对象可以是视频数据、音频数据、图像数据、文本数据之一,或者其它非结构性数据对象,之后端侧可以将检索对象传递至计算机***,由计算机***实现由检索对象至第一向量的映射,或者是由除计算机***之外的其他计算单元(例如,端侧,或者是连接在端侧以及计算机***之间的其他计算单元)实现由检索对象至第一向量的映射。进而,计算机***中的路由节点可以获取到经过映射得到的第一向量。In a possible implementation, the user can input a search object (or data object) on the terminal side, where the search object can be one of video data, audio data, image data, text data, or other unstructured data Afterwards, the end-side can transfer the retrieval object to the computer system, and the computer system realizes the mapping from the retrieval object to the first vector, or other calculation units (for example, the end-side, or connected to the computer system) other than the computer system end-side and other computing units between the computer systems) realize the mapping from the search object to the first vector. Furthermore, the routing node in the computer system can obtain the first vector obtained through mapping.
2、关于第一向量的生成方式:2. Regarding the generation method of the first vector:
在一种可能的实现中,可以对检索对象进行特征提取,以得到第一向量(或者称之为),例如,检索对象可以是视频数据、音频数据、图像数据、文本数据之一,或者其它非结构性数据对象。例如,图像数据可以由各自的从所述原始图像数据的颜色直方图中获得的原始特征向量来表征第一向量,视频数据可以由各自的从所述原始视频数据的尺度不变特征转换(scale-invariant feature transform,SIFT)或者3D-SIFT获得的或者从区分视频描述符(discriminate video descriptor,DVD)获得的原始特征向量来表征第一向量。许多不同的特征向量格式因为表征不同的数据对象种类而被熟知,这些格式中的任一格式都可以适用于特征提取过程。In a possible implementation, feature extraction can be performed on the retrieval object to obtain the first vector (or called it), for example, the retrieval object can be one of video data, audio data, image data, text data, or other Unstructured data objects. For example, image data may be represented by a respective original feature vector obtained from a color histogram of the original image data, and video data may be represented by a respective scale-invariant feature vector obtained from the original video data (scale -invariant feature transform, SIFT) or 3D-SIFT or from the original feature vector obtained from the discriminative video descriptor (discriminate video descriptor, DVD) to characterize the first vector. Many different feature vector formats are known for representing different kinds of data objects, any of which may be suitable for the feature extraction process.
在一种可能的实现中,第一向量可以为上述原始特征向量或者是对原始特征向量进行其他处理(例如归一化)得到的向量,这里并不限定。In a possible implementation, the first vector may be the aforementioned original feature vector or a vector obtained by performing other processing (such as normalization) on the original feature vector, which is not limited here.
在一种可能的实现中,检索对象(例如,文本数据)可以在d维向量空间中表示,其中d表示维度的任何合适数量。检索对象(例如,文本数据)可以在向量空间中表示为称为项嵌入(term embedding)的第一向量。文本数据可以对应于在向量空间中的特定点(即,向量的终点)的坐标。作为示例而非限制,可以通过应用由字典定义的函数将检索对象(例如,文本数据)映射到向量空间中的第一向量。作为另一个示例而非限制,可以利用被训练为将文本映射到向量表示的字典,或者这种字典本身可以通过训练来生成。作为另一个示例而非限制,可以使用诸如Word2vec的模型来将检索对象(例如,文本数据)映射到向量空间中的向量表示。在一种可能的实现中,可以通过使用机器学习模型(例如,神经网络)将检索对象(例如,文本数据)映射到向量空间中的向量表示。可能已经使用训练数据序列(例如,各自包括检索对象(例如,文本数据)的语料库(corpus))对机器学习模型进行了训练。In one possible implementation, the retrieval object (eg, text data) can be represented in a d-dimensional vector space, where d represents any suitable number of dimensions. Retrieval objects (eg, text data) can be represented in a vector space as first vectors called term embeddings. Text data may correspond to the coordinates of a particular point in vector space (ie, the end point of the vector). By way of example and not limitation, a retrieval object (eg, text data) may be mapped to a first vector in a vector space by applying a function defined by a dictionary. As another example and not limitation, dictionaries trained to map text to vector representations may be utilized, or such dictionaries themselves may be generated by training. As another example and not limitation, a model such as Word2vec may be used to map a retrieved object (eg, text data) to a vector representation in a vector space. In one possible implementation, a retrieved object (eg, text data) can be mapped to a vector representation in a vector space by using a machine learning model (eg, a neural network). The machine learning model may have been trained using a sequence of training data (eg, corpora each including a retrieval object (eg, text data)).
在一种可能的实现中,检索对象可以在向量空间中被表示为称为特征向量或对象嵌入的向量。例如,可以基于检索对象的一个或更多个特性、属性或特征、对象与其他对象的关系或与对象相关联的任何其他合适的信息,将检索对象映射到向量。作为示例而非限制,函数可以通过特征提取将检索对象映射到向量,特征提取可以从测量数据的初始集合开始,并建立导出值(例如,特征)。作为示例而非限制,可以通过使用算法来检测或隔离对象的各种期望部分或形状而将包括视频或图像的对象映射到向量。用于计算向量的特征可以基于 从边缘检测、拐角检测、斑点检测(blob detection)、脊检测、尺度不变特征变换、边缘方向、变化的强度、自相关、运动检测、光流、阈值、斑点提取、模板匹配、霍夫变换(例如,线、圆、椭圆、任意形状)获得的信息,或者任何其他合适的信息。作为另一示例而非限制,可以基于诸如频谱斜率、音调系数(tonality coefficient)、音频频谱质心、音频频谱包络、Mel频率倒谱或任何其他合适的信息的特征而将包括音频数据的检索对象映射到向量。在一种可能的实现中,当检索对象具有太大而不能被有效处理的数据或者包括冗余数据时,函数可以使用变换后的简化特征集(例如,特征选择)将检索对象映射到向量。在特定实施例中,函数可以基于与检索对象相关联的一个或更多个n-gram将检索对象映射到第一向量。尽管上述实施例描述了以特定方式在向量空间中表示检索对象,但是本申请设想了以任何合适的方式在向量空间中表示检索对象。In one possible implementation, retrieval objects can be represented in a vector space as vectors called feature vectors or object embeddings. For example, a retrieval object may be mapped to a vector based on one or more properties, attributes, or characteristics of the retrieval object, the relationship of the object to other objects, or any other suitable information associated with the object. By way of example and not limitation, a function may map retrieved objects to vectors through feature extraction, which may start from an initial set of measurement data and establish derived values (eg, features). By way of example and not limitation, objects, including videos or images, may be mapped to vectors by using algorithms to detect or isolate various desired portions or shapes of objects. The features used to compute the vectors can be based on features from edge detection, corner detection, blob detection, ridge detection, scale invariant feature transformation, edge direction, intensity of change, autocorrelation, motion detection, optical flow, thresholding, blob Information obtained by extraction, template matching, Hough transform (eg, line, circle, ellipse, arbitrary shape), or any other suitable information. As another example and not limitation, retrieval objects comprising audio data may be based on features such as spectral slope, tonality coefficient, audio spectral centroid, audio spectral envelope, Mel frequency cepstrum, or any other suitable information. map to a vector. In one possible implementation, when the retrieved object has data that is too large to be efficiently processed or includes redundant data, the function can map the retrieved object to a vector using a transformed reduced feature set (eg, feature selection). In particular embodiments, the function may map the retrieved object to the first vector based on one or more n-grams associated with the retrieved object. Although the above embodiments describe representing retrieval objects in a vector space in a specific manner, the present application contemplates representing retrieval objects in a vector space in any suitable manner.
402、所述路由节点根据所述第一向量以及路由信息,从所述多个索引节点中得到目标索引节点,其中,所述路由信息包括多个第二向量以及每个所述第二向量对应的索引节点,每个所述第二向量用于表示对应的索引节点上存储的一个或多个第三向量,所述第三向量为数据对象的表征,所述第一向量与所述多个第二向量中的目标向量之间的向量相似度大于阈值,且所述目标向量在所述路由信息中对应于所述目标索引节点;402. The routing node obtains a target index node from the plurality of index nodes according to the first vector and routing information, wherein the routing information includes a plurality of second vectors and each of the second vectors corresponds to index node, each of the second vectors is used to represent one or more third vectors stored on the corresponding index node, the third vector is a representation of the data object, the first vector and the plurality of The vector similarity between target vectors in the second vector is greater than a threshold, and the target vector corresponds to the target index node in the routing information;
本申请实施例中,路由节点还可以获取到路由信息,其中路由信息可以包括多个第二向量以及每个第二向量对应的索引节点。In this embodiment of the present application, the routing node may also obtain routing information, where the routing information may include multiple second vectors and an index node corresponding to each second vector.
具体的,在分布式检索的架构中,路由节点作为数据中转站,在获取到第一向量后,需要确定对第一向量执行数据检索的索引节点信息(例如,索引节点标识,该索引节点标识可以明确指示是多个索引节点中的哪一个索引节点),本申请实施例中,可以通过路由信息实现由第一向量到索引节点的确定过程。Specifically, in the architecture of distributed retrieval, the routing node is used as a data transfer station. After obtaining the first vector, it is necessary to determine the index node information (for example, the index node ID, the index node ID Which index node among the multiple index nodes can be clearly indicated), in the embodiment of the present application, the process of determining from the first vector to the index node can be realized through routing information.
接下来描述路由信息。Routing information is described next.
1、关于如何构建路由信息:1. About how to build routing information:
在一种可能的实现中,可以对多个候选数据对象(例如非结构性数据对象)进行特征提取,在特征提取过程中,可以从包括于对象数据库的非结构性数据对象中提取信息,为多个候选数据对象的每个候选数据对象产生对应的原始特征向量。例如,包括于对象数据库中的所述非结构性数据对象可以是视频数据、音频数据、图像数据、文本数据之一,或者其它非结构性数据对象。例如,图像对象的每个图像对象都可以由各自的从所述原始图像数据的颜色直方图中获得的原始特征向量表征,视频对象中的每个视频对象都可以由各自的从所述原始视频数据的SIFT或者3D-SIFT获得的或者从区分视频描述符DVD获得的原始特征向量表征。许多不同的特征向量格式因为表征不同的数据对象种类而被熟知,这些格式中的任一格式都适用于特征提取过程,可将候选数据对象转化为各自的原始特征向量。In a possible implementation, feature extraction can be performed on multiple candidate data objects (such as unstructured data objects), and during the feature extraction process, information can be extracted from the unstructured data objects included in the object database, for Each candidate data object of the plurality of candidate data objects generates a corresponding raw feature vector. For example, the unstructured data objects included in the object database may be one of video data, audio data, image data, text data, or other unstructured data objects. For example, each image object of image objects can be characterized by a respective original feature vector obtained from the color histogram of the original image data, and each video object in the video objects can be represented by a respective original feature vector obtained from the original video Raw feature vector representations obtained from SIFT or 3D-SIFT of the data or obtained from the discriminative video descriptor DVD. A number of different feature vector formats are known for representing different kinds of data objects, any of which are suitable for the feature extraction process to convert candidate data objects into their respective raw feature vectors.
在一种可能的实现中,每个所述第二向量用于表示对应的索引节点上存储的一个或多个第三向量可以理解为,每个第二向量可以表示出对应的索引节点上存储的部分或全部第三向量的特征,也就是第二向量可以作为对应的索引节点上存储的部分或全部第三向量的特征表征(第二向量与对应的索引节点上存储的部分或全部第三向量之间的相似度很高, 例如都属于同一个簇)。In a possible implementation, each of the second vectors is used to represent one or more third vectors stored on the corresponding index node. It can be understood that each second vector may represent one or more third vectors stored on the corresponding index node The characteristics of part or all of the third vector, that is, the second vector can be used as a feature representation of part or all of the third vector stored on the corresponding index node (the second vector and part or all of the third vector stored on the corresponding index node The similarity between the vectors is high, for example, they all belong to the same cluster).
其中,这里的簇可以为聚类或者是基于局部哈希得到的。可选的,第二向量可以为簇中心。Wherein, the clusters here may be clustered or obtained based on local hashing. Optionally, the second vector may be the cluster center.
在一种可能的实现中,可以对生成的对原始特征向量按原始向量集合待划分的相似度子空间数量相似度进行一次聚类,其中聚类(clustering)是按照某个特定标准(如距离)把一个数据集分割成不同的类或簇,使得同一个簇内的数据对象的相似性尽可能大,同时不在同一个簇中的数据对象的差异性也尽可能地大。也即聚类后同一类的数据尽可能聚集到一起,不同类数据尽量分离。聚类方法可以包括但不限于K均值(k-means)、均值漂移聚类、基于密度的聚类方法、凝聚层次聚类、图团体检测(graph community detection)、高斯混合模型k-means(gaussian mixture model kmeans,GMM k-means)等聚类方法,通过聚类可以得到全局向量空间的聚类质心集合。聚类质心集合中的每个质心向量对应一个相似度子空间的中心,代表一个全局向量空间的一个向量子空间。In a possible implementation, a clustering can be performed on the similarity of the generated similarity subspaces to be divided into the original feature vector according to the original vector set, wherein the clustering (clustering) is based on a certain standard (such as distance ) divides a data set into different classes or clusters, so that the similarity of data objects in the same cluster is as large as possible, and the difference of data objects that are not in the same cluster is also as large as possible. That is to say, after clustering, the data of the same class are gathered together as much as possible, and the data of different classes are separated as much as possible. Clustering methods may include, but are not limited to, K-means, mean shift clustering, density-based clustering methods, agglomerative hierarchical clustering, graph community detection, Gaussian mixture model k-means (gaussian mixture model kmeans, GMM k-means) and other clustering methods, the cluster centroid set of the global vector space can be obtained through clustering. Each centroid vector in the cluster centroid set corresponds to the center of a similarity subspace, representing a vector subspace of a global vector space.
其中,各个聚类类别的质心向量可以作为路由信息中的第二向量,或者是在相应的聚类类别中位于质心向量附近的向量可以作为路由信息中的第二向量,或者是与质心向量相似的向量可以作为路由信息中的第二向量,只要能够代表对应的聚类类别,并不限定怎么选取第二向量。Wherein, the centroid vector of each clustering category can be used as the second vector in the routing information, or a vector located near the centroid vector in the corresponding clustering category can be used as the second vector in the routing information, or a vector similar to the centroid vector The vector of can be used as the second vector in the routing information, as long as it can represent the corresponding clustering category, there is no limitation on how to select the second vector.
在一种可能的实现中,在得到多个第二向量之后,可以将各个第二向量所在的聚类类别映射到一个或多个索引节点上,也就是说,建立各个第二向量与索引节点之间的映射关系,例如,第二向量可以包括第二向量1至第二向量100,索引节点可以包括索引节点1至索引节点10,可以将第二向量1至第二向量10映射到索引节点1,将第二向量11至第二向量20映射到索引节点2,将第二向量21至第二向量30映射到索引节点3,将第二向量31至第二向量40映射到索引节点4,将第二向量41至第二向量50映射到索引节点5,将第二向量51至第二向量60映射到索引节点6,将第二向量61至第二向量70映射到索引节点7,将第二向量71至第二向量80映射到索引节点8,将第二向量81至第二向量90映射到索引节点9,将第二向量91至第二向量100映射到索引节点10。In a possible implementation, after multiple second vectors are obtained, the clustering category of each second vector can be mapped to one or more index nodes, that is, each second vector and index node For example, the second vector can include the second vector 1 to the second vector 100, the index node can include the index node 1 to the index node 10, and the second vector 1 to the second vector 10 can be mapped to the index node 1. Map the second vector 11 to the second vector 20 to the index node 2, map the second vector 21 to the second vector 30 to the index node 3, map the second vector 31 to the second vector 40 to the index node 4, Map the second vector 41 to the second vector 50 to the index node 5, map the second vector 51 to the second vector 60 to the index node 6, map the second vector 61 to the second vector 70 to the index node 7, map the second vector The second vector 71 to the second vector 80 are mapped to the index node 8 , the second vector 81 to the second vector 90 are mapped to the index node 9 , and the second vector 91 to the second vector 100 are mapped to the index node 10 .
接下来描述,建立第二向量至索引节点的映射关系的规则。Next, the rules for establishing the mapping relationship from the second vector to the index node will be described.
在一种可能的实现中,可以对聚类得到的多个第二向量可以按照索引节点的数量进行二次聚类,聚类方法可以包括但不限于K均值(k-means)、均值漂移聚类、基于密度的聚类方法、凝聚层次聚类、图团体检测(graph community detection)、高斯混合模型k-means(gaussian mixture model kmeans,GMM k-means)等聚类方法,通过聚类可以得到全局向量空间的第二聚类质心集合。第二聚类质心集合中的每个质心向量(第二向量)可以映射到一个索引节点,且质心向量所在的聚类类别中的其他第二向量都可以映射到该质心向量所映射到的索引节点上。In a possible implementation, secondary clustering can be performed on the multiple second vectors obtained by clustering according to the number of index nodes, and the clustering methods can include but not limited to K-means, mean shift clustering Classes, density-based clustering methods, agglomerative hierarchical clustering, graph community detection (graph community detection), Gaussian mixture model k-means (gaussian mixture model kmeans, GMM k-means) and other clustering methods, through clustering can be obtained A second set of cluster centroids for the global vector space. Each centroid vector (second vector) in the second cluster centroid set can be mapped to an index node, and other second vectors in the cluster category where the centroid vector is located can be mapped to the index to which the centroid vector is mapped on the node.
应理解,可选的,每个索引节点所映射的第二向量可以都属于同一个二次聚类后得到的聚类类别,且不同索引节点所映射的第二向量属于二次聚类后得到的不同聚类类别。It should be understood that, optionally, the second vectors mapped to each index node may all belong to the same cluster category obtained after secondary clustering, and the second vectors mapped to different index nodes belong to the clustering category obtained after secondary clustering different clustering categories.
应理解,可选的,可以将质心向量所在的聚类类别中的其他第二向量除了可以映射到该质心向量所映射到的索引节点上,还可以映射到其他索引节点上(前提是该第二向量与 映射到的索引节点对应的第二聚类质心集合中的质心向量之间的相似度很高)。也就是说,不同索引节点所映射的部分第二向量(数量很小)也可以属于二次聚类后得到的相同聚类类别。It should be understood that, optionally, other second vectors in the clustering category where the centroid vector is located can be mapped to other index nodes in addition to the index node to which the centroid vector is mapped (provided that the first The similarity between the two vectors and the centroid vectors in the second cluster centroid set corresponding to the index node mapped to is very high). That is to say, some of the second vectors (with a small number) mapped by different index nodes may also belong to the same clustering category obtained after secondary clustering.
例如,第二向量可以包括第二向量1至第二向量100,索引节点可以包括索引节点1至索引节点10,经过二次聚类后,第二向量1至第二向量10属于同一个聚类类别,第二向量11至第二向量20属于同一个聚类类别,第二向量21至第二向量30属于同一个聚类类别,第二向量31至第二向量40属于同一个聚类类别,第二向量41至第二向量50属于同一个聚类类别,第二向量51至第二向量60属于同一个聚类类别,第二向量61至第二向量70属于同一个聚类类别,第二向量71至第二向量80属于同一个聚类类别,第二向量81至第二向量90属于同一个聚类类别,第二向量91至第二向量100属于同一个聚类类别。For example, the second vector may include the second vector 1 to the second vector 100, and the index nodes may include the index node 1 to the index node 10, and after secondary clustering, the second vector 1 to the second vector 10 belong to the same cluster Category, the second vector 11 to the second vector 20 belong to the same cluster category, the second vector 21 to the second vector 30 belong to the same cluster category, the second vector 31 to the second vector 40 belong to the same cluster category, The second vector 41 to the second vector 50 belong to the same clustering category, the second vector 51 to the second vector 60 belong to the same clustering category, the second vector 61 to the second vector 70 belong to the same clustering category, the second The vector 71 to the second vector 80 belong to the same cluster category, the second vector 81 to the second vector 90 belong to the same cluster category, and the second vector 91 to the second vector 100 belong to the same cluster category.
可选的,每个索引节点所映射的第二向量可以都属于同一个二次聚类后得到的聚类类别,且不同索引节点所映射的第二向量属于二次聚类后得到的不同聚类类别,例如,可以将第二向量1至第二向量10映射到索引节点1,将第二向量11至第二向量20映射到索引节点2,将第二向量21至第二向量30映射到索引节点3,将第二向量31至第二向量40映射到索引节点4,将第二向量41至第二向量50映射到索引节点5,将第二向量51至第二向量60映射到索引节点6,将第二向量61至第二向量70映射到索引节点7,将第二向量71至第二向量80映射到索引节点8,将第二向量81至第二向量90映射到索引节点9,将第二向量91至第二向量100映射到索引节点10。Optionally, the second vectors mapped to each index node may all belong to the same cluster category obtained after secondary clustering, and the second vectors mapped to different index nodes belong to different clusters obtained after secondary clustering Class category, for example, can map second vector 1 to second vector 10 to index node 1, second vector 11 to second vector 20 to index node 2, second vector 21 to second vector 30 to Inode 3, map second vector 31 to second vector 40 to inode 4, map second vector 41 to second vector 50 to inode 5, map second vector 51 to second vector 60 to inode 6. Map the second vector 61 to the second vector 70 to the index node 7, map the second vector 71 to the second vector 80 to the index node 8, map the second vector 81 to the second vector 90 to the index node 9, The second vector 91 to the second vector 100 are mapped to the index node 10 .
可选的,每个索引节点所映射的第二向量可以都属于同一个二次聚类后得到的聚类类别,且不同索引节点所映射的部分第二向量(数量很小)也可以属于二次聚类后得到的相同聚类类别,例如,若第二向量10和第二向量11至第二向量20所属的聚类类别的质心向量之间的向量相似度很高,则可以将第二向量1至第二向量10映射到索引节点1,并将第二向量10以及第二向量11至第二向量20映射到索引节点2。也就是说,第二向量10同时映射到索引节点1和索引节点2。Optionally, the second vectors mapped to each index node may all belong to the same clustering category obtained after secondary clustering, and some of the second vectors (small in number) mapped to different index nodes may also belong to the second vector The same cluster category obtained after sub-clustering, for example, if the vector similarity between the second vector 10 and the centroid vectors of the cluster categories to which the second vector 11 to the second vector 20 belongs is very high, then the second Vector 1 to second vector 10 are mapped to index node 1 , and second vector 10 and second vector 11 to second vector 20 are mapped to index node 2 . That is, the second vector 10 is mapped to both index node 1 and index node 2 .
其中,本申请实施例中所描述的向量相似度也可以称之为相似性度量,相似性度量可以是余弦相似性、Minkowski距离、Mahalanobis距离、Jaccard相似性系数或任何合适的相似性度量。作为示例而非限制,
Figure PCTCN2022115091-appb-000001
Figure PCTCN2022115091-appb-000002
的相似性度量可以是余弦相似性,作为另一个示例而非限制,
Figure PCTCN2022115091-appb-000003
Figure PCTCN2022115091-appb-000004
的相似性度量可以是欧几里德距离
Figure PCTCN2022115091-appb-000005
两个向量的相似性度量可以表示分别对应于两个向量的两个对象或n-gram彼此之间的相似程度。尽管本公开描述了以特定方式计算向量之间的相似性度量,但是本公开设想了以任何合适的方式计算向量之间的相似性度量。
Wherein, the vector similarity described in the embodiment of the present application may also be called a similarity measure, and the similarity measure may be cosine similarity, Minkowski distance, Mahalanobis distance, Jaccard similarity coefficient or any suitable similarity measure. By way of example and not limitation,
Figure PCTCN2022115091-appb-000001
and
Figure PCTCN2022115091-appb-000002
The similarity measure for can be cosine similarity, as another example and not limitation,
Figure PCTCN2022115091-appb-000003
and
Figure PCTCN2022115091-appb-000004
The similarity measure for can be the Euclidean distance
Figure PCTCN2022115091-appb-000005
The similarity measure of two vectors may indicate how similar two objects or n-grams respectively corresponding to the two vectors are to each other. Although this disclosure describes computing similarity measures between vectors in a particular manner, this disclosure contemplates computing similarity measures between vectors in any suitable manner.
由上述方式,可以建立各个第二向量到索引节点之间的映射关系,也就是构建了路由信息。In the manner described above, the mapping relationship between each second vector and the index node can be established, that is, the routing information is constructed.
在一种可能的实现中,可以以数据表的形式存储路由信息,其中路由信息可以包括各个第二向量的标识(该标识可以直接或者间接地指向对应的第二向量的存储地址),以及每个第二向量对应的索引节点(例如是索引节点的标识,该标识可以直接或者间接地指向对应的索引节点的地址)。尽管本公开描述了以特定方式存储路由信息,但是本公开设想了以 任何合适的方式存储上述路由信息。In a possible implementation, the routing information may be stored in the form of a data table, where the routing information may include the identifier of each second vector (the identifier may directly or indirectly point to the storage address of the corresponding second vector), and each The index node corresponding to the second vector (for example, the identifier of the index node, which may directly or indirectly point to the address of the corresponding index node). Although this disclosure describes storing routing information in a particular manner, this disclosure contemplates storing such routing information in any suitable manner.
在一种可能的实现中,路由节点在获取到第一向量后,需要识别出该第一向量与路由信息中的哪个或哪些第二向量之间的向量相似度较大(例如,向量相似度大于阈值),具体的,可以根据上述第一聚类质心集合,来构建第一索引(第一索引的索引类型包括但不限于基于分层可通航小世界图算法(hierarchical navigable small world,HNSW)的分层图结构、LSH局部敏感哈希等),第一索引也可以称为粗量化器索引。第一向量可以通过该粗量化器索引快速找到最相似的一个或多个第二向量。可选的,路由节点上可以保存上述第一索引。In a possible implementation, after obtaining the first vector, the routing node needs to identify which vector similarity between the first vector and which or which second vectors in the routing information is greater (for example, vector similarity greater than the threshold), specifically, the first index can be constructed according to the above-mentioned first cluster centroid set (the index type of the first index includes but is not limited to a hierarchical navigable small world algorithm (hierarchical navigable small world, HNSW) layered graph structure, LSH locality-sensitive hashing, etc.), the first index may also be called a coarse quantizer index. The first vector can quickly find the most similar one or more second vectors through the coarse quantizer index. Optionally, the above-mentioned first index may be saved on the routing node.
在一种可能的实现中,每个第二向量对应于一个索引节点,每个第二向量所在的聚类类别中的其他第三向量(例如,可以将每个聚类类别中的第三向量可以称之为一个向量子空间)可以部署在对应的索引节点上。且可以针对于部署在各个索引节点上的向量子空间中的第三向量,根据向量的数量选择不同的ANN算法进行ANN模型训练,生成各向量子空间对应的子索引模型。其中,ANN算法包括但不限于标量量化(scalar quantizer,SQ),乘积量化(product quantization,PQ)、HNSW、倒排乘积量化等。相当于每个聚类类别都构成一个子索引模型,也就是每个第二向量都可对应一个子索引模型,进而可以将每个第二向量对应的子索引模型部署在第二向量对应的索引节点上。In a possible implementation, each second vector corresponds to an index node, and other third vectors in the clustering category where each second vector is located (for example, the third vector in each clustering category can be can be called a vector subspace) can be deployed on the corresponding index nodes. And for the third vector in the vector subspace deployed on each index node, different ANN algorithms may be selected according to the number of vectors to perform ANN model training, and a sub-index model corresponding to each vector subspace may be generated. Among them, the ANN algorithm includes but is not limited to scalar quantization (scalar quantizer, SQ), product quantization (product quantization, PQ), HNSW, inverted product quantization, etc. It is equivalent to each clustering category forming a sub-index model, that is, each second vector can correspond to a sub-index model, and then the sub-index model corresponding to each second vector can be deployed on the index corresponding to the second vector on the node.
本申请实施例中,路由节点在获取到第一向量之后,可以根据第一向量以及所述路由信息,从所述多个索引节点中确定目标索引节点,其中所述第一向量与所述多个第二向量中的目标向量之间的向量相似度大于阈值,所述目标向量在所述路由信息中对应于所述目标索引节点。In this embodiment of the present application, after obtaining the first vector, the routing node can determine the target index node from the multiple index nodes according to the first vector and the routing information, where the first vector and the multiple The vector similarity between target vectors in the second vectors is greater than a threshold, and the target vectors correspond to the target index nodes in the routing information.
在一种可能的实现中,可以根据第一向量从路由信息中的多个第二向量中确定向量相似度大于阈值的目标向量,其中,该目标向量可以为多个第二向量中与第一向量最相似、或者较为相似的第二向量,例如可以基于上述构建的第一索引(粗量化器索引),来确定目标向量。其中,目标向量的数量可以为一个或多个,当目标向量的数量为1个时,目标向量可以为多个第二向量中与第一向量最相似的第二向量,当目标向量的数量为多个时,目标向量可以为多个第二向量中与第一向量之间相似度排名靠前多个的第二向量。In a possible implementation, a target vector whose vector similarity is greater than a threshold may be determined from multiple second vectors in the routing information according to the first vector, where the target vector may be one of the multiple second vectors that is identical to the first The second vector with the most similar vector, or relatively similar vector, can determine the target vector based on, for example, the first index (coarse quantizer index) constructed above. Wherein, the quantity of the target vector can be one or more, when the quantity of the target vector is 1, the target vector can be the second vector most similar to the first vector among multiple second vectors, when the quantity of the target vector is When there are multiple target vectors, the target vectors may be the second vectors with the highest similarity with the first vector among the multiple second vectors.
应理解,在第一索引(粗量化器索引)为HNSW等分层图结构时,确定的目标向量可以为尽量贴近与第一向量高相似度的向量,其可以不是多个路由节点中与第一向量高相似度的向量。It should be understood that when the first index (coarse quantizer index) is a hierarchical graph structure such as HNSW, the determined target vector may be as close as possible to a vector with a high similarity with the first vector, which may not be the same as the first vector among multiple routing nodes. A vector of vectors with high similarity.
应理解,目标向量的数量可以为预先设定好的数量,目标向量的数量越多,则最后检索结果的精度可能越高,但可能导致确定出的目标向量所分布的索引节点的数量越多,进而加大了数据的并发量,可能会增大检索过程所需的时延,实际应用时可以基于检索精度以及时延要求选择一个较为平衡的值。It should be understood that the number of target vectors may be a preset number, the more the number of target vectors, the higher the accuracy of the final retrieval result may be, but the greater the number of index nodes distributed by the determined target vectors may be. , which in turn increases the amount of data concurrency, which may increase the delay required for the retrieval process. In practical applications, a more balanced value can be selected based on retrieval accuracy and delay requirements.
应理解,这里的阈值,可以和目标向量的数量以及第一索引的结构相关,当目标向量的数量很多时,阈值越低,当第一索引的性能较好时(例如可以确定出接近于与第一向量高相似度的向量),阈值越高。It should be understood that the threshold here may be related to the number of target vectors and the structure of the first index. When the number of target vectors is large, the threshold is lower. When the performance of the first index is better (for example, it can be determined that it is close to the vector with high similarity to the first vector), the higher the threshold.
在确定出目标向量后,可以根据路由信息,确定出目标向量对应的索引节点(例如,可以称之为目标索引节点),这里确定出目标向量对应的索引节点可以包括确定出目标向量 对应的索引节点的标识(例如节点编号)、地址等信息。当目标向量的数量为多个时,可以确定出每个目标向量对应的索引节点。After the target vector is determined, the index node corresponding to the target vector (for example, may be referred to as the target index node) may be determined according to the routing information. Here, determining the index node corresponding to the target vector may include determining the index corresponding to the target vector Node identification (such as node number), address and other information. When there are multiple target vectors, an index node corresponding to each target vector may be determined.
由于目标索引节点上存储的第三向量为目标向量所属的聚类类别中的第三向量,因此,所述目标向量与所述多个第三向量中每个第三向量之间的向量相似度大于阈值,且所述多个第三向量中不同第三向量之间的向量相似度大于阈值。可选的,在一种可能的实现中,所述目标向量可以为所述多个第三向量中的一个向量(例如,可以为多个第三向量的聚类质心)。Since the third vector stored on the target index node is the third vector in the cluster category to which the target vector belongs, the vector similarity between the target vector and each third vector in the plurality of third vectors is greater than a threshold, and the vector similarity between different third vectors among the plurality of third vectors is greater than a threshold. Optionally, in a possible implementation, the target vector may be a vector among the multiple third vectors (for example, may be a cluster centroid of multiple third vectors).
在一种可能的实现中,所述目标索引节点上还可以存储有除了多个第三向量之外的其他向量(这部分向量也可以作为数据检索的第三向量,且与多个第三向量之间的相似度较低,但这部分向量在目标索引节点上的数量较少),也就是说,目标索引节点可以存储有用于进行数据检索的多个向量,且所述多个第三向量为所述多个向量中数量占比大于目标比例的向量,示例性的,目标比例可以为百分之80、百分之90、百分之95等接近于1的数值。In a possible implementation, other vectors besides multiple third vectors may also be stored on the target index node (this part of vectors may also be used as the third vectors for data retrieval, and the same as multiple third vectors The similarity between them is low, but the number of these vectors on the target index node is small), that is to say, the target index node can store multiple vectors for data retrieval, and the multiple third vectors Among the multiple vectors, the proportion of the number of vectors is greater than the target ratio. Exemplarily, the target ratio may be a value close to 1, such as 80%, 90%, 95%.
在一种可能的实现中,所述第一向量与所述多个第二向量中的M个目标向量之间的向量相似度大于阈值,所述M个目标向量在所述路由信息中对应于所述目标索引节点,所述M为大于1的正整数。为了保证检索的精度,可以在路由信息中确定出多个和第一向量相似度较高的目标向量,进而增加了后续针对于第一向量进行检索时采用的第三向量的数量,进而增加了检索结果的精确度。In a possible implementation, the vector similarity between the first vector and M target vectors in the plurality of second vectors is greater than a threshold, and the M target vectors in the routing information correspond to For the target index node, the M is a positive integer greater than 1. In order to ensure the accuracy of retrieval, multiple target vectors with high similarity with the first vector can be determined in the routing information, thereby increasing the number of third vectors used in the subsequent retrieval of the first vector, thereby increasing the The precision of the search results.
在一种可能的实现中,所述目标索引节点包括一个或多个索引节点,且所述目标索引节点包括的索引节点数量小于预设数量。其中,预设数量可以为2、3、4、5等。In a possible implementation, the target index node includes one or more index nodes, and the number of index nodes included in the target index node is less than a preset number. Wherein, the preset number may be 2, 3, 4, 5 and so on.
本申请实施例中,路由节点基于相似性确定出待索引节点需要路由至的索引节点,且在索引节点上的第三向量也是基于向量相似性部署的(相似性较大的向量部署在同一个索引节点上,相似性较小的向量部署在不同的索引节点上),也就是说,和第一向量相似度很高的第三向量都存储在特定的索引节点上(候选索引节点)上,因此路由节点只需要将第一向量发送至目标索引节点上,目标索引节点就可以基于自身存储的候选节点得到一个很精确的检索结果,使每次检索只需访问一个或部分索引节点(数量较少)即可获得精确的检索结果,提升了大规模向量规模的相似性检索性能,集群检索的并发性能可以随着集群规模而线性增长,有效解决了海量向量规模场景下的向量相似性检索性能和扩展性问题。In this embodiment of the application, the routing node determines the index node to which the node to be indexed needs to be routed based on similarity, and the third vector on the index node is also deployed based on vector similarity (vectors with greater similarity are deployed in the same On the index node, the vectors with less similarity are deployed on different index nodes), that is to say, the third vector with high similarity with the first vector is stored on a specific index node (candidate index node), Therefore, the routing node only needs to send the first vector to the target index node, and the target index node can obtain a very accurate retrieval result based on the candidate nodes stored by itself, so that each retrieval only needs to visit one or part of the index nodes (the number is relatively small). less) to obtain accurate retrieval results, which improves the performance of large-scale vector-scale similarity retrieval. The concurrency performance of cluster retrieval can increase linearly with the size of the cluster, which effectively solves the problem of vector similarity retrieval performance in massive vector-scale scenarios. and scalability issues.
403、所述路由节点向所述目标索引节点传递所述第一向量。403. The routing node transmits the first vector to the target index node.
本申请实施例中,路由节点在确定出目标索引节点之后,可以将第一向量传递至目标索引节点,进而由目标索引节点基于自身存储的多个第三向量进行针对于第一向量的数据检索。In the embodiment of the present application, after the routing node determines the target index node, it can transfer the first vector to the target index node, and then the target index node performs data retrieval for the first vector based on multiple third vectors stored by itself. .
应理解,当目标索引节点的数量为多个时,路由节点可以将第一向量传递至多个目标索引节点,以便各个目标索引节点分别基于自身存储的多个第三向量进行针对于第一向量的数据检索。It should be understood that when the number of target index nodes is multiple, the routing node may transfer the first vector to multiple target index nodes, so that each target index node performs the first vector based on the multiple third vectors stored by itself. data retrieval.
404、所述目标索引节点从自身存储的多个第三向量中确定所述第一向量的检索结果。404. The target index node determines the retrieval result of the first vector from multiple third vectors stored by itself.
本申请实施例中,目标索引节点在接收到第一向量之后,可以从自身存储的多个第三向量中确定所述第一向量的检索结果,可选的,可以基于所述第一向量与存储的多个第三向量之间的相似性比较,从所述多个第三向量中确定检索结果。In this embodiment of the application, after receiving the first vector, the target index node can determine the retrieval result of the first vector from multiple third vectors stored in itself, and optionally, based on the first vector and The similarity comparison between the stored multiple third vectors, and the retrieval result is determined from the multiple third vectors.
在一种可能的实现中,由于目标索引节点上存储有多个向量子空间的第三向量,且针对于部署在各个索引节点上的向量子空间中的第三向量,根据向量的数量选择不同的ANN算法进行了ANN模型训练,生成各向量子空间对应的子索引模型。因此,路由节点还可以将目标向量对应的向量子空间的信息传递至目标索引节点,进而目标索引节点可以获取向量子空间对应的子索引模型,并基于子索引模型对第一向量进行与存储的多个第三向量之间的相似性比较,从所述多个第三向量中确定检索结果。In a possible implementation, since there are multiple third vectors in the vector subspace stored on the target index node, and for the third vector in the vector subspace deployed on each index node, different vectors are selected according to the number of vectors The ANN algorithm of the ANN model is trained to generate a sub-index model corresponding to each vector subspace. Therefore, the routing node can also transfer the information of the vector subspace corresponding to the target vector to the target index node, and then the target index node can obtain the sub-index model corresponding to the vector subspace, and based on the sub-index model, compare the first vector with the stored Similarity comparison between multiple third vectors, determining retrieval results from the multiple third vectors.
其中,所述检索结果可以为所述多个第三向量中的部分向量(例如,可以为相似性最高的一个或多个第三向量)。Wherein, the retrieval result may be a part of the multiple third vectors (for example, it may be one or more third vectors with the highest similarity).
在一种可能的实现中,所述路由节点,还可以向所述目标索引节点传递所述目标向量;进而目标索引节点可以基于所述目标向量以及第一映射关系,从自身存储的多个第三向量中确定所述一个或多个第三向量,所述第一映射关系包括所述多个第二向量以及每个所述第二向量与第三向量的对应关系;基于所述第一向量从所述一个或多个第三向量中确定检索结果。In a possible implementation, the routing node may also transmit the target vector to the target index node; furthermore, the target index node may, based on the target vector and the first mapping relationship, select Determine the one or more third vectors among the three vectors, the first mapping relationship includes the plurality of second vectors and the corresponding relationship between each of the second vectors and the third vector; based on the first vector A retrieval result is determined from the one or more third vectors.
在一种可能的实现中,各个索引节点上可以存储有自身存储的第三向量所对应的一个或多个第二向量,该一个或多个第二向量可以用于表征索引节点自身存储的第三向量。在检索时,索引节点可以基于目标向量(即一个第二向量),确定出该目标向量对应的一个或多个第三向量,目标向量可以作为一个或多个第三向量的表征,进而索引节点可以基于第一向量从一个或多个第三向量中确定检索结果。In a possible implementation, each index node may store one or more second vectors corresponding to the third vector stored by itself, and the one or more second vectors may be used to represent the first vector stored by the index node itself. Three vectors. When searching, the index node can determine one or more third vectors corresponding to the target vector based on the target vector (that is, a second vector), and the target vector can be used as a representation of one or more third vectors, and then the index node Retrieval results may be determined from one or more third vectors based on the first vector.
由于目标向量可以作为一个或多个第三向量的表征,且第一向量与目标向量之间的相似度大于阈值,因此从一个或多个第三向量中确定检索结果可以在满足检索精确度的前提下,加速检索过程。Since the target vector can be used as a representation of one or more third vectors, and the similarity between the first vector and the target vector is greater than the threshold, it is possible to determine the retrieval result from one or more third vectors while satisfying the retrieval accuracy. Under the premise, speed up the retrieval process.
在一种可能的实现中,目标索引节点可以根据所述第一向量,从自身存储的多个第三向量中确定部分第三向量,其中所述部分第三向量中的每个第三向量对应相同的簇,且所述相同的簇的簇中心与所述第一向量之间的向量相似度大于阈值;基于所述第一向量从所述部分第三向量中确定检索结果。In a possible implementation, the target index node may determine part of the third vectors from multiple third vectors stored by itself according to the first vector, where each third vector in the part of the third vectors corresponds to The same cluster, and the vector similarity between the cluster center of the same cluster and the first vector is greater than a threshold; determine the retrieval result from the part of the third vector based on the first vector.
在一种可能的实现中,各个索引节点上可以存储有自身存储的第三向量所对应的一个或多个第四向量,该一个或多个第四向量可以用于表征索引节点自身存储的第三向量。在检索时,索引节点可以基于从一个或多个第四向量中确定出相似度最大的第四向量,并获取确定出的第四向量对应的一个或多个第三向量,第四向量可以作为一个或多个第三向量的表征,进而索引节点可以基于第一向量从一个或多个第三向量中确定检索结果。In a possible implementation, each index node may store one or more fourth vectors corresponding to the third vector stored by itself, and the one or more fourth vectors may be used to represent the first vector stored by the index node itself. Three vectors. When retrieving, the index node can determine the fourth vector with the highest similarity from one or more fourth vectors, and obtain one or more third vectors corresponding to the determined fourth vector, and the fourth vector can be used as The representation of one or more third vectors, and then the index node can determine the retrieval result from the one or more third vectors based on the first vector.
由于第四向量可以作为一个或多个第三向量的表征,且第四向量与第一向量之间的相似度大于阈值,因此从一个或多个第三向量中确定检索结果可以在满足检索精确度的前提下,加速检索过程。Since the fourth vector can be used as a representation of one or more third vectors, and the similarity between the fourth vector and the first vector is greater than the threshold, it is possible to determine the retrieval result from one or more third vectors when the retrieval precision is satisfied. Under the premise of speed, speed up the retrieval process.
接下来结合网页语义特征索引的应用场景描述本申请提供的数据检索方法。Next, the data retrieval method provided by this application will be described in conjunction with the application scenario of web page semantic feature indexing.
在一种可能的实现中,路由节点可以接收待检索的语义向量,路由节点可以根据路由信息确定出待访问的语义向量分片1编号,并将待检索语义向量转发至该义向量分片1将待检索语义向量与分片1上的子空间模型进行比较,获取对应的子索引,用子索引计算与待检索语义向量的相似度,获取最相似的一个或多个第三向量标识并返回至路由节点。In a possible implementation, the routing node can receive the semantic vector to be retrieved, and the routing node can determine the number of the semantic vector segment 1 to be accessed according to the routing information, and forward the semantic vector to be retrieved to the semantic vector segment 1 Compare the semantic vector to be retrieved with the subspace model on slice 1, obtain the corresponding sub-index, use the sub-index to calculate the similarity with the semantic vector to be retrieved, obtain the most similar one or more third vector identifiers and return to the routing node.
在一种可能的实现中,在索引构建阶段,可以对待索引网页语义向量按相似度进行一次聚类,并对中待索引的网页语义向量集合按待划分的相似度子空间数量进行第一次聚类,得到全局向量空间的第一聚类质心集合。对生成的第一聚类质心集合,可以按分片数量进行聚类,得到第二聚类质心集合。第二聚类质心集合中的每个质心对应一个网页语义向量分片。对生成的第一聚类质心集合,构建第一索引,称为粗量化器索引,并构建第一索引和网页语义向量分片之间的映射关系。第一索引和映射关系共同组成路由装置,并安装在路由节点中。将第一聚类质心集合,按上述映射关系,分配给每个语义向量分片,生成子空间模型。使用路由装置,找到待索引的各原始向量对应的第一聚类质心和第二聚类质心,并保存在该第二聚类之心对应语义向量分片中的第一聚类质心对应的向量子空间中。将生成的子空间模型按第二聚类质心分配至各语义向量分片,构建子索引模型,用保存在各第一聚类质心对应的向量子空间中的语义向量,生成各向量子空间对应的子索引模型。对保存在各第一聚类质心对应的向量子空间中的原始向量,使用子索引模型,构建第二索引。将生成的全部子索引按子空间模型,分配至各分片。In a possible implementation, in the index construction phase, the semantic vectors of the webpages to be indexed can be clustered once according to the similarity, and the set of semantic vectors of the webpages to be indexed can be clustered for the first time according to the number of similarity subspaces to be divided. Clustering, the first cluster centroid set of the global vector space is obtained. The generated first cluster centroid set can be clustered according to the number of slices to obtain the second cluster centroid set. Each centroid in the second cluster centroid set corresponds to a webpage semantic vector segment. For the generated first cluster centroid set, construct the first index, called the coarse quantizer index, and construct the mapping relationship between the first index and the webpage semantic vector fragments. The first index and the mapping relationship together constitute a routing device, and are installed in the routing node. The first cluster centroid set is assigned to each semantic vector slice according to the above mapping relationship to generate a subspace model. Use the routing device to find the first cluster centroid and the second cluster centroid corresponding to each original vector to be indexed, and store the direction corresponding to the first cluster centroid in the semantic vector slice corresponding to the second cluster centroid in quantum space. Assign the generated subspace model to each semantic vector slice according to the second cluster centroid, construct a sub-index model, and use the semantic vector stored in the vector subspace corresponding to each first cluster centroid to generate each vector subspace corresponding The sub-index model of . For the original vectors stored in the vector subspace corresponding to each first cluster centroid, a sub-index model is used to construct a second index. All generated sub-indices are allocated to each shard according to the subspace model.
在一种可能的实现中,在检索阶段,可以在协同转发节点,使用路由装置找到对应的语义向量分片,并路由至该分片。在被路由到的语义向量分片上,查找子空间模型,找到最相似的第一聚类质心集合中质心对应的向量子空间。在找到的向量子空间中,调用该向量子空间生成的子索引,计算与待检索原始向量最相似的一组向量,并返回至路由节点。In a possible implementation, in the retrieval phase, a routing device may be used at the cooperative forwarding node to find a corresponding semantic vector segment and route to the segment. On the semantic vector slice that is routed to, look up the subspace model, and find the vector subspace corresponding to the centroid in the first cluster centroid set that is most similar. In the found vector subspace, call the subindex generated by the vector subspace, calculate a group of vectors most similar to the original vector to be retrieved, and return to the routing node.
本申请实施例提供了数据检索方法,所述方法应用于计算机***,所述计算机***包括路由节点以及多个索引节点,所述方法包括:所述路由节点获取第一向量以及路由信息,所述路由信息包括多个第二向量以及每个第二向量对应的索引节点;所述路由节点根据所述第一向量以及所述路由信息,从所述多个索引节点中确定目标索引节点,其中所述第一向量与所述多个第二向量中的目标向量之间的向量相似度大于阈值,所述目标向量在所述路由信息中对应于所述目标索引节点;所述路由节点向所述目标索引节点传递所述第一向量;所述目标索引节点基于所述第一向量与存储的多个第三向量之间的相似性比较,从所述多个第三向量中确定检索结果,本申请实施例中,路由节点基于相似性确定出待索引节点需要路由至的索引节点,且在索引节点上的第三向量也是基于向量相似性部署的(相似性较大的向量部署在同一个索引节点上,相似性较小的向量部署在不同的索引节点上),也就是说,和第一向量相似度很高的第三向量都存储在特定的索引节点上(候选索引节点)上,因此路由节点只需要将第一向量发送至目标索引节点上,目标索引节点就可以基于自身存储的候选节点得到一个很精确的检索结果,使每次检索只需访问一个或部分索引节点(数量较少)即可获得精确的检索结果,提升了大规模向量规模的相似性检索性能,集群 检索的并发性能可以随着集群规模而线性增长,有效解决了海量向量规模场景下的向量相似性检索性能和扩展性问题。An embodiment of the present application provides a data retrieval method, the method is applied to a computer system, the computer system includes a routing node and a plurality of index nodes, the method includes: the routing node obtains the first vector and routing information, the The routing information includes a plurality of second vectors and index nodes corresponding to each second vector; the routing node determines a target index node from the plurality of index nodes according to the first vector and the routing information, wherein the The vector similarity between the first vector and target vectors in the plurality of second vectors is greater than a threshold, and the target vector corresponds to the target index node in the routing information; the routing node sends the The target index node transmits the first vector; the target index node determines a retrieval result from the multiple third vectors based on the similarity comparison between the first vector and the stored multiple third vectors, this In the embodiment of the application, the routing node determines the index node to which the node to be indexed needs to be routed based on similarity, and the third vector on the index node is also deployed based on vector similarity (vectors with greater similarity are deployed in the same index On the node, the vectors with less similarity are deployed on different index nodes), that is to say, the third vector with high similarity with the first vector is stored on a specific index node (candidate index node), so The routing node only needs to send the first vector to the target index node, and the target index node can obtain a very accurate retrieval result based on the candidate nodes stored in itself, so that each retrieval only needs to visit one or part of the index nodes (a small number ) can obtain accurate retrieval results, which improves the similarity retrieval performance of large-scale vector scale. Scalability issues.
接下来以路由节点为执行主体描述本申请实施例提供的一种数据检索方法,参照图5,图5为本申请实施例提供的一种数据检索方法的流程示意,所述方法包括:Next, a routing node is used as the execution subject to describe a data retrieval method provided by the embodiment of the present application. Referring to FIG. 5, FIG. 5 is a schematic flow chart of a data retrieval method provided by the embodiment of the present application. The method includes:
501、获取第一向量;501. Acquire the first vector;
其中,步骤501的描述可以参照上述实施例中步骤401的描述,这里不再赘述。Wherein, for the description of step 501, reference may be made to the description of step 401 in the foregoing embodiment, and details are not repeated here.
502、根据所述第一向量以及路由信息,从所述多个索引节点中得到目标索引节点,其中,所述路由信息包括多个第二向量以及每个所述第二向量对应的索引节点,每个所述第二向量用于表示对应的索引节点上存储的一个或多个第三向量,所述第三向量为数据对象的表征,所述第一向量与所述多个第二向量中的目标向量之间的向量相似度大于阈值,且所述目标向量在所述路由信息中对应于所述目标索引节点;502. Obtain a target index node from the plurality of index nodes according to the first vector and routing information, where the routing information includes a plurality of second vectors and an index node corresponding to each second vector, Each of the second vectors is used to represent one or more third vectors stored on the corresponding index node, the third vector is a representation of a data object, and the first vector and the plurality of second vectors are The vector similarity between the target vectors is greater than a threshold, and the target vector corresponds to the target index node in the routing information;
其中,步骤502的描述可以参照上述实施例中步骤402的描述,这里不再赘述。Wherein, the description of step 502 may refer to the description of step 402 in the foregoing embodiment, and details are not repeated here.
503、向所述目标索引节点传递所述第一向量,所述第一向量用于指示所述目标索引节点从自身存储的多个第三向量中确定所述第一向量的检索结果。503. Transfer the first vector to the target index node, where the first vector is used to instruct the target index node to determine a retrieval result of the first vector from multiple third vectors stored by itself.
其中,步骤503的描述可以参照上述实施例中步骤403以及步骤404的描述,这里不再赘述。For the description of step 503, reference may be made to the description of step 403 and step 404 in the above-mentioned embodiment, and details are not repeated here.
本申请实施例中,路由节点基于相似性确定出待索引节点需要路由至的索引节点,且在索引节点上的第三向量也是基于向量相似性部署的(相似性较大的向量部署在同一个索引节点上,相似性较小的向量部署在不同的索引节点上),也就是说,和第一向量相似度很高的第三向量都存储在特定的索引节点上(候选索引节点)上,因此路由节点只需要将第一向量发送至目标索引节点上,目标索引节点就可以基于自身存储的候选节点得到一个很精确的检索结果,使每次检索只需访问一个或部分索引节点(数量较少)即可获得精确的检索结果,提升了大规模向量规模的相似性检索性能,集群检索的并发性能可以随着集群规模而线性增长,有效解决了海量向量规模场景下的向量相似性检索性能和扩展性问题。In this embodiment of the application, the routing node determines the index node to which the node to be indexed needs to be routed based on similarity, and the third vector on the index node is also deployed based on vector similarity (vectors with greater similarity are deployed in the same On the index node, the vectors with less similarity are deployed on different index nodes), that is to say, the third vector with high similarity with the first vector is stored on a specific index node (candidate index node), Therefore, the routing node only needs to send the first vector to the target index node, and the target index node can obtain a very accurate retrieval result based on the candidate nodes stored by itself, so that each retrieval only needs to visit one or part of the index nodes (the number is relatively small). less) to obtain accurate retrieval results, which improves the performance of large-scale vector-scale similarity retrieval. The concurrency performance of cluster retrieval can increase linearly with the size of the cluster, which effectively solves the problem of vector similarity retrieval performance in massive vector-scale scenarios. and scalability issues.
在一种可能的实现中,每个所述第二向量对应一个簇,每个所述簇包括一个或多个所述第三向量,且所述多个第二向量中不同第二向量对应不同的簇。In a possible implementation, each of the second vectors corresponds to a cluster, each of the clusters includes one or more of the third vectors, and different second vectors in the plurality of second vectors correspond to different of clusters.
在一种可能的实现中,每个所述第二向量对应一个聚类,每个所述第二向量为所述聚类的聚类中心对应的向量。In a possible implementation, each second vector corresponds to a cluster, and each second vector is a vector corresponding to a cluster center of the cluster.
在一种可能的实现中,每个所述索引节点用于按照簇存储多个所述第三向量,每个所述簇包括至少一个所述第三向量。In a possible implementation, each of the index nodes is configured to store multiple third vectors according to clusters, and each cluster includes at least one third vector.
在一种可能的实现中,每个所述索引节点存储的第三向量为一个或多个簇中包括的向量。In a possible implementation, the third vector stored in each index node is a vector included in one or more clusters.
在一种可能的实现中,所述第一向量与所述多个第二向量中的至少两个目标向量中的每个目标向量之间的向量相似度大于阈值,所述至少两个目标向量在所述路由信息中对应于所述目标索引节点。In a possible implementation, the vector similarity between the first vector and each of at least two target vectors among the plurality of second vectors is greater than a threshold, and the at least two target vectors Corresponds to the target index node in the routing information.
在一种可能的实现中,所述目标索引节点包括一个或多个索引节点。In a possible implementation, the target index node includes one or more index nodes.
在一种可能的实现中,所述检索结果为所述多个第三向量中的部分向量。In a possible implementation, the retrieval result is a partial vector in the plurality of third vectors.
在一种可能的实现中,所述第一向量为检索对象的表征,所述检索对象包括文本数据、音频数据、图像数据或视频数据中的一个或多个。In a possible implementation, the first vector is a representation of a retrieval object, and the retrieval object includes one or more of text data, audio data, image data, or video data.
接下来以索引节点为执行主体描述本申请实施例提供的一种数据检索方法,所述方法可以应用于目标索引节点,所述目标索引节点存储有多个第三向量,所述多个第三向量中不同第三向量之间的向量相似度大于阈值,参照图6,图6为本申请实施例提供的一种数据检索方法的流程示意,所述方法包括:Next, the index node is used as the execution subject to describe a data retrieval method provided by the embodiment of the present application. The method can be applied to the target index node, and the target index node stores multiple third vectors, and the multiple third The vector similarity between different third vectors in the vector is greater than the threshold. Referring to FIG. 6, FIG. 6 is a schematic flow chart of a data retrieval method provided in an embodiment of the present application, and the method includes:
601、获取第一向量。601. Acquire a first vector.
其中,步骤601的描述可以参照上述实施例中步骤403的描述,这里不再赘述。Wherein, for the description of step 601, reference may be made to the description of step 403 in the foregoing embodiment, and details are not repeated here.
602、从自身存储的多个第三向量中确定所述第一向量的检索结果。602. Determine a retrieval result of the first vector from multiple third vectors stored by itself.
其中,步骤602的描述可以参照上述实施例中步骤404的描述,这里不再赘述。Wherein, for the description of step 602, reference may be made to the description of step 404 in the foregoing embodiment, and details are not repeated here.
在一种可能的实现中,所述目标索引节点上存储有用于进行数据检索的多个向量,所述多个第三向量为所述多个向量中数量占比大于目标比例的向量。In a possible implementation, multiple vectors used for data retrieval are stored on the target index node, and the multiple third vectors are vectors whose quantity proportion among the multiple vectors is larger than the target ratio.
在一种可能的实现中,所述检索结果为所述多个第三向量中的部分向量。In a possible implementation, the retrieval result is a partial vector in the plurality of third vectors.
此外,本申请还提供了一种计算机***,包括第一索引节点和第二索引节点,所述第一索引节点包括第一存储器和第一处理器,所述第二索引节点包括第二存储器和第二处理器,其中,所述第一存储器,被配置为存储用于进行数据检索的多个第一向量,所述多个第一向量中不同第一向量之间的相似度大于阈值,所述第一向量为数据对象的表征;所述第二存储器,被配置为存储用于进行数据检索的多个第二向量,所述多个第二向量中不同第二向量之间的相似度大于所述阈值,所述第二向量为数据对象的表征,且所述多个第一向量与所述多个第二向量之间的向量相似度小于所述阈值;所述第一处理器,被配置为基于所述多个第一向量进行数据检索;所述第二处理器,被配置为基于所述多个第二向量进行数据检索。In addition, the present application also provides a computer system, including a first index node and a second index node, the first index node includes a first memory and a first processor, and the second index node includes a second memory and The second processor, wherein the first memory is configured to store a plurality of first vectors for data retrieval, the similarity between different first vectors among the plurality of first vectors is greater than a threshold, and the The first vector is a representation of a data object; the second memory is configured to store a plurality of second vectors for data retrieval, and the similarity between different second vectors in the plurality of second vectors is greater than The threshold, the second vector is a representation of a data object, and the vector similarity between the plurality of first vectors and the plurality of second vectors is smaller than the threshold; the first processor is configured to perform data retrieval based on the plurality of first vectors; the second processor is configured to perform data retrieval based on the plurality of second vectors.
本申请实施例中,索引节点上的第三向量也是基于向量相似性部署的(相似性较大的向量部署在同一个索引节点上,相似性较小的向量部署在不同的索引节点上),也就是说,和第一向量相似度很高的第三向量都存储在特定的索引节点上(候选索引节点)上,因此路由节点只需要将第一向量发送至目标索引节点上,目标索引节点就可以基于自身存储的候选节点得到一个很精确的检索结果,使每次检索只需访问一个或部分索引节点(数量较少)即可获得精确的检索结果,提升了大规模向量规模的相似性检索性能,集群检索的并发性能可以随着集群规模而线性增长,有效解决了海量向量规模场景下的向量相似性检索性能和扩展性问题。In the embodiment of the present application, the third vector on the index node is also deployed based on vector similarity (vectors with greater similarity are deployed on the same index node, and vectors with less similarity are deployed on different index nodes), That is to say, the third vector with high similarity with the first vector is stored on a specific index node (candidate index node), so the routing node only needs to send the first vector to the target index node, and the target index node A very accurate retrieval result can be obtained based on the candidate nodes stored by itself, so that each retrieval only needs to visit one or part of the index nodes (the number is small) to obtain accurate retrieval results, which improves the similarity of large-scale vector scale Retrieval performance, the concurrency performance of cluster retrieval can grow linearly with the size of the cluster, effectively solving the performance and scalability problems of vector similarity retrieval in massive vector scale scenarios.
在一种可能的实现中,所述多个第一向量为一个或多个第一簇中包括的向量,所述多个第二向量为一个或多个第二簇中包括的向量,且所述第一簇与所述第二簇不同。In a possible implementation, the multiple first vectors are vectors included in one or more first clusters, the multiple second vectors are vectors included in one or more second clusters, and the The first cluster is different from the second cluster.
在一种可能的实现中,所述第一簇和所述第二簇为聚类。In a possible implementation, the first cluster and the second cluster are clusters.
在一种可能的实现中,所述第一处理器,被配置为基于第三向量从所述多个第一向量中确定检索结果,所述第三向量为第一检索对象的表征;In a possible implementation, the first processor is configured to determine a retrieval result from the plurality of first vectors based on a third vector, where the third vector is a representation of the first retrieval object;
所述第二处理器,被配置为基于第四向量从所述多个第二向量中确定检索结果,所述 第四向量为第二检索对象的表征。The second processor is configured to determine a retrieval result from the plurality of second vectors based on a fourth vector, where the fourth vector is a representation of the second retrieval object.
在一种可能的实现中,所述第一索引节点和所述第二索引节点均通信连接于路由节点,且所述第三向量和所述第四向量均来自于所述路由节点的发送。In a possible implementation, both the first index node and the second index node are communicatively connected to a routing node, and both the third vector and the fourth vector are sent from the routing node.
在一种可能的实现中,所述第一检索对象和所述第二检索对象包括文本数据、音频数据、图像数据或视频数据中的一个或多个。In a possible implementation, the first retrieval object and the second retrieval object include one or more of text data, audio data, image data, or video data.
参照图7,图7为本申请实施例提供的一种数据检索装置700的结构示意,所述装置可以应用于计算机***,所述计算机***包括路由节点以及多个索引节点,所述路由节点包括:Referring to FIG. 7, FIG. 7 is a schematic structural diagram of a data retrieval device 700 provided by the embodiment of the present application. The device can be applied to a computer system, and the computer system includes a routing node and a plurality of index nodes. The routing node includes :
获取模块701,用于获取第一向量;An acquisition module 701, configured to acquire a first vector;
其中,关于获取模块701的具体描述可以参照上述实施例中步骤401的描述,这里不再赘述。Wherein, for the specific description of the obtaining module 701, reference may be made to the description of step 401 in the above-mentioned embodiment, and details are not repeated here.
路由模块702,用于根据所述第一向量以及路由信息,从所述多个索引节点中得到目标索引节点,其中,所述路由信息包括多个第二向量以及每个所述第二向量对应的索引节点,每个所述第二向量用于表示对应的索引节点上存储的一个或多个第三向量,所述第三向量为数据对象的表征,所述第一向量与所述多个第二向量中的目标向量之间的向量相似度大于阈值,且所述目标向量在所述路由信息中对应于所述目标索引节点;A routing module 702, configured to obtain a target index node from the plurality of index nodes according to the first vector and routing information, wherein the routing information includes a plurality of second vectors and each of the second vectors corresponds to index node, each of the second vectors is used to represent one or more third vectors stored on the corresponding index node, the third vector is a representation of the data object, the first vector and the plurality of The vector similarity between target vectors in the second vector is greater than a threshold, and the target vector corresponds to the target index node in the routing information;
其中,关于路由模块702的具体描述可以参照上述实施例中步骤402的描述,这里不再赘述。Wherein, for the specific description of the routing module 702, reference may be made to the description of step 402 in the above embodiment, and details are not repeated here.
发送模块703,用于向所述目标索引节点传递所述第一向量。A sending module 703, configured to transfer the first vector to the target index node.
其中,关于发送模块703的具体描述可以参照上述实施例中步骤403的描述,这里不再赘述。Wherein, for the specific description of the sending module 703, reference may be made to the description of step 403 in the above embodiment, and details are not repeated here.
所述目标索引节点,包括:The target index node includes:
检索模块704,用于从自身存储的多个第三向量中确定所述第一向量的检索结果。The retrieval module 704 is configured to determine the retrieval result of the first vector from a plurality of third vectors stored in itself.
其中,关于检索模块704的具体描述可以参照上述实施例中步骤404的描述,这里不再赘述。Wherein, for a specific description of the retrieval module 704, reference may be made to the description of step 404 in the above-mentioned embodiment, and details are not repeated here.
本申请实施例中,路由节点基于相似性确定出待索引节点需要路由至的索引节点,且在索引节点上的第三向量也是基于向量相似性部署的(相似性较大的向量部署在同一个索引节点上,相似性较小的向量部署在不同的索引节点上),也就是说,和第一向量相似度很高的第三向量都存储在特定的索引节点上(候选索引节点)上,因此路由节点只需要将第一向量发送至目标索引节点上,目标索引节点就可以基于自身存储的候选节点得到一个很精确的检索结果,使每次检索只需访问一个或部分索引节点(数量较少)即可获得精确的检索结果,提升了大规模向量规模的相似性检索性能,集群检索的并发性能可以随着集群规模而线性增长,有效解决了海量向量规模场景下的向量相似性检索性能和扩展性问题。In this embodiment of the application, the routing node determines the index node to which the node to be indexed needs to be routed based on similarity, and the third vector on the index node is also deployed based on vector similarity (vectors with greater similarity are deployed in the same On the index node, the vectors with less similarity are deployed on different index nodes), that is to say, the third vector with high similarity with the first vector is stored on a specific index node (candidate index node), Therefore, the routing node only needs to send the first vector to the target index node, and the target index node can obtain a very accurate retrieval result based on the candidate nodes stored by itself, so that each retrieval only needs to visit one or part of the index nodes (the number is relatively small). less) to obtain accurate retrieval results, which improves the performance of large-scale vector-scale similarity retrieval. The concurrency performance of cluster retrieval can increase linearly with the size of the cluster, which effectively solves the problem of vector similarity retrieval performance in massive vector-scale scenarios. and scalability issues.
在一种可能的实现中,每个所述第二向量对应一个簇,每个所述簇包括一个或多个所述第三向量,且所述多个第二向量中不同第二向量对应不同的簇。In a possible implementation, each of the second vectors corresponds to a cluster, each of the clusters includes one or more of the third vectors, and different second vectors in the plurality of second vectors correspond to different of clusters.
在一种可能的实现中,每个所述第二向量对应一个聚类,每个所述第二向量为所述聚 类的聚类中心对应的向量。In a possible implementation, each of the second vectors corresponds to a cluster, and each of the second vectors is a vector corresponding to a cluster center of the cluster.
在一种可能的实现中,每个所述索引节点用于按照簇存储多个所述第三向量,每个所述簇包括至少一个所述第三向量。In a possible implementation, each of the index nodes is configured to store multiple third vectors according to clusters, and each cluster includes at least one third vector.
在一种可能的实现中,每个所述索引节点存储的第三向量为一个或多个簇中包括的向量。In a possible implementation, the third vector stored in each index node is a vector included in one or more clusters.
在一种可能的实现中,所述第一向量与所述多个第二向量中的至少两个目标向量中的每个目标向量之间的向量相似度大于阈值,所述至少两个目标向量在所述路由信息中对应于所述目标索引节点。In a possible implementation, the vector similarity between the first vector and each of at least two target vectors among the plurality of second vectors is greater than a threshold, and the at least two target vectors Corresponds to the target index node in the routing information.
在一种可能的实现中,所述目标索引节点包括一个或多个索引节点。In a possible implementation, the target index node includes one or more index nodes.
在一种可能的实现中,所述检索结果为所述多个第三向量中的部分向量。In a possible implementation, the retrieval result is a partial vector in the plurality of third vectors.
在一种可能的实现中,所述第一向量为检索对象的表征,所述检索对象包括文本数据、音频数据、图像数据或视频数据中的一个或多个。In a possible implementation, the first vector is a representation of a retrieval object, and the retrieval object includes one or more of text data, audio data, image data, or video data.
在一种可能的实现中,所述发送模块,还用于向所述目标索引节点传递所述目标向量;In a possible implementation, the sending module is further configured to transfer the target vector to the target index node;
所述检索模块,具体用于基于所述目标向量以及第一映射关系,从自身存储的多个第三向量中确定所述一个或多个第三向量,所述第一映射关系包括一个或多个第二向量以及所述一个或多个第二向量中每个第二向量与第三向量的对应关系;The retrieval module is specifically configured to determine the one or more third vectors from a plurality of third vectors stored in itself based on the target vector and the first mapping relationship, the first mapping relationship includes one or more second vectors and the corresponding relationship between each second vector and the third vector in the one or more second vectors;
基于所述第一向量从所述一个或多个第三向量中确定检索结果。A retrieval result is determined from the one or more third vectors based on the first vector.
在一种可能的实现中,所述检索模块,具体用于根据所述第一向量,从自身存储的多个第三向量中确定部分第三向量,其中所述部分第三向量中的每个第三向量对应相同的簇,且所述相同的簇的簇中心与所述第一向量之间的向量相似度大于阈值;In a possible implementation, the retrieval module is specifically configured to, according to the first vector, determine a part of the third vector from a plurality of third vectors stored in itself, wherein each of the part of the third vector The third vector corresponds to the same cluster, and the vector similarity between the cluster center of the same cluster and the first vector is greater than a threshold;
基于所述第一向量从所述部分第三向量中确定检索结果。A retrieval result is determined from the portion of third vectors based on the first vector.
参照图8,图8为本申请实施例提供的一种数据检索装置的结构示意,所述装置800包括:Referring to FIG. 8, FIG. 8 is a schematic structural diagram of a data retrieval device provided in an embodiment of the present application. The device 800 includes:
获取模块801,用于获取第一向量;An acquisition module 801, configured to acquire a first vector;
其中,关于获取模块801的具体描述可以参照上述实施例中步骤501的描述,这里不再赘述。Wherein, for the specific description of the obtaining module 801, reference may be made to the description of step 501 in the above embodiment, and details are not repeated here.
路由模块802,用于根据所述第一向量以及路由信息,从所述多个索引节点中得到目标索引节点,其中,所述路由信息包括多个第二向量以及每个所述第二向量对应的索引节点,每个所述第二向量用于表示对应的索引节点上存储的一个或多个第三向量,所述第三向量为数据对象的表征,所述第一向量与所述多个第二向量中的目标向量之间的向量相似度大于阈值,且所述目标向量在所述路由信息中对应于所述目标索引节点;A routing module 802, configured to obtain a target index node from the plurality of index nodes according to the first vector and routing information, wherein the routing information includes a plurality of second vectors and each of the second vectors corresponds to index node, each of the second vectors is used to represent one or more third vectors stored on the corresponding index node, the third vector is a representation of the data object, the first vector and the plurality of The vector similarity between target vectors in the second vector is greater than a threshold, and the target vector corresponds to the target index node in the routing information;
其中,关于路由模块802的具体描述可以参照上述实施例中步骤502的描述,这里不再赘述。Wherein, for the specific description of the routing module 802, reference may be made to the description of step 502 in the above embodiment, and details are not repeated here.
发送模块803,用于向所述目标索引节点传递所述第一向量,所述第一向量用于指示所述目标索引节点从自身存储的多个第三向量中确定所述第一向量的检索结果。A sending module 803, configured to transmit the first vector to the target index node, where the first vector is used to instruct the target index node to determine the retrieval of the first vector from multiple third vectors stored by itself. result.
其中,关于发送模块803的具体描述可以参照上述实施例中步骤503的描述,这里不 再赘述。Wherein, for the specific description of the sending module 803, reference may be made to the description of step 503 in the above embodiment, and details are not repeated here.
本申请实施例中,路由节点基于相似性确定出待索引节点需要路由至的索引节点,且在索引节点上的第三向量也是基于向量相似性部署的(相似性较大的向量部署在同一个索引节点上,相似性较小的向量部署在不同的索引节点上),也就是说,和第一向量相似度很高的第三向量都存储在特定的索引节点上(候选索引节点)上,因此路由节点只需要将第一向量发送至目标索引节点上,目标索引节点就可以基于自身存储的候选节点得到一个很精确的检索结果,使每次检索只需访问一个或部分索引节点(数量较少)即可获得精确的检索结果,提升了大规模向量规模的相似性检索性能,集群检索的并发性能可以随着集群规模而线性增长,有效解决了海量向量规模场景下的向量相似性检索性能和扩展性问题。In this embodiment of the application, the routing node determines the index node to which the node to be indexed needs to be routed based on similarity, and the third vector on the index node is also deployed based on vector similarity (vectors with greater similarity are deployed in the same On the index node, the vectors with less similarity are deployed on different index nodes), that is to say, the third vector with high similarity with the first vector is stored on a specific index node (candidate index node), Therefore, the routing node only needs to send the first vector to the target index node, and the target index node can obtain a very accurate retrieval result based on the candidate nodes stored by itself, so that each retrieval only needs to visit one or part of the index nodes (the number is relatively small). less) to obtain accurate retrieval results, which improves the performance of large-scale vector-scale similarity retrieval. The concurrency performance of cluster retrieval can increase linearly with the size of the cluster, which effectively solves the problem of vector similarity retrieval performance in massive vector-scale scenarios. and scalability issues.
在一种可能的实现中,每个所述第二向量对应一个簇,每个所述簇包括一个或多个所述第三向量,且所述多个第二向量中不同第二向量对应不同的簇。In a possible implementation, each of the second vectors corresponds to a cluster, each of the clusters includes one or more of the third vectors, and different second vectors in the plurality of second vectors correspond to different of clusters.
在一种可能的实现中,每个所述第二向量对应一个聚类,每个所述第二向量为所述聚类的聚类中心对应的向量。In a possible implementation, each second vector corresponds to a cluster, and each second vector is a vector corresponding to a cluster center of the cluster.
在一种可能的实现中,每个所述索引节点用于按照簇存储多个所述第三向量,每个所述簇包括至少一个所述第三向量。In a possible implementation, each of the index nodes is configured to store multiple third vectors according to clusters, and each cluster includes at least one third vector.
在一种可能的实现中,每个所述索引节点存储的第三向量为一个或多个簇中包括的向量。In a possible implementation, the third vector stored in each index node is a vector included in one or more clusters.
在一种可能的实现中,所述第一向量与所述多个第二向量中的至少两个目标向量中的每个目标向量之间的向量相似度大于阈值,所述至少两个目标向量在所述路由信息中对应于所述目标索引节点。In a possible implementation, the vector similarity between the first vector and each of at least two target vectors among the plurality of second vectors is greater than a threshold, and the at least two target vectors Corresponds to the target index node in the routing information.
在一种可能的实现中,所述目标索引节点包括一个或多个索引节点。In a possible implementation, the target index node includes one or more index nodes.
在一种可能的实现中,所述检索结果为所述多个第三向量中的部分向量。In a possible implementation, the retrieval result is a partial vector in the plurality of third vectors.
在一种可能的实现中,所述第一向量为检索对象的表征,所述检索对象包括文本数据、音频数据、图像数据或视频数据中的一个或多个。In a possible implementation, the first vector is a representation of a retrieval object, and the retrieval object includes one or more of text data, audio data, image data, or video data.
参照图9,图9为本申请实施例提供的一种数据检索装置的结构示意,所述装置900可以应用于第一索引节点,所述第一索引节点存储有多个第一第三向量,所述多个第一第三向量中不同第一第三向量之间的向量相似度大于阈值,所述装置900包括:Referring to FIG. 9, FIG. 9 is a schematic structural diagram of a data retrieval device provided by an embodiment of the present application. The device 900 can be applied to a first index node, and the first index node stores multiple first and third vectors. The vector similarity between different first and third vectors among the plurality of first and third vectors is greater than a threshold, and the device 900 includes:
获取模块901,用于获取第一向量,所述第一向量与所述多个第一第三向量之间的向量相似度大于阈值;An acquisition module 901, configured to acquire a first vector, the vector similarity between the first vector and the plurality of first and third vectors is greater than a threshold;
其中,关于获取模块901的具体描述可以参照上述实施例中步骤601的描述,这里不再赘述。Wherein, for the specific description of the acquiring module 901, reference may be made to the description of step 601 in the above embodiment, and details are not repeated here.
检索模块902,用于基于所述第一向量与所述多个第一第三向量之间的相似性比较,从所述多个第一第三向量中确定检索结果。A retrieval module 902, configured to determine a retrieval result from the multiple first and third vectors based on the similarity comparison between the first vector and the multiple first and third vectors.
其中,关于检索模块902的具体描述可以参照上述实施例中步骤602的描述,这里不再赘述。Wherein, for the specific description of the retrieval module 902, reference may be made to the description of step 602 in the above embodiment, and details are not repeated here.
在一种可能的实现中,所述第一索引节点上存储有用于进行数据检索的多个向量,所 述多个第一第三向量为所述多个向量中数量占比大于目标比例的向量。In a possible implementation, multiple vectors used for data retrieval are stored on the first index node, and the multiple first and third vectors are vectors whose quantity proportion among the multiple vectors is greater than the target ratio .
在一种可能的实现中,所述检索结果为所述多个第三向量中的部分向量。In a possible implementation, the retrieval result is a partial vector in the plurality of third vectors.
本申请实施例中,路由节点基于相似性确定出待索引节点需要路由至的索引节点,且在索引节点上的第三向量也是基于向量相似性部署的(相似性较大的向量部署在同一个索引节点上,相似性较小的向量部署在不同的索引节点上),也就是说,和第一向量相似度很高的第三向量都存储在特定的索引节点上(候选索引节点)上,因此路由节点只需要将第一向量发送至目标索引节点上,目标索引节点就可以基于自身存储的候选节点得到一个很精确的检索结果,使每次检索只需访问一个或部分索引节点(数量较少)即可获得精确的检索结果,提升了大规模向量规模的相似性检索性能,集群检索的并发性能可以随着集群规模而线性增长,有效解决了海量向量规模场景下的向量相似性检索性能和扩展性问题。In this embodiment of the application, the routing node determines the index node to which the node to be indexed needs to be routed based on similarity, and the third vector on the index node is also deployed based on vector similarity (vectors with greater similarity are deployed in the same On the index node, the vectors with less similarity are deployed on different index nodes), that is to say, the third vector with high similarity with the first vector is stored on a specific index node (candidate index node), Therefore, the routing node only needs to send the first vector to the target index node, and the target index node can obtain a very accurate retrieval result based on the candidate nodes stored by itself, so that each retrieval only needs to visit one or part of the index nodes (the number is relatively small). less) to obtain accurate retrieval results, which improves the performance of large-scale vector-scale similarity retrieval. The concurrency performance of cluster retrieval can increase linearly with the size of the cluster, which effectively solves the problem of vector similarity retrieval performance in massive vector-scale scenarios. and scalability issues.
接下来介绍本申请实施例提供的一种终端设备,请参阅图10,图10为本申请实施例提供的终端设备的一种结构示意图,终端设备1000具体可以表现为虚拟现实VR设备、手机、平板、笔记本电脑、智能穿戴设备、监控数据处理设备或服务器等,此处不做限定。具体的,终端设备1000包括:接收器1001、发射器1002、处理器1003和存储器1004(其中终端设备1000中的处理器1003的数量可以一个或多个,图10中以一个处理器为例),其中,处理器1003可以包括应用处理器10031和通信处理器10032。在本申请的一些实施例中,接收器1001、发射器1002、处理器1003和存储器1004可通过总线或其它方式连接。Next, we will introduce a terminal device provided by the embodiment of the present application. Please refer to FIG. 10. FIG. Tablets, laptops, smart wearable devices, monitoring data processing equipment or servers, etc., are not limited here. Specifically, the terminal device 1000 includes: a receiver 1001, a transmitter 1002, a processor 1003, and a memory 1004 (the number of processors 1003 in the terminal device 1000 can be one or more, and one processor is taken as an example in FIG. 10 ) , where the processor 1003 may include an application processor 10031 and a communication processor 10032 . In some embodiments of the present application, the receiver 1001 , the transmitter 1002 , the processor 1003 and the memory 1004 may be connected through a bus or in other ways.
存储器1004可以包括只读存储器和随机存取存储器,并向处理器1003提供指令和数据。存储器1004的一部分还可以包括非易失性随机存取存储器(non-volatile random access memory,NVRAM)。存储器1004存储有处理器和操作指令、可执行模块或者数据结构,或者它们的子集,或者它们的扩展集,其中,操作指令可包括各种操作指令,用于实现各种操作。The memory 1004 may include read-only memory and random-access memory, and provides instructions and data to the processor 1003 . A part of the memory 1004 may also include a non-volatile random access memory (non-volatile random access memory, NVRAM). The memory 1004 stores processors and operating instructions, executable modules or data structures, or their subsets, or their extended sets, wherein the operating instructions may include various operating instructions for implementing various operations.
处理器1003控制终端设备的操作。具体的应用中,终端设备的各个组件通过总线***耦合在一起,其中总线***除包括数据总线之外,还可以包括电源总线、控制总线和状态信号总线等。但是为了清楚说明起见,在图中将各种总线都称为总线***。The processor 1003 controls the operation of the terminal device. In a specific application, various components of the terminal device are coupled together through a bus system, where the bus system may include a power bus, a control bus, and a status signal bus in addition to a data bus. However, for the sake of clarity, the various buses are referred to as bus systems in the figures.
上述本申请实施例揭示的方法可以应用于处理器1003中,或者由处理器1003实现。处理器1003可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,上述方法的各步骤可以通过处理器1003中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器1003可以是通用处理器、数字信号处理器(digital signal processing,DSP)、微处理器或微控制器,还可进一步包括专用集成电路(application specific integrated circuit,ASIC)、现场可编程门阵列(field-programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。该处理器1003可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器 等本领域成熟的存储介质中。该存储介质位于存储器1004,处理器1003读取存储器1004中的信息,结合其硬件完成上述方法的步骤。The methods disclosed in the foregoing embodiments of the present application may be applied to the processor 1003 or implemented by the processor 1003 . The processor 1003 may be an integrated circuit chip, which has a signal processing capability. In the implementation process, each step of the above method may be completed by an integrated logic circuit of hardware in the processor 1003 or instructions in the form of software. The above-mentioned processor 1003 may be a general-purpose processor, a digital signal processor (digital signal processing, DSP), a microprocessor or a microcontroller, and may further include an application specific integrated circuit (ASIC), field programmable Field-programmable gate array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components. The processor 1003 may implement or execute various methods, steps, and logic block diagrams disclosed in the embodiments of the present application. A general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module can be located in a mature storage medium in the field such as random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, register. The storage medium is located in the memory 1004, and the processor 1003 reads the information in the memory 1004, and completes the steps of the above method in combination with its hardware.
接收器1001可用于接收输入的数字或字符信息,以及产生与终端设备的相关设置以及功能控制有关的信号输入。发射器1002可用于通过第一接口输出数字或字符信息;发射器1002还可用于通过第一接口向磁盘组发送指令,以修改磁盘组中的数据;发射器1002还可以包括显示屏等显示设备。The receiver 1001 can be used to receive input digital or character information, and generate signal input related to related settings and function control of the terminal device. The transmitter 1002 can be used to output digital or character information through the first interface; the transmitter 1002 can also be used to send instructions to the disk group through the first interface to modify the data in the disk group; the transmitter 1002 can also include a display device such as a display screen .
本申请实施例中,在一种情况下,处理器1003,用于执行上述实施例描述的数据检索方法中端侧执行的相关步骤。In the embodiment of the present application, in one case, the processor 1003 is configured to execute the relevant steps performed on the terminal side in the data retrieval method described in the above embodiment.
本申请实施例还提供了一种服务器,请参阅图11,图11是本申请实施例提供的服务器一种结构示意图,具体的,服务器1100由一个或多个服务器实现,服务器1100可因配置或性能不同而产生比较大的差异,可以包括一个或一个以***处理器(central processing units,CPU)1111(例如,一个或一个以上处理器)和存储器1132,一个或一个以上存储应用程序1142或数据1144的存储介质1130(例如一个或一个以上海量存储设备)。其中,存储器1132和存储介质1130可以是短暂存储或持久存储。存储在存储介质1130的程序可以包括一个或一个以上模块(图示没标出),每个模块可以包括对服务器中的一系列指令操作。更进一步地,中央处理器1111可以设置为与存储介质1130通信,在服务器1100上执行存储介质1130中的一系列指令操作。The embodiment of the present application also provides a server. Please refer to FIG. 11. FIG. There are relatively large differences due to different performances, and may include one or more central processing units (central processing units, CPU) 1111 (for example, one or more processors) and memory 1132, and one or more storage application programs 1142 or data 1144 storage medium 1130 (for example, one or more mass storage devices). Wherein, the memory 1132 and the storage medium 1130 may be temporary storage or persistent storage. The program stored in the storage medium 1130 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations on the server. Furthermore, the central processing unit 1111 may be configured to communicate with the storage medium 1130 , and execute a series of instruction operations in the storage medium 1130 on the server 1100 .
服务器1100还可以包括一个或一个以上电源1126,一个或一个以上有线或无线网络接口1150,一个或一个以上输入输出接口1158;或,一个或一个以上操作***1141,例如Windows ServerTM,Mac OS XTM,UnixTM,LinuxTM,FreeBSDTM等等。The server 1100 can also include one or more power supplies 1126, one or more wired or wireless network interfaces 1150, one or more input and output interfaces 1158; or, one or more operating systems 1141, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.
本申请实施例中,中央处理器1111,用于执行图4、图6对应实施例中描述的数据检索方法。In the embodiment of the present application, the central processing unit 1111 is configured to execute the data retrieval method described in the embodiments corresponding to FIG. 4 and FIG. 6 .
本申请实施例中还提供一种包括计算机程序产品,当其在计算机上运行时,使得计算机执行如前述执行设备所执行的步骤,或者,使得计算机执行如前述服务器所执行的步骤。The embodiment of the present application also provides a computer program product, which, when running on a computer, causes the computer to perform the steps performed by the aforementioned execution device, or causes the computer to perform the steps performed by the aforementioned server.
本申请实施例中还提供一种计算机可读存储介质,该计算机可读存储介质中存储有用于进行信号处理的程序,当其在计算机上运行时,使得计算机执行如前述执行设备所执行的步骤,或者,使得计算机执行如前述服务器所执行的步骤。An embodiment of the present application also provides a computer-readable storage medium, the computer-readable storage medium stores a program for signal processing, and when it is run on a computer, the computer executes the steps performed by the aforementioned executing device , or make the computer perform the steps performed by the aforementioned server.
本申请实施例提供的执行设备、服务器或终端设备具体可以为芯片,芯片包括:处理单元和通信单元,所述处理单元例如可以是处理器,所述通信单元例如可以是输入/输出接口、管脚或电路等。该处理单元可执行存储单元存储的计算机执行指令,以使执行设备内的芯片执行上述实施例描述的数据处理方法,或者,以使服务器内的芯片执行上述实施例描述的数据处理方法。可选地,所述存储单元为所述芯片内的存储单元,如寄存器、缓存等,所述存储单元还可以是所述无线接入设备端内的位于所述芯片外部的存储单元,如只读存储器(read-only memory,ROM)或可存储静态信息和指令的其他类型的静态存储设备,随机存取存储器(random access memory,RAM)等。The execution device, server, or terminal device provided in the embodiment of the present application may specifically be a chip. The chip includes: a processing unit and a communication unit. The processing unit may be, for example, a processor, and the communication unit may be, for example, an input/output interface, a pipe pins or circuits etc. The processing unit can execute the computer-executed instructions stored in the storage unit, so that the chips in the execution device execute the data processing methods described in the above embodiments, or the chips in the server execute the data processing methods described in the above embodiments. Optionally, the storage unit is a storage unit in the chip, such as a register, a cache, etc., and the storage unit may also be a storage unit located outside the chip in the wireless access device, such as only Read-only memory (ROM) or other types of static storage devices that can store static information and instructions, random access memory (random access memory, RAM), etc.
具体的,请参阅图12,图12为本申请实施例提供的芯片的一种结构示意图,上述图4、图5和图6对应的实施例中描述的数据检索方法可以在图12所示的芯片中实现。具体的,所述芯片可以表现为神经网络处理器NPU 1200,NPU 1200作为协处理器挂载到主CPU(Host CPU)上,由Host CPU分配任务。NPU的核心部分为运算电路1203,控制器1204控制运算电路1203提取存储器(权重存储器或输入存储器)中的数据并进行运算。Specifically, please refer to FIG. 12. FIG. 12 is a schematic structural diagram of a chip provided by the embodiment of the present application. The data retrieval method described in the above-mentioned embodiments corresponding to FIG. 4, FIG. 5 and FIG. implemented in the chip. Specifically, the chip can be represented as a neural network processor NPU 1200, and the NPU 1200 is mounted on the main CPU (Host CPU) as a coprocessor, and the tasks are assigned by the Host CPU. The core part of the NPU is the operation circuit 1203, and the controller 1204 controls the operation circuit 1203 to extract data in the memory (weight memory or input memory) and perform operations.
上述图4、图5和图6对应的实施例中描述的数据检索方法可以由图12所示的芯片中的主CPU和NPU共同配合完成。The data retrieval methods described in the above embodiments corresponding to FIG. 4 , FIG. 5 and FIG. 6 can be completed by the cooperation of the main CPU and the NPU in the chip shown in FIG. 12 .
在一些实现中,运算电路1203内部包括多个处理单元(Process Engine,PE)。在一些实现中,运算电路1203是二维脉动阵列。运算电路1203还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中,运算电路1203是通用的矩阵处理器。In some implementations, the operation circuit 1203 includes multiple processing units (Process Engine, PE). In some implementations, arithmetic circuit 1203 is a two-dimensional systolic array. The arithmetic circuit 1203 may also be a one-dimensional systolic array or other electronic circuits capable of performing mathematical operations such as multiplication and addition. In some implementations, arithmetic circuit 1203 is a general-purpose matrix processor.
举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。运算电路从权重存储器1202中取矩阵B相应的数据,并缓存在运算电路中每一个PE上。运算电路从输入存储器1201中取矩阵A数据与矩阵B进行矩阵运算,得到的矩阵的部分结果或最终结果,保存在累加器(accumulator)1208中。For example, suppose there is an input matrix A, a weight matrix B, and an output matrix C. The operation circuit fetches the data corresponding to the matrix B from the weight memory 1202, and caches it in each PE in the operation circuit. The operation circuit fetches the data of matrix A from the input memory 1201 and performs matrix operation with matrix B, and the obtained partial or final results of the matrix are stored in an accumulator (accumulator) 1208 .
统一存储器1206用于存放输入数据以及输出数据。权重数据直接通过存储单元访问控制器(Direct Memory Access Controller,DMAC)1205,DMAC被搬运到权重存储器1202中。输入数据也通过DMAC被搬运到统一存储器1206中。The unified memory 1206 is used to store input data and output data. The weight data directly accesses the controller (Direct Memory Access Controller, DMAC) 1205 through the storage unit, and the DMAC is transferred to the weight storage 1202. The input data is also transferred to the unified memory 1206 through the DMAC.
BIU为Bus Interface Unit即,总线接口单元1210,用于AXI总线与DMAC和取指存储器(Instruction Fetch Buffer,IFB)1209的交互。The BIU is the Bus Interface Unit, that is, the bus interface unit 1210, which is used for the interaction between the AXI bus and the DMAC and the instruction fetch buffer (Instruction Fetch Buffer, IFB) 1209.
总线接口单元1210(Bus Interface Unit,简称BIU),用于取指存储器1209从外部存储器获取指令,还用于存储单元访问控制器1205从外部存储器获取输入矩阵A或者权重矩阵B的原数据。The bus interface unit 1210 (Bus Interface Unit, BIU for short), is used for the instruction fetch memory 1209 to obtain instructions from the external memory, and is also used for the storage unit access controller 1205 to obtain the original data of the input matrix A or the weight matrix B from the external memory.
DMAC主要用于将外部存储器DDR中的输入数据搬运到统一存储器1206或将权重数据搬运到权重存储器1202中或将输入数据数据搬运到输入存储器1201中。The DMAC is mainly used to move the input data in the external memory DDR to the unified memory 1206 , move the weight data to the weight memory 1202 , or move the input data to the input memory 1201 .
向量计算单元1207包括多个运算处理单元,在需要的情况下,对运算电路的输出做进一步处理,如向量乘,向量加,指数运算,对数运算,大小比较等等。主要用于神经网络中非卷积/全连接层网络计算,如Batch Normalization(批归一化),像素级求和,对特征平面进行上采样等。The vector computing unit 1207 includes a plurality of computing processing units, and if necessary, further processes the output of the computing circuit, such as vector multiplication, vector addition, exponent operation, logarithmic operation, size comparison and so on. It is mainly used for non-convolutional/fully connected layer network calculations in neural networks, such as Batch Normalization (batch normalization), pixel-level summation, and upsampling of feature planes.
在一些实现中,向量计算单元1207能将经处理的输出的向量存储到统一存储器1206。例如,向量计算单元1207可以将线性函数;或,非线性函数应用到运算电路1203的输出,例如对卷积层提取的特征平面进行线性插值,再例如累加值的向量,用以生成激活值。在一些实现中,向量计算单元1207生成归一化的值、像素级求和的值,或二者均有。在一些实现中,处理过的输出的向量能够用作到运算电路1203的激活输入,例如用于在神经网络中的后续层中的使用。In some implementations, the vector computation unit 1207 can store the vector of the processed output to unified memory 1206 . For example, the vector calculation unit 1207 can apply a linear function; or, a nonlinear function to the output of the operation circuit 1203, such as performing linear interpolation on the feature plane extracted by the convolution layer, and then such as a vector of accumulated values to generate an activation value. In some implementations, the vector computation unit 1207 generates normalized values, pixel-level summed values, or both. In some implementations, the vector of processed outputs can be used as an activation input to operational circuitry 1203, eg, for use in subsequent layers in a neural network.
控制器1204连接的取指存储器(instruction fetch buffer)1209,用于存储控制器1204使用的指令;An instruction fetch buffer (instruction fetch buffer) 1209 connected to the controller 1204 is used to store instructions used by the controller 1204;
统一存储器1206,输入存储器1201,权重存储器1202以及取指存储器1209均为On-Chip存储器。外部存储器私有于该NPU硬件架构。The unified memory 1206, the input memory 1201, the weight memory 1202 and the fetch memory 1209 are all On-Chip memories. External memory is private to the NPU hardware architecture.
其中,上述任一处提到的处理器,可以是一个通用中央处理器,微处理器,ASIC,或一个或多个用于控制上述程序执行的集成电路。Wherein, the processor mentioned above can be a general-purpose central processing unit, microprocessor, ASIC, or one or more integrated circuits for controlling the execution of the above-mentioned programs.
另外需说明的是,以上所描述的装置实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。另外,本申请提供的装置实施例附图中,模块之间的连接关系表示它们之间具有通信连接,具体可以实现为一条或多条通信总线或信号线。In addition, it should be noted that the device embodiments described above are only illustrative, and the units described as separate components may or may not be physically separated, and the components shown as units may or may not be A physical unit can be located in one place, or it can be distributed to multiple network units. Part or all of the modules can be selected according to actual needs to achieve the purpose of the solution of this embodiment. In addition, in the drawings of the device embodiments provided in the present application, the connection relationship between the modules indicates that they have communication connections, which can be specifically implemented as one or more communication buses or signal lines.
通过以上的实施方式的描述,所属领域的技术人员可以清楚地了解到本申请可借助软件加必需的通用硬件的方式来实现,当然也可以通过专用硬件包括专用集成电路、专用CPU、专用存储器、专用元器件等来实现。一般情况下,凡由计算机程序完成的功能都可以很容易地用相应的硬件来实现,而且,用来实现同一功能的具体硬件结构也可以是多种多样的,例如模拟电路、数字电路或专用电路等。但是,对本申请而言更多情况下软件程序实现是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在可读取的存储介质中,如计算机的软盘、U盘、移动硬盘、ROM、RAM、磁碟或者光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,训练设备,或者网络设备等)执行本申请各个实施例所述的方法。Through the description of the above embodiments, those skilled in the art can clearly understand that the present application can be implemented by means of software plus necessary general-purpose hardware, and of course it can also be realized by special hardware including application-specific integrated circuits, dedicated CPUs, dedicated memories, Special components, etc. to achieve. In general, all functions completed by computer programs can be easily realized by corresponding hardware, and the specific hardware structure used to realize the same function can also be varied, such as analog circuits, digital circuits or special-purpose circuit etc. However, for this application, software program implementation is a better implementation mode in most cases. Based on this understanding, the essence of the technical solution of this application or the part that contributes to the prior art can be embodied in the form of a software product, and the computer software product is stored in a readable storage medium, such as a floppy disk of a computer , U disk, mobile hard disk, ROM, RAM, magnetic disk or optical disk, etc., including several instructions to make a computer device (which can be a personal computer, training device, or network device, etc.) execute the instructions described in various embodiments of the present application method.
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。In the above embodiments, all or part of them may be implemented by software, hardware, firmware or any combination thereof. When implemented using software, it may be implemented in whole or in part in the form of a computer program product.
所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、训练设备或数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、训练设备或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存储的任何可用介质或者是包含一个或多个可用介质集成的训练设备、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如固态硬盘(Solid State Disk,SSD))等。The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on the computer, the processes or functions according to the embodiments of the present application will be generated in whole or in part. The computer can be a general purpose computer, a special purpose computer, a computer network, or other programmable devices. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transferred from a website, computer, training device, or data The center transmits to another website site, computer, training device or data center via wired (eg, coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (eg, infrared, wireless, microwave, etc.). The computer-readable storage medium may be any available medium that can be stored by a computer, or a data storage device such as a training device or a data center integrated with one or more available media. The available medium may be a magnetic medium (such as a floppy disk, a hard disk, or a magnetic tape), an optical medium (such as a DVD), or a semiconductor medium (such as a solid state disk (Solid State Disk, SSD)), etc.

Claims (29)

  1. 一种数据检索***,其特征在于,所述数据检索***包括路由节点以及多个索引节点,其中,A data retrieval system, characterized in that the data retrieval system includes a routing node and a plurality of index nodes, wherein,
    所述路由节点,用于获取第一向量;The routing node is used to obtain the first vector;
    根据所述第一向量以及路由信息,从所述多个索引节点中得到目标索引节点,其中,所述路由信息包括多个第二向量以及每个所述第二向量对应的索引节点,每个所述第二向量用于表示对应的索引节点上存储的一个或多个第三向量,所述第三向量为对数据对象的表征,所述第一向量与所述多个第二向量中的目标向量之间的向量相似度大于阈值,且所述目标向量在所述路由信息中对应于所述目标索引节点;Obtain a target index node from the plurality of index nodes according to the first vector and routing information, wherein the routing information includes a plurality of second vectors and an index node corresponding to each second vector, each The second vector is used to represent one or more third vectors stored on the corresponding index node, the third vector is a representation of the data object, the first vector and the multiple second vectors The vector similarity between target vectors is greater than a threshold, and the target vector corresponds to the target index node in the routing information;
    向所述目标索引节点传递所述第一向量;passing the first vector to the target inode;
    所述目标索引节点,用于从自身存储的多个第三向量中确定所述第一向量的检索结果。The target index node is configured to determine the retrieval result of the first vector from multiple third vectors stored by itself.
  2. 根据权利要求1所述的***,其特征在于,每个所述第二向量对应一个簇,每个所述簇包括一个或多个所述第三向量,且所述多个第二向量中不同第二向量对应不同的簇。The system according to claim 1, wherein each of the second vectors corresponds to a cluster, each of the clusters includes one or more of the third vectors, and the plurality of second vectors are different from The second vector corresponds to a different cluster.
  3. 根据权利要求1或2所述的***,其特征在于,每个所述第二向量对应一个聚类,每个所述第二向量为所述聚类的聚类中心对应的向量。The system according to claim 1 or 2, wherein each of the second vectors corresponds to a cluster, and each of the second vectors is a vector corresponding to a cluster center of the cluster.
  4. 根据权利要求1至3任一所述的***,其特征在于,每个所述索引节点用于按照簇存储多个所述第三向量,每个所述簇包括至少一个所述第三向量。The system according to any one of claims 1 to 3, wherein each of the index nodes is configured to store a plurality of the third vectors in clusters, and each of the clusters includes at least one of the third vectors.
  5. 根据权利要求1至4任一所述的***,其特征在于,每个所述索引节点存储的第三向量为一个或多个簇中包括的向量。The system according to any one of claims 1 to 4, wherein the third vector stored in each index node is a vector included in one or more clusters.
  6. 根据权利要求1至5任一所述的***,其特征在于,所述第一向量与所述多个第二向量中的至少两个目标向量中的每个目标向量之间的向量相似度大于阈值,所述至少两个目标向量在所述路由信息中对应于所述目标索引节点。The system according to any one of claims 1 to 5, wherein the vector similarity between the first vector and each of at least two target vectors in the plurality of second vectors is greater than A threshold, the at least two target vectors in the routing information correspond to the target index node.
  7. 根据权利要求1至6任一所述的***,其特征在于,所述目标索引节点包括一个或多个索引节点。The system according to any one of claims 1 to 6, wherein the target index node includes one or more index nodes.
  8. 根据权利要求1至7任一所述的***,其特征在于,所述检索结果为所述多个第三向量中的部分向量。The system according to any one of claims 1 to 7, wherein the retrieval result is a partial vector in the plurality of third vectors.
  9. 根据权利要求1至8任一所述的***,其特征在于,所述第一向量为检索对象的表征,所述检索对象包括文本数据、音频数据、图像数据或视频数据中的一个或多个。The system according to any one of claims 1 to 8, wherein the first vector is a representation of a retrieval object, and the retrieval object includes one or more of text data, audio data, image data or video data .
  10. 根据权利要求1至9任一所述的***,其特征在于,所述路由节点,还用于向所述目标索引节点传递所述目标向量;The system according to any one of claims 1 to 9, wherein the routing node is further configured to transfer the target vector to the target index node;
    所述目标索引节点具体用于:The target index node is specifically used for:
    基于所述目标向量以及第一映射关系,从自身存储的多个第三向量中确定所述一个或多个第三向量,所述第一映射关系指示所述目标向量与所述多个第三向量中的一个或多个第三向量的映射关系;Based on the target vector and the first mapping relationship, determine the one or more third vectors from a plurality of third vectors stored in itself, the first mapping relationship indicates that the target vector and the plurality of third vectors a mapping relationship of one or more third vectors in the vector;
    从所述一个或多个第三向量中确定所述第一向量的检索结果。A retrieval result for the first vector is determined from the one or more third vectors.
  11. 根据权利要求1至9任一所述的***,其特征在于,所述目标索引节点具体用于:The system according to any one of claims 1 to 9, wherein the target index node is specifically used for:
    根据所述第一向量,从自身存储的多个第三向量中确定部分第三向量,其中所述部分第三向量对应相同的簇,且所述簇的簇中心与所述第一向量之间的向量相似度大于阈值;According to the first vector, determine a part of the third vector from a plurality of third vectors stored in itself, wherein the part of the third vector corresponds to the same cluster, and the cluster center of the cluster is between the first vector and The vector similarity of is greater than the threshold;
    从所述部分第三向量中确定所述第一向量的检索结果。A retrieval result of the first vector is determined from the part of the third vector.
  12. 一种数据检索方法,其特征在于,所述方法包括:A data retrieval method, characterized in that the method comprises:
    获取第一向量;get the first vector;
    根据所述第一向量以及路由信息,从所述多个索引节点中得到目标索引节点,其中,所述路由信息包括多个第二向量以及每个所述第二向量对应的索引节点,每个所述第二向量用于表示对应的索引节点上存储的一个或多个第三向量,所述第三向量为数据对象的表征,所述第一向量与所述多个第二向量中的目标向量之间的向量相似度大于阈值,且所述目标向量在所述路由信息中对应于所述目标索引节点;Obtain a target index node from the plurality of index nodes according to the first vector and routing information, wherein the routing information includes a plurality of second vectors and an index node corresponding to each second vector, each The second vector is used to represent one or more third vectors stored on the corresponding index node, the third vector is a representation of a data object, and the first vector and the objects in the plurality of second vectors The vector similarity between vectors is greater than a threshold, and the target vector corresponds to the target index node in the routing information;
    向所述目标索引节点传递所述第一向量,所述第一向量用于指示所述目标索引节点从自身存储的多个第三向量中确定所述第一向量的检索结果。passing the first vector to the target index node, where the first vector is used to instruct the target index node to determine the retrieval result of the first vector from multiple third vectors stored by itself.
  13. 根据权利要求12所述的方法,其特征在于,每个所述第二向量对应一个簇,每个所述簇包括一个或多个所述第三向量,且所述多个第二向量中不同第二向量对应不同的簇。The method according to claim 12, wherein each of the second vectors corresponds to a cluster, each of the clusters includes one or more of the third vectors, and the plurality of second vectors are different from The second vector corresponds to a different cluster.
  14. 根据权利要求12或13所述的方法,其特征在于,每个所述第二向量对应一个聚类,每个所述第二向量为所述聚类的聚类中心对应的向量。The method according to claim 12 or 13, wherein each of the second vectors corresponds to a cluster, and each of the second vectors is a vector corresponding to a cluster center of the cluster.
  15. 根据权利要求12至14任一所述的方法,其特征在于,每个所述索引节点用于按照簇存储多个所述第三向量,每个所述簇包括至少一个所述第三向量。The method according to any one of claims 12 to 14, wherein each of the index nodes is configured to store a plurality of the third vectors in clusters, and each of the clusters includes at least one of the third vectors.
  16. 根据权利要求12至15任一所述的方法,其特征在于,每个所述索引节点存储的第三向量为一个或多个簇中包括的向量。The method according to any one of claims 12 to 15, wherein the third vector stored in each index node is a vector included in one or more clusters.
  17. 根据权利要求12至16任一所述的方法,其特征在于,所述第一向量与所述多个第二向量中的至少两个目标向量中的每个目标向量之间的向量相似度大于阈值,所述至少 两个目标向量在所述路由信息中对应于所述目标索引节点。The method according to any one of claims 12 to 16, wherein the vector similarity between the first vector and each of at least two target vectors in the plurality of second vectors is greater than A threshold, the at least two target vectors in the routing information correspond to the target index node.
  18. 根据权利要求12至17任一所述的方法,其特征在于,所述目标索引节点包括一个或多个索引节点。The method according to any one of claims 12 to 17, wherein the target index node includes one or more index nodes.
  19. 根据权利要求12至18任一所述的方法,其特征在于,所述检索结果为所述多个第三向量中的部分向量。The method according to any one of claims 12 to 18, wherein the retrieval result is a partial vector in the plurality of third vectors.
  20. 根据权利要求12至19任一所述的方法,其特征在于,所述第一向量为检索对象的表征,所述检索对象包括文本数据、音频数据、图像数据或视频数据中的一个或多个。The method according to any one of claims 12 to 19, wherein the first vector is a representation of a retrieval object, and the retrieval object includes one or more of text data, audio data, image data or video data .
  21. 一种计算机***,其特征在于,包括第一索引节点和第二索引节点,所述第一索引节点包括第一存储器和第一处理器,所述第二索引节点包括第二存储器和第二处理器,其中,A computer system, characterized by comprising a first index node and a second index node, the first index node includes a first memory and a first processor, and the second index node includes a second memory and a second processing device, among them,
    所述第一存储器,被配置为存储用于进行数据检索的多个第一向量,所述多个第一向量中不同第一向量之间的相似度大于阈值,所述第一向量为数据对象的表征;The first memory is configured to store a plurality of first vectors for data retrieval, the similarity between different first vectors in the plurality of first vectors is greater than a threshold, and the first vector is a data object representation of
    所述第二存储器,被配置为存储用于进行数据检索的多个第二向量,所述多个第二向量中不同第二向量之间的相似度大于所述阈值,所述第二向量为数据对象的表征,且所述多个第一向量与所述多个第二向量之间的向量相似度小于所述阈值;The second memory is configured to store a plurality of second vectors for data retrieval, the similarity between different second vectors in the plurality of second vectors is greater than the threshold, and the second vectors are a representation of a data object, and the vector similarity between the plurality of first vectors and the plurality of second vectors is less than the threshold;
    所述第一处理器,被配置为基于所述多个第一向量进行数据检索;the first processor configured to perform data retrieval based on the plurality of first vectors;
    所述第二处理器,被配置为基于所述多个第二向量进行数据检索。The second processor is configured to perform data retrieval based on the plurality of second vectors.
  22. 根据权利要求21所述的计算机***,其特征在于,所述多个第一向量为一个或多个第一簇中包括的向量,所述多个第二向量为一个或多个第二簇中包括的向量,且所述第一簇与所述第二簇不同。The computer system according to claim 21, wherein the plurality of first vectors are vectors included in one or more first clusters, and the plurality of second vectors are vectors included in one or more second clusters. contains vectors, and the first cluster is different from the second cluster.
  23. 根据权利要求21或22所述的计算机***,其特征在于,所述第一簇和所述第二簇为聚类。The computer system according to claim 21 or 22, wherein the first cluster and the second cluster are clusters.
  24. 根据权利要求22或23任一所述的计算机***,其特征在于,The computer system according to any one of claims 22 or 23, wherein,
    所述第一处理器,被配置为基于第三向量从所述多个第一向量中确定检索结果,所述第三向量为第一检索对象的表征;The first processor is configured to determine a retrieval result from the plurality of first vectors based on a third vector, where the third vector is a representation of the first retrieval object;
    所述第二处理器,被配置为基于第四向量从所述多个第二向量中确定检索结果,所述第四向量为第二检索对象的表征。The second processor is configured to determine a retrieval result from the plurality of second vectors based on a fourth vector, where the fourth vector is a representation of a second retrieval object.
  25. 根据权利要求24所述的计算机***,其特征在于,所述第一索引节点和所述第二索引节点均通信连接于路由节点,且所述第三向量和所述第四向量均来自于所述路由节点 的发送。The computer system according to claim 24, wherein both the first index node and the second index node are communicatively connected to a routing node, and both the third vector and the fourth vector come from the The sending of the routing node.
  26. 根据权利要求24或25所述的计算机***,其特征在于,所述第一检索对象和所述第二检索对象包括文本数据、音频数据、图像数据或视频数据中的一个或多个。The computer system according to claim 24 or 25, wherein the first retrieval object and the second retrieval object include one or more of text data, audio data, image data or video data.
  27. 一种数据检索装置,其特征在于,所述装置包括存储器和处理器;所述存储器存储有代码,所述处理器被配置为获取所述代码,并执行如权利要求12至20任一所述的方法。A data retrieval device, characterized in that the device includes a memory and a processor; the memory stores codes, and the processor is configured to obtain the codes and execute the code described in any one of claims 12 to 20. Methods.
  28. 一种计算机存储介质,其特征在于,所述计算机存储介质存储有一个或多个指令,所述指令在由一个或多个计算机执行时使得所述一个或多个计算机实施权利要求12至20任一所述的方法。A computer storage medium, wherein the computer storage medium stores one or more instructions which, when executed by one or more computers, cause the one or more computers to implement any of claims 12 to 20 a method as described.
  29. 一种计算机程序产品,其特征在于,所述计算机程序产品包括代码,当所述代码被执行时,用于实现权利要求12至20任一项所述的方法的步骤。A computer program product, characterized in that the computer program product includes codes for realizing the steps of the method according to any one of claims 12 to 20 when the codes are executed.
PCT/CN2022/115091 2021-08-31 2022-08-26 Data retrieval method and related device WO2023030184A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111017328.0 2021-08-31
CN202111017328.0A CN115730116A (en) 2021-08-31 2021-08-31 Data retrieval method and related equipment

Publications (1)

Publication Number Publication Date
WO2023030184A1 true WO2023030184A1 (en) 2023-03-09

Family

ID=85291868

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/115091 WO2023030184A1 (en) 2021-08-31 2022-08-26 Data retrieval method and related device

Country Status (2)

Country Link
CN (1) CN115730116A (en)
WO (1) WO2023030184A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116595065B (en) * 2023-05-09 2024-04-02 上海任意门科技有限公司 Content duplicate identification method, device, system and storage medium
CN117609285A (en) * 2023-11-09 2024-02-27 中移互联网有限公司 Vector retrieval method and device, electronic equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110019096A (en) * 2017-12-29 2019-07-16 上海全土豆文化传播有限公司 The generation method and device of index file
CN110019875A (en) * 2017-12-29 2019-07-16 上海全土豆文化传播有限公司 The generation method and device of index file
WO2019225274A1 (en) * 2018-05-25 2019-11-28 日本電信電話株式会社 Clustering device, clustering method, program, and data structure
WO2021067376A1 (en) * 2019-10-02 2021-04-08 Google Llc Accelerated embedding layer computations
US20210224583A1 (en) * 2020-01-16 2021-07-22 International Business Machines Corporation Multilevel clustering of vector-based data
CN113297264A (en) * 2020-04-10 2021-08-24 阿里巴巴集团控股有限公司 Method and device for massively parallel processing of database

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110019096A (en) * 2017-12-29 2019-07-16 上海全土豆文化传播有限公司 The generation method and device of index file
CN110019875A (en) * 2017-12-29 2019-07-16 上海全土豆文化传播有限公司 The generation method and device of index file
WO2019225274A1 (en) * 2018-05-25 2019-11-28 日本電信電話株式会社 Clustering device, clustering method, program, and data structure
WO2021067376A1 (en) * 2019-10-02 2021-04-08 Google Llc Accelerated embedding layer computations
US20210224583A1 (en) * 2020-01-16 2021-07-22 International Business Machines Corporation Multilevel clustering of vector-based data
CN113297264A (en) * 2020-04-10 2021-08-24 阿里巴巴集团控股有限公司 Method and device for massively parallel processing of database

Also Published As

Publication number Publication date
CN115730116A (en) 2023-03-03

Similar Documents

Publication Publication Date Title
WO2023030184A1 (en) Data retrieval method and related device
US10482122B2 (en) System and method for multiresolution and multitemporal image search
US9092504B2 (en) Clustered information processing and searching with structured-unstructured database bridge
US8489589B2 (en) Visual search reranking
US9607014B2 (en) Image tagging
WO2018040503A1 (en) Method and system for obtaining search results
US9009149B2 (en) Systems and methods for mobile search using Bag of Hash Bits and boundary reranking
RU2439686C2 (en) Annotation by means of searching
WO2016069065A1 (en) Similarity search and malware prioritization
WO2023108995A1 (en) Vector similarity calculation method and apparatus, device and storage medium
US11176403B1 (en) Filtering detected objects from an object recognition index according to extracted features
CN111026865A (en) Relation alignment method, device and equipment of knowledge graph and storage medium
CN107636655B (en) System and method for providing data as a service (DaaS) in real time
US11442973B2 (en) System and method for storing and querying document collections
Duan et al. Distributed in-memory vocabulary tree for real-time retrieval of big data images
CN117648495B (en) Data pushing method and system based on cloud primary vector data
CN111488479B (en) Hypergraph construction method and device, computer system and medium
CN110390011B (en) Data classification method and device
WO2022007596A1 (en) Image retrieval system, method and apparatus
CN110209895B (en) Vector retrieval method, device and equipment
CN113901278A (en) Data search method and device based on global multi-detection and adaptive termination
Zhou et al. A novel locality-sensitive hashing algorithm for similarity searches on large-scale hyperspectral data
US11500937B1 (en) Data retrieval system
Hadi et al. Efficient Platform as a Service (PaaS) Model on Public Cloud for CBIR System.
WO2015159702A1 (en) Partial-information extraction system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22863322

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE