WO2022063150A1 - 数据存储方法及装置、数据查询方法及装置 - Google Patents

数据存储方法及装置、数据查询方法及装置 Download PDF

Info

Publication number
WO2022063150A1
WO2022063150A1 PCT/CN2021/119760 CN2021119760W WO2022063150A1 WO 2022063150 A1 WO2022063150 A1 WO 2022063150A1 CN 2021119760 W CN2021119760 W CN 2021119760W WO 2022063150 A1 WO2022063150 A1 WO 2022063150A1
Authority
WO
WIPO (PCT)
Prior art keywords
cluster center
center point
data
target
cluster
Prior art date
Application number
PCT/CN2021/119760
Other languages
English (en)
French (fr)
Inventor
楼仁杰
李飞飞
占超群
魏闯先
Original Assignee
阿里云计算有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里云计算有限公司 filed Critical 阿里云计算有限公司
Publication of WO2022063150A1 publication Critical patent/WO2022063150A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering

Definitions

  • the present specification relates to the technical field of data processing, and in particular, to a data storage method and device, and a data query method and device.
  • the k-means clustering algorithm (k-means clustering algorithm, K-means clustering algorithm) will cluster the data according to the similar distance, so as to divide all the data into multiple spaces.
  • the query vector You only need to search for data points in the same space as yourself to ensure high accuracy.
  • the embodiments of this specification provide a data storage method.
  • This specification also relates to a data storage device, a data query method and device, two kinds of computing devices, and two kinds of computer-readable storage media, so as to solve the technical defects existing in the prior art.
  • a data storage method comprising:
  • Cluster the data set to be stored, and determine multiple cluster center points
  • cluster center point For each cluster center point in the plurality of cluster center points, according to the cluster center point, determine the corresponding neighbor cluster center point, and cluster the cluster center point and the corresponding neighbor cluster point.
  • the space formed by the center point is determined as the clustering subspace;
  • the to-be-stored data in the to-be-stored data set is stored.
  • determining the corresponding nearest neighbor cluster center point including:
  • the point is determined as the adjacent cluster center point corresponding to the first cluster center point.
  • storing the data to be stored in the data set to be stored according to the clustering subspace includes:
  • the first target cluster center point determines the first target neighbor cluster center point corresponding to the first target cluster center point
  • the first data to be stored is stored in a clustering subspace formed by the first target cluster center point and the first target neighbor cluster center point.
  • the determining the first target cluster center point corresponding to the first to-be-stored data in the to-be-stored data set includes:
  • Sort the second distances from nearest to farthest select the second preset value and the second distance ahead of the sorting, and determine the cluster center point corresponding to the selected second preset value and the second distance is the first target cluster center point corresponding to the first data to be stored.
  • the determining the first target neighbor cluster center point corresponding to the first target cluster center point according to the first data to be stored and the first target cluster center point includes:
  • the method further includes:
  • a search space corresponding to the data to be queried is determined.
  • a data query method comprising:
  • a search space corresponding to the data to be queried is determined.
  • determining the second target cluster center point corresponding to the data to be queried from the plurality of cluster center points includes:
  • Sort the fourth distances from nearest to farthest select the fourth preset value and the fourth distance ahead of the sorting, and determine the cluster center point corresponding to the fourth preset value and the fourth distance as the The second target cluster center point corresponding to the data to be queried is described.
  • determining the target cluster subspace corresponding to the second target cluster center point includes:
  • each neighbor cluster center point corresponding to the second target cluster center point Obtain each neighbor cluster center point corresponding to the second target cluster center point, and determine the each neighbor cluster center point as the second target neighbor cluster center point corresponding to the second target cluster center point;
  • the cluster subspace formed by the second target cluster center point and the second target neighbor cluster center point is determined as the target cluster subspace.
  • determining the search space corresponding to the data to be queried according to the data to be queried and the target clustering subspace includes:
  • Sort the fifth distances from near to far select the fifth preset value and the fifth distance ahead of the sorting, and determine the target clustering subspace corresponding to the fifth preset value and the fifth distance as The search space corresponding to the data to be queried.
  • the calculating the fifth distance between the target cluster subspace and the data to be queried includes:
  • a sixth distance between the target point and the data to be queried is calculated, and the sixth distance is determined as a fifth distance between the target cluster subspace and the data to be queried.
  • a data storage device comprising:
  • a first determining module configured to cluster the data set to be stored, and determine a plurality of cluster center points
  • the second determination module is configured to, for each cluster center point in the plurality of cluster center points, determine the corresponding nearest neighbor cluster center point according to the cluster center point, and assign the cluster center point to the cluster center point.
  • the space formed by the corresponding neighbor cluster center points is determined as a cluster subspace;
  • the storage module is configured to store the to-be-stored data in the to-be-stored data set according to the clustering subspace.
  • a data query device comprising:
  • the third determination module is configured to determine the second target cluster center point corresponding to the data to be queried from the plurality of cluster center points;
  • a fourth determination module configured to determine a target cluster subspace corresponding to the second target cluster center point according to the data to be queried and the second target cluster center point;
  • a fifth determining module is configured to determine a search space corresponding to the data to be queried according to the data to be queried and the target clustering subspace.
  • a computing device including:
  • the memory is used to store computer-executable instructions
  • the processor is used to execute the computer-executable instructions to implement the following methods:
  • Cluster the data set to be stored, and determine multiple cluster center points
  • cluster center point For each cluster center point in the plurality of cluster center points, according to the cluster center point, determine the corresponding neighbor cluster center point, and cluster the cluster center point and the corresponding neighbor cluster point.
  • the space formed by the center point is determined as the clustering subspace;
  • the to-be-stored data in the to-be-stored data set is stored.
  • a computing device including:
  • the memory is used to store computer-executable instructions
  • the processor is used to execute the computer-executable instructions to implement the following methods:
  • a search space corresponding to the data to be queried is determined.
  • a computer-readable storage medium which stores computer-executable instructions, and when the instructions are executed by a processor, implements the steps of the data storage method.
  • a computer-readable storage medium which stores computer-executable instructions, and when the instructions are executed by a processor, implements the steps of the data query method.
  • the data storage method provided in this specification can cluster the data set to be stored and determine multiple cluster center points; then, for each cluster center point in the multiple cluster center points, according to the cluster center point , determine the corresponding neighbor cluster center point, and determine the space formed by the cluster center point and the corresponding neighbor cluster center point as the cluster subspace; then, according to the cluster subspace, the to-be-stored data set Store data for storage.
  • the data to be stored is not directly stored in a layer of space formed with the cluster center point as the center, but is further determined according to the cluster center point.
  • the data to be stored is stored in the clustering subspace composed of the cluster center point and the corresponding neighbor cluster center point, and the clustering algorithm and the idea of the nearest neighbor graph are combined.
  • the neighbor graph relationship of the second-level subspace is introduced, and the first-level space is further divided into two-level clustering subspaces, thereby improving the retrieval accuracy of the entire index; and this second-level subspace division
  • the method saves the cost of multi-layer clustering, and the index construction speed is faster.
  • this specification improves the construction and storage cost of the index without introducing additional cost of index construction and storage. accuracy of retrieval.
  • Fig. 1 is a kind of ANN index structure diagram based on K-means clustering provided by an embodiment of this specification;
  • Fig. 3 is a kind of index structure diagram combining K-means clustering and nearest neighbor graph provided by an embodiment of this specification;
  • FIG. 5 is a schematic structural diagram of a data storage device provided by an embodiment of the present specification.
  • FIG. 6 is a schematic structural diagram of a data query apparatus provided by an embodiment of the present specification.
  • FIG. 7 is a structural block diagram of a computing device provided by an embodiment of the present specification.
  • FIG. 8 is a structural block diagram of another computing device provided by an embodiment of the present specification.
  • ANN retrieval Approximate Nearest Neighbor Search takes advantage of the characteristic that the data will form a clustered distribution when the amount of data increases, and classifies or encodes the data in the database by analyzing and clustering the data. For the target data, predict the data category to which it belongs according to its data characteristics, and return some or all of the categories as the retrieval result. That is, the nearest N adjacent data (ie, vectors) in the high-dimensional space are quickly retrieved through pre-built indexes, but only approximately accurate results can be returned, and absolute accuracy cannot be guaranteed.
  • the core idea of approximate nearest neighbor retrieval is to search for data items that may be nearest neighbors instead of returning only the most probable items, improving retrieval efficiency at the expense of accuracy within an acceptable range.
  • Vector index a type of index structure that provides ANN retrieval capabilities for high-dimensional vector data.
  • K-means clustering algorithm also known as k-means clustering algorithm (k-means clustering algorithm), is an iterative solution clustering analysis algorithm, the steps are to randomly select K objects as the initial cluster center Calculate the distance between each object and each cluster center point, assign each object to the cluster center point closest to it, and the cluster center point and the objects assigned to them represent a cluster.
  • K-means clustering algorithm originated from a vector quantization method in signal processing, and is also a classic clustering analysis method in the field of data mining.
  • Voronoi space a subdivision of hyperspace, which is characterized in that any position in the subspace is closest to the center point of the subspace, and is relatively far away from the center point in adjacent subspaces, and each subspace is far away from the center point of the subspace. Contains one and only one center point.
  • the k-means clustering algorithm (k-means clustering algorithm, K-means clustering algorithm) will cluster the data according to the similar distance, so as to divide all the data into multiple spaces.
  • the query vector Just search for data points in the same space as yourself.
  • K-means clustering algorithm alone for the vector index of space division, when the data clustering effect is not good, the retrieval accuracy will be seriously reduced, and when the query vector falls on the boundary of multiple spaces,
  • the system must search for multiple spaces adjacent to the query vector to ensure accuracy.
  • each space will be adjacent to a lot of spaces, which further exacerbates this problem.
  • Figure 1 is an ANN index structure diagram based on K-means clustering.
  • the K-means clustering algorithm is used to cluster the data to be stored, and 8 different K-means such as C0-C7 are determined.
  • K-means clustering algorithm is used to cluster the data to be stored, and 8 different K-means such as C0-C7 are determined.
  • C0-C7 K-means
  • each cluster center point is divided into a Voronoi space.
  • the data q to be queried is at the space boundary, it has to search multiple spaces of C0, C2 and C3 at the same time, which greatly reduces the retrieval efficiency.
  • this specification proposes a data storage method and device, and a data query method and device.
  • After clustering the to-be-stored data set and determining multiple cluster center points For each cluster center point in the cluster center points, determine the corresponding neighbor cluster center point, and determine the space formed by the cluster center point and the corresponding neighbor cluster center point as the cluster subspace; The storage data is stored in the cluster subspace.
  • the clustering algorithm is combined with the idea of the nearest neighbor graph.
  • K-means clustering the neighbor graph relationship of the second-level subspace is introduced, and the space Ci of one layer is further divided into two-layer clustering subspaces, thus Improve the retrieval efficiency and accuracy of the entire index.
  • a data storage method is provided.
  • This specification also relates to a data storage device, a data query method and device, two computing devices, and two computer-readable storage media. In the following embodiments are explained in detail one by one.
  • FIG. 2 shows a flowchart of a data storage method provided according to an embodiment of the present specification, which specifically includes the following steps:
  • Step 202 Cluster the data set to be stored, and determine a plurality of cluster center points.
  • the to-be-stored data set is a set composed of all to-be-stored data, and the to-be-stored data set includes a plurality of to-be-stored data.
  • K-means clustering algorithm can be used to cluster the data set to be stored, so as to determine the multiple cluster center points.
  • K data can be randomly selected from the data set to be stored, and the selected The K data are determined as K cluster center points.
  • Step 204 For each cluster center point in the plurality of cluster center points, according to the cluster center point, determine the corresponding neighbor cluster center point, and the cluster center point and the corresponding neighbor cluster center point.
  • the formed space is determined as a clustering subspace.
  • clustering on the basis of clustering the data set to be stored and determining multiple cluster center points, further, for each cluster center point in the multiple cluster center points, according to the cluster center point , determine the corresponding neighbor cluster center point, and determine the space formed by the cluster center point and the corresponding neighbor cluster center point as the cluster subspace.
  • each cluster center point Ci finds n nearest neighbor cluster center points close to it, each cluster center point Ci and its corresponding n neighbor cluster center points can form multiple clusters. subspace.
  • the corresponding nearest neighbor cluster center point is determined, and the specific implementation process may be as follows:
  • Sort the first distances from nearest to farthest select the first preset value and the first distance in the first order, and determine the second cluster center point corresponding to the selected first preset value and the first distance as the first distance.
  • a cluster center point corresponds to the nearest neighbor cluster center point.
  • the first preset value can be set in advance, such as 2, 3, 4, etc. After sorting the calculated first distances from near to far, the first preset value can be filtered and aggregated. Multiple nearest neighbor cluster center points that are close to the class center point.
  • the determined cluster center points are C0, C1, C2, ..., Ck, for C0, C0 is determined as the first cluster center point, and C1, C2, ..., Ck is determined as the second cluster
  • the center point calculate the distance between C0 and C1, C2, ..., Ck respectively, assuming that the distance between C0 and C1, C2, ..., C7 is a, and the distance between C8, C9, ..., C20
  • the distance is b
  • the distance between C21, C22, ..., Ck is c
  • the distance a is less than the distance b is less than the distance c
  • the first preset value is 7, then the determined nearest neighbor clustering corresponding to C0 at this time
  • the center points are C1, C2, ..., C7.
  • the above steps are also performed for C1, and the adjacent cluster center points corresponding to C1 are determined to be C0, C2, C7, C8, C9, C10, and C11; the above steps are also performed for C2, and the adjacent cluster center points corresponding to C2 are determined to be C0, C1, C3, C12, C13, C14, C15; and so on, until the nearest neighbor cluster center point corresponding to Ck is determined.
  • the corresponding nearest neighbors can be filtered out by setting a distance threshold.
  • Cluster center point For each cluster center point in the plurality of cluster center points, according to the cluster center point, the corresponding neighboring cluster center point is determined, and the specific implementation process may be as follows:
  • a first distance smaller than the first distance threshold in the first distances is determined, and a second cluster center point corresponding to the first distance smaller than the first distance threshold is determined as a neighbor cluster center point corresponding to the first cluster center point.
  • the first distance threshold may be set in advance.
  • the determined cluster center points are C0, C1, C2, ..., Ck, for C0, C0 is determined as the first cluster center point, and C1, C2, ..., Ck is determined as the second cluster
  • the center point calculate the distance between C0 and C1, C2, ..., Ck respectively, assuming that the distance between C0 and C1, C2, ..., C7 is 1, and the distance between C8, C9, ..., C20
  • the distance is 2, the distance between C21, C22, ..., Ck is 3, and the first distance threshold is 1.5, then the center point of the nearest neighbor cluster corresponding to C0 determined at this time is C1, C2, ..., C7.
  • the above steps are also performed for C1, C2, .
  • each cluster center point Ci and one of its neighbors are clustered together.
  • the class center point Cj constitutes a cluster subspace B(i, j), that is, each cluster center point Ci and its multiple neighboring cluster center points can constitute multiple cluster subspaces.
  • the determined neighbor cluster center points corresponding to C0 are C1, C2, ..., C7
  • the neighbor cluster center points corresponding to C1 are C0, C2, C7, C8, C9, C10, C11.
  • C0 and C1 can form a clustering subspace B(0,1)
  • C0 and C2 can form a clustering subspace B(0,2)
  • C0 and C3 can form a clustering subspace B(0,3 )
  • C0 and C4 can form a clustering subspace B(0, 4)
  • C0 and C5 can form a clustering subspace B(0,5)
  • C0 and C6 can form a clustering subspace B(0, 6)
  • C0 and C7 can form a clustering subspace B(0, 7).
  • C1 and C0 can form a clustering subspace B(1,0)
  • C1 and C2 can form a clustering subspace B(1,2)
  • C1 and C7 can form a clustering subspace B(1,7)
  • C1 and C8 can form a clustering subspace B(1, 8)
  • C1 and C9 can form a clustering subspace B(1,9)
  • C1 and C10 can form a clustering subspace B(1,10
  • C1 and C11 can form a clustering subspace B(1, 11).
  • Step 206 According to the clustering subspace, store the data to be stored in the data set to be stored.
  • the cluster center point determines the corresponding neighbor cluster center point, and the cluster center point and the corresponding neighbor cluster center point On the basis that the space formed by the points is determined as the clustering subspace, further, according to the clustering subspace, the to-be-stored data in the to-be-stored data set is stored.
  • each data to be stored in the data set to be stored needs to be stored in the corresponding clustering subspace to facilitate subsequent retrieval and query. Therefore, for each data to be stored in the data set to be stored, it needs to be calculated once and determined. In which clustering subspace to store it.
  • the data to be stored in the data set to be stored is stored, and the specific implementation process may be as follows:
  • the first target cluster center point determines the first target neighbor cluster center point corresponding to the first target cluster center point
  • the first data to be stored is stored in the clustering subspace formed by the first target cluster center point and the first target neighbor cluster center point.
  • the first data to be stored may be any data to be stored in the data set to be stored, and each of the data to be stored in the data set to be stored needs to perform the above operation steps once, so as to determine its corresponding clustering subspace, to store. That is, each to-be-stored data in the to-be-stored data set is to be used as the above-mentioned first to-be-stored data once.
  • the first target cluster center point corresponding to the first to-be-stored data in the to-be-stored data set is determined, and in specific implementation, the first to-be-stored data and each cluster center point of the plurality of cluster center points may be calculated Then, sort the second distances from near to far, select the second preset value and the second distance in the front of the sorting, and put the cluster corresponding to the second preset value and the second distance.
  • the class center point is determined as the first target cluster center point corresponding to the first data to be stored.
  • the closest cluster subspace can be selected, or the two closest cluster subspaces can be selected. Therefore, when determining the center point of the first target cluster corresponding to the first data to be stored, it can be filtered by the second preset value, wherein the second preset value can be set in advance, such as 1 or 2 , 3, 4, etc.
  • the data set to be stored is ⁇ X1, X2, X3, ..., Xm ⁇
  • the first data to be stored is X1
  • the cluster center points determined according to the data set to be stored are C0, C1, C2, ..., C7, calculate the distance between X1 and C0, C1, C2, ..., C7 respectively, assuming that the distance between X1 and C0 is 1, the distance between X1 and C1 is 1.5, and the distance between X1 and C2 is 2, the distance between X1 and C3 is 3, the distance between X1 and C4 is 4, the distance between X1 and C5 is 5, the distance between X1 and C6 is 6, and the distance between X1 and C7 is 7.
  • the center point of the first target cluster that is closest to the first data to be stored may not be screened by the second preset value, but by setting a distance threshold, the corresponding The closest first target cluster center point.
  • the first target cluster center point corresponding to the first to-be-stored data in the to-be-stored data set is determined, and the specific implementation process may be as follows:
  • the cluster center point corresponding to the second distance is determined as the first target cluster center point corresponding to the first data to be stored.
  • the second distance threshold may be set in advance.
  • the data set to be stored is ⁇ X1, X2, X3, ..., Xm ⁇
  • the first data to be stored is X1
  • the cluster center points determined according to the data set to be stored are C0, C1, C2, ..., C7, calculate the distance between X1 and C0, C1, C2, ..., C7 respectively, assuming that the distance between X1 and C0 is 1, the distance between X1 and C1 is 1.5, and the distance between X1 and C2 is 2, the distance between X1 and C3 is 3, the distance between X1 and C4 is 4, the distance between X1 and C5 is 5, the distance between X1 and C6 is 6, and the distance between X1 and C7 is 7.
  • the first target neighbor cluster center point corresponding to the first target cluster center point is determined, and in specific implementation, the first target cluster center point may be obtained first The corresponding neighbor cluster center points, and then calculate the third distance between the first data to be stored and the neighbor cluster center points, sort the third distances from near to far, and select the third The preset value and the third distance are determined, and the selected neighbor cluster center point corresponding to the third preset value and the third distance is determined as the first target neighbor cluster center point corresponding to the first target cluster center point.
  • the first target cluster center point corresponding to the first target cluster center point of each to-be-stored data is also determined respectively.
  • the target neighbor cluster center point that is, the first target neighbor cluster center point corresponding to the first target cluster center point and closest to the first data to be stored.
  • filtering may be performed by a third preset value, wherein the third preset value may be set in advance, such as 1, 2, 3, 4, and so on.
  • the first data to be stored is X1
  • the first target cluster center point corresponding to X1 is C0
  • the neighbor cluster center points corresponding to C0 are C1, C2, ..., C7, respectively, calculate X1 and C1, C2, ... ..., the distance between C7 (if it has been calculated before, the result can be directly obtained here), assuming that the distance between X1 and C1 is 1.5, the distance between X1 and C2 is 2, and the distance between X1 and C3 is 2.
  • the distance is 3, the distance between X1 and C4 is 4, the distance between X1 and C5 is 5, the distance between X1 and C6 is 6, the distance between X1 and C7 is 7, and the above 7 distances are as follows Sort from near to far, assuming that the first second distance (1.5) is selected, then C1 is determined as the first target nearest neighbor cluster center point corresponding to C0.
  • the corresponding target neighbor cluster center points are determined according to the above method.
  • the first target neighbor cluster center point corresponding to the first target cluster center point and closest to the first to-be-stored data may not be screened by the third preset value. , but by setting a distance threshold, filter out the corresponding first target nearest neighbor cluster center point that is closest to the first data to be stored. In this way, according to the first data to be stored and the first target cluster center point, the first target neighbor cluster center point corresponding to the first target cluster center point is determined, and the specific implementation process may be as follows:
  • the neighbor cluster center point corresponding to the third distance smaller than the third distance threshold value is determined as the first target neighbor cluster center point corresponding to the first target cluster center point.
  • the third distance threshold may be set in advance.
  • the first data to be stored is X1
  • the first target cluster center point corresponding to X1 is C0
  • the neighbor cluster center points corresponding to C0 are C1, C2, ..., C7, respectively, calculate X1 and C1, C2, ... ..., the distance between C7, assuming the distance between X1 and C1 is 1.5, the distance between X1 and C2 is 2, the distance between X1 and C3 is 3, the distance between X1 and C4 is 4, and the distance between X1 and C4 is 4.
  • the distance between X1 and C5 is 5, the distance between X1 and C6 is 6, and the distance between X1 and C7 is 7.
  • the distance between X1 and C1 is 1.5 less than the third distance. If the distance threshold is set, at this time, C1 is determined as the center point of the first target nearest neighbor clustering corresponding to C0.
  • the corresponding target neighbor cluster center points are determined according to the above method.
  • the first data to be stored can be stored in the first target cluster.
  • the first data to be stored is X1
  • the first target cluster center point corresponding to X1 is C0
  • the first target neighbor cluster center point is C1
  • X1 is stored in the cluster subspace formed by C0 and C1. in B(0,1).
  • other to-be-stored data in the to-be-stored data set are stored in sequence according to the above method.
  • all the encoded to-be-stored data will be stored in an inverted structure according to the belonging clustering subspace B(i, j) and then written to the index file.
  • the storage in the inverted structure refers to arranging and storing each data to be stored according to the number (id) of the clustering subspace, and writing it into the index file.
  • the clustering subspace can be searched according to the following steps 208-212.
  • Step 208 From the plurality of cluster center points, determine a second target cluster center point corresponding to the data to be queried.
  • a second target cluster center point close to the data to be queried needs to be determined from a plurality of cluster center points.
  • Sort the fourth distances from nearest to farthest select the fourth preset value and the fourth distance at the top of the ranking, and determine the cluster center point corresponding to the fourth preset value and the fourth distance as the data to be queried.
  • the second target cluster center point The second target cluster center point.
  • the distance between it and each cluster center point Ci is calculated, so as to determine the cluster center point close to the data to be queried, that is, the fourth preset numerical value Two-target cluster center point.
  • the fourth preset value can be set in advance, such as 2, 3, 4, etc. After sorting the calculated fourth distances from near to far, the fourth preset value can be used to filter and wait for The query data is close to multiple second target cluster center points.
  • the cluster center points are C0, C1, C2, ..., C7, and for the data q to be queried, calculate the distance between q and C0, C1, C2, ..., C7, respectively, assuming The distance between q and C0 is 0.5, the distance between q and C1 is 3, the distance between q and C2 is 1, the distance between q and C3 is 0.8, the distance between q and C4 is 3.2, The distance between q and C5 is 5, the distance between q and C6 is 7, and the distance between q and C7 is 5.5. Assuming that the fourth preset value is 3, the second target corresponding to q is determined at this time.
  • the cluster center points are C0, C2 and C3.
  • the center points of multiple target clusters that are close to the data to be queried may not be screened by the fourth preset value, but the corresponding target clusters may be screened by setting a distance threshold. class center point.
  • the second target cluster center point corresponding to the data to be queried is determined, and the specific implementation process may be as follows:
  • a fourth distance smaller than the fourth distance threshold is determined, and the cluster center point corresponding to the fourth distance smaller than the fourth distance threshold is determined as the second target cluster center point corresponding to the data to be queried.
  • the fourth distance threshold may be set in advance.
  • the cluster center points are C0, C1, C2, ..., C7, and for the data q to be queried, calculate the distance between q and C0, C1, C2, ..., C7, respectively, assuming The distance between q and C0 is 0.5, the distance between q and C1 is 3, the distance between q and C2 is 1, the distance between q and C3 is 0.8, the distance between q and C4 is 3.2, The distance between q and C5 is 5, the distance between q and C6 is 7, and the distance between q and C7 is 5.5. Assuming that the fourth distance threshold is 1.5, the second target cluster corresponding to q is determined at this time.
  • the class center points are C0, C2, and C3.
  • Step 210 Determine the target cluster subspace corresponding to the second target cluster center point according to the data to be queried and the second target cluster center point.
  • the second target cluster center point corresponding to the data to be queried from the plurality of cluster center points further, according to the to-be-queried data and the second target cluster center point, determine The target cluster subspace corresponding to the second target cluster center point.
  • multiple clustering subspaces where the data to be queried may be stored can be selected according to the data to be queried in the previous step and the center point of the second target cluster.
  • the specific implementation process is as follows:
  • each neighbor cluster center point corresponding to the second target cluster center point Obtain each neighbor cluster center point corresponding to the second target cluster center point, and determine each neighbor cluster center point as the second target neighbor cluster center point corresponding to the second target cluster center point;
  • the cluster subspace formed by the second target cluster center point and the second target neighbor cluster center point is determined as the target cluster subspace.
  • each second target cluster center point will have a corresponding neighboring cluster center point. Therefore, for each second target cluster center point , the corresponding second target neighbor cluster center point should be determined, and then the corresponding target cluster subspace should be determined.
  • the second target cluster center points corresponding to the data to be queried are C0, C2, and C3, and the neighboring cluster center points corresponding to C0 are C1, C2, ..., C7, so the second target cluster center point
  • the second target nearest neighbor cluster center points corresponding to C0 are C1, C2, ..., C7.
  • the target cluster subspace corresponding to the second target cluster center point C0 is B(0,1), B(0, 2), B(0,3), B(0,4), B(0,5), B(0,6), B(0,7);
  • the center points of each nearest neighbor cluster corresponding to C2 are C0, C1, C3, C12, C13, C14, C15, so the second target neighbor cluster center points corresponding to the second target cluster center point C2 are C0, C1, C3, C12, C13, C14, C15, and the second target cluster center point C2
  • the target cluster subspace corresponding to the target cluster center point C2 is B(2,0), B(2,1), B(2,3), B(2,12), B(2,13), B (2,14), B(2,15);
  • the center points of each neighboring cluster corresponding to C3 are C0, C2, C4, C16, C17, C18, C19, so the second target cluster center point C3 corresponds to the second
  • the target nearest neighbor cluster center points are
  • Step 212 According to the data to be queried and the target clustering subspace, determine a search space corresponding to the data to be queried.
  • the search space refers to the space in which the data to be queried is finally queried.
  • the search space corresponding to the data to be queried is determined according to the data to be queried and the target clustering subspace, and the specific implementation process may be as follows:
  • Sort the fifth distances from nearest to farthest select the fifth preset value and the fifth distance at the top of the ranking, and determine the target clustering subspace corresponding to the fifth preset value and the fifth distance as the corresponding data to be queried. search space.
  • the midpoint between the second target cluster center point and the second target neighbor cluster center point can be determined as the target point; then the sixth distance between the target point and the data to be queried is calculated, and the sixth distance is determined as the target
  • the fifth distance between the clustering subspace and the data to be queried That is, the distance from the target cluster subspace to the data to be queried is represented by the distance from the mean center point of the cluster center point Ci and the adjacent cluster center point Cj to the data to be queried.
  • target clustering subspaces obtained by the above steps. For each target clustering subspace, it is necessary to determine the distance between it and the data to be stored, so as to decide whether to cluster the target.
  • the subspace is determined as the search space for querying the data to be queried.
  • the target clustering subspace corresponding to C0 is B(0,1), B(0,2), B(0,3), B(0,4), B(0,5), B(0, 6), B(0,7)
  • the target clustering subspace corresponding to C2 is B(2,0), B(2,1), B(2,3), B(2,12), B(2 ,13), B(2,14), B(2,15)
  • the target clustering subspace corresponding to C3 is B(3,0), B(3,2), B(3,4), B( 3,16), B(3,17), B(3,18), B(3,19)
  • the nearest target clustering subspace is used as the final search space.
  • B(0,2), B(0,3), B(2,0), and B(3,0) are selected as the final search space, and the data to be queried is queried in this search space. q.
  • the data storage method provided in this specification can cluster the data set to be stored and determine multiple cluster center points; then, for each cluster center point in the multiple cluster center points, according to the cluster center point , determine the corresponding neighbor cluster center point, and determine the space formed by the cluster center point and the corresponding neighbor cluster center point as the cluster subspace; then, according to the cluster subspace, the to-be-stored data set Store data for storage.
  • the second target cluster center point corresponding to the data to be queried can be determined from the plurality of cluster center points; and then according to the data to be queried and the second target cluster center point, determine the target clustering subspace corresponding to the center point of the second target clustering, and then further determine the search space corresponding to the data to be queried according to the data to be queried and the target clustering subspace, and perform a search in the search space .
  • the data to be stored is not directly stored in a layer of space formed with the cluster center point as the center, but is further determined according to the cluster center point. Then the data to be stored is stored in the clustering subspace formed by the cluster center point and the corresponding neighbor cluster center point, and the clustering algorithm and the idea of the nearest neighbor graph are combined.
  • the neighbor graph relationship of the second-level subspace is introduced, and the first-level space is further divided into more detailed two-level clustering subspaces, thereby improving the retrieval accuracy of the entire index.
  • this two-layer subspace division method saves the cost of multi-layer clustering compared with the two-layer hierarchical clustering structure, and the index construction speed is faster. Compared with the single-layer K-means clustering structure, this specification does not On the basis of introducing additional index construction and storage overhead, the retrieval accuracy is improved.
  • the stored data is clustered simply by the K-means clustering algorithm, and 8 different K-means clustering centers such as C0-C7 are determined. Each cluster center point is divided into a Voronoi space.
  • the data q to be queried is at the space boundary, multiple spaces of C0, C2 and C3 have to be searched at the same time.
  • the data storage method provided in this specification after determining 8 different K-means clustering center points such as C0-C7, will further determine the neighboring clustering center points corresponding to C0-C7, Therefore, the second-level spatial clustering subspace is divided twice, and the data q to be queried is also queried.
  • the method provided in this specification only needs to be in the clustering subspaces B(0,2), B(0,3), B(2, 0) and B(3,0), which greatly reduces the search range and improves the search efficiency and accuracy.
  • FIG. 4 shows a flowchart of a data storage method provided according to an embodiment of the present specification, which specifically includes the following steps:
  • Step 402 From the plurality of cluster center points, determine the second target cluster center point corresponding to the data to be queried. It should be noted that, the specific implementation process of step 402 is the same as the specific implementation process of the above-mentioned step 208, and details are not described herein again in this specification.
  • Step 404 Determine the target cluster subspace corresponding to the second target cluster center point according to the data to be queried and the second target cluster center point.
  • step 404 is the same as the specific implementation process of the above-mentioned step 210, and details are not described herein again in this specification.
  • Step 406 Determine a search space corresponding to the data to be queried according to the data to be queried and the target clustering subspace.
  • step 406 is the same as the specific implementation process of the above-mentioned step 212, and details are not described herein again in this specification.
  • the data query method provided in this specification can determine the second target cluster center point corresponding to the data to be queried from a plurality of cluster center points, and then determine the second target cluster center point according to the data to be queried and the second target cluster center point
  • the target clustering subspace corresponding to the cluster center point is further determined according to the data to be queried and the target clustering subspace to determine the search space corresponding to the data to be queried, and the search is performed in the search space.
  • the clustering algorithm is combined with the idea of the nearest neighbor graph, that is, on the basis of K-means clustering, the neighbor graph relationship of the two-layer subspace is introduced, and the space of the first layer is further divided into more
  • the detailed two-level clustering subspace can narrow the search scope when querying the data to be queried subsequently, thereby improving the retrieval efficiency and accuracy of the entire index.
  • FIG. 5 shows a schematic structural diagram of a data storage apparatus provided by an embodiment of this specification.
  • the device includes:
  • the first determining module 502 is configured to cluster the data set to be stored, and determine a plurality of cluster center points;
  • the second determination module 504 is configured to, for each cluster center point in the plurality of cluster center points, determine the corresponding nearest neighbor cluster center point according to the type of center point, and the cluster center point and the corresponding cluster center point The space formed by the neighbor cluster center points is determined as the cluster subspace;
  • the storage module 506 is configured to store the to-be-stored data in the to-be-stored data set according to the cluster subspace.
  • the second determining module 504 is further configured to:
  • Sort the first distances from nearest to farthest select the first preset value and the first distance in the first order, and determine the second cluster center point corresponding to the selected first preset value and the first distance as the first distance.
  • a cluster center point corresponds to the nearest neighbor cluster center point.
  • the storage module 506 is further configured to:
  • the first target cluster center point determines the first target neighbor cluster center point corresponding to the first target cluster center point
  • the first data to be stored is stored in the clustering subspace formed by the first target cluster center point and the first target neighbor cluster center point.
  • the storage module 506 is further configured to:
  • Sort the second distances from nearest to farthest select the second preset value and the second distance at the top of the ranking, and determine the cluster center point corresponding to the selected second preset value and the second distance as the first
  • the first target cluster center point corresponding to the data is stored.
  • the storage module 506 is further configured to:
  • Sort the third distances from nearest to farthest select the third preset value and the third distance at the top of the ranking, and determine the nearest neighbor cluster center point corresponding to the selected third preset value and the third distance as the first The first target neighbor cluster center point corresponding to the target cluster center point.
  • the apparatus further includes:
  • a third determination module configured to determine a second target cluster center point corresponding to the data to be queried from among the plurality of cluster center points;
  • a fourth determining module configured to determine a target clustering subspace corresponding to the second target clustering center point according to the data to be queried and the second target clustering center point;
  • the fifth determining module is configured to determine a search space corresponding to the data to be queried according to the data to be queried and the target clustering subspace.
  • the data to be stored is not directly stored in a layer of space formed with the cluster center point as the center, but is further determined according to the cluster center point.
  • the nearest neighbor clustering center point, and then the data to be stored is stored in the clustering subspace formed by the cluster center point and the corresponding neighbor cluster center point, and the clustering algorithm and the nearest neighbor graph idea are combined, that is, in K
  • the neighbor graph relationship of the second-level subspace is introduced, and the first-level space is further divided into more detailed two-level clustering subspaces, thereby improving the retrieval accuracy of the entire index.
  • this two-layer subspace division method saves the cost of multi-layer clustering compared with the two-layer hierarchical clustering structure, and the index construction speed is faster. Compared with the single-layer K-means clustering structure, this specification does not On the basis of introducing additional index construction and storage overhead, the retrieval accuracy is improved.
  • the above is a schematic solution of a data storage device according to this embodiment. It should be noted that the technical solution of the data storage device and the technical solution of the above-mentioned data storage method belong to the same concept, and the details that are not described in detail in the technical solution of the data storage device can be referred to the description of the technical solution of the above-mentioned data storage method. .
  • FIG. 6 shows a schematic structural diagram of a data query apparatus provided by an embodiment of the present specification.
  • the device includes:
  • the third determination module 602 is configured to determine the second target cluster center point corresponding to the data to be queried from the plurality of cluster center points;
  • the fourth determination module 604 is configured to determine the target cluster subspace corresponding to the second target cluster center point according to the data to be queried and the second target cluster center point;
  • the fifth determination module 606 is configured to determine a search space corresponding to the data to be queried according to the data to be queried and the target clustering subspace.
  • the third determining module 602 is further configured to:
  • Sort the fourth distances from nearest to farthest select the fourth preset value and the fourth distance at the top of the ranking, and determine the cluster center point corresponding to the fourth preset value and the fourth distance as the data to be queried.
  • the second target cluster center point The second target cluster center point.
  • the fourth determining module 604 is further configured to:
  • each neighbor cluster center point corresponding to the second target cluster center point Obtain each neighbor cluster center point corresponding to the second target cluster center point, and determine each neighbor cluster center point as the second target neighbor cluster center point corresponding to the second target cluster center point;
  • the cluster subspace formed by the second target cluster center point and the second target neighbor cluster center point is determined as the target cluster subspace.
  • the fifth determining module 606 is further configured to:
  • Sort the fifth distance from nearest to farthest select the fifth preset value and the fifth distance ahead of the sorting, and determine the target clustering subspace corresponding to the fifth preset value and the fifth distance as the to-be-to-be The search space corresponding to the query data.
  • the fifth determining module 606 is further configured to:
  • the data query device provided in this specification can determine the second target cluster center point corresponding to the data to be queried from a plurality of cluster center points, and then determine the second target cluster center point according to the data to be queried and the second target cluster center point
  • the target clustering subspace corresponding to the cluster center point is further determined according to the data to be queried and the target clustering subspace to determine the search space corresponding to the data to be queried, and the search is performed in the search space.
  • the clustering algorithm is combined with the idea of the nearest neighbor graph, that is, on the basis of K-means clustering, the neighbor graph relationship of the two-layer subspace is introduced, and the space of the first layer is further divided into more
  • the detailed second-level subspace can narrow the search range when querying the data to be queried, thereby improving the retrieval efficiency and accuracy of the entire index.
  • the above is a schematic solution of a data query apparatus according to this embodiment. It should be noted that the technical solution of the data query device and the technical solution of the above-mentioned data query method belong to the same concept, and the details that are not described in detail in the technical solution of the data query device can be referred to the description of the technical solution of the above-mentioned data query method. .
  • FIG. 7 shows a structural block diagram of a computing device 700 according to an embodiment of the present specification.
  • Components of the computing device 700 include, but are not limited to, memory 710 and processor 720 .
  • the processor 720 is connected with the memory 710 through the bus 730, and the database 750 is used for storing data.
  • Computing device 700 also includes access device 740 that enables computing device 700 to communicate via one or more networks 760 .
  • networks include a public switched telephone network (PSTN), a local area network (LAN), a wide area network (WAN), a personal area network (PAN), or a combination of communication networks such as the Internet.
  • Access device 740 may include one or more of any type of network interface (eg, a network interface card (NIC)), wired or wireless, such as an IEEE 802.11 wireless local area network (WLAN) wireless interface, World Interoperability for Microwave Access ( Wi-MAX) interface, Ethernet interface, Universal Serial Bus (USB) interface, cellular network interface, Bluetooth interface, Near Field Communication (NFC) interface, and the like.
  • NIC network interface card
  • computing device 700 and other components not shown in FIG. 7 may also be connected to each other, such as through a bus. It should be understood that the structural block diagram of the computing device shown in FIG. 7 is only for the purpose of example, rather than limiting the scope of the present specification. Those skilled in the art can add or replace other components as required.
  • Computing device 700 may be any type of stationary or mobile computing device, including mobile computers or mobile computing devices (eg, tablet computers, personal digital assistants, laptop computers, notebook computers, netbooks, etc.), mobile phones (eg, smart phones) ), wearable computing devices (eg, smart watches, smart glasses, etc.) or other types of mobile devices, or stationary computing devices such as desktop computers or PCs.
  • mobile computers or mobile computing devices eg, tablet computers, personal digital assistants, laptop computers, notebook computers, netbooks, etc.
  • mobile phones eg, smart phones
  • wearable computing devices eg, smart watches, smart glasses, etc.
  • desktop computers or PCs e.g., desktop computers or PCs.
  • Computing device 700 may also be a mobile or stationary server.
  • the processor 720 is configured to execute the following computer-executable instructions to implement the following method:
  • Cluster the data set to be stored, and determine multiple cluster center points
  • cluster center point For each cluster center point in the plurality of cluster center points, according to the cluster center point, determine the corresponding neighbor cluster center point, and the space formed by the cluster center point and the corresponding neighbor cluster center point Determined as a clustering subspace;
  • the to-be-stored data in the to-be-stored data set is stored.
  • the above is a schematic solution of a computing device according to this embodiment. It should be noted that the technical solution of the computing device and the technical solution of the above-mentioned data storage method belong to the same concept. For details not described in detail in the technical solution of the computing device, refer to the description of the technical solution of the above-mentioned data storage method.
  • FIG. 8 shows a structural block diagram of a computing device 800 according to an embodiment of the present specification.
  • Components of the computing device 800 include, but are not limited to, a memory 810 and a processor 820 .
  • the processor 820 is connected with the memory 810 through the bus 830, and the database 850 is used for saving data.
  • Computing device 800 also includes access device 840 that enables computing device 800 to communicate via one or more networks 860 .
  • networks include a public switched telephone network (PSTN), a local area network (LAN), a wide area network (WAN), a personal area network (PAN), or a combination of communication networks such as the Internet.
  • Access device 840 may include one or more of any type of network interface (eg, network interface card (NIC)), wired or wireless, such as IEEE 802.11 wireless local area network (WLAN) wireless interface, World Interoperability for Microwave Access ( Wi-MAX) interface, Ethernet interface, Universal Serial Bus (USB) interface, cellular network interface, Bluetooth interface, Near Field Communication (NFC) interface, and the like.
  • NIC network interface card
  • computing device 800 may also be connected to each other, such as through a bus.
  • bus may also be connected to each other, such as through a bus.
  • FIG. 8 the structural block diagram of the computing device shown in FIG. 8 is only for the purpose of example, rather than limiting the scope of this specification. Those skilled in the art can add or replace other components as required.
  • Computing device 800 may be any type of stationary or mobile computing device, including mobile computers or mobile computing devices (eg, tablet computers, personal digital assistants, laptop computers, notebook computers, netbooks, etc.), mobile phones (eg, smart phones) ), wearable computing devices (eg, smart watches, smart glasses, etc.) or other types of mobile devices, or stationary computing devices such as desktop computers or PCs.
  • Computing device 800 may also be a mobile or stationary server.
  • the processor 820 is configured to execute the following computer-executable instructions to implement the following method:
  • a search space corresponding to the data to be queried is determined.
  • An embodiment of the present specification further provides a computer-readable storage medium, which stores computer instructions, and when the instructions are executed by a processor, is used to implement the operation steps of the above data storage method.
  • An embodiment of the present specification further provides a computer-readable storage medium, which stores computer instructions, which, when executed by a processor, are used to implement the operation steps of the above data query method.
  • the computer instructions include computer program code, which may be in source code form, object code form, an executable file, some intermediate form, or the like.
  • the computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, U disk, removable hard disk, magnetic disk, optical disk, computer memory, read-only memory (ROM, Read-Only Memory) , Random Access Memory (RAM, Random Access Memory), electric carrier signal, telecommunication signal and software distribution medium, etc. It should be noted that the content contained in the computer-readable media may be appropriately increased or decreased according to the requirements of legislation and patent practice in the jurisdiction, for example, in some jurisdictions, according to legislation and patent practice, the computer-readable media Electric carrier signals and telecommunication signals are not included.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Artificial Intelligence (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种数据存储方法及装置、数据查询方法及装置,数据存储方法包括:对待存储数据集进行聚类,确定多个聚类中心点(202);针对多个聚类中心点中的每一个聚类中心点,根据聚类中心点,确定对应的近邻聚类中心点,将聚类中心点和对应的近邻聚类中心点构成的空间确定为聚类子空间(204);根据聚类子空间,对待存储数据集中的待存储数据进行存储(206)。数据存储方法把聚类算法和近邻图思想进行了结合,在一层聚类的基础上,引入了二层子空间的近邻图关系,把一层的空间进一步划分为更为细致的二层聚类子空间,从而提升了整个索引的检索准度。

Description

数据存储方法及装置、数据查询方法及装置
本申请要求2020年09月27日递交的申请号为202011035973.0、发明名称为“数据存储方法及装置、数据查询方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本说明书涉及数据处理技术领域,特别涉及数据存储方法及装置、数据查询方法及装置。
背景技术
随着计算机技术和网络技术的飞速发展,产生了大量的数据,随之而来的是数据存储和数据查询方面的巨大压力。k-means聚类算法(k-means clustering algorithm,K均值聚类算法),会把数据按照相似距离进行聚类,从而把全部数据划分到多个空间中,在数据查询的过程中,查询向量只需要搜索和自己在相同空间的数据点,即可保证很高的准确率。
然而,当查询向量落在多个空间的边界部分时,***必须搜索和查询向量相邻的多个空间才能保证准确率,同时因为向量的超高维度,每个空间会和非常多的空间相邻,更加加剧了这个问题,进而需要更简单更便捷的方法进行数据存储、数据查询的操作或者处理。
发明内容
有鉴于此,本说明书实施例提供了一种数据存储方法。本说明书同时涉及一种数据存储装置、数据查询方法及装置,两种计算设备,以及两种计算机可读存储介质,以解决现有技术中存在的技术缺陷。
根据本说明书实施例的第一方面,提供了一种数据存储方法,所述方法包括:
对待存储数据集进行聚类,确定多个聚类中心点;
针对所述多个聚类中心点中的每一个聚类中心点,根据所述聚类中心点,确定对应的近邻聚类中心点,将所述聚类中心点和对应的所述近邻聚类中心点构成的空间确定为聚类子空间;
根据所述聚类子空间,对所述待存储数据集中的待存储数据进行存储。
可选的,所述根据所述聚类中心点,确定对应的近邻聚类中心点,包括:
将所述聚类中心点确定为第一聚类中心点,将所述多个聚类中心点中除所述第一聚类中心点外的聚类中心点确定为第二聚类中心点;
计算所述第一聚类中心点和各个所述第二聚类中心点之间的第一距离;
将所述第一距离按照由近至远进行排序,选择排序靠前的第一预设数值个第一距离,将选择的所述第一预设数值个第一距离对应的第二聚类中心点确定为所述第一聚类中心点对应的近邻聚类中心点。
可选的,所述根据所述聚类子空间,对所述待存储数据集中的待存储数据进行存储,包括:
确定所述待存储数据集中的第一待存储数据对应的第一目标聚类中心点;
根据所述第一待存储数据和所述第一目标聚类中心点,确定所述第一目标聚类中心点对应的第一目标近邻聚类中心点;
将所述第一待存储数据存储至所述第一目标聚类中心点和所述第一目标近邻聚类中心点构成的聚类子空间中。
可选的,所述确定所述待存储数据集中的第一待存储数据对应的第一目标聚类中心点,包括:
计算所述第一待存储数据和所述多个聚类中心点中每个聚类中心点之间的第二距离;
将所述第二距离按照由近至远进行排序,选择排序靠前的第二预设数值个第二距离,将选择的所述第二预设数值个第二距离对应的聚类中心点确定为所述第一待存储数据对应的第一目标聚类中心点。
可选的,所述根据所述第一待存储数据和所述第一目标聚类中心点,确定所述第一目标聚类中心点对应的第一目标近邻聚类中心点,包括:
获取所述第一目标聚类中心点对应的各个近邻聚类中心点;
计算所述第一待存储数据和所述各个近邻聚类中心点之间的第三距离;
将所述第三距离按照由近至远进行排序,选择排序靠前的第三预设数值个第三距离,将选择的所述第三预设数值个第三距离对应的近邻聚类中心点确定为所述第一目标聚类中心点对应的第一目标近邻聚类中心点。
可选的,所述根据所述聚类子空间,对所述待存储数据集中的待存储数据进行存储之后,还包括:
从所述多个聚类中心点中,确定待查询数据对应的第二目标聚类中心点;
根据所述待查询数据和所述第二目标聚类中心点,确定所述第二目标聚类中心点对应的目标聚类子空间;
根据所述待查询数据和所述目标聚类子空间,确定出所述待查询数据对应的搜索空间。
根据本说明书实施例的第二方面,提供了一种数据查询方法,所述方法包括:
从多个聚类中心点中,确定待查询数据对应的第二目标聚类中心点;
根据所述待查询数据和所述第二目标聚类中心点,确定所述第二目标聚类中心点对 应的目标聚类子空间;
根据所述待查询数据和所述目标聚类子空间,确定出所述待查询数据对应的搜索空间。
可选的,所述从多个聚类中心点中,确定待查询数据对应的第二目标聚类中心点,包括:
计算所述待查询数据和所述多个聚类中心点中每个聚类中心点之间的第四距离;
将所述第四距离按照由近至远进行排序,选择排序靠前的第四预设数值个第四距离,将所述第四预设数值个第四距离对应的聚类中心点确定为所述待查询数据对应的第二目标聚类中心点。
可选的,所述根据所述待查询数据和所述第二目标聚类中心点,确定所述第二目标聚类中心点对应的目标聚类子空间,包括:
获取所述第二目标聚类中心点对应的各个近邻聚类中心点,将所述各个近邻聚类中心点确定为所述第二目标聚类中心点对应的第二目标近邻聚类中心点;
将所述第二目标聚类中心点和所述第二目标近邻聚类中心点构成的聚类子空间确定为所述目标聚类子空间。
可选的,所述根据所述待查询数据和所述目标聚类子空间,确定出所述待查询数据对应的搜索空间,包括:
计算所述目标聚类子空间和所述待查询数据的第五距离;
将所述第五距离按照由近至远进行排序,选择排序靠前的第五预设数值个第五距离,将所述第五预设数值个第五距离对应的目标聚类子空间确定为所述待查询数据对应的搜索空间。
可选的,所述计算所述目标聚类子空间和所述待查询数据的第五距离,包括:
将所述第二目标聚类中心点和所述第二目标近邻聚类中心点的中点确定为目标点;
计算所述目标点和所述待查询数据的第六距离,将所述第六距离确定为所述目标聚类子空间和所述待查询数据的第五距离。
根据本说明书实施例的第三方面,提供了一种数据存储装置,所述装置包括:
第一确定模块,被配置为对待存储数据集进行聚类,确定多个聚类中心点;
第二确定模块,被配置为针对所述多个聚类中心点中的每一个聚类中心点,根据所述聚类中心点,确定对应的近邻聚类中心点,将所述聚类中心点和对应的所述近邻聚类中心点构成的空间确定为聚类子空间;
存储模块,被配置为根据所述聚类子空间,对所述待存储数据集中的待存储数据进行存储。
根据本说明书实施例的第四方面,提供了一种数据查询装置,所述装置包括:
第三确定模块,被配置为从多个聚类中心点中,确定待查询数据对应的第二目标聚 类中心点;
第四确定模块,被配置为根据所述待查询数据和所述第二目标聚类中心点,确定所述第二目标聚类中心点对应的目标聚类子空间;
第五确定模块,被配置为根据所述待查询数据和所述目标聚类子空间,确定出所述待查询数据对应的搜索空间。
根据本说明书实施例的第五方面,提供了一种计算设备,包括:
存储器和处理器;
所述存储器用于存储计算机可执行指令,所述处理器用于执行所述计算机可执行指令,以实现下述方法:
对待存储数据集进行聚类,确定多个聚类中心点;
针对所述多个聚类中心点中的每一个聚类中心点,根据所述聚类中心点,确定对应的近邻聚类中心点,将所述聚类中心点和对应的所述近邻聚类中心点构成的空间确定为聚类子空间;
根据所述聚类子空间,对所述待存储数据集中的待存储数据进行存储。
根据本说明书实施例的第六方面,提供了一种计算设备,包括:
存储器和处理器;
所述存储器用于存储计算机可执行指令,所述处理器用于执行所述计算机可执行指令,以实现下述方法:
从多个聚类中心点中,确定待查询数据对应的第二目标聚类中心点;
根据所述待查询数据和所述第二目标聚类中心点,确定所述第二目标聚类中心点对应的目标聚类子空间;
根据所述待查询数据和所述目标聚类子空间,确定出所述待查询数据对应的搜索空间。
根据本说明书实施例的第七方面,提供了一种计算机可读存储介质,其存储有计算机可执行指令,该指令被处理器执行时实现所述数据存储方法的步骤。
根据本说明书实施例的第八方面,提供了一种计算机可读存储介质,其存储有计算机可执行指令,该指令被处理器执行时实现所述数据查询方法的步骤。
本说明书提供的数据存储方法,可以对待存储数据集进行聚类,确定多个聚类中心点;之后,针对该多个聚类中心点中的每一个聚类中心点,根据该聚类中心点,确定对应的近邻聚类中心点,将该聚类中心点和对应的近邻聚类中心点构成的空间确定为聚类子空间;然后根据该聚类子空间,对该待存储数据集中的待存储数据进行存储。这种情况下,确定聚类中心点后,并不是直接将待存储数据存储至以该聚类中心点为中心形成的一层空间中,而是进一步再根据该聚类中心点确定出与其接近的近邻聚类中心点,然后将待存储数据存储至该聚类中心点和对应的近邻聚类中心点构成的聚类子空间中,把 聚类算法和近邻图思想进行了结合,在一层聚类的基础上,引入了二层子空间的近邻图关系,把一层的空间进一步划分为二层聚类子空间,从而提升整个索引的检索准度;并且,这种二层子空间划分方式相比二层的层次聚类结构节省了多层聚类的开销,索引构建速度更快,相比单层的聚类结构,本说明书在没有引入额外索引构建和存储开销的基础上提升了检索的准度。
附图说明
图1是本说明书一实施例提供的一种基于K-means聚类的ANN索引结构图;
图2是本说明书一实施例提供的一种数据存储方法的流程图;
图3是本说明书一实施例提供的一种结合K-means聚类和近邻图的索引结构图;
图4是本说明书一实施例提供的一种数据查询方法的流程图;
图5是本说明书一实施例提供的一种数据存储装置的结构示意图;
图6是本说明书一实施例提供的一种数据查询装置的结构示意图;
图7是本说明书一实施例提供的一种计算设备的结构框图;
图8是本说明书一实施例提供的另一种计算设备的结构框图。
具体实施方式
在下面的描述中阐述了很多具体细节以便于充分理解本说明书。但是本说明书能够以很多不同于在此描述的其它方式来实施,本领域技术人员可以在不违背本说明书内涵的情况下做类似推广,因此本说明书不受下面公开的具体实施的限制。
在本说明书一个或多个实施例中使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本说明书一个或多个实施例。在本说明书一个或多个实施例和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义。还应当理解,本说明书一个或多个实施例中使用的术语“和/或”是指并包含一个或多个相关联的列出项目的任何或所有可能组合。
应当理解,尽管在本说明书一个或多个实施例中可能采用术语第一、第二等来描述各种信息,但这些信息不应限于这些术语。这些术语仅用来将同一类型的信息彼此区分开。例如,在不脱离本说明书一个或多个实施例范围的情况下,第一也可以被称为第二,类似地,第二也可以被称为第一。取决于语境,如在此所使用的词语“如果”可以被解释成为“在……时”或“当……时”或“响应于确定”。
首先,对本说明书一个或多个实施例涉及的名词术语进行解释。
ANN检索:近似最近邻检索(Approximate Nearest Neighbor Search)利用数据量增大后数据之间会形成簇状聚集分布的特性,通过对数据分析聚类的方法对数据库中的数据进行分类或编码,对于目标数据,根据其数据特征预测其所属的数据类别,返回类别 中的部分或全部作为检索结果。也就是,通过预先构建的索引快速检索高维空间中的最近的N个相邻数据(即向量),但只能返回近似准确结果,不能保证绝对准确。近似最近邻检索的核心思想:搜索可能是近邻的数据项而不再只局限于返回最可能的项目,在牺牲可接受范围内的精度的情况下提高检索效率。
向量索引:一类专门提供高维向量数据ANN检索能力的索引结构。
K-means聚类算法:也即k均值聚类算法(k-means clustering algorithm),是一种迭代求解的聚类分析算法,其步骤是随机选取K个对象作为初始的聚类中心点,然后计算每个对象与各个聚类中心点之间的距离,把每个对象分配给距离它最近的聚类中心点,聚类中心点以及分配给它们的对象就代表一个聚类。K-means聚类算法源于信号处理中的一种向量量化方法,同时也是数据挖掘领域中经典的聚类分析方法。
Voronoi空间:对超空间的一种剖分,其特点是子空间内的任何位置离该子空间的中心点的距离最近,离相邻子空间内中心点的距离相对远,且每个子空间内含且仅包含一个中心点。
接下来,对本说明书提供的数据存储方法、数据查询方法的基本构思进行简要阐述:
k-means聚类算法(k-means clustering algorithm,K均值聚类算法),会把数据按照相似距离进行聚类,从而把全部数据划分到多个空间中,在数据查询的过程中,查询向量只需要搜索和自己在相同空间的数据点。然而,单纯利用K-means聚类算法进行空间划分的向量索引,在遇到数据聚类效果不好的时候,检索准度会严重下降,且当查询向量落在多个空间的边界部分时,***必须搜索和查询向量相邻的多个空间才能保证准确率,同时因为向量的超高维度,每个空间会和非常多的空间相邻,更加加剧了这个问题。
例如,图1是一个基于K-means聚类的ANN索引结构图,如图1所示,通过K-means聚类算法对待存储数据进行聚类,确定出C0-C7等8个不同的K-means聚类中心点,每个聚类中心点都划分了一个Voronoi空间。当待查询数据q在空间边界的时候不得不同时搜索C0、C2和C3这多个空间,大大降低了检索效率。
因而,本说明书为了提高检索效率和准确率,提出了一种数据存储方法及装置、数据查询方法及装置,可以在对待存储数据集进行聚类,确定多个聚类中心点之后,针对该多个聚类中心点中的每一个聚类中心点,确定对应的近邻聚类中心点,将该聚类中心点和对应的近邻聚类中心点构成的空间确定为聚类子空间;再将待存储数据存储至该聚类子空间中。把聚类算法和近邻图思想进行了结合,在K-means聚类的基础上,引入了二层子空间的近邻图关系,把一层的空间Ci进一步划分为二层聚类子空间,从而提升整个索引的检索效率和准确度。
在本说明书中,提供了一种数据存储方法,本说明书同时涉及一种数据存储装置,一种数据查询方法及装置,两种计算设备,以及两种计算机可读存储介质,在下面的实施例中逐一进行详细说明。
图2示出了根据本说明书一实施例提供的一种数据存储方法的流程图,具体包括以下步骤:
步骤202:对待存储数据集进行聚类,确定多个聚类中心点。
具体的,待存储数据集就是全部待存储数据构成的集合,该待存储数据集包括多个待存储数据。
实际应用中,可以通过K-means聚类算法对待存储数据集进行聚类,从而确定出该多个聚类中心点,具体实现时,可以从待存储数据集中随机选取K个数据,将选取出的K个数据确定为K个聚类中心点。
步骤204:针对该多个聚类中心点中的每一个聚类中心点,根据该聚类中心点,确定对应的近邻聚类中心点,将该聚类中心点和对应的近邻聚类中心点构成的空间确定为聚类子空间。
具体的,在对待存储数据集进行聚类,确定多个聚类中心点的基础上,进一步的,将针对该多个聚类中心点中的每一个聚类中心点,根据该聚类中心点,确定对应的近邻聚类中心点,将该聚类中心点和对应的近邻聚类中心点构成的空间确定为聚类子空间。
实际应用中,对每一个聚类中心点Ci,寻找和其接近的n个近邻聚类中心点,每一个聚类中心点Ci和其对应的n个近邻聚类中心点可以构成多个聚类子空间。
在本实施例的一个或多个实施方式中,根据该聚类中心点,确定对应的近邻聚类中心点,具体实现过程可以为:
将该聚类中心点确定为第一聚类中心点,将该多个聚类中心点中除第一聚类中心点外的聚类中心点确定为第二聚类中心点;
计算第一聚类中心点和各个第二聚类中心点之间的第一距离;
将第一距离按照由近至远进行排序,选择排序靠前的第一预设数值个第一距离,将选择的第一预设数值个第一距离对应的第二聚类中心点确定为第一聚类中心点对应的近邻聚类中心点。
需要说明的是,对每一个聚类中心点Ci,要寻找和其接近的多个近邻聚类中心点,因而需要计算该聚类中心点Ci和其他聚类中心点之间的距离,从而筛选出距离较近的多个近邻聚类中心点,即第一预设数值个近邻聚类中心点。其中,第一预设数值可以提前设置,如2个、3个、4个等,对计算得到的多个第一距离按照由近至远排序后,可以通过该第一预设数值筛选与聚类中心点接近的多个近邻聚类中心点。
例如,确定出的聚类中心点为C0、C1、C2、……、Ck,针对C0,将C0确定为第一聚类中心点,将C1、C2、……、Ck确定为第二聚类中心点,分别计算C0和C1、C2、……、Ck之间的距离,假设C0和C1、C2、……、C7之间的距离为a,和C8、C9、……、C20之间的距离为b,和C21、C22、……、Ck之间的距离为c,且距离a小于距离b小于距离c,第一预设数值为7,则此时确定出的C0对应的近邻聚类中心点为C1、C2、……、 C7。针对C1同样执行上述步骤,确定出C1对应的近邻聚类中心点为C0、C2、C7、C8、C9、C10、C11;针对C2同样执行上述步骤,确定出C2对应的近邻聚类中心点为C0、C1、C3、C12、C13、C14、C15;以此类推,直至确定出Ck对应的近邻聚类中心点。
在本实施例的一个或多个实施方式中,还可以不通过第一预设数值筛选与聚类中心点接近的多个近邻聚类中心点,而是通过设置距离阈值,筛选出对应的近邻聚类中心点。此种方式下,针对该多个聚类中心点中的每一个聚类中心点,根据该聚类中心点,确定对应的近邻聚类中心点,具体实现过程可以如下:
将该聚类中心点确定为第一聚类中心点,将该多个聚类中心点中除第一聚类中心点外的聚类中心点确定为第二聚类中心点;
计算第一聚类中心点和各个第二聚类中心点之间的第一距离;
确定第一距离中小于第一距离阈值的第一距离,将该小于第一距离阈值的第一距离对应的第二聚类中心点确定为第一聚类中心点对应的近邻聚类中心点。
其中,第一距离阈值可以提前设置。
例如,确定出的聚类中心点为C0、C1、C2、……、Ck,针对C0,将C0确定为第一聚类中心点,将C1、C2、……、Ck确定为第二聚类中心点,分别计算C0和C1、C2、……、Ck之间的距离,假设C0和C1、C2、……、C7之间的距离为1,和C8、C9、……、C20之间的距离为2,和C21、C22、……、Ck之间的距离为3,且第一距离阈值为1.5,则此时确定出的C0对应的近邻聚类中心点为C1、C2、……、C7。针对C1、C2、……、Ck同样执行上述步骤,确定出对应的近邻聚类中心点。
实际应用中,获得了每一个聚类中心点对应的多个近邻聚类中心点,称这为聚类中心点之间的近邻图关系,每个聚类中心点Ci和它的其中一个近邻聚类中心点Cj构成了一个聚类子空间B(i,j),也即每个聚类中心点Ci和它的多个近邻聚类中心点可以构成多个聚类子空间。
例如,确定出的C0对应的近邻聚类中心点为C1、C2、……、C7,C1对应的近邻聚类中心点为C0、C2、C7、C8、C9、C10、C11。则C0和C1可以构成一个聚类子空间B(0,1),C0和C2可以构成一个聚类子空间B(0,2),C0和C3可以构成一个聚类子空间B(0,3),C0和C4可以构成一个聚类子空间B(0,4),C0和C5可以构成一个聚类子空间B(0,5),C0和C6可以构成一个聚类子空间B(0,6),C0和C7可以构成一个聚类子空间B(0,7)。C1和C0可以构成一个聚类子空间B(1,0),C1和C2可以构成一个聚类子空间B(1,2),C1和C7可以构成一个聚类子空间B(1,7),C1和C8可以构成一个聚类子空间B(1,8),C1和C9可以构成一个聚类子空间B(1,9),C1和C10可以构成一个聚类子空间B(1,10),C1和C11可以构成一个聚类子空间B(1,11)。
步骤206:根据该聚类子空间,对该待存储数据集中的待存储数据进行存储。
具体的,在针对该多个聚类中心点中的每一个聚类中心点,根据该聚类中心点,确定对应的近邻聚类中心点,将该聚类中心点和对应的近邻聚类中心点构成的空间确定为聚类子空间的基础上,进一步的,将根据该聚类子空间,对该待存储数据集中的待存储数据进行存储。
实际应用中,需要将待存储数据集中的每个待存储数据存储至相应的聚类子空间中,方便后续检索查询,因而对于待存储数据集中的每一个待存储数据,都需要计算一遍,确定将其存储至哪个聚类子空间中。
本实施例的一个或多个实施方式中,根据该聚类子空间,对该待存储数据集中的待存储数据进行存储,具体实现过程可以为:
确定该待存储数据集中的第一待存储数据对应的第一目标聚类中心点;
根据第一待存储数据和第一目标聚类中心点,确定第一目标聚类中心点对应的第一目标近邻聚类中心点;
将第一待存储数据存储至第一目标聚类中心点和第一目标近邻聚类中心点构成的聚类子空间中。
具体的,第一待存储数据可以为待存储数据集中的任意一个待存储数据,待存储数据集中的每一个待存储数据都要执行一遍上述操作步骤,从而确定出其对应的聚类子空间,进行存储。也即是,待存储数据集中的每一个待存储数据都要作为一次上述第一待存储数据。
实际应用中,对每个待存储数据,计算其接近的聚类中心点Ci,然后从Ci对应的各个近邻聚类中心点中挑选出和待存储数据最接近的近邻聚类中心点Cj,最后把待存储数据进行编码写入到B(i,j)的存储空间中,需要说明的是,可以采用PQ(product quantization)编码,也可以采用其他编码方式,本说明书对此不进行限制。
其中,确定该待存储数据集中的第一待存储数据对应的第一目标聚类中心点,具体实现时,可以计算第一待存储数据和该多个聚类中心点中每个聚类中心点之间的第二距离;然后将第二距离按照由近至远进行排序,选择排序靠前的第二预设数值个第二距离,将选择的第二预设数值个第二距离对应的聚类中心点确定为第一待存储数据对应的第一目标聚类中心点。
具体的,在***第一待存储数据时,需要确定将该第一待存储数据***至哪个聚类子空间中,可以选择最接近的一个聚类子空间,也可以选择最接近的两个聚类子空间,因而在确定第一待存储数据对应的第一目标聚类中心点时,可以通过第二预设数值进行筛选,其中,第二预设数值可以提前设置,如1个、2个、3个、4个等。
例如,待存储数据集为{X1、X2、X3、……、Xm},第一待存储数据为X1,根据待存储数据集确定出的聚类中心点为C0、C1、C2、……、C7,分别计算X1和C0、C1、C2、……、C7之间的距离,假设X1和C0之间的距离为1,X1和C1之间的距离为1.5, X1和C2之间的距离为2,X1和C3之间的距离为3,X1和C4之间的距离为4,X1和C5之间的距离为5,X1和C6之间的距离为6,X1和C7之间的距离为7,将上述8个距离按照由近至远进行排序,假设选择排序最靠前的1个第二距离(1),则将C0确定为该第一待存储数据X1对应的第一目标聚类中心点。以此类推,对于待存储数据集中的其他待存储数据{X2、X3、……、Xm},均按照上述方式,确定出对应的第一目标聚类中心点。
在本实施例的一个或多个实施方式中,也可以不通过第二预设数值筛选与第一待存储数据最接近的第一目标聚类中心点,而是通过设置距离阈值,筛选出对应的最接近的第一目标聚类中心点。此种方式下,确定待存储数据集中的第一待存储数据对应的第一目标聚类中心点,具体实现过程可以如下:
计算第一待存储数据和该多个聚类中心点中每个聚类中心点之间的第二距离,确定第二距离中小于第二距离阈值的第二距离,将小于第二距离阈值的第二距离对应的聚类中心点确定为第一待存储数据对应的第一目标聚类中心点。
具体的,第二距离阈值可以提前设置。
例如,待存储数据集为{X1、X2、X3、……、Xm},第一待存储数据为X1,根据待存储数据集确定出的聚类中心点为C0、C1、C2、……、C7,分别计算X1和C0、C1、C2、……、C7之间的距离,假设X1和C0之间的距离为1,X1和C1之间的距离为1.5,X1和C2之间的距离为2,X1和C3之间的距离为3,X1和C4之间的距离为4,X1和C5之间的距离为5,X1和C6之间的距离为6,X1和C7之间的距离为7,假设第二距离阈值为1.3,X1和C0之间的距离为1小于该第二距离阈值,则此时将C0确定为该第一待存储数据X1对应的第一目标聚类中心点。以此类推,对于待存储数据集中的其他待存储数据{X2、X3、……、Xm},均按照上述方式,确定出对应的第一目标聚类中心点。
其中,根据第一待存储数据和第一目标聚类中心点,确定第一目标聚类中心点对应的第一目标近邻聚类中心点,具体实现时,可以先获取第一目标聚类中心点对应的各个近邻聚类中心点,然后计算第一待存储数据和该各个近邻聚类中心点之间的第三距离,将第三距离按照由近至远进行排序,选择排序靠前的第三预设数值个第三距离,将选择的第三预设数值个第三距离对应的近邻聚类中心点确定为第一目标聚类中心点对应的第一目标近邻聚类中心点。
具体的,对于待存储数据集中的每一个待存储数据都确定其对应的第一目标聚类中心点后,还要分别确定出每一个待存储数据的第一目标聚类中心点对应的第一目标近邻聚类中心点,即第一目标聚类中心点对应的,且与第一待存储数据最近的第一目标近邻聚类中心点。实现时,可以通过第三预设数值进行筛选,其中,第三预设数值可以提前设置,如1个、2个、3个、4个等。
例如,第一待存储数据为X1,X1对应的第一目标聚类中心点为C0,C0对应的近邻聚类中心点为C1、C2、……、C7,分别计算X1和C1、C2、……、C7之间的距离(若之前计算过,则此处直接获取结果即可),假设X1和C1之间的距离为1.5,X1和C2之间的距离为2,X1和C3之间的距离为3,X1和C4之间的距离为4,X1和C5之间的距离为5,X1和C6之间的距离为6,X1和C7之间的距离为7,将上述7个距离按照由近至远进行排序,假设选择排序最靠前的1个第二距离(1.5),则将C1确定为C0对应的第一目标近邻聚类中心点。以此类推,对于待存储数据集中的其他待存储数据对应的目标聚类中心点,均按照上述方式,确定出对应的目标近邻聚类中心点。
在本实施例的一个或多个实施方式中,也可以不通过第三预设数值筛选第一目标聚类中心点对应的,且与第一待存储数据最近的第一目标近邻聚类中心点,而是通过设置距离阈值,筛选出对应的,且与第一待存储数据最近的第一目标近邻聚类中心点。此种方式下,根据第一待存储数据和第一目标聚类中心点,确定第一目标聚类中心点对应的第一目标近邻聚类中心点,具体实现过程可以如下:
可以先获取第一目标聚类中心点对应的各个近邻聚类中心点,然后计算第一待存储数据和该各个近邻聚类中心点之间的第三距离,确定第三距离中小于第三距离阈值的第三距离,将小于第三距离阈值的第三距离对应的近邻聚类中心点确定为第一目标聚类中心点对应的第一目标近邻聚类中心点。
具体的,第三距离阈值可以提前设置。
例如,第一待存储数据为X1,X1对应的第一目标聚类中心点为C0,C0对应的近邻聚类中心点为C1、C2、……、C7,分别计算X1和C1、C2、……、C7之间的距离,假设X1和C1之间的距离为1.5,X1和C2之间的距离为2,X1和C3之间的距离为3,X1和C4之间的距离为4,X1和C5之间的距离为5,X1和C6之间的距离为6,X1和C7之间的距离为7,假设第三距离阈值为1.8,X1和C1之间的距离为1.5小于该第三距离阈值,则此时将C1确定为C0对应的第一目标近邻聚类中心点。以此类推,对于待存储数据集中的其他待存储数据对应的目标聚类中心点,均按照上述方式,确定出对应的目标近邻聚类中心点。
实际应用中,经过上述步骤确定出第一待存储数据对应的第一目标聚类中心点以及第一目标近邻聚类中心点之后,就可以将该第一待存储数据存储至第一目标聚类中心点和所述第一目标近邻聚类中心点构成的聚类子空间中。
例如,第一待存储数据为X1,X1对应的第一目标聚类中心点为C0,第一目标近邻聚类中心点为C1,则此时将X1存储至C0和C1构成的聚类子空间B(0,1)中。以此类推,对于待存储数据集中的其他待存储数据,均按照上述方式,依次存储。
实际应用中,存储完待存储数据集中的全部待存储数据之后,所有编码后的待存储数据会按照归属的聚类子空间B(i,j)进行倒排结构存储然后写入到索引文件中。其中, 倒排结构存储是指将每个待存储数据按照聚类子空间的编号(id)进行排列存储,写进索引文件中。
需要说明的是,按照上述步骤202-206对待存储数据集中的所有待存储数据进行存储后,如果需要查询某个待查询数据,就可以按照如下步骤208-212在聚类子空间中进行搜索。
步骤208:从该多个聚类中心点中,确定待查询数据对应的第二目标聚类中心点。
实际应用中,对于每个待查询数据,要从多个聚类中心点中,确定与该待查询数据接近的第二目标聚类中心点,具体实现过程可以如下:
计算该待查询数据和该多个聚类中心点中每个聚类中心点之间的第四距离;
将第四距离按照由近至远进行排序,选择排序靠前的第四预设数值个第四距离,将第四预设数值个第四距离对应的聚类中心点确定为待查询数据对应的第二目标聚类中心点。
需要说明的是,对于每个待查询数据,要计算其和每个聚类中心点Ci之间的距离,从而确定出与待查询数据接近的聚类中心点,即第四预设数值个第二目标聚类中心点。其中,第四预设数值可以提前设置,如2个、3个、4个等,对计算得到的多个第四距离按照由近至远排序后,可以通过该第四预设数值筛选与待查询数据接近的多个第二目标聚类中心点。
例如,如图3所示,聚类中心点为C0、C1、C2、……、C7,针对待查询数据q,分别计算q和C0、C1、C2、……、C7之间的距离,假设q和C0之间的距离为0.5,q和C1之间的距离为3,q和C2之间的距离为1,q和C3之间的距离为0.8,q和C4之间的距离为3.2,q和C5之间的距离为5,q和C6之间的距离为7,q和C7之间的距离为5.5,假设第四预设数值为3,则此时确定出q对应的第二目标聚类中心点为C0、C2和C3。
在本实施例的一个或多个实施方式中,还可以不通过第四预设数值筛选与待查询数据接近的多个目标聚类中心点,而是通过设置距离阈值,筛选出对应的目标聚类中心点。此种方式下,从该多个聚类中心点中,确定待查询数据对应的第二目标聚类中心点,具体实现过程可以如下:
计算该待查询数据和多个聚类中心点中每个聚类中心点之间的第四距离;
确定第四距离中小于第四距离阈值的第四距离,将该小于第四距离阈值的第四距离对应的聚类中心点确定为待查询数据对应的第二目标聚类中心点。
其中,第四距离阈值可以提前设置。
例如,如图3所示,聚类中心点为C0、C1、C2、……、C7,针对待查询数据q,分别计算q和C0、C1、C2、……、C7之间的距离,假设q和C0之间的距离为0.5,q和C1之间的距离为3,q和C2之间的距离为1,q和C3之间的距离为0.8,q和C4之 间的距离为3.2,q和C5之间的距离为5,q和C6之间的距离为7,q和C7之间的距离为5.5,假设第四距离阈值为1.5,则此时确定出q对应的第二目标聚类中心点为C0、C2和C3。
步骤210:根据该待查询数据和第二目标聚类中心点,确定第二目标聚类中心点对应的目标聚类子空间。
具体的,在从该多个聚类中心点中,确定待查询数据对应的第二目标聚类中心点的基础上,进一步的,将根据该待查询数据和第二目标聚类中心点,确定第二目标聚类中心点对应的目标聚类子空间。
实际应用中,可以根据上一步待查询数据和第二目标聚类中心点,挑选出待查询数据可能存储的多个聚类子空间,具体实现过程如下:
获取第二目标聚类中心点对应的各个近邻聚类中心点,将各个近邻聚类中心点确定为第二目标聚类中心点对应的第二目标近邻聚类中心点;
将第二目标聚类中心点和第二目标近邻聚类中心点构成的聚类子空间确定为目标聚类子空间。
需要说明的是,确定出的第二目标聚类中心点可能为多个,每个第二目标聚类中心点都会有对应的近邻聚类中心点,因而对于每一个第二目标聚类中心点,均要确定出对应的第二目标近邻聚类中心点,进而确定出对应的目标聚类子空间。
例如,确定出待查询数据对应的第二目标聚类中心点为C0、C2和C3,C0对应的各个近邻聚类中心点为C1、C2、……、C7,因而第二目标聚类中心点C0对应的第二目标近邻聚类中心点为C1、C2、……、C7,此时第二目标聚类中心点C0对应的目标聚类子空间为B(0,1),B(0,2),B(0,3),B(0,4),B(0,5),B(0,6),B(0,7);C2对应的各个近邻聚类中心点为C0、C1、C3、C12、C13、C14、C15,因而第二目标聚类中心点C2对应的第二目标近邻聚类中心点为C0、C1、C3、C12、C13、C14、C15,此时第二目标聚类中心点C2对应的目标聚类子空间为B(2,0),B(2,1),B(2,3),B(2,12),B(2,13),B(2,14),B(2,15);C3对应的各个近邻聚类中心点为C0、C2、C4、C16、C17、C18、C19,因而第二目标聚类中心点C3对应的第二目标近邻聚类中心点为C0、C2、C4、C16、C17、C18、C19,此时第二目标聚类中心点C3对应的目标聚类子空间为B(3,0),B(3,2),B(3,4),B(3,16),B(3,17),B(3,18),B(3,19)。
步骤212:根据该待查询数据和目标聚类子空间,确定出该待查询数据对应的搜索空间。
具体的,搜索空间就是指最终查询该待查询数据的空间。
本实施例的一个或多个实施方式中,根据该待查询数据和目标聚类子空间,确定出该待查询数据对应的搜索空间,具体实现过程可以为:
计算目标聚类子空间和该待查询数据的第五距离;
将第五距离按照由近至远进行排序,选择排序靠前的第五预设数值个第五距离,将第五预设数值个第五距离对应的目标聚类子空间确定为待查询数据对应的搜索空间。
其中,可以将第二目标聚类中心点和第二目标近邻聚类中心点的中点确定为目标点;然后计算该目标点和待查询数据的第六距离,将该第六距离确定为目标聚类子空间和待查询数据的第五距离。也即,目标聚类子空间到待查询数据的距离,采用聚类中心点Ci和近邻聚类中心点Cj的均值中心点到待查询数据的距离来代表。
需要说明的是,由上述步骤得到的目标聚类子空间可能为多个,针对每一个目标聚类子空间,都需要确定其和待存储数据之间的距离,从而决定是否将该目标聚类子空间确定为查询待查询数据的搜索空间。
例如,C0对应的目标聚类子空间为B(0,1),B(0,2),B(0,3),B(0,4),B(0,5),B(0,6),B(0,7),C2对应的目标聚类子空间为B(2,0),B(2,1),B(2,3),B(2,12),B(2,13),B(2,14),B(2,15),C3对应的目标聚类子空间为B(3,0),B(3,2),B(3,4),B(3,16),B(3,17),B(3,18),B(3,19),计算待查询数据q和上述所有目标聚类子空间之间的距离,然后选择多个距离较近的目标聚类子空间作为最后的搜索空间。如图3所示,选取了B(0,2)、B(0,3)、B(2,0)、B(3,0)作为最后的搜索空间,在该搜索空间中查询待查询数据q。
本说明书提供的数据存储方法,可以对待存储数据集进行聚类,确定多个聚类中心点;之后,针对该多个聚类中心点中的每一个聚类中心点,根据该聚类中心点,确定对应的近邻聚类中心点,将该聚类中心点和对应的近邻聚类中心点构成的空间确定为聚类子空间;然后根据该聚类子空间,对该待存储数据集中的待存储数据进行存储。之后,在需要查询待查询数据时,就可以从该多个聚类中心点中,确定该待查询数据对应的第二目标聚类中心点;然后根据该待查询数据和第二目标聚类中心点,确定第二目标聚类中心点对应的目标聚类子空间,再进一步根据该待查询数据和目标聚类子空间,确定出该待查询数据对应的搜索空间,在该搜索空间中进行搜索。
这种情况下,确定聚类中心点后,并不是直接将待存储数据存储至以该聚类中心点为中心形成的一层空间中,而是进一步再根据该聚类中心点确定出与其接近的近邻聚类中心点,然后将待存储数据存储至该聚类中心点和对应的近邻聚类中心点构成的聚类子空间中,把聚类算法和近邻图思想进行了结合,在K-means聚类的基础上,引入了二层子空间的近邻图关系,把一层的空间进一步划分为更为细致的二层聚类子空间,从而提升整个索引的检索准度。并且,这种二层子空间划分方式相比二层的层次聚类结构节省了多层聚类的开销,索引构建速度更快,相比单层的K-means聚类结构,本说明书在没有引入额外索引构建和存储开销的基础上提升了检索的准度。
接下来,结合附图1和附图3对本说明书提供的数据存储方法能够带来的有益效果 进行示例说明:
假设待查询数据为q,若采用如图1所示的现有技术,单纯通过K-means聚类算法对待存储数据进行聚类,确定出C0-C7等8个不同的K-means聚类中心点,每个聚类中心点都划分了一个Voronoi空间。当待查询数据q在空间边界的时候则不得不同时搜索C0、C2和C3这多个空间。而如图3所示,本说明书提供的数据存储方法,在确定出C0-C7等8个不同的K-means聚类中心点之后,还会进一步确定C0-C7对应的近邻聚类中心点,从而二次划分出二层空间聚类子空间,同样查询待查询数据q,本说明书提供的方法只需要在聚类子空间B(0,2)、B(0,3)、B(2,0)、B(3,0)中搜索,大大减少了搜索范围,进而提高了搜索效率和准度。
图4示出了根据本说明书一实施例提供的一种数据存储方法的流程图,具体包括以下步骤:
步骤402:从多个聚类中心点中,确定待查询数据对应的第二目标聚类中心点。需要说明的是,步骤402的具体实现过程和上述步骤208的具体实现过程相同,本说明书在此不再赘述。
步骤404:根据该待查询数据和第二目标聚类中心点,确定第二目标聚类中心点对应的目标聚类子空间。
需要说明的是,步骤404的具体实现过程和上述步骤210的具体实现过程相同,本说明书在此不再赘述。
步骤406:根据该待查询数据和目标聚类子空间,确定出该待查询数据对应的搜索空间。
需要说明的是,步骤406的具体实现过程和上述步骤212的具体实现过程相同,本说明书在此不再赘述。
本说明书提供的数据查询方法,可以从多个聚类中心点中,确定待查询数据对应的第二目标聚类中心点,然后根据待查询数据和第二目标聚类中心点,确定第二目标聚类中心点对应的目标聚类子空间,再进一步根据待查询数据和目标聚类子空间,确定出待查询数据对应的搜索空间,在该搜索空间中进行搜索。这种情况下,把聚类算法和近邻图思想进行了结合,也即在K-means聚类的基础上,引入了二层子空间的近邻图关系,把一层的空间进一步划分为更为细致的二层聚类子空间,后续查询待查询数据时,可以缩小搜索范围,从而提升整个索引的检索效率和准度。
与上述方法实施例相对应,本说明书还提供了数据存储装置实施例,图5示出了本说明书一实施例提供的一种数据存储装置的结构示意图。如图5所示,该装置包括:
第一确定模块502,被配置为对待存储数据集进行聚类,确定多个聚类中心点;
第二确定模块504,被配置为针对该多个聚类中心点中的每一个聚类中心点,根据该类中心点,确定对应的近邻聚类中心点,将该聚类中心点和对应的近邻聚类中心点构 成的空间确定为聚类子空间;
存储模块506,被配置为根据该聚类子空间,对该待存储数据集中的待存储数据进行存储。
在本实施例的一个或多个实施方式中,第二确定模块504进一步被配置为:
将该聚类中心点确定为第一聚类中心点,将该多个聚类中心点中除第一聚类中心点外的聚类中心点确定为第二聚类中心点;
计算第一聚类中心点和各个第二聚类中心点之间的第一距离;
将第一距离按照由近至远进行排序,选择排序靠前的第一预设数值个第一距离,将选择的第一预设数值个第一距离对应的第二聚类中心点确定为第一聚类中心点对应的近邻聚类中心点。
在本实施例的一个或多个实施方式中,存储模块506进一步被配置为:
确定该待存储数据集中的第一待存储数据对应的第一目标聚类中心点;
根据第一待存储数据和第一目标聚类中心点,确定第一目标聚类中心点对应的第一目标近邻聚类中心点;
将第一待存储数据存储至第一目标聚类中心点和第一目标近邻聚类中心点构成的聚类子空间中。
在本实施例的一个或多个实施方式中,存储模块506进一步被配置为:
计算第一待存储数据和所述多个聚类中心点中每个聚类中心点之间的第二距离;
将第二距离按照由近至远进行排序,选择排序靠前的第二预设数值个第二距离,将选择的第二预设数值个第二距离对应的聚类中心点确定为第一待存储数据对应的第一目标聚类中心点。
在本实施例的一个或多个实施方式中,存储模块506进一步被配置为:
获取第一目标聚类中心点对应的各个近邻聚类中心点;
计算第一待存储数据和各个近邻聚类中心点之间的第三距离;
将第三距离按照由近至远进行排序,选择排序靠前的第三预设数值个第三距离,将选择的第三预设数值个第三距离对应的近邻聚类中心点确定为第一目标聚类中心点对应的第一目标近邻聚类中心点。
在本实施例的一个或多个实施方式中,该装置还包括:
第三确定模块,被配置为从该多个聚类中心点中,确定待查询数据对应的第二目标聚类中心点;
第四确定模块,被配置为根据该待查询数据和该第二目标聚类中心点,确定第二目标聚类中心点对应的目标聚类子空间;
第五确定模块,被配置为根据该待查询数据和目标聚类子空间,确定出该待查询数据对应的搜索空间。
本说明书中,确定聚类中心点后,并不是直接将待存储数据存储至以该聚类中心点为中心形成的一层空间中,而是进一步再根据该聚类中心点确定出与其接近的近邻聚类中心点,然后将待存储数据存储至该聚类中心点和对应的近邻聚类中心点构成的聚类子空间中,把聚类算法和近邻图思想进行了结合,也即在K-means聚类的基础上,引入了二层子空间的近邻图关系,把一层的空间进一步划分为更为细致的二层聚类子空间,从而提升整个索引的检索准度。并且,这种二层子空间划分方式相比二层的层次聚类结构节省了多层聚类的开销,索引构建速度更快,相比单层的K-means聚类结构,本说明书在没有引入额外索引构建和存储开销的基础上提升了检索的准度。
上述为本实施例的一种数据存储装置的示意性方案。需要说明的是,该数据存储装置的技术方案与上述的数据存储方法的技术方案属于同一构思,数据存储装置的技术方案未详细描述的细节内容,均可以参见上述数据存储方法的技术方案的描述。
与上述方法实施例相对应,本说明书还提供了数据查询装置实施例,图6示出了本说明书一实施例提供的一种数据查询装置的结构示意图。如图6所示,该装置包括:
第三确定模块602,被配置为从多个聚类中心点中,确定待查询数据对应的第二目标聚类中心点;
第四确定模块604,被配置为根据该待查询数据和第二目标聚类中心点,确定第二目标聚类中心点对应的目标聚类子空间;
第五确定模块606,被配置为根据该待查询数据和目标聚类子空间,确定出待查询数据对应的搜索空间。
在本实施例的一个或多个实施方式中,第三确定模块602进一步被配置为:
计算该待查询数据和该多个聚类中心点中每个聚类中心点之间的第四距离;
将第四距离按照由近至远进行排序,选择排序靠前的第四预设数值个第四距离,将第四预设数值个第四距离对应的聚类中心点确定为该待查询数据对应的第二目标聚类中心点。
在本实施例的一个或多个实施方式中,第四确定模块604进一步被配置为:
获取第二目标聚类中心点对应的各个近邻聚类中心点,将各个近邻聚类中心点确定为第二目标聚类中心点对应的第二目标近邻聚类中心点;
将第二目标聚类中心点和第二目标近邻聚类中心点构成的聚类子空间确定为目标聚类子空间。
在本实施例的一个或多个实施方式中,第五确定模块606进一步被配置为:
计算目标聚类子空间和该待查询数据的第五距离;
将该第五距离按照由近至远进行排序,选择排序靠前的第五预设数值个第五距离,将该第五预设数值个第五距离对应的目标聚类子空间确定为该待查询数据对应的搜索空间。
在本实施例的一个或多个实施方式中,第五确定模块606进一步被配置为:
将第二目标聚类中心点和第二目标近邻聚类中心点的中点确定为目标点;
计算目标点和待查询数据的第六距离,将第六距离确定为目标聚类子空间和该待查询数据的第五距离。
本说明书提供的数据查询装置,可以从多个聚类中心点中,确定待查询数据对应的第二目标聚类中心点,然后根据待查询数据和第二目标聚类中心点,确定第二目标聚类中心点对应的目标聚类子空间,再进一步根据待查询数据和目标聚类子空间,确定出待查询数据对应的搜索空间,在该搜索空间中进行搜索。这种情况下,把聚类算法和近邻图思想进行了结合,也即在K-means聚类的基础上,引入了二层子空间的近邻图关系,把一层的空间进一步划分为更为细致的二层子空间,在查询待查询数据时,可以缩小搜索范围,从而提升整个索引的检索效率和准度。
上述为本实施例的一种数据查询装置的示意性方案。需要说明的是,该数据查询装置的技术方案与上述的数据查询方法的技术方案属于同一构思,数据查询装置的技术方案未详细描述的细节内容,均可以参见上述数据查询方法的技术方案的描述。
图7示出了根据本说明书一实施例提供的一种计算设备700的结构框图。
该计算设备700的部件包括但不限于存储器710和处理器720。处理器720与存储器710通过总线730相连接,数据库750用于保存数据。
计算设备700还包括接入设备740,接入设备740使得计算设备700能够经由一个或多个网络760通信。这些网络的示例包括公用交换电话网(PSTN)、局域网(LAN)、广域网(WAN)、个域网(PAN)或诸如因特网的通信网络的组合。接入设备740可以包括有线或无线的任何类型的网络接口(例如,网络接口卡(NIC))中的一个或多个,诸如IEEE802.11无线局域网(WLAN)无线接口、全球微波互联接入(Wi-MAX)接口、以太网接口、通用串行总线(USB)接口、蜂窝网络接口、蓝牙接口、近场通信(NFC)接口,等等。
在本说明书的一个实施例中,计算设备700的上述部件以及图7中未示出的其他部件也可以彼此相连接,例如通过总线。应当理解,图7所示的计算设备结构框图仅仅是出于示例的目的,而不是对本说明书范围的限制。本领域技术人员可以根据需要,增添或替换其他部件。
计算设备700可以是任何类型的静止或移动计算设备,包括移动计算机或移动计算设备(例如,平板计算机、个人数字助理、膝上型计算机、笔记本计算机、上网本等)、移动电话(例如,智能手机)、可佩戴的计算设备(例如,智能手表、智能眼镜等)或其他类型的移动设备,或者诸如台式计算机或PC的静止计算设备。计算设备700还可以是移动式或静止式的服务器。
其中,处理器720用于执行如下计算机可执行指令,以实现下述方法:
对待存储数据集进行聚类,确定多个聚类中心点;
针对该多个聚类中心点中的每一个聚类中心点,根据该聚类中心点,确定对应的近邻聚类中心点,将该聚类中心点和对应的近邻聚类中心点构成的空间确定为聚类子空间;
根据该聚类子空间,对该待存储数据集中的待存储数据进行存储。
上述为本实施例的一种计算设备的示意性方案。需要说明的是,该计算设备的技术方案与上述的数据存储方法的技术方案属于同一构思,计算设备的技术方案未详细描述的细节内容,均可以参见上述数据存储方法的技术方案的描述。
图8示出了根据本说明书一实施例提供的一种计算设备800的结构框图。
该计算设备800的部件包括但不限于存储器810和处理器820。处理器820与存储器810通过总线830相连接,数据库850用于保存数据。
计算设备800还包括接入设备840,接入设备840使得计算设备800能够经由一个或多个网络860通信。这些网络的示例包括公用交换电话网(PSTN)、局域网(LAN)、广域网(WAN)、个域网(PAN)或诸如因特网的通信网络的组合。接入设备840可以包括有线或无线的任何类型的网络接口(例如,网络接口卡(NIC))中的一个或多个,诸如IEEE802.11无线局域网(WLAN)无线接口、全球微波互联接入(Wi-MAX)接口、以太网接口、通用串行总线(USB)接口、蜂窝网络接口、蓝牙接口、近场通信(NFC)接口,等等。
在本说明书的一个实施例中,计算设备800的上述部件以及图8中未示出的其他部件也可以彼此相连接,例如通过总线。应当理解,图8所示的计算设备结构框图仅仅是出于示例的目的,而不是对本说明书范围的限制。本领域技术人员可以根据需要,增添或替换其他部件。
计算设备800可以是任何类型的静止或移动计算设备,包括移动计算机或移动计算设备(例如,平板计算机、个人数字助理、膝上型计算机、笔记本计算机、上网本等)、移动电话(例如,智能手机)、可佩戴的计算设备(例如,智能手表、智能眼镜等)或其他类型的移动设备,或者诸如台式计算机或PC的静止计算设备。计算设备800还可以是移动式或静止式的服务器。
其中,处理器820用于执行如下计算机可执行指令,以实现下述方法:
从多个聚类中心点中,确定待查询数据对应的第二目标聚类中心点;
根据该待查询数据和第二目标聚类中心点,确定第二目标聚类中心点对应的目标聚类子空间;
根据该待查询数据和目标聚类子空间,确定出待查询数据对应的搜索空间。
上述为本实施例的一种计算设备的示意性方案。需要说明的是,该计算设备的技术方案与上述的数据查询方法的技术方案属于同一构思,计算设备的技术方案未详细描述的细节内容,均可以参见上述数据查询方法的技术方案的描述。
本说明书一实施例还提供一种计算机可读存储介质,其存储有计算机指令,该指令被处理器执行时以用于实现上述数据存储方法的操作步骤。
本说明书一实施例还提供一种计算机可读存储介质,其存储有计算机指令,该指令被处理器执行时以用于实现上述数据查询方法的操作步骤。
上述为本实施例的一种计算机可读存储介质的示意性方案。需要说明的是,该存储介质的技术方案与上述的数据存储方法、数据查询方法的技术方案属于同一构思,存储介质的技术方案未详细描述的细节内容,均可以参见上述数据存储方法、数据查询方法的技术方案的描述。
上述对本说明书特定实施例进行了描述。其它实施例在所附权利要求书的范围内。在一些情况下,在权利要求书中记载的动作或步骤可以按照不同于实施例中的顺序来执行并且仍然可以实现期望的结果。另外,在附图中描绘的过程不一定要求示出的特定顺序或者连续顺序才能实现期望的结果。在某些实施方式中,多任务处理和并行处理也是可以的或者可能是有利的。
所述计算机指令包括计算机程序代码,所述计算机程序代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述计算机可读介质可以包括:能够携带所述计算机程序代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、电载波信号、电信信号以及软件分发介质等。需要说明的是,所述计算机可读介质包含的内容可以根据司法管辖区内立法和专利实践的要求进行适当的增减,例如在某些司法管辖区,根据立法和专利实践,计算机可读介质不包括电载波信号和电信信号。
需要说明的是,对于前述的各方法实施例,为了简便描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本说明书并不受所描述的动作顺序的限制,因为依据本说明书,某些步骤可以采用其它顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定都是本说明书所必须的。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其它实施例的相关描述。
以上公开的本说明书优选实施例只是用于帮助阐述本说明书。可选实施例并没有详尽叙述所有的细节,也不限制该发明仅为所述的具体实施方式。显然,根据本说明书的内容,可作很多的修改和变化。本说明书选取并具体描述这些实施例,是为了更好地解释本说明书的原理和实际应用,从而使所属技术领域技术人员能很好地理解和利用本说明书。本说明书仅受权利要求书及其全部范围和等效物的限制。

Claims (17)

  1. 一种数据存储方法,所述方法包括:
    对待存储数据集进行聚类,确定多个聚类中心点;
    针对所述多个聚类中心点中的每一个聚类中心点,根据所述聚类中心点,
    确定对应的近邻聚类中心点,将所述聚类中心点和对应的所述近邻聚类中心点
    构成的空间确定为聚类子空间;
    根据所述聚类子空间,对所述待存储数据集中的待存储数据进行存储。
  2. 根据权利要求1所述的数据存储方法,所述根据所述聚类中心点,确定
    对应的近邻聚类中心点,包括:
    将所述聚类中心点确定为第一聚类中心点,将所述多个聚类中心点中除所述第一聚类中心点外的聚类中心点确定为第二聚类中心点;
    计算所述第一聚类中心点和各个所述第二聚类中心点之间的第一距离;
    将所述第一距离按照由近至远进行排序,选择排序靠前的第一预设数值个第一距离,将选择的所述第一预设数值个第一距离对应的第二聚类中心点确定为所述第一聚类中心点对应的近邻聚类中心点。
  3. 根据权利要求1所述的数据存储方法,所述根据所述聚类子空间,对所述待存储数据集中的待存储数据进行存储,包括:
    确定所述待存储数据集中的第一待存储数据对应的第一目标聚类中心点;
    根据所述第一待存储数据和所述第一目标聚类中心点,确定所述第一目标聚类中心点对应的第一目标近邻聚类中心点;
    将所述第一待存储数据存储至所述第一目标聚类中心点和所述第一目标近邻聚类中心点构成的聚类子空间中。
  4. 根据权利要求3所述的数据存储方法,所述确定所述待存储数据集中的第一待存储数据对应的第一目标聚类中心点,包括:
    计算所述第一待存储数据和所述多个聚类中心点中每个聚类中心点之间的第二距离;
    将所述第二距离按照由近至远进行排序,选择排序靠前的第二预设数值个第二距离,将选择的所述第二预设数值个第二距离对应的聚类中心点确定为所述第一待存储数据对应的第一目标聚类中心点。
  5. 根据权利要求3所述的数据存储方法,所述根据所述第一待存储数据和所述第一目标聚类中心点,确定所述第一目标聚类中心点对应的第一目标近邻聚类中心点,包括:
    获取所述第一目标聚类中心点对应的各个近邻聚类中心点;
    计算所述第一待存储数据和所述各个近邻聚类中心点之间的第三距离;
    将所述第三距离按照由近至远进行排序,选择排序靠前的第三预设数值个第三距离,将选择的所述第三预设数值个第三距离对应的近邻聚类中心点确定为所述第一目标 聚类中心点对应的第一目标近邻聚类中心点。
  6. 根据权利要求1所述的数据存储方法,所述根据所述聚类子空间,对所述待存储数据集中的待存储数据进行存储之后,还包括:
    从所述多个聚类中心点中,确定待查询数据对应的第二目标聚类中心点;
    根据所述待查询数据和所述第二目标聚类中心点,确定所述第二目标聚类中心点对应的目标聚类子空间;
    根据所述待查询数据和所述目标聚类子空间,确定出所述待查询数据对应的搜索空间。
  7. 一种数据查询方法,所述方法包括:
    从多个聚类中心点中,确定待查询数据对应的第二目标聚类中心点;
    根据所述待查询数据和所述第二目标聚类中心点,确定所述第二目标聚类中心点对应的目标聚类子空间;
    根据所述待查询数据和所述目标聚类子空间,确定出所述待查询数据对应的搜索空间。
  8. 根据权利要求7所述的数据查询方法,所述从多个聚类中心点中,确定待查询数据对应的第二目标聚类中心点,包括:
    计算所述待查询数据和所述多个聚类中心点中每个聚类中心点之间的第四距离;
    将所述第四距离按照由近至远进行排序,选择排序靠前的第四预设数值个第四距离,将所述第四预设数值个第四距离对应的聚类中心点确定为所述待查询数据对应的第二目标聚类中心点。
  9. 根据权利要求7所述的数据查询方法,所述根据所述待查询数据和所述第二目标聚类中心点,确定所述第二目标聚类中心点对应的目标聚类子空间,包括:
    获取所述第二目标聚类中心点对应的各个近邻聚类中心点,将所述各个近邻聚类中心点确定为所述第二目标聚类中心点对应的第二目标近邻聚类中心点;
    将所述第二目标聚类中心点和所述第二目标近邻聚类中心点构成的聚类子空间确定为所述目标聚类子空间。
  10. 根据权利要求9所述的数据查询方法,所述根据所述待查询数据和所述目标聚类子空间,确定出所述待查询数据对应的搜索空间,包括:
    计算所述目标聚类子空间和所述待查询数据的第五距离;
    将所述第五距离按照由近至远进行排序,选择排序靠前的第五预设数值个第五距离,将所述第五预设数值个第五距离对应的目标聚类子空间确定为所述待查询数据对应的搜索空间。
  11. 根据权利要求10所述的数据查询方法,所述计算所述目标聚类子空间和所述待查询数据的第五距离,包括:
    将所述第二目标聚类中心点和所述第二目标近邻聚类中心点的中点确定为目标点;
    计算所述目标点和所述待查询数据的第六距离,将所述第六距离确定为所述目标聚类子空间和所述待查询数据的第五距离。
  12. 一种数据存储装置,所述装置包括:
    第一确定模块,被配置为对待存储数据集进行聚类,确定多个聚类中心点;
    第二确定模块,被配置为针对所述多个聚类中心点中的每一个聚类中心点,根据所述聚类中心点,确定对应的近邻聚类中心点,将所述聚类中心点和对应的所述近邻聚类中心点构成的空间确定为聚类子空间;
    存储模块,被配置为根据所述聚类子空间,对所述待存储数据集中的待存储数据进行存储。
  13. 一种数据查询装置,所述装置包括:
    第三确定模块,被配置为从多个聚类中心点中,确定待查询数据对应的第二目标聚类中心点;
    第四确定模块,被配置为根据所述待查询数据和所述第二目标聚类中心点,确定所述第二目标聚类中心点对应的目标聚类子空间;
    第五确定模块,被配置为根据所述待查询数据和所述目标聚类子空间,确定出所述待查询数据对应的搜索空间。
  14. 一种计算设备,包括:
    存储器和处理器;
    所述存储器用于存储计算机可执行指令,所述处理器用于执行所述计算机可执行指令,以实现下述方法:
    对待存储数据集进行聚类,确定多个聚类中心点;
    针对所述多个聚类中心点中的每一个聚类中心点,根据所述聚类中心点,确定对应的近邻聚类中心点,将所述聚类中心点和对应的所述近邻聚类中心点构成的空间确定为聚类子空间;
    根据所述聚类子空间,对所述待存储数据集中的待存储数据进行存储。
  15. 一种计算设备,包括:
    存储器和处理器;
    所述存储器用于存储计算机可执行指令,所述处理器用于执行所述计算机可执行指令,以实现下述方法:
    从多个聚类中心点中,确定待查询数据对应的第二目标聚类中心点;
    根据所述待查询数据和所述第二目标聚类中心点,确定所述第二目标聚类中心点对应的目标聚类子空间;
    根据所述待查询数据和所述目标聚类子空间,确定出所述待查询数据对应的搜索空间。
  16. 一种计算机可读存储介质,其存储有计算机指令,该指令被处理器执行时实现 权利要求1至6任意一项所述数据存储方法的步骤。
  17. 一种计算机可读存储介质,其存储有计算机指令,该指令被处理器执行时实现权利要求7至11任意一项所述数据查询方法的步骤。
PCT/CN2021/119760 2020-09-27 2021-09-23 数据存储方法及装置、数据查询方法及装置 WO2022063150A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011035973.0 2020-09-27
CN202011035973.0A CN113297331B (zh) 2020-09-27 2020-09-27 数据存储方法及装置、数据查询方法及装置

Publications (1)

Publication Number Publication Date
WO2022063150A1 true WO2022063150A1 (zh) 2022-03-31

Family

ID=77318246

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/119760 WO2022063150A1 (zh) 2020-09-27 2021-09-23 数据存储方法及装置、数据查询方法及装置

Country Status (2)

Country Link
CN (1) CN113297331B (zh)
WO (1) WO2022063150A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113297331B (zh) * 2020-09-27 2022-09-09 阿里云计算有限公司 数据存储方法及装置、数据查询方法及装置
CN115357609B (zh) * 2022-10-24 2023-01-13 深圳比特微电子科技有限公司 物联网数据的处理方法、装置、设备和介质

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103049514A (zh) * 2012-12-14 2013-04-17 杭州淘淘搜科技有限公司 一种基于分层聚类的均衡图像聚类方法
CN103324705A (zh) * 2013-06-17 2013-09-25 中国科学院深圳先进技术研究院 大规模向量场数据处理方法
CN105912611A (zh) * 2016-04-05 2016-08-31 中国科学技术大学 一种基于cnn的快速图像检索方法
US20180032579A1 (en) * 2016-07-28 2018-02-01 Fujitsu Limited Non-transitory computer-readable recording medium, data search method, and data search device
CN107818147A (zh) * 2017-10-19 2018-03-20 大连大学 基于Voronoi图的分布式时空索引***
CN109271427A (zh) * 2018-10-17 2019-01-25 辽宁大学 一种基于近邻密度和流形距离的聚类方法
CN110909197A (zh) * 2019-11-04 2020-03-24 深圳力维智联技术有限公司 一种高维特征的处理方法和装置
CN111310809A (zh) * 2020-02-04 2020-06-19 重庆亿创西北工业技术研究院有限公司 一种数据聚类方法、装置、计算机设备和存储介质
CN113297331A (zh) * 2020-09-27 2021-08-24 阿里云计算有限公司 数据存储方法及装置、数据查询方法及装置

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105868414B (zh) * 2016-05-03 2019-03-26 湖南工业大学 一种聚类分离的分布式索引方法
CN108629345B (zh) * 2017-03-17 2021-07-30 北京京东尚科信息技术有限公司 高维图像特征匹配方法和装置
CN110889424B (zh) * 2018-09-11 2023-06-30 阿里巴巴集团控股有限公司 向量索引建立方法及装置和向量检索方法及装置

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103049514A (zh) * 2012-12-14 2013-04-17 杭州淘淘搜科技有限公司 一种基于分层聚类的均衡图像聚类方法
CN103324705A (zh) * 2013-06-17 2013-09-25 中国科学院深圳先进技术研究院 大规模向量场数据处理方法
CN105912611A (zh) * 2016-04-05 2016-08-31 中国科学技术大学 一种基于cnn的快速图像检索方法
US20180032579A1 (en) * 2016-07-28 2018-02-01 Fujitsu Limited Non-transitory computer-readable recording medium, data search method, and data search device
CN107818147A (zh) * 2017-10-19 2018-03-20 大连大学 基于Voronoi图的分布式时空索引***
CN109271427A (zh) * 2018-10-17 2019-01-25 辽宁大学 一种基于近邻密度和流形距离的聚类方法
CN110909197A (zh) * 2019-11-04 2020-03-24 深圳力维智联技术有限公司 一种高维特征的处理方法和装置
CN111310809A (zh) * 2020-02-04 2020-06-19 重庆亿创西北工业技术研究院有限公司 一种数据聚类方法、装置、计算机设备和存储介质
CN113297331A (zh) * 2020-09-27 2021-08-24 阿里云计算有限公司 数据存储方法及装置、数据查询方法及装置

Also Published As

Publication number Publication date
CN113297331B (zh) 2022-09-09
CN113297331A (zh) 2021-08-24

Similar Documents

Publication Publication Date Title
EP3709184B1 (en) Sample set processing method and apparatus, and sample querying method and apparatus
Zhao et al. k-means: A revisit
WO2022063150A1 (zh) 数据存储方法及装置、数据查询方法及装置
JP2019053772A (ja) オブジェクトネットワークをモデル化するシステム及び方法
CN114282073B (zh) 数据存储方法及装置、数据读取方法及装置
CN102799614B (zh) 基于视觉词语空间共生性的图像检索方法
CN107391502B (zh) 时间间隔的数据查询方法、装置及索引构建方法、装置
CN106095920B (zh) 面向大规模高维空间数据的分布式索引方法
CN104484392B (zh) 数据库查询语句生成方法及装置
CN111752955A (zh) 数据处理方法、装置、设备及计算机可读存储介质
Zhao et al. Approximate k-NN graph construction: a generic online approach
CN117251641A (zh) 向量数据库检索方法、***、电子设备及存储介质
CN106844541B (zh) 一种联机分析处理方法及装置
Chen et al. Spatial and temporal constrained ranked retrieval over videos
CN115878824B (zh) 图像检索***、方法和装置
CN111723227B (zh) 基于人工智能和互联网的数据分析方法及云计算服务平台
KR102158049B1 (ko) Cf 트리를 활용한 범위 질의 기반의 데이터 클러스터링 장치 및 방법
CN113806376B (zh) 索引构建方法及装置
EP3283984A1 (en) Relevance optimized representative content associated with a data storage system
CN115146103A (zh) 图像检索方法、装置、计算机设备、存储介质和程序产品
CN113901278A (zh) 一种基于全局多探测和适应性终止的数据搜索方法和装置
CN111859192B (zh) 搜索方法、装置、电子设备及存储介质
Antaris et al. Similarity search over the cloud based on image descriptors' dimensions value cardinalities
Balasundaram et al. Unsupervised learning‐based recognition and extraction for intelligent automatic video retrieval
CN113297204A (zh) 索引生成方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21871519

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21871519

Country of ref document: EP

Kind code of ref document: A1