CN117235080A - Index generation and similarity search method and device for large-scale high-dimensional data - Google Patents

Index generation and similarity search method and device for large-scale high-dimensional data Download PDF

Info

Publication number
CN117235080A
CN117235080A CN202311525045.6A CN202311525045A CN117235080A CN 117235080 A CN117235080 A CN 117235080A CN 202311525045 A CN202311525045 A CN 202311525045A CN 117235080 A CN117235080 A CN 117235080A
Authority
CN
China
Prior art keywords
index
tree
leaf
index tree
given query
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311525045.6A
Other languages
Chinese (zh)
Inventor
张军
周进登
崔永花
沈晓琦
冯桂亚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
93184 Unit Of Chinese Pla
Original Assignee
93184 Unit Of Chinese Pla
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 93184 Unit Of Chinese Pla filed Critical 93184 Unit Of Chinese Pla
Priority to CN202311525045.6A priority Critical patent/CN117235080A/en
Publication of CN117235080A publication Critical patent/CN117235080A/en
Pending legal-status Critical Current

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The application relates to the technical field of data processing, and particularly discloses an index generation and similarity search method and device for large-scale high-dimensional data, wherein the index generation and similarity search method for the large-scale high-dimensional data comprises the following steps: generating an index tree by using the existing data sequence, and writing information of the index tree into a related file; traversing the index tree to obtain a plurality of candidate categories adjacent to the given query if the given query is received; multiple candidate categories are processed in parallel by multiple threads to obtain a target result closest to a given query. The method solves the technical problem that abnormal operation of the equipment cannot be solved in time due to the fact that the equipment process is not monitored.

Description

Index generation and similarity search method and device for large-scale high-dimensional data
Technical Field
The application relates to the technical field of data processing, in particular to an index generation and similarity search method and device for large-scale high-dimensional data.
Background
Today, high-dimensional data is widespread, often with data volumes up to the TB level and data feature dimensions up to several thousands. Extraction of valuable information from such large-scale data requires complex data analysis methods and techniques. How to efficiently extract effective information from such complex data remains a challenging task. Thereby defeating the research area of scalable data science. Similarity search (similary search) is a key technique in the above-mentioned fields. The similarity search is mainly used for searching the data points closest to the queried data in the data set, so that the data processing speed is effectively improved, and the similarity search method can be applied to tasks such as missing data completion, information retrieval, information classification, outlier detection and the like.
For the similarity retrieval problem, a method of performing Brute Force search (Brute Force) is directly adopted in early stage, but the resource consumption is too high due to the similarity of the calculated query vector and the rest all points. Accordingly, various approximation methods have been proposed subsequently, by sacrificing some accuracy in exchange for computational efficiency. The above approximation methods can be divided into two classes according to the nature of the proposed solution, the approximation solutions that do not provide any quality assurance to the result beingApproximate solution->While the solution that can provide quality assurance is +.>Approximate solution->,/>Representing approximation error +_>Indicating a probability that the approximation error is not exceeded. In the method of calculating the approximate solution, the Tree-based method is used in the method of dividing +.>The efficiency of component indexing under search scenes outside the approximate solution is highest, and the application scenes are limited to a certain extent. Method based on Graph-based for processing current application scenarios>Approximating the problem has shown good performance, but the building block diagram structure will consume a significant amount of time and space resources when processing large data sets.
In view of the above problems, no effective solution has been proposed at present.
Disclosure of Invention
The embodiment of the application provides an index generation and similarity search method and device for large-scale high-dimensional data, which are used for at least solving the technical problem that abnormal operation of equipment cannot be solved in time due to the fact that equipment processes are not monitored.
According to an aspect of the embodiment of the present application, there is provided an index generation and similarity search method for large-scale high-dimensional data, including: generating an index tree by using the existing data sequence, and writing the information of the index tree into a related file; traversing an index tree to obtain a plurality of candidate categories adjacent to a given query if the given query is received; and processing a plurality of candidate categories in parallel by using a plurality of threads to obtain a target result closest to the given query.
Preferably, the generating the index tree using the existing data sequence includes: dividing the existing data sequence by using a hierarchical tree structure to form a plurality of leaves; the graph structure is built in parallel in each leaf of the hierarchical tree structure, generating an index tree.
Preferably, the dividing the existing data sequence by using the hierarchical tree structure to form a plurality of leaves includes: reading the data sequences into the memory from the disk in batches by utilizing the hierarchical tree structure, and inserting the data cached in the memory into the tree structure; and traversing the binary index tree by using the thread to determine the leaf corresponding to the data sequence. The existing data sequence is partitioned using a hierarchical tree structure to form a plurality of leaves.
Preferably, the constructing the graph structure in parallel in each leaf of the hierarchical tree structure to generate the index tree includes: in the case where the master coordinator thread initializes a plurality of leaf coordinator threads, reading leaf data with each leaf coordinator thread and inserting the leaf data into the graph structure; an index tree is generated using the respective graph structures of all the leaves.
Preferably, the writing the information of the index tree to a related file includes: writing the information of the index tree into HTree, LRDFile and LSDFile files, wherein HTre files are used for storing the spanning tree structure, LRDFile is used for storing the original data sequence, and LSDFile is used for storing abstract information.
Preferably, traversing the index tree results in a plurality of candidate categories adjacent to the given query, including: and traversing the index tree, and searching a plurality of candidate leaves with similarity meeting preset conditions with the given query.
Preferably, the processing the candidate categories in parallel by using a plurality of threads to obtain the target result closest to the given query includes: and performing beam search on a plurality of leaves in parallel by using a plurality of threads to obtain a target result closest to the given query distance.
According to another aspect of the embodiment of the present application, there is also provided an index generation and similarity search device for large-scale high-dimensional data, including: the index generation module is used for generating an index tree by utilizing the existing data sequence and writing the information of the index tree into related files; the query reply module is used for traversing the index tree to obtain a plurality of candidate categories adjacent to the given query under the condition that the given query is received; and processing a plurality of candidate categories in parallel by using a plurality of threads to obtain a target result closest to the given query.
According to yet another aspect of the embodiments of the present application, there is also provided a computer-readable storage medium having a computer program stored therein, wherein the computer program is configured to perform the above-described index generation and similarity search method for large-scale high-dimensional data when run.
According to still another aspect of the embodiments of the present application, there is also provided an electronic device including a memory and a processor, where the memory stores a computer program, and the processor is configured to execute the index generation and similarity search method for large-scale high-dimensional data described above by using the computer program.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:
FIG. 1 is a flow diagram of an alternative index generation and similarity search method for large-scale high-dimensional data according to an embodiment of the present application;
FIG. 2 is a flow diagram of an alternative method for index generation and similarity search for large-scale high-dimensional data in accordance with an embodiment of the present application;
fig. 3 is a schematic diagram of HNSW according to an embodiment of the present application;
FIG. 4 is a schematic illustration of an EAPCA index generation flow in accordance with an embodiment of the present application;
FIG. 5 is a schematic illustration of an EAPCA query reply flow in accordance with an embodiment of the application;
FIG. 6 is a schematic diagram of an alternative index generation and similarity search device for large-scale high-dimensional data according to an embodiment of the present application;
fig. 7 is a schematic structural diagram of an alternative electronic device according to an embodiment of the present application.
Detailed Description
In order that those skilled in the art will better understand the present application, a technical solution in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present application without making any inventive effort, shall fall within the scope of the present application.
It should be noted that the terms "first," "second," and the like in the description and the claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
According to one aspect of the embodiment of the application, an index generation and similarity search method for large-scale high-dimensional data is provided, and is widely applied to index generation and similarity search for large-scale high-dimensional data.
As an optional implementation manner, as shown in fig. 1, the method for generating the index and searching the similarity for the large-scale high-dimensional data includes:
s102, generating an index tree by using the existing data sequence, and writing information of the index tree into a related file;
s104, under the condition that a given query is received, traversing the index tree to obtain a plurality of candidate categories adjacent to the given query;
s106, processing a plurality of candidate categories in parallel by using a plurality of threads to obtain a target result closest to the given query.
As an alternative embodiment, generating an index tree using an existing data sequence includes: dividing the existing data sequence by using a hierarchical tree structure to form a plurality of leaves; the graph structure is built in parallel in each leaf of the hierarchical tree structure, generating an index tree.
As an alternative embodiment, the partitioning of the existing data sequence using a hierarchical tree structure to form a plurality of leaves includes: reading the data sequences into the memory from the disk in batches by utilizing the hierarchical tree structure, and inserting the data cached in the memory into the tree structure; and traversing the binary index tree by using the thread to determine the leaf corresponding to the data sequence. The existing data sequence is partitioned using a hierarchical tree structure to form a plurality of leaves.
As an alternative embodiment, constructing the graph structure in parallel in each leaf of the hierarchical tree structure, generating the index tree, includes: in the case where the master coordinator thread initializes a plurality of leaf coordinator threads, reading leaf data with each leaf coordinator thread and inserting the leaf data into the graph structure; an index tree is generated using the respective graph structures of all the leaves.
As an alternative embodiment, writing the information of the index tree to the related file includes: the information of the index tree is written into HTree, LRDFile and LSDFile files, wherein htre files are used to store the spanning tree structure, LRDFile is used to store the original data sequence, and LSDFile is used to store summary information.
As an alternative embodiment, traversing the index tree to obtain a plurality of candidate categories adjacent to a given query includes: traversing the index tree, and searching a plurality of candidate leaves with similarity meeting preset conditions with the given query.
As an alternative embodiment, processing multiple candidate categories in parallel with multiple threads to obtain a target result closest to a given query includes: and performing beam search on the plurality of leaves in parallel by utilizing a plurality of threads to obtain a target result closest to the given query distance.
Specifically, the index generation and similarity search method for large-scale high-dimensional data aims to realize searching in the construction of indexes and memoryGood performance is obtained when approximating the solution, and the application flow is not limited to the one shown in fig. 2, and the following steps are adopted:
step one: an index tree is generated. Reading in the existing data sequence by utilizing index generation;
step two: and writing the index tree into the related file. Writing the related information of the generated index tree into related files, wherein the related files are not limited to three files which do not comprise HTree, LRDFile and LSDFile;
step three: a given query is obtained.
Step four: traversing the index tree to get nearest neighbors to a given queryAnd a result.
Step five: the candidate classes are processed in parallel by multiple threads, and the result nearest to the given query is taken as the target class.
The index generation and similarity search method is not limited to mainly comprising two parts, index generation and query reply. The index generation specifically includes the following parts:
in a first step, the existing dataset is partitioned using DLS (deep level search, hierarchical tree structure) to form a plurality of clusters, i.e., leaves. Existing data sequencesNot limited to ordered point sets, where +.>,/>Wherein->Representing the length of the data sequence. The DLS will read in the data sequence and generate a tree structure in which each leaf is considered a cluster. In the process of generating its index structure by DLS, a double-cache mechanism is employed. DLS reads data from disk into memory in batches, as wellData that has been previously cached in memory is inserted into the tree structure. Each of the memories inserts data vectors into the tree structure in parallel. The individual threads running in parallel traverse the binary index tree for the appropriate leaf, following the partitioning strategy of the accessed node at the point in time when each of the binary trees selects either the left node or the right node. Eventually, all data has its own EAPCA classification, and the data vectors in each node are partitioned using the same policy.
And a second step of constructing the graph structure in parallel in each leaf in the first step. One HNSW hierarchical navigable small world map is built in each leaf in parallel. First, master coordinator thread initializationA leaf-coordinator thread (leaf-Coordinator). Each leaf coordinator selects one leaf for which a graph structure has not been constructed, and creates +.>A leaf-Worker thread (leaf-Worker). The She Xiediao thread and the leaf worker thread both participate in the mapping work in their corresponding leaves. Each of them reads one data from the dataset and inserts it into the map. After the graph is built, she Xiediao collates the graph into individual disk files and selects new leaves for processing. Once the graph of all index leaves is constructed, index generation is terminated.
More specifically, the generation process of the DLS tree is introduced. Each node in the DLS treeComprising: />All data sequences stored in leaf offspring +.>Is +.>(if->Leaf of JavaScript herba>Representation->The size of all data sequences in (a); />Is>Wherein->Is->Is>Individual section->Wherein,/>,/>,/>Is the length of the data series; summary information of each node->Wherein->Is section->Summary information of->,/>,/>And->
Leaf node and file locationAssociated, the location indicates that the leaf node's original data is atPositions in the relevant file and leaf nodes +.>Abstract at->Locations in the related file. The internal node contains pointers to its child nodes and partitioning policies. Leaf node->Exceeding its capacity->(She Yuzhi) it splits into two leaf nodes +.>And->These are leaf nodes, and +.>Becomes an internal node. In particular, DLS utilizes a split strategy that allows the summary information of a node to be increased in both the horizontal (H-split) and vertical (V-split) dimensions. NodeBy being self-explanatory>Individual section->One of which is selected and which is of +.>Mean (or standard deviation) of the series of points of +.>Series of (a) to->And->Between them.
In the horizontal division of the H-split,and->Has a->The same segment, while in a vertically partitioned V-split, the child node has one additional segment. When dividing in H-split using mean>Data segment->The mean value of (2) is within the range->Will be stored in +.>In the range ofWill be stored at +.>Is a kind of medium. The standard deviation can also be used in H-split for partitioning, in the same manner we just described, except that the mean is changed to the value of the standard deviation. V-split first of all will->Split into two new segments and then apply H-split on one of the segments.
Index generation method in DLS tree as shown in fig. 3, the workflow of index generation includes three stages of reading, inserting and refreshing. Wherein one thread is used as a coordinator and a plurality of insertworkbench threads are used as workers, and raw data is processed based on a double buffering scheme called DBuffer. DLS uses DBuffer to read original data from diskConsumption is interleaved with CPU intensive operations inserted into the index tree.
During the Read phase, the coordinator loads the data set in batches into memory, and reads a batch of data from the original file into the first portion of the DBuffer. When the DBufer of the first part is full, the DBufer of the first part is abandoned, and the next incoming data is stored into the DBufer of the second part to operate in advance
In the Insert phase,thread slave coordinator->Reading in an unprocessed buffer sectionData sequences and inserts them into the tree. Each->The index tree will be traversed to find the appropriate leaf node, with either the left or right leaf node being selected at different locations, each node having two branches, depending on the division on the accessed node. Each leaf in the tree points to one +.>Pointer arrays as a pointer to the original data stored in HBuffer are each +.>The HBuffer has its own area and records the data series it inserts. When the number of full areas reaches the refresh threshold, a refresh phase is performed.
In the Flush (Flush) phase, the data in the HBuffer is flushed to disk. This process is accomplished by combining one of theThe insertion worker is identified as->And the refresh coordinator is higher than the worker level and bears the realization of refresh tasks. When appearing->Thereafter, the rest->Wait untilAnd finishing the corresponding work. The two are carefully synchronized in the above process, and wait for a stateConversion to->Is identical to the role multithreading state of (1)And step, completing the insertion. These areInformation about when its responsible area is full must be notified to +.>And the synchronization is completed to realize the temporary stopping function of the task, and the execution is resumed after the refresh stage is completed. SBuffer is ready for reading.
In the index write phase (index writing), the internal nodes (leaves) are computed in parallelSummary and creates HTree, LRDFile and LSDFile three files. Wherein htrees store the generated tree structure. The LRDFile stores the original data sequence for algorithm traversal in each leaf node of the tree. By calculating +.>The summary information improves pruning and stores them in an array in memory. LSDFile stores the above-mentioned array digest information stored in the memory after each execution is completed. When the HBuffer can accommodate the entire data set, then these operations do not require any disk access.
The pseudo code of the index generation method run in algorithm 1, coordinator, is as follows:
input: data set sizeThread number->Number +.>Label->Threshold for forcing asynchronous threads to end simultaneously +.>
Initializing a set of Workers;
definition trigger
Definition of the definition
Reading data from a fileIs->
Definition of the definition
For the followingFrom 0 to->And (3) circularly executing:
each Worker-focused Worker performs an insert job
Definition of the definition
When (when)At this time, loop execution:
definition of the definition
Reading data from a fileIs->;
Definition of the definition;
Definition of the definition;
Coordinator (coordinator)Achieve->;
Will beAssign to->;
Waiting for left and right insertion workersThe node completes its corresponding task.
Algorithm 2, new data insertion workerThe pseudo code of (2) is as follows:
input: label (Label)
Root nodeIs a label of (2);
trigger device
Integer number
Initializing floating point numbers
When (when)If not 0, loop execution:
if it isAt least comprises->Number of slots, then execute:
definition of the definition
When (when)At this time, loop execution:
definition of the definition
Performing an insertion sequence(algorithm 3);
definition of the definition
Insert worker->The node performs:
if it isIs->Then:
at the position ofFlow of executing refresh coordinator on node->(algorithm 4);
otherwise:
executing a flow flush worker of refreshing workers (algorithm 5);
definition of the definition
Algorithm 3, insert data into nodesThe pseudo code of (2) is as follows:
input:,/>,/>
will beAdd to node->In (a) and (b);
acquisition ofA lock in (a);
when (when)When not becoming a new leaf, the loop performs:
release ofA lock in (a);
will beAdd to node->In (a) and (b);
acquisition ofA lock in (a);
usingUpdate->Summary information of (a);
will beAdded to->In the thread area of (a) and adds a point +.>Corresponding toIs a pointer to (a);
if nodeFull, then:
using a currently optimal partitioning strategy as a nodeIs a policy of (2);
generating according to partitioning strategyIs a child node of (a);
if the refreshing process is triggered, extracting all data sequences in the memory and the disk;
distributing data sequences to the two child nodes, and updating abstract information corresponding to the nodes;
will beAre not identified as leaves;
release ofIs provided.
Algorithm 4, refresh coordinatorThe pseudo code of (2) is as follows:
input: label (Label)
Initializing integers
Initializing integers
Definition of the definition
For worker setsEach of +.>Performing:
when (when)At this time, loop execution:
definition of the definition
When (when)At this time, loop execution:
execution of
If it isAt->Is full or hasThen:
defined as->
At->Blocks in (a)
Algorithm 5, refresh workerThe pseudo code of (2) is as follows:
input:,/>,/>
working sectionPoint(s)Is full in the cache space of (a) means that at least one of the cache spaces isA sequence;
if it isAt->If the cache space is full:
calculation of
Definition of the definition
At->Is a block in (a);
definition of the definition
If it isThen:
at->Blocks in (a)
The query reply specifically comprises the following parts:
first, traversing DLS tree to findAre relative to->Neighbor nodes (query statement, demand) that are currently optimal (best-so-far, bsf) and will then be selected as +.>Candidate leaves or clusters. Traversing DLS tree already loaded in advance into memory, looking for if +.>To the leaf to which the data set should have been assigned. Performing a bundle search on the HNSW map corresponding to this leaf, return and +.>Is->The neighbor vector is returned as the first current best bsf. At the same time use the->Personal->The current optimal bsf reply and EAPCA classification method continues to traverse the index tree to finish pruning and screening of the nodes. The algorithm returns after that, as candidate cluster, one contains +.>Leaves, use->Is->The ascending order of the distances is arranged.
Specifically, for data sequencesOne +.>Division of->Its corresponding EAPCA (ExtendedAPCA) is denoted +.>WhereinIndicate->Mean value of individual divisions>Indicate->Standard deviation of individual divisions, wherein->Indicate->Right termination time of each partition.
In particular, the method comprises the steps of,the specific expression formula of the distance is as follows:
(1)
wherein,,/>representing two data sequences, satisfying the sequence dimension agreementI.e. +.>
A second step, structurally to the first stepThe individual candidate leaves are searched in parallel using a beam search (BeamSearch). Multithread parallel processing candidate cluster from +.>Is->Leaves with the smallest distance begin. Each leaf map is handled by only one thread, while the same thread can handle multiple leaves, only one at a time. One thread searches and returns +.>A bsf reply handles one leaf. Each thread maintains a local priority queue which stores the treated leaf correspondence +.>A bsf reply and a local +.>Representing a query->+.>The Euclidean distance between replies. The thread uses read-write lock (Readers-writelock) to access global +.>Representing a query->?>Euclidean distance between replies. Global->Whenever a thread finds a more local +.>The value is updated afterwards. Once a thread completes a search task in a leaf, the existing priority queue will be used to preheat the search task in the new leaf, using +.>Is->A minimum distance. Searching is either at +.>The candidate clusters are terminated after being processed, or at +.>And the next leaf to be treated +.>Distance is greater than->And then terminate. This is because +.>Having a Lower-bound property, i.e. the Euclidean distance between any two points in the original high-dimensional space is such that they are greater than or equal to +.>Distance. In addition, the set of candidate leaves is according to their +.>Distance to go onAnd (5) arranging in ascending order. When all threads terminate running, the final result is computed, replies returned from all local priority queues are aggregated, and +.>Shortest European distance +.>And replies.
In the embodiment of the application, the data is read from the disk data set in a two-stage cache mode, when the index is generated, the hierarchical navigable small-world HNSW is adopted, and simultaneously, the Coordinator and the Worker workbench node are used for realizing the parallel operation of the algorithm, combining the advantages of the existing tree-based method and the graph-based method, traversing the index tree by using a beam search method, realizing the parallel operation and accelerating the search process.
In the embodiment of the application, the two-stage cache method can make the DLS tree forming process more efficient. A large memory buffer (hbbuffer) is allocated at the beginning of index creation and released after inserting all the sequences faster than each leaf pre-allocates its own memory buffer and releases it at splitting. This reduces the number of system calls and the number of memory management problems that occur, especially in the case of frequent splits at the beginning of index generation (memory can be reserved by processes for later reuse when the program issues a large number of memory cleanup operations). The hierarchical processing method can greatly accelerate the operation speed of the model and avoid excessive resource consumption caused by traversing all data points on a large data set.
It should be noted that, for simplicity of description, the foregoing method embodiments are all described as a series of acts, but it should be understood by those skilled in the art that the present application is not limited by the order of acts described, as some steps may be performed in other orders or concurrently in accordance with the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required for the present application.
According to another aspect of the embodiment of the present application, there is also provided an index generation and similarity search apparatus for large-scale high-dimensional data for implementing the above-mentioned index generation and similarity search method for large-scale high-dimensional data. As shown in fig. 6, the apparatus includes:
the index generation module 602 is configured to generate an index tree using an existing data sequence, and write information of the index tree into a related file;
a query reply module 604, configured to traverse the index tree to obtain a plurality of candidate categories adjacent to the given query if the given query is received; multiple candidate categories are processed in parallel by multiple threads to obtain a target result closest to a given query.
Optionally, the index generation module 602 generates an index tree using an existing data sequence, including: dividing the existing data sequence by using a hierarchical tree structure to form a plurality of leaves; the graph structure is built in parallel in each leaf of the hierarchical tree structure, generating an index tree.
Optionally, the index generation module 602 divides the existing data sequence using a hierarchical tree structure to form a plurality of leaves, including: reading the data sequences into the memory from the disk in batches by utilizing the hierarchical tree structure, and inserting the data cached in the memory into the tree structure; and traversing the binary index tree by using the thread to determine the leaf corresponding to the data sequence. The existing data sequence is partitioned using a hierarchical tree structure to form a plurality of leaves.
Optionally, the index generation module 602 constructs a graph structure in parallel in each leaf of the hierarchical tree structure, and generates an index tree, including: in the case where the master coordinator thread initializes a plurality of leaf coordinator threads, reading leaf data with each leaf coordinator thread and inserting the leaf data into the graph structure; an index tree is generated using the respective graph structures of all the leaves.
Optionally, the index generation module 602 writes information of the index tree to the related file, including: the information of the index tree is written into HTree, LRDFile and LSDFile files, wherein htre files are used to store the spanning tree structure, LRDFile is used to store the original data sequence, and LSDFile is used to store summary information.
Optionally, the query reply module 604 traverses the index tree to obtain a plurality of candidate categories adjacent to the given query, including: traversing the index tree, and searching a plurality of candidate leaves with similarity meeting preset conditions with the given query.
Optionally, the query reply module 604 processes multiple candidate categories in parallel using multiple threads to obtain a target result closest to a given query, including: and performing beam search on the plurality of leaves in parallel by utilizing a plurality of threads to obtain a target result closest to the given query distance.
In the embodiment of the application, the two-stage cache method can make the DLS tree forming process more efficient. A large memory buffer (hbbuffer) is allocated at the beginning of index creation and released after inserting all the sequences faster than each leaf pre-allocates its own memory buffer and releases it at splitting. This reduces the number of system calls and the number of memory management problems that occur, especially in the case of frequent splits at the beginning of index generation (memory can be reserved by processes for later reuse when the program issues a large number of memory cleanup operations). The hierarchical processing method can greatly accelerate the operation speed of the model and avoid excessive resource consumption caused by traversing all data points on a large data set.
According to still another aspect of the embodiment of the present application, there is further provided an electronic device, which may be a terminal device or a server, for implementing the above-described index generation and similarity search method for large-scale high-dimensional data. As shown in fig. 7, the electronic device comprises a memory 702 and a processor 704, the memory 702 storing a computer program, the processor 704 being arranged to perform the steps of any of the method embodiments described above by means of the computer program.
Alternatively, in this embodiment, the electronic device may be located in at least one network device of a plurality of network devices of the computer network.
Alternatively, in the present embodiment, the above-described processor may be configured to execute the following steps by a computer program:
s1, generating an index tree by using an existing data sequence, and writing information of the index tree into a related file;
s2, under the condition that a given query is received, traversing an index tree to obtain a plurality of candidate categories adjacent to the given query;
s3, processing a plurality of candidate categories in parallel by using a plurality of threads to obtain a target result closest to the given query.
Alternatively, it will be appreciated by those skilled in the art that the structure shown in fig. 7 is merely illustrative, and the electronic device may be any terminal device. Fig. 7 is not limited to the structure of the electronic device described above. For example, the electronic device may also include more or fewer components (e.g., network interfaces, etc.) than shown in FIG. 7, or have a different configuration than shown in FIG. 7.
The memory 702 may be used to store software programs and modules, such as program instructions/modules corresponding to the monitoring method and apparatus of the smart device in the embodiment of the present application, and the processor 704 executes the software programs and modules stored in the memory 702 to perform various functional applications and data processing, that is, implement the index generation and similarity searching method for large-scale and high-dimensional data. The memory 702 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid state memory. In some examples, the memory 702 may further include memory remotely located relative to the processor 704, which may be connected to the terminal via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof. As an example, as shown in fig. 7, the memory 702 may include, but is not limited to, the index generation module 602 and the query reply module 604 in the index generation and similarity search device for large-scale high-dimensional data. In addition, other module units in the index generation and similarity search device for large-scale high-dimensional data may be included, but are not limited to, and are not described in detail in this example.
Optionally, the transmission device 706 is used to receive or transmit data via a network. Specific examples of the network described above may include wired networks and wireless networks. In one example, the transmission device 706 includes a network adapter (Network Interface Controller, NIC) that may be connected to other network devices and routers via a network cable to communicate with the internet or a local area network. In one example, the transmission device 706 is a Radio Frequency (RF) module that is configured to communicate wirelessly with the internet.
In addition, the electronic device further includes: a display 708 and a connection bus 710 for connecting the various modular components of the electronic device.
In other embodiments, the terminal device or the server may be a node in a distributed system, where the distributed system may be a blockchain system, and the blockchain system may be a distributed system formed by connecting the plurality of nodes through a network communication. Among them, the nodes may form a Peer-To-Peer (P2P) network, and any type of computing device, such as a server, a terminal, etc., may become a node in the blockchain system by joining the Peer-To-Peer network.
According to one aspect of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions are read from the computer-readable storage medium by a processor of a computer device, and executed by the processor, cause the computer device to perform the methods provided in various alternative implementations of the index generation and similarity search aspects described above for large-scale, high-dimensional data. Wherein the computer program is arranged to perform the steps of any of the method embodiments described above when run.
Alternatively, in the present embodiment, the above-described computer-readable storage medium may be configured to store a computer program for executing the steps of:
s1, generating an index tree by using an existing data sequence, and writing information of the index tree into a related file;
s2, under the condition that a given query is received, traversing an index tree to obtain a plurality of candidate categories adjacent to the given query;
s3, processing a plurality of candidate categories in parallel by using a plurality of threads to obtain a target result closest to the given query.
Alternatively, in this embodiment, it will be understood by those skilled in the art that all or part of the steps in the methods of the above embodiments may be performed by a program for instructing a terminal device to execute the steps, where the program may be stored in a computer readable storage medium, and the storage medium may include: flash disk, read-Only Memory (ROM), random-access Memory (Random Access Memory, RAM), magnetic or optical disk, and the like.
The foregoing embodiment numbers of the present application are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.
The integrated units in the above embodiments may be stored in the above-described computer-readable storage medium if implemented in the form of software functional units and sold or used as separate products. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, comprising several instructions for causing one or more computer devices (which may be personal computers, servers or network devices, etc.) to perform all or part of the steps of the method described in the embodiments of the present application.
In the foregoing embodiments of the present application, the descriptions of the embodiments are emphasized, and for a portion of this disclosure that is not described in detail in this embodiment, reference is made to the related descriptions of other embodiments.
In several embodiments provided by the present application, it should be understood that the disclosed client may be implemented in other manners. The above-described embodiments of the apparatus are merely exemplary, and the division of the units, such as the division of the units, is merely a logical function division, and may be implemented in another manner, for example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interfaces, units or modules, or may be in electrical or other forms.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The foregoing is merely a preferred embodiment of the present application and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present application, which are intended to be comprehended within the scope of the present application.

Claims (10)

1. An index generation and similarity search method for large-scale high-dimensional data is characterized by comprising the following steps:
generating an index tree by utilizing the existing data sequence, and writing information of the index tree into a related file;
traversing an index tree to obtain a plurality of candidate categories adjacent to a given query if the given query is received;
and processing a plurality of candidate categories in parallel by using a plurality of threads to obtain a target result closest to the given query.
2. The method of claim 1, wherein generating an index tree using the existing data sequence comprises:
dividing the existing data sequence by using a hierarchical tree structure to form a plurality of leaves;
the graph structure is built in parallel in each leaf of the hierarchical tree structure, generating an index tree.
3. The method of claim 2, wherein partitioning the existing data sequence using the hierarchical tree structure to form a plurality of leaves comprises:
reading the data sequences into the memory from the disk in batches by utilizing the hierarchical tree structure, and inserting the data cached in the memory into the tree structure;
traversing the binary index tree by using a thread to determine a leaf corresponding to the data sequence; the existing data sequence is partitioned using a hierarchical tree structure to form a plurality of leaves.
4. The method of claim 2, wherein constructing the graph structure in parallel in each leaf of the hierarchical tree structure generates the index tree, comprising:
in the case where the master coordinator thread initializes a plurality of leaf coordinator threads, reading leaf data with each leaf coordinator thread and inserting the leaf data into the graph structure;
an index tree is generated using the respective graph structures of all the leaves.
5. The method of claim 1, wherein writing information of the index tree to a related file comprises:
writing the information of the index tree into HTree, LRDFile and LSDFile files, wherein HTre files are used for storing the spanning tree structure, LRDFile is used for storing the original data sequence, and LSDFile is used for storing summary information.
6. The method of claim 1, wherein traversing the index tree results in a plurality of candidate categories adjacent to the given query, comprising:
and traversing the index tree, and searching a plurality of candidate leaves with similarity meeting preset conditions with the given query.
7. The method of claim 6, wherein the processing multiple candidate categories in parallel with multiple threads to obtain a target result closest to the given query comprises:
and performing beam search on a plurality of leaves in parallel by using a plurality of threads to obtain a target result closest to the given query distance.
8. An index generation and similarity search device for large-scale high-dimensional data, comprising:
the index generation module is used for generating an index tree by utilizing the existing data sequence and writing information of the index tree into related files;
the query reply module is used for traversing the index tree to obtain a plurality of candidate categories adjacent to the given query under the condition that the given query is received; and processing a plurality of candidate categories in parallel by using a plurality of threads to obtain a target result closest to the given query.
9. A computer-readable storage medium, characterized in that the computer-readable storage medium comprises a stored program, wherein the program, when run, performs the method of any one of claims 1 to 7.
10. An electronic device comprising a memory and a processor, characterized in that the memory has stored therein a computer program, the processor being arranged to execute the method according to any of the claims 1 to 7 by means of the computer program.
CN202311525045.6A 2023-11-16 2023-11-16 Index generation and similarity search method and device for large-scale high-dimensional data Pending CN117235080A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311525045.6A CN117235080A (en) 2023-11-16 2023-11-16 Index generation and similarity search method and device for large-scale high-dimensional data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311525045.6A CN117235080A (en) 2023-11-16 2023-11-16 Index generation and similarity search method and device for large-scale high-dimensional data

Publications (1)

Publication Number Publication Date
CN117235080A true CN117235080A (en) 2023-12-15

Family

ID=89084854

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311525045.6A Pending CN117235080A (en) 2023-11-16 2023-11-16 Index generation and similarity search method and device for large-scale high-dimensional data

Country Status (1)

Country Link
CN (1) CN117235080A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110008256A (en) * 2019-04-09 2019-07-12 杭州电子科技大学 It is a kind of to be navigated the approximate KNN searching method of worldlet figure based on layering
US20190286730A1 (en) * 2018-03-14 2019-09-19 Colossio, Inc. Sliding window pattern matching for large data sets
CN113590889A (en) * 2021-07-30 2021-11-02 深圳大学 Method and device for constructing metric space index tree, computer equipment and storage medium
US20220300528A1 (en) * 2019-08-12 2022-09-22 Universität Bern Information retrieval and/or visualization method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190286730A1 (en) * 2018-03-14 2019-09-19 Colossio, Inc. Sliding window pattern matching for large data sets
CN110008256A (en) * 2019-04-09 2019-07-12 杭州电子科技大学 It is a kind of to be navigated the approximate KNN searching method of worldlet figure based on layering
US20220300528A1 (en) * 2019-08-12 2022-09-22 Universität Bern Information retrieval and/or visualization method
CN113590889A (en) * 2021-07-30 2021-11-02 深圳大学 Method and device for constructing metric space index tree, computer equipment and storage medium

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
ILIAS AZIZI ET AL.: ""ELPIS: Graph-Based Similarity Search for Scalable Data Science"", ACM DIGITAL LIBRARY HTTPS://DL.ACM.ORG/DOI/10.14778/3583140.3583166, 1 February 2023 (2023-02-01), pages 1548 - 1552 *
KARIMA ECHIHABI ET AL: ""Hercules Against Data Series Similarity Search"", ARVIV.ORG, 26 December 2022 (2022-12-26), pages 1 - 12 *
王飞 等: ""MTSAX: 一种新的多元轨迹索引方法"", 《计算机工程》, vol. 44, no. 5, pages 1 - 6 *

Similar Documents

Publication Publication Date Title
CN102915347B (en) A kind of distributed traffic clustering method and system
CN110413611B (en) Data storage and query method and device
US10191998B1 (en) Methods of data reduction for parallel breadth-first search over graphs of connected data elements
US11630864B2 (en) Vectorized queues for shortest-path graph searches
CN112287182A (en) Graph data storage and processing method and device and computer storage medium
Ribeiro et al. Efficient parallel subgraph counting using g-tries
CN107330094B (en) Bloom filter tree structure for dynamically storing key value pairs and key value pair storage method
Zhang et al. SUMMA: subgraph matching in massive graphs
CN114567634B (en) Method, system, storage medium and electronic device for calculating E-level map facing backward
US11222070B2 (en) Vectorized hash tables
CN108549696B (en) Time series data similarity query method based on memory calculation
CN110677461A (en) Graph calculation method based on key value pair storage
Hassani et al. I-hastream: density-based hierarchical clustering of big data streams and its application to big graph analytics tools
CN113900810A (en) Distributed graph processing method, system and storage medium
CN106776810B (en) Big data processing system and method
CN107330083B (en) Parallel construction method for equal-width histogram
Kumari et al. Scalable parallel algorithms for shared nearest neighbor clustering
CN117235080A (en) Index generation and similarity search method and device for large-scale high-dimensional data
Goyal et al. Parallelizing optics for commodity clusters
Kvet Database Block Management using Master Index
CN113505825B (en) Graph calculating device
Lamm et al. Communication-efficient Massively Distributed Connected Components
CN112100446A (en) Search method, readable storage medium and electronic device
Das et al. Challenges and approaches for large graph analysis using map/reduce paradigm
Keawpibal et al. DistEQ: Distributed equality query processing on encoded bitmap index

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination