CN111259193B

CN111259193B - Feature retrieval system based on cluster filtering and application method thereof

Info

Publication number: CN111259193B
Application number: CN202010047322.7A
Authority: CN
Inventors: 关喜记; 黄松钦; 董振江; 江盛欣; 劳定雄; 汪刚; 刘双广
Original assignee: Gosuncn Technology Group Co Ltd
Current assignee: Gosuncn Technology Group Co Ltd
Priority date: 2020-01-16
Filing date: 2020-01-16
Publication date: 2023-08-25
Anticipated expiration: 2040-01-16
Also published as: CN111259193A

Abstract

The invention belongs to the technical field of video investigation, and particularly relates to a feature retrieval system based on cluster filtering and an application method thereof. And when the feature retrieval is carried out, target feature filtering is carried out according to the class of the cluster, so that unnecessary feature memory access is reduced, and the feature retrieval efficiency is improved. Secondly, a series of problems of slow loading of massive feature data, inconsistent feature data, re-clustering of target features and the like are solved through a distributed system for target feature retrieval.

Description

Feature retrieval system based on cluster filtering and application method thereof

Technical Field

The invention belongs to the technical field of video investigation, and particularly relates to a feature retrieval system based on cluster filtering and an application method thereof.

Background

At present, with the wide application of deep learning in the video recognition field, the video recognition technology has greatly advanced, for example, the accuracy of face recognition and license plate recognition is up to more than 99%, so that industries such as government, public security and finance increasingly apply the technologies such as face recognition, pedestrian re-recognition and license plate recognition, and the artificial intelligence market in the video recognition field is also more and more active. In particular, with the sequential landing of a large number of safe cities and smart city projects in recent years, a large amount of video structured feature data is generated, and a data base is provided for subsequent feature retrieval. How to quickly find related targets appearing in videos in massive video structured data has become an urgent problem to be solved in the security field.

Prior art scheme 1: an open-source Elastic Search engine is adopted, the feature data are filtered by taking Search conditions into consideration to form a target feature library, and then the features to be searched and the feature library are used for 1:N Search. The main flow is as follows: storing the preprocessed structured feature data in Elastic Search Node, wherein the stored information comprises a time stamp and a camera channel, and Elastic Search Cluster ensures that the feature data of the Node is balanced; an Elastic Search query node is arranged and is responsible for receiving a characteristic Search request of a client and distributing the Search request Elastic Search Node; elastic Search Node after receiving the search request, the feature data is read from the server disk according to the search condition to form a target feature library. Then calculating the similarity between the features to be searched and all the features of the target feature library by using a feature comparison algorithm, and returning TopN to the query node of the Elastic Search; the Elastic Search query Node gathers the TopN results of all the nodes, finishes the sorting of the similarity and the TopN fetching operation, and finally returns the final result to the client.

Prior art scheme 2: the distributed retrieval scheme based on the memory technology stores the hot spot feature data in the memory of the server, and directly retrieves the features to be searched and all target feature libraries in the memory according to the ratio of 1:N, wherein the main flow is as follows: node nodes are arranged and are responsible for pulling target feature data belonging to the Node from a database and storing the target feature data in a memory of a server when the Node nodes are started to form a target feature library of the Node; node queries new target feature from database in increment at regular time in operation course, and stores it in memory of server, and deletes outdated feature data from memory feature library at regular time; a Cluster is arranged and is responsible for distributing search requests and summarizing TopN results of Node nodes; the Cluster receives the feature retrieval request and distributes the request to all Node nodes. Node traverses the target feature library according to the search condition and outputs TopN result; and the Cluster Node combines the TopN results of all the nodes and finishes similarity sorting, and finally outputs the TopN to the client.

Prior art scheme 3: the main business flow is consistent with the distributed retrieval scheme based on the memory technology by adopting the retrieval scheme of the GPU technology, and the difference is that the target characteristic data is stored in the video memory of the GPU instead of the memory of the server. The scheme utilizes the strong floating point operation capability and high concurrency calculation capability of the GPU, and can greatly improve the speed of feature comparison, thereby improving the retrieval efficiency of massive feature data.

Drawbacks of prior art scheme 1: according to the scheme, an open-source ES (Elastic Search) search engine is adopted, an ES is not provided with an interface for secondary processing of TopN results after retrieval, and source codes need to be modified. Secondly, the target feature library for feature retrieval is often more than millions, if the feature data are not in the system cache, the data need to be read from a disk, the IO consumption is serious, and the feature retrieval efficiency is affected. Finally, the operation and maintenance of the ES are troublesome, and the abnormal recovery of the memory ES cluster needs to read a large amount of characteristic data, so that the ES performance is seriously slowed down.

Drawbacks of prior art scheme 2: the scheme adopts the distributed retrieval based on the memory technology, and is limited by the access speed of the bus bandwidth and the memory bandwidth of the server first. Because the technical scheme needs to traverse all target feature data in the memory, and then select target features according to the filtering conditions to perform feature comparison. Secondly, the scheme distribution type framework is simpler, and the problems of feature data synchronization, consistency and the like are not solved well.

Drawbacks of prior art scheme 3: the scheme adopts the distributed retrieval based on GPU hardware, so that the speed of feature retrieval is greatly improved, but the hardware cost is also increased. The GPU price of a video memory of 8G is far higher than that of a common server memory, and the stored characteristic data is limited. For example: when the target feature data is about 700 bytes, a video card with 8G video memory can store about 1000 ten thousand features, and according to 1500 ten thousand feature data per day, a GPU card with 8G video memory can store feature data for 0.67 days, and 45 GPU cards are needed for storing feature data for 1 month. When the CPU scheme is used, the 256G memory can store about 4G characteristic data, only 2 256G memory strips are needed for storing 1 month characteristic data, and the hardware cost is far lower than that of the GPU technical scheme.

Disclosure of Invention

Aiming at the problems in the prior art, the invention designs a feature retrieval system based on cluster filtering and an application method thereof. The identity of the target can be identified rapidly, and the working efficiency is improved.

The invention is realized by the following technical scheme:

the feature retrieval system based on cluster filtering is characterized by comprising a retrieval cluster, wherein a Master and a plurality of Slave nodes are arranged in the retrieval cluster, and the Master is used for providing distribution of feature retrieval requests and summarizing TopN results returned by the Slave nodes; the Slave node is used for providing feature retrieval service; the Slave node is also provided with a memory database, and the memory database is used for storing target characteristic data in groups by using cluster IDs (ClusterIDs); the target feature data comprise historical feature data loaded from a backup file and newly added real-time feature data loaded from a distributed database;

the feature retrieval system also comprises a backup service, wherein the backup service is used for backing up the snapshot feature data of the distributed database, the backup file is stored in a network shared disk, and the snapshot feature data is backed up in a unit of day;

the feature retrieval system further comprises a clustering service connected with the distributed database and used for grouping the target feature data, and when the number of the snap feature data of the distributed database exceeds a threshold value, a re-clustering function is started, and the cluster ID of the target feature data is recalculated.

Further, the feature retrieval system at least comprises retrieval of face features, vehicle features and pedestrian features.

Further, the Master is also used for dynamically adding or deleting Slave nodes.

Further, when a Slave node is newly added, the Master distributes part of target feature data of other Slave nodes to the newly added Slave node.

Further, when the Slave node is deleted, the Master distributes the target feature data of the offline Slave node to other Slave nodes, so as to ensure the integrity of the target feature data in the memory database.

Further, after the backup service finishes the re-backup of the historical characteristic data, all the Slave nodes are notified to reload the target characteristic data.

An application method based on a feature retrieval system comprises a memory feature management method and a target feature retrieval method; the memory characteristic management method comprises the following steps:

s101: the Master dynamically allocates the characteristic range of the Slave to be loaded according to the number of the Slave nodes;

s102: the Slave node loads a historical characteristic database from the backup file, performs characteristic filtering according to the range of the Slave node, and queries newly added real-time characteristic data from the distributed database in a timing increment manner;

s103: when a new Slave node is added, the Master distributes partial target characteristic data of other Slave nodes to the new Slave node according to a scheduling algorithm of load balancing, and the target characteristic data of all the Slave nodes are kept relatively balanced;

s104: when a Slave node is offline, the Master distributes all target feature data of the offline Slave node to other online Slave nodes, and the target feature data in the memory database are kept complete and balanced;

s105: the backup service performs backup on the historical characteristic data of the current day at regular time zero at night;

s106: the clustering service decides whether to recluster all target feature data according to a preset reclustering strategy and the data quantity of the snap target feature;

s107: after the clustering service completes the re-clustering, sending a request to a backup service, and backing up the historical characteristic data again;

s108: after the backup service finishes the re-backup of the historical characteristic data, sending a request for reloading the target characteristic data to a Master;

s109: the Master broadcasts the received request for reloading the target characteristic data to all Slave nodes;

s110: the Slave node clears all the characteristic data in the memory database, and reloads the historical characteristic data from the backup file.

Further, the target feature retrieval method comprises the following steps:

s201: the Slave node loads characteristic data from the backup file according to the Hash Value range;

s202: the Slave node incrementally loads real-time characteristic data from the distributed database according to the Hash Value range;

s203: the Slave nodes group-store all the characteristic data according to the cluster IDs to form an internal memory database of the Slave nodes;

s204: after receiving the feature retrieval request, the Slave node calculates the cluster ID of feature data to be retrieved;

s205: the Slave node selects target feature data matched with the cluster ID of the feature data to be retrieved for feature comparison, calculates the similarity and returns a TopN result to the Master;

s206: the Master gathers the topns of all the Slave nodes, and outputs the final topns to the client after completing the similarity sorting.

Further, in step S106, when the target feature data needs to be re-clustered, the clustering service extracts the real-time feature data from the distributed database to re-calculate the cluster ID of the target feature data, and updates the cluster ID field and the cluster algorithm version number of the memory database.

Further, in step S110, the feature data of the current day is queried incrementally from the distributed database, and whether the version numbers of the clustering algorithms are consistent is compared, if not, it is indicated that the record is not clustered, and is not loaded into the in-memory database.

Compared with the prior art, the invention has at least the following beneficial effects or advantages: according to the method, the historical characteristic data is loaded from the backup file, so that the access amount of the distributed database is reduced, and the target characteristic loading speed is also increased. And secondly, the feature data are stored in a scattered manner by using a cluster mode, and then the target feature data are filtered according to the cluster, so that the retrieval range of the target feature data is reduced, the access to the memory features is reduced, and the overall feature retrieval speed is improved.

1. Compared with the technical scheme 1, the invention adopts a distributed scheme based on the memory technology, can directly compare the characteristic data in the memory, and does not need to read the characteristic data from the disk. Obviously, the characteristic comparison speed of the scheme is higher than that of the scheme 1.

2. Compared with the technical scheme 2, the method has the advantages that all target features are grouped by adopting a mature clustering algorithm, and feature comparison is only needed to be carried out on the target features in the same group during retrieval, so that the number of feature target libraries is reduced, the access quantity to memory data is also reduced, and the overall feature comparison speed is improved. Obviously, the characteristic comparison speed of the scheme is higher than that of the scheme 2.

3. Compared with the technical scheme 3, the characteristic comparison speed of the technical scheme is lower than that of the technical scheme 3 under the condition of the equal number of target characteristic libraries. However, the hardware cost of the invention is far lower than that of the technical scheme 3, the integral feature retrieval efficiency can be improved by adding the Slave nodes in a mode of stacking servers, and the number of the storable features of the invention is far more than that of the technical scheme 3. Therefore, the invention is also superior to the technical scheme 3 in terms of hardware cost.

Drawings

The invention will be described in further detail with reference to the accompanying drawings;

FIG. 1 is a schematic structural diagram of a specific embodiment of a feature retrieval system based on cluster filtering and a method of application thereof;

fig. 2 is a target feature retrieval flow chart.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The invention groups the characteristics of a huge amount of target characteristic libraries by applying a mature target clustering algorithm (personnel clustering, vehicle clustering and the like). And when the feature retrieval is carried out, target feature filtering is carried out according to the class of the cluster, so that unnecessary feature memory access is reduced, and the feature retrieval efficiency is improved. Secondly, a distributed system for searching the self-research features is used for solving a series of problems of slow loading of massive feature data, inconsistent feature data, re-clustering of target features and the like.

Referring to fig. 1, the feature retrieval system based on cluster filtering and the application method thereof in this embodiment specifically include a retrieval cluster 110, where a Master111 is disposed in the retrieval cluster 110, and is used for providing distribution of feature retrieval requests and summarizing TopN results returned by Slave. The search cluster 110 is provided with a plurality of Slave nodes 112, the Slave nodes 112 provide feature search service, and a memory database is further provided, and the memory database stores target feature data. The number of the Slave nodes 112 can be 1 or more, and the Slave nodes can be dynamically added or deleted.

The embodiment is provided with a backup service 113 for backing up the snapshot feature information of the distributed database, wherein the feature backup file is stored in a network shared disk, and the feature data backup is performed in units of days.

The present embodiment is provided with a clustering service 114 for re-clustering existing target feature data after the center point of the feature clustering algorithm is changed when the snapshot feature database of the distributed database is more and more.

The business process of the invention comprises two parts of memory characteristic management and characteristic retrieval. The characteristic management flow of the memory specifically comprises the following steps:

s101: master111 dynamically allocates the needed loading characteristic range of Slave according to the number of the Slave nodes;

s102: the Slave loads a historical feature database from the feature file data backed up by the network, performs feature filtering according to the range of the node, and then queries the newly added target feature from the distributed database in a timing increment mode. All target features are grouped by cluster ID (ClusterID), and each Slave node is advocated by all ClusterID groups;

s103: when a new Slave node is added, the Master distributes partial target features of other Slave nodes to the new Slave node according to a scheduling algorithm of load balancing, and the target features of all Slave nodes are kept relatively balanced;

s104: when a Slave node is offline, the Master distributes all characteristic data of the offline node to other online Slave nodes, and the target characteristic data in the memory is kept complete and balanced;

s105: the feature backup service 113 performs backup on the feature data of the current day at zero time at night;

s106: the clustering service 114 then decides whether to re-cluster all feature data according to the predetermined re-clustering strategy and the data amount of the snap shot target feature. When the feature needs to be clustered again, the clustering service pulls the target feature data from the distributed database to recalculate the ClusterID of the target feature, and updates the ClusterID field and the clustering algorithm version number of the database.

S107: after the clustering service 114 completes the re-clustering, it sends a request to the backup service 113 to back up the history feature data again;

s108: after the backup service 113 finishes the backup of the historical feature data, a request for reloading the target feature data is sent to the Master;

s109: the Master111 then broadcasts a request to reload the target feature data to all Slave nodes;

s110: the Slave node empties the memory characteristic data and reloads the history characteristic data from the backup file. And the characteristic data of the current day is subjected to incremental inquiry from the distributed database, whether the version numbers of the clustering algorithms are consistent or not is compared, if not, the record is not subjected to clustering calculation, and is not loaded into the characteristic memory.

The target feature retrieval process (as shown in fig. 2) specifically comprises the following steps:

s202: the Slave node incrementally loads characteristic data from the distributed database according to the Hash Value range;

s203: the Slave node performs grouping storage on all target features according to ClusterID to form a feature memory database of the Slave node;

s204: after receiving the feature retrieval request, the Slave calculates ClusterID of the feature to be retrieved;

s205: the Slave selects a target feature library matched with ClusterID of the feature to be searched for feature comparison, calculates the similarity and returns a TopN result to the Master;

s206: master111 gathers the TopN of all the Slave nodes and outputs the final TopN to the client after completing the similarity sorting.

The present invention also provides a computer-readable storage medium having stored thereon a computer program, wherein the program when executed by a processor implements the steps of a method of applying a cluster-filtering-based feature retrieval system.

The invention also provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the method of applying the cluster filtering based feature retrieval system when executing the program.

The foregoing embodiments have been provided for the purpose of illustrating the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the foregoing embodiments are merely illustrative of the present invention and are not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made without departing from the spirit and scope of the present invention also fall within the scope of the present invention.

Claims

1. The feature retrieval system based on cluster filtering is characterized by comprising a retrieval cluster, wherein a Master and a plurality of Slave nodes are arranged in the retrieval cluster, and the Master is used for providing distribution of feature retrieval requests and summarizing TopN results returned by the Slave nodes; the Slave node is used for providing feature retrieval service; the Slave node is also provided with a memory database, and the memory database is used for storing target characteristic data in groups by using cluster IDs (ClusterIDs); the target feature data comprise historical feature data loaded from a backup file and newly added real-time feature data loaded from a distributed database;

2. The feature retrieval system based on cluster filtering of claim 1, wherein the feature retrieval system comprises at least retrieval of face features, vehicle features and pedestrian features.

3. The feature retrieval system based on cluster filtering of claim 1, wherein the Master is further configured to dynamically add or delete Slave nodes.

4. The feature retrieval system based on cluster filtering of claim 3, wherein when a Slave node is newly added, the Master distributes part of target feature data of other Slave nodes to the newly added Slave node.

5. The feature retrieval system based on cluster filtering of claim 4, wherein when a Slave node is deleted, the Master distributes the target feature data of the Slave node which is offline to other Slave nodes, so as to ensure the integrity of the target feature data in the memory database.

6. The feature retrieval system based on cluster filtering of claim 5, wherein after the backup service completes the re-backup of the historical feature data, all Slave nodes are notified to reload the target feature data.

7. An application method based on the feature retrieval system as recited in claim 1, wherein the application method includes a memory feature management method and a target feature retrieval method; the memory characteristic management method comprises the following steps:

8. The application method according to claim 7, wherein the target feature retrieval method comprises the steps of:

9. The application method according to claim 7, wherein in step S106, when the target feature data needs to be re-clustered, the clustering service extracts the real-time feature data from the distributed database to re-calculate the cluster ID of the target feature data, and updates the cluster ID field and the cluster algorithm version number of the in-memory database.

10. The application method according to claim 9, wherein in step S110, the feature data of the current day is queried incrementally from the distributed database, and the comparison is made to see if the version numbers of the clustering algorithms are consistent, and if they are inconsistent, it indicates that the clustering calculation has not been performed, and is not loaded into the in-memory database.