CN115238006A

CN115238006A - Retrieval data synchronization method, device, equipment and computer storage medium

Info

Publication number: CN115238006A
Application number: CN202210899362.3A
Authority: CN
Inventors: 蒋楠; 严石伟
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-07-28
Filing date: 2022-07-28
Publication date: 2022-10-25

Abstract

The application discloses a retrieval data synchronization method, a retrieval data synchronization device, retrieval data synchronization equipment and a computer storage medium, which can be applied to various scenes such as cloud technology, artificial intelligence, intelligent traffic, auxiliary driving and the like. And simultaneously, after the characteristic management cluster performs write operation, the updated characteristics are synchronized to the characteristic retrieval cluster in a write diffusion mode, so that the characteristic retrieval cluster synchronously updates the database of the characteristic retrieval cluster, and in order to guarantee the synchronization reliability, the operation transaction information is used as a bottom-pocketed strategy in a mode that each retrieval node subscribes the write operation transaction of the characteristic management database, so as to guarantee the synchronization reliability between the characteristic management database and the characteristic retrieval database.

Description

Retrieval data synchronization method, device, equipment and computer storage medium

Technical Field

The application relates to the technical field of computers, in particular to the technical field of retrieval, and provides a retrieval data synchronization method, a retrieval data synchronization device, retrieval data synchronization equipment and a computer storage medium.

Background

Artificial Intelligence (AI) technology has wide application in the field of content understanding, for example: the AI technology plays a great role in the scene of examining the picture content. The picture content auditing refers to auditing whether the uploaded picture contains illegal contents, and the most common picture content auditing mode is to extract features of the illegal contents, store the features in a database and then retrieve the features in the database based on the uploaded picture extraction features.

A typical search system includes three parts, namely, picture input, software Development Kit (SDK) processing and search service, where the SDK processing mainly includes picture size control, coding, feature extraction, and the like, and the search service is the core of the entire search system, and is mainly used for managing a feature database, for example, implementing operations such as adding, deleting, or modifying features, and also used for feature search, that is, performing feature matching based on features in the feature database to output search results.

Therefore, the whole retrieval system at present reuses one retrieval library, the retrieval pressure is large and the time consumption is increased when the retrieval request is high and concurrent, once the retrieval library has network or disk failure, the retrieval system is directly unavailable, and the efficiency and the availability of content auditing and retrieval are reduced.

Disclosure of Invention

The embodiment of the application provides a retrieval data synchronization method, a retrieval data synchronization device, retrieval data synchronization equipment and a computer storage medium, which are used for improving the reliability of a retrieval system, so that the retrieval efficiency and the usability are improved.

On one hand, the retrieval data synchronization method is applied to a characteristic management cluster included in a retrieval system, and the retrieval system also comprises a characteristic retrieval cluster composed of at least one retrieval node; the method comprises the following steps:

responding to a write operation request sent by a client, performing corresponding write operation on the associated characteristic management database, and recording corresponding operation transaction information in an operation log of the characteristic management database;

respectively initiating a write diffusion request to the at least one retrieval node, wherein the write diffusion request carries characteristic information corresponding to the write operation, so that each retrieval node synchronously updates an associated characteristic retrieval database based on the characteristic information; and the number of the first and second groups,

and when each retrieval node subscribes the write operation transaction of the characteristic management database, respectively sending the operation transaction information to each retrieval node, so that each retrieval node performs synchronous operation on the characteristic retrieval database based on the operation transaction information when the synchronous updating is not successfully performed by each retrieval node.

On one hand, the method for synchronizing the retrieval data is applied to any retrieval node included in a feature retrieval cluster included in a retrieval system, and the retrieval system further comprises a feature management cluster; the method comprises the following steps:

receiving a write diffusion request sent by the feature management cluster, wherein the write diffusion request carries feature information corresponding to write operation of the feature management cluster on an associated feature management database, and the write operation is triggered by the write operation request sent by a client;

based on the characteristic information, synchronously updating the associated characteristic retrieval database; and the number of the first and second groups,

receiving operation transaction information sent by the feature management cluster, wherein the operation transaction information records transaction information corresponding to the write operation;

and if the synchronous updating is determined to be unsuccessful, carrying out synchronous operation on the characteristic retrieval database based on the operation transaction information.

On one hand, the retrieval data synchronization device is applied to a feature management cluster included in a retrieval system, and the retrieval system also comprises a feature retrieval cluster composed of at least one retrieval node; the device comprises:

the system comprises a characteristic management unit, a characteristic management unit and a characteristic management unit, wherein the characteristic management unit is used for responding to a write operation request sent by a client, carrying out corresponding write operation on a related characteristic management database, and recording corresponding operation transaction information in an operation log of the characteristic management database;

the write diffusion unit is used for respectively initiating write diffusion requests to the at least one retrieval node, and the write diffusion requests carry the characteristic information corresponding to the write operation, so that each retrieval node synchronously updates the associated characteristic retrieval database based on the characteristic information; and the number of the first and second groups,

and the transaction pushing unit is used for respectively sending the operation transaction information to each retrieval node when each retrieval node subscribes the write operation transaction of the characteristic management database, so that each retrieval node performs synchronous operation on the characteristic retrieval database based on the operation transaction information when the synchronous updating is not successfully executed.

Optionally, the apparatus further includes a configuration unit, configured to:

configuring a transaction outbox corresponding to the management database, wherein the transaction outbox comprises a message relay module and a transaction message queue, and the message relay module subscribes the update of the operation log;

receiving a subscription request sent by the at least one retrieval node, wherein the subscription request is used for requesting to subscribe the write operation transaction of the feature management database;

and responding to each subscription request, and configuring the subscription of each retrieval node to the transaction message queue.

Optionally, the transaction pushing unit is specifically configured to:

acquiring the operation transaction information from the operation log through the message relay module, and writing the operation transaction information into the transaction message queue;

and acquiring the operation transaction information from the transaction message queue based on the subscription of each retrieval node to the transaction message queue, and respectively sending the operation transaction information to each retrieval node.

Optionally, the apparatus further includes an initial synchronization unit, configured to:

receiving an initial synchronization request sent by a newly added retrieval node in the feature retrieval cluster, wherein the initial synchronization request carries a target synchronization mode determined by the newly added retrieval node;

if the target synchronization mode is a synchronization mode based on an operation log, sending the operation log to the newly added retrieval node;

if the target synchronization mode is a synchronization mode based on the database snapshot, determining whether the resource state of the self equipment meets the resource requirement of the synchronization mode;

if yes, executing snapshot persistence operation to generate a database snapshot of the feature management database, and sending the database snapshot to the newly-added retrieval node;

and if not, modifying the target synchronization mode into a synchronization mode based on the operation log, and sending the operation log to the newly-added retrieval node.

Optionally, the initial synchronization unit is further configured to:

in the process of executing the snapshot persistence operation, if a write operation request sent by a client is received, caching the received write operation request;

after sending the database snapshot to the new search node, the method further comprises:

and sending the cached write operation request to the newly added retrieval node, so that the newly added retrieval node performs corresponding write operation based on the received write operation request.

Optionally, the feature management unit is specifically configured to:

determining a target operation type corresponding to the write operation based on the write operation request;

if the target operation type is a characteristic deleting operation, deleting the characteristic indicated by the writing operation request from the characteristic management database;

if the target operation type is a feature modification operation, extracting features of the content carried by the write operation request, and updating corresponding features in the feature management database according to the obtained features;

and if the target operation type is a feature adding operation, extracting features of the content carried by the write operation request, and adding the obtained features into the feature management database.

On one hand, the retrieval data synchronization device is applied to any retrieval node included in a feature retrieval cluster included in a retrieval system, and the retrieval system further comprises a feature management cluster; the device comprises:

a receiving unit, configured to receive a write diffusion request sent by the feature management cluster, where the write diffusion request carries feature information corresponding to a write operation of the feature management cluster on an associated feature management database, and the write operation is triggered by a write operation request sent by a client;

the write diffusion execution unit is used for synchronously updating the associated characteristic retrieval database based on the characteristic information; and (c) a second step of,

the receiving unit is further configured to receive operation transaction information sent by the feature management cluster, where the operation transaction information records transaction information corresponding to the write operation;

and the transactional synchronization unit is used for carrying out synchronization operation on the characteristic retrieval database based on the operation transaction information if the synchronous updating is determined to be unsuccessful.

Optionally, the write diffusion execution unit is further configured to:

updating the task state of the write operation request at the retrieval node based on the synchronous updating result;

the transactional synchronization unit is specifically configured to:

and if the task state indicates that the write operation request is not completed at the retrieval node, determining that the synchronous updating is not successful.

responding to an indication that a retrieval node joins the feature retrieval cluster, and initiating an initial synchronization request to the feature management cluster, wherein the initial synchronization request is used for indicating the feature retrieval cluster to synchronize the full feature quantity in the feature management database to the feature retrieval database according to a target synchronization mode;

receiving an operation log returned by the feature management cluster, and sequentially performing synchronous operation on the feature retrieval database according to each operation transaction information recorded by the operation log; alternatively, the first and second electrodes may be,

and receiving a database snapshot returned by the characteristic management cluster, and loading the database snapshot into a characteristic retrieval database of the database snapshot.

Optionally, the retrieval system further includes a feature extraction cluster; the initial synchronization unit is specifically configured to:

determining whether the resource utilization rate of the feature extraction cluster is greater than the preset threshold value;

if the resource utilization rate is larger than the preset threshold value, determining that the target synchronization mode is a synchronization mode based on the database snapshot;

if the resource utilization rate is not larger than the preset threshold value, determining that the target synchronization mode is a synchronization mode based on an operation log;

and initiating the initial synchronization request to the feature retrieval cluster based on the determined target synchronization mode.

Optionally, the initial synchronization unit is specifically configured to:

performing deserialization processing on the operation log to obtain a write operation request corresponding to each operation transaction information;

carrying out validity check on each write operation request, and filtering write operation requests which fail to pass the check;

carrying out duplicate removal processing on each write operation request after filtering;

and executing the write operation indicated by each write operation request after the duplication removal so as to synchronize the full quantity of the features in the feature management database to the feature retrieval database.

In one aspect, a data retrieval system is provided, which includes a feature management cluster and a feature retrieval cluster, wherein the feature retrieval cluster includes at least one retrieval node;

the characteristic management cluster responds to a write operation request sent by a client, performs corresponding write operation on the associated characteristic management database, and records corresponding operation transaction information in an operation log of the characteristic management database; respectively initiating write diffusion requests to the at least one retrieval node, so that each retrieval node synchronously updates the associated characteristic retrieval database based on the characteristic information carried by the write operation request;

when each retrieval node subscribes to the write operation transaction of the characteristic management database, the characteristic management cluster also sends the operation transaction information to the at least one retrieval node, so that when each retrieval node does not successfully execute the synchronous updating, the characteristic retrieval database is synchronously operated based on the operation transaction information.

Optionally, the data retrieval system further includes an access device;

the access device receives an operation request sent by a client, and when the operation request is judged to be a write operation request, the write operation request is distributed to the feature management cluster, or when the operation request is judged to be a retrieval operation request, the retrieval operation request is distributed to the feature retrieval cluster.

In one aspect, a computer device is provided, comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of any of the above methods when executing the computer program.

In one aspect, a computer storage medium is provided having computer program instructions stored thereon that, when executed by a processor, implement the steps of any of the above-described methods.

In one aspect, a computer program product or computer program is provided that includes computer instructions stored in a computer-readable storage medium. The computer instructions are read by a processor of a computer device from a computer-readable storage medium, and the computer instructions are executed by the processor to cause the computer device to perform the steps of any of the methods described above.

In the retrieval data synchronization method, the device, the equipment and the computer storage medium provided by the embodiment of the application, the retrieval system is divided into the feature management cluster and the feature retrieval cluster, the feature management cluster is responsible for writing operation related to the database, the feature retrieval cluster realizes retrieval functions and maintains the databases related to the feature retrieval cluster respectively, so that the problem that the whole retrieval system reuses one retrieval library, the retrieval pressure is large and time consumption is increased when high concurrent retrieval requests are sent, once network or disk faults occur in the retrieval library, the retrieval system is unavailable is directly caused, the reliability of the retrieval system is improved, and the retrieval efficiency and the availability are improved. Meanwhile, after the feature management cluster performs write operation, updated features are synchronized to the feature retrieval cluster in a write diffusion mode, so that the feature retrieval cluster synchronously updates the database of the feature retrieval cluster, in order to guarantee the reliability of synchronization, in the embodiment of the application, each retrieval node also subscribes a write operation transaction of the feature management database, so that once the write operation of the feature management database is performed and the operation log is written, corresponding operation transaction information is triggered and sent to each retrieval node, and when the synchronous update fails, the operation transaction information is used as a bottom-of-pocket strategy, and the synchronization operation is performed again, so that the reliability of synchronization between the feature management database and the feature retrieval database is guaranteed, and the reliability of the retrieval system is further improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments or related technologies, the drawings needed to be used in the description of the embodiments or related technologies are briefly introduced below, it is obvious that the drawings in the following description are only the embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is 1: n, architecture schematic diagram of the retrieval system;

fig. 2 is a schematic view of an application scenario provided in an embodiment of the present application;

FIG. 3 is a block diagram of a retrieval system according to an embodiment of the present disclosure;

fig. 4 is a schematic diagram of a read-write separation CQRS framework provided in the embodiment of the present application;

fig. 5 is a schematic flowchart of initial synchronization provided in an embodiment of the present application;

fig. 6 is a schematic flowchart of a retrieved data synchronization method according to an embodiment of the present application;

FIG. 7 is a flow chart illustrating a write diffusion process according to an embodiment of the present disclosure;

FIG. 8 is a schematic diagram of a write-diffusion-combined transaction outbox mechanism provided by an embodiment of the present application;

fig. 9 is an application diagram illustrating a picture content audit scenario provided in an embodiment of the present application;

fig. 10 is a schematic structural diagram of a retrieved data synchronization apparatus according to an embodiment of the present application;

fig. 11 is a schematic structural diagram of another retrieved data synchronization apparatus according to an embodiment of the present application;

fig. 12 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more clearly understood, the technical solutions in the embodiments of the present application will be described below clearly and completely with reference to the accompanying drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without making any creative effort belong to the protection scope of the present application. In the present application, the embodiments and features of the embodiments may be arbitrarily combined with each other without conflict. Also, while a logical order is shown in the flow diagrams, in some cases, the steps shown or described may be performed in an order different than here.

For the convenience of understanding the technical solutions provided by the embodiments of the present application, some key terms used in the embodiments of the present application are explained first:

and (3) operating the log: for recording all operation transaction information performed in the database, a database transaction (transaction) is a sequence of database operations that access and possibly operate on various data items, and these operations are either all performed or all not performed, and are an integral unit of work. A transaction consists of all database operations performed between the beginning of the transaction and the end of the transaction. In a database system, a transaction is a discrete unit of work, which may be an operation to modify a feature in a database, or to add a feature, etc. For example, binlog is a binary operation log, and records the operation of modifying the search base by the search service, including adding or deleting modification features and the like.

Kafka: the distributed publish-subscribe message queue system is high in throughput.

Command Query Responsibility Separation (CQRS) architecture: the retrieval system is a read-write separation architecture system mode, can separate a command for changing the state of a model from the query of the state of the model, and is set up based on the idea of CQRS.

And (3) writing: the method refers to performing operations on features in the database, including operations such as feature addition, feature deletion and feature modification.

And (3) writing diffusion: in the embodiment of the present application, the write diffusion means that a feature in the feature management database is updated, and the update is diffused to other feature retrieval databases. For example, when a new feature is added to the feature management database, the added feature should be updated to the feature search database in order to keep the synchronization between the feature management database and the feature search database to ensure the correctness of the search.

The transaction outbox: also called system events (Application events), refers to performing normal updates based on the same transaction and inserting messages into a specific outbox table in the database so that the subscribed recipients can receive the messages and know the corresponding update actions of the transaction. After the characteristic management database is subjected to write operation, relevant transactions are written into the operation log, the transaction characteristics depend on high assurance of the database, once the operation log is written, all write operations are successfully completed by the characteristic management cluster, therefore, the characteristic retrieval cluster is informed of the write operations executed by the characteristic management cluster through transactional messages, and the consistency between the characteristic management database and the characteristic retrieval database is guaranteed by taking the transactional messages as a bottom-of-pocket strategy when write diffusion is unsuccessful.

Database snapshot: a snapshot refers to a fully available copy of a given data set that includes an image of the corresponding data at some point in time (the point in time at which the copy began). The snapshot may be a copy of the data it represents or may be a replica of the data. The snapshot is actually a reference mark or pointer pointing to data stored in the storage device, and it means the condition of the data at a certain moment, and the core of the working principle is to establish a pointer list indicating the address of the read data, provide an image of the instant data, and copy the data when the data is changed. For a database, data may be all in a memory, and if the database is crashed, the data will be all lost, so a mechanism must be provided to ensure that the data is not lost due to a failure, and this mechanism is called a persistence mechanism, and a snapshot is one of the persistence mechanisms, and in other words, a snapshot can be understood as a full backup of data at a certain time of the database.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and the like.

With the research and progress of artificial intelligence technology, the artificial intelligence technology is developed and researched in a plurality of fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, automatic driving, unmanned aerial vehicles, robots, smart medical services, smart customer service, internet of vehicles, automatic driving, smart traffic and the like.

The scheme provided by the embodiment of the application relates to management of the features extracted by adopting the artificial intelligence technology and a retrieval process realized on the basis of the features. Specifically, a feature extraction model of contents such as images, audios, videos, or texts, which are obtained by training through AI technologies such as a computer vision technology, a speech processing technology, a natural language processing technology, and machine learning/deep learning, may be adopted to extract features of contents to be extracted, and then the features are stored in a database, so that a related retrieval process may be subsequently implemented based on the features.

Taking a picture content audit scene which is more common in the content audit scene as an example, the main function in the picture content audit scene is to judge whether the picture contains illegal content, so that the general database contains the characteristics corresponding to the identified illegal content, and judge whether the picture contains the corresponding characteristics in a characteristic matching manner to determine whether the picture contains the illegal content. Generally, a 1: the N search system, as shown in fig. 1, is typically a 1: the architecture schematic diagram of the N retrieval system comprises three parts, namely picture input, software Development Kit (SDK) processing and retrieval service, wherein the picture input part inputs a picture to be written or retrieved, and the SDK processing mainly realizes the functions of picture processing (including picture size control), image frame detection, coding, feature extraction and the like; the retrieval service is the core of the whole retrieval system, mainly comprises characteristic management such as writing operations of adding, deleting, modifying characteristic information and the like, and also comprises characteristic retrieval which mainly realizes characteristic matching and outputting retrieval scores.

The whole retrieval system reuses one retrieval library, the retrieval pressure is large and the time consumption is increased when high concurrent retrieval requests occur, and once a network or disk failure occurs in the database, the retrieval system is directly unavailable, so that the real-time performance of content audit is influenced; and the feature management and feature retrieval modules are arranged in the same service, so that the coupling is high, the mutual influence is high, the development is not facilitated, for example, when the feature management function needs to be developed or optimized, the feature retrieval function is influenced, and even the function of the whole retrieval system needs to be suspended. From the view of the resource consumption of the device, if the subsequent capacity expansion caused by the increase of the retrieval request needs to consider the resources required by the feature management module and the feature retrieval module at the same time, the problem of resource waste caused by the invalid capacity expansion of the resources required by the feature management module is obvious.

In the practical application process, it is found that in a similar scene such as content auditing, the write operation usually belongs to low-frequency operation in an auditing system, and usually can be uniformly stored in a warehouse for a batch of characteristics only when a retrieval system is idle, and a retrieval request is a main function of the auditing system and belongs to high-frequency operation.

In the method, a set of feature management clusters are independently deployed for write operations such as addition, deletion and modification and a set of feature retrieval clusters are independently deployed for read operations such as retrieval, so that the high availability of retrieval is not influenced, the feature management clusters and the feature retrieval clusters are physically isolated, and the data consistency is ensured through a synchronization mechanism.

Furthermore, a corresponding synchronization mechanism is proposed, which may specifically include an initial synchronization and a dynamic synchronization.

The initial synchronization is a basic module in a retrieval base synchronization mechanism, and based on the resource utilization state of a retrieval node, a synchronization mode based on an operation log can be dynamically selected, or a synchronization mode based on database snapshot synchronization is selected, and the initial synchronization is generally used when the retrieval node is firstly accessed into a feature retrieval cluster.

The dynamic synchronization is a core module in a retrieval library synchronization mechanism and mainly comprises two strategies of write diffusion and a transaction outbox, wherein the write diffusion mainly comprises a real-time updating strategy introduced after a feature management cluster completes write operation requests such as addition, deletion and modification and the like and a state of a feature retrieval database related to each retrieval node included in the feature retrieval cluster. After the write operation of the feature management cluster is successful in updating the feature management database, multiple write operations are initiated to a plurality of retrieval nodes, in order to ensure that the write operation is not blocked and real-time performance is ensured, write diffusion is realized in an asynchronous mode, so that failure possibility exists, the transaction outbox generates operation transaction information write message queues by subscribing database operation logs of the feature management cluster aiming at a bottom-pocketing strategy introduced when a write diffusion request is failed to update the retrieval nodes, and each retrieval node in the feature retrieval cluster acquires operation transaction information through the subscription message queues, so that consistency between feature management data and the feature retrieval database is ensured.

Some brief descriptions are given below to application scenarios to which the technical solution of the embodiment of the present application can be applied, and it should be noted that the application scenarios described below are only used for describing the embodiment of the present application and are not limited. In the specific implementation process, the technical scheme provided by the embodiment of the application can be flexibly applied according to actual needs.

The method and the device can be suitable for the high-availability write-less-read-more-retrieval scene of picture auditing, and after the scheme is introduced, more real-time, high-availability and high-concurrency retrieval information can be realized for a public cloud picture auditing system, so that a more reasonable and safer operation strategy can be formulated according to the timely retrieved auditing result.

The scheme provided by the embodiment of the application can be applied to various scenes, including but not limited to cloud technology, artificial intelligence, intelligent traffic, driving assistance and the like. For example, the method is applicable to most retrieval scenes with less writing and more reading, such as a picture content auditing scene of a human face, a same picture, etc., a public cloud picture content auditing scene, a sound feature retrieval scene, a fingerprint feature retrieval scene, etc., which are not illustrated here. As shown in fig. 2, an application scenario schematic diagram provided in the embodiment of the present application may include a terminal device 101, an access device 102, a feature management cluster 103, and a feature retrieval cluster 104.

The terminal device 101 may be, for example, a mobile phone, a tablet computer (PAD), a laptop, a desktop, a smart wearable device, a smart voice interaction device, a smart home appliance, a vehicle-mounted terminal, or an aircraft. The terminal device 101 may be installed with a target application initiating a write operation request or a feature retrieval request, for example, the target application may be a video application, an instant messaging application, a picture application, a music application, or a browser, the application related to the embodiment of the present application may be a software client, or a client such as a web page or an applet, the server is a background server corresponding to software or a web page or an applet, and a specific type of the client is not limited.

The access device 102 is configured to implement functions of parameter checking, task distribution, and the like for the received task request, that is, to distribute the write operation request to the feature management cluster and distribute the feature retrieval request to the feature retrieval cluster.

The feature management cluster 103 is used for realizing writing operations such as feature addition and deletion, the feature retrieval cluster 104 is used for realizing feature matching calculation and then returning retrieval scores and hit results, the feature retrieval cluster 104 comprises a plurality of retrieval nodes, and when the access device distributes a feature retrieval request, the request is routed to different retrieval nodes through load balancing, so that horizontal expansion of retrieval capacity is guaranteed and supported, and high availability and high concurrency of mass retrieval requests are guaranteed.

Taking a public cloud picture content auditing scene as an example, the feature management database and the feature retrieval database are mainly used for storing image features of illegal contents, for example, corresponding image features of illegal faces, articles and the like may be included, when write operation requests such as addition, deletion and modification need to be performed on the databases, the access device 102 may acquire the corresponding write operation requests and distribute the corresponding write operation requests to the feature management cluster 103, the feature management cluster 103 may perform corresponding write operations, update the feature management database, and synchronize the updated contents to each retrieval node of the feature retrieval cluster 104 to update the respective feature retrieval databases of the retrieval nodes.

In a public cloud picture content auditing scene, compared with the retrieval operation of the feature retrieval cluster 104, the feature management operation of the feature management cluster 103 belongs to low-frequency operation, mainly features are added, required resources are not large, and the retrieval query rate (query-per-second, QPS) of the retrieval operation can reach ten thousand levels, so that the requirements on high availability, high concurrency, real-time performance and the like of the feature retrieval cluster 104 are high, and the requirements can be met by adopting distributed retrieval nodes through load balancing.

Similarly, in other similar scenarios, the functions of the devices are implemented similarly, and thus are not described in detail again.

The access device 102, the feature management cluster 103, and the feature retrieval cluster 104 may be implemented by a server device with a certain computing capability, may be independent physical servers, may also be a server cluster or a distributed system formed by a plurality of physical servers, and may also be cloud servers providing basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, content Delivery Networks (CDNs), big data platforms, and artificial intelligence platforms, but are not limited thereto.

It should be noted that, in order to increase the reliability of the retrieval system and avoid the retrieval system being unusable due to a device failure, the feature management cluster 103 and the feature retrieval cluster 104 are different physical devices and are physically isolated from each other.

In this embodiment of the present application, the retrieval system may further include a feature extraction cluster (not shown in fig. 2) for implementing a feature extraction related function, which may provide a feature extraction function triggered by processes such as a write operation request, a database synchronization, and a feature retrieval request.

Each retrieval node in feature management cluster 103 and feature retrieval cluster 104 may include one or more processors, memory, and interactive I/O interfaces, etc., and may also configure respective databases for storing the extracted features. The memory of each device may store program instructions of the retrieved data synchronization method provided in the embodiment of the present application, and when the program instructions are executed by the corresponding processor, the program instructions can be used to implement the steps of the retrieved data synchronization method provided in the embodiment of the present application, so as to implement data synchronization between the feature management database and the feature retrieval database.

The devices may be connected in direct or indirect communication via one or more networks. The network may be a wired network, or a Wireless network, for example, the Wireless network may be a mobile cellular network, or may be a Wireless-Fidelity (WIFI) network, or may also be other possible networks, which is not limited in this embodiment of the present invention.

It should be noted that, where possible, the feature management cluster 103 and the access device 102 may be implemented by the same device, and similarly, the function of the feature extraction cluster may also be implemented by the feature management cluster 103 or the access device 102, which is not limited in this embodiment of the present application.

Fig. 3 is a schematic structural diagram of a retrieval system provided in an embodiment of the present application, where the retrieval system includes three parts, namely a retrieval access module, a feature management cluster, and a feature retrieval cluster.

(1) The retrieval access module is used for realizing the functions of parameter verification, task distribution and the like of the received task request, namely distributing the write operation request to the feature management cluster and distributing the feature retrieval request to the feature retrieval cluster.

(2) The feature management cluster comprises functions of feature management, library management, write diffusion management, initial synchronization and the like, wherein the feature management is a write operation process for realizing the correlation of the feature management database aiming at a write operation request, the library management is a management function for realizing the correlation of the feature management database, and the write diffusion management and the initial synchronization are functions corresponding to the synchronization between the feature management database and the feature retrieval database.

(3) The feature retrieval cluster comprises functions of feature retrieval, synchronization management and the like, wherein the feature retrieval refers to a corresponding retrieval process executed aiming at a feature retrieval request, namely, the feature matching calculation is completed and then a retrieval score and a hit result are returned, and the synchronization management is a synchronization process executed for keeping synchronization between a feature management database and a feature retrieval database.

In one possible implementation, the retrieval system may be a distributed retrieval system based on a read-write separation CQRS framework, which is shown in fig. 4 as a schematic diagram of the read-write separation CQRS framework.

The read-write separation CQRS is a system mode with separated responsibility of command query, and can separate a command for changing the model state from the query of the model state, so in the embodiment of the application, two parts of feature management and feature retrieval are realized by different physical devices, a command end, namely a feature management cluster, mainly focuses on optimizing and writing in feature data, and when a write operation request is received, a write operation process is realized through feature management service of the feature management cluster, and a feature management database is updated; the query end, that is, the feature retrieval cluster, mainly focuses on optimally reading feature data, that is, when a feature retrieval request is received, a retrieval process is realized through a feature retrieval service of a corresponding retrieval node, and a retrieval result is returned. The command end and the query end are independently deployed, are isolated in a physical mode, and ensure the real-time performance and the final consistency of the retrieved data through a data synchronization mechanism.

In a possible application scenario, the relevant data (such as feature data and the like) and the model parameters involved in the embodiment of the present application may be stored by using a cloud storage (cloud storage) technology. The distributed cloud storage system refers to a storage system which integrates a large number of storage devices (or called storage nodes) of different types in a network through application software or application interfaces to cooperatively work through functions of cluster application, grid technology, distributed storage file systems and the like, and provides data storage and service access functions to the outside.

In a possible application scenario, in order to reduce communication delay of retrieval, retrieval nodes may be deployed in each region, or in order to balance load, different retrieval nodes may serve terminal devices in different regions, respectively, for example, when a terminal device is located at a site a, a communication connection is established with a retrieval node serving the site a, when a terminal device is located at a site b, a communication connection is established with a retrieval node serving the site b, and a plurality of retrieval nodes and a feature management cluster may form a data sharing system, so as to implement data sharing through a block chain.

Each retrieval node or feature management cluster in the data sharing system has a node identifier corresponding to the retrieval node or feature management cluster, and each retrieval node or feature management cluster in the data sharing system may store node identifiers of other retrieval nodes or feature management clusters in the data sharing system, so that a generated block is subsequently broadcast to other retrieval nodes in the data sharing system according to the node identifiers of other retrieval nodes, for example, when a write operation request occurs, the feature management cluster may perform feature synchronization in this way.

Of course, the method provided in the embodiment of the present application is not limited to the application scenario shown in fig. 2 or the architecture of fig. 3, and may also be used in other possible application scenarios, and the embodiment of the present application is not limited thereto. Functions that can be implemented by each device or module of the application scenario shown in fig. 2 or fig. 3 will be described together in the subsequent method embodiment, and are not described in detail here.

Based on the read-write separation CQRS framework, operations such as addition, deletion and modification are deployed in the feature management cluster, operations such as retrieval are deployed in the feature retrieval cluster, the two clusters are physically isolated and machine heterogeneous, the problem that the data storage of the whole retrieval system is unreliable due to equipment faults and the whole retrieval system is unavailable is solved, the requirement of high concurrency of retrieval when resources are limited can be met, and when retrieval requests are increased, the feature management cluster is not required to be expanded, and the feature management cluster is not required to be expanded, so that invalid capacity expansion of the feature management cluster is avoided. And the synchronization of the feature management database and the feature retrieval database is realized between the feature management cluster and the feature retrieval cluster through a synchronization process. Based on this, the embodiment of the present application provides a retrieval data synchronization method for the above retrieval system, which is used to implement synchronization between the feature management database and the feature retrieval database. The synchronization between the feature management database and the feature retrieval database includes an initial synchronization process when the retrieval node is initially added, and a dynamic synchronization process when the retrieval node is used, which are described below.

Referring to fig. 5, a schematic flow chart of initial synchronization provided in the embodiment of the present application is shown. The initial synchronization refers to that when a retrieval node is initially added into a feature retrieval cluster, an associated feature retrieval database does not contain any feature, and all features in a feature management database need to be synchronized into the feature retrieval database at this time, so that the method is a full-scale synchronization process of feature data, and comprises the following steps:

step 501: adding new searching nodes into the characteristic searching cluster, and starting an initial synchronization process.

When the retrieval demand of the feature retrieval cluster is increased, the retrieval nodes can be conveniently added to improve the high concurrency of the whole retrieval system to the retrieval request, and when the newly added retrieval nodes are successfully added into the feature retrieval cluster, an initial synchronization process is required.

When the newly added retrieval node successfully joins the feature retrieval cluster, the newly added retrieval node can be indicated to have successfully joined the feature retrieval cluster, and then an initial synchronization request is initiated to the feature management cluster in response to the indication that the new retrieval node joins the feature retrieval cluster.

In a possible implementation manner, the feature retrieval cluster may have a management module, when a new retrieval node is needed, by initiating an application to the management module, when the deployment work related to the new retrieval node is completed, the management module may send an indication that the feature retrieval cluster is successfully added to the new retrieval node.

In a possible implementation manner, the adding work of the newly added search node can be completed with the assistance of related personnel, and when the deployment work related to the newly added search node is completed, the initial synchronization process of the newly added search node can be initiated.

For example, the newly added search node may provide a display interface through which the relevant person may operate to initiate an initial synchronization process, and when the newly added search node receives a corresponding operation, the newly added search node is an instruction to add the search node to the feature search cluster.

Or, the related personnel can also operate on the management page of the retrieval system to initiate an initial synchronization process, and when the newly added retrieval node receives the initial synchronization process triggered by the initial synchronization process, the newly added retrieval node also receives an indication that the retrieval node joins the feature retrieval cluster.

Step 502: and the newly-added retrieval nodes determine a target synchronization mode.

In the embodiment of the present application, the synchronization method may be as follows:

(1) Synchronization mode based on operation log

The operation log records all operations on the characteristic management database, so that the same operation result as the characteristic management database can be obtained by executing the same operation according to the operation transaction information in the operation log.

For example, when 10 new features are added to the feature management database, the operation log records the operation transaction information corresponding to the new feature, and similarly, after the operation log is acquired and the operation transaction information is known, the new search node may perform the same write operation again to add 10 new features to the feature search data associated with the new search node.

However, since the process of extracting features may be involved in the process of writing operation, and the process needs to consume the feature extraction resources, it needs to determine whether the feature extraction resources meet the requirement of the first synchronization manner.

Specifically, the retrieval system may further include a feature extraction cluster, which is responsible for feature extraction operations involved in the processes of writing operations or feature retrieval, and the like. Then, because the synchronization mode based on the operation log needs to re-extract the features, the resource requirement on the feature extraction cluster is relatively high, and therefore, whether the current state of the feature extraction cluster meets the requirement of the synchronization mode can be judged by newly adding the retrieval node.

(2) Synchronization mode based on database snapshot

When the synchronization mode is adopted, the feature management cluster can carry out persistence on the snapshot of the feature management database, sends the snapshot to the newly added retrieval node, and the newly added retrieval node carries out loading after receiving the snapshot to complete initial synchronization. The synchronization method based on the database snapshot may increase the load of a Central Processing Unit (CPU) of the feature management cluster, so that there is a requirement for the device state of the feature management device.

Of course, other synchronization methods may also be adopted, and the embodiment of the present application does not limit this.

In a specific implementation manner, the newly added retrieval node may obtain a resource utilization rate of the feature extraction cluster, and determine whether the resource utilization rate of the feature extraction cluster is greater than the preset threshold, if the resource utilization rate is greater than the preset threshold, it indicates that a current load of the feature extraction cluster is high, and if the synchronization manner based on the operation log is adopted, the pressure of the feature extraction cluster is further increased, so that in order to ensure the calculation resources required by the online retrieval, other synchronization manners may be adopted, otherwise, when the resource utilization rate is not greater than the preset threshold, the synchronization manner based on the operation log may be adopted.

Taking a picture content auditing scene as an example, the involved auditing object is mainly a picture, so that when the feature extraction cluster performs feature extraction, the resource of a Graphics Processing Unit (GPU) is the main resource consumed, therefore, the newly added retrieval node can obtain the GPU utilization rate of the feature extraction cluster, when the GPU load is too high, the situation that the on-line computing resource is in shortage is indicated, in order to not influence the on-line computing, other synchronization modes can be adopted at the moment, and when the GPU load is not high, the synchronization mode based on the operation log can be adopted.

It should be noted that, the target synchronization mode adopted by the initial synchronization may be pre-specified by the newly added search node, which is specifically taken as an example here, but in practical application, the feature management cluster may also make a decision after receiving the initial synchronization request.

Step 503: and the newly added retrieval node initiates an initial synchronization request to the feature management cluster.

In this embodiment of the present application, the initial synchronization request is used to instruct the feature retrieval cluster to synchronize the full feature amount in the feature management database to the feature retrieval database according to the determined target synchronization manner.

Step 504: the feature management cluster determines a target synchronization mode indicated by the initial synchronization request.

Step 505: and if the target synchronization mode is the synchronization mode based on the operation log, the feature management cluster sends the operation log to the newly added retrieval node.

And after receiving the initial synchronization request, the feature management cluster acquires a target synchronization mode indicated in the initial synchronization request, and when the target synchronization mode is an operation log-based synchronization mode, the feature management cluster sends the current operation log of the associated feature management database to the newly-added retrieval node.

Step 506: and the newly added retrieval node performs synchronous operation on the characteristic retrieval database according to the operation transaction information recorded by the operation log in sequence.

The operation log records the operation transaction information of the feature management database in sequence, and the newly added retrieval nodes are sequentially executed, so that the synchronization with the feature management database can be realized.

In a possible implementation manner, the operation log may be a binlog, and write operation requests corresponding to write operations performed on the feature management database in sequence may be recorded in the binlog, in order to save broadband resources, the write operation request sequence in the binlog may be converted into a binary form according to a sequence, and after receiving the binlog, the newly added search node performs deserialization to obtain a write operation request corresponding to each operation transaction information, such as a request structure corresponding to each write operation request, which includes a write operation type to be executed and a write operation object. For a picture content audit scene, the write operation object may specifically be a picture, or for a text content audit scene, the write operation object may specifically be a text.

In order to avoid resource waste caused by invalid requests and repeated requests, the newly added search node can perform validity check on each write operation request, filter out write operation requests which cannot pass the check, perform deduplication processing on each write operation request after filtration, and execute corresponding write operation aiming at each write operation request after final deduplication to write each feature into an associated feature search database so as to synchronize the feature in the feature management database to the feature search database.

Specifically, when performing deduplication processing, deduplication can be performed for repeated requests of the same write operation request, and write operations with the same characteristics can be combined at the same time, so that write operation actions to be performed are reduced. For example, when a feature is added to the database, then a delete operation is performed, which is equivalent to the feature not existing, then the actual addition of the hash and delete operation may not be performed, and the final result is consistent for the database; or, after a feature is added to the database, the modifying operation is executed, and the database only retains the feature modified by the modifying operation, so that the modifying operation is executed only once. Based on this, the operations aiming at the characteristics of the same Identification (ID) in the database can be merged, thereby reducing the data processing amount of the writing operation as much as possible.

After the initial synchronization is completed, the newly added retrieval node can clean up data such as operation logs.

Step 507: if the target synchronization mode is a synchronization mode based on the database snapshot, the feature management cluster determines whether the resource state of the self device meets the resource requirement of the synchronization mode.

In this embodiment of the application, because the synchronization mode based on the database snapshot may increase the CPU load pressure and the disk pressure of the feature management cluster, in order to ensure that the online request is not affected, the feature management cluster needs to determine whether the resource state (for example, the CPU and the disk state) of its own device meets the resource requirement of the synchronization mode, and select whether to modify the target synchronization mode based on the determination result.

Step 508: if so, the feature management cluster executes snapshot persistence operation to generate a database snapshot of the feature management database.

Step 509: and the characteristic management cluster sends the database snapshot to the newly added retrieval node.

Step 510: and the newly-added retrieval node loads the database snapshot into the characteristic retrieval database of the newly-added retrieval node.

Step 511: and if not, the characteristic management cluster modification target synchronization mode is a synchronization mode based on the operation log.

Step 512: and the characteristic management cluster sends the operation log to the newly added retrieval node.

If the characteristic management cluster determines that the load of the CPU is too high, the target synchronization mode is modified into a synchronization mode based on the operation log, and meanwhile, the concurrency number of the characteristic extraction cluster is reduced, so that the influence on the line is adjusted to be the lowest; and if the CPU load is determined to be normal, performing a database snapshot process to complete the initial synchronization.

In the embodiment of the application, when determining that the load of the CPU is too high, the feature management cluster may further perform negotiation with a newly added search node, and determine whether the initial synchronization time can be delayed, and if so, may determine a new initial synchronization time, for example, may select a time when the search system is idle, and the time may be obtained based on historical data statistics.

In the embodiment of the present application, in consideration of the fact that the initial synchronization process may need to consume a long time when the feature management database is large in scale, a new write operation request may also be received in the initial synchronization process, and for the newly received write operation request, when an operation log-based synchronization mode is adopted, operation transaction information corresponding to the write operation requests may be continuously sent to the newly added search node.

When a synchronization mode based on the database snapshot is adopted, in the process of executing the snapshot persistence operation, if a write operation request sent by the client is received, an incremental synchronization mode can be adopted, namely the received write operation request is cached, and the cached write operation request is sent to the newly-added retrieval node, so that the newly-added retrieval node performs corresponding write operation based on the received write operation request.

In the embodiment of the application, the initial synchronization solves the synchronization of the full-scale retrieval data when the newly added retrieval node is added into the feature retrieval cluster for the first time, when the newly added retrieval node is added into the feature retrieval cluster and the initial synchronization is completed, the retrieval node can be put into use, and in the subsequent use process, the difference data of the feature management cluster and the feature retrieval cluster needs to be synchronized in real time and the final consistency of the retrieval library is ensured. Based on the above, the embodiment of the application also introduces dynamic synchronization, which mainly comprises write diffusion and a transaction outbox strategy, wherein the write diffusion ensures that most write operation requests of the feature management cluster can be updated to the feature retrieval cluster in time.

Fig. 6 is a schematic flow chart of a retrieved data synchronization method according to an embodiment of the present application.

Step 601: and the characteristic management cluster receives a write operation request sent by a client.

In the embodiment of the application, when the access layer of the retrieval system determines that the received operation request is a write operation request, the write operation request is distributed to the feature management cluster. The access layer can carry out validity check on the write operation request, and the feature management cluster can also carry out secondary check on the distributed write operation request, judge whether the write operation request is an invalid request, ignore the write operation request when the write operation request is the invalid request, end the process, and execute the subsequent process to execute the write operation when the write operation request is the valid request.

Step 602: and the feature management cluster carries out corresponding write operation on the associated feature management database.

Step 603: and the characteristic management cluster records corresponding operation transaction information in an operation log of the characteristic management database.

In the embodiment of the present application, transaction characteristics of high assurance of the database are relied on, that is, when the feature management cluster successfully completes all write operations indicated in the write operation request, corresponding operation transaction information is written into the operation log, and if the operation transaction information is not successfully completed or is not completely completed, the operation log is not written, which also means that once the operation transaction information is recorded in the operation log, it indicates that the feature management cluster must successfully complete all write operations indicated in the write operation request.

Specifically, the operation transaction information may include all transaction information corresponding to the write operation, including but not limited to a request structure of the write operation request, process information of executing the write operation, and the like.

Step 604: and respectively initiating a write diffusion request to at least one retrieval node by the feature management cluster.

In the embodiment of the application, the condition for initiating the write diffusion request is that the client sends a write operation request to the feature management cluster, and the feature management cluster is automatically triggered after completing feature management, library entering of a search library and service binlog addition, the write diffusion request sent by the feature search cluster carries feature information corresponding to the write operation, for deletion operation, the feature information may be a feature ID, and for modification or addition operation, the feature information may include modified or added feature data.

In a possible implementation manner, in order to avoid network delay or time consumption caused by processing of the search node, the write-diffusion request in the embodiment of the present application is an asynchronous write manner, that is, after the feature management cluster initiates write diffusion, a response is directly returned without waiting for a processing result of the search node.

Step 605: and each retrieval node synchronously updates the associated characteristic retrieval database based on the characteristic information.

In the embodiment of the present application, the synchronous update refers to an operation directly performed on the feature in the feature search database, such as adding, modifying, or deleting.

Specifically, when the write diffusion request received by the retrieval node carries the modified feature data, the retrieval node replaces the corresponding original feature data with the received feature data; when the write diffusion request received by the retrieval node carries newly added feature data, the retrieval node adds the received feature data into a feature management database; and when the write diffusion request received by the retrieval node carries the deleted feature ID, the retrieval node searches and deletes the feature data of the ID.

Step 606: and the characteristic management cluster respectively sends the operation transaction information to each retrieval node.

In the embodiment of the present application, in consideration of the situation that write diffusion may fail due to the influence of various factors, in order to avoid inconsistency between the feature management database and the feature retrieval database, a more reliable synchronization manner needs to be adopted as a bottom-pocketing policy.

Therefore, in consideration of the transaction characteristics of high guarantee of the database, the write operation executed by the feature management database can be acquired by subscribing the write operation transaction of the feature management database by each retrieval node. That is, when each search node subscribes to a write operation transaction of the feature management database, once the operation log of the feature management database is updated, the feature management cluster sends operation transaction information to each search node.

In a possible implementation manner, the above process may be implemented by using a transaction outbox mechanism, and as shown in fig. 6, the operation log of the database may be managed by a subscription feature, so that when the operation log is updated, corresponding operation transaction information is written into the transaction message queue, and each retrieval node also subscribes to the topic of the transaction message queue, so that when the transaction message queue is updated, the operation transaction information is sent to each retrieval node.

Step 607: and the retrieval node judges whether the synchronous updating is successful.

If the result of step 607 is yes, then the ignore operation transaction information is not executed.

Step 608: if the result of step 607 is no, the retrieval node performs a synchronization operation on the feature retrieval database based on the operation transaction information.

In the embodiment of the present application, since the operation transaction information is updated substantially at the same time as the write-diffusion request, the idempotent characteristic of the database may be adopted to avoid repeatedly executing the update.

In a possible implementation manner, since the write operation request needs to update the associated database in both the feature management cluster and the feature retrieval cluster, the corresponding task states of the write operation request can be maintained in the feature management cluster and the retrieval nodes. For example, in the feature management cluster, when the write operation process corresponding to the write operation request is completed, the corresponding task state is updated to be completed, otherwise, the corresponding task state is updated to be not completed; similarly, in each search node, when the write operation process corresponding to the write operation request is completed, the corresponding task state is updated to be completed, otherwise, the corresponding task state is updated to be not completed.

Then, after the search node executes the synchronous update based on the write diffusion request, the task state of the write operation request at the search node may be updated based on the synchronous update result, that is, when the synchronous update is successful, the update task state is complete, and if the synchronous update fails, the update task state is incomplete. Then, when receiving the operation transaction information, the search node determines whether the above-mentioned synchronization update is successful based on the task state, that is, the task state indicates that the write operation request is still not completed at the search node, and then determines that the synchronization update is unsuccessful, and then may perform a synchronization operation on the feature search database based on the operation transaction information, that is, update the feature search database based on the features in the operation transaction information, or re-extract the features and update the feature search database based on the request structure of the write operation request in the operation transaction information.

In a possible implementation manner, when updating the database based on the operation transaction information, it may be determined whether the same update is performed, that is, the operation object, the operation content, the write operation request, and the like are all consistent, and if so, it indicates that the synchronous update is completed, and then the repeated update is not required; if the same update has not been made, the update is performed on the feature search database.

In the embodiment of the present application, refer to fig. 7, which is a schematic flow chart of a write diffusion process.

Step 701: the feature management cluster receives a write operation request.

Step 702: and the characteristic management cluster determines a target operation type corresponding to the write operation based on the write operation request so as to execute a corresponding processing flow according to the operation type.

Step 703: and if the target operation type is the characteristic deleting operation, deleting the characteristic indicated by the write operation request from the characteristic management database by the characteristic management cluster.

For example, if the write request indicates to delete a feature with an ID of "XXX", the feature management cluster searches for the feature according to the ID and deletes the feature.

Step 704: and if the target operation type is a feature modification operation or a feature addition operation, the feature management cluster performs feature extraction on the content carried by the write operation request. The feature management cluster can request the feature extraction cluster to extract the features of the carried content.

Step 705: the feature management cluster determines whether feature extraction was successful.

Step 706: if the operation type is successful, the feature management cluster judges the operation type again. If the failure occurs, the process ends.

Step 707: and aiming at the characteristic adding operation, the characteristic management cluster constructs a unique ID of the characteristic data.

Step 708: and the feature management cluster stores the extracted features in a warehouse.

For example, taking a picture content audit scene as an example, if the write operation request indicates that the features corresponding to the carried image are added to the feature management database, the feature extraction cluster is called, feature extraction is performed on the image carried by the write operation request, and an ID is assigned to the feature, so as to add corresponding features in the feature management database.

Step 709: for a feature modification operation, the feature management cluster updates the corresponding feature in the feature management database with the obtained feature.

For example, taking a picture content audit scene as an example, if the write operation request indicates that the feature with the ID of "XXX" is modified to the feature corresponding to the carried image, the feature extraction cluster is invoked to perform feature extraction on the image carried by the write operation request, and replace the original feature with the ID of "XXX" in the feature management database.

Step 710: the feature management cluster serializes the write operation log for the write operation requests.

Step 711: the feature management cluster determines whether the write failed.

Step 712: and if the updating is successful, the feature management cluster asynchronously sends write diffusion requests to all retrieval nodes of the feature retrieval cluster, so that all the retrieval nodes receive the write diffusion requests and execute synchronous updating.

Step 713: if the failure occurs, the feature management cluster deletes the feature in the feature management database. That is, if writing binlog fails, it indicates that this writing operation fails, and the feature calendar database needs to be rolled back to the state before the operation. The feature management cluster can delete the features in the feature management database in a lazy deletion mode.

In the embodiment of the present application, referring to fig. 8, a schematic diagram of a write diffusion combining transaction outbox mechanism provided in the embodiment of the present application is shown.

In the embodiment of the present application, referring to fig. 8, when there is a write operation, the feature management service of the feature management cluster performs corresponding persistent update on the associated feature management database, and when the update is successful, each search node of the feature search cluster is synchronously updated in a write diffusion manner. However, the write-diffusion process is asynchronous operation, that is, after the write-diffusion request is sent, the feature management cluster will not wait for the result returned by each retrieval node, and will return the result of the current write operation to the front end, so that it cannot be guaranteed that each retrieval node writes successfully, and as shown in fig. 8, there may be a failure in updating the features of a certain retrieval node. In order to ensure data consistency, a transaction outbox mechanism is introduced as a bottom-of-pocket strategy, and a write request can be well guaranteed to be successfully written into a retrieval node database finally based on the highly available transaction characteristics of the database and the persistence of a message queue to the message.

In this embodiment of the present application, before the dynamic synchronization process is executed, a transaction outbox corresponding to the feature management database needs to be configured, as shown in fig. 8, where the transaction outbox mainly includes two modules, namely a message relay module (message relay) and a transaction message queue (message queue). The message relay module is responsible for subscribing an operation log, such as binlog, of the feature management cluster database, namely subscribing the update of the operation log, so that after the feature management cluster successfully completes continuous operations of feature extraction, database writing and the like, operation transaction information is written into the operation log, and the message relay module subscribing the information acquires and caches the message and then writes the message into a specific subject of the message queue. And the feature management cluster also receives subscription requests sent by all the retrieval nodes, wherein the subscription requests are used for requesting to subscribe the write operation transaction of the feature management database, and the subscription of all the retrieval nodes to the transaction message queue is configured in response to all the subscription requests. That is, each search node also subscribes to the subject of the message queue, and once the message queue is updated, corresponding operation transaction information is acquired and updated to the database of the node, so that data consistency is completed.

Furthermore, the message relay module can acquire operation transaction information from the operation log, write the operation transaction information into the transaction message queue, acquire the operation transaction information from the transaction message queue based on the subscription of each retrieval node to the transaction message queue, and respectively send the operation transaction information to each retrieval node. The message relay module subscribes an operation log of the database mainly depends on high-guaranteed transaction characteristics of the database, once the operation log is written to show that the feature management cluster intends to complete all write operations of a current write operation request successfully, the message queue mainly completes system decoupling and information persistence to avoid information loss, and after each retrieval node receives operation transaction information from the message queue, the retrieval nodes need to be guaranteed to update and manage cluster messages according to features in a consistent sequence, so that disorder is avoided, and data repetition is avoided based on an update power ideality mechanism.

Fig. 9 is a schematic view of an application illustrated by taking a picture content audit scene as an example. When a new search node is added to the feature search cluster, the features in the feature management database are synchronized to the feature search database associated with the search node in the initial synchronization mode provided by the embodiment of the application. And the administrator client of the picture content auditing platform initiates a writing operation request based on the uploaded picture corresponding to the identified illegal content, so that the feature management cluster performs feature extraction on the picture and writes the picture into the feature management database, and simultaneously, after the picture is successfully put in storage, each retrieval node of the feature retrieval cluster is synchronously updated through the writing diffusion and transaction outbox mechanism provided by the embodiment of the application. Furthermore, when the user client uploads the picture and needs to perform content verification on the picture content, for example, whether the verification contains illegal and illegal articles, people or human faces and the like, the feature retrieval cluster can receive a corresponding feature retrieval request, call the feature extraction cluster to extract the features of the picture to be verified, perform retrieval in the feature retrieval database, and output a retrieval result to help determine whether the verification is passed, and return the result to the user client.

To sum up, the embodiment of the present application provides a distributed retrieval data synchronization scheme combining write-diffusion and a transaction outbox mechanism, which can be applied to public cloud picture content audit scenes based on CV technology, so as to implement more real-time, highly available and highly concurrent retrieval information, so as to formulate a more reasonable and safe operation strategy according to the audit result retrieved in time, and certainly can also be applied to other similar retrieval scenes, such as other types of feature retrieval services like sound, fingerprints, etc., the distributed retrieval system data synchronization scheme is applicable to other hardware platforms including PCs, servers, etc., so as to greatly improve the availability of the audit retrieval system, and by introducing a read-write separation CQRS framework, stable operation under the condition of high concurrency of the system is greatly ensured, and a dynamic synchronization mechanism combining write-diffusion and the transaction outbox mechanism is adopted, the success rate of real-time write-diffusion update is more than 99%, the synchronization delay is low, while the retrieval instantaneity is ensured as the bottom of write-diffusion failure, and the final update of the transaction outbox and message queue persistence data can achieve final retrieval consistency of two cluster failure reliability based on the database high-guarantee.

Referring to fig. 10, based on the same inventive concept, an embodiment of the present application further provides a search data synchronization apparatus 100, which is applied to a feature management cluster included in a search system, where the search system further includes a feature search cluster composed of at least one search node; the device includes:

a feature management unit 1001, configured to perform corresponding write operation on an associated feature management database in response to a write operation request sent by a client, and record corresponding operation transaction information in an operation log of the feature management database;

the write diffusion unit 1002 is configured to initiate a write diffusion request to at least one search node, where the write diffusion request carries feature information corresponding to a write operation, so that each search node synchronously updates an associated feature search database based on the feature information; and the number of the first and second groups,

a transaction pushing unit 1003, configured to, when each search node subscribes to a write operation transaction of the feature management database, send operation transaction information to each search node, respectively, so that each search node performs a synchronization operation on the feature search database based on the operation transaction information when the synchronization update performed by each search node is unsuccessful.

Optionally, the apparatus further comprises a configuration unit 1004, configured to:

receiving a subscription request sent by at least one retrieval node, wherein the subscription request is used for requesting to subscribe the write operation transaction of the characteristic management database;

and configuring the subscription of each retrieval node to the transaction message queue in response to each subscription request.

Optionally, the transaction pushing unit 1003 is specifically configured to:

acquiring operation transaction information from the operation log through a message relay module, and writing the operation transaction information into a transaction message queue;

and acquiring operation transaction information from the transaction message queue based on the subscription of each retrieval node to the transaction message queue, and respectively sending the operation transaction information to each retrieval node.

Optionally, the apparatus further includes an initial synchronization unit 1005 configured to:

if the target synchronization mode is a synchronization mode based on the operation log, the operation log is sent to the newly added retrieval node;

if yes, executing snapshot persistence operation to generate a database snapshot of the feature management database, and sending the database snapshot to the newly added retrieval node;

and if not, modifying the target synchronization mode into a synchronization mode based on the operation log, and sending the operation log to the newly added retrieval node.

Optionally, the initial synchronization unit 1005 is further configured to:

in the process of executing snapshot persistence operation, if a write operation request sent by a client is received, caching the received write operation request;

after the database snapshot is sent to the newly added search node, the method further includes:

Optionally, the feature management unit 1001 is specifically configured to:

if the target operation type is the characteristic deleting operation, deleting the characteristic indicated by the writing operation request from the characteristic management database;

if the target operation type is the feature modification operation, extracting features of the content carried by the write operation request, and updating corresponding features in a feature management database according to the obtained features;

and if the target operation type is the feature adding operation, extracting features of the content carried by the write operation request, and adding the obtained features into a feature management database.

The device may be configured to execute the method executed by the feature management cluster in each embodiment of the present application, and therefore, for functions and the like that can be realized by each functional module of the device, reference may be made to the description of the foregoing embodiment, which is not repeated here.

Referring to fig. 11, based on the same inventive concept, an embodiment of the present application further provides a search data synchronization apparatus 110, which is applied to any search node included in a feature search cluster included in a search system, where the search system further includes a feature management cluster; the device includes:

a receiving unit 1101, configured to receive a write diffusion request sent by a feature management cluster, where the write diffusion request carries feature information corresponding to a write operation of the feature management cluster on an associated feature management database, and the write operation is triggered by a write operation request sent by a client;

a write diffusion execution unit 1102, configured to perform synchronous update on the associated feature search database based on the feature information; and the number of the first and second groups,

the receiving unit 1101 is further configured to receive operation transaction information sent by the feature management cluster, where the operation transaction information records transaction information corresponding to a write operation;

and a transactional synchronization unit 1103, configured to, if it is determined that the synchronization update is not successful, perform a synchronization operation on the feature search database based on the operation transaction information.

Optionally, the write diffusion performing unit 1102 is further configured to:

transactional synchronization unit 1103 is specifically configured to:

Optionally, the apparatus further includes an initial synchronization unit 1104 configured to:

responding to an indication that the retrieval node joins the feature retrieval cluster, initiating an initial synchronization request to the feature management cluster, wherein the initial synchronization request is used for indicating the feature retrieval cluster to synchronize the feature full quantity in the feature management database to the feature retrieval database according to a target synchronization mode;

receiving an operation log returned by the feature management cluster, and sequentially performing synchronous operation on the feature retrieval database according to the operation transaction information recorded by the operation log; alternatively, the first and second electrodes may be,

and receiving a database snapshot returned by the feature management cluster, and loading the database snapshot into a feature retrieval database of the feature management cluster.

Optionally, the search system further includes a feature extraction cluster; the initial synchronization unit 1104 is specifically configured to:

determining whether the resource utilization rate of the feature extraction cluster is greater than a preset threshold value;

if the resource utilization rate is larger than a preset threshold value, determining that the target synchronization mode is a database snapshot-based synchronization mode;

if the resource utilization rate is not larger than the preset threshold value, determining that the target synchronization mode is a synchronization mode based on the operation log;

and initiating an initial synchronization request to the feature retrieval cluster based on the determined target synchronization mode.

Optionally, the initial synchronization unit 1104 is specifically configured to:

carrying out validity check on each write operation request, and filtering the write operation requests which cannot pass the check;

The device may be configured to execute a method executed by a retrieval node in a feature retrieval cluster in each embodiment of the present application, and therefore, for functions and the like that can be realized by each functional module of the device, reference may be made to the description of the foregoing embodiment, which is not repeated herein.

Referring to fig. 12, based on the same technical concept, an embodiment of the present application further provides a computer device. In one embodiment, the computer device may be a node in the feature management cluster or a retrieval node of the feature retrieval cluster shown in fig. 1, and as shown in fig. 12, the computer device includes a memory 1201, a communication module 1203, and one or more processors 1202.

A memory 1201 for storing computer programs executed by the processor 1202. The memory 1201 may mainly include a storage program area and a storage data area, where the storage program area may store an operating system, a program required for running an instant messaging function, and the like; the storage data area can store various instant messaging information, operation instruction sets and the like.

Memory 1201 may be a volatile memory (volatile memory), such as a random-access memory (RAM); the memory 1201 may also be a non-volatile memory (non-volatile memory), such as a read-only memory (rom), a flash memory (flash memory), a hard disk (HDD) or a solid-state drive (SSD); or the memory 1201 is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. The memory 1201 may be a combination of the above memories.

The processor 1202 may include one or more Central Processing Units (CPUs), a digital processing unit, and the like. The processor 1202 is configured to implement the above-described retrieved data synchronization method when calling a computer program stored in the memory 1201.

The communication module 1203 is used for communicating with the terminal device and other servers.

In the embodiment of the present application, the specific connection medium between the memory 1201, the communication module 1203 and the processor 1202 is not limited. In fig. 12, the memory 1201 and the processor 1202 are connected by a bus 1204, the bus 1204 is depicted by a thick line in fig. 12, and the connection manner between other components is merely illustrative and not limited. The bus 1204 may be divided into an address bus, a data bus, a control bus, and the like. For ease of description, only one thick line is depicted in fig. 12, but only one bus or one type of bus is not depicted.

The memory 1201 stores a computer storage medium, and the computer storage medium stores computer-executable instructions for implementing the retrieved data synchronization method according to the embodiment of the present application. The processor 1202 is configured to execute the retrieved data synchronization method of the above embodiments.

In some possible embodiments, the aspects of the retrieved data synchronization method provided by the present application may also be implemented in the form of a program product, which includes program code for causing a computer device to perform the steps in the retrieved data synchronization method according to various exemplary embodiments of the present application described above in this specification, when the program product runs on a computer device, for example, the computer device may perform the steps of the embodiments.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The program product of embodiments of the present application may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a computing device. However, the program product of the present application is not limited thereto, and in the context of the present application, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with a command execution system, apparatus, or device.

A readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with a command execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user computing device, partly on the user equipment, as a stand-alone software package, partly on the user computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

It should be noted that although several units or sub-units of the apparatus are mentioned in the above detailed description, such division is merely exemplary and not mandatory. Indeed, the features and functions of two or more of the units described above may be embodied in one unit, according to embodiments of the application. Conversely, the features and functions of one unit described above may be further divided into embodiments by a plurality of units.

Further, while the operations of the methods of the present application are depicted in the drawings in a particular order, this does not require or imply that these operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and so forth) having computer-usable program code embodied therein.

While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. The retrieval data synchronization method is applied to a feature management cluster included in a retrieval system, and the retrieval system also comprises a feature retrieval cluster composed of at least one retrieval node; the method comprises the following steps:

respectively initiating a write diffusion request to the at least one retrieval node, wherein the write diffusion request carries characteristic information corresponding to the write operation, so that each retrieval node synchronously updates a related characteristic retrieval database based on the characteristic information; and (c) a second step of,

and when each retrieval node subscribes to the write operation transaction of the characteristic management database, respectively sending the operation transaction information to each retrieval node, so that when each retrieval node fails to execute the synchronous updating, the characteristic management database is synchronously operated based on the operation transaction information.

2. The method of claim 1, wherein before the operation transaction information is sent to each retrieving node when each retrieving node subscribes to a write operation transaction of the feature management database, the method further comprises:

configuring a transaction outbox corresponding to the management database, wherein the transaction outbox comprises a message relay module and a transaction message queue, and the message relay module subscribes the updating of the operation log;

3. The method of claim 2, wherein the sending the operation transaction information to each retrieval node when each retrieval node subscribes to a write operation transaction of the feature management database comprises:

4. The method of claim 1, wherein the method further comprises:

5. The method of claim 4, wherein the method further comprises:

6. The method of any one of claims 1 to 5, wherein in response to a write operation request sent by a client, performing a corresponding write operation on a feature management database corresponding to the client comprises:

if the target operation type is a feature deletion operation, deleting the feature indicated by the write operation request from the feature management database;

7. A retrieval data synchronization method is applied to any retrieval node included in a feature retrieval cluster included in a retrieval system, and the retrieval system further comprises a feature management cluster; the method comprises the following steps:

8. The method of claim 7, wherein after synchronously updating the associated feature search database based on the feature information, the method further comprises:

determining that the synchronization update was unsuccessful, comprising:

9. The method of claim 7, wherein prior to receiving a write flood request sent by the feature management cluster, the method further comprises:

10. The method of claim 9, wherein the retrieval system further comprises a feature extraction cluster; initiating an initial synchronization request to the feature management cluster in response to an indication that the current search node joins the feature search cluster, including:

if the resource utilization rate is determined to be greater than the preset threshold value, determining that a target synchronization mode is a database snapshot-based synchronization mode;

11. The method of claim 9, wherein the receiving the operation log returned by the feature management cluster, and sequentially performing a synchronization operation on the feature retrieval database according to the operation transaction information recorded by the operation log comprises:

and executing the write operation indicated by each write operation request after the duplication is removed so as to synchronize the full quantity of the features in the feature management database to the feature retrieval database.

12. The retrieval data synchronization device is applied to a feature management cluster included in a retrieval system, and the retrieval system also comprises a feature retrieval cluster composed of at least one retrieval node; the device comprises:

13. The retrieval data synchronization device is applied to any retrieval node included in a feature retrieval cluster included in a retrieval system, and the retrieval system further comprises a feature management cluster; the device comprises:

14. A computer device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor,

the processor when executing the computer program realizes the steps of the method of any one of claims 1 to 6 or 7 to 11.

15. A computer storage medium having computer program instructions stored thereon, wherein,

the computer program instructions, when executed by a processor, implement the steps of the method of any one of claims 1 to 6 or 7 to 11.

16. A computer program product comprising computer program instructions, characterized in that,