CN111143343B

CN111143343B - Efficient data deleting method and system based on source terminal deduplication

Info

Publication number: CN111143343B
Application number: CN201911374951.4A
Authority: CN
Inventors: 周建华; 张有成; 姚崎; 丁红; 李海鹏; 许萍萍
Original assignee: Aerospace One System Jiangsu Information Technology Co ltd
Current assignee: Aerospace One System Jiangsu Information Technology Co ltd
Priority date: 2019-12-27
Filing date: 2019-12-27
Publication date: 2023-12-15
Anticipated expiration: 2039-12-27
Also published as: CN111143343A

Abstract

The application discloses a high-efficiency data deleting method based on source terminal deleting, in the backup process, segmenting a data stream of a source terminal into data blocks, calculating fingerprints, comparing the fingerprints, if the fingerprints are not provided with new blocks, transmitting the corresponding data blocks into a container of a server terminal for storage, marking the corresponding container as 1, writing the container into a data file after the container is fully written, and creating a new container; the backup set is automatically cleaned due to expiration, and the guid object record is cleaned; and (3) during idle time outside the normal service window period, cleaning the data block and the fingerprint thereof for the container marked with 0 by using preset loop deletion logic, wherein the container marked with 0 indicates that the data block and the fingerprint thereof in the container can be cleaned without being referenced. The advantages are that: the application adopts a marking mode, the statistical logic is simpler, the cleaning logic is not influenced by the size of the deduplication library, and the method is more efficient.

Description

Efficient data deleting method and system based on source terminal deduplication

Technical Field

The application relates to a method and a system for efficiently deleting data based on source end deduplication, and belongs to the technical field of data protection.

Background

The source terminal erasure has been widely used in data protection products due to its characteristics of reducing transmission bandwidth and storage space. For convenience of explanation, the convention uses that the source end data after the deduplication is stored in a deduplication library, wherein the deduplication library comprises a deduplication fingerprint library and a deduplication database. The index information of the data blocks is stored in the deduplication fingerprint database, and the data blocks are stored in the deduplication database. The data after the source terminal is used for deleting has the following characteristics: the data blocks stored in the deduplication database are unique in the whole database, and most of the data blocks in the deduplication database can be commonly used by a plurality of data sources, so that the aim of reducing the storage space can be achieved only by the characteristic. This feature has a positive effect on reducing the storage space, but has great complexity on deletion operation, and the data in the deduplication database is difficult to clean conveniently like ordinary data. The first way of the existing method is to record the reference times of each data block, increase the reference times for repeated data blocks during backup, subtract the reference times of the data blocks contained during deletion, and wait until the reference times are 0 to indicate that the data blocks can be cleaned, and the storage space occupied by the data blocks can be released. The mode has great influence on the backup and deletion performance along with the increase of the deletion library, the other mode is centralized cleaning, the centralized cleaning mode is executed at a specific time point, marks are marked on all the used data files, the data files are quite large in granularity relative to data blocks, statistics is quite fast, then the unused data files and fingerprints are deleted, the purpose of releasing space is achieved, the mode has the defects of being large in granularity, and the effect of releasing space is not quite good.

Disclosure of Invention

The application aims to overcome the defects that in the existing source terminal deduplication technology, due to the unique characteristic of the deduplicated data, logic is complex, efficiency is low and space cannot be released rapidly and efficiently when deleting operation is performed, and provides a source terminal deduplication-based data efficient deleting method and system.

In order to solve the technical problems, the application provides a high-efficiency data deleting method based on source end deduplication, which is characterized in that in the backup process, a data stream of a source end is segmented into data blocks, fingerprints are calculated and compared, if no instruction exists for the fingerprints as new blocks, the corresponding data blocks are transmitted into a container of a server end for storage, the corresponding container is marked as 1, the container is written into a data file after being fully written, a new container is created, the container comprises a plurality of data blocks, a deduplication library comprises a plurality of data files with fixed sizes, and each data file comprises a plurality of containers;

the backup set is automatically cleaned due to expiration, and the guid object record is cleaned;

and (3) during idle time outside the normal service window period, cleaning the data block and the fingerprint thereof for the container marked with 0 by using preset loop deletion logic, wherein the container marked with 0 indicates that the data block and the fingerprint thereof in the container can be cleaned without being referenced.

Further, the container is fixed in size.

Further, the marking process of each container is as follows:

determining a backup set, wherein the backup set comprises an object library and a deduplication library, the object library stores object files, the object files store object records and index data of the objects, the deduplication library stores data files, and the data files store information of each data block contained in the objects;

and acquiring the referenced object file, reading index data in the object file according to the unique identifier of the object, finding a corresponding container according to the fingerprint in the index data, and marking the corresponding container record with a mark 1.

Further, the loop delete logic is to:

s1, in the backup process, marking a container where a corresponding data block is located as 1 for the referenced data block, and marking a corresponding object record as 1 to indicate that the data block is checked;

s2, traversing object records, finding out objects marked as 0, finding out the position of a container stored by a corresponding data block in a deduplication library according to index information of records in an object file, marking the container corresponding to a fingerprint as 1, and marking the object record as 0 to indicate that the object record is not inspected yet;

s3, traversing container records in the deduplication library, cleaning data blocks and fingerprints thereof in a container marked with 0, and marking the container state as 2, wherein the container is cleaned and can be reused;

s4, marking the container record in the deduplication library as 0 of 1, and marking all the object records in the object library as 0;

s5, collecting all containers marked as 2 in the deduplication library, and preferentially selecting the collected containers for multiplexing when new data needs to be stored;

s6, circularly executing the steps S1-S5 in a set period.

A high-efficiency data deleting system based on source end deleting comprises a container determining module, a backup set cleaning module and a deleting module;

the container determining module is used for dividing a data stream of a source end into data blocks in a backup process, calculating fingerprints, comparing the fingerprints, if the fingerprints are not provided with new blocks, transmitting the corresponding data blocks into containers of a server end for storage, marking the corresponding containers as 1, writing the containers into data files after the containers are fully written, and creating a new container, wherein the container comprises a plurality of data blocks, a deduplication library comprises a plurality of data files with fixed sizes, and each data file comprises a plurality of containers;

the backup set cleaning module is used for automatically cleaning the backup set after the backup set expires, and simultaneously deleting the guid object records;

and the deleting module is used for cleaning the data block and the fingerprint of the container marked as 0 by utilizing a preset loop deleting logic in idle time outside the normal service window period, wherein the container marked as 0 indicates that the data block and the fingerprint in the container are not referenced and can be cleaned.

Further, the size of the container determined by the container determining module is fixed.

Further, the container determining module comprises a backup set determining module and a container marking module;

the backup set determining module is used for determining a backup set, the backup set comprises an object library and a deduplication library, the object library stores object files, the object files store object records and index data of objects, the deduplication library stores data files, and the data files store information of each data block contained in the objects;

the container marking module is used for acquiring the referenced object file, reading index data in the object file according to the unique identifier of the object, finding a corresponding container according to the fingerprint in the index data, and marking the corresponding container record with a mark 1.

Further, the cleaning module comprises a backup module, a first traversing module, a second traversing module, an initializing module, a collecting module and a circulating module;

the backup module is used for marking a container where a corresponding data block is located as 1 for the referenced data block in the backup process, and marking a corresponding object record as 1 to represent that the referenced data block is checked;

the first traversing module is used for traversing the object records, finding out the objects marked as 0, finding out the position of the container stored by the corresponding data block in the deduplication library according to the index information of the records in the object file, marking the container corresponding to the fingerprint as 1, and marking the object record mark as 0 to indicate that the object record is not inspected yet;

the second traversing module traverses the container records in the deduplication library, cleans the data blocks and fingerprints thereof in the container marked with 0, then marks the container state as 2, and represents that the container is cleaned and can be reused; the method comprises the steps of carrying out a first treatment on the surface of the

The initialization module marks the container record in the deduplication library as 0 of 1, and marks of all object records in the object library are 0;

the collection module is used for collecting all containers marked as 2 in the deduplication library, and the collected containers are preferentially selected to be reused when new data needs to be stored;

the circulation module is used for circularly executing the processes of the backup module, the first traversing module, the second traversing module, the initializing module and the collecting module in a set period.

The application has the beneficial effects that:

the application can regularly execute cleaning in the background under the condition of not affecting normal backup and recovery service, and the cleaned space can be reused, thereby achieving the purpose of releasing the space in phase change. The application adopts a marking mode, the statistical logic is simpler, the cleaning logic is not influenced by the size of the deduplication library, and the method is more efficient.

Drawings

FIG. 1 is a schematic flow diagram of marking containers that are also being referenced;

fig. 2 is a schematic flow diagram of a purge vessel.

Detailed Description

The application is further described below with reference to the accompanying drawings. The following examples are only for more clearly illustrating the technical aspects of the present application, and are not intended to limit the scope of the present application.

The deleting logic used in the application uses the marks of the containers to distinguish which data blocks in the containers and the fingerprints thereof can be cleaned and which are still used, so that the key is how to quickly mark the containers in the re-deleting library, the deleting of the fingerprints is simple, and the user only needs to read the containers needing to be deleted to analyze the stored fingerprints, and then delete the corresponding records in the fingerprint library. It takes a long time to find the fingerprints in the index file of all objects in the object library one time to determine which containers are still being used as a whole if they are to be marked each time they are cleaned up. The application extends this stage to the backup stage, compare the fingerprint naturally in the backup process, only need make the container hit at this moment mark can, have little influence on backup logic, because the validity of the backup set is limited in time, only need find a small number of objects can confirm which container can be cleared up at this moment when clearing up.

The containers of the present application have a relatively small particle size. Containers are a logical concept for managing a collection of data blocks, and containers are of a fixed size (each container being the same size), and a data file may contain multiple containers. Because the containers are of a fixed size, new data can be stored by multiplexing once the containers are cleaned, and in addition, the cleaning of the container level can be performed by using the idle time of normal business, and the normal business is hardly influenced.

The method comprises the steps of deleting a plurality of data files with fixed sizes in a database, wherein each data file comprises a plurality of containers with fixed sizes, each container comprises a plurality of data blocks, dividing a data stream of a source end into data blocks in a backup process, calculating fingerprints, comparing the fingerprints, if no instruction exists, transmitting the corresponding data blocks to a container of a server end for storage, marking the corresponding container as 1, writing the container into the data file after the container is fully written, and creating a new container.

The reason for the container unit is that the purpose of the deduplication library to contain a batch of fingerprints in container units is the locality principle of the data utilized. The principle of locality is that if a data block is used, then the probability that its neighboring data block will also be used is high. The container is used as an updating unit of the cache, so that the hit rate of the cache can be effectively improved. The delete function can also take advantage of the principle that if a block needs to be cleaned, then the data block that it is adjacent to will be cleaned with a high probability.

The backup set is stored in a background in an object mode and mainly comprises two parts, wherein an object library and an object file store object records and index data of the objects, and a deduplication library and a data file store information of each data block contained in the objects. The read backup set accesses the object library first, and finds out the corresponding data block in the duplicate and delete library according to the index information recorded in the object file.

As shown in fig. 1, a referenced object file is acquired, index data of objects in the object file is read according to guid (unique object identifier), a corresponding container is found according to fingerprints in the index data, and a corresponding container record is marked with a mark 1.

The method specifically comprises the following steps:

1. during the backup process, for the referenced data block, the container in which the corresponding data block is located is labeled 1, representing that the container has a data block used. The corresponding object record is also marked 1, representing that it has been checked.

2. Traversing the object records, finding out the objects marked as 0, finding out the position of the container stored by the corresponding data block in the deduplication library according to the index information of the records in the object file, and marking the container corresponding to the fingerprint.

3. Traversing the container records in the deduplication library, no fingerprint in the container, labeled 0, is referenced to clean up, clean up the fingerprint in the container, and then the container state is labeled 2, indicating that the container has been cleaned up for reuse.

4. The flag bit is initialized. The container record in the deduplication library is marked 1 and set to 0, and the marks of all object records in the object library are set to 0.

5. All containers marked 2 in the deduplication library are collected, and the collected containers are preferably selected for reuse when new data needs to be stored.

As shown in fig. 2, the steps of cleaning using the idle time outside the normal service window period are:

starting to collect the container id to be cleaned, judging whether the container needs cleaning according to the mark of the container, if yes, judging whether the service is idle, if no, waiting for 1s, judging again, if idle, closing the second-level buffer memory in the cleaning write buffer memory, judging whether to idle again, if no, waiting for 1s again, judging whether to idle again, if idle, taking out one container id, checking the container state again, executing cleaning if cleaning is still needed, cleaning the mark, judging whether the service is idle again, if not waiting for 1s, judging again, if yes, opening the second-level buffer memory in the writing buffer memory, initializing a bloom filter, and ending.

The process is circularly executed in the background with a set period, and the backup set in the data protection product has a life cycle, so that the backup set can be automatically cleaned when the life cycle expires, and can be manually cleaned, so that along with the replacement of objects in the object library, the cleaning logic can clean the containers which are out of date and are not used any more, and new data is stored after the recovery, thereby achieving the purposes of space recycling and phase change and storage space reduction.

The application also provides a data efficient deleting system based on source end deleting, which comprises a container determining module, a backup set cleaning module and a deleting module;

The size of the container determined by the container determining module is fixed.

The container determining module comprises a backup set determining module and a container marking module;

The cleaning module comprises a backup module, a first traversing module, a second traversing module, an initializing module, a collecting module and a circulating module;

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The foregoing is merely a preferred embodiment of the present application, and it should be noted that modifications and variations could be made by those skilled in the art without departing from the technical principles of the present application, and such modifications and variations should also be regarded as being within the scope of the application.

Claims

1. A method for efficiently deleting data based on source end deduplication is characterized in that,

in the backup process, segmenting a data stream of a source end into data blocks, calculating fingerprints, comparing the fingerprints, if the fingerprints are not provided with a new block, transmitting the corresponding data blocks into a container of a server end for storage, marking the corresponding container as 1, writing the container into a data file after the container is full, and creating a new container, wherein the container comprises a plurality of data blocks, a deduplication library comprises a plurality of data files with fixed sizes, and each data file comprises a plurality of containers;

the method comprises the steps that data blocks and fingerprints of a container marked as 0 are cleaned by utilizing preset loop deletion logic in idle time outside a normal service window period, wherein the container marked as 0 indicates that the data blocks and fingerprints thereof in the container are not referenced, and cleaning is performed;

the process of marking each of the containers is as follows:

acquiring a referenced object file, reading index data in the object file according to a unique identifier of the object, finding a corresponding container according to fingerprints in the index data, and marking a corresponding container record with a mark 1;

the loop deletion logic is as follows:

s3, traversing the container records in the deduplication library, cleaning the data blocks and fingerprints thereof in the container marked with 0, and marking the container state as 2, wherein the container is cleaned for recycling;

s6, circularly executing the steps S1-S5 in a set period.

2. The efficient source deduplication-based data deletion method of claim 1, wherein the container is fixed in size.

3. The data efficient deleting system based on source end deleting is characterized by comprising a container determining module, a backup set cleaning module and a deleting module;

the deleting module is used for cleaning the data block and the fingerprint of the container marked as 0 by utilizing a preset loop deleting logic in idle time outside the normal service window period, wherein the container marked as 0 indicates that the data block and the fingerprint in the container are not referenced, and cleaning is carried out;

the container marking module is used for acquiring the referenced object file, reading index data in the object file according to the unique identifier of the object, finding a corresponding container according to the fingerprint in the index data, and marking the corresponding container record with a mark 1;

the second traversing module traverses the container records in the deduplication library, cleans up the data blocks and fingerprints thereof in the container marked with 0, and then marks the container state as 2, which represents that the container has been cleaned up for reuse;

4. The source deduplication-based data efficient deletion system of claim 3, wherein the size of the container determined by the container determination module is fixed.