CN116932281A

CN116932281A - Similar data detection method and computing device based on two-dimensional locality

Info

Publication number: CN116932281A
Application number: CN202310811168.XA
Authority: CN
Inventors: 张宇成; 夏昕; 邱际光; 张旭; 涂旭青; 于程远; 周金平
Original assignee: Nanchang University
Current assignee: Nanchang University
Priority date: 2023-07-04
Filing date: 2023-07-04
Publication date: 2023-10-24

Abstract

The embodiment of the application provides a similar data detection method based on two-dimensional locality, which comprises the following steps: initializing a hollow hash table to record the container reference amount of a container of a data block in the backup, acquiring the data block subjected to data deduplication and differential compression, and if the data block is repeated and is repeated with a differential block, updating the hash table by using a reference block of the differential block; if the data block is not repeated with the difference block, updating the hash table by using the data block which is repeated with the data block in the system; if the data block is not repeated, judging whether the data block is subjected to differential compression or not; if delta compression is carried out, updating the hash table by using the reference blocks of the data block; if the data block is not subjected to delta compression, updating the hash table by the data block; and counting container numbers with reference quantity larger than a preset container reference quantity threshold value in the hash table, and writing in a new file. The application can detect most similar data blocks and ensure that the detected similar data blocks have high similarity.

Description

Similar data detection method and computing device based on two-dimensional locality

Technical Field

The application belongs to the field of computer storage, and particularly relates to a similar data detection method, computing equipment and storage medium based on two-dimensional locality.

Background

Data backup is a common method of data protection. Since backup files of adjacent versions tend to be less diverse, the backup data set contains a significant amount of redundancy. Backup systems typically use redundancy elimination techniques to eliminate data redundancy, reducing storage overhead, and commonly used redundancy elimination techniques include data deduplication and delta compression.

Data deduplication breaks the backup file into several data blocks to detect identical data blocks and prohibit identical data blocks from being repeatedly stored, the data deduplication reduces the recovery performance of the backup system, the data overwrite technique improves the recovery performance by storing a small number of duplicate data blocks, and the duplicate data blocks that are overwritten are called "fragmented blocks".

Delta compression can detect and remove redundancy between similar data blocks. Assuming two similar data blocks a and B, data block a is the target block to be compressed, delta compression will detect the same data in data block a and data block B, and the data present in data block a but not in data block B, and encode data block a into a delta block, data block B being referred to as the "reference block", using the "copy" and "insert" instructions, respectively. Since part of the data has been replaced with the copy instruction, the delta block is much smaller in size than the original data block a. Thus, the system memory difference block may reduce memory overhead.

The basic storage unit of the backup system using the redundancy elimination technique is a fixed-length container, and a non-duplicate data block and a delta block subjected to delta compression are written into the container. When a container is full, it is stored in the system. At the same time, the system opens a new empty container to receive the non-duplicate data blocks and delta blocks that are subsequently stored.

When two technologies of data deduplication and delta compression are simultaneously used for eliminating redundancy, the backup system firstly uses data deduplication to eliminate repeated data blocks, then searches the data blocks similar to the data blocks in the rest non-repeated data block searching system, and if the data blocks are found, delta compression is carried out; the basic flow of the system is shown in figure 1. In searching for similar data blocks for non-duplicate data blocks, the system may use a similar data detection technique to calculate one or more eigenvalues for the data blocks and detect similar data blocks by matching the eigenvalues. The eigenvalues are typically weak hashes of the data blocks, and two data blocks having one or more eigenvalues equal are considered similar in content. Which data blocks within the system can be detected as the reference blocks required for delta compression depends on the eigenvalue indexing method.

One common eigenvalue indexing method is full data indexing, i.e., indexing the eigenvalues of all the data blocks in the system. This approach faces two problems. The first is a problem of storing the eigenvalue index. Placing the eigenvalue index into memory will affect the scalability of the system; when the number of data blocks in the system increases to a certain extent, the memory will not store the eigenvalue index. Placing the eigenvalue index on disk results in a lower query speed. A second problem faced by full data indexing is that self-similar data blocks with a low degree of similarity are detected as similar data blocks. Similar data blocks, referred to as self-similar data blocks, may also exist within a single backup. For a data block, the data block with higher similarity is a certain data block in the previous or earlier backup, because the former is inherited from the latter with a high probability, and the latter is generated by one or more small modifications; in contrast, self-similar data blocks that exist in the same backup are less similar because there is no inheritance relationship. However, the full data index will index all stored data blocks in the system, including the self-similar data blocks, and the characteristic values of the self-similar data blocks may overwrite the characteristic values of the similar data blocks in the previous backup, so that the self-similar data blocks are detected as the reference blocks required for delta compression.

An illustrative example of this problem is given in figure 2. In fig. 2, each data block is represented by a capital letter, and the letters are identical to each other to represent the corresponding data block. Backup N contains 4 data blocks, where data block D1 is generated with minor modifications to data block D in backup N-1, and D1 is similar to D. In addition, X in backup N is also similar to D1, i.e., X is a self-similar data block of D1. D has a higher similarity to D1 than X. After the system processes the backup N-1, the characteristic value index contains the characteristic value of D. However, after the system processes the data block X in the backup N, since X is similar to D, the characteristic value of X will cover the characteristic value of D, so that X is detected as a similar data block of D1 and is used as a reference block for delta compression of D1.

Another common eigenvalue indexing method is a latest backup index, which only indexes the data blocks in the last backup and the eigenvalues of the data blocks already processed in the current backup; if the data block in the last backup has been delta compressed, its reference block is indexed. There are two problems with this approach:

the first problem is the same as that faced by the full data index, namely, when there is a self-similar data block in the data set, the self-similar data block with lower similarity is detected as a reference block, thereby causing loss of compression rate.

The second problem is that cross-version similar data blocks cannot be detected. Similar data blocks may appear across versions, whereas the most recent backup indexing method only detects the data blocks in the last backup or the reference blocks of delta blocks in the last backup, thereby missing similar data blocks that appear across versions. An illustrative example of this problem is given in figure 3. Backup 1 contains 6 non-duplicate data blocks, which are written to containers 1 and 2 after the backup is completed. Backup 2 contains 5 non-duplicate data blocks and 1 duplicate data block, where A1 is the result of a minor modification of A in backup 1, so A1 and A are similar. When processing backup 2, the characteristic value of the data block in backup 1 will be indexed, so a is detected as a similar data block of A1, and the system performs delta compression on A1 to generate a difference block d (A1). After the backup 2 is completed, the data block and the delta block are written into the container 3. The first 4 data blocks of the backup 3 inherit from the backup 1 across versions, wherein A2, C2 and D2 are generated by small modifications of A, C, D and E in the backup 1 respectively, and are therefore similar data blocks to each other; the last 2 data blocks of backup 3 inherit from backup 2, where J2 is generated with little modification by J, so J2 is similar to J. Since the most recent backup indexing method indexes only the characteristic values of the data blocks (i.e., B, G, H, I and J) in the backup 2 and the characteristic values of the reference block (i.e., a) of the difference block, only a will be detected as a similar data block of A2, whereas C and D appearing across versions cannot be detected as a similar data block of C2 and D2, so that C2 and D2 cannot be delta compressed.

Because of the data sharing among files, the data in the backup system using the redundancy elimination technology cannot delete the invalid files directly, and only the container which does not contain the valid data can be recovered through garbage collection. When the available storage space in the system is below a threshold, the backup system may be garbage collected. During garbage collection, the system scans all existing containers, marks valid data blocks and delta blocks, repacks the data into new containers, stores the new containers in the system again, and deletes containers which do not contain valid data blocks and delta blocks.

Disclosure of Invention

Aiming at the technical problems in the prior art, the embodiment of the application aims to provide a similar data detection method and computing equipment based on two-dimensional locality, which can detect most of similar data blocks and ensure that the detected similar data blocks have a similar data detection method with high similarity, so as to be suitable for data backup computing equipment which uses data de-duplication and delta compression for redundancy elimination at the same time.

A first aspect of an embodiment of the present application provides a method for detecting similar data based on two-dimensional locality, including the steps of:

s1: initializing an empty hash Table Table-RT, wherein the hash Table records the container reference quantity of a container related to a data block in the backup, the key code of the Table-RT is a container number, and the hash value is the container reference quantity corresponding to the container number;

when a data block is used for updating a Table-RT, the embodiment of the application searches the Table-RT for the container number of the container where the data block is located; if not found, a new record is inserted, wherein the key code of the new record is the container number of the container where the data block is located, and the value is the length of the data block, namely the container reference amount; if so, accumulating the length of the data block to the searched value;

s2: acquiring a data block C subjected to data deduplication and delta compression processing, and judging whether the data block C is repeated or not;

s21: if the data block C is repeated, judging whether the data block C is repeated with the difference block;

s211: if the data block C is repeated with the difference block, updating the Table-RT by using a reference block of the difference block;

s212: if the data block C is not repeated with the difference block, updating the Table-RT by using the data block which is repeated with the data block C in the system;

s22: if the data block C is not repeated, judging whether the data block C is subjected to differential compression;

s221: if the data block C is subjected to differential compression, updating the Table-RT by using a reference block of the data block C;

s222: if the data block C does not do differential compression, updating the Table-RT by using the data block C;

s3: judging whether the data block C is the last data block in the backup data;

s31: if the data block C is the last data block in the backup data, executing S4;

s32: if the data block C is not the last data block in the backup data, executing S2;

s4: and counting container numbers with the reference quantity larger than a preset container reference quantity threshold THRE-Size in the Table-RT, and writing the container numbers with the reference quantity larger than the container reference quantity threshold THRE-Size into a new file, and recording the new file as F-Dense-CTN.

As a further improvement of the embodiment of the present application, in step S4, in addition to the first backup, before each generation of a new F-Dense-CTN file, the F-Dense-CTN file generated by the last backup is deleted;

preferably, when the data rewriting technique is not used, the container reference amount threshold (THRE-Size) described in step S4 is set to 0; when the data rewriting technique is used, the THRE-Size described in step S4 may be set to a larger value according to the parameters of the data rewriting technique; for example, when the container Size is 4MB, the THRE-Size is set to 2MB;

as a second aspect of the embodiment of the present application, the embodiment of the present application further provides a similar data delta compression method, including the steps of:

s5: reading the F-Dense-CTN file generated by the last backup, and constructing a lookup table TAB-Dense-CTN by using the container number in the F-Dense-CTN file;

s6: initializing an empty Cache Base-Cache, wherein the Cache Base-Cache is used for storing potential similar data blocks;

s7: acquiring a data block D which is subjected to data de-duplication processing but not yet subjected to differential compression, and judging whether the data block D is repeated or not;

s71: if the data block D is repeated, further judging whether the container number CID of the container where the data block D is repeated in the system exists in the TAB-Dense-CTN, and if so, inserting all the data blocks in the container with the container number CID and the characteristic values of the data blocks into the Base-Cache:

s72: if the data block D is not repeated, calculating the characteristic value of the data block D, and searching similar data blocks of the data block D in the Base-Cache; if found, reading a similar data block of the data block D, and performing differential compression on the data block D by taking the similar data block of the data block D as a reference block;

s8: judging whether the data block D is the last data block in the backup data stream or not:

s81: if the data block D is the last data block in the backup data stream, the similar data searching and delta compression are finished;

s82: otherwise, S7 is performed.

Preferably, the Base-Cache in step S6 is a queue, each element in the queue is a hash table, the key of the hash table is the characteristic value of the data block, and the value is the corresponding data block. Each element in the Base-Cache (i.e., each hash table) contains all the data blocks in a container and the characteristic values of those data blocks. When similar data blocks are searched in the Base-Cache, the similar data blocks are searched one by one from the first element (namely the first hash table) until the last element is found or found.

Preferably, the data block and the data block characteristic value of the container are inserted into the Base-Cache in the step S71, and when a certain element in the Base-Cache needs to be replaced, a latest and longest unused replacement strategy is adopted.

As a third aspect of the embodiments of the present application, the embodiments of the present application provide a garbage recycling method, the method including the steps of:

s9: initializing a lookup table GC-Record;

s10: initializing an empty Container Open-Container in the memory, for recording from which containers valid data are copied to which new containers; the key of Open-Container is the Container number of the Container where the valid data is located before being copied to the new Container, and the value is the Container number of the corresponding new Container;

s11: traversing all effective files in the system, and marking data blocks or difference blocks corresponding to the files as effective data;

s12: traversing all containers in the system, counting the proportion of effective data in each container, and counting the containers with the proportion of the effective data lower than a preset threshold; s13: obtaining a Container (P-Container) with a valid data duty ratio lower than a preset threshold, counting the total amount of the contained valid data, and judging whether the existing data amount of the Open-Container is over the Container capacity after adding the valid data in the Container:

s1301: if yes, storing the Open-Container into the system, creating a new empty Container Open-Container in the memory, copying the effective data to the new Open-Container, taking the Container number of the P-Container as a key, taking the Container number of the Open-Container as a value, and adding the Container number to the GC-Record;

s1302: if not, copying the valid data to the Open-Container, taking the Container number of the P-Container as a key, taking the Container number of the Open-Container as a value, and adding the Container number to the GC-Record;

s14: s13, executing until a container with the effective data duty ratio lower than a preset threshold value cannot be found in the processing system;

s15: reclaiming all containers that do not contain valid data;

s16: traversing all F-Dense-CTN files, and updating each record in the F-Dense-CTN files, wherein the specific steps are as follows: for a Record (i.e., a container number, denoted as Dense-CID) in an F-Dense-CTN file, determining whether the Dense-CID is present in the GC-Record; if not, the Dense-CID need not be updated; if the value exists, acquiring a corresponding value in the GC-Record, and judging whether the value exists in the F-Dense-CTN file or not; if the CID exists, deleting the Dense-CID in the CID; if not, the Dense-CID is replaced with the value. Preferably, the GC-Record described in S9 is a hash table, the key (container number) is an integer, and the value is an array of length 2. An array of value 2 is because valid data in one container may be copied to two new containers.

Preferably, the value obtained by looking up the Dense-CID in the GC-Record as described in S16 is an array containing 2 elements, and thus may contain two container numbers; if so, both container numbers are updated into the F-Dense-CTN file.

The quantity of the similar data blocks detected by the embodiment of the application is more than that of the latest backup index method, and is close to that of the full data index method, and the space occupied by the characteristic value index is far smaller than that of the full data index method. The latest backup indexing method cannot detect similar data blocks of cross-version, but the proposed similar data detection method can detect similar data blocks of cross-version, as long as they are stored in the same container as the reference blocks of the data blocks or delta blocks in the previous version and the container contains the data blocks in the current backup.

For successive versions of the backup, the order in which the redundant data appear is approximately the same, a property known as "redundancy locality". Redundancy locality can be further divided into "logical redundancy locality" and "physical redundancy locality": logical redundancy locality means redundancy locality before redundancy elimination, the logical redundancy locality being stored in an input backup data stream; the physical redundancy locality means redundancy locality after redundancy elimination, and the physical redundancy locality is stored in a container. The method provided by the embodiment of the application simultaneously utilizes the logic redundancy locality and the physical redundancy locality.

When self-similar data blocks exist in the data set, compared with the prior art, the similarity of the similar data blocks detected by the similar data detection method provided by the embodiment of the application is higher. The similar data detection method provided by the embodiment of the application does not index the characteristic value of the processed data block in the current backup, so that the self-similar data block with low similarity is prevented from being detected as the reference block.

Drawings

The above features, technical features, advantages and implementation of the present application will be further described in the following description of preferred embodiments with reference to the accompanying drawings in a clear and easily understood manner.

FIG. 1 is a basic flow of backup using both data deduplication and delta compression;

FIG. 2 is an illustration of a prior art eigenvalue indexing method detecting self-similar data blocks as reference blocks;

FIG. 3 is an illustration of a prior art most recent backup indexing method missing cross-version like data blocks;

FIG. 4 is a diagram illustrating an embodiment of a method for detecting similar data and compressing similar data delta according to the present application;

FIG. 5 is an illustration of updating an F-Dense-CTN file in the garbage collection method according to the embodiment of the present application;

FIG. 6 is a similarity comparison of similar data blocks found by the prior art method and the proposed method;

fig. 7 is a comparison of the number of similar data blocks found by the prior art method and the method proposed by the present application.

Detailed Description

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the following description will explain the specific embodiments of the present application with reference to the accompanying drawings. It is evident that the drawings in the following description are only examples of the application, from which other drawings and other embodiments can be obtained by a person skilled in the art without inventive effort.

For simplicity of the drawing, only the parts relevant to the application are schematically shown in each drawing, and they do not represent the actual structure thereof as a product. Additionally, in order to simplify the drawing for ease of understanding, components having the same structure or function in some of the drawings are shown schematically with only one of them, or only one of them is labeled. Herein, "a" means not only "only this one" but also "more than one" case.

It should be further understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

In addition, in the description of the present application, step numbers and the like are used only to distinguish the description, and are not to be construed as indicating or implying or necessarily performing sequentially.

s3: judging whether the data block C is the last data block in the backup data;

s82: otherwise, S7 is performed.

Fig. 4 shows an illustrative example of a method for detecting similar data and compressing similar data delta according to an embodiment of the present application. Backup 1 contains 6 non-duplicate data blocks, which are stored in container 1 and container 2 after backup, and both containers are written to the F-Dense-CTN file. Backup 2 contains 5 non-duplicate data blocks and 1 duplicate data block, where A1 is the result of a minor modification of A in backup 1, so A1 and A are similar. When the backup 2 is subjected to data deduplication, since the duplicate data block B is in the container 1 and the container 1 exists in the F-delete-CTN file generated by the backup 1, the characteristic values of all the data blocks in the container 1 will be indexed, so a is detected as a similar data block of A1; the system performs delta compression on A1 to generate a difference block d (A1). After the backup 2 is finished, the data block and the delta block are written into the container 3, the F-Dense-CTN file generated by the backup 1 is deleted, and the container 1 and the container 3 are written into a new F-Dense-CTN file. The first 4 data blocks of the backup 3 are inherited from the backup 1, wherein A2, C2 and D2 are generated by slightly modifying A, C and D in the backup 1 respectively, and are mutually similar data blocks; the last 2 data blocks of backup 3 inherit from backup 2, where J2 is the result of a small modification of J in backup 2, so J2 is similar to J. When the backup 3 is subjected to data deduplication, since the duplicate data block B is in the container 1 and the container 1 exists in the F-delete-CTN file generated by the backup 2, the characteristic values of all the data blocks in the container 1 are indexed; also, the eigenvalues of all the data blocks in the container 3 will be indexed; thus, similar data blocks of A2, C2, D2, and J2 in backup 3 can all be detected, such that these data blocks can be delta compressed into D (A2), D (C2), D (D2), and D (J2); wherein C2, D2 are similar data blocks that cannot be detected by the most recent backup indexing method.

s9: initializing a lookup table GC-Record;

s15: reclaiming all containers that do not contain valid data;

Fig. 5 shows an example of updating an F-Dense-CTN file in the garbage collection method according to the embodiment of the present application. In the garbage collection process, the valid data contained in the containers with the container numbers 52, 53, 55, 67 and 68 are copied out and written into the new containers with the container numbers 117 and 118, and the GC-Record file records the information. When updating the F-Dense-CTN file, 52 is replaced with 117, since 52 in the file is present in the GC-Record, and the corresponding value 117 is not present in the F-Dense-CTN file. When updated 67, the corresponding value contains two container numbers: 117 and 118, which need to be processed one by one; in the processing of 117, 67 is deleted because 117 is already present in the file; when 118 is processed, 118 is added to the file as it is not present in the file. In addition, 69 in the file does not exist in the GC-Record, so no update is needed. Finally, the updated F-Dense-CTN file contains three container numbers 117, 118 and 69, as shown in FIG. 5.

Table 1 illustrates the backup data sets used.

TABLE 1

FIG. 6 shows a recent backup indexing method, a full data indexing method, and similarities detected by the present application on a common dataset WebSimilarity of data blocksThe WEB dataset contains a large number of self-similar data blocks. As can be seen from the results in fig. 6, the similarity of the similar data blocks detected by the present application is higher than that of the existing method.

Fig. 7 shows a comparison of the number of similar data blocks found by the present application on 4 common backup data sets (RDB, WEB, CHROME and SYN). As can be seen from the results in the figure, the number of similar data blocks found by the method is higher than that of the latest backup indexing method and is close to the full data indexing method. Meanwhile, the space overhead occupied by the eigenvalue index provided by the application is far smaller than that of the full data index method.

In another aspect, embodiments of the present application also provide a computer-readable storage medium storing computer code, which when executed, performs each of the methods as in the previous embodiments, respectively.

In one aspect, an embodiment of the present application further provides a computer apparatus, including:

one or more processors;

a memory for storing one or more computer programs;

the one or more computer programs, when executed by the one or more processors, cause the one or more processors to implement the method as described in the previous embodiments.

In the foregoing embodiments, the descriptions of the embodiments are focused on, and the parts of a certain embodiment that are not described or depicted in detail may be referred to in the related descriptions of other embodiments.

It should be noted that the above embodiments can be freely combined as needed. The foregoing is merely a preferred embodiment of the present application and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present application, which are intended to be comprehended within the scope of the present application.

Claims

1. The method for detecting the similar data based on the two-dimensional locality is characterized by comprising the following steps of:

s3: judging whether the data block C is the last data block in the backup data;

2. The two-dimensional locality-based similarity data detection method according to claim 1, further comprising:

when a data block is used for updating a Table-RT, searching a container number of a container where the data block is located in the Table-RT; if not found, a new record is inserted, wherein the key code of the new record is the container number of the container where the data block is located, and the value is the length of the data block, namely the container reference amount; if so, the length of the data block is accumulated to the found value.

3. The two-dimensional locality-based similarity data detection method according to claim 1, further comprising:

s82: otherwise, S7 is performed.

4. The method for detecting similar data based on two-dimensional locality according to claim 1, wherein the Base-Cache is a queue, each element in the queue is a hash table, a key of the hash table is a characteristic value of a data block, and the value is a corresponding data block; each hash table in the Base-Cache contains all data blocks in a container and characteristic values of the data blocks, and when similar data blocks are searched in the Base-Cache, the similar data blocks are searched one by one from the first hash table until the last element is found or found.

5. The two-dimensional locality-based similarity data detection method according to claim 1, further comprising:

s9: initializing a lookup table GC-Record;

s12: traversing all containers in the system, counting the proportion of effective data in each container, and counting the containers with the proportion of the effective data lower than a preset threshold;

s13: obtaining a Container with the effective data duty ratio lower than a preset threshold, marking as P-Container, counting the total amount of the contained effective data, and judging whether the existing data amount of the Open-Container is over the Container capacity after adding the effective data in the Container:

s15: reclaiming all containers that do not contain valid data;

s16: traversing all F-Dense-CTN files, and updating each record in the F-Dense-CTN files, wherein the specific steps are as follows: for a Record in an F-Dense-CTN file, namely a container number, recorded as a Dense-CID, judging whether the Dense-CID exists in the GC-Record; if not, the Dense-CID need not be updated; if the value exists, acquiring a corresponding value in the GC-Record, and judging whether the value exists in the F-Dense-CTN file or not; if the CID exists, deleting the Dense-CID in the CID; if not, the Dense-CID is replaced with the value. Preferably, the GC-Record described in S9 is a hash table, the key is an integer, the value is an array of length 2, and the value is an array of length 2.

6. A computer readable storage medium storing computer code which, when executed, performs the method of any one of claims 1 to 5.

7. A computer device, the computer device comprising:

one or more processors;

a memory for storing one or more computer programs;

the one or more computer programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1 to 5.