WO2014067063A1 - Duplicate data retrieval method and device - Google Patents

Duplicate data retrieval method and device Download PDF

Info

Publication number
WO2014067063A1
WO2014067063A1 PCT/CN2012/083740 CN2012083740W WO2014067063A1 WO 2014067063 A1 WO2014067063 A1 WO 2014067063A1 CN 2012083740 W CN2012083740 W CN 2012083740W WO 2014067063 A1 WO2014067063 A1 WO 2014067063A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
hash value
data packet
block
hash
Prior art date
Application number
PCT/CN2012/083740
Other languages
French (fr)
Chinese (zh)
Inventor
覃强
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to PCT/CN2012/083740 priority Critical patent/WO2014067063A1/en
Priority to CN201280001989.7A priority patent/CN103189867B/en
Publication of WO2014067063A1 publication Critical patent/WO2014067063A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • G06F16/1748De-duplication implemented within the file system, e.g. based on file segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/14Details of searching files based on file metadata
    • G06F16/148File search processing
    • G06F16/152File search processing using file content signatures, e.g. hash values

Definitions

  • the present invention relates to storage technologies, and in particular, to a method and device for repetitive data retrieval. Background technique
  • Deduplication (De-duplication in English) is a data reduction technology designed to reduce the storage capacity used in storage systems or reduce the amount of data transmitted over the network. It is widely used in data backup or WAN data transmission scenarios.
  • the process of deduplication is as follows: the input data is divided into blocks, the hash value of each block is calculated, and the calculated hash value is searched in the single instance library to determine whether the block is a duplicate block. In order to repeat the block, the block and its hash value are not stored in the single instance library, so as to reduce the data.
  • the embodiment of the invention provides a method and device for repetitive data retrieval, which is used to improve the efficiency of repeated block query and improve the overall performance of the data deduplication technology.
  • the first aspect provides a method for repetitive data retrieval, including:
  • Performing a similarity hash operation on the data block in the first data packet for the first data packet in the at least one data packet acquiring a hash value of the first data packet, and obtaining a hash value storage a first hash value in the table that is similar to a hash value of the first data packet that is greater than or equal to a preset first similarity threshold, where the hash value storage table is stored in the data storage space.
  • the hash value of the second data packet is performing similarity hashing according to data partitioning in the second data packet Obtaining;
  • the first data packet is any one of the at least one data packet; if a similarity between a hash value of the first data packet and the first hash value is greater than or equal to a preset a second similarity threshold for performing a repeated block retrieval on the data partitioning within the first data packet.
  • the method for retrieving the data block further includes: if a similarity between a hash value of the first data packet and the first hash value is less than the first a second similarity threshold, storing data blocks in the first data packet and hash values of data blocks in the first data packet into the data storage space, and grouping the first data packet The correspondence between the hash value and the first data packet is stored in the hash value storage table.
  • the at least two data partitions are grouped, and the obtaining the at least one data packet includes: Forming, by the hash value of each of the at least two data blocks, the hash data to be blocked; the length of the hash value of any one of the data blocks is a sliding step, and the block is used
  • the algorithm performs block processing on the to-be-blocked hash data to obtain at least one hash value block; and blocks the data corresponding to the hash value of the same hash value block as one of the data packets.
  • the first data packet The hashing of the first data packet includes: hashing each data chunk in the first data packet, and acquiring the first data packet a hash value of each data block in the data; replacing 0 of the hash value of each data block in the first data packet with -1, and partitioning all the data in the first data packet.
  • the corresponding bits of the hash value are added, the bits added by greater than 0 are mapped to 1, and the bits added by less than or equal to 0 are mapped to 0, and the obtained binary value is used as the hash value of the first data packet.
  • the data storage space includes a plurality of storage areas; the hash value storage table further stores a correspondence between a hash value of the second data packet and a number of a storage area where the second data packet is located Relationship
  • Performing a repetitive block search on the data block in the first data packet includes: obtaining, from the hash value storage table, a number n of the storage area corresponding to the first hash value, and corresponding to the storage area in the number n The data block and the hash value of the data block are loaded into the memory; wherein n is greater than or equal to 0 An integer of the first data packet is compared with a data block having the same hash value in the storage area corresponding to the number n to complete a repeated block retrieval of the data partition in the first data packet.
  • the method further includes: dividing the data in the storage area corresponding to the number n and the data into blocks When the hash value is loaded into the memory, the hash value of the data block and the data block in the storage area corresponding to the number (n+1) is loaded into the memory;
  • comparing the data blocks in the first data packet that are the same as the hash value in the storage area corresponding to the number n to complete the repeated block retrieval of the data block in the first data packet includes: Comparing the data blocks in the first data packet with the same hash value in the storage area corresponding to the number n and the number (n+1) to complete the block of data in the first data packet. Repeat block retrieval.
  • the obtaining a hash value storage table is similar to a hash value of the first data packet
  • the first hash value that is greater than or equal to the preset first similarity threshold includes: obtaining the number of repeated bits in the hash value storage table corresponding to the hash value of the first data packet is greater than Or equal to a preset number of hash values as the first hash value.
  • the acquiring a hash value storage table is corresponding to a hash value of the first data packet
  • the second aspect provides a duplicate data retrieval device, including:
  • a block obtaining module configured to perform block processing on the received data to obtain at least two data blocks
  • a packet obtaining module configured to group the at least two data blocks obtained by the block obtaining module to obtain at least one data packet, each data packet includes at least one data block; and a hash calculation module, For the first data packet in the at least one data packet, The data segmentation in the first data packet performs a similarity hash operation, obtains a hash value of the first data packet, and obtains a hash value similarity with the first data packet in the hash value storage table.
  • the hash value storage table stores a hash value of the second data packet that has been stored in the data storage space and the second data Corresponding relationship of the group, the hash value of the second data packet is obtained by performing a similarity hash operation according to the data partitioning in the second data packet; the first data packet is the at least one data packet Any one of the data packets;
  • a repeating search module configured to: when the similarity between the hash value of the first data packet and the first hash value is greater than or equal to a preset second similarity threshold, Data block is used for repeated block retrieval.
  • the data retrieving device further includes: a storage module, configured, in a similarity between a hash value of the first data packet and the first hash value When less than the second similarity threshold, storing data blocks in the first data packet and hash values of data blocks in the first data packet into the data storage space, and A correspondence between the hash value of the first data packet and the first data packet is stored in the hash value storage table.
  • a storage module configured, in a similarity between a hash value of the first data packet and the first hash value When less than the second similarity threshold, storing data blocks in the first data packet and hash values of data blocks in the first data packet into the data storage space, and A correspondence between the hash value of the first data packet and the first data packet is stored in the hash value storage table.
  • the packet acquiring module is specifically configured to be used by each of the at least two data blocks
  • the hash value of the data block constitutes the block hash data
  • the length of the hash value of any one of the data blocks is a sliding step size
  • the block data is used to perform the block hash data.
  • Block processing obtaining at least one hash value block, and dividing the data block corresponding to the hash value belonging to the same hash value block as one of the data packets.
  • Performing a similarity hash operation on the data partition in the first data packet, and obtaining a hash value of the first data packet includes:
  • the hash calculation module is configured to perform a hash operation on each data block in the first data packet, and obtain a hash value of each data block in the first data packet, where the first The 0 of the hash value of each data block in the data packet is replaced by -1, and the corresponding bits of the hash values of all the data blocks in the first data packet are added, and the bit maps greater than 0 are added. Is 1 and will add less than A bit map of or equal to 0 is 0, and the obtained binary value is used as a hash value of the first data packet.
  • the data storage space includes a plurality of storage areas;
  • the hash value storage table further stores a correspondence between a hash value of the second data packet and a number of a storage area where the second data packet is located relationship;
  • the repeating retrieval module is specifically configured to obtain, from the hash value storage table, the number n of the storage area corresponding to the first hash value, and the number n corresponding to the data partitioning and data partitioning in the storage area.
  • the value is loaded into the memory; wherein, n is an integer greater than or equal to 0; comparing the data blocks in the first data packet that are the same as the hash value in the storage area corresponding to the number n, to complete the A repeated block retrieval of data chunks within the first data packet.
  • the repeatedly retrieving module is further configured to divide data and data in the storage area corresponding to the number n When the hash value of the block is loaded into the memory, the hash value of the data block and the data block in the storage area corresponding to the number (n+1) is loaded into the memory;
  • the repeatedly retrieving module is configured to compare data blocks in the first data packet that are the same as the hash value in the storage area corresponding to the number n, to complete data partitioning in the first data packet.
  • the repeated block retrieval includes: the repeated retrieval module is specifically configured to compare the first data points to complete a repeated block retrieval of data blocks within the first data packet.
  • the hash calculation module is configured to obtain a hash value storage table and the first data
  • the first hash value of the hash value similarity of the group is greater than or equal to the preset first similarity threshold.
  • the hash calculation module is specifically configured to obtain the first data in the hash value storage table.
  • the number of repeated bits in the corresponding position of the hash value of the group is greater than or equal to a preset number of hash values as the first hash value.
  • the hash computing module is specifically configured to obtain the hash value storage table and the first The hash value calculation module is specifically configured to acquire the first one, where the number of the repeated bits in the corresponding position of the hash value of the data packet is greater than or equal to the preset number of hash values.
  • a hash value is used as the first hash value.
  • a third aspect provides a repetitive data retrieval device, including: a processor, a communication interface, a memory, and a bus: the processor, the communication interface, and the memory complete communication with each other through the bus;
  • the communication interface is configured to receive data
  • the processor is configured to execute a program
  • the memory is configured to store the program
  • the program is configured to perform block processing on the data received by the communication interface, to obtain at least two data blocks, and group the at least two data blocks to obtain at least one data packet.
  • Data packets comprise at least one data block; for a first data packet in the at least one data packet, performing a similarity hash operation on the data block in the first data packet to obtain the first data packet a hash value, the first hash value of the hash value storage table that is similar to the hash value of the first data packet is greater than or equal to a preset first similarity threshold, and the hash value storage table is obtained.
  • the program is further configured to: the similarity between the hash value of the first data packet and the first hash value is less than the second similarity threshold And storing a data block in the first data packet and a hash value of the data block in the first data packet into the data storage space, and hashing the first data packet A correspondence between the value and the first data packet is stored in the hash value storage table.
  • the program is used to group the at least two data blocks,
  • the at least one data packet includes: the program is specifically configured to form a to-be-blocked hash data by a hash value of each of the at least two data blocks, to block the data block
  • the length of the hash value is a sliding step size, and the block data is subjected to block processing by using a blocking algorithm to obtain at least one hash value block, and the hash value corresponding to the same hash value block is corresponding.
  • the data is chunked as one of the data packets.
  • the data segmentation in the first data packet performs a similarity hash operation
  • obtaining the hash value of the first data packet includes: the program is specifically configured to perform each data segmentation in the first data packet a hash operation, obtaining a hash value of each data block in the first data packet, replacing 0 in a hash value of each data block in the first data packet with -1, The corresponding bits of the hash values of all the data blocks in the first data packet are added, the bits added by greater than 0 are mapped to 1, and the bits added by less than or equal to 0 are mapped to 0, and the obtained binary value is used as the The hash value of the first data packet.
  • the data storage space includes a plurality of storage areas; the hash value storage table further stores a correspondence between a hash value of the second data packet and a number of a storage area where the second data packet is located Relationship
  • And performing, by the program, the repeated block search of the data block in the first data packet includes: the program is specifically configured to obtain, from the hash value storage table, a number of the storage area corresponding to the first hash value n, loading the hash value of the data partition and the data chunk in the storage area corresponding to the number n into the memory; wherein n is an integer greater than or equal to 0; corresponding to the number n in the first data packet Data chunks having the same hash value in the storage area are compared to complete a repeated block search of the data chunks within the first data packet.
  • the program is further configured to block the data and block the data in the storage area corresponding to the number n
  • the hash value of the data block and the data block in the storage area corresponding to the number (n+1) is loaded into the memory
  • the program is specifically configured to: in the storage area corresponding to the number n in the first data packet Comparing the data blocks having the same hash value for comparison to complete the repeated block retrieval of the data block in the first data packet includes: the program is specifically configured to use the number n in the first data packet The data block with the same hash value in the storage area corresponding to the number (n+1) is compared to complete the repeated block retrieval of the data block in the first data packet.
  • the program is used to obtain a hash value storage table and the first data packet
  • the first hash value whose hash similarity is greater than or equal to the preset first similarity threshold includes: the program is specifically configured to obtain a hash value corresponding to the first data packet in the hash value storage table
  • the number of repeated bits in the position is greater than or equal to a preset number of hash values as the first hash value.
  • the number of the repeated bits in the seventh possible real hash value corresponding position of the third aspect is greater than or equal to the preset number of hash values as the
  • the first hash value includes: the program is specifically configured to obtain a Hamming distance between the hash value of the first data packet and each hash value in the hash value storage table, A hash value in the hash value storage table whose clear distance is less than or equal to the preset Hamming distance threshold is used as the first hash value.
  • a fourth aspect provides a computer program product comprising a computer readable storage medium for storing a program, the program comprising:
  • a block obtaining unit configured to perform block processing on the received data to obtain at least two data blocks
  • a packet obtaining unit configured to group the at least two data blocks obtained by the block obtaining unit, to obtain at least one data packet, each data packet includes at least one data block; and a hash computing unit, Performing a similarity hash operation on the data block in the first data packet for the first data packet in the at least one data packet, acquiring a hash value of the first data packet, and obtaining a hash value And storing, in the storage table, a first hash value that is similar to a hash value of the first data packet, and is greater than or equal to a preset first similarity threshold, where the hash value storage table is stored in the data storage space.
  • the hash value of the second data packet is a similarity hash operation according to the data partitioning in the second data packet Obtained;
  • the first data packet is in the at least one data packet Any one of the data packets;
  • a retrieving unit configured to: in the first data packet, when a similarity between a hash value of the first data packet and the first hash value is greater than or equal to a preset second similarity threshold Data block is used for repeated block retrieval.
  • the program further includes: a storage unit, configured to: when a similarity between a hash value of the first data packet and the first hash value is less than a second similarity threshold, storing a data partition in the first data packet and a hash value of the data partition in the first data packet into the data storage space, and the first A correspondence between the hash value of the data packet and the first data packet is stored in the hash value storage table.
  • the hash value of each data block in the second part of the fourth aspect constitutes the block hash data to be any one of the data blocks
  • the length of the hash value is a sliding step size
  • the block data is subjected to block processing by using a blocking algorithm to obtain at least one hash value block, and the hash belonging to the same hash value block is hashed.
  • the data corresponding to the value is divided into one of the data packets.
  • Performing a similarity hashing on the data partitioning in the first data packet, and obtaining the hash value of the first data packet includes: the hash computing unit is specifically configured to be used in the first data packet Each data block performs a hash operation, obtains a hash value of each data block in the first data packet, and replaces 0 of the hash value of each data block in the first data packet with -1, adding corresponding bits of hash values of all data blocks in the first data packet, mapping bits added by greater than 0 to 1 , and mapping bits added less than or equal to 0 to 0, obtaining The binary value is used as the hash value of the first data packet.
  • the data storage space includes a plurality of storage areas;
  • the hash value storage table further stores a correspondence between a hash value of the second data packet and a number of a storage area where the second data packet is located relationship;
  • the number n of the corresponding storage area is loaded into the memory by the number n corresponding to the data block and the data block in the storage area; wherein n is an integer greater than or equal to 0;
  • a data block having the same hash value in the storage area corresponding to the number n is compared to complete a repeated block search of the data block in the first data packet.
  • the repeatedly retrieving unit is further configured to perform data partitioning and data division in the storage area corresponding to the number n When the hash value of the block is loaded into the memory, the hash value of the data block and the data block in the storage area corresponding to the number (n+1) is loaded into the memory;
  • the repeatedly retrieving unit is configured to compare data blocks in the first data packet that are the same as the hash value in the storage area corresponding to the number n, to complete data partitioning in the first data packet.
  • the repeated block retrieval includes: the repeated retrieval unit is specifically configured to compare the first data points to complete a repeated block retrieval of data blocks within the first data packet.
  • the hash computing unit is configured to obtain a hash value storage table and the first data
  • the first hash value of the hash value similarity of the group is greater than or equal to the preset first similarity threshold.
  • the hash calculation unit is specifically configured to acquire the first data in the hash value storage table.
  • the number of repeated bits in the corresponding position of the hash value of the group is greater than or equal to a preset number of hash values as the first hash value.
  • the hash computing unit is specifically configured to obtain the hash value storage table and the first The hash value calculation unit is specifically configured to acquire the first one, where the number of the repeated bits in the corresponding position of the hash value of the data packet is greater than or equal to the preset number of hash values.
  • a hash value is used as the first hash value.
  • the method and device for retrieving data firstly block and then group the received data, perform similarity hashing on the data blocks in the data packet, and obtain a data packet. a hash value, and then obtaining a hash value of the data packet and a first hash value similarity of each data packet stored in the data storage space stored in the hash value storage table greater than or equal to a preset first similarity threshold a hash value, determining whether the similarity between the hash value of the data packet and the first hash value is greater than or equal to a preset second similarity threshold, if greater than, indicating that the data partition in the data packet is greater than To a certain extent, it is a repeating block, and then it performs a repeated block retrieval.
  • the query hash value storage table stores the correspondence between the hash value and the data packet of the data packet that has been stored in the data storage space
  • the data packet is The number is relatively small, so the efficiency of querying the hash value storage table is high, and the repeated block retrieval based on the data packet reduces the number of repeated block retrievals, that is, reduces the number of interactions with the disk, which is beneficial to improve the efficiency of the repeated block query. This improves the overall performance of the deduplication technology.
  • FIG. 1 is a flowchart of a method for retrieving duplicate data according to an embodiment of the present invention
  • FIG. 2 is a schematic diagram of a similarity hash operation process according to an embodiment of the present invention.
  • FIG. 3 is a schematic structural diagram of a repetitive data retrieval device according to an embodiment of the present invention
  • FIG. 4 is a schematic structural diagram of a repetitive data retrieval device according to another embodiment of the present invention
  • FIG. FIG. 6 is a schematic structural diagram of a computer program product according to an embodiment of the present invention.
  • FIG. 1 is a flowchart of a method for retrieving duplicate data according to an embodiment of the present invention. As shown in Figure 1, The method of this embodiment includes:
  • Step 101 Perform block processing on the received data to obtain at least two data blocks.
  • the executor of the embodiment may be a repetitive data retrieval device, and the device may be a device with a computing capability in the implementation mode, for example, a server in a data backup environment, a computer, or the like, or a WAN data transmission scenario. Terminals, gateways, base stations, etc.
  • the data retrieval device After receiving the data to be stored, the data retrieval device first blocks the data to obtain at least two data blocks.
  • the data decryption device may perform a block processing on the data, and may be, for example, but not limited to, a Fixed-Sized Partition (FSP) algorithm, and a variable-blocking (Content-Defined Chunking). , referred to as CDC) algorithm, sliding block (sliding block in English) algorithm.
  • FSP Fixed-Sized Partition
  • CDC variable-blocking
  • sliding block sliding block in English
  • the size of the data block depends on the block algorithm used and the actual application requirements. The specific values of the embodiment of the present invention are not limited.
  • the process of performing block processing on data using various blocking algorithms is prior art and will not be described in detail herein. See the prior art.
  • Step 102 Group the at least two data blocks to obtain at least one data packet, where each data packet includes at least one data block.
  • the data retrieval device performs block processing on the data to obtain the data segmentation, and then performs packet processing on the obtained data block to obtain the data packet, and the number of the data packets may be smaller than the number of the data blocks.
  • the packet processing actually divides the acquired data into different data packets, and the specific grouping manner can be various.
  • the repeated data retrieval device may divide the plurality of data blocks in turn according to the principle that each data packet includes the same number of data partitions to form at least one data packet.
  • the repeated data retrieval device may further use the blocking algorithm to obtain at least one data packet for the divided data blocks.
  • the embodiment includes: forming, by the hash value of each of the at least two data partitions divided by the foregoing, the hash data to be blocked; and partitioning the data by any one of the at least two data partitions.
  • the length of the hash value (the length of the hash value of each data block is the same) is the sliding step size, and the block data is block-processed by the block algorithm to obtain at least one hash value.
  • the sliding step size refers to the minimum sliding distance when sliding on the block hash data, and the hash value block obtained by the blocking algorithm can be waited by sliding one or more times.
  • the hash block is composed of one or more complete hash values. If the sliding distance of a hash value block obtained by the block algorithm is a plurality of sliding step sizes (that is, after multiple sliding steps), the hash value block is composed of multiple hash values. If the sliding distance of a hash value block obtained by the block algorithm is a sliding step size (ie, after one sliding), the hash value block is composed of a hash value.
  • the data block corresponding to the hash value of the same hash value block is divided into one data packet, so that at least one data packet is obtained, and the grouping manner is adopted, so that The end position of each data packet is the end position of a block, and the division of the packets is more accurate.
  • the process of performing block processing on the block hash data by using the block algorithm is similar to the process of the existing block algorithm, and will not be described again.
  • the process of forming the hash data to be blocked by the hash value of each of the at least two data blocks includes: calculating a hash value of each of the at least two data blocks, These hash values are concatenated to form the hash data to be chunked.
  • each data packet is composed of consecutive data chunks.
  • the number of data blocks included in each data packet may be the same or different. Moreover, the number of data blocks included in the data packet may be determined according to the actual application, and the specific values of the embodiments of the present invention are not limited.
  • the repeated block retrieval based on the data packet is advantageous for reducing the number of repeated block retrievals, reducing the interaction with the disk, and improving the efficiency of the repeated block retrieval.
  • Step 103 Perform, for a first data packet in the at least one data packet, a similarity hash (or similar hash, or sim ash) on the data partition in the first data packet, and obtain a hash value of the first data packet.
  • a similarity hash or similar hash, or sim ash
  • Step 103 Perform, for a first data packet in the at least one data packet, a similarity hash (or similar hash, or sim ash) on the data partition in the first data packet, and obtain a hash value of the first data packet.
  • the similarity of the Greek values is greater than or equal to the preset second similarity threshold, and the data block in the first data packet is subjected to repeated block retrieval.
  • the embodiment is described by taking any one of the data packets as an example, and is referred to as the first data packet for convenience of distinction, that is, the first data packet. It may be any one of the at least one data packet obtained as described above.
  • the hash value storage table stores a correspondence between the hash value of the second data packet currently stored in the data storage space and the second data packet. For ease of differentiation and description, the data packets that have been currently stored in the data storage space are recorded as the second data packet.
  • the calculation method of the hash value of the second data packet stored in the hash value storage table is the same as the calculation method of the hash value of the first data packet in the embodiment, that is, the hash value of the second data packet is also Data points in the second data packet
  • the blocks obtained by the similarity hash operation, and the data blocks corresponding to the hash values do not overlap each other, that is, the data blocks in the second data packet are determined not to be duplicate blocks.
  • the data storage space refers to the storage space for storing data chunks, which may be a hard disk, a disk, or the like.
  • the hash value storage table in this embodiment is much smaller, so it can be stored in the memory, which is beneficial to improve the efficiency of querying the hash value storage table, and is beneficial to further Improve the efficiency of repeated block searches.
  • the hash value storage table is not limited to being stored in the memory, and may be stored on a disk or other storage device, but is preferably stored in the memory. After the data retrieval device obtains the data packet, the same processing is performed for each data packet. In this embodiment, the first data packet is taken as an example, and the repeated data retrieval device performs the following processing on the first data packet:
  • a similarity hash operation is performed on the data partition in the first data packet to obtain a hash value of the first data packet.
  • the principle of similarity hashing is that the higher the similarity between two data chunks, the greater the similarity of the calculated hash values, and vice versa.
  • the similarity hash operation is an arithmetic method capable of making the similarity of the hash values of the data blocks having higher similarity higher.
  • a method for performing similarity hashing on a first data packet includes: hashing each data chunk in the first data packet to obtain a hash of each data chunk in the first data packet a value; a hash value of each data block in the first data packet is represented in a binary manner, and each bit in the binary value is converted, and the value may be 0. The binary bit is replaced by -1, the binary bit with the value of 1 remains unchanged, and then the converted hash value is accumulated.
  • the corresponding bits of each converted hash value can be added, and the phase is added.
  • the first data packet includes n data blocks, which are respectively a first data block-nth data block, and each data block is hashed to obtain a binary form hash value. 2 shows that the hash values of the binary form of the first data block, the second data block, and the nth data block are 100110, 110000, and 001001, respectively, and the hash value of each data block is binary.
  • the hash values of the replaced binary forms of the first data block, the second data block, and the nth data block are 1-1-111-1, 11-1-1-1, respectively. -1 and -1-11-1-11, sequentially adding the corresponding bits in the hashed values of the n data blocks, and finally obtaining 13, 18, -22, -5, -2, 5 As a result, the value greater than 0 in the result is mapped to 1, and the value less than or equal to 0 is mapped to 0, resulting in a binary 110001, which is the hash value of the first data packet.
  • Another similarity hashing operation such as a perceptual hashing algorithm, may be employed ( Perceptual hash algorithm ) , to perform the similarity hashing operation on the data partitioning in the first data packet involved in the embodiment.
  • Perceptual hash algorithm a perceptual hashing algorithm
  • the principle of perceptual hash operation is to generate a "fingerprint" (English fingerprint) string for each picture, and then compare the fingerprints of different pictures. The higher the similarity of the comparison results, the higher the similarity of the pictures; Applying to the repeated data retrieval method provided in this embodiment, the principle is to calculate a hash value for each data packet, and then compare the hash values of different data packets. If the similarity between the two hash values is higher, Explain that the more data blocks that may be duplicated in the two data packets (ie, the greater the similarity between the two data packets).
  • the block search indicates that the data partitioning in the data packet is a repeated block in a large degree, and the repeated block retrieval improves the performance of the repeated block retrieval.
  • the method of the present embodiment will be described below in a comparative manner to improve the performance of repeated block retrieval.
  • the hash value similarity of the first data packet is greater than or equal to a hash value of the preset first similarity threshold, and is recorded as the first Hash value.
  • the multiple hash values may be obtained, where each hash belongs to the first hash. Value; if there is one hash value greater than or equal to the preset first similarity threshold, the hash value is taken as the first A hash value, that is, the first hash value obtained is one.
  • a hash value having the largest similarity to the hash value of the first data packet in the hash value storage table may be obtained as the first hash value, but is not limited thereto.
  • the implementation manner of obtaining the hash value that is similar to the hash value of the first data packet greater than or equal to the preset first similarity threshold may be: the duplicate data retrieval device acquires the hash data storage table and the first data packet The hash value corresponds to the number of repeated bits at the position greater than or equal to the preset number of hash values as the first hash value.
  • the number of repeated bits in the corresponding position of the two hash values represents the similarity of the two hash values; if the two hash values correspond to more repeating positions in the position, the two hash values are indicated. The higher the similarity; vice versa.
  • the preset number here is equivalent to the preset first similarity threshold.
  • the embodiment of the hash data storage device obtaining the hash value corresponding to the hash value of the first data packet in the hash value storage table is greater than or equal to the preset number of hash values as the first hash value, including:
  • the repeated data retrieval device acquires a Hamming distance between the hash value of the first data packet and each hash value in the hash value storage table, and stores the hash value whose Hamming distance is less than or equal to the preset Hamming distance threshold.
  • the hash value in the table is used as the first hash value.
  • the degree of repetition between the first data packet-data packet and the second data packet corresponding to each hash value in the hash value storage table is greater than or equal to the preset number of hash values as the first hash value, including:
  • the repeated data retrieval device acquires a Hamming distance between the hash value of the first data packet and each hash value in the hash value storage table, and stores the hash value whose Hamming distance is less than or equal to
  • other parameters capable of representing the similarity of the two hash values may be used.
  • the preset Hamming distance threshold here is equivalent to the above preset number.
  • the repeated data retrieval device compares the similarity between the hash value of the first data packet and the first hash value with a preset second similarity threshold, and is used to determine whether the first data packet needs to perform a repeated block search. If the similarity between the hash value of the first data packet and the first hash value is greater than or equal to the second similarity threshold, indicating the degree of repetition between the second data packet corresponding to the first hash value of the first data packet Very high, it can be determined that there are more duplicate blocks between the two, so a repeated block search of the first data packet is required.
  • the second similarity threshold may be a repetition number threshold.
  • comparing, by the repeated data retrieval device, the similarity between the hash value of the first data packet and the first hash value and the preset second similarity threshold may be: the repeated data retrieval device determines the first data packet. Whether the number of repeated digits in the position corresponding to the first hash value is greater than or equal to the preset repetition digit threshold.
  • the second similarity threshold is greater than or equal to the first similarity threshold.
  • the data storage space includes multiple storage areas, and each storage area has a number, and each storage area is used in order from the smallest to the largest.
  • the second data corresponding to the hash value of the second data packet can be known from the correspondence relationship.
  • the process of performing the repeated block retrieval on the first data packet may be: the duplicate data retrieval device acquires the number n of the storage area corresponding to the first hash value from the hash value storage table, and the number n corresponds to the storage area.
  • the data block and the hash value of the data block are loaded into the memory, where n is an integer greater than or equal to 0; then the data in the first data packet is the same as the hash value in the storage area corresponding to the number n. Comparing to complete a repeated block retrieval of data blocks within the first data packet.
  • the process of comparing the data blocks in the first data packet with the same hash value in the storage area corresponding to the number n to complete the repeated block retrieval of the data block in the first data packet may be: The data blocks in the first data packet that are the same as the hash values in the storage area corresponding to the number n and the number (n+1) are compared to complete the repeated block retrieval of the data blocks in the first data packet.
  • the data block in the first data packet that is the same as the hash value in the storage area corresponding to the number n and the number (n+1) is compared to complete the process of repetitive block retrieval of the data block in the first data packet.
  • the method may be: first comparing the hash value of each data block in the first data packet with the hash value in the storage area corresponding to the number n and the number (n+1), to obtain the number and the number in the first data packet.
  • the same hash value obtained here is the second hash value, and then the second hash value is in the data.
  • the corresponding data block and the second hash value in the packet are compared in the number n and the corresponding data block in the corresponding storage area of the number (n+1) to complete the repeated block of the data block in the first data packet. Search.
  • the number (n+1) corresponding storage area is the next storage area corresponding to the storage area of the number n, that is, when the number n is correspondingly stored After the area is filled, continue to write data to the corresponding area (n+1) corresponding storage area. Because the data received next may be under the storage area corresponding to the first hash value.
  • a storage area (that is, a storage area numbered (n+1)) has duplicate data, so the storage area corresponding to the first hash value (that is, the storage area numbered n) corresponds to the first hash value at a time.
  • the content of the next storage area of the storage area is added to the memory, which is beneficial to improve the efficiency of the subsequent repeated block retrieval process, thereby facilitating the overall efficiency of the repeated block retrieval.
  • the preferred partition storage mode is: centralized storage in a storage area according to the order of receiving data blocks, and when the storage area is full, the received data blocks are stored in the next storage area.
  • Each storage area is a storage space, and each storage area has a certain size, for example, but not limited to 64 MB.
  • the hash values of the data block and the data block are simultaneously stored in each storage area, and the specific storage manner is not limited.
  • a preferred storage mode of the storage area is as follows: The storage area is divided into two parts, one part is a data segment area, the data segment area stores data partitioning; the other part is a metadata area, and the metadata area stores Metadata corresponding to the data block in the data segment area, where the metadata includes a hash value of the data block, a length of the data block, a length of the data segment, and some check code, etc., in the present invention
  • the hash value of the data block in the metadata is mainly used.
  • the first data packet is illustrated.
  • the degree of repetition between the second data packets corresponding to the first hash value is not high, and it can be determined that there is no duplicate block between the two or the number of duplicate blocks is very small, for example, there may be only one or two in the first data packet.
  • the data block is duplicated in the data block in the second data packet corresponding to the first hash value. To improve overall performance, the data block in the first data packet may be processed as new data, that is, not repeated.
  • Block retrieval is stored directly into the data storage space. Further, if the data storage space includes a plurality of storage areas, the duplicate data retrieval device can directly store the data partitions in the first data packet and the hash values of the data chunks into the currently used storage area.
  • the repeated data retrieval method first blocks and receives the received data, performs similarity hashing on the data partitions in the data packet, obtains a hash value of the data packet, and then obtains a hash value of the data packet, and then Obtaining a first hash value of a hash value of the data packet and a hash value of each data packet stored in the data storage space stored in the hash value storage table that is greater than or equal to a preset first similarity threshold , determining whether the similarity between the hash value of the data packet and the first hash value is greater than or equal to The preset second similarity threshold, if greater than, indicates that the data partitioning in the data packet is largely a duplicate block, and then the block retrieval is performed, since the query hash value storage table is stored in the The correspondence between the hash value and the data packet of the data packet that has been stored in the data storage space, and the number of data packets is relatively small, so the efficiency of querying the has
  • FIG. 3 is a schematic structural diagram of a duplicate data retrieval device according to an embodiment of the present invention.
  • the data retrieving device in this embodiment may be a device having a computing capability and a storage capability in a specific implementation manner, for example, a server in a data backup environment, a computer, or the like, or a terminal in a WAN data transmission scenario. Gateways, base stations, and the like, the specific embodiments of the present invention do not limit the specific implementation of the repeated data retrieval device.
  • the device in this embodiment includes: a block obtaining module 31, a group obtaining module 32, a hash calculating module 33, and a repeating search module 34.
  • the block obtaining module 31 is configured to perform block processing on the received data to obtain at least two data blocks.
  • the packet obtaining module 32 is connected to the block obtaining module 31 and configured to group at least two data blocks obtained by the block obtaining module 31 to obtain at least one data packet, and each data packet includes at least one data block.
  • the hash calculation module 33 is connected to the packet obtaining module 32, and is configured to perform similarity hashing on the data partition in the first data packet for the first data packet in the at least one data packet acquired by the packet obtaining module 32. Obtaining a hash value of the first data packet, and acquiring a first hash value in the hash value storage table that is similar to a hash value of the first data packet by a preset first similarity threshold.
  • the hash value storage table stores a correspondence between a hash value of the second data packet that has been stored in the data storage space and the second data packet, and the hash value of the second data packet is according to the second data packet.
  • the data block is obtained by performing a similarity hash operation; the first data packet is any one of the at least one data packet.
  • the repeated search module 34 is connected to the hash calculation module 33, and the similarity between the hash value of the first data packet and the first hash value obtained by the hash calculation module 33 is greater than or equal to a preset second similarity threshold. At the same time, a repeated block retrieval is performed on the data partitioning within the first data packet.
  • the repeated data retrieval device of the embodiment further includes Includes: storage module 35.
  • the storage module 35 is connected to the hash calculation module 33, and is configured to: when the similarity between the hash value of the first data packet and the first hash value obtained by the hash calculation module 33 is less than the second similarity threshold, The data block within the data packet and the hash value of the data block within the first data packet are stored into the data storage space, and the correspondence between the hash value of the first data packet and the first data packet is stored to the hash. The value is stored in the table.
  • the hash calculation module 33 performs the same actions for each data packet.
  • the packet obtaining module 32 is specifically configured to use the hash value of each data block in the at least two data blocks obtained by the block obtaining module 31 to form the block hash data to be at least
  • the length of the hash value of each data block in the two data blocks is a sliding step size, and the block data is subjected to block processing by using the block algorithm to obtain at least one hash value block, The data blocks corresponding to the hash values of the same hash value block are used as one data packet, thereby obtaining at least one data packet.
  • the hash calculation module 33 is configured to perform a similarity hash operation on the data partition in the first data packet, and obtain the hash value of the first data packet, where the hash value includes: the hash calculation module 33 Specifically, the hash operation is performed on each data block in the first data packet, and the hash value of each data block in the first data packet is obtained, and the hash of each data block in the first data packet is hashed. The 0 in the value is replaced by -1, the corresponding bits of the hash value of all data blocks in the first data packet are added, the bits added by greater than 0 are mapped to 1, and the bit maps less than or equal to 0 are added. Is 0, the obtained binary value is used as the hash value of the first data packet.
  • the data storage space includes a plurality of storage areas.
  • the hash value storage table further stores a correspondence between a hash value of the second data packet and a number of the storage area where the second data packet is located.
  • the repeated retrieval module 34 is specifically configured to obtain the number n of the storage area corresponding to the first hash value from the hash value storage table, and load the hash value of the data partition and the data partition corresponding to the number n corresponding to the storage area.
  • n is an integer greater than or equal to 0; comparing the data blocks in the first data packet with the same hash value in the storage area corresponding to the number n to complete the data partitioning in the first data packet Repeated block retrieval.
  • the repeated retrieval module 34 is further configured to: when the hash value of the data partition and the data partition in the storage area corresponding to the number n is loaded into the memory, the number (n+1) is correspondingly The data chunks in the storage area and the hash values of the data chunks are loaded into the memory.
  • the cable module 34 is specifically configured to compare the data blocks in the first data packet with the same hash value in the storage area corresponding to the number n to complete the repeated block retrieval of the data block in the first data packet, including: The module 34 is specifically configured to compare data blocks in the first data packet with the same hash value in the storage area corresponding to the number n and the number (n+1) to complete the repetition of the data block in the first data packet. Block retrieval.
  • the hash calculation module 33 is configured to obtain a first hash value in the hash value storage table that is similar to a hash value of the first data packet by a first similarity threshold that is greater than or equal to a preset first similarity threshold.
  • the hash calculation module 33 is specifically configured to obtain, as the first hash, the number of repeated bits in the hash value storage table corresponding to the hash value of the first data packet is greater than or equal to a preset number of hash values. value.
  • the hash calculation module 33 is specifically configured to obtain, in the hash value storage table, the number of the repeated bits at the position corresponding to the hash value of the first data packet is greater than or equal to the preset number of hash values as the first hash value.
  • the hash calculation module 33 is specifically configured to obtain a Hamming distance between the hash value of the data packet and each hash value in the hash value storage table, and set the Hamming distance to be less than or equal to the preset Hamming distance threshold.
  • the hash value stores the hash value in the table as the first hash value.
  • the functional modules of the repeated data retrieval device provided by the embodiments of the present invention can be used to execute the process of the repeated data retrieval method shown in FIG. 1.
  • the specific working principle is not described here. For details, refer to the description of the method embodiments.
  • the repeated data retrieval device provided in this embodiment firstly blocks and receives the received data, performs similarity hashing on the data partitions in the data packet, obtains a hash value of the data packet, and then acquires the data packet.
  • the hash value and the hash value stored in the hash value storage table are similar to the hash value of each data packet stored in the data storage space, and the similarity is greater than or equal to the first hash value of the preset first similarity threshold, and the data packet is determined.
  • the query hash value storage table stores the correspondence between the hash value and the data packet of the data packet that has been stored in the data storage space, and the number of data packets is relatively small, so the query Hash value storage tables are more efficient, and repeated block retrieval based on data packets reduces the number of repeated block searches, ie, reduces disk interaction. Frequency and help to improve the duplicate block query efficiency, thereby improving deduplication overall performance.
  • FIG. 5 is a schematic structural diagram of a repeated data retrieval device according to another embodiment of the present invention.
  • the repetitive data retrieval device of the embodiment may be a device with computing power and storage capability in a specific implementation manner, for example, a server, a computer, or the like in a data backup environment, or a terminal and a gateway in a WAN data transmission scenario.
  • the base station and the like, the specific embodiment of the present invention does not limit the specific implementation of the repeated data retrieval device.
  • the repeated data retrieval device of this embodiment includes:
  • the bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component (PCI) bus, or an Extended Industry Standard Architecture (EISA) bus. Wait.
  • ISA Industry Standard Architecture
  • PCI Peripheral Component
  • EISA Extended Industry Standard Architecture
  • the bus can be divided into an address bus, a data bus, a control bus, and the like. For ease of representation, only one thick line is shown in Figure 5, but it does not mean that there is only one bus or one type of bus.
  • a communication interface 53 for receiving data is only one thick line is shown in Figure 5, but it does not mean that there is only one bus or one type of bus.
  • the processor 51 is configured to execute a program.
  • the program can include program code, the program code including computer operating instructions.
  • the processor 51 may be a central processing unit (CPU), an application specific integrated circuit (hereinafter referred to as an ASIC), or one or more integrated circuits configured to implement the embodiments of the present invention.
  • CPU central processing unit
  • ASIC application specific integrated circuit
  • the memory 52 is used to store a program.
  • the memory 52 may include a high speed RAM memory, and may also include a non-volatile memory such as at least one disk memory.
  • the foregoing program may be specifically configured to: perform block processing on the data received by the communication interface 53 to obtain at least two data blocks; group the at least two data blocks to obtain at least one data packet, and each data packet Include at least one data block; perform a similarity hash operation on the data block in the first data packet for the first data packet in the at least one data packet, obtain a hash value of the first data packet, and obtain a hash value And storing, in the storage table, a first hash value that is similar to a hash value of the first data packet, and is greater than or equal to a preset first similarity threshold, where the hash value storage table stores the data that has been stored in the data storage space.
  • Corresponding relationship between the hash value of the second data packet and the second data packet wherein the hash value of the second data packet is obtained by performing a similarity hash operation according to the data partitioning in the second data packet; a data packet is any one of the at least one data packet; if the hash value of the first data packet is greater than or similar to the first hash value Equal to the preset second similarity threshold, performing repeated block retrieval on the data partitioning in the first data packet.
  • the program stored in the memory 52 is further configured to: when the similarity between the hash value of the first data packet and the first hash value is less than the second similarity threshold, The data block and the hash value of the data block in the first data packet are stored into the data storage space, and the correspondence between the hash value of the first data packet and the first data packet is stored in the hash value storage table. .
  • the program stored in the memory 52 is configured to group the at least two data blocks to obtain at least one data packet.
  • the program is specifically configured to be used by each of the at least two data blocks.
  • the hash value of the data block constitutes the hash data to be blocked, and the length of the hash value of any data block is the sliding step size, and the block data is used to block the block data to be blocked. Processing, obtaining at least one hash value block, and dividing the data block corresponding to the hash value belonging to the same hash value block as one of the data packets.
  • the program stored in the memory 52 is configured to perform a similarity hash operation on the data partitioning in the first data packet, and obtaining the hash value of the first data packet includes: the program is specifically used for Each data block in a data packet is hashed, and a hash value of each data block in the first data packet is obtained, and 0 of the hash value of each data block in the first data packet is replaced with -1, adding the corresponding bits of the hash value of all data blocks in the first data packet, mapping the bits added by greater than 0 to 1, and mapping the bits added less than or equal to 0 to 0, the obtained binary The value is used as the hash value of the first data packet.
  • the data storage space includes a plurality of storage areas; and the hash value storage table further stores a correspondence between the hash value of the second data packet and the number of the storage area where the second data packet is located.
  • the program stored in the memory 52 is used to perform the repeated block retrieval on the data partitioning in the first data packet.
  • the program is specifically configured to obtain the number of the storage area corresponding to the first hash value from the hash value storage table.
  • n the hash value corresponding to the data partition and the data chunk in the storage area corresponding to the number n is loaded into the memory; wherein n is an integer greater than or equal to 0; the first data packet corresponds to the number n corresponding to the storage area The data blocks having the same hash value are compared to complete a repeated block search for the data block within the first data packet.
  • the program stored in the memory 52 is further configured to: when the data block and the hash value of the data block in the storage area corresponding to the number n are loaded into the memory, the number (n+1) is corresponding to the storage area. The hash of the data chunks and data chunks is loaded into memory.
  • the program is specifically used for Comparing the data blocks in the first data packet with the same hash value in the storage area corresponding to the number n to complete the repeated block retrieval of the data block in the first data packet includes: the program is specifically used to be the first The data blocks in the data packet are compared with the data blocks having the same hash value in the storage area corresponding to the number n and the number (n+1) to complete the repeated block retrieval of the data block in the first data packet.
  • the program stored in the memory 52 is configured to obtain a first hash value in the hash value storage table that is similar to a hash value of the first data packet by greater than or equal to a preset first similarity threshold.
  • the program is specifically configured to obtain, as the first hash value, a hash value in a position corresponding to a hash value of the first data packet in the hash value storage table that is greater than or equal to a preset number of hash values. .
  • the program stored in the memory 52 is specifically configured to obtain the hash number of the hash value storage table corresponding to the hash value of the first data packet is greater than or equal to the preset number of hashes.
  • the value as the first hash value includes: the program is specifically configured to obtain a Hamming distance between the hash value of the first data packet and each hash value in the hash value storage table, and the Hamming distance is less than or equal to the pre- The hash value in the hash value storage table of the Hamming distance threshold is set as the first hash value.
  • the repeated data retrieval device provided by the embodiment of the present invention can be used to execute the process of the repeated data retrieval method shown in FIG. 1.
  • the specific working principle is not described here. For details, refer to the description of the method embodiment.
  • the repeated data retrieval device provided in this embodiment firstly blocks and receives the received data, performs similarity hashing on the data partitions in the data packet, obtains a hash value of the data packet, and then acquires the data packet.
  • the hash value and the hash value stored in the hash value storage table are similar to the hash value of each data packet stored in the data storage space, and the similarity is greater than or equal to the first hash value of the preset first similarity threshold, and the data packet is determined.
  • the query hash value storage table stores the correspondence between the hash value and the data packet of the data packet that has been stored in the data storage space, and the number of data packets is relatively small, so the query Hash value storage tables are more efficient, and repeated block retrieval based on data packets reduces the number of repeated block searches, ie, reduces disk interaction. Frequency and help to improve the duplicate block query efficiency, thereby improving deduplication overall performance.
  • An embodiment of the invention provides a computer program product comprising a computer readable storage medium for storing a program.
  • the program includes:
  • the block obtaining unit 81 is configured to perform block processing on the received data to obtain at least two data blocks.
  • the packet obtaining unit 82 is connected to the block obtaining unit 81 and configured to group at least two data blocks acquired by the block obtaining unit 81 to acquire at least one data packet, and each data packet includes at least one data block.
  • the hash calculation unit 83 is connected to the packet acquisition unit 82, and is configured to perform a similarity hash operation on the data partition in the first data packet for the first data packet in the at least one data packet acquired by the packet acquisition unit 82, Obtaining a hash value of the first data packet, and acquiring a first hash value in the hash value storage table that is similar to a hash value of the first data packet, greater than or equal to a preset first similarity threshold, the hash
  • the value storage table stores a correspondence between a hash value of the second data packet that has been stored in the data storage space and the second data packet, and the hash value of the second data packet is based on the data in the second data packet.
  • the block is obtained by performing a similarity hash operation; the first data packet is any one of the at least one data packet.
  • the repeating retrieval unit 84 is connected to the hash computing unit 83, and configured to use the first data when the similarity between the hash value of the first data packet and the first hash value is greater than or equal to a preset second similarity threshold.
  • the data blocks within the packet are subjected to repeated block retrieval.
  • the repeated data retrieval device of this embodiment further includes: a storage unit 85.
  • the storage unit 85 is connected to the hash calculation unit 83, and is configured to: when the similarity between the hash value of the first data packet and the first hash value obtained by the hash calculation unit 83 is less than the second similarity threshold, The data block within the data packet and the hash value of the data block within the first data packet are stored into the data storage space, and the correspondence between the hash value of the first data packet and the first data packet is stored to the hash. The value is stored in the table.
  • the above-described hash calculation unit 83, the repetition retrieval unit 84, and the storage unit 85 perform the same operation for each data packet.
  • the packet obtaining unit 82 is specifically configured to use the hash value of each data block in the at least two data blocks obtained by the block obtaining unit 81 to form the to-be-blocked hash data, to at least The length of the hash value of each data block in the two data blocks is a sliding step size, and the block data is subjected to block processing by using the block algorithm to obtain at least one hash value block, The data blocks corresponding to the hash values of the same hash value block are used as one data packet, thereby obtaining at least one data packet.
  • the hash calculation unit 83 is configured to perform a similarity hash operation on the data partition in the first data packet, and obtain a hash value of the first data packet, including: a hash meter.
  • the calculating unit 83 is specifically configured to perform a hash operation on each data block in the first data packet, obtain a hash value of each data block in the first data packet, and block each data in the first data packet.
  • the 0 of the hash value is replaced by -1, the corresponding bits of the hash values of all the data blocks in the first data packet are added, and the bits added by greater than 0 are mapped to 1, and the addition is less than or equal to 0.
  • the bit map is 0, and the obtained binary value is used as the hash value of the first data packet.
  • the data storage space includes a plurality of storage areas.
  • the hash value storage table further stores a correspondence between a hash value of the second data packet and a number of the storage area where the second data packet is located.
  • the repeated retrieval unit 84 is specifically configured to obtain the number n of the storage area corresponding to the first hash value from the hash value storage table, and load the hash value of the data partition and the data partition corresponding to the number n corresponding to the storage area.
  • n is an integer greater than or equal to 0; comparing the data blocks in the first data packet with the same hash value in the storage area corresponding to the number n to complete the data partitioning in the first data packet Repeated block retrieval.
  • the retrieving unit 84 is further configured to: when the data block and the hash value of the data block in the storage area corresponding to the number n are loaded into the memory, the number (n+1) is corresponding to The data chunks in the storage area and the hash values of the data chunks are loaded into the memory. Based on this, the repeated retrieval unit 84 is specifically configured to compare the data blocks in the first data packet with the same hash value in the storage area corresponding to the number n to complete the repeated block retrieval of the data partition in the first data packet.
  • the method includes: the repeated retrieval unit 84 is specifically configured to compare data blocks in the first data packet with the same hash value in the storage area corresponding to the number n and the number (n+1) to complete the data in the first data packet. Repeated block retrieval of chunks.
  • the hash calculation unit 83 is configured to obtain a first hash value in the hash value storage table that is similar to a hash value of the first data packet by a preset first similarity threshold.
  • the hash calculation unit 83 is specifically configured to obtain, as the first hash, the number of repeated bits in the hash value storage table corresponding to the hash value of the first data packet is greater than or equal to a preset number of hash values. value.
  • the hash calculation unit 83 is specifically configured to obtain, in the hash value storage table, the number of the repeated bits on the position corresponding to the hash value of the first data packet is greater than or equal to the preset number of hash values as the first hash value.
  • the hash calculation unit 83 is specifically configured to obtain a Hamming distance between the hash value of the data packet and each hash value in the hash value storage table, and set the Hamming distance to be less than or equal to the preset Hamming distance threshold.
  • the hash value stores the hash value in the table as the first hash value.
  • the repetitive data retrieval device provided by the embodiment of the present invention can be used to execute the process of the repeated data retrieval method shown in FIG. 1. The specific working principle is not described here. For details, refer to the description of the method embodiment.
  • the repeated data retrieval device provided in this embodiment firstly blocks and receives the received data, performs similarity hashing on the data partitions in the data packet, obtains a hash value of the data packet, and then acquires the data packet.
  • the hash value and the hash value stored in the hash value storage table are similar to the hash value of each data packet stored in the data storage space, and the similarity is greater than or equal to the first hash value of the preset first similarity threshold, and the data packet is determined.
  • the query hash value storage table stores the correspondence between the hash value and the data packet of the data packet that has been stored in the data storage space, and the number of data packets is relatively small, so the query Hash value storage tables are more efficient, and repeated block retrieval based on data packets reduces the number of repeated block searches, ie, reduces disk interaction. Frequency and help to improve the duplicate block query efficiency, thereby improving deduplication overall performance.
  • the aforementioned program can be stored in a computer readable storage medium.
  • the program when executed, performs the steps including the foregoing method embodiments; and the foregoing storage medium includes: a medium that can store program codes, such as a ROM, a RAM, a magnetic disk, or an optical disk.

Abstract

Provided are a duplicate data retrieval method and device. The method comprises: segmenting received data to acquire at least two data segments; grouping the at least two data segments to obtain at least one data grouping; and as regards each data grouping, performing a similarity Hash algorithm on the data segments in the data grouping to acquire a Hash value of the data grouping, and acquiring a first Hash value of a first similarity threshold value which is greater than or equal to a Hash value similarity of the data grouping in a Hash value storage table, and if the Hash value of the data grouping and the similarity of the first Hash value are greater than or equal to a preset second similarity threshold value, duplicate segment retrieval is performed on the data segments in the data grouping. The technical solution of the present invention increases the search efficiency of a duplicate segment, improving the overall performance of the duplicate data deletion technique.

Description

重复数据检索方法及设备  Repetitive data retrieval method and device
技术领域 本发明涉及存储技术, 尤其涉及一种重复数据检索方法及设备。 背景技术 TECHNICAL FIELD The present invention relates to storage technologies, and in particular, to a method and device for repetitive data retrieval. Background technique
重复数据删除(英文为 De-duplication )是一种数据减缩技术, 旨在减少 存储***中使用的存储容量或减少数据在网络中的传输量, 它广泛应用于数 据备份或广域网数据传输的场景。 重复数据删除的过程是: 对输入数据进行 分块, 计算每个分块的哈希(Hash )值, 用计算出的 Hash值在单一实例库中 查找以判断该分块是否为重复块, 若为重复块, 则不将该分块及其 Hash值存 储到单一实例库中, 从而达到缩减数据的目的。  Deduplication (De-duplication in English) is a data reduction technology designed to reduce the storage capacity used in storage systems or reduce the amount of data transmitted over the network. It is widely used in data backup or WAN data transmission scenarios. The process of deduplication is as follows: the input data is divided into blocks, the hash value of each block is calculated, and the calculated hash value is searched in the single instance library to determine whether the block is a duplicate block. In order to repeat the block, the block and its hash value are not stored in the single instance library, so as to reduce the data.
单一实例库通常比较大, 无法全部放入内存, 通常会放在磁盘中, 这样 在查询分块是否为重复块时就需要频繁地访问磁盘,由于磁盘访问速度较低, 使得重复块查询的效率较低, 影响了重复数据删除技术的整体性能。 发明内容  Single-instance libraries are usually large and cannot be placed in memory. They are usually placed on disk. This requires frequent access to the disk when querying whether the block is a duplicate block. The efficiency of repeated block queries is low due to the low disk access speed. Lower, affecting the overall performance of the deduplication technology. Summary of the invention
本发明实施例提供一种重复数据检索方法及设备, 用以提高重复块查询 效率, 提高重复数据删除技术的整体性能。  The embodiment of the invention provides a method and device for repetitive data retrieval, which is used to improve the efficiency of repeated block query and improve the overall performance of the data deduplication technology.
第一方面提供一种重复数据检索方法, 包括:  The first aspect provides a method for repetitive data retrieval, including:
对接收到的数据进行分块处理, 获取至少两个数据分块;  Performing block processing on the received data to obtain at least two data blocks;
对所述至少两个数据分块进行分组, 得到至少一个数据分组, 每个数据 分组包括至少一个数据分块;  And grouping the at least two data blocks to obtain at least one data packet, each data packet including at least one data block;
针对所述至少一个数据分组中的第一数据分组, 对所述第一数据分组内 的数据分块进行相似性哈希运算, 获取所述第一数据分组的哈希值, 获取哈 希值存储表中与所述第一数据分组的哈希值相似度大于或等于预设的第一相 似度阈值的第一哈希值, 所述哈希值存储表中存储有已经存储在数据存储空 间中的第二数据分组的哈希值和所述第二数据分组的对应关系, 所述第二数 据分组的哈希值是根据所述第二数据分组内的数据分块进行相似性哈希运算 获得的;所述第一数据分组是所述至少一个数据分组中的任意一个数据分组; 如果所述第一数据分组的哈希值与所述第一哈希值的相似度大于或等于 预设的第二相似度阈值,对所述第一数据分组内的数据分块进行重复块检索。 Performing a similarity hash operation on the data block in the first data packet for the first data packet in the at least one data packet, acquiring a hash value of the first data packet, and obtaining a hash value storage a first hash value in the table that is similar to a hash value of the first data packet that is greater than or equal to a preset first similarity threshold, where the hash value storage table is stored in the data storage space. Corresponding relationship between the hash value of the second data packet and the second data packet, the hash value of the second data packet is performing similarity hashing according to data partitioning in the second data packet Obtaining; the first data packet is any one of the at least one data packet; if a similarity between a hash value of the first data packet and the first hash value is greater than or equal to a preset a second similarity threshold for performing a repeated block retrieval on the data partitioning within the first data packet.
在第一方面的第一种可能的实现方式中, 所述重复数据块检索方法还包 括: 如果所述第一数据分组的哈希值与所述第一哈希值的相似度小于所述第 二相似度阈值, 将所述第一数据分组内的数据分块和所述第一数据分组内的 数据分块的哈希值存储到所述数据存储空间中, 并将所述第一数据分组的哈 希值与所述第一数据分组的对应关系存储到所述哈希值存储表中。  In a first possible implementation manner of the first aspect, the method for retrieving the data block further includes: if a similarity between a hash value of the first data packet and the first hash value is less than the first a second similarity threshold, storing data blocks in the first data packet and hash values of data blocks in the first data packet into the data storage space, and grouping the first data packet The correspondence between the hash value and the first data packet is stored in the hash value storage table.
结合第一方面或第一方面的第一种可能的实现方式, 在第一方面的第二 种可能的实现方式中, 对所述至少两个数据分块进行分组, 得到至少一个数 据分组包括: 由所述至少两个数据分块中每个数据分块的哈希值构成待分块 哈希数据; 以任一个所述数据分块的哈希值的长度为滑动步长, 釆用分块算 法对所述待分块哈希数据进行分块处理, 得到至少一个哈希值分块; 将属于 同一哈希值分块的哈希值对应的数据分块作为一个所述数据分组。  With reference to the first aspect, or the first possible implementation manner of the first aspect, in the second possible implementation manner of the first aspect, the at least two data partitions are grouped, and the obtaining the at least one data packet includes: Forming, by the hash value of each of the at least two data blocks, the hash data to be blocked; the length of the hash value of any one of the data blocks is a sliding step, and the block is used The algorithm performs block processing on the to-be-blocked hash data to obtain at least one hash value block; and blocks the data corresponding to the hash value of the same hash value block as one of the data packets.
结合第一方面或第一方面的第一种可能的实现方式或第一方面的第二种 可能的实现方式, 在第一方面的第三种可能的实现方式中, 对所述第一数据 分组内的数据分块进行相似性哈希运算, 获取所述第一数据分组的哈希值包 括: 对所述第一数据分组内每个数据分块进行哈希运算, 获取所述第一数据 分组内每个数据分块的哈希值; 将所述第一数据分组内每个数据分块的哈希 值中的 0替换为 -1 , 将所述第一数据分组内所有数据分块的哈希值的对应位 相加, 将相加大于 0的位映射为 1 , 将相加小于或等于 0的位映射为 0, 获得 的二进制数值作为所述第一数据分组的哈希值。  In conjunction with the first aspect or the first possible implementation of the first aspect or the second possible implementation of the first aspect, in a third possible implementation of the first aspect, the first data packet The hashing of the first data packet includes: hashing each data chunk in the first data packet, and acquiring the first data packet a hash value of each data block in the data; replacing 0 of the hash value of each data block in the first data packet with -1, and partitioning all the data in the first data packet The corresponding bits of the hash value are added, the bits added by greater than 0 are mapped to 1, and the bits added by less than or equal to 0 are mapped to 0, and the obtained binary value is used as the hash value of the first data packet.
结合第一方面或第一方面的第一种可能的实现方式或第一方面的第二种 可能的实现方式或第一方面的第三种可能的实现方式, 在第一方面的第四种 可能的实现方式中, 所述数据存储空间包括多个存储区域; 所述哈希值存储 表还存储有所述第二数据分组的哈希值和所述第二数据分组所在存储区域的 编号的对应关系;  In combination with the first aspect or the first possible implementation of the first aspect or the second possible implementation of the first aspect or the third possible implementation of the first aspect, the fourth possibility in the first aspect In an implementation manner, the data storage space includes a plurality of storage areas; the hash value storage table further stores a correspondence between a hash value of the second data packet and a number of a storage area where the second data packet is located Relationship
对所述第一数据分组内的数据分块进行重复块检索包括: 从所述哈希值 存储表中获取所述第一哈希值对应的存储区域的编号 n, 将编号 n对应存储 区域中的数据分块和数据分块的哈希值加载到内存中; 其中, n为大于等于 0 的整数; 将所述第一数据分组中与所述编号 n对应存储区域中哈希值相同的 数据分块进行比较,以完成对所述第一数据分组内的数据分块的重复块检索。 Performing a repetitive block search on the data block in the first data packet includes: obtaining, from the hash value storage table, a number n of the storage area corresponding to the first hash value, and corresponding to the storage area in the number n The data block and the hash value of the data block are loaded into the memory; wherein n is greater than or equal to 0 An integer of the first data packet is compared with a data block having the same hash value in the storage area corresponding to the number n to complete a repeated block retrieval of the data partition in the first data packet.
结合第一方面的第四种可能的实现方式, 在第一方面的第五种可能的实 现方式中, 所述方法还包括: 在将编号 n对应存储区域中的数据分块和数据 分块的哈希值加载到内存中的同时, 将编号(n+1 )对应存储区域中的数据分 块和数据分块的哈希值加载到内存中;  In conjunction with the fourth possible implementation of the first aspect, in a fifth possible implementation manner of the first aspect, the method further includes: dividing the data in the storage area corresponding to the number n and the data into blocks When the hash value is loaded into the memory, the hash value of the data block and the data block in the storage area corresponding to the number (n+1) is loaded into the memory;
所述将所述第一数据分组中与所述编号 n对应存储区域中哈希值相同的 数据分块进行比较, 以完成对所述第一数据分组内的数据分块的重复块检索 包括: 将所述第一数据分组中与所述编号 n和编号(n+1 )对应存储区域中哈 希值相同的数据分块进行比较, 以完成对所述第一数据分组内的数据分块的 重复块检索。  And comparing the data blocks in the first data packet that are the same as the hash value in the storage area corresponding to the number n to complete the repeated block retrieval of the data block in the first data packet includes: Comparing the data blocks in the first data packet with the same hash value in the storage area corresponding to the number n and the number (n+1) to complete the block of data in the first data packet. Repeat block retrieval.
结合第一方面或第一方面的第一种可能的实现方式或第一方面的第二种 可能的实现方式或第一方面的第三种可能的实现方式或第一方面的第四种可 能的实现方式或第一方面的第五种可能的实现方式, 在第一方面的第六种可 能的实现方式中, 所述获取哈希值存储表中与所述第一数据分组的哈希值相 似度大于或等于预设的第一相似度阈值的第一哈希值包括: 获取所述哈希值 存储表中与所述第一数据分组的哈希值对应位置上的重复位的个数大于或等 于预设数量的哈希值作为所述第一哈希值。  Combining the first aspect or the first possible implementation of the first aspect or the second possible implementation of the first aspect or the third possible implementation of the first aspect or the fourth possible implementation of the first aspect The implementation manner or the fifth possible implementation manner of the first aspect, in a sixth possible implementation manner of the first aspect, the obtaining a hash value storage table is similar to a hash value of the first data packet The first hash value that is greater than or equal to the preset first similarity threshold includes: obtaining the number of repeated bits in the hash value storage table corresponding to the hash value of the first data packet is greater than Or equal to a preset number of hash values as the first hash value.
结合第一方面的第六种可能的实现方式, 在第一方面的第七种可能的实 现方式中, 所述获取哈希值存储表中与所述第一数据分组的哈希值对应位置 上的重复位的个数大于或等于预设数量的哈希值作为所述第一哈希值包括: 获取所述第一数据分组的哈希值与所述哈希值存储表中每个哈希值之间的汉 明距离, 将汉明距离小于或等于预设汉明距离阈值的所述哈希值存储表中的 哈希值作为所述第一哈希值。  With reference to the sixth possible implementation manner of the first aspect, in a seventh possible implementation manner of the first aspect, the acquiring a hash value storage table is corresponding to a hash value of the first data packet The number of the repeated bits is greater than or equal to the preset number of hash values as the first hash value includes: obtaining a hash value of the first data packet and each hash in the hash value storage table A Hamming distance between the values, the hash value in the hash value storage table whose Hamming distance is less than or equal to the preset Hamming distance threshold is used as the first hash value.
第二方面提供一种重复数据检索设备, 包括:  The second aspect provides a duplicate data retrieval device, including:
分块获取模块, 用于对接收到的数据进行分块处理, 获取至少两个数据 分块;  a block obtaining module, configured to perform block processing on the received data to obtain at least two data blocks;
分组获取模块, 用于对所述分块获取模块获取到的所述至少两个数据分 块进行分组, 得到至少一个数据分组, 每个数据分组包括至少一个数据分块; 哈希计算模块, 用于针对所述至少一个数据分组中的第一数据分组, 对 所述第一数据分组内的数据分块进行相似性哈希运算, 获取所述第一数据分 组的哈希值, 获取哈希值存储表中与所述第一数据分组的哈希值相似度大于 或等于预设的第一相似度阈值的第一哈希值, 所述哈希值存储表中存储有已 经存储在数据存储空间中的第二数据分组的哈希值和所述第二数据分组的对 应关系, 所述第二数据分组的哈希值是根据所述第二数据分组内的数据分块 进行相似性哈希运算获得的; 所述第一数据分组是所述至少一个数据分组中 的任意一个数据分组; a packet obtaining module, configured to group the at least two data blocks obtained by the block obtaining module to obtain at least one data packet, each data packet includes at least one data block; and a hash calculation module, For the first data packet in the at least one data packet, The data segmentation in the first data packet performs a similarity hash operation, obtains a hash value of the first data packet, and obtains a hash value similarity with the first data packet in the hash value storage table. a first hash value greater than or equal to a preset first similarity threshold, wherein the hash value storage table stores a hash value of the second data packet that has been stored in the data storage space and the second data Corresponding relationship of the group, the hash value of the second data packet is obtained by performing a similarity hash operation according to the data partitioning in the second data packet; the first data packet is the at least one data packet Any one of the data packets;
重复检索模块, 用于在所述第一数据分组的哈希值与所述第一哈希值的 相似度大于或等于预设的第二相似度阈值时, 对所述第一数据分组内的数据 分块进行重复块检索。  a repeating search module, configured to: when the similarity between the hash value of the first data packet and the first hash value is greater than or equal to a preset second similarity threshold, Data block is used for repeated block retrieval.
在第二方面的第一种可能的实现方式中,所述重复数据检索设备还包括: 存储模块, 用于在所述第一数据分组的哈希值与所述第一哈希值的相似度小 于所述第二相似度阈值时, 将所述第一数据分组内的数据分块和所述第一数 据分组内的数据分块的哈希值存储到所述数据存储空间中, 并将所述第一数 据分组的哈希值与所述第一数据分组的对应关系存储到所述哈希值存储表 中。  In a first possible implementation manner of the second aspect, the data retrieving device further includes: a storage module, configured, in a similarity between a hash value of the first data packet and the first hash value When less than the second similarity threshold, storing data blocks in the first data packet and hash values of data blocks in the first data packet into the data storage space, and A correspondence between the hash value of the first data packet and the first data packet is stored in the hash value storage table.
结合第二方面或第二方面的第一种可能的实现方式, 在第二方面的第二 种可能的实现方式中, 所述分组获取模块具体用于由所述至少两个数据分块 中每个数据分块的哈希值构成待分块哈希数据, 以任一个所述数据分块的哈 希值的长度为滑动步长,釆用分块算法对所述待分块哈希数据进行分块处理, 得到至少一个哈希值分块, 将属于同一哈希值分块的哈希值对应的数据分块 作为一个所述数据分组。  With reference to the second aspect, or the first possible implementation manner of the second aspect, in a second possible implementation manner of the second aspect, the packet acquiring module is specifically configured to be used by each of the at least two data blocks The hash value of the data block constitutes the block hash data, and the length of the hash value of any one of the data blocks is a sliding step size, and the block data is used to perform the block hash data. Block processing, obtaining at least one hash value block, and dividing the data block corresponding to the hash value belonging to the same hash value block as one of the data packets.
结合第二方面或第二方面的第一种可能的实现方式或第二方面的第二种 可能的实现方式, 在第二方面的第三种可能的实现方式中, 所述哈希计算模 块用于对所述第一数据分组内的数据分块进行相似性哈希运算, 获取所述第 一数据分组的哈希值包括:  With reference to the second aspect or the first possible implementation manner of the second aspect, or the second possible implementation manner of the second aspect, in a third possible implementation manner of the second aspect, Performing a similarity hash operation on the data partition in the first data packet, and obtaining a hash value of the first data packet includes:
所述哈希计算模块具体用于对所述第一数据分组内每个数据分块进行哈 希运算, 获取所述第一数据分组内每个数据分块的哈希值, 将所述第一数据 分组内每个数据分块的哈希值中的 0替换为 -1 , 将所述第一数据分组内所有 数据分块的哈希值的对应位相加, 将相加大于 0的位映射为 1 , 将相加小于 或等于 0的位映射为 0, 获得的二进制数值作为所述第一数据分组的哈希值。 结合第二方面或第二方面的第一种可能的实现方式或第二方面的第二种 可能的实现方式或第二方面的第三种可能的实现方式, 在第二方面的第四种 可能的实现方式中, 所述数据存储空间包括多个存储区域; 所述哈希值存储 表还存储有所述第二数据分组的哈希值和所述第二数据分组所在存储区域的 编号的对应关系; The hash calculation module is configured to perform a hash operation on each data block in the first data packet, and obtain a hash value of each data block in the first data packet, where the first The 0 of the hash value of each data block in the data packet is replaced by -1, and the corresponding bits of the hash values of all the data blocks in the first data packet are added, and the bit maps greater than 0 are added. Is 1 and will add less than A bit map of or equal to 0 is 0, and the obtained binary value is used as a hash value of the first data packet. With reference to the second aspect or the first possible implementation of the second aspect or the second possible implementation of the second aspect or the third possible implementation of the second aspect, the fourth possibility in the second aspect In an implementation manner, the data storage space includes a plurality of storage areas; the hash value storage table further stores a correspondence between a hash value of the second data packet and a number of a storage area where the second data packet is located relationship;
所述重复检索模块具体用于从所述哈希值存储表中获取所述第一哈希值 对应的存储区域的编号 n, 将编号 n对应存储区域中的数据分块和数据分块 的哈希值加载到内存中; 其中, n为大于等于 0的整数; 将所述第一数据分 组中与所述编号 n对应存储区域中哈希值相同的数据分块进行比较, 以完成 对所述第一数据分组内的数据分块的重复块检索。  The repeating retrieval module is specifically configured to obtain, from the hash value storage table, the number n of the storage area corresponding to the first hash value, and the number n corresponding to the data partitioning and data partitioning in the storage area. The value is loaded into the memory; wherein, n is an integer greater than or equal to 0; comparing the data blocks in the first data packet that are the same as the hash value in the storage area corresponding to the number n, to complete the A repeated block retrieval of data chunks within the first data packet.
结合第二方面的第四种可能的实现方式, 在第二方面的第五种可能的实 现方式中, 所述重复检索模块还用于在将编号 n对应存储区域中的数据分块 和数据分块的哈希值加载到内存中的同时, 将编号( n+1 )对应存储区域中的 数据分块和数据分块的哈希值加载到内存中;  In conjunction with the fourth possible implementation of the second aspect, in a fifth possible implementation manner of the second aspect, the repeatedly retrieving module is further configured to divide data and data in the storage area corresponding to the number n When the hash value of the block is loaded into the memory, the hash value of the data block and the data block in the storage area corresponding to the number (n+1) is loaded into the memory;
所述重复检索模块具体用于将所述第一数据分组中与所述编号 n对应存 储区域中哈希值相同的数据分块进行比较, 以完成对所述第一数据分组内的 数据分块的重复块检索包括: 所述重复检索模块具体用于将所述第一数据分 比较, 以完成对所述第一数据分组内的数据分块的重复块检索。  The repeatedly retrieving module is configured to compare data blocks in the first data packet that are the same as the hash value in the storage area corresponding to the number n, to complete data partitioning in the first data packet. The repeated block retrieval includes: the repeated retrieval module is specifically configured to compare the first data points to complete a repeated block retrieval of data blocks within the first data packet.
结合第二方面或第二方面的第一种可能的实现方式或第二方面的第二种 可能的实现方式或第二方面的第三种可能的实现方式或第二方面的第四种可 能的实现方式或第二方面的第五种可能的实现方式, 在第二方面的第六种可 能的实现方式中, 所述哈希计算模块用于获取哈希值存储表中与所述第一数 据分组的哈希值相似度大于或等于预设的第一相似度阈值的第一哈希值包 括: 所述哈希计算模块具体用于获取所述哈希值存储表中与所述第一数据分 组的哈希值对应位置上的重复位的个数大于或等于预设数量的哈希值作为所 述第一哈希值。  Combining the second aspect or the first possible implementation of the second aspect or the second possible implementation of the second aspect or the third possible implementation of the second aspect or the fourth possible implementation of the second aspect The implementation or the fifth possible implementation of the second aspect, in a sixth possible implementation manner of the second aspect, the hash calculation module is configured to obtain a hash value storage table and the first data The first hash value of the hash value similarity of the group is greater than or equal to the preset first similarity threshold. The hash calculation module is specifically configured to obtain the first data in the hash value storage table. The number of repeated bits in the corresponding position of the hash value of the group is greater than or equal to a preset number of hash values as the first hash value.
结合第二方面的第六种可能的实现方式, 在第二方面的第七种可能的实 现方式中, 所述哈希计算模块具体用于获取所述哈希值存储表中与所述第一 数据分组的哈希值对应位置上的重复位的个数大于或等于预设数量的哈希值 作为所述第一哈希值包括: 所述哈希值计算模块具体用于获取所述第一数据 分组的哈希值与所述哈希值存储表中每个哈希值之间的汉明距离, 将汉明距 离小于或等于预设汉明距离阈值的所述哈希值存储表中的哈希值作为所述第 一哈希值。 With reference to the sixth possible implementation of the second aspect, in a seventh possible implementation manner of the second aspect, the hash computing module is specifically configured to obtain the hash value storage table and the first The hash value calculation module is specifically configured to acquire the first one, where the number of the repeated bits in the corresponding position of the hash value of the data packet is greater than or equal to the preset number of hash values. a Hamming distance between the hash value of the data packet and each hash value in the hash value storage table, and the Hamming distance is less than or equal to the preset Hamming distance threshold in the hash value storage table A hash value is used as the first hash value.
第三方面提供一种重复数据检索设备, 包括: 处理器、 通信接口、 存储 器和总线: 所述处理器、 所述通信接口、 所述存储器通过所述总线完成相互 间的通信;  A third aspect provides a repetitive data retrieval device, including: a processor, a communication interface, a memory, and a bus: the processor, the communication interface, and the memory complete communication with each other through the bus;
所述通信接口, 用于接收数据;  The communication interface is configured to receive data;
所述处理器, 用于执行程序;  The processor is configured to execute a program;
所述存储器, 用于存放所述程序;  The memory is configured to store the program;
其中, 所述程序用于对所述通信接口接收到的所述数据进行分块处理, 获取至少两个数据分块; 对所述至少两个数据分块进行分组, 得到至少一个 数据分组, 每个数据分组包括至少一个数据分块; 针对所述至少一个数据分 组中的第一数据分组, 对所述第一数据分组内的数据分块进行相似性哈希运 算, 获取所述第一数据分组的哈希值, 获取哈希值存储表中与所述第一数据 分组的哈希值相似度大于或等于预设的第一相似度阈值的第一哈希值, 所述 哈希值存储表中存储有已经存储在数据存储空间中的第二数据分组的哈希值 和所述第二数据分组的对应关系, 所述第二数据分组的哈希值是根据所述第 二数据分组内的数据分块进行相似性哈希运算获得的; 所述第一数据分组是 所述至少一个数据分组中的任意一个数据分组; 如果所述数据分组的哈希值 与所述第一哈希值的相似度大于或等于预设的第二相似度阈值, 对所述第一 数据分组内的数据分块进行重复块检索。  The program is configured to perform block processing on the data received by the communication interface, to obtain at least two data blocks, and group the at least two data blocks to obtain at least one data packet. Data packets comprise at least one data block; for a first data packet in the at least one data packet, performing a similarity hash operation on the data block in the first data packet to obtain the first data packet a hash value, the first hash value of the hash value storage table that is similar to the hash value of the first data packet is greater than or equal to a preset first similarity threshold, and the hash value storage table is obtained. Storing a correspondence between a hash value of the second data packet that has been stored in the data storage space and the second data packet, the hash value of the second data packet being based on the second data packet Obtaining, by the data chunking, a similarity hashing operation; the first data packet is any one of the at least one data packet; if the data packet is A similarity value and the first hash value is equal to or greater than a second preset similarity threshold, the data block the first data packet within the block retrieval is repeated.
在第三方面的第一种可能的实现方式中, 所述程序还用于在所述第一数 据分组的哈希值与所述第一哈希值的相似度小于所述第二相似度阈值时, 将 所述第一数据分组内的数据分块和所述第一数据分组内的数据分块的哈希值 存储到所述数据存储空间中, 并将所述第一数据分组的哈希值与所述第一数 据分组的对应关系存储到所述哈希值存储表中。  In a first possible implementation manner of the third aspect, the program is further configured to: the similarity between the hash value of the first data packet and the first hash value is less than the second similarity threshold And storing a data block in the first data packet and a hash value of the data block in the first data packet into the data storage space, and hashing the first data packet A correspondence between the value and the first data packet is stored in the hash value storage table.
结合第三方面或第三方面的第一种可能的实现方式, 在第三方面的第二 种可能的实现方式中, 所述程序用于对所述至少两个数据分块进行分组, 得 到至少一个数据分组包括: 所述程序具体用于由所述至少两个数据分块中每 个数据分块的哈希值构成待分块哈希数据, 以任一个所述数据分块的哈希值 的长度为滑动步长, 釆用分块算法对所述待分块哈希数据进行分块处理, 得 到至少一个哈希值分块, 将属于同一哈希值分块的哈希值对应的数据分块作 为一个所述数据分组。 With the third aspect or the first possible implementation manner of the third aspect, in a second possible implementation manner of the third aspect, the program is used to group the at least two data blocks, The at least one data packet includes: the program is specifically configured to form a to-be-blocked hash data by a hash value of each of the at least two data blocks, to block the data block The length of the hash value is a sliding step size, and the block data is subjected to block processing by using a blocking algorithm to obtain at least one hash value block, and the hash value corresponding to the same hash value block is corresponding. The data is chunked as one of the data packets.
结合第三方面或第三方面的第一种可能的实现方式或第三方面的第二种 可能的实现方式, 在第三方面的第三种可能的实现方式中, 所述程序用于对 所述第一数据分组内的数据分块进行相似性哈希运算, 获取所述第一数据分 组的哈希值包括: 所述程序具体用于对所述第一数据分组内每个数据分块进 行哈希运算, 获取所述第一数据分组内每个数据分块的哈希值, 将所述第一 数据分组内每个数据分块的哈希值中的 0替换为 -1 , 将所述第一数据分组内 所有数据分块的哈希值的对应位相加, 将相加大于 0的位映射为 1 , 将相加 小于或等于 0的位映射为 0, 获得的二进制数值作为所述第一数据分组的哈 希值。  With reference to the third aspect, or the first possible implementation manner of the third aspect, or the second possible implementation manner of the third aspect, in a third possible implementation manner of the third aspect, The data segmentation in the first data packet performs a similarity hash operation, and obtaining the hash value of the first data packet includes: the program is specifically configured to perform each data segmentation in the first data packet a hash operation, obtaining a hash value of each data block in the first data packet, replacing 0 in a hash value of each data block in the first data packet with -1, The corresponding bits of the hash values of all the data blocks in the first data packet are added, the bits added by greater than 0 are mapped to 1, and the bits added by less than or equal to 0 are mapped to 0, and the obtained binary value is used as the The hash value of the first data packet.
结合第三方面或第三方面的第一种可能的实现方式或第三方面的第二种 可能的实现方式或第三方面的第三种可能的实现方式, 在第三方面的第四种 可能的实现方式中, 所述数据存储空间包括多个存储区域; 所述哈希值存储 表还存储有所述第二数据分组的哈希值和所述第二数据分组所在存储区域的 编号的对应关系;  In combination with the third aspect or the first possible implementation of the third aspect or the second possible implementation of the third aspect or the third possible implementation of the third aspect, the fourth possibility in the third aspect In an implementation manner, the data storage space includes a plurality of storage areas; the hash value storage table further stores a correspondence between a hash value of the second data packet and a number of a storage area where the second data packet is located Relationship
所述程序对所述第一数据分组内的数据分块进行重复块检索包括: 所述 程序具体用于从所述哈希值存储表中获取所述第一哈希值对应的存储区域的 编号 n, 将编号 n对应存储区域中的数据分块和数据分块的哈希值加载到内 存中; 其中, n为大于等于 0的整数; 将所述第一数据分组中与所述编号 n 对应存储区域中哈希值相同的数据分块进行比较, 以完成对所述第一数据分 组内的数据分块的重复块检索。  And performing, by the program, the repeated block search of the data block in the first data packet includes: the program is specifically configured to obtain, from the hash value storage table, a number of the storage area corresponding to the first hash value n, loading the hash value of the data partition and the data chunk in the storage area corresponding to the number n into the memory; wherein n is an integer greater than or equal to 0; corresponding to the number n in the first data packet Data chunks having the same hash value in the storage area are compared to complete a repeated block search of the data chunks within the first data packet.
结合第三方面的第四种可能的实现方式, 在第三方面的第五种可能的实 现方式中, 所述程序还用于在将编号 n对应存储区域中的数据分块和数据分 块的哈希值加载到内存中的同时, 将编号(n+1 )对应存储区域中的数据分块 和数据分块的哈希值加载到内存中;  In conjunction with the fourth possible implementation of the third aspect, in a fifth possible implementation manner of the third aspect, the program is further configured to block the data and block the data in the storage area corresponding to the number n When the hash value is loaded into the memory, the hash value of the data block and the data block in the storage area corresponding to the number (n+1) is loaded into the memory;
所述程序具体用于将所述第一数据分组中与所述编号 n对应存储区域中 哈希值相同的数据分块进行比较, 以完成对所述第一数据分组内的数据分块 的重复块检索包括: 所述程序具体用于将所述第一数据分组中与所述编号 n 和编号(n+1 )对应存储区域中哈希值相同的数据分块进行比较, 以完成对所 述第一数据分组内的数据分块的重复块检索。 The program is specifically configured to: in the storage area corresponding to the number n in the first data packet Comparing the data blocks having the same hash value for comparison to complete the repeated block retrieval of the data block in the first data packet includes: the program is specifically configured to use the number n in the first data packet The data block with the same hash value in the storage area corresponding to the number (n+1) is compared to complete the repeated block retrieval of the data block in the first data packet.
结合第三方面或第三方面的第一种可能的实现方式或第三方面的第二种 可能的实现方式或第三方面的第三种可能的实现方式或第三方面的第四种可 能的实现方式或第三方面的第五种可能的实现方式, 在第三方面的第六种可 能的实现方式中, 所述程序用于获取哈希值存储表中与所述第一数据分组的 哈希值相似度大于或等于预设的第一相似度阈值的第一哈希值包括: 所述程 序具体用于获取所述哈希值存储表中与所述第一数据分组的哈希值对应位置 上的重复位的个数大于或等于预设数量的哈希值作为所述第一哈希值。  Combining the third aspect or the first possible implementation of the third aspect or the second possible implementation of the third aspect or the third possible implementation of the third aspect or the fourth possible implementation of the third aspect The implementation or the fifth possible implementation manner of the third aspect, in a sixth possible implementation manner of the third aspect, the program is used to obtain a hash value storage table and the first data packet The first hash value whose hash similarity is greater than or equal to the preset first similarity threshold includes: the program is specifically configured to obtain a hash value corresponding to the first data packet in the hash value storage table The number of repeated bits in the position is greater than or equal to a preset number of hash values as the first hash value.
结合第三方面的第六种可能的实现方式, 在第三方面的第七种可能的实 的哈希值对应位置上的重复位的个数大于或等于预设数量的哈希值作为所述 第一哈希值包括: 所述程序具体用于具体用于获取所述第一数据分组的哈希 值与所述哈希值存储表中每个哈希值之间的汉明距离, 将汉明距离小于或等 于预设汉明距离阈值的所述哈希值存储表中的哈希值作为所述第一哈希值。  With reference to the sixth possible implementation manner of the third aspect, the number of the repeated bits in the seventh possible real hash value corresponding position of the third aspect is greater than or equal to the preset number of hash values as the The first hash value includes: the program is specifically configured to obtain a Hamming distance between the hash value of the first data packet and each hash value in the hash value storage table, A hash value in the hash value storage table whose clear distance is less than or equal to the preset Hamming distance threshold is used as the first hash value.
第四方面提供一种计算机程序产品, 包括计算机可读存储介质, 用于存 储程序, 所述程序包括:  A fourth aspect provides a computer program product comprising a computer readable storage medium for storing a program, the program comprising:
分块获取单元, 用于对接收到的数据进行分块处理, 获取至少两个数据 分块;  a block obtaining unit, configured to perform block processing on the received data to obtain at least two data blocks;
分组获取单元, 用于对所述分块获取单元获取到的所述至少两个数据分 块进行分组, 得到至少一个数据分组, 每个数据分组包括至少一个数据分块; 哈希计算单元, 用于针对所述至少一个数据分组中的第一数据分组, 对 所述第一数据分组内的数据分块进行相似性哈希运算, 获取所述第一数据分 组的哈希值, 获取哈希值存储表中与所述第一数据分组的哈希值相似度大于 或等于预设的第一相似度阈值的第一哈希值, 所述哈希值存储表中存储有已 经存储在数据存储空间中的第二数据分组的哈希值和所述第二数据分组的对 应关系, 所述第二数据分组的哈希值是根据所述第二数据分组内的数据分块 进行相似性哈希运算获得的; 所述第一数据分组是所述至少一个数据分组中 的任意一个数据分组; a packet obtaining unit, configured to group the at least two data blocks obtained by the block obtaining unit, to obtain at least one data packet, each data packet includes at least one data block; and a hash computing unit, Performing a similarity hash operation on the data block in the first data packet for the first data packet in the at least one data packet, acquiring a hash value of the first data packet, and obtaining a hash value And storing, in the storage table, a first hash value that is similar to a hash value of the first data packet, and is greater than or equal to a preset first similarity threshold, where the hash value storage table is stored in the data storage space. a correspondence between the hash value of the second data packet and the second data packet, the hash value of the second data packet is a similarity hash operation according to the data partitioning in the second data packet Obtained; the first data packet is in the at least one data packet Any one of the data packets;
重复检索单元, 用于在所述第一数据分组的哈希值与所述第一哈希值的 相似度大于或等于预设的第二相似度阈值时, 对所述第一数据分组内的数据 分块进行重复块检索。  And a retrieving unit, configured to: in the first data packet, when a similarity between a hash value of the first data packet and the first hash value is greater than or equal to a preset second similarity threshold Data block is used for repeated block retrieval.
在第四方面的第一种可能的实现方式中, 所述程序还包括: 存储单元, 用于在所述第一数据分组的哈希值与所述第一哈希值的相似度小于所述第二 相似度阈值时, 将所述第一数据分组内的数据分块和所述第一数据分组内的 数据分块的哈希值存储到所述数据存储空间中, 并将所述第一数据分组的哈 希值与所述第一数据分组的对应关系存储到所述哈希值存储表中。  In a first possible implementation manner of the fourth aspect, the program further includes: a storage unit, configured to: when a similarity between a hash value of the first data packet and the first hash value is less than a second similarity threshold, storing a data partition in the first data packet and a hash value of the data partition in the first data packet into the data storage space, and the first A correspondence between the hash value of the data packet and the first data packet is stored in the hash value storage table.
结合第四方面或第四方面的第一种可能的实现方式, 在第四方面的第二 中每个数据分块的哈希值构成待分块哈希数据, 以任一个所述数据分块的哈 希值的长度为滑动步长,釆用分块算法对所述待分块哈希数据进行分块处理, 得到至少一个哈希值分块, 将属于同一哈希值分块的哈希值对应的数据分块 作为一个所述数据分组。  With reference to the fourth aspect or the first possible implementation manner of the fourth aspect, the hash value of each data block in the second part of the fourth aspect constitutes the block hash data to be any one of the data blocks The length of the hash value is a sliding step size, and the block data is subjected to block processing by using a blocking algorithm to obtain at least one hash value block, and the hash belonging to the same hash value block is hashed. The data corresponding to the value is divided into one of the data packets.
结合第四方面或第四方面的第一种可能的实现方式或第四方面的第二种 可能的实现方式, 在第四方面的第三种可能的实现方式中, 所述哈希计算单 元用于对所述第一数据分组内的数据分块进行相似性哈希运算, 获取所述第 一数据分组的哈希值包括: 所述哈希计算单元具体用于对所述第一数据分组 内每个数据分块进行哈希运算, 获取所述第一数据分组内每个数据分块的哈 希值, 将所述第一数据分组内每个数据分块的哈希值中的 0替换为 -1 , 将所 述第一数据分组内所有数据分块的哈希值的对应位相加, 将相加大于 0的位 映射为 1 , 将相加小于或等于 0的位映射为 0, 获得的二进制数值作为所述第 一数据分组的哈希值。  With reference to the fourth aspect, or the first possible implementation manner of the fourth aspect, or the second possible implementation manner of the fourth aspect, in a third possible implementation manner of the fourth aspect, Performing a similarity hashing on the data partitioning in the first data packet, and obtaining the hash value of the first data packet includes: the hash computing unit is specifically configured to be used in the first data packet Each data block performs a hash operation, obtains a hash value of each data block in the first data packet, and replaces 0 of the hash value of each data block in the first data packet with -1, adding corresponding bits of hash values of all data blocks in the first data packet, mapping bits added by greater than 0 to 1 , and mapping bits added less than or equal to 0 to 0, obtaining The binary value is used as the hash value of the first data packet.
结合第四方面或第四方面的第一种可能的实现方式或第四方面的第二种 可能的实现方式或第四方面的第三种可能的实现方式, 在第四方面的第四种 可能的实现方式中, 所述数据存储空间包括多个存储区域; 所述哈希值存储 表还存储有所述第二数据分组的哈希值和所述第二数据分组所在存储区域的 编号的对应关系; 对应的存储区域的编号 n, 将编号 n对应存储区域中的数据分块和数据分块 的哈希值加载到内存中; 其中, n为大于等于 0的整数; 将所述第一数据分 组中与所述编号 n对应存储区域中哈希值相同的数据分块进行比较, 以完成 对所述第一数据分组内的数据分块的重复块检索。 With reference to the fourth aspect or the first possible implementation of the fourth aspect or the second possible implementation of the fourth aspect or the third possible implementation of the fourth aspect, the fourth possibility in the fourth aspect In an implementation manner, the data storage space includes a plurality of storage areas; the hash value storage table further stores a correspondence between a hash value of the second data packet and a number of a storage area where the second data packet is located relationship; The number n of the corresponding storage area is loaded into the memory by the number n corresponding to the data block and the data block in the storage area; wherein n is an integer greater than or equal to 0; A data block having the same hash value in the storage area corresponding to the number n is compared to complete a repeated block search of the data block in the first data packet.
结合第四方面的第四种可能的实现方式, 在第四方面的第五种可能的实 现方式中, 所述重复检索单元还用于在将编号 n对应存储区域中的数据分块 和数据分块的哈希值加载到内存中的同时, 将编号( n+1 )对应存储区域中的 数据分块和数据分块的哈希值加载到内存中;  In conjunction with the fourth possible implementation of the fourth aspect, in a fifth possible implementation manner of the fourth aspect, the repeatedly retrieving unit is further configured to perform data partitioning and data division in the storage area corresponding to the number n When the hash value of the block is loaded into the memory, the hash value of the data block and the data block in the storage area corresponding to the number (n+1) is loaded into the memory;
所述重复检索单元具体用于将所述第一数据分组中与所述编号 n对应存 储区域中哈希值相同的数据分块进行比较, 以完成对所述第一数据分组内的 数据分块的重复块检索包括: 所述重复检索单元具体用于将所述第一数据分 比较, 以完成对所述第一数据分组内的数据分块的重复块检索。  The repeatedly retrieving unit is configured to compare data blocks in the first data packet that are the same as the hash value in the storage area corresponding to the number n, to complete data partitioning in the first data packet. The repeated block retrieval includes: the repeated retrieval unit is specifically configured to compare the first data points to complete a repeated block retrieval of data blocks within the first data packet.
结合第四方面或第四方面的第一种可能的实现方式或第四方面的第二种 可能的实现方式或第四方面的第三种可能的实现方式或第四方面的第四种可 能的实现方式或第四方面的第五种可能的实现方式, 在第四方面的第六种可 能的实现方式中, 所述哈希计算单元用于获取哈希值存储表中与所述第一数 据分组的哈希值相似度大于或等于预设的第一相似度阈值的第一哈希值包 括: 所述哈希计算单元具体用于获取所述哈希值存储表中与所述第一数据分 组的哈希值对应位置上的重复位的个数大于或等于预设数量的哈希值作为所 述第一哈希值。  Combining the fourth aspect or the first possible implementation of the fourth aspect or the second possible implementation of the fourth aspect or the third possible implementation of the fourth aspect or the fourth possible implementation of the fourth aspect The fifth possible implementation manner of the fourth aspect, in a sixth possible implementation manner of the fourth aspect, the hash computing unit is configured to obtain a hash value storage table and the first data The first hash value of the hash value similarity of the group is greater than or equal to the preset first similarity threshold. The hash calculation unit is specifically configured to acquire the first data in the hash value storage table. The number of repeated bits in the corresponding position of the hash value of the group is greater than or equal to a preset number of hash values as the first hash value.
结合第四方面的第六种可能的实现方式, 在第四方面的第七种可能的实 现方式中, 所述哈希计算单元具体用于获取所述哈希值存储表中与所述第一 数据分组的哈希值对应位置上的重复位的个数大于或等于预设数量的哈希值 作为所述第一哈希值包括: 所述哈希值计算单元具体用于获取所述第一数据 分组的哈希值与所述哈希值存储表中每个哈希值之间的汉明距离, 将汉明距 离小于或等于预设汉明距离阈值的所述哈希值存储表中的哈希值作为所述第 一哈希值。  With reference to the sixth possible implementation manner of the foregoing aspect, in a seventh possible implementation manner of the fourth aspect, the hash computing unit is specifically configured to obtain the hash value storage table and the first The hash value calculation unit is specifically configured to acquire the first one, where the number of the repeated bits in the corresponding position of the hash value of the data packet is greater than or equal to the preset number of hash values. a Hamming distance between the hash value of the data packet and each hash value in the hash value storage table, and the Hamming distance is less than or equal to the preset Hamming distance threshold in the hash value storage table A hash value is used as the first hash value.
本发明实施例提供的重复数据检索方法及设备,对接收到的数据先分块, 再分组, 对数据分组内的数据分块进行相似性哈希运算, 得到数据分组的哈 希值, 然后获取数据分组的哈希值与哈希值存储表中存储的已经存储到数据 存储空间中的各数据分组的哈希值相似度大于或等于预设第一相似度阈值的 第一哈希值, 判断数据分组的哈希值与第一哈希值的相似度是否大于或等于 预设的第二相似度阈值, 如果大于, 说明该数据分组中的数据分块在^ ί艮大程 度上是重复块, 然后对其进行重复块检索, 由于查询哈希值存储表中存储的 是已经存储到数据存储空间中的数据分组的哈希值和数据分组的对应关系, 而数据分组的数量相对较少, 所以查询哈希值存储表的效率较高, 并且基于 数据分组进行重复块检索减少了重复块检索的次数, 即减少了与磁盘交互的 次数, 有利于提高重复块查询效率, 从而提高了重复数据删除技术的整体性 能。 附图说明 为了更清楚地说明本发明实施例或现有技术中的技术方案, 下面将对实 施例或现有技术描述中所需要使用的附图作一简单地介绍, 显而易见地, 下 面描述中的附图是本发明的一些实施例, 对于本领域普通技术人员来讲, 在 不付出创造性劳动性的前提下, 还可以根据这些附图获得其他的附图。 The method and device for retrieving data according to the embodiment of the present invention firstly block and then group the received data, perform similarity hashing on the data blocks in the data packet, and obtain a data packet. a hash value, and then obtaining a hash value of the data packet and a first hash value similarity of each data packet stored in the data storage space stored in the hash value storage table greater than or equal to a preset first similarity threshold a hash value, determining whether the similarity between the hash value of the data packet and the first hash value is greater than or equal to a preset second similarity threshold, if greater than, indicating that the data partition in the data packet is greater than To a certain extent, it is a repeating block, and then it performs a repeated block retrieval. Since the query hash value storage table stores the correspondence between the hash value and the data packet of the data packet that has been stored in the data storage space, the data packet is The number is relatively small, so the efficiency of querying the hash value storage table is high, and the repeated block retrieval based on the data packet reduces the number of repeated block retrievals, that is, reduces the number of interactions with the disk, which is beneficial to improve the efficiency of the repeated block query. This improves the overall performance of the deduplication technology. BRIEF DESCRIPTION OF THE DRAWINGS In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, a brief description of the drawings used in the embodiments or the prior art description will be briefly described below. The drawings are some embodiments of the present invention, and those skilled in the art can obtain other drawings based on these drawings without any inventive labor.
图 1为本发明一实施例提供的重复数据检索方法的流程图;  1 is a flowchart of a method for retrieving duplicate data according to an embodiment of the present invention;
图 2为本发明一实施例提供的相似性哈希运算过程示意图;  2 is a schematic diagram of a similarity hash operation process according to an embodiment of the present invention;
图 3为本发明一实施例提供的重复数据检索设备的结构示意图; 图 4为本发明另一实施例提供的重复数据检索设备的结构示意图; 图 5为本发明又一实施例提供的重复数据检索设备的结构示意图; 图 6为本发明一实施例提供的计算机程序产品的结构示意图。 具体实施方式 为使本发明实施例的目的、 技术方案和优点更加清楚, 下面将结合本发 明实施例中的附图, 对本发明实施例中的技术方案进行清楚、 完整地描述, 显然, 所描述的实施例是本发明一部分实施例, 而不是全部的实施例。 基于 本发明中的实施例, 本领域普通技术人员在没有作出创造性劳动前提下所获 得的所有其他实施例, 都属于本发明保护的范围。  FIG. 3 is a schematic structural diagram of a repetitive data retrieval device according to an embodiment of the present invention; FIG. 4 is a schematic structural diagram of a repetitive data retrieval device according to another embodiment of the present invention; FIG. FIG. 6 is a schematic structural diagram of a computer program product according to an embodiment of the present invention. The technical solutions in the embodiments of the present invention are clearly and completely described in the following with reference to the accompanying drawings in the embodiments of the present invention. The embodiments are a part of the embodiments of the invention, and not all of the embodiments. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without creative efforts are within the scope of the present invention.
图 1为本发明一实施例提供的重复数据检索方法的流程图。如图 1所示, 本实施例的方法包括: FIG. 1 is a flowchart of a method for retrieving duplicate data according to an embodiment of the present invention. As shown in Figure 1, The method of this embodiment includes:
步骤 101、 对接收到的数据进行分块处理, 获取至少两个数据分块。  Step 101: Perform block processing on the received data to obtain at least two data blocks.
本实施例的执行主体可以是重复数据检索设备, 该设备在实现形态上可 以是各种具有计算能力的设备, 例如可以是数据备份环境中的服务器、 计算 机等, 还可以是广域网数据传输场景中的终端、 网关、 基站等。  The executor of the embodiment may be a repetitive data retrieval device, and the device may be a device with a computing capability in the implementation mode, for example, a server in a data backup environment, a computer, or the like, or a WAN data transmission scenario. Terminals, gateways, base stations, etc.
重复数据检索设备接收到待存储的数据后, 首先对数据进行分块, 获取 至少两个数据分块。 可选的, 重复数据检索设备对数据进行分块处理可以釆 用分块算法,例如可以是但不限于固定分块( Fixed-Sized Partition, 简称 FSP ) 算法、可变分块( Content-Defined Chunking, 简称为 CDC )算法、滑动块(英 文为 sliding block )算法。 数据分块的大小视所釆用的分块算法以及实际应用 需求而定, 本发明实施例对其具体值不做限定。 关于使用各种分块算法对数 据进行分块处理的过程属于现有技术, 在此不再详述, 可参见现有技术。  After receiving the data to be stored, the data retrieval device first blocks the data to obtain at least two data blocks. Optionally, the data decryption device may perform a block processing on the data, and may be, for example, but not limited to, a Fixed-Sized Partition (FSP) algorithm, and a variable-blocking (Content-Defined Chunking). , referred to as CDC) algorithm, sliding block (sliding block in English) algorithm. The size of the data block depends on the block algorithm used and the actual application requirements. The specific values of the embodiment of the present invention are not limited. The process of performing block processing on data using various blocking algorithms is prior art and will not be described in detail herein. See the prior art.
步骤 102、 对所述至少两个数据分块进行分组, 得到至少一个数据分组, 每个数据分组包括至少一个数据分块。  Step 102: Group the at least two data blocks to obtain at least one data packet, where each data packet includes at least one data block.
重复数据检索设备对数据进行分块处理获取数据分块之后, 再对获取的 数据分块进行分组处理, 获取数据分组, 数据分组的个数可以是小于数据分 块的个数。 该分组处理实际上就是将获取的数据分块划分到不同的数据分组 中, 具体分组方式可以有多种。 例如, 重复数据检索设备可以按照每个数据 分组包括相同个数的数据分块的原则, 依次对多个数据分块进行划分, 形成 至少一个数据分组。  The data retrieval device performs block processing on the data to obtain the data segmentation, and then performs packet processing on the obtained data block to obtain the data packet, and the number of the data packets may be smaller than the number of the data blocks. The packet processing actually divides the acquired data into different data packets, and the specific grouping manner can be various. For example, the repeated data retrieval device may divide the plurality of data blocks in turn according to the principle that each data packet includes the same number of data partitions to form at least one data packet.
又例如, 重复数据检索设备还可以对划分出的数据分块再次釆用分块算 法得到至少一个数据分组。 该实施方式包括: 由上述划分出的至少两个数据 分块中每个数据分块的哈希值构成待分块哈希数据; 以所述至少两个数据分 块中任一个数据分块的哈希值的长度(每个数据分块的哈希值的长度均相同) 为滑动步长, 釆用分块算法对该待分块哈希数据进行分块处理, 得到至少一 个哈希值分块。 滑动步长是指在待分块哈希数据上滑动时最小的滑动距离, 使用分块算法获得的哈希值分块可以通过一次或多次滑动等到。 由于分算算 法使用的滑动步长是以哈希值的长度为单位的, 所以哈希值分块都是由一个 或多个完整的哈希值构成的。 如果分块算法中得到一个哈希值分块的滑动距 离是多个滑动步长(即经过多次滑动) , 则该哈希值分块就由多个哈希值构 成; 如果分块算法中得到一个哈希值分块的滑动距离是一个滑动步长(即经 过一次滑动), 则该哈希值分块就由一个哈希值构成。 在得到哈希值分块后, 将属于同一哈希值分块的哈希值对应的数据分块作为一个数据分组, 这样就 得到了至少一个数据分组, 并且, 釆用这样的分组方式, 使得每个数据分组 的结束位置就是一个分块的结束位置, 分组的划分更准确。 其中, 釆用分块 算法对待分块哈希数据进行分块处理的过程与现有分块算法的过程相类似, 不再赘述。 由上述至少两个数据分块中每个数据分块的哈希值构成待分块哈 希数据的过程包括: 计算所述至少两个数据分块中每个数据分块的哈希值, 将这些哈希值连接在一起构成待分块哈希数据。 For another example, the repeated data retrieval device may further use the blocking algorithm to obtain at least one data packet for the divided data blocks. The embodiment includes: forming, by the hash value of each of the at least two data partitions divided by the foregoing, the hash data to be blocked; and partitioning the data by any one of the at least two data partitions The length of the hash value (the length of the hash value of each data block is the same) is the sliding step size, and the block data is block-processed by the block algorithm to obtain at least one hash value. Piece. The sliding step size refers to the minimum sliding distance when sliding on the block hash data, and the hash value block obtained by the blocking algorithm can be waited by sliding one or more times. Since the sliding step used by the split algorithm is based on the length of the hash value, the hash block is composed of one or more complete hash values. If the sliding distance of a hash value block obtained by the block algorithm is a plurality of sliding step sizes (that is, after multiple sliding steps), the hash value block is composed of multiple hash values. If the sliding distance of a hash value block obtained by the block algorithm is a sliding step size (ie, after one sliding), the hash value block is composed of a hash value. After obtaining the hash value block, the data block corresponding to the hash value of the same hash value block is divided into one data packet, so that at least one data packet is obtained, and the grouping manner is adopted, so that The end position of each data packet is the end position of a block, and the division of the packets is more accurate. The process of performing block processing on the block hash data by using the block algorithm is similar to the process of the existing block algorithm, and will not be described again. The process of forming the hash data to be blocked by the hash value of each of the at least two data blocks includes: calculating a hash value of each of the at least two data blocks, These hash values are concatenated to form the hash data to be chunked.
其中, 每个数据分组中的数据分块是连续的, 即每个数据分组由连续的 数据分块构成。  Wherein, the data partitioning in each data packet is continuous, that is, each data packet is composed of consecutive data chunks.
其中, 各数据分组所包含的数据分块个数可以相同, 也可以不相同。 并 且, 数据分组所包含的数据分块个数可根据实际应用而定, 本发明实施例对 其具体值也不做限定。  The number of data blocks included in each data packet may be the same or different. Moreover, the number of data blocks included in the data packet may be determined according to the actual application, and the specific values of the embodiments of the present invention are not limited.
经过上述分组处理后, 可以基于数据分组进行重复块检索有利于减少进 行重复块检索的次数, 减少与磁盘的交互, 有利于提高重复块检索效率。  After the above packet processing, the repeated block retrieval based on the data packet is advantageous for reducing the number of repeated block retrievals, reducing the interaction with the disk, and improving the efficiency of the repeated block retrieval.
步骤 103、 针对上述至少一个数据分组中的第一数据分组, 对第一数据 分组内的数据分块进行相似性哈希运算(similarly hash, 或 sim ash ) , 获取 第一数据分组的哈希值, 获取哈希值存储表中与第一数据分组的哈希值相似 度大于或等于预设的第一相似度阈值的第一哈希值, 如果第一数据分组的哈 希值与第一哈希值的相似度大于或等于预设的第二相似度阈值, 对第一数据 分组内的数据分块进行重复块检索。  Step 103: Perform, for a first data packet in the at least one data packet, a similarity hash (or similar hash, or sim ash) on the data partition in the first data packet, and obtain a hash value of the first data packet. Obtaining, in the hash value storage table, a first hash value that is similar to a hash value of the first data packet greater than or equal to a preset first similarity threshold, if the hash value of the first data packet is the first hash value The similarity of the Greek values is greater than or equal to the preset second similarity threshold, and the data block in the first data packet is subjected to repeated block retrieval.
其中, 由于对每个数据分组的处理都是相同的, 所以本实施例以其中任 意一个数据分组为例进行说明, 为便于区分将其记为第一数据分组, 也就是 说, 第一数据分组可以是上述获得的至少一个数据分组中的任意一个数据分 组。 哈希值存储表中存储有当前已经存储在数据存储空间中的第二数据分组 的哈希值和第二数据分组的对应关系。 为便于区分和描述, 将当前已经存储 在数据存储空间中的数据分组记为第二数据分组。 其中, 哈希值存储表中存 储的第二数据分组的哈希值的计算方法与本实施例中第一数据分组的哈希值 的计算方法相同, 即第二数据分组的哈希值也是对第二数据分组内的数据分 块进行相似性哈希运算获得的, 另外这些哈希值对应的数据块彼此之间不存 在重复, 即第二数据分组中的数据分块被判定为不是重复块。 数据存储空间 是指用于存储数据分块的存储空间, 可以是硬盘、 磁盘等。 Wherein, since the processing of each data packet is the same, the embodiment is described by taking any one of the data packets as an example, and is referred to as the first data packet for convenience of distinction, that is, the first data packet. It may be any one of the at least one data packet obtained as described above. The hash value storage table stores a correspondence between the hash value of the second data packet currently stored in the data storage space and the second data packet. For ease of differentiation and description, the data packets that have been currently stored in the data storage space are recorded as the second data packet. The calculation method of the hash value of the second data packet stored in the hash value storage table is the same as the calculation method of the hash value of the first data packet in the embodiment, that is, the hash value of the second data packet is also Data points in the second data packet The blocks obtained by the similarity hash operation, and the data blocks corresponding to the hash values do not overlap each other, that is, the data blocks in the second data packet are determined not to be duplicate blocks. The data storage space refers to the storage space for storing data chunks, which may be a hard disk, a disk, or the like.
可选的, 由于哈希值存储表中存储的是已经存储在数据存储空间中的数 组的数量不大于数据分块的数量, 这样在数据分组的数量小于数据分块的数 量的情况下, 与存储每个数据分块的哈希表相比, 本实施例的哈希值存储表 就会小很多, 所以可以存储在内存中, 这样有利于提高查询哈希值存储表的 效率, 有利于进一步提高重复块检索的效率。 其中, 哈希值存储表并不限于 存储在内存中, 还可以存储在磁盘或其他存储设备上, 但优选存储在内存中。 重复数据检索设备在获取到数据分组之后, 对每个数据分组会进行相同 的处理, 本实施例以第一数据分组为例, 则重复数据检索设备对第一数据分 组进行以下处理:  Optionally, since the number of the arrays stored in the data storage space stored in the hash value storage table is not greater than the number of data partitions, so that the number of data packets is smaller than the number of data partitions, Compared with the hash table storing each data block, the hash value storage table in this embodiment is much smaller, so it can be stored in the memory, which is beneficial to improve the efficiency of querying the hash value storage table, and is beneficial to further Improve the efficiency of repeated block searches. The hash value storage table is not limited to being stored in the memory, and may be stored on a disk or other storage device, but is preferably stored in the memory. After the data retrieval device obtains the data packet, the same processing is performed for each data packet. In this embodiment, the first data packet is taken as an example, and the repeated data retrieval device performs the following processing on the first data packet:
首先, 对第一数据分组内的数据分块进行相似性哈希运算, 获取第一数 据分组的哈希值。 相似性哈希的原理是两个数据分块的相似度越高, 对其计 算的哈希值的相似度也会越大, 反之亦然。 相似性哈希运算是能够使相似度 越高的数据分块的哈希值的相似度越高的运算方法。 例如, 一种对第一数据 分组进行相似性哈希运算的方法包括: 对第一数据分组内的每个数据分块进 行哈希运算, 获取第一数据分组内每个数据分块的哈希值; 将第一数据分组 内每个数据分块的哈希值以二进制方式表示, 对所述以二进制方式表示的哈 希值中的每一个位进行转换, 具体实现时可以将值为 0的二进制位替换为 -1 , 值为 1的二进制位保持不变, 然后将转换后的哈希值累加, 具体实现时可以 将所述每一个转换后的哈希值的对应位相加, 将相加之和大于 0的位映射为 1 , 将相加之和小于或等于 0的位映射为 0, 由此获得的二进制数值作为该第 一数据分组的哈希值。结合图 2对相似性哈希运算的优选实施过程进行说明。 如图 2所示, 第一数据分组包括 n个数据分块, 分别为第一数据分块-第 n数 据分块, 对每个数据分块进行哈希运算获得二进制形式的哈希值, 图 2示出 了第一数据分块、 第二数据分块和第 n数据分块的二进制形式的哈希值分别 为 100110、 110000和 001001 , 将每个数据分块的二进制形式的哈希值中的 0 替换为 -1 , 上述第一数据分块、 第二数据分块和第 n数据分块的替换后的二 进制形式的哈希值分别为 1-1-111-1、 11-1-1-1-1和 -1-11-1-11 , 依次将 n个数 据分块的替换后的哈希值中的相应位相加, 最终得到 13,18,-22,-5,-2,5这一结 果, 将该结果中大于 0的数值映射为 1 , 小于或等于 0的数值映射为 0, 得到 二进制的 110001 , 该二进制形式的 110001即为上述第一数据分组的哈希值。 First, a similarity hash operation is performed on the data partition in the first data packet to obtain a hash value of the first data packet. The principle of similarity hashing is that the higher the similarity between two data chunks, the greater the similarity of the calculated hash values, and vice versa. The similarity hash operation is an arithmetic method capable of making the similarity of the hash values of the data blocks having higher similarity higher. For example, a method for performing similarity hashing on a first data packet includes: hashing each data chunk in the first data packet to obtain a hash of each data chunk in the first data packet a value; a hash value of each data block in the first data packet is represented in a binary manner, and each bit in the binary value is converted, and the value may be 0. The binary bit is replaced by -1, the binary bit with the value of 1 remains unchanged, and then the converted hash value is accumulated. In the specific implementation, the corresponding bits of each converted hash value can be added, and the phase is added. The bit map with the sum greater than 0 is mapped to 1, and the bit whose sum is less than or equal to 0 is mapped to 0, and the binary value thus obtained is used as the hash value of the first data packet. A preferred implementation process of the similarity hash operation will be described with reference to FIG. As shown in FIG. 2, the first data packet includes n data blocks, which are respectively a first data block-nth data block, and each data block is hashed to obtain a binary form hash value. 2 shows that the hash values of the binary form of the first data block, the second data block, and the nth data block are 100110, 110000, and 001001, respectively, and the hash value of each data block is binary. 0 Substituting -1, the hash values of the replaced binary forms of the first data block, the second data block, and the nth data block are 1-1-111-1, 11-1-1-1, respectively. -1 and -1-11-1-11, sequentially adding the corresponding bits in the hashed values of the n data blocks, and finally obtaining 13, 18, -22, -5, -2, 5 As a result, the value greater than 0 in the result is mapped to 1, and the value less than or equal to 0 is mapped to 0, resulting in a binary 110001, which is the hash value of the first data packet.
除了上述方法完成本实施例涉及的对第一数据分组内的数据分块进行的 一种相似性哈希运算之外, 还可以釆用另一种相似性哈希运算, 例如感知哈 希算法( Perceptual hash algorithm ) , 来完成本实施例涉及的对第一数据分组 内的数据分块进行的相似性哈希运算。 感知哈希运算的原理是对每张图片生 成一个 "指纹 "(英文为 fingerprint )字符串, 然后比较不同图片的指纹, 比较 结果的相似度越高, 说明图片的相似度越高; 而将其应用到本实施例提供的 重复数据检索方法中, 其原理是对每个数据分组计算一个哈希值, 然后比较 不同数据分组的哈希值, 如果两个哈希值的相似度越高, 就说明两个数据分 组中可能发生重复的数据块就越多 (即两个数据分组的相似度就越大) 。  In addition to the above method to complete a similarity hashing operation on the data partitioning in the first data packet involved in the embodiment, another similarity hashing operation, such as a perceptual hashing algorithm, may be employed ( Perceptual hash algorithm ) , to perform the similarity hashing operation on the data partitioning in the first data packet involved in the embodiment. The principle of perceptual hash operation is to generate a "fingerprint" (English fingerprint) string for each picture, and then compare the fingerprints of different pictures. The higher the similarity of the comparison results, the higher the similarity of the pictures; Applying to the repeated data retrieval method provided in this embodiment, the principle is to calculate a hash value for each data packet, and then compare the hash values of different data packets. If the similarity between the two hash values is higher, Explain that the more data blocks that may be duplicated in the two data packets (ie, the greater the similarity between the two data packets).
本实施例通过引入相似性哈希运算, 充分利用哈希值相似度越高, 对应 数据分组的相似度就越高这一特性, 通过将计算出的数据分组的哈希值与已 经存在的数据分组的哈希值进行比较就能够在一定程度上体现该数据分组内 各数据分块与已经存储在数据存储空间中的数据分块发生重复的可能性, 如 果计算出的数据分组的哈希值与已经存在的数据分组的哈希值的相似度越 高, 说明该数据分组内数据分块发生重复的可能性就越大, 此时若基于数据 分组的哈希值确定该数据分组需要进行重复块检索, 说明该数据分组内的数 据分块在^ ί艮大程度是重复块, 这时候进行重复块检索就提高了重复块检索的 性能。 下面通过一种比较的方式说明本实施例的方法能够提高重复块检索的 性能。  In this embodiment, by introducing a similarity hash operation, the higher the similarity of the hash value is, the higher the similarity of the corresponding data packet is, and the hash value of the calculated data packet is compared with the existing data. Comparing the hash values of the packets can reflect to some extent the possibility that each data block in the data packet overlaps with the data block that has been stored in the data storage space, if the calculated hash value of the data packet The higher the similarity with the hash value of the existing data packet, the more likely the data block is repeated in the data packet. At this time, if the data packet is determined based on the hash value of the data packet, the data packet needs to be repeated. The block search indicates that the data partitioning in the data packet is a repeated block in a large degree, and the repeated block retrieval improves the performance of the repeated block retrieval. The method of the present embodiment will be described below in a comparative manner to improve the performance of repeated block retrieval.
接着, 当重复数据检索设备计算出第一数据分组的哈希值后, 将该第一 据分组的哈希值相似度大于或等于预设第一相似度阈值的哈希值, 记为第一 哈希值。 可选的, 在具体实现时, 如果大于或等于预设第一相似度阈值的哈 希值有多个, 则可以获取所述多个哈希值, 其中每个哈希都属于第一哈希值; 如果大于或等于预设第一相似度阈值的哈希值有一个, 则将该哈希值作为第 一哈希值, 即获取的第一哈希值为一个。 优选的, 可以获取哈希值存储表中 与该第一数据分组的哈希值相似度最大的哈希值作为第一哈希值, 但不限于 此。 这里获取与该第一数据分组的哈希值相似度大于或等于预设第一相似度 阈值的哈希值的实施方式可以是: 重复数据检索设备获取哈希值存储表中与 第一数据分组的哈希值对应位置上重复位的个数大于或等于预设数量的哈希 值作为第一哈希值。 在该实施方式中, 两个哈希值对应位置上重复位的多少 表征了两个哈希值的相似度; 如果两个哈希值对应位置上重复位越多, 说明 这两个哈希值的相似度越高; 反之亦然。 这里的预设数量相当于上述预设第 一相似度阈值。 进一步, 重复数据检索设备获取哈希值存储表中与第一数据 分组的哈希值对应位置上重复位大于或等于预设数量的哈希值作为第一哈希 值的一种实施方式包括: 重复数据检索设备获取第一数据分组的哈希值与哈 希值存储表中每个哈希值之间的汉明距离, 将汉明距离小于或等于预设汉明 距离阈值的哈希值存储表中的哈希值作为第一哈希值。 其中, 第一数据分组 一数据分组与哈希值存储表中各哈希值对应的第二数据分组之间的重复程 度。 汉明距离越小 (即重复位数越多)表明第一数据分组与对应的第二数据 分组之间的重复程度越高。 另外, 除了使用汉明距离外, 还可以使用能够表 示两个哈希值的相似度的其他参数。 这里的预设汉明距离阈值相当于上述预 设数量。 Then, after the repeated data retrieval device calculates the hash value of the first data packet, the hash value similarity of the first data packet is greater than or equal to a hash value of the preset first similarity threshold, and is recorded as the first Hash value. Optionally, in a specific implementation, if there are multiple hash values greater than or equal to the preset first similarity threshold, the multiple hash values may be obtained, where each hash belongs to the first hash. Value; if there is one hash value greater than or equal to the preset first similarity threshold, the hash value is taken as the first A hash value, that is, the first hash value obtained is one. Preferably, a hash value having the largest similarity to the hash value of the first data packet in the hash value storage table may be obtained as the first hash value, but is not limited thereto. The implementation manner of obtaining the hash value that is similar to the hash value of the first data packet greater than or equal to the preset first similarity threshold may be: the duplicate data retrieval device acquires the hash data storage table and the first data packet The hash value corresponds to the number of repeated bits at the position greater than or equal to the preset number of hash values as the first hash value. In this embodiment, the number of repeated bits in the corresponding position of the two hash values represents the similarity of the two hash values; if the two hash values correspond to more repeating positions in the position, the two hash values are indicated. The higher the similarity; vice versa. The preset number here is equivalent to the preset first similarity threshold. Further, the embodiment of the hash data storage device obtaining the hash value corresponding to the hash value of the first data packet in the hash value storage table is greater than or equal to the preset number of hash values as the first hash value, including: The repeated data retrieval device acquires a Hamming distance between the hash value of the first data packet and each hash value in the hash value storage table, and stores the hash value whose Hamming distance is less than or equal to the preset Hamming distance threshold. The hash value in the table is used as the first hash value. The degree of repetition between the first data packet-data packet and the second data packet corresponding to each hash value in the hash value storage table. The smaller the Hamming distance (i.e., the greater the number of repetitions), the higher the degree of repetition between the first data packet and the corresponding second data packet. In addition, in addition to using the Hamming distance, other parameters capable of representing the similarity of the two hash values may be used. The preset Hamming distance threshold here is equivalent to the above preset number.
接着, 重复数据检索设备将上述第一数据分组的哈希值与第一哈希值的 相似度与预设第二相似度阈值进行比较, 用来判断第一数据分组是否需要进 行重复块检索。 如果第一数据分组的哈希值与第一哈希值的相似度大于或等 于该第二相似度阈值, 说明第一数据分组与第一哈希值对应的第二数据分组 之间的重复度非常高, 可以判定两者之间存在较多重复块, 因此, 需要对第 一数据分组进行重复块检索。 可选的, 如果使用两个哈希值对应位置上重复 位的多少来表征两个哈希值的相似度, 则这里的第二相似度阈值可以是重复 位数阈值。 相应的, 重复数据检索设备将上述第一数据分组的哈希值与第一 哈希值的相似度与预设第二相似度阈值进行比较可以是: 重复数据检索设备 判断第一数据分组的哈希值与第一哈希值对应位置上重复位数是否大于或等 于预设重复位数阈值。 在此说明, 第二相似度阈值大于或等于第一相似度阈值。 Then, the repeated data retrieval device compares the similarity between the hash value of the first data packet and the first hash value with a preset second similarity threshold, and is used to determine whether the first data packet needs to perform a repeated block search. If the similarity between the hash value of the first data packet and the first hash value is greater than or equal to the second similarity threshold, indicating the degree of repetition between the second data packet corresponding to the first hash value of the first data packet Very high, it can be determined that there are more duplicate blocks between the two, so a repeated block search of the first data packet is required. Optionally, if two hash values are used to represent the similarity of the two hash values, the second similarity threshold may be a repetition number threshold. Correspondingly, comparing, by the repeated data retrieval device, the similarity between the hash value of the first data packet and the first hash value and the preset second similarity threshold may be: the repeated data retrieval device determines the first data packet. Whether the number of repeated digits in the position corresponding to the first hash value is greater than or equal to the preset repetition digit threshold. Here, the second similarity threshold is greater than or equal to the first similarity threshold.
可选的, 数据存储空间包括多个存储区域, 每个存储区域有一个编号, 按照编号由小到大的顺序依次使用各存储区域。 相应地, 哈希值存储表中除 了存储有第二数据分组的哈希值与第二数据分组的对应关系, 从对应关系中 可以了解到与第二数据分组的哈希值对应的第二数据分组所在存储区域的编 号的对应关系。基于此, 上述对第一数据分组进行重复块检索的过程可以是: 重复数据检索设备从哈希值存储表中获取第一哈希值对应的存储区域的编号 n, 将编号 n对应存储区域中的数据分块和数据分块的哈希值加载到内存中, 这里的 n是大于等于 0的整数; 然后将第一数据分组中与编号 n对应存储区 域中哈希值相同的数据分块进行比较, 以完成对第一数据分组内的数据分块 的重复块检索。  Optionally, the data storage space includes multiple storage areas, and each storage area has a number, and each storage area is used in order from the smallest to the largest. Correspondingly, in the hash value storage table, in addition to the correspondence between the hash value of the second data packet and the second data packet, the second data corresponding to the hash value of the second data packet can be known from the correspondence relationship. The correspondence between the numbers of the storage areas in which the packets are located. Based on this, the process of performing the repeated block retrieval on the first data packet may be: the duplicate data retrieval device acquires the number n of the storage area corresponding to the first hash value from the hash value storage table, and the number n corresponds to the storage area. The data block and the hash value of the data block are loaded into the memory, where n is an integer greater than or equal to 0; then the data in the first data packet is the same as the hash value in the storage area corresponding to the number n. Comparing to complete a repeated block retrieval of data blocks within the first data packet.
可选的, 在将编号 n对应存储区域中的数据分块和数据分块的哈希值加 载到内存中的同时, 将编号( n+1 )对应存储区域中的数据分块和数据分块的 哈希值也加载到内存中的。 基于此, 上述将第一数据分组中与编号 n对应存 储区域中哈希值相同的数据分块进行比较, 以完成对第一数据分组内的数据 分块的重复块检索的过程可以是: 将第一数据分组中与编号 n和编号 (n+1 ) 对应存储区域中哈希值相同的数据分块进行比较, 以完成对第一数据分组内 的数据分块的重复块检索。  Optionally, when the hash value of the data partition and the data chunk in the storage area corresponding to the number n is loaded into the memory, the number (n+1) is corresponding to the data partition and the data chunk in the storage area. The hash value is also loaded into memory. Based on this, the process of comparing the data blocks in the first data packet with the same hash value in the storage area corresponding to the number n to complete the repeated block retrieval of the data block in the first data packet may be: The data blocks in the first data packet that are the same as the hash values in the storage area corresponding to the number n and the number (n+1) are compared to complete the repeated block retrieval of the data blocks in the first data packet.
这里将第一数据分组中与编号 n和编号(n+1 )对应存储区域中哈希值相 同的数据分块进行比较, 以完成对第一数据分组内的数据分块的重复块检索 的过程可以是: 先将第一数据分组中每个数据分块的哈希值分别与编号 n和 编号(n+1 )对应的存储区域中的哈希值进行比较, 得到第一数据分组中与编 号 n和编号(n+1 )对应的存储区域中相同的哈希值, 为便于描述, 将这里得 到的相同的哈希值即为第二哈希值, 然后将第二哈希值在该数据分组中对应 的数据分块和第二哈希值在编号 n和编号(n+1 )对应存储区域中对应的数据 分块进行比较, 以完成对第一数据分组内的数据分块的重复块检索。  Here, the data block in the first data packet that is the same as the hash value in the storage area corresponding to the number n and the number (n+1) is compared to complete the process of repetitive block retrieval of the data block in the first data packet. The method may be: first comparing the hash value of each data block in the first data packet with the hash value in the storage area corresponding to the number n and the number (n+1), to obtain the number and the number in the first data packet. n and the same hash value in the storage area corresponding to the number (n+1). For convenience of description, the same hash value obtained here is the second hash value, and then the second hash value is in the data. The corresponding data block and the second hash value in the packet are compared in the number n and the corresponding data block in the corresponding storage area of the number (n+1) to complete the repeated block of the data block in the first data packet. Search.
其中, 由于各存储区域是按照编号由小到大的顺序依次使用的, 所以编 号(n+1 )对应存储区域是编号 n对应存储区域的下一个存储区域,也就是说, 当编号 n对应存储区域被写满后, 再继续往编号(n+1 )对应存储区域中写入 数据。 因为接下来收到的数据艮有可能就在第一哈希值对应的存储区域的下 一个存储区域(即编号为 (n+1 )的存储区域) 内有重复数据, 所以一次性将 第一哈希值对应的存储区域(即编号为 n的存储区域)和第一哈希值对应的 存储区域的下一个存储区域的内容都加在到内存中, 有利于提高后续重复块 检索过程的效率, 进而有利于从整体上提高重复块检索的效率。 Wherein, since each storage area is sequentially used in the order of number from small to large, the number (n+1) corresponding storage area is the next storage area corresponding to the storage area of the number n, that is, when the number n is correspondingly stored After the area is filled, continue to write data to the corresponding area (n+1) corresponding storage area. Because the data received next may be under the storage area corresponding to the first hash value. A storage area (that is, a storage area numbered (n+1)) has duplicate data, so the storage area corresponding to the first hash value (that is, the storage area numbered n) corresponds to the first hash value at a time. The content of the next storage area of the storage area is added to the memory, which is beneficial to improve the efficiency of the subsequent repeated block retrieval process, thereby facilitating the overall efficiency of the repeated block retrieval.
在此说明, 本实施例釆用不同存储区域进行数据分块和数据分块的哈希 值的存储, 但不限于此。 较为优选的分区存储方式为: 按照接收数据分块的 顺序, 集中存储到一个存储区域内, 当该存储区域满后, 将接收到的数据分 块存储到下一个存储区域中。 其中, 每个存储区域是一段存储空间, 每个存 储区域有一定的大小, 例如可以是但不限于 64MB。  It is explained here that the present embodiment uses different storage areas for storing data hashes and hashes of data blocks, but is not limited thereto. The preferred partition storage mode is: centralized storage in a storage area according to the order of receiving data blocks, and when the storage area is full, the received data blocks are stored in the next storage area. Each storage area is a storage space, and each storage area has a certain size, for example, but not limited to 64 MB.
在每个存储区域中同时存储有数据分块和数据分块的哈希值, 具体存储 方式不作限定。 存储区域的一种优选存储方式为: 存储区域内分为两部分, 一部分为数据段区域, 该数据段区域存储的是数据分块; 另一部分是元数据 区域,该元数据区域存储的是和所述数据段区域中的数据分块对应的元数据, 这里的元数据包括数据分块的哈希值、 数据分块的长度、 数据段的长度、 以 及一些校验码等信息, 在本发明重复数据查找的过程中, 主要是利用元数据 中的数据分块的哈希值。  The hash values of the data block and the data block are simultaneously stored in each storage area, and the specific storage manner is not limited. A preferred storage mode of the storage area is as follows: The storage area is divided into two parts, one part is a data segment area, the data segment area stores data partitioning; the other part is a metadata area, and the metadata area stores Metadata corresponding to the data block in the data segment area, where the metadata includes a hash value of the data block, a length of the data block, a length of the data segment, and some check code, etc., in the present invention In the process of repeating data lookup, the hash value of the data block in the metadata is mainly used.
可选的, 如果上述步骤 103 中通过进行相似性哈希运算获得的第一数据 分组的哈希值与第一哈希值的相似度小于预设的第二相似度阈值, 说明第一 数据分组与第一哈希值对应的第二数据分组之间的重复度不是艮高, 可以判 定两者之间不存在重复块或重复块数量非常少, 例如可能第一数据分组中仅 存在一两个数据分块与该第一哈希值对应的第二数据分组中的数据分块存在 重复, 为了提高整体性能, 可以将第一数据分组中的数据分块作为新数据进 行处理, 即不进行重复块检索而是直接存储到数据存储空间中。 进一步, 如 果数据存储空间包括多个存储区域, 则重复数据检索设备可以直接将第一数 据分组内的数据分块以及数据分块的哈希值存储到当前使用的存储区域中。  Optionally, if the similarity between the hash value of the first data packet obtained by performing the similarity hash operation and the first hash value in step 103 is less than a preset second similarity threshold, the first data packet is illustrated. The degree of repetition between the second data packets corresponding to the first hash value is not high, and it can be determined that there is no duplicate block between the two or the number of duplicate blocks is very small, for example, there may be only one or two in the first data packet. The data block is duplicated in the data block in the second data packet corresponding to the first hash value. To improve overall performance, the data block in the first data packet may be processed as new data, that is, not repeated. Block retrieval is stored directly into the data storage space. Further, if the data storage space includes a plurality of storage areas, the duplicate data retrieval device can directly store the data partitions in the first data packet and the hash values of the data chunks into the currently used storage area.
由上可见, 本实施例提供的重复数据检索方法, 对接收到的数据先分块, 再分组, 对数据分组内的数据分块进行相似性哈希运算, 获取数据分组的哈 希值, 然后获取数据分组的哈希值与哈希值存储表中存储的已经存储到数据 存储空间中的各数据分组的哈希值中相似度大于或等于预设第一相似度阈值 的第一哈希值, 判断数据分组的哈希值与第一哈希值的相似度是否大于或等 于预设的第二相似度阈值, 如果大于, 说明该数据分组中的数据分块在很大 程度上是重复块, 然后对其进行重复块检索, 由于查询哈希值存储表中存储 的是已经存储到数据存储空间中的数据分组的哈希值和数据分组的对应关 系, 而数据分组的数量相对较少, 所以查询哈希值存储表的效率较高, 并且 基于数据分组进行重复块检索减少了重复块检索的次数, 即减少了与磁盘交 互的次数, 有利于提高重复块查询效率, 从而提高了重复数据删除技术的整 体性能。 As can be seen from the above, the repeated data retrieval method provided in this embodiment first blocks and receives the received data, performs similarity hashing on the data partitions in the data packet, obtains a hash value of the data packet, and then obtains a hash value of the data packet, and then Obtaining a first hash value of a hash value of the data packet and a hash value of each data packet stored in the data storage space stored in the hash value storage table that is greater than or equal to a preset first similarity threshold , determining whether the similarity between the hash value of the data packet and the first hash value is greater than or equal to The preset second similarity threshold, if greater than, indicates that the data partitioning in the data packet is largely a duplicate block, and then the block retrieval is performed, since the query hash value storage table is stored in the The correspondence between the hash value and the data packet of the data packet that has been stored in the data storage space, and the number of data packets is relatively small, so the efficiency of querying the hash value storage table is high, and the repeated block retrieval is performed based on the data packet. The number of times of repeated block retrieval is reduced, that is, the number of interactions with the disk is reduced, which is advantageous for improving the efficiency of the repeated block query, thereby improving the overall performance of the deduplication technology.
图 3为本发明一实施例提供的重复数据检索设备的结构示意图。 本实施 例的重复数据检索设备在具体实现形态上可以是各种具有计算能力和存储能 力的设备, 例如可以是数据备份环境中的服务器、 计算机等, 还可以是广域 网数据传输场景中的终端、 网关、 基站等等, 本发明具体实施例并不对重复 数据检索设备的具体实现做限定。 如图 3所示, 本实施例的设备包括: 分块 获取模块 31、 分组获取模块 32、 哈希计算模块 33和重复检索模块 34。  FIG. 3 is a schematic structural diagram of a duplicate data retrieval device according to an embodiment of the present invention. The data retrieving device in this embodiment may be a device having a computing capability and a storage capability in a specific implementation manner, for example, a server in a data backup environment, a computer, or the like, or a terminal in a WAN data transmission scenario. Gateways, base stations, and the like, the specific embodiments of the present invention do not limit the specific implementation of the repeated data retrieval device. As shown in FIG. 3, the device in this embodiment includes: a block obtaining module 31, a group obtaining module 32, a hash calculating module 33, and a repeating search module 34.
其中, 分块获取模块 31 , 用于对接收到的数据进行分块处理, 获取至少 两个数据分块。  The block obtaining module 31 is configured to perform block processing on the received data to obtain at least two data blocks.
分组获取模块 32, 与分块获取模块 31连接, 用于对分块获取模块 31获 取的至少两个数据分块进行分组, 获取至少一个数据分组, 每个数据分组包 括至少一个数据分块。  The packet obtaining module 32 is connected to the block obtaining module 31 and configured to group at least two data blocks obtained by the block obtaining module 31 to obtain at least one data packet, and each data packet includes at least one data block.
哈希计算模块 33 , 与分组获取模块 32连接, 用于针对分组获取模块 32 获取的至少一个数据分组中的第一数据分组, 对该第一数据分组内的数据分 块进行相似性哈希运算, 获取该第一数据分组的哈希值, 获取哈希值存储表 中与该第一数据分组的哈希值相似度大于或等于预设的第一相似度阈值的第 一哈希值, 该哈希值存储表中存储有已经存储在数据存储空间中的第二数据 分组的哈希值和第二数据分组的对应关系, 所述第二数据分组的哈希值是根 据第二数据分组内的数据分块进行相似性哈希运算获得的; 所述第一数据分 组是至少一个数据分组中的任意一个数据分组。  The hash calculation module 33 is connected to the packet obtaining module 32, and is configured to perform similarity hashing on the data partition in the first data packet for the first data packet in the at least one data packet acquired by the packet obtaining module 32. Obtaining a hash value of the first data packet, and acquiring a first hash value in the hash value storage table that is similar to a hash value of the first data packet by a preset first similarity threshold. The hash value storage table stores a correspondence between a hash value of the second data packet that has been stored in the data storage space and the second data packet, and the hash value of the second data packet is according to the second data packet. The data block is obtained by performing a similarity hash operation; the first data packet is any one of the at least one data packet.
重复检索模块 34, 与哈希计算模块 33连接, 用于在第一数据分组的哈 希值与哈希计算模块 33 获取的第一哈希值的相似度大于或等于预设第二相 似度阈值时, 对第一数据分组内的数据分块进行重复块检索。  The repeated search module 34 is connected to the hash calculation module 33, and the similarity between the hash value of the first data packet and the first hash value obtained by the hash calculation module 33 is greater than or equal to a preset second similarity threshold. At the same time, a repeated block retrieval is performed on the data partitioning within the first data packet.
在一可选实施方式中, 如图 4所示, 本实施例的重复数据检索设备还包 括: 存储模块 35。 存储模块 35 , 与哈希计算模块 33连接, 用于在第一数据 分组的哈希值与哈希计算模块 33 获取的第一哈希值的相似度小于第二相似 度阈值时, 将第一数据分组内的数据分块和第一数据分组内的数据分块的哈 希值存储到数据存储空间中, 并将第一数据分组的哈希值与第一数据分组的 对应关系存储到哈希值存储表中。 In an optional implementation manner, as shown in FIG. 4, the repeated data retrieval device of the embodiment further includes Includes: storage module 35. The storage module 35 is connected to the hash calculation module 33, and is configured to: when the similarity between the hash value of the first data packet and the first hash value obtained by the hash calculation module 33 is less than the second similarity threshold, The data block within the data packet and the hash value of the data block within the first data packet are stored into the data storage space, and the correspondence between the hash value of the first data packet and the first data packet is stored to the hash. The value is stored in the table.
在此说明, 上述哈希计算模块 33、 重复检索模块 34以及存储模块 35对 每个数据分组均执行相同的动作。  Here, the hash calculation module 33, the repetition retrieval module 34, and the storage module 35 perform the same actions for each data packet.
在一可选实施方式中, 分组获取模块 32具体可用于由分块获取模块 31 获取到的至少两个数据分块中每个数据分块的哈希值构成待分块哈希数据, 以至少两个数据分块中每个数据分块的哈希值的长度为滑动步长, 釆用分块 算法对上述待分块哈希数据进行分块处理, 得到至少一个哈希值分块, 将属 于同一哈希值分块的哈希值对应的数据分块作为一个数据分组, 从而得到至 少一个数据分组。  In an optional implementation, the packet obtaining module 32 is specifically configured to use the hash value of each data block in the at least two data blocks obtained by the block obtaining module 31 to form the block hash data to be at least The length of the hash value of each data block in the two data blocks is a sliding step size, and the block data is subjected to block processing by using the block algorithm to obtain at least one hash value block, The data blocks corresponding to the hash values of the same hash value block are used as one data packet, thereby obtaining at least one data packet.
在一可选实施方式中,哈希计算模块 33用于对上述第一数据分组内的数 据分块进行相似性哈希运算, 获取上述第一数据分组的哈希值包括: 哈希计 算模块 33具体用于对上述第一数据分组内每个数据分块进行哈希运算,获取 第一数据分组中每个数据分块的哈希值, 将第一数据分组内每个数据分块的 哈希值中的 0替换为 -1 , 将第一数据分组内所有数据分块的哈希值的对应位 相加, 将相加大于 0的位映射为 1 , 将相加小于或等于 0的位映射为 0, 获得 的二进制数值作为第一数据分组的哈希值。  In an optional implementation, the hash calculation module 33 is configured to perform a similarity hash operation on the data partition in the first data packet, and obtain the hash value of the first data packet, where the hash value includes: the hash calculation module 33 Specifically, the hash operation is performed on each data block in the first data packet, and the hash value of each data block in the first data packet is obtained, and the hash of each data block in the first data packet is hashed. The 0 in the value is replaced by -1, the corresponding bits of the hash value of all data blocks in the first data packet are added, the bits added by greater than 0 are mapped to 1, and the bit maps less than or equal to 0 are added. Is 0, the obtained binary value is used as the hash value of the first data packet.
在一可选实施方式中, 上述数据存储空间包括多个存储区域; 相应地, 哈希值存储表还存储有第二数据分组的哈希值和第二数据分组所在存储区域 的编号的对应关系。基于此, 重复检索模块 34具体可用于从哈希值存储表中 获取第一哈希值对应存储区域的编号 n, 将编号 n对应存储区域中的数据分 块和数据分块的哈希值加载到内存中; 其中, n为大于等于 0的整数; 将第 一数据分组中与编号 n对应存储区域中哈希值相同的数据分块进行比较, 以 完成对第一数据分组内的数据分块的重复块检索。  In an optional implementation manner, the data storage space includes a plurality of storage areas. Correspondingly, the hash value storage table further stores a correspondence between a hash value of the second data packet and a number of the storage area where the second data packet is located. . Based on this, the repeated retrieval module 34 is specifically configured to obtain the number n of the storage area corresponding to the first hash value from the hash value storage table, and load the hash value of the data partition and the data partition corresponding to the number n corresponding to the storage area. In the memory, where n is an integer greater than or equal to 0; comparing the data blocks in the first data packet with the same hash value in the storage area corresponding to the number n to complete the data partitioning in the first data packet Repeated block retrieval.
在一可选实施方式中,重复检索模块 34还用于在将编号 n对应存储区域 中的数据分块和数据分块的哈希值加载到内存中的同时, 将编号( n+1 )对应 存储区域中的数据分块和数据分块的哈希值加载到内存中。 基于此, 重复检 索模块 34具体用于将第一数据分组中与编号 n对应存储区域中哈希值相同的 数据分块进行比较,以完成对第一数据分组内的数据分块的重复块检索包括: 重复检索模块 34具体用于将第一数据分组中与编号 n和编号 (n+1 )对应存 储区域中哈希值相同的数据分块进行比较, 以完成对第一数据分组内的数据 分块的重复块检索。 In an optional implementation manner, the repeated retrieval module 34 is further configured to: when the hash value of the data partition and the data partition in the storage area corresponding to the number n is loaded into the memory, the number (n+1) is correspondingly The data chunks in the storage area and the hash values of the data chunks are loaded into the memory. Based on this, repeat the check The cable module 34 is specifically configured to compare the data blocks in the first data packet with the same hash value in the storage area corresponding to the number n to complete the repeated block retrieval of the data block in the first data packet, including: The module 34 is specifically configured to compare data blocks in the first data packet with the same hash value in the storage area corresponding to the number n and the number (n+1) to complete the repetition of the data block in the first data packet. Block retrieval.
在一可选实施方式中,哈希计算模块 33用于获取哈希值存储表中与第一 数据分组的哈希值相似度大于或等于预设的第一相似度阈值的第一哈希值包 括:哈希计算模块 33具体可用于获取哈希值存储表中与第一数据分组的哈希 值对应位置上的重复位的个数大于或等于预设数量的哈希值作为第一哈希 值。  In an optional implementation, the hash calculation module 33 is configured to obtain a first hash value in the hash value storage table that is similar to a hash value of the first data packet by a first similarity threshold that is greater than or equal to a preset first similarity threshold. The hash calculation module 33 is specifically configured to obtain, as the first hash, the number of repeated bits in the hash value storage table corresponding to the hash value of the first data packet is greater than or equal to a preset number of hash values. value.
哈希计算模块 33 具体用于获取哈希值存储表中与第一数据分组的哈希 值对应位置上的重复位的个数大于或等于预设数量的哈希值作为第一哈希值 包括:哈希计算模块 33具体可用于获取上述数据分组的哈希值与哈希值存储 表中每个哈希值之间的汉明距离, 将汉明距离小于或等于预设汉明距离阈值 的哈希值存储表中的哈希值作为第一哈希值。  The hash calculation module 33 is specifically configured to obtain, in the hash value storage table, the number of the repeated bits at the position corresponding to the hash value of the first data packet is greater than or equal to the preset number of hash values as the first hash value. The hash calculation module 33 is specifically configured to obtain a Hamming distance between the hash value of the data packet and each hash value in the hash value storage table, and set the Hamming distance to be less than or equal to the preset Hamming distance threshold. The hash value stores the hash value in the table as the first hash value.
本发明实施例提供的重复数据检索设备的各功能模块可用于执行图 1所 示重复数据检索方法的流程, 其具体工作原理不再赘述, 详见方法实施例的 描述。  The functional modules of the repeated data retrieval device provided by the embodiments of the present invention can be used to execute the process of the repeated data retrieval method shown in FIG. 1. The specific working principle is not described here. For details, refer to the description of the method embodiments.
本实施例提供的重复数据检索设备, 对接收到的数据先分块, 再分组, 对数据分组内的数据分块进行相似性哈希运算, 获取数据分组的哈希值, 然 后获取数据分组的哈希值与哈希值存储表中存储的已经存储到数据存储空间 中的各数据分组的哈希值中相似度大于或等于预设第一相似度阈值的第一哈 希值, 判断数据分组的哈希值与第一哈希值的相似度是否大于或等于预设的 第二相似度阈值, 如果大于, 说明该数据分组中的数据分块在^ ί艮大程度上是 重复块, 然后对其进行重复块检索, 由于查询哈希值存储表中存储的是已经 存储到数据存储空间中的数据分组的哈希值和数据分组的对应关系, 而数据 分组的数量相对较少, 所以查询哈希值存储表的效率较高, 并且基于数据分 组进行重复块检索减少了重复块检索的次数, 即减少了与磁盘交互的次数, 有利于提高重复块查询效率, 从而提高了重复数据删除技术的整体性能。  The repeated data retrieval device provided in this embodiment firstly blocks and receives the received data, performs similarity hashing on the data partitions in the data packet, obtains a hash value of the data packet, and then acquires the data packet. The hash value and the hash value stored in the hash value storage table are similar to the hash value of each data packet stored in the data storage space, and the similarity is greater than or equal to the first hash value of the preset first similarity threshold, and the data packet is determined. Whether the similarity between the hash value and the first hash value is greater than or equal to a preset second similarity threshold, if greater than, indicating that the data partition in the data packet is a duplicate block to a large extent, and then Repeating the block search, because the query hash value storage table stores the correspondence between the hash value and the data packet of the data packet that has been stored in the data storage space, and the number of data packets is relatively small, so the query Hash value storage tables are more efficient, and repeated block retrieval based on data packets reduces the number of repeated block searches, ie, reduces disk interaction. Frequency and help to improve the duplicate block query efficiency, thereby improving deduplication overall performance.
图 5为本发明又一实施例提供的重复数据检索设备的结构示意图。 本实 施例的重复数据检索设备在具体实现形态上可以是各种具有计算能力和存储 能力的设备, 例如可以是数据备份环境中的服务器、 计算机等, 还可以是广 域网数据传输场景中的终端、 网关、 基站等等, 本发明具体实施例并不对重 复数据检索设备的具体实现做限定。 如图 5所示, 本实施例的重复数据检索 设备包括: FIG. 5 is a schematic structural diagram of a repeated data retrieval device according to another embodiment of the present invention. Real The repetitive data retrieval device of the embodiment may be a device with computing power and storage capability in a specific implementation manner, for example, a server, a computer, or the like in a data backup environment, or a terminal and a gateway in a WAN data transmission scenario. The base station and the like, the specific embodiment of the present invention does not limit the specific implementation of the repeated data retrieval device. As shown in FIG. 5, the repeated data retrieval device of this embodiment includes:
处理器 51、通信接口( Communications Interface ) 53、存储器 52和总线; 处理器 51、 存储器 52和通信接口 53通过总线连接并完成相互间的通信。 所 述总线可以是工业标准体系结构( Industry Standard Architecture , 简称为 ISA ) 总线、 外部设备互连(Peripheral Component, 简称为 PCI ) 总线或扩展工业 标准体系结构 ( Extended Industry Standard Architecture, 简称为 EISA ) 总线 等。 所述总线可以分为地址总线、 数据总线、 控制总线等。 为便于表示, 图 5中仅用一条粗线表示, 但并不表示仅有一根总线或一种类型的总线。 其中: 通信接口 53 , 用于接收数据。  The processor 51, the communication interface 53, the memory 52, and the bus; the processor 51, the memory 52, and the communication interface 53 are connected by a bus and complete communication with each other. The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component (PCI) bus, or an Extended Industry Standard Architecture (EISA) bus. Wait. The bus can be divided into an address bus, a data bus, a control bus, and the like. For ease of representation, only one thick line is shown in Figure 5, but it does not mean that there is only one bus or one type of bus. Wherein: a communication interface 53 for receiving data.
处理器 51 , 用于执行程序。 具体地, 该程序可以包括程序代码, 所述程 序代码包括计算机操作指令。  The processor 51 is configured to execute a program. In particular, the program can include program code, the program code including computer operating instructions.
处理器 51可能是一个中央处理器( CPU ) ,者是特定集成电路 ( Application Specific Integrated Circuit, 以下简称为 ASIC ) , 或者是被配置成实施本发明 实施例的一个或多个集成电路。  The processor 51 may be a central processing unit (CPU), an application specific integrated circuit (hereinafter referred to as an ASIC), or one or more integrated circuits configured to implement the embodiments of the present invention.
存储器 52, 用于存储程序。 存储器 52可能包含高速 RAM存储器, 也可 能还包括非易失性存储器(non-volatile memory ) ,例如至少一个磁盘存储器。  The memory 52 is used to store a program. The memory 52 may include a high speed RAM memory, and may also include a non-volatile memory such as at least one disk memory.
上述程序具体可以用于: 对通信接口 53接收到的数据进行分块处理, 获 取至少两个数据分块; 对所述至少两个数据分块进行分组, 得到至少一个数 据分组, 每个数据分组包括至少一个数据分块; 针对至少一个数据分组中的 第一数据分组, 对第一数据分组内的数据分块进行相似性哈希运算, 获取第 一数据分组的哈希值, 获取哈希值存储表中与第一数据分组的哈希值相似度 大于或等于预设的第一相似度阈值的第一哈希值, 所述哈希值存储表中存储 有已经存储在数据存储空间中的第二数据分组的哈希值和第二数据分组的对 应关系, 所述第二数据分组的哈希值是根据第二数据分组内的数据分块进行 相似性哈希运算获得的; 所述第一数据分组是所述至少一个数据分组中的任 意一个数据分组; 如果第一数据分组的哈希值与第一哈希值的相似度大于或 等于预设的第二相似度阈值,对第一数据分组内的数据分块进行重复块检索。 在一可选实施方式中,存储器 52存储的程序还用于在第一数据分组的哈 希值与第一哈希值的相似度小于述第二相似度阈值时, 将第一数据分组内的 数据分块和第一数据分组内的数据分块的哈希值存储到数据存储空间中, 并 将第一数据分组的哈希值与第一数据分组的对应关系存储到哈希值存储表 中。 The foregoing program may be specifically configured to: perform block processing on the data received by the communication interface 53 to obtain at least two data blocks; group the at least two data blocks to obtain at least one data packet, and each data packet Include at least one data block; perform a similarity hash operation on the data block in the first data packet for the first data packet in the at least one data packet, obtain a hash value of the first data packet, and obtain a hash value And storing, in the storage table, a first hash value that is similar to a hash value of the first data packet, and is greater than or equal to a preset first similarity threshold, where the hash value storage table stores the data that has been stored in the data storage space. Corresponding relationship between the hash value of the second data packet and the second data packet, wherein the hash value of the second data packet is obtained by performing a similarity hash operation according to the data partitioning in the second data packet; a data packet is any one of the at least one data packet; if the hash value of the first data packet is greater than or similar to the first hash value Equal to the preset second similarity threshold, performing repeated block retrieval on the data partitioning in the first data packet. In an optional implementation, the program stored in the memory 52 is further configured to: when the similarity between the hash value of the first data packet and the first hash value is less than the second similarity threshold, The data block and the hash value of the data block in the first data packet are stored into the data storage space, and the correspondence between the hash value of the first data packet and the first data packet is stored in the hash value storage table. .
在一可选实施方式中,存储器 52存储的程序用于对所述至少两个数据分 块进行分组, 得到至少一个数据分组包括: 该程序具体用于由所述至少两个 数据分块中每个数据分块的哈希值构成待分块哈希数据, 以任一个数据分块 的哈希值的长度为滑动步长, 釆用分块算法对所述待分块哈希数据进行分块 处理, 得到至少一个哈希值分块, 将属于同一哈希值分块的哈希值对应的数 据分块作为一个所述数据分组。  In an optional implementation, the program stored in the memory 52 is configured to group the at least two data blocks to obtain at least one data packet. The program is specifically configured to be used by each of the at least two data blocks. The hash value of the data block constitutes the hash data to be blocked, and the length of the hash value of any data block is the sliding step size, and the block data is used to block the block data to be blocked. Processing, obtaining at least one hash value block, and dividing the data block corresponding to the hash value belonging to the same hash value block as one of the data packets.
在一可选实施方式中,存储器 52存储的程序用于对第一数据分组内的数 据分块进行相似性哈希运算, 获取第一数据分组的哈希值包括: 该程序具体 用于对第一数据分组内每个数据分块进行哈希运算, 获取第一数据分组内每 个数据分块的哈希值, 将第一数据分组内每个数据分块的哈希值中的 0替换 为 -1 , 将第一数据分组内所有数据分块的哈希值的对应位相加,将相加大于 0 的位映射为 1 , 将相加小于或等于 0的位映射为 0, 获得的二进制数值作为第 一数据分组的哈希值。  In an optional implementation, the program stored in the memory 52 is configured to perform a similarity hash operation on the data partitioning in the first data packet, and obtaining the hash value of the first data packet includes: the program is specifically used for Each data block in a data packet is hashed, and a hash value of each data block in the first data packet is obtained, and 0 of the hash value of each data block in the first data packet is replaced with -1, adding the corresponding bits of the hash value of all data blocks in the first data packet, mapping the bits added by greater than 0 to 1, and mapping the bits added less than or equal to 0 to 0, the obtained binary The value is used as the hash value of the first data packet.
在一可选实施方式中, 数据存储空间包括多个存储区域; 所述哈希值存 储表还存储有第二数据分组的哈希值和第二数据分组所在存储区域的编号的 对应关系。基于此, 存储器 52存储的程序用于对第一数据分组内的数据分块 进行重复块检索包括: 该程序具体用于从哈希值存储表中获取第一哈希值对 应的存储区域的编号 n, 将编号 n对应存储区域中的数据分块和数据分块的 哈希值加载到内存中; 其中, n为大于等于 0的整数; 将第一数据分组中与 编号 n对应存储区域中哈希值相同的数据分块进行比较, 以完成对第一数据 分组内的数据分块的重复块检索。  In an optional implementation, the data storage space includes a plurality of storage areas; and the hash value storage table further stores a correspondence between the hash value of the second data packet and the number of the storage area where the second data packet is located. Based on this, the program stored in the memory 52 is used to perform the repeated block retrieval on the data partitioning in the first data packet. The program is specifically configured to obtain the number of the storage area corresponding to the first hash value from the hash value storage table. n, the hash value corresponding to the data partition and the data chunk in the storage area corresponding to the number n is loaded into the memory; wherein n is an integer greater than or equal to 0; the first data packet corresponds to the number n corresponding to the storage area The data blocks having the same hash value are compared to complete a repeated block search for the data block within the first data packet.
可选的 ,存储器 52存储的程序还用于在将编号 n对应存储区域中的数据 分块和数据分块的哈希值加载到内存中的同时, 将编号(n+1 )对应存储区域 中的数据分块和数据分块的哈希值加载到内存中。 基于此, 该程序具体用于 将第一数据分组中与编号 n对应存储区域中哈希值相同的数据分块进行比 较, 以完成对第一数据分组内的数据分块的重复块检索包括: 该程序具体用 于将第一数据分组中与编号 n和编号(n+1 )对应存储区域中哈希值相同的数 据分块进行比较, 以完成对第一数据分组内的数据分块的重复块检索。 Optionally, the program stored in the memory 52 is further configured to: when the data block and the hash value of the data block in the storage area corresponding to the number n are loaded into the memory, the number (n+1) is corresponding to the storage area. The hash of the data chunks and data chunks is loaded into memory. Based on this, the program is specifically used for Comparing the data blocks in the first data packet with the same hash value in the storage area corresponding to the number n to complete the repeated block retrieval of the data block in the first data packet includes: the program is specifically used to be the first The data blocks in the data packet are compared with the data blocks having the same hash value in the storage area corresponding to the number n and the number (n+1) to complete the repeated block retrieval of the data block in the first data packet.
在一可选实施方式中,存储器 52存储的程序用于获取哈希值存储表中与 第一数据分组的哈希值相似度大于或等于预设的第一相似度阈值的第一哈希 值包括: 该程序具体用于获取哈希值存储表中与第一数据分组的哈希值对应 位置上的重复位的个数大于或等于预设数量的哈希值作为所述第一哈希值。  In an optional implementation, the program stored in the memory 52 is configured to obtain a first hash value in the hash value storage table that is similar to a hash value of the first data packet by greater than or equal to a preset first similarity threshold. The program is specifically configured to obtain, as the first hash value, a hash value in a position corresponding to a hash value of the first data packet in the hash value storage table that is greater than or equal to a preset number of hash values. .
在一可选实施方式中,存储器 52存储的程序具体用于获取哈希值存储表 中与第一数据分组的哈希值对应位置上的重复位的个数大于或等于预设数量 的哈希值作为第一哈希值包括: 该程序具体用于获取第一数据分组的哈希值 与哈希值存储表中每个哈希值之间的汉明距离, 将汉明距离小于或等于预设 汉明距离阈值的哈希值存储表中的哈希值作为第一哈希值。  In an optional implementation, the program stored in the memory 52 is specifically configured to obtain the hash number of the hash value storage table corresponding to the hash value of the first data packet is greater than or equal to the preset number of hashes. The value as the first hash value includes: the program is specifically configured to obtain a Hamming distance between the hash value of the first data packet and each hash value in the hash value storage table, and the Hamming distance is less than or equal to the pre- The hash value in the hash value storage table of the Hamming distance threshold is set as the first hash value.
本发明实施例提供的重复数据检索设备可用于执行图 1所示重复数据检 索方法的流程, 其具体工作原理不再赘述, 详见方法实施例的描述。  The repeated data retrieval device provided by the embodiment of the present invention can be used to execute the process of the repeated data retrieval method shown in FIG. 1. The specific working principle is not described here. For details, refer to the description of the method embodiment.
本实施例提供的重复数据检索设备, 对接收到的数据先分块, 再分组, 对数据分组内的数据分块进行相似性哈希运算, 获取数据分组的哈希值, 然 后获取数据分组的哈希值与哈希值存储表中存储的已经存储到数据存储空间 中的各数据分组的哈希值中相似度大于或等于预设第一相似度阈值的第一哈 希值, 判断数据分组的哈希值与第一哈希值的相似度是否大于或等于预设的 第二相似度阈值, 如果大于, 说明该数据分组中的数据分块在^ ί艮大程度上是 重复块, 然后对其进行重复块检索, 由于查询哈希值存储表中存储的是已经 存储到数据存储空间中的数据分组的哈希值和数据分组的对应关系, 而数据 分组的数量相对较少, 所以查询哈希值存储表的效率较高, 并且基于数据分 组进行重复块检索减少了重复块检索的次数, 即减少了与磁盘交互的次数, 有利于提高重复块查询效率, 从而提高了重复数据删除技术的整体性能。  The repeated data retrieval device provided in this embodiment firstly blocks and receives the received data, performs similarity hashing on the data partitions in the data packet, obtains a hash value of the data packet, and then acquires the data packet. The hash value and the hash value stored in the hash value storage table are similar to the hash value of each data packet stored in the data storage space, and the similarity is greater than or equal to the first hash value of the preset first similarity threshold, and the data packet is determined. Whether the similarity between the hash value and the first hash value is greater than or equal to a preset second similarity threshold, if greater than, indicating that the data partition in the data packet is a duplicate block to a large extent, and then Repeating the block search, because the query hash value storage table stores the correspondence between the hash value and the data packet of the data packet that has been stored in the data storage space, and the number of data packets is relatively small, so the query Hash value storage tables are more efficient, and repeated block retrieval based on data packets reduces the number of repeated block searches, ie, reduces disk interaction. Frequency and help to improve the duplicate block query efficiency, thereby improving deduplication overall performance.
本发明一实施例提供一种计算机程序产品, 该计算机程序产品包括计算 机可读存储介质, 用于存储程序。 如图 6所示, 该程序包括:  An embodiment of the invention provides a computer program product comprising a computer readable storage medium for storing a program. As shown in Figure 6, the program includes:
分块获取单元 81 , 用于对接收到的数据进行分块处理, 获取至少两个数 据分块。 分组获取单元 82, 与分块获取单元 81连接, 用于对分块获取单元 81获 取的至少两个数据分块进行分组, 获取至少一个数据分组, 每个数据分组包 括至少一个数据分块。 The block obtaining unit 81 is configured to perform block processing on the received data to obtain at least two data blocks. The packet obtaining unit 82 is connected to the block obtaining unit 81 and configured to group at least two data blocks acquired by the block obtaining unit 81 to acquire at least one data packet, and each data packet includes at least one data block.
哈希计算单元 83 , 与分组获取单元 82连接, 用于针对分组获取单元 82 获取的至少一个数据分组中的第一数据分组, 对第一数据分组内的数据分块 进行相似性哈希运算, 获取第一数据分组的哈希值, 获取哈希值存储表中与 第一数据分组的哈希值相似度大于或等于预设的第一相似度阈值的第一哈希 值, 所述哈希值存储表中存储有已经存储在数据存储空间中的第二数据分组 的哈希值和第二数据分组的对应关系, 所述第二数据分组的哈希值是根据第 二数据分组内的数据分块进行相似性哈希运算获得的; 所述第一数据分组是 所述至少一个数据分组中的任意一个数据分组。  The hash calculation unit 83 is connected to the packet acquisition unit 82, and is configured to perform a similarity hash operation on the data partition in the first data packet for the first data packet in the at least one data packet acquired by the packet acquisition unit 82, Obtaining a hash value of the first data packet, and acquiring a first hash value in the hash value storage table that is similar to a hash value of the first data packet, greater than or equal to a preset first similarity threshold, the hash The value storage table stores a correspondence between a hash value of the second data packet that has been stored in the data storage space and the second data packet, and the hash value of the second data packet is based on the data in the second data packet. The block is obtained by performing a similarity hash operation; the first data packet is any one of the at least one data packet.
重复检索单元 84, 与哈希计算单元 83连接, 用于在第一数据分组的哈 希值与第一哈希值的相似度大于或等于预设的第二相似度阈值时, 对第一数 据分组内的数据分块进行重复块检索。  The repeating retrieval unit 84 is connected to the hash computing unit 83, and configured to use the first data when the similarity between the hash value of the first data packet and the first hash value is greater than or equal to a preset second similarity threshold. The data blocks within the packet are subjected to repeated block retrieval.
在一可选实施方式中, 如图 6所示, 本实施例的重复数据检索设备还包 括: 存储单元 85。 存储单元 85 , 与哈希计算单元 83连接, 用于在第一数据 分组的哈希值与哈希计算单元 83 获取的第一哈希值的相似度小于第二相似 度阈值时, 将第一数据分组内的数据分块和第一数据分组内的数据分块的哈 希值存储到数据存储空间中, 并将第一数据分组的哈希值与第一数据分组的 对应关系存储到哈希值存储表中。  In an optional implementation manner, as shown in FIG. 6, the repeated data retrieval device of this embodiment further includes: a storage unit 85. The storage unit 85 is connected to the hash calculation unit 83, and is configured to: when the similarity between the hash value of the first data packet and the first hash value obtained by the hash calculation unit 83 is less than the second similarity threshold, The data block within the data packet and the hash value of the data block within the first data packet are stored into the data storage space, and the correspondence between the hash value of the first data packet and the first data packet is stored to the hash. The value is stored in the table.
在此说明, 上述哈希计算单元 83、 重复检索单元 84以及存储单元 85对 每个数据分组均执行相同的动作。  Here, the above-described hash calculation unit 83, the repetition retrieval unit 84, and the storage unit 85 perform the same operation for each data packet.
在一可选实施方式中, 分组获取单元 82具体可用于由分块获取单元 81 获取到的至少两个数据分块中每个数据分块的哈希值构成待分块哈希数据, 以至少两个数据分块中每个数据分块的哈希值的长度为滑动步长, 釆用分块 算法对上述待分块哈希数据进行分块处理, 得到至少一个哈希值分块, 将属 于同一哈希值分块的哈希值对应的数据分块作为一个数据分组, 从而得到至 少一个数据分组。  In an optional implementation, the packet obtaining unit 82 is specifically configured to use the hash value of each data block in the at least two data blocks obtained by the block obtaining unit 81 to form the to-be-blocked hash data, to at least The length of the hash value of each data block in the two data blocks is a sliding step size, and the block data is subjected to block processing by using the block algorithm to obtain at least one hash value block, The data blocks corresponding to the hash values of the same hash value block are used as one data packet, thereby obtaining at least one data packet.
在一可选实施方式中,哈希计算单元 83用于对上述第一数据分组内的数 据分块进行相似性哈希运算, 获取上述第一数据分组的哈希值包括: 哈希计 算单元 83具体用于对上述第一数据分组内每个数据分块进行哈希运算,获取 第一数据分组中每个数据分块的哈希值, 将第一数据分组内每个数据分块的 哈希值中的 0替换为 -1 , 将第一数据分组内所有数据分块的哈希值的对应位 相加, 将相加大于 0的位映射为 1 , 将相加小于或等于 0的位映射为 0, 获得 的二进制数值作为第一数据分组的哈希值。 In an optional implementation, the hash calculation unit 83 is configured to perform a similarity hash operation on the data partition in the first data packet, and obtain a hash value of the first data packet, including: a hash meter. The calculating unit 83 is specifically configured to perform a hash operation on each data block in the first data packet, obtain a hash value of each data block in the first data packet, and block each data in the first data packet. The 0 of the hash value is replaced by -1, the corresponding bits of the hash values of all the data blocks in the first data packet are added, and the bits added by greater than 0 are mapped to 1, and the addition is less than or equal to 0. The bit map is 0, and the obtained binary value is used as the hash value of the first data packet.
在一可选实施方式中, 上述数据存储空间包括多个存储区域; 相应地, 哈希值存储表还存储有第二数据分组的哈希值和第二数据分组所在存储区域 的编号的对应关系。基于此, 重复检索单元 84具体可用于从哈希值存储表中 获取第一哈希值对应存储区域的编号 n, 将编号 n对应存储区域中的数据分 块和数据分块的哈希值加载到内存中; 其中, n为大于等于 0的整数; 将第 一数据分组中与编号 n对应存储区域中哈希值相同的数据分块进行比较, 以 完成对第一数据分组内的数据分块的重复块检索。  In an optional implementation manner, the data storage space includes a plurality of storage areas. Correspondingly, the hash value storage table further stores a correspondence between a hash value of the second data packet and a number of the storage area where the second data packet is located. . Based on this, the repeated retrieval unit 84 is specifically configured to obtain the number n of the storage area corresponding to the first hash value from the hash value storage table, and load the hash value of the data partition and the data partition corresponding to the number n corresponding to the storage area. In the memory, where n is an integer greater than or equal to 0; comparing the data blocks in the first data packet with the same hash value in the storage area corresponding to the number n to complete the data partitioning in the first data packet Repeated block retrieval.
在一可选实施方式中,重复检索单元 84还用于在将编号 n对应存储区域 中的数据分块和数据分块的哈希值加载到内存中的同时, 将编号( n+1 )对应 存储区域中的数据分块和数据分块的哈希值加载到内存中。 基于此, 重复检 索单元 84具体用于将第一数据分组中与编号 n对应存储区域中哈希值相同的 数据分块进行比较,以完成对第一数据分组内的数据分块的重复块检索包括: 重复检索单元 84具体用于将第一数据分组中与编号 n和编号 (n+1 )对应存 储区域中哈希值相同的数据分块进行比较, 以完成对第一数据分组内的数据 分块的重复块检索。  In an optional implementation manner, the retrieving unit 84 is further configured to: when the data block and the hash value of the data block in the storage area corresponding to the number n are loaded into the memory, the number (n+1) is corresponding to The data chunks in the storage area and the hash values of the data chunks are loaded into the memory. Based on this, the repeated retrieval unit 84 is specifically configured to compare the data blocks in the first data packet with the same hash value in the storage area corresponding to the number n to complete the repeated block retrieval of the data partition in the first data packet. The method includes: the repeated retrieval unit 84 is specifically configured to compare data blocks in the first data packet with the same hash value in the storage area corresponding to the number n and the number (n+1) to complete the data in the first data packet. Repeated block retrieval of chunks.
在一可选实施方式中,哈希计算单元 83用于获取哈希值存储表中与第一 数据分组的哈希值相似度大于或等于预设的第一相似度阈值的第一哈希值包 括:哈希计算单元 83具体可用于获取哈希值存储表中与第一数据分组的哈希 值对应位置上的重复位的个数大于或等于预设数量的哈希值作为第一哈希 值。  In an optional implementation, the hash calculation unit 83 is configured to obtain a first hash value in the hash value storage table that is similar to a hash value of the first data packet by a preset first similarity threshold. The hash calculation unit 83 is specifically configured to obtain, as the first hash, the number of repeated bits in the hash value storage table corresponding to the hash value of the first data packet is greater than or equal to a preset number of hash values. value.
哈希计算单元 83 具体用于获取哈希值存储表中与第一数据分组的哈希 值对应位置上的重复位的个数大于或等于预设数量的哈希值作为第一哈希值 包括:哈希计算单元 83具体可用于获取上述数据分组的哈希值与哈希值存储 表中每个哈希值之间的汉明距离, 将汉明距离小于或等于预设汉明距离阈值 的哈希值存储表中的哈希值作为第一哈希值。 本发明实施例提供的重复数据检索设备可用于执行图 1所示重复数据检 索方法的流程, 其具体工作原理不再赘述, 详见方法实施例的描述。 The hash calculation unit 83 is specifically configured to obtain, in the hash value storage table, the number of the repeated bits on the position corresponding to the hash value of the first data packet is greater than or equal to the preset number of hash values as the first hash value. The hash calculation unit 83 is specifically configured to obtain a Hamming distance between the hash value of the data packet and each hash value in the hash value storage table, and set the Hamming distance to be less than or equal to the preset Hamming distance threshold. The hash value stores the hash value in the table as the first hash value. The repetitive data retrieval device provided by the embodiment of the present invention can be used to execute the process of the repeated data retrieval method shown in FIG. 1. The specific working principle is not described here. For details, refer to the description of the method embodiment.
本实施例提供的重复数据检索设备, 对接收到的数据先分块, 再分组, 对数据分组内的数据分块进行相似性哈希运算, 获取数据分组的哈希值, 然 后获取数据分组的哈希值与哈希值存储表中存储的已经存储到数据存储空间 中的各数据分组的哈希值中相似度大于或等于预设第一相似度阈值的第一哈 希值, 判断数据分组的哈希值与第一哈希值的相似度是否大于或等于预设的 第二相似度阈值, 如果大于, 说明该数据分组中的数据分块在^ ί艮大程度上是 重复块, 然后对其进行重复块检索, 由于查询哈希值存储表中存储的是已经 存储到数据存储空间中的数据分组的哈希值和数据分组的对应关系, 而数据 分组的数量相对较少, 所以查询哈希值存储表的效率较高, 并且基于数据分 组进行重复块检索减少了重复块检索的次数, 即减少了与磁盘交互的次数, 有利于提高重复块查询效率, 从而提高了重复数据删除技术的整体性能。  The repeated data retrieval device provided in this embodiment firstly blocks and receives the received data, performs similarity hashing on the data partitions in the data packet, obtains a hash value of the data packet, and then acquires the data packet. The hash value and the hash value stored in the hash value storage table are similar to the hash value of each data packet stored in the data storage space, and the similarity is greater than or equal to the first hash value of the preset first similarity threshold, and the data packet is determined. Whether the similarity between the hash value and the first hash value is greater than or equal to a preset second similarity threshold, if greater than, indicating that the data partition in the data packet is a duplicate block to a large extent, and then Repeating the block search, because the query hash value storage table stores the correspondence between the hash value and the data packet of the data packet that has been stored in the data storage space, and the number of data packets is relatively small, so the query Hash value storage tables are more efficient, and repeated block retrieval based on data packets reduces the number of repeated block searches, ie, reduces disk interaction. Frequency and help to improve the duplicate block query efficiency, thereby improving deduplication overall performance.
本领域普通技术人员可以理解: 实现上述各方法实施例的全部或部分步 骤可以通过程序指令相关的硬件来完成。 前述的程序可以存储于一计算机可 读取存储介质中。 该程序在执行时, 执行包括上述各方法实施例的步骤; 而 前述的存储介质包括: ROM, RAM, 磁碟或者光盘等各种可以存储程序代码 的介质。  One of ordinary skill in the art will appreciate that all or a portion of the steps to implement the various method embodiments described above can be accomplished by hardware associated with the program instructions. The aforementioned program can be stored in a computer readable storage medium. The program, when executed, performs the steps including the foregoing method embodiments; and the foregoing storage medium includes: a medium that can store program codes, such as a ROM, a RAM, a magnetic disk, or an optical disk.
最后应说明的是: 以上各实施例仅用以说明本发明的技术方案, 而非对 其限制; 尽管参照前述各实施例对本发明进行了详细的说明, 本领域的普通 技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改, 或者对其中部分或者全部技术特征进行等同替换; 而这些修改或者替换, 并 不使相应技术方案的本质脱离本发明各实施例技术方案的范围。  Finally, it should be noted that the above embodiments are only for explaining the technical solutions of the present invention, and are not intended to be limiting thereof; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art will understand that The technical solutions described in the foregoing embodiments may be modified, or some or all of the technical features may be equivalently replaced; and the modifications or substitutions do not deviate from the technical solutions of the embodiments of the present invention. range.

Claims

权 利 要求 书 Claim
1、 一种重复数据检索方法, 其特征在于, 包括: A method for retrieving data, characterized in that it comprises:
对接收到的数据进行分块处理, 获取至少两个数据分块;  Performing block processing on the received data to obtain at least two data blocks;
对所述至少两个数据分块进行分组, 得到至少一个数据分组, 每个数据 分组包括至少一个数据分块;  And grouping the at least two data blocks to obtain at least one data packet, each data packet including at least one data block;
针对所述至少一个数据分组中的第一数据分组, 对所述第一数据分组内 的数据分块进行相似性哈希运算, 获取所述第一数据分组的哈希值, 获取哈 希值存储表中与所述第一数据分组的哈希值相似度大于或等于预设的第一相 似度阈值的第一哈希值, 所述哈希值存储表中存储有已经存储在数据存储空 间中的第二数据分组的哈希值和所述第二数据分组的对应关系, 所述第二数 据分组的哈希值是根据所述第二数据分组内的数据分块进行相似性哈希运算 获得的;所述第一数据分组是所述至少一个数据分组中的任意一个数据分组; 如果所述第一数据分组的哈希值与所述第一哈希值的相似度大于或等于 预设的第二相似度阈值,对所述第一数据分组内的数据分块进行重复块检索。  Performing a similarity hash operation on the data block in the first data packet for the first data packet in the at least one data packet, acquiring a hash value of the first data packet, and obtaining a hash value storage a first hash value in the table that is similar to a hash value of the first data packet that is greater than or equal to a preset first similarity threshold, where the hash value storage table is stored in the data storage space. Corresponding relationship between the hash value of the second data packet and the second data packet, the hash value of the second data packet is obtained by performing a similarity hash operation according to the data partitioning in the second data packet The first data packet is any one of the at least one data packet; if a similarity between a hash value of the first data packet and the first hash value is greater than or equal to a preset And a second similarity threshold, performing a repeated block retrieval on the data partitioning in the first data packet.
2、 根据权利要求 1所述的重复数据检索方法, 其特征在于, 还包括: 如果所述第一数据分组的哈希值与所述第一哈希值的相似度小于所述第 二相似度阈值, 将所述第一数据分组内的数据分块和所述第一数据分组内的 数据分块的哈希值存储到所述数据存储空间中, 并将所述第一数据分组的哈 希值与所述第一数据分组的对应关系存储到所述哈希值存储表中。  2. The method of retrieving data according to claim 1, further comprising: if a similarity between a hash value of the first data packet and the first hash value is less than the second similarity a threshold, storing a data block in the first data packet and a hash value of the data block in the first data packet into the data storage space, and hashing the first data packet A correspondence between the value and the first data packet is stored in the hash value storage table.
3、 根据权利要求 1或 2所述的重复数据检索方法, 其特征在于, 对所述 至少两个数据分块进行分组, 得到至少一个数据分组包括:  The method for retrieving data according to claim 1 or 2, wherein the grouping the at least two data blocks to obtain at least one data packet comprises:
由所述至少两个数据分块中每个数据分块的哈希值构成待分块哈希数 据; 以任一个所述数据分块的哈希值的长度为滑动步长, 釆用分块算法对所 述待分块哈希数据进行分块处理, 得到至少一个哈希值分块;  Forming, by the hash value of each of the at least two data blocks, the hash data to be blocked; the length of the hash value of any one of the data blocks is a sliding step, and the block is used The algorithm performs block processing on the to-be-blocked hash data to obtain at least one hash value block;
将属于同一哈希值分块的哈希值对应的数据分块作为一个所述数据分 组。  The data blocks corresponding to the hash values belonging to the same hash value block are grouped as one of the data groups.
4、 根据权利要求 1-3任一项所述的重复数据检索方法, 其特征在于, 对 所述第一数据分组内的数据分块进行相似性哈希运算, 获取所述第一数据分 组的哈希值包括:  The method for retrieving data according to any one of claims 1 to 3, wherein the data segmentation in the first data packet is subjected to a similarity hash operation to obtain the first data packet. The hash values include:
对所述第一数据分组内每个数据分块进行哈希运算, 获取所述第一数据 分组内每个数据分块的哈希值; Performing a hash operation on each data block in the first data packet to obtain the first data The hash value of each data chunk in the packet;
将所述第一数据分组内每个数据分块的哈希值中的 0替换为 -1 , 将所述 第一数据分组内所有数据分块的哈希值的对应位相加, 将相加大于 0的位映 射为 1 , 将相加小于或等于 0的位映射为 0, 获得的二进制数值作为所述第一 数据分组的哈希值。  Substituting 0 of the hash values of each data block in the first data packet with -1, adding corresponding bits of hash values of all data blocks in the first data packet, adding A bit map greater than 0 is mapped to 1, and a bit added less than or equal to 0 is mapped to 0, and the obtained binary value is used as a hash value of the first data packet.
5、 根据权利要求 1-4任一项所述的重复数据检索方法, 其特征在于, 所 述数据存储空间包括多个存储区域; 所述哈希值存储表还存储有所述第二数 据分组的哈希值和所述第二数据分组所在存储区域的编号的对应关系;  The method for retrieving data according to any one of claims 1 to 4, wherein the data storage space includes a plurality of storage areas; and the hash value storage table further stores the second data packet Correspondence between the hash value and the number of the storage area in which the second data packet is located;
对所述第一数据分组内的数据分块进行重复块检索包括:  Performing a repeated block retrieval on the data partitioning within the first data packet includes:
从所述哈希值存储表中获取所述第一哈希值对应的存储区域的编号 n, 将编号 n对应存储区域中的数据分块和数据分块的哈希值加载到内存中; 其 中, n为大于等于 0的整数;  Obtaining, from the hash value storage table, the number n of the storage area corresponding to the first hash value, and loading the hash value corresponding to the data block and the data block in the storage area by the number n into the memory; , n is an integer greater than or equal to 0;
将所述第一数据分组中与所述编号 n对应存储区域中哈希值相同的数据 分块进行比较, 以完成对所述第一数据分组内的数据分块的重复块检索。  Comparing the data blocks of the first data packet with the same hash value in the storage area corresponding to the number n to complete the repeated block retrieval of the data block in the first data packet.
6、 根据权利要求 5所述的重复数据检索方法, 其特征在于, 所述方法还 包括:  The method for retrieving data according to claim 5, wherein the method further comprises:
在将编号 n对应存储区域中的数据分块和数据分块的哈希值加载到内存 中的同时, 将编号( n+1 )对应存储区域中的数据分块和数据分块的哈希值加 载到内存中;  While loading the data block and the hash value of the data block in the storage area corresponding to the number n into the memory, the number (n+1) corresponds to the data block in the storage area and the hash value of the data block. Loaded into memory;
所述将所述第一数据分组中与所述编号 n对应存储区域中哈希值相同的 数据分块进行比较, 以完成对所述第一数据分组内的数据分块的重复块检索 包括:  And comparing the data blocks in the first data packet that are the same as the hash value in the storage area corresponding to the number n to complete the repeated block retrieval of the data block in the first data packet includes:
将所述第一数据分组中与所述编号 n和编号(n+1 )对应存储区域中哈希 值相同的数据分块进行比较, 以完成对所述第一数据分组内的数据分块的重 复块检索。  Comparing the data blocks in the first data packet with the same hash value in the storage area corresponding to the number n and the number (n+1) to complete the block of data in the first data packet. Repeat block retrieval.
7、 根据权利要求 1-6任一项所述的重复数据检索方法, 其特征在于, 所 述获取哈希值存储表中与所述第一数据分组的哈希值相似度大于或等于预设 的第一相似度阈值的第一哈希值包括:  The method for retrieving data according to any one of claims 1 to 6, wherein the hash value similarity between the hash value storage table and the first data packet is greater than or equal to a preset. The first hash value of the first similarity threshold includes:
获取所述哈希值存储表中与所述第一数据分组的哈希值对应位置上的重 复位的个数大于或等于预设数量的哈希值作为所述第一哈希值。 Obtaining, as the first hash value, a hash value in a position corresponding to a hash value of the first data packet in the hash value storage table that is greater than or equal to a preset number of hash values.
8、 根据权利要求 7所述的重复数据检索方法, 其特征在于, 所述获取哈 希值存储表中与所述第一数据分组的哈希值对应位置上的重复位的个数大于 或等于预设数量的哈希值作为所述第一哈希值包括: The method for retrieving data according to claim 7, wherein the number of repeated bits in the position corresponding to the hash value of the first data packet in the obtained hash value storage table is greater than or equal to The preset number of hash values as the first hash value includes:
获取所述第一数据分组的哈希值与所述哈希值存储表中每个哈希值之间 的汉明距离, 将汉明距离小于或等于预设汉明距离阈值的所述哈希值存储表 中的哈希值作为所述第一哈希值。  Obtaining a Hamming distance between the hash value of the first data packet and each hash value in the hash value storage table, and using the hash with a Hamming distance less than or equal to a preset Hamming distance threshold A hash value in the value storage table is used as the first hash value.
9、 一种重复数据检索设备, 其特征在于, 包括:  9. A repetitive data retrieval device, comprising:
分块获取模块, 用于对接收到的数据进行分块处理, 获取至少两个数据 分块;  a block obtaining module, configured to perform block processing on the received data to obtain at least two data blocks;
分组获取模块, 用于对所述分块获取模块获取到的所述至少两个数据分 块进行分组, 得到至少一个数据分组, 每个数据分组包括至少一个数据分块; 哈希计算模块, 用于针对所述至少一个数据分组中的第一数据分组, 对 所述第一数据分组内的数据分块进行相似性哈希运算, 获取所述第一数据分 组的哈希值, 获取哈希值存储表中与所述第一数据分组的哈希值相似度大于 或等于预设的第一相似度阈值的第一哈希值, 所述哈希值存储表中存储有已 经存储在数据存储空间中的第二数据分组的哈希值和所述第二数据分组的对 应关系, 所述第二数据分组的哈希值是根据所述第二数据分组内的数据分块 进行相似性哈希运算获得的; 所述第一数据分组是所述至少一个数据分组中 的任意一个数据分组;  a packet obtaining module, configured to group the at least two data blocks obtained by the block obtaining module to obtain at least one data packet, each data packet includes at least one data block; and a hash calculation module, Performing a similarity hash operation on the data block in the first data packet for the first data packet in the at least one data packet, acquiring a hash value of the first data packet, and obtaining a hash value And storing, in the storage table, a first hash value that is similar to a hash value of the first data packet, and is greater than or equal to a preset first similarity threshold, where the hash value storage table is stored in the data storage space. a correspondence between the hash value of the second data packet and the second data packet, the hash value of the second data packet is a similarity hash operation according to the data partitioning in the second data packet Obtaining; the first data packet is any one of the at least one data packet;
重复检索模块, 用于在所述第一数据分组的哈希值与所述第一哈希值的 相似度大于或等于预设的第二相似度阈值时, 对所述第一数据分组内的数据 分块进行重复块检索。  a repeating search module, configured to: when the similarity between the hash value of the first data packet and the first hash value is greater than or equal to a preset second similarity threshold, Data block is used for repeated block retrieval.
10、 根据权利要求 9所述的重复数据检索设备, 其特征在于, 还包括: 存储模块, 用于在所述第一数据分组的哈希值与所述第一哈希值的相似 度小于所述第二相似度阈值时, 将所述第一数据分组内的数据分块和所述第 一数据分组内的数据分块的哈希值存储到所述数据存储空间中, 并将所述第 一数据分组的哈希值与所述第一数据分组的对应关系存储到所述哈希值存储 表中。  The repetitive data retrieval device according to claim 9, further comprising: a storage module, configured to: when a hash value of the first data packet and the first hash value are less than When the second similarity threshold is described, storing the data block in the first data packet and the hash value of the data block in the first data packet into the data storage space, and A correspondence between a hash value of a data packet and the first data packet is stored in the hash value storage table.
1 1、 根据权利要求 9或 10所述的重复数据检索设备, 其特征在于, 所述 分组获取模块具体用于由所述至少两个数据分块中每个数据分块的哈希值构 成待分块哈希数据, 以任一个所述数据分块的哈希值的长度为滑动步长, 釆 用分块算法对所述待分块哈希数据进行分块处理,得到至少一个哈希值分块, 将属于同一哈希值分块的哈希值对应的数据分块作为一个所述数据分组。 The data retrieval device according to claim 9 or 10, wherein the packet acquisition module is specifically configured to: construct a hash value of each of the at least two data blocks And the block-shaped hash data is obtained, and the length of the hash value of any one of the data blocks is a sliding step, and the block-blocking algorithm is used to perform block processing on the to-be-blocked hash data to obtain at least one hash. The hash value blocks the data blocks corresponding to the hash values belonging to the same hash value block as one of the data packets.
12、 根据权利要求 9-11任一项所述的重复数据检索设备, 其特征在于, 所述哈希计算模块用于对所述第一数据分组内的数据分块进行相似性哈希运 算, 获取所述第一数据分组的哈希值包括:  The repeated data retrieval device according to any one of claims 9 to 11, wherein the hash calculation module is configured to perform a similarity hash operation on the data partition in the first data packet. Obtaining a hash value of the first data packet includes:
所述哈希计算模块具体用于对所述第一数据分组内每个数据分块进行哈 希运算, 获取所述第一数据分组内每个数据分块的哈希值, 将所述第一数据 分组内每个数据分块的哈希值中的 0替换为 -1 , 将所述第一数据分组内所有 数据分块的哈希值的对应位相加, 将相加大于 0的位映射为 1 , 将相加小于 或等于 0的位映射为 0, 获得的二进制数值作为所述第一数据分组的哈希值。  The hash calculation module is configured to perform a hash operation on each data block in the first data packet, and obtain a hash value of each data block in the first data packet, where the first The 0 of the hash value of each data block in the data packet is replaced by -1, and the corresponding bits of the hash values of all the data blocks in the first data packet are added, and the bit maps greater than 0 are added. A value of 1, a bit less than or equal to 0 is mapped to 0, and the obtained binary value is used as a hash value of the first data packet.
13、 根据权利要求 9-12任一项所述的重复数据检索设备, 其特征在于, 所述数据存储空间包括多个存储区域; 所述哈希值存储表还存储有所述第二 数据分组的哈希值和所述第二数据分组所在存储区域的编号的对应关系; 所述重复检索模块具体用于从所述哈希值存储表中获取所述第一哈希值 对应存储区域的编号 n, 将编号 n对应存储区域中的数据分块和数据分块的 哈希值加载到内存中; 其中, n为大于等于 0的整数; 将所述第一数据分组 中与所述编号 n对应存储区域中哈希值相同的数据分块进行比较, 以完成对 所述第一数据分组内的数据分块的重复块检索。  The repeated data retrieval device according to any one of claims 9 to 12, wherein the data storage space includes a plurality of storage areas; the hash value storage table further stores the second data packet Corresponding relationship between the hash value and the number of the storage area in which the second data packet is located; the repeated retrieval module is specifically configured to obtain the number of the first hash value corresponding storage area from the hash value storage table n, loading the hash value of the data partition and the data chunk in the storage area corresponding to the number n into the memory; wherein n is an integer greater than or equal to 0; corresponding to the number n in the first data packet Data chunks having the same hash value in the storage area are compared to complete a repeated block search of the data chunks within the first data packet.
14、 根据权利要求 9-13任一项所述的重复数据检索设备, 其特征在于, 所述重复检索模块还用于在将编号 n对应存储区域中的数据分块和数据分块 的哈希值加载到内存中的同时, 将编号( n+1 )对应存储区域中的数据分块和 数据分块的哈希值加载到内存中;  The repeated data retrieval device according to any one of claims 9 to 13, wherein the repeated retrieval module is further configured to block the data in the storage area corresponding to the number n and the hash of the data block. When the value is loaded into the memory, the number (n+1) corresponding to the data block in the storage area and the hash value of the data block are loaded into the memory;
所述重复检索模块具体用于将所述第一数据分组中与所述编号 n对应存 储区域中哈希值相同的数据分块进行比较, 以完成对所述第一数据分组内的 数据分块的重复块检索包括:  The repeatedly retrieving module is configured to compare data blocks in the first data packet that are the same as the hash value in the storage area corresponding to the number n, to complete data partitioning in the first data packet. The repeated block retrieval includes:
所述重复检索模块具体用于将所述第一数据分组中与所述编号 n和编号 ( n+1 )对应存储区域中哈希值相同的数据分块进行比较, 以完成对所述第一 数据分组内的数据分块的重复块检索。  The repeated retrieval module is specifically configured to compare data blocks in the first data packet with the same hash value in the storage area corresponding to the number n and the number (n+1) to complete the first Repeated block retrieval of data chunks within a data packet.
15、 根据权利要求 9-14任一项所述的重复数据检索设备, 其特征在于, 所述哈希计算模块用于获取哈希值存储表中与所述第一数据分组哈希值相似 度大于或等于预设的第一相似度阈值的第一哈希值包括: The repeated data retrieval device according to any one of claims 9 to 14, characterized in that The hashing module is configured to obtain, in the hash value storage table, a first hash value that is greater than or equal to a preset first similarity threshold of the first data packet hash value, including:
所述哈希计算模块具体用于获取所述哈希值存储表中与所述第一数据分 组的哈希值对应位置上的重复位的个数大于或等于预设数量的哈希值作为所 述第一哈希值。  The hash calculation module is specifically configured to acquire, as the hash value, a number of repeated bits at a position corresponding to a hash value of the first data packet that is greater than or equal to a preset number of hash values. The first hash value is described.
16、 根据权利要求 15所述的重复数据检索设备, 其特征在于, 所述哈希 计算模块具体用于获取所述哈希值存储表中与所述第一数据分组的哈希值对 应位置上的重复位的个数大于或等于预设数量的哈希值作为所述第一哈希值 包括:  The repeated data retrieval device according to claim 15, wherein the hash calculation module is configured to acquire a position corresponding to a hash value of the first data packet in the hash value storage table. The number of repeated bits is greater than or equal to a preset number of hash values as the first hash value includes:
所述哈希值计算模块具体用于获取所述第一数据分组的哈希值与所述哈 希值存储表中每个哈希值之间的汉明距离, 将汉明距离小于或等于预设汉明 距离阈值的所述哈希值存储表中的哈希值作为所述第一哈希值。  The hash value calculation module is specifically configured to acquire a Hamming distance between the hash value of the first data packet and each hash value in the hash value storage table, and set the Hamming distance to be less than or equal to the pre-predetermined The hash value in the hash value storage table of the Hamming distance threshold is set as the first hash value.
17、 一种重复数据检索设备, 其特征在于, 包括: 处理器、 通信接口、 存储器和总线, 所述处理器、 所述通信接口、 所述存储器通过所述总线完成 相互间的通信;  A repetitive data retrieval device, comprising: a processor, a communication interface, a memory, and a bus, wherein the processor, the communication interface, and the memory complete communication with each other through the bus;
所述通信接口, 用于接收数据;  The communication interface is configured to receive data;
所述处理器, 用于执行程序;  The processor is configured to execute a program;
所述存储器, 用于存放所述程序;  The memory is configured to store the program;
其中, 所述程序用于对所述通信接口接收到的所述数据进行分块处理, 获取至少两个数据分块; 对所述至少两个数据分块进行分组, 得到至少一个 数据分组, 每个数据分组包括至少一个数据分块; 针对所述至少一个数据分 组中的第一数据分组, 对所述第一数据分组内的数据分块进行相似性哈希运 算, 获取所述第一数据分组的哈希值, 获取哈希值存储表中与所述第一数据 分组的哈希值相似度大于或等于预设的第一相似度阈值的第一哈希值, 所述 哈希值存储表中存储有已经存储在数据存储空间中的第二数据分组的哈希值 和所述第二数据分组的对应关系, 所述第二数据分组的哈希值是根据所述第 二数据分组内的数据分块进行相似性哈希运算获得的; 所述第一数据分组是 所述至少一个数据分组中的任意一个数据分组; 如果所述第一数据分组的哈 希值与所述第一哈希值的相似度大于或等于预设的第二相似度阈值, 对所述 第一数据分组内的数据分块进行重复块检索。 The program is configured to perform block processing on the data received by the communication interface, to obtain at least two data blocks, and group the at least two data blocks to obtain at least one data packet. Data packets comprise at least one data block; for a first data packet in the at least one data packet, performing a similarity hash operation on the data block in the first data packet to obtain the first data packet a hash value, the first hash value of the hash value storage table that is similar to the hash value of the first data packet is greater than or equal to a preset first similarity threshold, and the hash value storage table is obtained. Storing a correspondence between a hash value of the second data packet that has been stored in the data storage space and the second data packet, the hash value of the second data packet being based on the second data packet The data block is obtained by performing a similarity hash operation; the first data packet is any one of the at least one data packet; if the hash of the first data packet The similarity of the value to the first hash value is greater than or equal to a preset second similarity threshold, and the data block in the first data packet is subjected to repeated block retrieval.
18、 根据权利要求 17所述的重复数据检索设备, 其特征在于, 所述程序 还用于在所述第一数据分组的哈希值与所述第一哈希值的相似度小于所述第 二相似度阈值时, 将所述第一数据分组内的数据分块和所述第一数据分组内 的数据分块的哈希值存储到所述数据存储空间中, 并将所述第一数据分组的 哈希值与所述第一数据分组的对应关系存储到所述哈希值存储表中。 The repeated data retrieval device according to claim 17, wherein the program is further configured to: the similarity between the hash value of the first data packet and the first hash value is smaller than the first And storing a hash value of the data block in the first data packet and a hash value of the data block in the first data packet into the data storage space, and using the first data A correspondence between the hash value of the packet and the first data packet is stored in the hash value storage table.
19、 根据权利要求 17或 18所述的重复数据检索设备, 其特征在于, 所 述程序用于对所述至少两个数据分块进行分组,得到至少一个数据分组包括: 所述程序具体用于由所述至少两个数据分块中每个数据分块的哈希值构 成待分块哈希数据, 以任一个所述数据分块的哈希值的长度为滑动步长, 釆 用分块算法对所述待分块哈希数据进行分块处理,得到至少一个哈希值分块, 将属于同一哈希值分块的哈希值对应的数据分块作为一个所述数据分组。  The repetitive data retrieval device according to claim 17 or 18, wherein the program is configured to group the at least two data blocks to obtain at least one data packet, including: the program is specifically used to Forming, by the hash value of each of the at least two data blocks, the hash data to be blocked, and the length of the hash value of any one of the data blocks is a sliding step, and the block is used. The algorithm performs block processing on the to-be-blocked hash data to obtain at least one hash value block, and blocks the data corresponding to the hash value of the same hash value block as one of the data packets.
20、 根据权利要求 17-19任一项所述的重复数据检索设备, 其特征在于, 所述程序用于对所述第一数据分组内的数据分块进行相似性哈希运算, 获取 所述第一数据分组的哈希值包括:  The repeated data retrieval device according to any one of claims 17 to 19, wherein the program is configured to perform a similarity hash operation on the data partitioning in the first data packet to obtain the The hash value of the first data packet includes:
所述程序具体用于对所述第一数据分组内每个数据分块进行哈希运算, 获取所述第一数据分组内每个数据分块的哈希值, 将所述第一数据分组内每 个数据分块的哈希值中的 0替换为 -1 , 将所述第一数据分组内所有数据分块 的哈希值的对应位相加, 将相加大于 0的位映射为 1 , 将相加小于或等于 0 的位映射为 0, 获得的二进制数值作为所述第一数据分组的哈希值。  The program is specifically configured to perform a hash operation on each data block in the first data packet, and obtain a hash value of each data block in the first data packet, where the first data packet is The 0 of the hash value of each data block is replaced by -1, and the corresponding bits of the hash values of all the data blocks in the first data packet are added, and the bits added by greater than 0 are mapped to 1. The bits added by less than or equal to 0 are mapped to 0, and the obtained binary value is used as the hash value of the first data packet.
21、 根据权利要求 17-20任一项所述的重复数据检索设备, 其特征在于, 所述数据存储空间包括多个存储区域; 所述哈希值存储表还存储有所述第二 数据分组的哈希值和所述第二数据分组所在存储区域的编号的对应关系; 所述程序用于对所述第一数据分组内的数据分块进行重复块检索包括: 储区域的编号 n, 将编号 n对应存储区域中的数据分块和数据分块的哈希值 加载到内存中; 其中, n为大于等于 0的整数; 将所述第一数据分组中与所 述编号 n对应存储区域中哈希值相同的数据分块进行比较, 以完成对所述第 一数据分组内的数据分块的重复块检索。  The repeated data retrieval device according to any one of claims 17 to 20, wherein the data storage space includes a plurality of storage areas; the hash value storage table further stores the second data packet Corresponding relationship between the hash value and the number of the storage area in which the second data packet is located; the program is configured to perform a repeated block search on the data partition in the first data packet, including: a number n of the storage area, The number n corresponds to the data block in the storage area and the hash value of the data block is loaded into the memory; wherein n is an integer greater than or equal to 0; the first data packet is in the storage area corresponding to the number n The data blocks having the same hash value are compared to complete a repeated block retrieval of the data blocks within the first data packet.
22、 根据权利要求 21所述的重复数据检索设备, 其特征在于, 所述程序 还用于在将编号 n对应存储区域中的数据分块和数据分块的哈希值加载到内 存中的同时, 将编号(n+1 )对应存储区域中的数据分块和数据分块的哈希值 力。载到内存中; The data retrieval device according to claim 21, wherein the program is further configured to load the hash value of the data block and the data block in the storage area corresponding to the number n While storing, the number (n+1) corresponds to the hash value of the data partitioning and data partitioning in the storage area. Loaded into memory;
所述程序具体用于将所述第一数据分组中与所述编号 n对应存储区域中 哈希值相同的数据分块进行比较, 以完成对所述第一数据分组内的数据分块 的重复块检索包括:  The program is specifically configured to compare data blocks in the first data packet that are the same as the hash value in the storage area corresponding to the number n, to complete the repetition of the data block in the first data packet. Block retrieval includes:
所述程序具体用于将所述第一数据分组中与所述编号 n和编号 ( n+1 )对 应存储区域中哈希值相同的数据分块进行比较, 以完成对所述第一数据分组 内的数据分块的重复块检索。  The program is specifically configured to compare data blocks in the first data packet with the same hash value in the storage area corresponding to the number n and the number (n+1) to complete the first data packet. Repetitive block retrieval of data blocks within.
23、 根据权利要求 17-22任一项所述的重复数据检索设备, 其特征在于, 所述程序用于获取哈希值存储表中与所述第一数据分组的哈希值相似度大于 或等于预设的第一相似度阈值的第一哈希值包括: 值对应位置上的重复位的个数大于或等于预设数量的哈希值作为所述第一哈 希值。  The repeated data retrieval device according to any one of claims 17 to 22, wherein the program is configured to obtain a hash value similarity between the hash value and the first data packet in the hash value storage table. The first hash value equal to the preset first similarity threshold includes: the number of repeated bits in the corresponding position of the value is greater than or equal to a preset number of hash values as the first hash value.
24、 根据权利要求 23所述的重复数据检索设备, 其特征在于, 所述程序 具体用于获取所述哈希值存储表中与所述第一数据分组的哈希值对应位置上 的重复位的个数大于或等于预设数量的哈希值作为所述第一哈希值包括: 所述程序具体用于获取所述第一数据分组的哈希值与所述哈希值存储表 中每个哈希值之间的汉明距离, 将汉明距离小于或等于预设汉明距离阈值的 所述哈希值存储表中的哈希值作为所述第一哈希值。  The repetitive data retrieval device according to claim 23, wherein the program is specifically configured to acquire a repetition bit in a position corresponding to a hash value of the first data packet in the hash value storage table. The number of the hash value that is greater than or equal to the preset number of the first hash value includes: the program is specifically configured to acquire a hash value of the first data packet and each of the hash value storage table A Hamming distance between the hash values, the hash value in the hash value storage table whose Hamming distance is less than or equal to the preset Hamming distance threshold is used as the first hash value.
25、 一种计算机程序产品, 其特征在于, 包括计算机可读存储介质, 用 于存储程序, 所述程序包括:  25. A computer program product, comprising: a computer readable storage medium for storing a program, the program comprising:
分块获取单元, 用于对接收到的数据进行分块处理, 获取至少两个数据 分块;  a block obtaining unit, configured to perform block processing on the received data to obtain at least two data blocks;
分组获取单元, 用于对所述分块获取单元获取到的所述至少两个数据分 块进行分组, 得到至少一个数据分组, 每个数据分组包括至少一个数据分块; 哈希计算单元, 用于针对所述至少一个数据分组中的第一数据分组, 对 所述第一数据分组内的数据分块进行相似性哈希运算, 获取所述第一数据分 组的哈希值, 获取哈希值存储表中与所述第一数据分组的哈希值相似度大于 或等于预设的第一相似度阈值的第一哈希值, 所述哈希值存储表中存储有已 经存储在数据存储空间中的第二数据分组的哈希值和所述第二数据分组的对 应关系, 所述第二数据分组的哈希值是根据所述第二数据分组内的数据分块 进行相似性哈希运算获得的; 所述第一数据分组是所述至少一个数据分组中 的任意一个数据分组; a packet obtaining unit, configured to group the at least two data blocks obtained by the block obtaining unit, to obtain at least one data packet, each data packet includes at least one data block; and a hash computing unit, Performing a similarity hash operation on the data block in the first data packet for the first data packet in the at least one data packet, acquiring a hash value of the first data packet, and obtaining a hash value And storing, in the storage table, a first hash value that is similar to a hash value of the first data packet, and is greater than or equal to a preset first similarity threshold, where the hash value storage table stores Corresponding relationship between the hash value of the second data packet stored in the data storage space and the second data packet, the hash value of the second data packet is segmented according to data in the second data packet Obtaining a similarity hashing operation; the first data packet is any one of the at least one data packet;
重复检索单元, 用于在所述第一数据分组的哈希值与所述第一哈希值的 相似度大于或等于预设的第二相似度阈值时, 对所述第一数据分组内的数据 分块进行重复块检索。  And a retrieving unit, configured to: in the first data packet, when a similarity between a hash value of the first data packet and the first hash value is greater than or equal to a preset second similarity threshold Data block is used for repeated block retrieval.
26、 根据权利要求 25所述的计算机程序产品, 其特征在于, 所述程序还 包括:  The computer program product of claim 25, wherein the program further comprises:
存储单元, 用于在所述第一数据分组的哈希值与所述第一哈希值的相似 度小于所述第二相似度阈值时, 将所述第一数据分组内的数据分块和所述第 一数据分组内的数据分块的哈希值存储到所述数据存储空间中, 并将所述第 一数据分组的哈希值与所述第一数据分组的对应关系存储到所述哈希值存储 表中。  a storage unit, configured to block data in the first data packet when a similarity between a hash value of the first data packet and the first hash value is less than the second similarity threshold And storing a hash value of the data block in the first data packet into the data storage space, and storing a correspondence between the hash value of the first data packet and the first data packet to the The hash value is stored in the table.
27、 根据权利要求 25或 26所述的计算机程序产品, 其特征在于, 所述 成待分块哈希数据, 以任一个所述数据分块的哈希值的长度为滑动步长, 釆 用分块算法对所述待分块哈希数据进行分块处理,得到至少一个哈希值分块, 将属于同一哈希值分块的哈希值对应的数据分块作为一个所述数据分组。  The computer program product according to claim 25 or 26, wherein the length of the hash value of any one of the data blocks is a sliding step length, The blocking algorithm performs block processing on the to-be-blocked hash data to obtain at least one hash value block, and blocks the data corresponding to the hash value of the same hash value block as one of the data packets.
28、 根据权利要求 25-27任一项所述的计算机程序产品, 其特征在于, 所述哈希计算单元用于对所述第一数据分组内的数据分块进行相似性哈希运 算, 获取所述第一数据分组的哈希值包括:  The computer program product according to any one of claims 25 to 27, wherein the hash calculation unit is configured to perform a similarity hash operation on the data partitioning in the first data packet, The hash value of the first data packet includes:
所述哈希计算单元具体用于对所述第一数据分组内每个数据分块进行哈 希运算, 获取所述第一数据分组内每个数据分块的哈希值, 将所述第一数据 分组内每个数据分块的哈希值中的 0替换为 -1 , 将所述第一数据分组内所有 数据分块的哈希值的对应位相加, 将相加大于 0的位映射为 1 , 将相加小于 或等于 0的位映射为 0, 获得的二进制数值作为所述第一数据分组的哈希值。  The hash calculation unit is specifically configured to perform a hash operation on each data block in the first data packet, and obtain a hash value of each data block in the first data packet, where the first The 0 of the hash value of each data block in the data packet is replaced by -1, and the corresponding bits of the hash values of all the data blocks in the first data packet are added, and the bit maps greater than 0 are added. A value of 1, a bit less than or equal to 0 is mapped to 0, and the obtained binary value is used as a hash value of the first data packet.
29、 根据权利要求 25-28任一项所述的计算机程序产品, 其特征在于, 所述数据存储空间包括多个存储区域; 所述哈希值存储表还存储有所述第二 数据分组的哈希值和所述第二数据分组所在存储区域的编号的对应关系; 对应的存储区域的编号 n, 将编号 n对应存储区域中的数据分块和数据分块 的哈希值加载到内存中; 其中, n为大于等于 0的整数; 将所述第一数据分 组中与所述编号 n对应存储区域中哈希值相同的数据分块进行比较, 以完成 对所述第一数据分组内的数据分块的重复块检索。 The computer program product according to any one of claims 25 to 28, wherein the data storage space comprises a plurality of storage areas; the hash value storage table further stores the second data packet a correspondence between the hash value and the number of the storage area in which the second data packet is located; The number n of the corresponding storage area is loaded into the memory by the number n corresponding to the data block and the data block in the storage area; wherein n is an integer greater than or equal to 0; A data block having the same hash value in the storage area corresponding to the number n is compared to complete a repeated block search of the data block in the first data packet.
30、 根据权利要求 25-29任一项所述的计算机程序产品, 其特征在于, 所述重复检索单元还用于在将编号 n对应存储区域中的数据分块和数据分块 的哈希值加载到内存中的同时, 将编号( n+1 )对应存储区域中的数据分块和 数据分块的哈希值加载到内存中;  The computer program product according to any one of claims 25 to 29, wherein the repeated retrieval unit is further configured to block the data in the storage area corresponding to the number n and the hash value of the data block. Loading the data into the memory, loading the hash value of the data partition and the data chunk in the storage area corresponding to the number (n+1) into the memory;
所述重复检索单元具体用于将所述第一数据分组中与所述编号 n对应存 储区域中哈希值相同的数据分块进行比较, 以完成对所述第一数据分组内的 数据分块的重复块检索包括:  The repeatedly retrieving unit is configured to compare data blocks in the first data packet that are the same as the hash value in the storage area corresponding to the number n, to complete data partitioning in the first data packet. The repeated block retrieval includes:
所述重复检索单元具体用于将所述第一数据分组中与所述编号 n和编号 ( n+1 )对应存储区域中哈希值相同的数据分块进行比较, 以完成对所述第一 数据分组内的数据分块的重复块检索。  The repeatedly retrieving unit is configured to compare data blocks in the first data packet that are the same as the hash value in the storage area corresponding to the number n and the number (n+1), to complete the first Repeated block retrieval of data chunks within a data packet.
31、 根据权利要求 25-30任一项所述的计算机程序产品, 其特征在于, 所述哈希计算单元用于获取哈希值存储表中与所述第一数据分组的哈希值相 似度大于或等于预设的第一相似度阈值的第一哈希值包括:  The computer program product according to any one of claims 25 to 30, wherein the hash calculation unit is configured to acquire a hash value similarity with the first data packet in the hash value storage table. The first hash value greater than or equal to the preset first similarity threshold includes:
所述哈希计算单元具体用于获取所述哈希值存储表中与所述第一数据分 组的哈希值对应位置上的重复位的个数大于或等于预设数量的哈希值作为所 述第一哈希值。  The hash calculation unit is specifically configured to acquire, as the hash value, a hash value on a position corresponding to a hash value of the first data packet, which is greater than or equal to a preset number of hash values. The first hash value is described.
32、 根据权利要求 32所述的计算机程序产品, 其特征在于, 所述哈希计 算单元具体用于获取所述哈希值存储表中与所述第一数据分组的哈希值对应 位置上的重复位的个数大于或等于预设数量的哈希值作为所述第一哈希值包 括:  The computer program product according to claim 32, wherein the hash calculation unit is configured to acquire a position in the hash value storage table corresponding to a hash value of the first data packet. The number of repeated bits is greater than or equal to a preset number of hash values as the first hash value includes:
所述哈希值计算单元具体用于获取所述第一数据分组的哈希值与所述哈 希值存储表中每个哈希值之间的汉明距离, 将汉明距离小于或等于预设汉明 距离阈值的所述哈希值存储表中的哈希值作为所述第一哈希值。  The hash value calculation unit is specifically configured to acquire a Hamming distance between the hash value of the first data packet and each hash value in the hash value storage table, and set the Hamming distance to be less than or equal to the pre-predetermined The hash value in the hash value storage table of the Hamming distance threshold is set as the first hash value.
PCT/CN2012/083740 2012-10-30 2012-10-30 Duplicate data retrieval method and device WO2014067063A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2012/083740 WO2014067063A1 (en) 2012-10-30 2012-10-30 Duplicate data retrieval method and device
CN201280001989.7A CN103189867B (en) 2012-10-30 2012-10-30 Repeating data search method and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2012/083740 WO2014067063A1 (en) 2012-10-30 2012-10-30 Duplicate data retrieval method and device

Publications (1)

Publication Number Publication Date
WO2014067063A1 true WO2014067063A1 (en) 2014-05-08

Family

ID=48679810

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2012/083740 WO2014067063A1 (en) 2012-10-30 2012-10-30 Duplicate data retrieval method and device

Country Status (2)

Country Link
CN (1) CN103189867B (en)
WO (1) WO2014067063A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106202212A (en) * 2016-06-28 2016-12-07 微梦创科网络科技(中国)有限公司 A kind of method and system realizing data fractionation based on data server cluster

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103189867B (en) * 2012-10-30 2016-05-25 华为技术有限公司 Repeating data search method and equipment
WO2015042909A1 (en) * 2013-09-29 2015-04-02 华为技术有限公司 Data processing method, system and client
CN103858125B (en) * 2013-12-17 2015-12-30 华为技术有限公司 Repeating data disposal route, device and memory controller and memory node
CN105843859B (en) * 2016-03-17 2019-05-24 华为技术有限公司 The method, apparatus and equipment of data processing
CN106873964A (en) * 2016-12-23 2017-06-20 浙江工业大学 A kind of improved SimHash detection method of code similarities
CN107644081A (en) * 2017-09-21 2018-01-30 锐捷网络股份有限公司 Data duplicate removal method and device
CN110134544A (en) * 2018-02-08 2019-08-16 广东亿迅科技有限公司 The method and its system of datamation backup
CN108763270A (en) * 2018-04-07 2018-11-06 长沙开雅电子科技有限公司 A kind of data de-duplication Hash table Realization of Storing
CN108875062B (en) * 2018-06-26 2021-07-23 北京奇艺世纪科技有限公司 Method and device for determining repeated video
CN109670153B (en) * 2018-12-21 2023-11-17 北京城市网邻信息技术有限公司 Method and device for determining similar posts, storage medium and terminal
CN110909019B (en) * 2019-11-14 2022-04-08 湖南赛吉智慧城市建设管理有限公司 Big data duplicate checking method and device, computer equipment and storage medium
CN113472609B (en) * 2020-05-25 2024-03-19 汪永强 Data repeated sending marking system for wireless communication
CN114064621B (en) * 2021-10-28 2022-07-15 江苏未至科技股份有限公司 Method for judging repeated data
CN114817230A (en) * 2022-06-29 2022-07-29 深圳市乐易网络股份有限公司 Data stream filtering method and system

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101882141A (en) * 2009-05-08 2010-11-10 北京众志和达信息技术有限公司 Method and system for implementing repeated data deletion
CN101887457A (en) * 2010-07-02 2010-11-17 杭州电子科技大学 Content-based copy image detection method
CN102467572A (en) * 2010-11-17 2012-05-23 英业达股份有限公司 Data block inquiring method for supporting data de-duplication program
WO2012092212A2 (en) * 2010-12-28 2012-07-05 Microsoft Corporation Using index partitioning and reconciliation for data deduplication
US20120233135A1 (en) * 2011-01-17 2012-09-13 Quantum Corporation Sampling based data de-duplication
CN103189867A (en) * 2012-10-30 2013-07-03 华为技术有限公司 Duplicated data search method and equipment

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7469241B2 (en) * 2004-11-30 2008-12-23 Oracle International Corporation Efficient data aggregation operations using hash tables
US9245007B2 (en) * 2009-07-29 2016-01-26 International Business Machines Corporation Dynamically detecting near-duplicate documents
CN102622365B (en) * 2011-01-28 2015-04-29 北京百度网讯科技有限公司 Judging system and judging method for web page repeating

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101882141A (en) * 2009-05-08 2010-11-10 北京众志和达信息技术有限公司 Method and system for implementing repeated data deletion
CN101887457A (en) * 2010-07-02 2010-11-17 杭州电子科技大学 Content-based copy image detection method
CN102467572A (en) * 2010-11-17 2012-05-23 英业达股份有限公司 Data block inquiring method for supporting data de-duplication program
WO2012092212A2 (en) * 2010-12-28 2012-07-05 Microsoft Corporation Using index partitioning and reconciliation for data deduplication
US20120233135A1 (en) * 2011-01-17 2012-09-13 Quantum Corporation Sampling based data de-duplication
CN103189867A (en) * 2012-10-30 2013-07-03 华为技术有限公司 Duplicated data search method and equipment

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106202212A (en) * 2016-06-28 2016-12-07 微梦创科网络科技(中国)有限公司 A kind of method and system realizing data fractionation based on data server cluster

Also Published As

Publication number Publication date
CN103189867B (en) 2016-05-25
CN103189867A (en) 2013-07-03

Similar Documents

Publication Publication Date Title
WO2014067063A1 (en) Duplicate data retrieval method and device
US10592348B2 (en) System and method for data deduplication using log-structured merge trees
US10938961B1 (en) Systems and methods for data deduplication by generating similarity metrics using sketch computation
US9569357B1 (en) Managing compressed data in a storage system
US10303797B1 (en) Clustering files in deduplication systems
US9851917B2 (en) Method for de-duplicating data and apparatus therefor
US9298726B1 (en) Techniques for using a bloom filter in a duplication operation
JP6110517B2 (en) Data object processing method and apparatus
US10678654B2 (en) Systems and methods for data backup using data binning and deduplication
US10152389B2 (en) Apparatus and method for inline compression and deduplication
WO2013086969A1 (en) Method, device and system for finding duplicate data
JP2012525633A5 (en)
WO2017020576A1 (en) Method and apparatus for file compaction in key-value storage system
JP2013514558A (en) Storage system
WO2014094479A1 (en) Method and device for deleting duplicate data
US10838923B1 (en) Poor deduplication identification
CN108415671B (en) Method and system for deleting repeated data facing green cloud computing
CN103152430B (en) A kind of reduce the cloud storage method that data take up room
US10339124B2 (en) Data fingerprint strengthening
WO2014201696A1 (en) File reading method, storage device and reading system
CN108027713A (en) Data de-duplication for solid state drive controller
CN106980680B (en) Data storage method and storage device
US11366790B2 (en) System and method for random-access manipulation of compacted data files
US20220156233A1 (en) Systems and methods for sketch computation
WO2014117729A1 (en) Scalable data deduplication

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 12887672

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 12887672

Country of ref document: EP

Kind code of ref document: A1