CN116601593A - Data compression device, data storage device and method for data compression and data de-duplication - Google Patents

Data compression device, data storage device and method for data compression and data de-duplication Download PDF

Info

Publication number
CN116601593A
CN116601593A CN202080107962.0A CN202080107962A CN116601593A CN 116601593 A CN116601593 A CN 116601593A CN 202080107962 A CN202080107962 A CN 202080107962A CN 116601593 A CN116601593 A CN 116601593A
Authority
CN
China
Prior art keywords
data
data block
block
compressed
controller
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202080107962.0A
Other languages
Chinese (zh)
Inventor
阿萨夫·纳塔逊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Publication of CN116601593A publication Critical patent/CN116601593A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/3084Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction using adaptive string matching, e.g. the Lempel-Ziv method
    • H03M7/3091Data deduplication
    • H03M7/3095Data deduplication using variable length segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0608Saving storage space on storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • G06F3/064Management of blocks
    • G06F3/0641De-duplication techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0668Interfaces specially adapted for storage systems adopting a particular infrastructure
    • G06F3/067Distributed or networked storage systems, e.g. storage area networks [SAN], network attached storage [NAS]
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/3084Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction using adaptive string matching, e.g. the Lempel-Ziv method
    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction
    • H03M7/60General implementation details not specific to a particular type of compression
    • H03M7/6017Methods or arrangements to increase the throughput
    • H03M7/6029Pipelining

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a data compression device, which comprises a controller, wherein the controller is used for: receiving a data object to be compressed; indicating one or more data blocks in the data object; determining a hash value for each of the one or more data blocks; subsequently, compressing the data object; generating a header element, wherein the header element includes a hash value of each of the one or more data blocks and appends the header element to the compressed data object, wherein the header element is configured to indicate the one or more data blocks in the compressed data object.

Description

Data compression device, data storage device and method for data compression and data de-duplication
Technical Field
The present invention relates generally to the field of data compression and deduplication; and more particularly, to a data compression apparatus, a data storage apparatus, a method for data compression, and a method for storing data objects.
Background
In general, data backup is used to protect and recover data in the event of a data loss in a primary storage system (e.g., a host server). For security reasons, a separate backup system or storage system is widely used to store backups of data present in the primary storage system. Typically, over time, storage space is consumed by a storage system due to the changing data or the large amount of storage space that is consumed by new data in a conventional storage system. This is not ideal because it reduces the performance of the storage system. Furthermore, the cost of data storage remains a burden, including all associated costs, including the cost of storage hardware. In general, storage systems widely employ deduplication for eliminating duplicate or redundant data stored on the storage system without compromising the fidelity of the original data. In addition, compression systems also widely use data compression to store data on a storage system in a space efficient format. In a storage system, data is typically stored in the form of blocks of data.
Typically, conventional systems are used for data replication and compression. However, data compression is often a local feature, which works better when compressing large blocks of data. On the other hand, deduplication is a global feature, and if the data block size is too large, the effect will be reduced because the probability of finding similar blocks in a conventional system is low. In other words, there is a tradeoff between deduplication and data compression. Thus, if the block size is large, the compression ratio will be better, but the deduplication rate will be low. Thus, conventional systems are unable to perform reliable deduplication and compression of data.
Thus, in light of the foregoing discussion, there is a need to overcome the above-described shortcomings associated with conventional methods and systems for deduplication and data compression.
Disclosure of Invention
The present invention aims to provide a data compression device, a data storage device, a method for data compression and a method for storing data objects. The present invention aims to provide a solution to the existing problems of inefficient compression techniques and incompatible compression ratios and deduplication rates. It is an object of the present invention to provide a solution that at least partly overcomes the problems encountered in the prior art and to provide an improved data compression device, data storage device, method for data compression and method for storing data objects while improving the performance and efficiency of deduplication and compression.
The object of the invention is achieved by the solution provided in the attached independent claims. Advantageous implementations of the invention are further defined in the dependent claims.
In one aspect, the present invention provides a data compression apparatus comprising a controller for receiving a data object to be compressed; indicating one or more data blocks in the data object; determining a hash value for each of the one or more data blocks; subsequently, compressing the data object; generating a header element, wherein the header element includes a hash value for each of the one or more data blocks, and appending the header element to the compressed data object, wherein the header element is configured to indicate the one or more data blocks in the compressed data object.
The data compression device of the invention realizes efficient data compression, which is friendly for repeated data deletion. Accordingly, the data compression apparatus compresses the data object at a high compression ratio without having a great influence on the data object's de-duplication rate. The data compression device may compress the data objects together rather than compressing each block of the data objects individually. A controller of the data compression device receives the data object. The controller also instructs one or more of the data blocks in the data object, calculates a hash value for each of the data blocks, and then compresses the entire data object.
In one implementation, the controller is further configured to indicate one or more data blocks in the data object such that at least two of the one or more data blocks have different sizes.
The different sizes of the one or more data blocks allow the controller to obtain a high compression ratio of the data object while maintaining a high deduplication rate of the data object.
In one implementation, the controller is further configured to determine that the repeated data sequence extends from a first data block to a second data block, the first data block including a first portion of the repeated data sequence, the second data block including a second portion of the repeated data sequence; if so, then: adjusting the first data block to include a repeating data sequence; the second data block is adjusted to include the repeated data sequence.
In one case, when a data object is divided into a first data block and a second data block, the repeated data sequence spans the boundary of the first data block to the second data block, and the repeated data sequence is treated as two separate repeated data sequences. Adjusting the characteristics of the first data block and the second data block may slightly reduce the compression of the data blocks, but the data compression means obtains a better compression ratio than compressing each block individually. Advantageously, the adjustment of the characteristics of the first data block and the second data block can be added to almost every compression algorithm and is therefore easy to implement.
In one implementation, the header element further includes a start indication for each data block and an end indication for each data block.
The header element also includes a start and end position of each block of the compressed data object to determine a length of each block of the compressed data object.
In one implementation, the controller is further to: receiving a second data object; indicating one or more data blocks in the second data object; compressing the data object, wherein the compressed data object is arranged to indicate one or more data blocks; determining whether the compressed data block in the second data object corresponds to the compressed data block in the data object; if the determination is yes, one data block is replaced with another data block.
The controller receives the second data object and compresses the second data object such that the second data object is stored in the data compression device in a space efficient manner. The controller also determines whether the second data element is identical to the data element. If the second data element is the same as the data element in the compressed block, the controller removes the repeated compressed block of the second data element. Advantageously, the controller may store the second data object in the data compression means in a space efficient manner.
In one implementation, the controller is further to: generating a hash value for each of the data blocks in the second data object; based on the hash value generated for the data block, it is determined whether the compressed data block in the second data object corresponds to the compressed data block in the data object.
The hash value can identify a data block of the second data object, and the hash value of the data block of the second data object is compared with the hash value of the compressed data block in the data object to prevent duplication of the data block of the second data object in the data compression device.
In one implementation, the controller is further configured to replace one compressed data block with another compressed data block by replacing a larger one of the compressed data blocks with a reference data block of a smaller one of the compressed data blocks.
The controller is further configured to determine whether the compressed data block of the second data object is smaller or larger than the corresponding compressed data block of the first data object. The controller replaces the larger compressed data block with the smaller compressed data block. In other words, the controller stores compressed data blocks that occupy less memory space in the data compression device.
In one implementation, the controller is further configured to replace one compressed data block with another compressed data block by: determining which compressed data block has a higher decompression speed; and replacing the compressed data block with the lower decompression speed with the reference data block of the compressed data block with the higher decompression speed.
The controller is further configured to determine whether the compressed data blocks of the second data object are decompressed faster or slower than the corresponding compressed data blocks of the first data object. The controller replaces the slower data block to decompress. In other words, the controller stores the compressed data blocks, which require less time to decompress, thus reducing the time required for the controller to decompress.
In one implementation, the controller is further configured to replace one compressed data block with another compressed data block by: determining a compression dependency relationship of the compressed data blocks in the data object and a compression dependency relationship of the compressed data blocks in the second data object; it is determined which data block to replace based on the compression dependency.
By using the compression dependency, a data block having a lower compression dependency can be replaced, so that the decompression process can be simplified.
In one implementation, the compressed data elements include data objects that indicate one or more data blocks in the compressed data object; determining whether a compression dependency exists between a first data block and a second data block in the data object; if so, the second data block is decompressed before the first data block is decompressed.
The controller determines a dependency between a first data block and a second data block of the data object to enable the controller to decompress the complete data elements together instead of decompressing the individual data blocks.
In one implementation, the controller is further configured to determine that the second data block is included in the second data object and, in response, obtain the second data object.
By obtaining and subsequently decompressing the second data object, a second data block may be obtained. Advantageously, storing data objects in this manner enables space-efficient data storage without significantly affecting its processing time.
In one implementation, the one or more data blocks in the compressed data object include one or more compressed data blocks, and wherein a remaining portion of the one or more data blocks in the compressed data object are uncompressed, wherein the controller is configured to decompress the one or more compressed data blocks.
In one case, the data object obtained by the controller may include one or more data blocks, wherein some of the data blocks are compressed and the remainder of the data blocks are uncompressed. In this case, the controller is able to decompress the entire data object instead of decompressing individual data blocks of the data object.
In another aspect, the present invention provides a data storage device. The data storage device comprises a memory and a controller, wherein the memory is for storing a data object and the controller is for receiving a data element to be stored, wherein the data element comprises a data object and a header element, and wherein the data object indicates one or more data blocks in the data object and the header element comprises a hash value for each of the one or more data blocks; for each data block, determining whether the data block has a reference data block based on the hash value of the data block; if the result of the determination is yes, one data block is replaced with a reference data block of another data block.
The data storage device can carry out repeated data deletion on the compressed data elements, and meanwhile, high repeated data deletion rate is realized. The controller calculates a hash value of each of the first block and the second block. The hash value can identify one or more data blocks in the data object, and compare the hash value of the one or more data blocks in the data object with hash values of other data blocks of other data elements to prevent duplication of the one or more data blocks of the data object in the data compression device. The controller is configured to search for a match between the hash value of one or more data blocks in the data object and the data block stored in the data storage device.
In another aspect, the present invention provides a data device. The data device comprises a data compression device and a data storage device.
The data device of this aspect achieves all the advantages and effects of the data compression device and the data storage device of the present invention.
In another aspect, the present invention provides a method for data compression. The method includes receiving a data object to be compressed; indicating one or more data blocks in the data object; determining a hash value for each of the one or more data blocks; subsequently, compressing the data object; generating a header element, wherein the header element includes a hash value of each of the one or more data blocks and appends the header element to the compressed data object, wherein the header element is configured to indicate the one or more data blocks in the compressed data object.
The method of this aspect achieves all the advantages and effects of the data compression device of the present invention.
In one implementation, the invention provides a computer readable medium carrying computer instructions that, when loaded into and executed by a controller of a data compression device, enable the data compression device to implement the method.
The computer-readable medium carrying computer instructions of this aspect achieves all the advantages and effects of the data compression apparatus and method for data compression of the present invention.
In another aspect, the present invention provides a method for storing a data object. The method includes receiving a data element to be stored, wherein the data element includes a data object and a header element, and wherein the data object indicates one or more data blocks in the data object and the header element includes a hash value for each of the one or more data blocks; for each data block, determining whether the data block has a reference data block based on the hash value of the data block; if the result of the determination is yes, one data block is replaced with a reference data block of another data block.
The method of this aspect achieves all the advantages and effects of the data storage device of the present invention.
In one implementation, the invention provides a computer-readable medium carrying computer instructions that, when loaded into and executed by a controller of a data storage device, enable the data storage device to implement the method.
The computer readable medium carrying computer instructions of this aspect achieves all the advantages and effects of the data storage device and method for storing data objects of the present invention.
It should be noted that all devices, elements, circuits, units and means described in the present application may be implemented in software or hardware elements or any type of combination thereof. All steps performed by the various entities described in the present application and functions to be performed by the various entities described are intended to mean that the respective entities are adapted to perform the respective steps and functions. Although in the following description of specific embodiments, specific functions or steps performed by external entities are not reflected in the description of specific detailed elements of the entity performing the specific steps or functions, it should be clear to a skilled person that these methods and functions may be implemented by corresponding hardware or software elements or any combination thereof. It will be appreciated that features of the application are susceptible to being combined in various combinations without departing from the scope of the application as defined by the accompanying claims.
Additional aspects, advantages, features and objects of the application will become apparent from the accompanying drawings and detailed description of illustrative implementations which are explained in connection with the following appended claims.
Drawings
The foregoing summary, as well as the following detailed description of illustrative embodiments, is better understood when read in conjunction with the appended drawings. For the purpose of illustrating the application, there is shown in the drawings exemplary constructions of the application. However, the application is not limited to the specific methods and instrumentalities disclosed herein. Moreover, those skilled in the art will appreciate that the drawings are not drawn to scale. Wherever possible, like elements are designated by like numerals.
Embodiments of the invention will now be described, by way of example only, with reference to the following figures, in which:
FIG. 1 is a block diagram of various exemplary components of a data device according to an embodiment of the present invention;
FIG. 2A is a block diagram of various exemplary components of a data compression device according to an embodiment of the present invention;
FIG. 2B is a block diagram of various exemplary components of a data storage device according to an embodiment of the present invention;
FIGS. 3A, 3B, and 3C are exemplary illustrations of various operations for compressing and decompressing data elements according to embodiments of the present invention;
FIG. 4 is an exemplary illustration of decompression of data elements of another embodiment of the present invention;
FIG. 5 is an exemplary illustration of deduplication of data elements of an embodiment of the present invention;
FIG. 6 is a flow chart of a method for data compression in accordance with an embodiment of the present invention;
FIG. 7 is a flow chart of a method for storing data objects according to an embodiment of the invention.
In the drawings, the underlined numbers are used to denote items where the underlined numbers are located or items adjacent to the underlined numbers. The non-underlined numbers relate to items identified by lines associating the non-underlined numbers with the items. When a number is not underlined and has an associated arrow, the number without the underline is used to identify the general item to which the arrow points.
Detailed Description
The following detailed description describes embodiments of the invention and the manner in which the embodiments may be practiced. While some modes for carrying out the invention have been disclosed, those skilled in the art will recognize that other embodiments for carrying out or practicing the invention may also exist.
FIG. 1 is a block diagram of various exemplary components of a data device according to another embodiment of the present invention. Referring to fig. 1, a data device 100 is shown. The data device 100 includes a data compression device 102 and a data storage device 104. A network 106 is also shown.
In the data device 100, the data objects are compressed using a deduplication friendly compression algorithm for data objects. In other words, the data device 100 can achieve a better compression ratio than conventional systems without compromising the deduplication rate. The data device 100 is capable of compressing the entire data object together rather than compressing individual data blocks individually. Furthermore, the data device 100 is capable of decompressing the entire compressed data object together by determining dependencies between one or more data blocks of the data object, rather than decompressing individual data blocks individually. In addition, the data device 100 is capable of deduplicating compressed data objects at a high deduplication rate.
The data compression device 102 may comprise suitable logic, circuitry, interfaces and/or code that may be operable to compress and decompress data objects. The data compression device 102 may compress the data objects together rather than compressing each block of the data objects individually. In addition, the data compression device 102 compresses the data object at a high compression ratio without having a significant impact on the deduplication rate of the data object. Furthermore, the data compression device 102 may decompress the entire data object together, rather than decompressing each block separately. The data compression device 102 may decompress the data object by determining dependencies between different blocks in the data object. Examples of data compression apparatus 102 include, but are not limited to, a server, a block storage-based computing device in a computer cluster (e.g., a massively parallel computer cluster), or a supercomputer. For example, various exemplary components of the data compression device 102 are described in detail in FIG. 2A.
The data store 104 refers to a secondary storage device for backup. The data store 104 may comprise suitable logic, circuitry, interfaces and/or code that may be operable to process storage of data objects. The data storage 104 may deduplicate the compressed data elements while achieving a high deduplication rate. The data store 104 determines a hash value for each chunk of the data object. The data storage 104 determines whether the reference data block exists for the data block based on the hash value of the data block; if the result of the determination is yes, one data block is replaced with a reference data block of another data block. Examples of data storage 104 include, but are not limited to, secondary storage servers, block storage-based computing devices in a computer cluster (e.g., a massively parallel computer cluster), block storage-based electronics, or supercomputers. For example, various exemplary components of data storage device 104 are described in detail in FIG. 2B.
The network 106 includes a medium (e.g., a communication channel) through which the data compression device 102 may communicate with the data storage device 104. Examples of network 106 include, but are not limited to, a computer network in a computer cluster, a local area network (Local Area Network, LAN), a cellular network, a wireless sensor network (wireless sensor network, WSN), a cloud network, a vehicle-to-network (V2N) network, a metropolitan area network (Metropolitan Area Network, MAN), and/or the internet.
Fig. 2A is a block diagram of various exemplary components of a data compression device in accordance with another embodiment of the present invention. Referring to fig. 2A, a data compression device 102 is shown. The data compression device 102 includes a controller 202. In one implementation, the data compression device 102 also includes a memory 204 and a network interface 206.
The controller 202 is configured to execute instructions to control compression of data objects in the data compression device 102. Examples of controller 202 include, but are not limited to, microprocessors, microcontrollers, complex instruction set computing (complex instruction set computing, CISC) microprocessors, reduced instruction set computing (reduced instruction set computing, RISC) microprocessors, very long instruction word (very long instruction word, VLIW) microprocessors, central processing units (central processing unit, CPUs), state machines, data processing units, and other processors or control circuits. Further, the controller 202 may refer to one or more separate processors, processing devices, processing units that are part of a machine, such as the data compression device 102.
The memory 204 herein is the hardware or physical memory of the data compression device 102. The memory 204 is used to store instructions executable by the controller 202. The memory 204 may also include other known components for reading and writing data (not shown for simplicity), such as disk heads, and the like. Examples of implementations of memory 204 may include, but are not limited to, hard Disk Drives (HDDs), solid-State drives (SSDs), backup storage disks, block storage units, or other computer storage media. The memory 204 may store an operating system and/or other program products (including one or more operating algorithms) to operate the data compression device 102.
Network interface 206 is an arrangement of interconnected programmable and/or non-programmable components that facilitate data transfer between one or more electronic devices. The network interface 206 may support communication protocols for the Internet Small computer systems interface (Internet Small Computer Systems Interface, iSCSI), fibre channel, or fibre channel over Ethernet (Fibre Channel over Ethernet, FCoE) protocols. The network interface 206 may also support communication protocols for one or more of a peer-to-peer network, a hybrid peer-to-peer network, a Local Area Network (LAN), a Metropolitan Area Network (MAN), a wide area network (wide area network, WAN), all or a portion of a public network (e.g., a global computer network known as the internet), a private network, or any other communication system or systems for one or more locations. In addition, the network interface 206 supports wired or wireless communications that may be performed via any number of known protocols, including but not limited to internet protocol (Internet Protocol, IP), wireless access protocol (Wireless Access Protocol, WAP), frame relay or asynchronous transfer mode (Asynchronous Transfer Mode, ATM). In addition, the network interface 206 may also employ and support any other suitable protocol using voice, video, data, or a combination thereof.
In operation, the controller 202 is configured to receive a data object to be compressed. The controller 202 receives data objects, for example, via the network interface 206. The controller 202 receives the data objects to compress the data objects such that the data objects are stored in the data compression device 102 in a space efficient manner.
The controller 202 is also operable to indicate one or more data blocks in the data object. The controller 202 indicates that the data object includes data blocks, e.g., a first data block and a second data block. The block sizes of the first data block and the second data block are determined such that the data object is compressed at a sufficient compression ratio without compromising deduplication of each data block of the data object.
According to one embodiment, the controller 202 is further configured to indicate one or more data blocks in the data object such that at least two of the one or more data blocks have different sizes. The controller 202 indicates that the sizes of each of the first data block and the second data block of the data object are different from each other. In other words, the data object comprises one or more data blocks of variable block size.
The controller 202 is also configured to determine a hash value for each of the one or more data blocks. The controller 202 calculates a hash value of each of the first data block and the second data block using a hash function. The hash value refers to a fixed size value representing the original data of the first data block and the second data block. Examples of hash functions include, but are not limited to, SHA-1, SHA-2, or MD5 algorithms. Advantageously, the hash value is capable of identifying the first data block and the second data block, and comparing the hash value of the first data block and the second data block with hash values of other data blocks of other data elements to prevent duplication of the first data block and the second data block in the data compression device 102.
The controller 202 is also used to subsequently compress the data objects. The controller 202 is configured to compress each of the data blocks (i.e., the first data block and the second data block) of the data object using a compression algorithm. The controller 202 compresses the data objects into a single entity. In other words, the controller 202 is able to compress the entire data object, and therefore, each block (i.e., the first data block and the second data block) need not be compressed separately. Examples of compression algorithms may include, but are not limited to, lempel Zvi LZ77, LZR, lempel-Ziv-Storer-Szymanski (LZSS), and Lempel-Ziv Markov chain algorithm (LZMA).
According to one embodiment, the controller 202 is further configured to determine that the repeated data sequence extends from a first data block to a second data block, the first data block comprising a first portion of the repeated data sequence, the second data block comprising a second portion of the repeated data sequence; if so, then: adjusting the first data block to include a repeating data sequence; the second data block is adjusted to include the repeated data sequence. In one case, when the data object is divided into a first data block and a second data block, the repeated data sequence spans a boundary of the first data block to the second data block. In other words, the first data block comprises a portion (i.e., a first portion) of the repeated data sequence and the second data block comprises other portions (i.e., a second portion) of the repeated data sequence. In this case, the tangent point of the first data block and the second data block has the characteristic that the end of the decompression tangent point is the end of the coding sequence. In other words, the repeated data sequence is considered as two separate repeated data sequences. Thus, the first data block and the second data block are adjusted such that both comprise a repeating data sequence. In addition, the adjustment of the characteristics of the first data block and the second data block can be added to almost every compression algorithm, and thus, is easy to implement. However, adjusting the characteristics of the first and second data blocks may slightly reduce the compression of the data blocks, however, the compression algorithm has a better compression ratio than compressing each block alone.
The controller 202 is further configured to generate a header element, wherein the header element includes a hash value for each of the one or more data blocks, and append the header element to the compressed data object, wherein the header element is configured to indicate the one or more data blocks in the compressed data object. The header element refers to supplemental data including information about each block of the data object. The header element is typically added at the beginning of the compressed data object, i.e. at the beginning of the first data block. The header element includes a list of each block of compressed data elements and its corresponding hash value.
According to one embodiment, the header element further comprises a start indication for each data block and an end indication for each data block. The header element also includes the start and end positions of each block of the compressed data object. Notably, the start and end positions of each block of the compressed data object are used to determine the length of each block of the compressed data object.
According to one embodiment, the controller 202 is further configured to: receiving a second data object; indicating one or more data blocks in the second data object; compressing the data object, wherein the compressed data object is arranged to indicate one or more data blocks; determining whether the compressed data block in the second data object corresponds to the compressed data block in the data object; if the determination is yes, one data block is replaced with another data block. For example, the controller 202 receives the second data object via the network interface 206. The controller 202 receives the second data object to compress the second data object so that the second data object may be stored in a space efficient manner. The example of the second data object is similar to the example of the data object. The controller 202 also indicates that the second data object includes data blocks, e.g., a third data block and a fourth data block. The block sizes of the third data block and the fourth data block are determined such that the second data object is compressed at a sufficient compression ratio without compromising deduplication of each data block of the second data object. The controller 202 is further configured to compress each of the data blocks of the second data object (i.e., the third data block and the fourth data block) using a compression algorithm such that the second data object may be stored in a space efficient format. Examples of compression algorithms may include, but are not limited to, lempel Zvi LZ77, LZR, lempel-Ziv-Storer-Szymanski (LZSS), and Lempel-Ziv Markov chain algorithm (LZMA). In storing the second data object, the controller 202 also determines whether the second data object is similar to any data in the compressed block of the data object (i.e., the first data block or the second data block). If the second data object in the compressed block is similar to the data object, the controller 202 removes the duplicate compressed block of the second data object. In storing the second data object, the controller 202 determines whether the compressed data blocks in the second data object are similar to the compressed data blocks in the data object by comparing their respective hash values. The controller 202 may maintain a pointer to the data block of the data object where the data of the data block of the second data object is located. The pointer may locate the data block on the data storage 104. It should be appreciated that compression and similarity determination between two data objects is performed at different locations, wherein the data compression device 102 compresses the data objects and the controller 202 determines the similarity between the data objects in the different compressed blocks when the compressed data blocks are stored at the data storage device 104.
According to one embodiment, the controller 202 is further configured to: generating a hash value for each of the data blocks in the second data object; based on the hash value generated for the data block, it is determined whether the compressed data block in the second data object corresponds to the compressed data block in the data object. The controller 202 calculates a hash value of each of the third data block and the fourth data block using a hash function. The hash value refers to a fixed size value representing the original data of the third data block and the fourth data block. Examples of hash functions include, but are not limited to, SHA-1, SHA-2, or MD5 algorithms. Advantageously, the hash value is capable of identifying the third data block and the fourth data block, and comparing the hash values of the third data block and the fourth data block with the hash values of the first data block and the second data block to prevent duplication of the third data block and the fourth data block in the data compression device 102.
According to one embodiment, the controller 202 is further configured to replace one compressed data block with another compressed data block by replacing a larger one of the compressed data blocks with a reference data block of a smaller one of the compressed data blocks. The controller 202 is further configured to determine whether the compressed data block of the second data object is smaller or larger than the corresponding compressed data block of the first data object. The controller 202 replaces the larger compressed data block with the smaller compressed data block. In other words, the controller 202 stores compressed data blocks that occupy less memory space in the data compression device 102. For example, if the hash value of the third data block of the second data object matches the hash value of the first data block, the controller 202 determines which of the third data block and the first data block is smaller. If the first data block is smaller than the third data block, the controller 202 stores the compressed first data block. The controller 202 also adds a pointer to the first data block to locate the first data block on the data compression device 102 without the need to store the third data block again. In another case, if the hash value of the third data block of the second data object does not match the hash value of the first data block, the controller 202 adds the second data object to the data compression device 102.
According to one embodiment, the controller 202 is further configured to replace one compressed data block with another compressed data block by: determining which compressed data block has a higher decompression speed; and replacing the compressed data block with the lower decompression speed with the reference data block of the compressed data block with the higher decompression speed. The controller 202 is further configured to determine whether the compressed data blocks of the second data object are decompressed faster or slower than the corresponding compressed data blocks of the first data object. The controller 202 replaces the slower data block to be decompressed. In other words, the controller 202 stores compressed data blocks that require less time to decompress. For example, if the hash value of the third data block of the second data object matches the hash value of the first data block, the controller 202 determines which of the third data block and the first data block is decompressed faster. If the first data block is decompressed faster than the third data block, the controller 202 stores the compressed first data block.
According to one embodiment, the controller 202 is further configured to replace one compressed data block with another compressed data block by: determining a compression dependency relationship of the compressed data blocks in the data object and a compression dependency relationship of the compressed data blocks in the second data object; it is determined which data block to replace based on the compression dependency. As previously described, decompressing a given data block in a given data object may require decompressing another data object because a given data block may refer to a data block in another data object. Thus, the compression dependency of each of the data blocks in the data object and the second data object is determined to identify the data block having the lowest compression dependency. Here, the data block with the lowest compression dependency requires that the least number of data blocks in other data objects be decompressed before the data block is decompressed. Then, the data block with the lowest compression dependency is replaced.
According to one embodiment, the controller 202 is further configured to: receiving a compressed data element for decompression, wherein the compressed data element comprises a data object indicating one or more data blocks in the compressed data object; determining whether a compression dependency exists between a first data block and a second data block in the data object; if so, the second data block is decompressed before the first data block is decompressed. For example, the controller 202 receives the compressed data elements via the network interface 206. The compressed data elements include one or more compressed data blocks, e.g., a first data block and a second data block. The controller 202 determines a dependency relationship between a first data block and a second data block of the data object to decompress the complete data element. Compression dependencies are how different blocks of a data object depend on each other for decompression by the controller 202. For example, if the first data block shows a dependency on the second data block, the second data block is decompressed before the first data block is decompressed. This is further explained in fig. 3, for example.
According to one embodiment, the controller 202 is further configured to determine that the second data block is included in the second data object and, in response, obtain the second data object. Notably, the second data block has a compression dependency on one of the data blocks of the second data object (e.g., the first data block). Thus, the second data object needs to be decompressed to obtain the first data block therefrom and subsequently the second data block.
According to one embodiment, the one or more data blocks in the compressed data object comprise one or more compressed data blocks, and wherein a remaining portion of the one or more data blocks in the compressed data object are uncompressed, wherein the controller 202 is configured to decompress the one or more compressed data blocks. In one case, the data objects obtained by the controller 202 may include one or more data blocks, where some of the data blocks are compressed and the remainder of the data blocks are uncompressed. In this case, the controller 202 is able to decompress the entire data object instead of decompressing individual data blocks of the data object. The controller 202 decompresses the complete data object by determining the dependencies between each block of the data object. This is further explained in fig. 4, for example.
FIG. 2B is a block diagram of various exemplary components of a data storage device according to another embodiment of the present invention. Referring to FIG. 2B, a data storage device 104 is shown. The data storage 104 includes a memory 208 and a controller 210. In one implementation, the data storage device 104 also includes a network interface 212.
The controller 210 is configured to execute instructions to control the storage of data objects in the data storage 104. Examples of controller 210 include, but are not limited to, microprocessors, microcontrollers, complex instruction set computing (complex instruction set computing, CISC) microprocessors, reduced instruction set computing (reduced instruction set computing, RISC) microprocessors, very long instruction word (very long instruction word, VLIW) microprocessors, central processing units (central processing unit, CPUs), state machines, data processing units, and other processors or control circuits. Further, the controller 210 may refer to one or more separate processors, processing devices, processing units that are part of a machine, such as the data storage 104.
The memory 208 is used to store data objects. Memory 208 herein refers to hardware or physical memory used to store data objects in data storage 104. The memory 208 is also used to store instructions executable by the controller 210. Examples of implementations of memory 208 may include, but are not limited to, hard Disk Drives (HDDs), solid-State drives (SSDs), backup storage disks, block storage units, or other computer storage media. The memory 208 may store an operating system and/or other program products (including one or more operating algorithms) to operate the data storage device 104.
Network interface 212 is a means of interconnecting programmable and/or non-programmable components that facilitate the transfer of data between one or more electronic devices. The network interface 212 may support communication protocols for the Internet Small computer System interface (iSCSI), fibre channel, or fibre channel over Ethernet (FCoE) protocols. Network interface 212 may also support communication protocols for one or more of a peer-to-peer network, a hybrid peer-to-peer network, a Local Area Network (LAN), a Metropolitan Area Network (MAN), a Wide Area Network (WAN), all or a portion of a public network (e.g., a global computer network known as the internet), a private network, or any other communication system or systems for one or more locations. In addition, network interface 212 supports wired or wireless communications that may be performed via any number of known protocols, including, but not limited to, internet Protocol (IP), wireless Access Protocol (WAP), frame relay, or Asynchronous Transfer Mode (ATM). In addition, the network interface 212 may also employ and support any other suitable protocol that uses voice, video, data, or a combination thereof.
In operation, the controller 210 is configured to: receiving a data element to be stored, wherein the data element comprises a data object and a header element, and wherein the data object indicates one or more data blocks in the data object and the header element comprises a hash value for each of the one or more data blocks; for each data block, determining whether the data block has a reference data block based on the hash value of the data block; if the result of the determination is yes, one data block is replaced with a reference data block of another data block. For example, the controller 210 receives data elements via the network interface 212 to be stored in the memory 208 of the data storage 104. For example, the data compression device 102 compresses the received data elements. The received data elements are compressed such that the controller 210 can deduplicate the data elements at a high deduplication rate. In other words, the received compressed data elements have a deduplication friendly format. The data elements include data objects and header elements. The data object includes one or more data blocks, e.g., a first data block and a second data block. Further, the controller 210 calculates a hash value of each of the first block and the second block using a hash algorithm. Examples of hash functions include, but are not limited to, SHA-1, SHA-2, or MD5. Advantageously, the hash value is capable of identifying the first data block and the second data block, and comparing the hash value of the first data block and the second data block with hash values of other data blocks of other data elements to prevent duplication of the first data block and the second data block in the data compression device 102. The header element refers to supplemental data including information about each block of the data object. The header element is typically added at the beginning of the compressed data object, i.e. at the beginning of the first data block. The header element includes a list of each block of compressed data elements and its corresponding hash value for deduplication. The header element also includes a start and end position of each block of the compressed data object to determine a length of each block of the compressed data object. The controller 210 is configured to search for a match between the hash values of the first data block and the second data block and the data blocks stored in the data storage 104. If there is a match, this means that the one data block is already stored on the data storage 104. Accordingly, the controller 210 removes the data block whose hash value matches the other data blocks stored in the data storage 104. Accordingly, the controller 210 is able to store data objects in the data storage 104 in a space efficient manner. In addition, the controller 210 replaces the data block whose hash value matches the other data block with the reference data block of the other data block. The reference data block may locate another data block on the data storage device 104. The reference data block does not require repeated storage of the one data block and repetition is significantly reduced. For example, the deduplication of data blocks is further explained in FIG. 5.
In another case, if the hash value of the one data block does not match the hash value of another data block stored in the data storage 104, it means that the one data block is not stored on the data storage 104. Thus, the one data block will be stored to the data storage 104. To read and decompress the one data block from the data storage 104 after the deduplication, the controller 210 first decompresses the other data block, and then decompresses the one data block. This is further explained in fig. 5, for example.
Fig. 3A, 3B, and 3C are exemplary illustrations of various operations for compressing and decompressing data elements according to embodiments of the present invention. Fig. 3A to 3C are described in connection with the elements in fig. 1, 2A and 2B. Referring to FIG. 3A, a data object 302 is shown. The controller 202 receives the data object 302 for compression.
Referring to FIG. 3B, a data object 302 is shown in which one or more variable-size data blocks are indicated. The controller 202 indicates the data object 302 with a first data block 304, a second data block 306, and a third data block 308 using a variable block algorithm.
Referring to fig. 3C, a compressed data element 310 is shown. The controller 202 compresses the data object 302 to obtain compressed data elements 310 using a compression algorithm. The compressed data elements 310 include header elements 312 and data objects 314. The data object 314 includes a compressed first data block 316, a compressed second data block 318, and a compressed third data block 320. Without a better compression ratio than conventional compression devices, the controller 202 is able to compress the entire data object 302 without having to compress each of the first data block 304, the second data block 306, and the third data block 308 separately. Further, the arrows show the compression dependency of the compressed first data block 316, the compressed second data block 318, and the compressed third data block 320 in the compressed data element 310. For example, the arrows indicate that in order to decompress the compressed third data block 320, the first data block 304 and the second data block 306 must be decompressed.
Fig. 4 is an exemplary illustration of decompression of data elements of an embodiment of the present invention. Fig. 4 is described in connection with the elements of fig. 1, 2A and 2B. Referring to fig. 4, a data element 402 is shown. The data elements 402 include a header element 404 and a data object 406. The data object 406 includes a compressed first data block 408, an uncompressed second data block 410, and a compressed third data block 412. The controller 202 treats the uncompressed second data block 410 as if the uncompressed second data block 410 were to be decompressed by the controller 202. Accordingly, the controller 202 is configured to decompress the compressed first data block 408 and the compressed third data block 412 into complete data, rather than decompressing the compressed first data block 408 and the compressed third data block 412 separately.
FIG. 5 is an exemplary illustration of deduplication of data elements of an embodiment of the present invention. Fig. 5 is described in connection with the elements of fig. 1, 2A and 2B. Referring to fig. 5, a first data element 502 is shown. The first data element 502 is divided into a first data block 504, a second data block 506 and a third data block 508. Further, the controller 210 compresses the first data element 502 to obtain a compressed first data element 510. The compressed data elements include a first header element 512 and a first data object 514. The first header element 512 includes a hash value for each of the first data block 504, the second data block 506, and the third data block 508. The first data object 514 includes a compressed first data block 516, a compressed second data block 518, and a compressed third data block 520.
Referring to fig. 5, a second data element 522 is shown. The second data element 522 is divided into a fourth data block 524 and a fifth data block 526. The controller 210 compresses the second data element 522 to obtain a second compressed second data element 528. The second compressed second data element 528 includes a second header element 530 and a second data object 532. The second header element 530 includes hash values of the fourth data block 524 and the fifth data block 526. The second data object 532 includes a fourth compressed data block 534 and a fifth compressed data block 536.
In fig. 5, the fourth data block 524 is the same as the second data block 506. Thus, the controller 210 removes the fourth data block 524 from the second data element 522. Instead, the controller 210 replaces the fourth data block 524 with a reference data block for the second data block 506. The reference data block may locate the fourth data block 524 on the data storage device 104. The reference data block does not need to store the fourth data block 524 and repetition is significantly reduced. It is noted that the compression dependency of the compressed fifth data block 536 is located on the compressed fourth data block 534, which is stored as a reference data block for the second data block 506. Thus, to read the second compressed second data element 524 after the deduplication, the controller 210 first retrieves a fourth data block 524 that is stored in the data storage 104 as a reference data block for the second data block 506. Accordingly, to read the compressed second data element 524, the controller 210 first reads the compressed first data element 510, specifically, the first header element 512 of the first data element 510. The controller 210 then decompresses the compressed first data element 510 to obtain the second data block 506. When decompressing the second data block 506, the second compressed second data element 534 is decompressed. Subsequently, the value of the fourth data block 524 is obtained and the compressed fifth data block 536 is decompressed.
Fig. 6 is a flow chart of a method 600 for data compression in accordance with an embodiment of the present invention. The method 600 is performed at a data compression device, for example, in fig. 2A. The method 600 includes steps 602 through 610.
At step 602, the method 600 includes receiving a data object to be compressed. For example, the controller 202 receives data objects via the network interface 206. The controller 202 receives the data objects to compress the data objects such that the data objects are stored in the data compression device 102 in a space efficient manner.
At step 604, method 600 further includes indicating one or more data blocks in the data object. The controller 202 indicates that the data object includes data blocks, e.g., a first data block and a second data block.
At step 606, method 600 further includes determining a hash value for each of the one or more data blocks. The controller 202 calculates a hash value of each of the first data block and the second data block using a hash function. The hash value enables identification of the first data block and the second data block, which are compared with hash values of other data blocks of other data elements to prevent duplication of the first data block and the second data block in the data compression device 102.
At step 608, the method 600 further includes subsequently compressing the data object. The controller 202 is configured to compress each of the data blocks (i.e., the first data block and the second data block) of the data object using a compression algorithm. The controller 202 compresses the data objects into a single entity. In other words, the controller 202 is able to compress the entire data object, and therefore, each block (i.e., the first data block and the second data block) need not be compressed separately.
Compression has the property that it compresses data into a single object, resulting in a higher compression ratio, but even if only a part of the compressed object is received in compressed format and some of the compressed object is received in uncompressed format, the data can be decompressed, i.e. if some of the data blocks are displayed in compressed format and some of the data blocks are displayed in uncompressed format, the system is able to decompress the complete object.
At step 610, the method 600 further includes generating a header element, wherein the header element includes a hash value for each of the one or more data blocks, and appending the header element to the compressed data object, wherein the header element is configured to indicate the one or more data blocks in the compressed data object. The header element refers to supplemental data including information about each block of the data object. The header element is typically added at the beginning of the compressed data object, i.e. at the beginning of the first data block. The header element includes a list of each block of compressed data elements and its corresponding hash value.
Notably, the present invention provides a computer readable medium carrying computer instructions that, when loaded into and executed by the controller 202 of the data compression device 102, enable the data compression device 102 to implement the method 600. The computer program product may include, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. In yet another aspect, the present invention provides a computer program for performing the method 600 by an apparatus (e.g., the data compression device 102). In another aspect, the present invention provides a computer program product comprising a non-transitory computer readable storage medium having computer readable instructions executable by a processor to perform method 600. Examples of implementations of the non-transitory computer readable storage medium include, but are not limited to, electrically Erasable Programmable Read Only Memory (EEPROM), random access memory (random access memory, RAM), read Only Memory (ROM), hard Disk Drive (HDD), flash memory, secure Digital (SD) card, solid State Drive (SSD), computer readable storage medium, and/or CPU cache.
FIG. 7 is a flow chart of a method 700 for storing a data object according to an embodiment of the invention. Method 700 is performed at a data storage device, for example, in fig. 2B. Method 700 includes steps 702 and 704.
In step 702, the method 700 includes receiving a data element to be stored, wherein the data element includes a data object and a header element, and wherein the data object indicates one or more data blocks in the data object and the header element includes a hash value for each of the one or more data blocks. For example, the controller 210 receives data elements via the network interface 212 to be stored in the memory 208 of the data storage 104. For example, the data compression device 102 compresses the received data elements. The received data elements are compressed such that the controller 210 can deduplicate the data elements at a high deduplication rate. The data elements include data objects and header elements. The data object includes one or more data blocks, e.g., a first data block and a second data block. Further, the controller 210 calculates a hash value of each of the first block and the second block using a hash algorithm.
At step 704, the method 700 further comprises: for each data block, determining whether the data block has a reference data block based on the hash value of the data block; if the result of the determination is yes, one data block is replaced with a reference data block of another data block. The controller 210 is configured to search for a match between the hash values of the first data block and the second data block and the data blocks stored in the data storage 104. If there is a match, this means that the one data block is already stored on the data storage 104. Accordingly, the controller 210 removes the data block whose hash value matches the other data blocks stored in the data storage 104. Accordingly, the controller 210 is able to store data objects in the data storage 104 in a space efficient manner. In addition, the controller 210 replaces the data block whose hash value matches the other data block with the reference data block of the other data block. The reference data block may locate another data block on the data storage device 104. In one embodiment, if not If a dependency exists, the data block is replaced. If a dependency exists, the data block is not replaced in some embodiments.Notably, if the controller 210 does not find a match for a data block, the controller 210 is operable to store the compressed data block on the data storage device 104.
The present invention provides a computer readable medium carrying computer instructions that, when loaded into and executed by the controller 210 of the data storage device 104, enable the data storage device 104 to implement the method 700. The computer program product may include, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. In yet another aspect, the present invention provides a computer program for performing the method 700 by an apparatus (e.g., the data storage 104). In another aspect, the present invention provides a computer program product comprising a non-transitory computer readable storage medium having computer readable instructions executable by a processor to perform method 700. Examples of implementations of the non-transitory computer readable storage medium include, but are not limited to, electrically Erasable Programmable Read Only Memory (EEPROM), random access memory (random access memory, RAM), read Only Memory (ROM), hard Disk Drive (HDD), flash memory, secure Digital (SD) card, solid State Drive (SSD), computer readable storage medium, and/or CPU cache.
Modifications may be made to the embodiments of the invention described above without departing from the scope of the invention, as defined in the accompanying claims. Expressions such as "comprising," "including," "combining," "having," "being/being" and the like, which are used to describe and claim the present invention, are intended to be interpreted in a non-exclusive manner, i.e., to allow for items, components or elements that are not explicitly described to exist as well. Reference to the singular is also to be construed to relate to the plural. The word "exemplary" is used herein to mean "serving as an example, instance, or illustration. Any embodiment described as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments, and/or as excluding combinations of features of other embodiments. The word "optionally" as used herein means "provided in some embodiments and not provided in other embodiments. It is appreciated that certain features of the invention, which are, for clarity, described in the context of separate embodiments, may also be provided in combination in a single embodiment. Conversely, various features of the invention, which are, for brevity, described in the context of a single embodiment, may also be provided separately or in any suitable combination or as in any other described embodiment of the invention.

Claims (18)

1. A data compression device (102), comprising a controller (202), the controller (202) being configured to:
receiving a data object (302, 314, 406, 514, 532) to be compressed;
indicating one or more data blocks (304, 306, 308, 410, 504, 506, 508, 524, 526) in the data object (302, 314, 406, 514, 532);
determining a hash value for each of the one or more data blocks (304, 306, 308, 410, 504, 506, 508, 524, 526);
subsequently, compressing the data object (302, 314, 406, 514, 532);
generating a header element (312, 404, 512, 530), wherein the header element comprises the hash value of each of the one or more data blocks (304, 306, 308, 410, 504, 506, 508, 524, 526) and appending the header element (312, 404, 512, 530) to a compressed data object (510, 528), wherein the header element (312, 404, 512, 530) is arranged to indicate the one or more data blocks (304, 306, 308, 410, 504, 506, 508, 524, 526) in the compressed data object (510, 528).
2. The data compression device (102) of claim 1, wherein the controller (202) is further configured to instruct the one or more data blocks (304, 306, 308, 410, 504, 506, 508, 524, 526) in the data object (302, 314, 406, 514, 532) such that at least two of the one or more data blocks (304, 306, 308, 410, 504, 506, 508, 524, 526) have different sizes.
3. The data compression device (102) of claim 1 or 2, wherein the controller (202) is further configured to:
determining that a repeated data sequence extends from a first data block (304, 306, 308, 410, 504, 506, 508, 524, 526) to a second data block (304, 306, 308, 410, 504, 506, 508, 524, 526), wherein the first data block (304, 306, 308, 410, 504, 506, 508, 524, 526) comprises a first portion of the repeated data sequence and the second data block (304, 306, 308, 410, 504, 506, 508, 524, 526) comprises a second portion of the repeated data sequence; if so, then:
adjusting the first data block (304, 306, 308, 410, 504, 506, 508, 524, 526) to include the repeated data sequence;
the second data block (304, 306, 308, 410, 504, 506, 508, 524, 526) is adjusted to include the repeated data sequence.
4. The data compression device (102) of any of the preceding claims, wherein the header element (312, 404, 512, 530) further comprises a start indication for each data block (304, 306, 308, 410, 504, 506, 508, 524, 526) and an end indication for each data block (304, 306, 308, 410, 504, 506, 508, 524, 526).
5. The data compression device (102) of any one of the preceding claims, wherein the controller (202) is further configured to:
receiving a second data object;
indicating one or more data blocks (304, 306, 308, 410, 504, 506, 508, 524, 526) in the second data object;
compressing the data object (302, 314, 406, 514, 532), wherein the compressed data object is arranged to indicate the one or more data blocks (304, 306, 308, 410, 504, 506, 508, 524, 526);
determining whether a compressed data block (316, 318, 320, 408, 412, 516, 518, 520, 534, 536) in the second data object corresponds to a compressed data block (316, 318, 320, 408, 412, 516, 518, 520, 534, 536) in the data object (302, 314, 406, 514, 532); if the determination is yes, one data block (304, 306, 308, 410, 504, 506, 508, 524, 526) is replaced with another data block (304, 306, 308, 410, 504, 506, 508, 524, 526).
6. The data compression device (102) of claim 5, wherein the controller (202) is further configured to:
Generating a hash value for each of the data blocks (304, 306, 308, 410, 504, 506, 508, 524, 526) in the second data object (302, 314, 406, 514, 532);
based on the hash value generated for the data block (304, 306, 308, 410, 504, 506, 508, 524, 526), it is determined whether the compressed data block (316, 318, 320, 408, 412, 516, 518, 520, 534, 536) in the second data object (302, 314, 406, 514, 532) corresponds to the compressed data block (316, 318, 320, 408, 412, 516, 518, 520, 534, 536) in the data object (302, 314, 406, 514, 532).
7. The data compression device (102) of claim 5 or 6, wherein the controller (202) is further configured to replace one compressed data block (316, 318, 320, 408, 412, 516, 518, 520, 534, 536) with another compressed data block by replacing a larger one of the compressed data blocks (316, 318, 320, 408, 412, 516, 518, 520, 534, 536) with a reference data block of a smaller one of the compressed data blocks (316, 318, 320, 408, 412, 516, 518, 520, 534, 536).
8. The data compression device (102) of claim 5, 6 or 7, wherein the controller (202) is further configured to replace one compressed data block (316, 318, 320, 408, 412, 516, 518, 520, 534, 536) with another compressed data block (316, 318, 320, 408, 412, 516, 518, 520, 534, 536) by:
determining which compressed data blocks (316, 318, 320, 408, 412, 516, 518, 520, 534, 536) are decompressed faster;
-replacing said compressed data blocks (316, 318, 320, 408, 412, 516, 518, 520, 534, 536) with reference data blocks of said compressed data blocks (316, 318, 320, 408, 412, 516, 518, 520, 534, 536) with a higher decompression speed.
9. The data compression device (102) of claim 5, 6, 7 or 8, wherein the controller (202) is further configured to replace one compressed data block (316, 318, 320, 408, 412, 516, 518, 520, 534, 536) with another compressed data block (316, 318, 320, 408, 412, 516, 518, 520, 534, 536) by:
determining a compression dependency of the compressed data blocks (316, 318, 320, 408, 412, 516, 518, 520, 534, 536) in the data objects (302, 314, 406, 514, 532) and a compression dependency of the compressed data blocks (316, 318, 320, 408, 412, 516, 518, 520, 534, 536) in the second data objects (302, 314, 406, 514, 532);
A determination is made as to which data block (304, 306, 308, 410, 504, 506, 508, 524, 526) to replace based on the compression dependency.
10. The data compression device (102) of any one of the preceding claims, wherein the controller (202) is further configured to:
receiving a compressed data element (510, 528) for decompression, wherein the compressed data element (510, 528) comprises a data object (302, 314, 406, 514, 532) indicating one or more data blocks (304, 306, 308, 410, 504, 506, 508, 524, 526) of the compressed data object (302, 314, 406, 514, 532);
determining whether a compression dependency exists between a first data block (304, 306, 308, 410, 504, 506, 508, 524, 526) and a second data block (304, 306, 308, 410, 504, 506, 508, 524, 526) in the data object (302, 314, 406, 514, 532); if the determination is yes, then:
the second data block (304, 306, 308, 410, 506, 508, 524, 526) is decompressed before the first data block (304, 306, 308, 410, 504, 506, 508, 524, 526) is decompressed.
11. The data compression device (102) of claim 10, wherein the controller (202) is further configured to determine that the second data block (304, 306, 308, 410, 504, 506, 508, 524, 526) is included in a second data object (302, 314, 406, 514, 532) and, in response, obtain the second data object (302, 314, 406, 514, 532).
12. The data compression device (102) of claim 10 or 11, wherein one or more of the compressed data objects (302, 314, 406, 514, 532) are (304, 306, 308, 410, 504, 506, 508, 524, 526) include one or more compressed data blocks (316, 318, 320, 408, 412, 516, 518, 520, 534, 536), and a remainder of the one or more data blocks (304, 306, 308, 410, 504, 506, 508, 524, 526) of the compressed data objects (302, 314, 406, 514, 532) are uncompressed, the controller (202) being configured to decompress the one or more compressed data blocks (304, 306, 308, 410, 504, 506, 508, 524, 526).
13. A data storage device (104) comprising a memory (208) and a controller (210), wherein the memory (208) is configured to store data objects (302, 314, 406, 514, 532), and the controller (210) is configured to:
receiving a data element (310, 402, 502, 522) to be stored, wherein the data element (310, 402, 502, 522) comprises a data object (302, 314, 406, 514, 532) and a header element, and wherein the data object indicates one or more data blocks in the data object, and the header element (310, 402, 502, 522) comprises a hash value of each of the one or more data blocks (304, 306, 308, 410, 504, 506, 508, 524, 526);
For each data block (304, 306, 308, 410, 504, 506, 508, 524, 526), determining whether a reference data block (304, 306, 308, 410, 504, 506, 508, 524, 526) is present for the data block (304, 306, 308, 410, 504, 506, 508, 524, 526) based on the hash value of the data block (304, 306, 308, 410, 504, 506, 508, 524, 526); if the determination is yes, one data block (304, 306, 308, 410, 504, 506, 508, 524, 526) is replaced with a reference data block of another data block (304, 306, 308, 410, 504, 506, 508, 524, 526).
14. A data device (100) characterized by comprising a data compression device (102) according to any one of claims 1 to 13 and a data storage device (104) according to claim 13.
15. A method (600) for data compression, the method (600) comprising:
receiving a data object (302, 314, 406, 514, 532) to be compressed;
indicating one or more data blocks (304, 306, 308, 410, 504, 506, 508, 524, 526) in the data object (302, 314, 406, 514, 532);
determining a hash value for each of the one or more data blocks (304, 306, 308, 410, 504, 506, 508, 524, 526);
Subsequently, compressing the data object (302, 314, 406, 514, 532);
generating a header element, wherein the header element comprises the hash value of each of the one or more data blocks, and appending the header element (312, 404, 512, 530) to a compressed data object (302, 314, 406, 514, 532), wherein the header element (312, 404, 512, 530) is arranged to indicate the one or more data blocks (304, 306, 308, 410, 504, 506, 508, 524, 526) in the compressed data object (302, 314, 406, 514, 532).
16. A computer readable medium carrying computer instructions, characterized in that when loaded into and executed by a controller (202) of a data compression device (102), the instructions enable the data compression device (102) to implement the method (600) according to claim 15.
17. A method (700) for storing a data object, the method (700) comprising:
receiving a data element (310, 402, 502, 522) to be stored, wherein the data element (310, 402, 502, 522) comprises a data object (302, 314, 406, 514, 532) and a header element (312, 404, 512, 530), the data object (302, 314, 406, 514, 532) indicating one or more data blocks (304, 306, 308, 410, 504, 506, 508, 524, 526) in the data object (302, 314, 406, 514, 532), and the header element (312, 404, 512, 530) comprises a hash value of each of the one or more data blocks (304, 306, 308, 410, 504, 506, 508, 524, 526);
For each data block (304, 306, 308, 410, 504, 506, 508, 524, 526), determining whether a reference data block (304, 306, 308, 410, 504, 506, 508, 524, 526) is present for the data block (304, 306, 308, 410, 504, 506, 508, 524, 526) based on the hash value of the data block (304, 306, 308, 410, 504, 506, 508, 524, 526); if the determination is yes, one data block (304, 306, 308, 410, 504, 506, 508, 524, 526) is replaced with a reference data block of another data block (304, 306, 308, 410, 504, 506, 508, 524, 526).
18. A computer readable medium carrying computer instructions which, when loaded into and executed by a controller (210) of a data storage device (104), enable the data storage device (104) to implement the method (700) according to claim 17.
CN202080107962.0A 2020-12-21 2020-12-21 Data compression device, data storage device and method for data compression and data de-duplication Pending CN116601593A (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/EP2020/087369 WO2022135657A1 (en) 2020-12-21 2020-12-21 Data compression arrangement, data storage arrangement and methods for compression and deduplication of data

Publications (1)

Publication Number Publication Date
CN116601593A true CN116601593A (en) 2023-08-15

Family

ID=74141536

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202080107962.0A Pending CN116601593A (en) 2020-12-21 2020-12-21 Data compression device, data storage device and method for data compression and data de-duplication

Country Status (2)

Country Link
CN (1) CN116601593A (en)
WO (1) WO2022135657A1 (en)

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8954392B2 (en) * 2012-12-28 2015-02-10 Futurewei Technologies, Inc. Efficient de-duping using deep packet inspection
US9904687B2 (en) * 2013-02-13 2018-02-27 Hitachi, Ltd. Storage apparatus and data management method
US9917894B2 (en) * 2014-08-06 2018-03-13 Quest Software Inc. Accelerating transfer protocols

Also Published As

Publication number Publication date
WO2022135657A1 (en) 2022-06-30

Similar Documents

Publication Publication Date Title
US9678974B2 (en) Methods and apparatus for network efficient deduplication
US9367558B2 (en) Methods and apparatus for efficient compression and deduplication
US8633838B2 (en) Method and apparatus for compression and network transport of data in support of continuous availability of applications
US8179291B2 (en) Method and system for compression of logical data objects for storage
US8671116B2 (en) Efficient segment detection for deduplication
US8836548B1 (en) Method and system for data compression at a storage system
CN107682016B (en) Data compression method, data decompression method and related system
CN107506153B (en) Data compression method, data decompression method and related system
WO2014094479A1 (en) Method and device for deleting duplicate data
KR20180052739A (en) Data deduplication with solid state drive controller
US20190377509A1 (en) Chunk-based data deduplication
US8909606B2 (en) Data block compression using coalescion
US20230229633A1 (en) Adding content to compressed files using sequence alignment
JP2012164130A (en) Data division program
EP3432168B1 (en) Metadata separated container format
US10915260B1 (en) Dual-mode deduplication based on backup history
CN111124939A (en) Data compression method and system based on full flash memory array
CN116601593A (en) Data compression device, data storage device and method for data compression and data de-duplication
US11349494B2 (en) Data compression apparatus and data compression method
US11347424B1 (en) Offset segmentation for improved inline data deduplication
KR20180099136A (en) Apparatus and method for deduplication of network packet, apparatus for restoring deduplicated file
US11748307B2 (en) Selective data compression based on data similarity
US20230367477A1 (en) Storage system, data management program, and data management method
US20230281166A1 (en) Method and Key Value-Based Device for Data Deduplication and Random Access
CN116235140A (en) Block storage method and system for simplifying data in data deduplication

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination