CN107682016A - A kind of data compression method, data decompression method and related system - Google Patents

A kind of data compression method, data decompression method and related system Download PDF

Info

Publication number
CN107682016A
CN107682016A CN201710884914.2A CN201710884914A CN107682016A CN 107682016 A CN107682016 A CN 107682016A CN 201710884914 A CN201710884914 A CN 201710884914A CN 107682016 A CN107682016 A CN 107682016A
Authority
CN
China
Prior art keywords
data
block
recombination
data block
similar
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710884914.2A
Other languages
Chinese (zh)
Other versions
CN107682016B (en
Inventor
韩子衿
夏文
吴大立
古亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sangfor Technologies Co Ltd
Original Assignee
Sangfor Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sangfor Technologies Co Ltd filed Critical Sangfor Technologies Co Ltd
Priority to CN201710884914.2A priority Critical patent/CN107682016B/en
Publication of CN107682016A publication Critical patent/CN107682016A/en
Application granted granted Critical
Publication of CN107682016B publication Critical patent/CN107682016B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H03ELECTRONIC CIRCUITRY
    • H03MCODING; DECODING; CODE CONVERSION IN GENERAL
    • H03M7/00Conversion of a code where information is represented by a given sequence or number of digits to a code where the same, similar or subset of information is represented by a different sequence or number of digits
    • H03M7/30Compression; Expansion; Suppression of unnecessary data, e.g. redundancy reduction

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention discloses a kind of data compression method, data decompression method and related system, after former data are divided into multiple data blocks, similar data block migration is recombinated to eliminate redundant data, so as to improve the compression ratio of data.Present invention method includes:Former data are divided into multiple data blocks;Detect the similitude of multiple data blocks;Similar data block is migrated into restructuring successively, generates recombination data;Recombination data is compressed, generates compressed data.The present embodiment additionally provides a kind of data decompression method and related system, for improving the compression ratio of data.

Description

A kind of data compression method, data decompression method and related system
Technical field
The present invention relates to microcomputer data processing field, more particularly to a kind of data compression method, data decompression side Method and related system.
Background technology
Data compression refers to that on the premise of useful information is not lost reduction data volume improves it to reduce memory space Transmission, storage and treatment effeciency, or data are reorganized according to certain algorithm, reduce redundancy and the storage of data A kind of technical method in space.
Current data compression technique is broadly divided into lossy compression method and Lossless Compression, existing lossless compressiong mostly by Developed based on dictionary encoding technology LZ77 and LZ78.Dictionary encoding technology is mainly using a kind of slow based on " sliding window " Technology is deposited, current character sequence is matched with the character string cached in sliding window, it is relative with one if repeated Short coding represents, so as to realize that the redundancy of character tandem eliminates.
And in existing lossless compressiong sliding window size, the major limitation lookup of redundant data, on the one hand, Sliding window means more to be easily found redundant data more greatly, so as to more eliminate redundancy, but with sliding window Increase, the matched and searched time of redundance character string is also exponentially increased, therefore most of compression algorithms limit sliding window Size, such as bzip2 maximum sliding window is 900KB;On the other hand, sliding window is too small, the redundant digit in different windows It can not be eliminated according to because of apart from each other, substantial amounts of redundant data is still suffered from storage system, meanwhile, the character of Non-redundant data String matching operation also takes seriously, reduces the data compression speed in storage system.
The content of the invention
The embodiments of the invention provide a kind of data compression method, data decompression method and related system, for by former number After multiple data blocks are divided into, similar data block migration is recombinated to eliminate redundant data, so as to solve traditional compression In technology, the problem of causing data redundancy apart from each other not eliminate because of the limitation of sliding window size.
One aspect of the present invention provides a kind of method of data compression, including:
Former data are divided into multiple data blocks;
Detect the similitude of multiple data blocks;
Similar data block is migrated into restructuring successively, generates recombination data;
Recombination data is compressed, generates compressed data.
Optionally, after former data are divided into multiple data blocks, before the similitude for detecting multiple data blocks, this method Also include:
Record order, skew and the block length of multiple data blocks, generation original spectrum.
Optionally, similar data block is being migrated into restructuring successively, after generating recombination data, recombination data pressed Before contracting, this method also includes:
According to recombination data, the skew of multiple data blocks is updated, obtains new original spectrum;
New original spectrum is compressed, generation compressed file spectrum.
Optionally, similar data block is migrated into restructuring successively, generates recombination data, including:
Set of metadata of similar data block is migrated into restructuring successively, generates multiple similar chained lists;
According to multiple similar chained lists, data block contents corresponding to reading, generate recombination data from former data.
Optionally, the similitude of multiple data blocks is detected, including:
The similitude of multiple data blocks is detected by super method of characteristic, Simhash or Minhash methods.
Another aspect of the present invention provides a kind of data decompression method, including:
Depressurizing compression data and compressed file spectrum, respectively obtain recombination data and new original spectrum;
According to order, skew and the block length of multiple data blocks of new original spectrum record, read respectively from recombination data Multiple data blocks;
According to the order of multiple data blocks of new original spectrum record, multiple data blocks are write successively, obtain former data.
Present invention also offers a kind of system of data compression, including:
Blocking unit, for former data to be divided into multiple data blocks;
Detection unit, for detecting the similitude of multiple data blocks;
Recomposition unit, for similar data block to be migrated into restructuring successively, generate recombination data;
Compression unit, for recombination data to be compressed, generate compressed data.
Present invention also offers a kind of system of data decompression, including:
Decompression units, composed for depressurizing compression data and compressed file, respectively obtain recombination data and new original spectrum;
Reading unit, for composing skew and the block length of the multiple data blocks recorded according to new original, respectively from restructuring number Multiple data blocks are read according to middle;
Writing unit, the order of multiple data blocks of record is composed according to new original, multiple data blocks is write successively, obtains Former data.
Present invention also offers a kind of computer installation, including processor, the processor is stored in memory for execution On computer program when, it is possible to achieve the steps:
Former data are divided into multiple data blocks;
Detect the similitude of multiple data blocks;
Similar data block is migrated into restructuring successively, generates recombination data;
Recombination data is compressed, generates compressed data.
Present invention also offers a kind of computer installation, including processor, processor is used to perform storage on a memory Computer program when, for realizing the steps:
Depressurizing compression data and compressed file spectrum, respectively obtain recombination data and new original spectrum;
According to the skew of multiple data blocks of new original spectrum record and block length, more numbers are read from recombination data respectively According to block;
According to the order of multiple data blocks of new original spectrum record, multiple data blocks are write successively, obtain former data.
Present invention also offers a kind of computer-readable recording medium, computer program is stored thereon with, the computer journey When sequence is executed by processor, for realizing the steps:
Former data are divided into multiple data blocks;
Detect the similitude of multiple data blocks;
Similar data block is migrated into restructuring successively, generates recombination data;
Recombination data is compressed, generates compressed data.
Present invention also offers a kind of computer-readable recording medium, computer program is stored thereon with, it is characterised in that When computer program is executed by processor, for realizing the steps:
Depressurizing compression data and compressed file spectrum, respectively obtain recombination data and new original spectrum;
According to the skew of multiple data blocks of new original spectrum record and block length, more numbers are read from recombination data respectively According to block;
According to the order of multiple data blocks of new original spectrum record, multiple data blocks are write successively, obtain former data.
As can be seen from the above technical solutions, the embodiment of the present invention has advantages below:
In the present invention, former data are divided into multiple data blocks, detect the similitude of multiple data blocks, by similar data block Migration restructuring, generates recombination data, is then compressed recombination data, obtains compressed data, because the present invention will be similar Data block migration recombinates, so as to which similar data block restructuring, as much as possible be disappeared set of metadata of similar data block so as to ensure that together Except redundancy, solve in conventional data compression causes data redundancy apart from each other can not because of the limitation of sliding window size The problem of elimination.
Brief description of the drawings
Fig. 1 is the process schematic of data compression method;
Fig. 2 is a kind of one embodiment schematic diagram of data compression method in the embodiment of the present invention;
Fig. 3 is a kind of another embodiment schematic diagram of data compression method in the embodiment of the present invention;
Fig. 4 is the structure organization schematic diagram of file spectrum;
Fig. 5 is the process schematic of data decompression method;
Fig. 6 is a kind of one embodiment schematic diagram of data decompression method in the embodiment of the present invention;
Fig. 7 is a kind of one embodiment schematic diagram of data compression system in the embodiment of the present invention;
Fig. 8 is a kind of another embodiment schematic diagram of data compression system in the embodiment of the present invention;
Fig. 9 is a kind of one embodiment schematic diagram of data decompression system in the embodiment of the present invention.
Embodiment
The embodiments of the invention provide a kind of data compression method, data decompression method and related system, for by former number After multiple data blocks are divided into, similar data block migration is recombinated to eliminate redundant data, so as to solve traditional compression In technology, the problem of causing data redundancy apart from each other not eliminate because of the limitation of sliding window size, while also carry The high compression ratio of data.
For the ease of the understanding to file, first the technical term occurred in text is explained as follows:
Deblocking:Deblocking divides documents into multiple data blocks using block algorithm, and the selection of block algorithm is not But piecemeal speed can be influenceed, and is also had a great impact to the Detection results of set of metadata of similar data block.Existing deblocking algorithm Mainly include fixed length piecemeal and two kinds of elementary tactics of piecemeal based on content.Fixed length piecemeal marks cutting edge according to piecemeal position Boundary, it realizes that simply cutting speed is fast.Due to the problem of Boundary Moving be present, the redundancy detection effect of fixed length piecemeal is simultaneously paid no attention to Think.And piecemeal border is determined according to the local content of data flow based on the piecemeal of content, it efficiently solves Boundary Moving Problem, data flow is divided into the data block of random length.Comparatively, the block algorithm based on content can better adapt to frequency The load of numerous modification content, can find more redundant datas, be widely used in the storage system based on data deduplication In system.
Similitude detects:Similitude detects the data block highly similar for identifying content, so as to find out and eliminate storage Similarity redundancy in system.The representative fingerprint of comparison document is generally basede in storage system to judge the similarity relation between file. Existing conventional similarity detection method has the similarity detection method based on super characteristic value, Simhash, Minhash etc..
Data Migration:Data Migration is to be changed the partial data order in file so that set of metadata of similar data can be real Now cluster, so as to improve a kind of method of compressing file effect.Data Migration provides the mechanism for recovering metadata, migration simultaneously Elementary cell be data block.After file is divided into multiple data blocks, set of metadata of similar data block is identified by similarity detection method, so The position of set of metadata of similar data block is moved afterwards, is made set of metadata of similar data block physical location adjacent, is made file data more compressible.
Data compression:Data compression is a kind of redundant data technology for eliminating of main flow, is mainly eliminated by way of coding Redundant data information, i.e., on the premise of ensureing that legacy data information is not lost, original contents are changed, for what is repeated The coded representation of the less byte number of byte sequence, so as to reach the purpose for eliminating partial redundance data.Earliest by Claude Elwood Shannon (- 2001 years 1916) propose " comentropy " concept --- all there is redundancy in any information, redundancy is big It is small uncertain in other words relevant with the probability of each symbol in information (numeral, letter or word) appearance.Shannon information Entropy theory has established the theoretical foundation of data compression, as the continuous growth of electronic digital information, data compression technique are also gradual Develop into lossless compressiong, lossy compression etc..Existing lossless compressiong is mostly by based on dictionary encoding technology LZ77 and LZ78 are developed.Dictionary encoding technology is mainly using a kind of caching technology for being based on " sliding window ", by current word Symbol sequence is matched with the character string of caching in sliding window, if repeated, is represented with a relatively short coding, So as to realize that the redundancy of character tandem eliminates.
In order to make it easy to understand, Fig. 1 gives the process schematic of data compression method, with reference to Fig. 1, to describe this hair Data compression method in bright, referring to Fig. 2, a kind of one embodiment of data compression method in the embodiment of the present invention, including:
201st, former data are divided into multiple data blocks;
It is understood that data compression is on the premise of ensureing that former data are not lost, redundant data is eliminated, so as to Reach diminution memory space, the purpose of speeding up data transmission.
The present invention is the thought based on set of metadata of similar data clustering combination, so as to the elimination redundant data of maximum possible.For reality The clustering combination of existing set of metadata of similar data by former data, it is necessary to carry out piecemeal, so as to realize the Similar contrasts of block data.
Deblocking is that former data are divided into multiple data blocks using block algorithm.The granularity of average piecemeal is left for 8KB Right (changing Block granularity can also be arranged as required to as 4KB or 16KB), block algorithm can be used and calculated based on content piecemeal Method or fixed length piecemeal.
Fixed length piecemeal marks cut-boundary according to piecemeal position, and it realizes that simply cutting speed is fast.Due to border be present The problem of mobile, the redundancy detection effect of fixed length piecemeal is general.It is and true according to the local content of data flow based on the piecemeal of content Determine piecemeal border, the problem of it efficiently solves Boundary Moving, data flow is divided into the data block of random length.It is relative and Speech, the block algorithm based on content can better adapt to the load of frequently modification content, can find more redundant datas.
202nd, the similitude of multiple data blocks is detected;
Former data form multiple data blocks after deblocking, and data compression system carries out similar to multiple data blocks Property detection, wherein similitude detection algorithm have many kinds, such as:Super method of characteristic, Simhash or Minhash methods.
Wherein, the similitude detection of multiple data blocks specifically how is realized using above-mentioned algorithm, in the following embodiments It is described in detail.
It should be noted that the Similarity Detection Algorithm in the present embodiment includes but is not limited to above-mentioned algorithm, do not do herein Concrete restriction.
203rd, similar data block is migrated into restructuring successively, generates recombination data;
Multiple data blocks by similitude after detecting so that similar data block cluster restructuring, forms multiple similar chains Table, data compression system read according to similar chained list, successively from former data corresponding to data block, then by the data block of reading according to Secondary write-in, you can generation recombination data.
Specifically, how multiple data blocks generate similar chained list after similitude detection, and how according to similar chain Table, obtain recombination data and be described in detail in the following embodiments.
204th, recombination data is compressed, generates compressed data.
In multiple data chunks into after recombination data, data compression system is further by traditional compression method, to this Recombination data is compressed so that the maximized de-redundancy of similar data block, so as to increase the compression ratio of former data.
In the present invention, former data are divided into multiple data blocks, detect the similitude of multiple data blocks, by similar data block Migration restructuring, generates recombination data, is then compressed recombination data, obtains compressed data, because the present invention will be similar Data block migration recombinates, so as to which similar data block restructuring, as much as possible be disappeared set of metadata of similar data block so as to ensure that together Except redundancy, solve in conventional data compression causes data redundancy apart from each other can not because of the limitation of sliding window size The problem of elimination.
Embodiment based on Fig. 2, the data compression method in the embodiment of the present invention is described below in detail, referring to Fig. 3, this Another embodiment of a kind of data compression method in inventive embodiments, including:
301st, former data are divided into multiple data blocks;
In order to realize the purpose for recombinating similar data clusters in the present invention, it is necessary to which former data are carried out into piecemeal, so as to Obtain multiple data blocks.Wherein, deblocking is that former data are divided into multiple data blocks using block algorithm.Average piecemeal Granularity is 8KB or so (changing Block granularity can also be arranged as required to as 4KB or 16KB), and block algorithm can be used and is based on Content block algorithm or fixed length piecemeal.
It is specifically, detailed in the step 201 of the content and feature of content piecemeal and fixed length block algorithm in Fig. 2 embodiments Thin description, here is omitted.
302nd, order, skew and the block length of multiple data blocks, generation original spectrum are recorded;
Former data are returned to for the ease of later stage compressed data, data compression system is needed to multiple data blocks in former data In order, the block length of skew and each data block recorded, the order of plurality of data block is used to recover each data Order of the block in former data, skew and block length are to accurately read out the content of each data block.Wherein, record multiple The file of the order of data block, skew and block length, referred to as original are composed.
Fig. 4 is the institutional framework schematic diagram of file spectrum, gives the original spectrum example of an entitled TEST file.File Mainly include long file size, filename, the fileinfo of filename and data block number, the skew of each data block in spectrum With the data block metadata of block length.
303rd, the similitude of multiple data blocks is detected;
Former data form multiple data blocks after deblocking, and data compression system carries out similar to multiple data blocks Property detection, wherein similitude detection algorithm have many kinds, such as:Super method of characteristic, Simhash or Minhash methods.
Illustrated below with super method of characteristic, it is assumed that N number of data block be present, then N number of data block is used respectively A kind of hash algorithm, then N number of cryptographic Hash is respectively obtained, as N number of super characteristic value, but in order to improve multiple data block similarities Discrimination, then a variety of hash algorithms are used to N number of data block respectively so that each data block corresponds to multiple super characteristic values, Then each data block corresponds to multiple super characteristic value indexes respectively.Go to contrast with each super characteristic value of each data block respectively Each super characteristic value of other data blocks, if having the super spy of identical in finding multiple super characteristic values of certain two data block Value indicative, then the two artificial data blocks are set of metadata of similar data block.
It should be noted that the similar of multiple data blocks is detected using global super characteristic value index in the present embodiment Property, the scope of approx imately-detecting is expanded, improve the Detection results of set of metadata of similar data block.But approx imately-detecting algorithm in the present embodiment Simhash or Minhash detection algorithms can be used, specific detection algorithm, are not particularly limited herein.
304th, set of metadata of similar data block is migrated into restructuring successively, generates multiple similar chained lists;
In step 303, if data compression system finds super characteristic value identical data block be present, by these data Block is added in corresponding similar chained list successively, and skew and the block length of each data block are recorded in similar chained list, to obtain Multiple similar chained lists.
As shown in figure 1, wherein data block A, C, F is set of metadata of similar data block, then data block A, C, F are designated as similar chained list 1, number It is set of metadata of similar data block according to block B, D, E, then data block B, D, E is designated as similar chained list 2, if some data block is not present and other numbers According to the super characteristic value of block identical, then newly-built similar chained list, for depositing the data block.
305th, according to multiple similar chained lists, data block contents corresponding to reading, generate recombination data from former data;
By multiple data blocks after similitude detects, multiple similar chained lists are obtained, data compression system travels through often successively Individual similar chained list, according to the order of data block in each similar chained list, skew and block length, it is successively read out from former data each The content of data block, each data block of reading is then write into file successively, generate recombination data.
As shown in figure 1, according to similar chained list 1 record each data block order, skew and block length, from former data according to The secondary content for reading out data block A, C, F, writes file successively;According to the order of each data block of similar chained list 2 record, partially Shifting and block length, are successively read out data block B, D, E content from former data, then write file successively, by that analogy, according to phase Like the order of chained list, the content of each data block is read out respectively, is write successively, so as to generate recombination data, as shown in Figure 1 Recombination data A, C, F, B, D, E.
306th, according to recombination data, the skew of multiple data blocks is updated, obtains new original spectrum;
After recombination data is generated, because the position of each data block is changed, corresponding each data The skew of block is also changed, as shown in figure 1, in former data, it is assumed that A data blocks are 1K, and B data block is 2k, C data block For 3K, then skew of the C data block in former data is A data blocks and the block length summation of B data block, as 3K, and generates restructuring After data because the position of C data block changes, then C data block skew for A data blocks block length, i.e. 1k.For the later stage According to original spectrum and recombination data, extensive restored data, data compression system is then needed according to recombination data, renewal original spectrum In multiple data blocks skew, for the ease of description, the original spectrum after renewal is referred to as new original and composed.
307th, new original spectrum is compressed, obtains compressed file spectrum;
After obtaining new original spectrum, new original spectrum is compressed, obtains compressed file spectrum, and in order to which the later stage decompresses It is convenient, compressed file can be composed and be associated storage with the compressed data in later stage.
308th, recombination data is compressed, obtains compressed data;
, can be most by compression because similar data block is clustered into restructuring after step 305 obtains recombination data The big possible redundancy for eliminating set of metadata of similar data block, obtains the compressed data of more low capacity.
Further, present invention addresses in conventional compression method, cause apart because of the limitation of sliding window size The problem of too remote redundancy can not eliminate.
It should be noted that step 308 can also perform before step 307, i.e., do not have between step 307 and step 308 There is order to limit, and operate for convenience in practice, step 307 can also merge with step 308 to be performed, i.e., simultaneously will be new former File is composed and recombination data is compressed, and obtains compressed data and compressed file spectrum.
In the present invention, former data are divided into multiple data blocks, detect the similitude of multiple data blocks, by similar data block Migration restructuring, generates recombination data, is then compressed recombination data, obtains compressed data, because the present invention will be similar Data block migration recombinates, so as to which similar data block restructuring, as much as possible be disappeared set of metadata of similar data block so as to ensure that together Except redundancy, solve in conventional data compression causes data redundancy apart from each other can not because of the limitation of sliding window size The problem of elimination.
The data compression method in the present invention is described above is, the data decompression method in the present invention will be described below, please Refering to Fig. 6, one embodiment of data decompression method in the embodiment of the present invention, including:
601st, depressurizing compression data and compressed file spectrum, respectively obtain recombination data and new original spectrum;
Embodiment based on Fig. 3, after obtaining compressed data and compressed file spectrum, extensive restored data, data decompression system Need to decompress compressed data and compressed file spectrum, after decompression, both can obtain recombination data and new original spectrum, Fig. 5 is The process schematic of data decompression method.
As shown in figure 5, after compressed data and compressed file spectrum decompression, recombination data and new original spectrum are obtained.
602nd, order, skew and the block length of multiple data blocks of record are composed according to new original, respectively from recombination data Read multiple data blocks;
After compressed data and compressed file spectrum decompression, recombination data and new original spectrum are obtained, wherein, new original spectrum note The order and block length of each data block in former data, and skew of each data block in recombination data are recorded.So data decompression The order of each data block, block length in the former data that system records in being composed according to new original, and each data block is in recombination data In skew, read out the content of each data block according to the order of former data block from recombination data successively.
As shown in figure 5, order A, B, C, D, E, F of the multiple data blocks recorded in being composed according to new original, and each data Skew and block length of the block in recombination data, it is multiple according to being recorded in former data from recombination data A, D, F, B, C, E respectively The order of data block reads out the content of each data block.
603rd, the order of multiple data blocks of record is composed according to new original, multiple data blocks is write successively, obtains former number According to.
In step 602, data decompression system from recombination data according to former data record data block order successively After the content for reading out each data block, then write the content of each data block successively, you can extensive restored data.
It should be noted that if data storage is in disk, because data in magnetic disk is sequentially written in, and according to former data The order of the data block of record, it is non-sequential reading, so magnetic can be caused when reading the content of each data block in recombination data The certain I/O expenses of disk, so as to shorten the life-span of disk, if but disk is changed to SSD disks, because SSD disks support it is random read and Random writing, you can solve the problems, such as that magnetic disc i/o expense is big.
In the present invention, the method for corresponding data compression, depressurizing compression data and compressed file are composed, and respectively obtain recombination data Composed with new original, order, skew and the block length of multiple data blocks of record are composed according to new original, respectively from recombination data Multiple data blocks are read, then write multiple data blocks successively, you can extensive restored data.
The data compression method in the present invention is described above is, the data compression system in the present invention will be described below, please Refering to Fig. 7, a kind of one embodiment of data compression system in the embodiment of the present invention, including:
Blocking unit 701, for former data to be divided into multiple data blocks;
Detection unit 702, for detecting the similitude of multiple data blocks;
Recomposition unit 703, for similar data block to be migrated into restructuring successively, generate recombination data;
Compression unit 704, for recombination data to be compressed, generate compressed data.
It should be noted that the effect of each unit and the data compression system described in Fig. 2 embodiments in the present embodiment Type of action, here is omitted.
In the present invention, former data are divided into multiple data blocks by bronze drum blocking unit 701, detected by detection unit 702 The similitude of multiple data blocks, similar data block migration is recombinated, generate recombination data, then will by compression unit 704 Recombination data is compressed, and obtains compressed data, because the present invention recombinates similar data block migration, so as to by similar number According to block restructuring together, so as to ensure that as much as possible by set of metadata of similar data block eliminate redundancy, solve in conventional data compression because The problem of causing data redundancy apart from each other not eliminate for the limitation of sliding window size.
For ease of understanding, the data compression system in the embodiment of the present invention is described below in detail, referring to Fig. 8, of the invention Another embodiment of data compression system in embodiment, including:
Blocking unit 801, for former data to be divided into multiple data blocks;
Detection unit 802, for detecting the similitude of multiple data blocks;
Recomposition unit 803, for similar data block to be migrated into restructuring successively, generate recombination data;
First compression unit 804, for recombination data to be compressed, generate compressed data.
Further, the data compression system also includes:
First generation unit 805, for recording order, skew and the block length of multiple data blocks, generation original spectrum;
Updating block 806, for according to recombination data, updating the skew of multiple data blocks, obtaining new original spectrum;
Second compression unit 807, for new original spectrum to be compressed, generation compressed file spectrum.
Wherein, recomposition unit 803 includes:
First generation module 8031, for set of metadata of similar data block to be migrated into restructuring successively, generate multiple similar chained lists;
Second generation module 8032, for data block contents corresponding to according to multiple similar chained lists, being read from former data, Generate recombination data.
Wherein detection unit 802, including:
Detection module 8021, for detecting multiple data blocks by super method of characteristic, Simhash or Minhash methods Similitude.
It should be noted that the effect of above-mentioned each unit and each module and the effect of data compression system in Fig. 3 embodiments Similar, here is omitted.
In the present invention, former data are divided into multiple data blocks by bronze drum blocking unit 801, detected by detection unit 802 The similitude of multiple data blocks, similar data block migration is recombinated, generate recombination data, then will by compression unit 804 Recombination data is compressed, and obtains compressed data, because the present invention recombinates similar data block migration, so as to by similar number According to block restructuring together, so as to ensure that as much as possible by set of metadata of similar data block eliminate redundancy, solve in conventional data compression because The problem of causing data redundancy apart from each other not eliminate for the limitation of sliding window size.
Data compression system is described above is, then describes data decompression system below, referring to Fig. 9, the present invention is implemented One embodiment of data decompression system in example, including:
Decompression units 901, composed for depressurizing compression data and compressed file, respectively obtain recombination data and new original Spectrum;
Reading unit 902, for composing skew and the block length of the multiple data blocks recorded according to new original, respectively from restructuring Multiple data blocks are read in data;
Writing unit 903, the order of multiple data blocks of record is composed according to new original, writes multiple data blocks successively, Obtain former data.
It should be noted that the effect of each unit and the effect class of data decompression system in Fig. 6 embodiments in the present embodiment Seemingly, here is omitted.
In the present invention, the method for corresponding data compression, composed by the depressurizing compression data of decompression units 901 and compressed file, Respectively obtain recombination data and new original is composed, the order for multiple data blocks that reading unit 902 records according to new original spectrum, Skew and block length, read multiple data blocks, then write multiple data blocks successively, you can recover former number from recombination data respectively According to.
The angle of slave module functional entity is to the data compression system in the embodiment of the present invention, data decompression system above It is described, the computer installation in the embodiment of the present invention is described from the angle of hardware handles below:
The computer installation is used for the function of realizing data compression system side, Computer of embodiment of the present invention device one Individual embodiment includes:
Processor and memory;
Memory is used to store computer program, can when processor is used to perform the computer program stored in memory To realize following steps:
Former data are divided into multiple data blocks;
Detect the similitude of multiple data blocks;
Similar data block is migrated into restructuring successively, generates recombination data;
Recombination data is compressed, generates compressed data.
In some embodiments of the invention, processor, can be also used for realizing following steps:
Record order, skew and the block length of multiple data blocks, generation original spectrum.
In some embodiments of the invention, processor, can be also used for realizing following steps:
According to recombination data, the skew of multiple data blocks is updated, obtains new original spectrum;
New original spectrum is compressed, generation compressed file spectrum.
In some embodiments of the invention, processor, can be also used for realizing following steps:
Set of metadata of similar data block is migrated into restructuring successively, generates multiple similar chained lists;
According to multiple similar chained lists, data block contents corresponding to reading, generate recombination data from former data.
In some embodiments of the invention, processor, can be also used for realizing following steps:
The similitude of multiple data blocks is detected by super method of characteristic, Simhash or Minhash methods.
The computer installation can be also used for realizing the function of data decompression system side, Computer of the embodiment of the present invention Another embodiment of device includes:
Depressurizing compression data and compressed file spectrum, respectively obtain recombination data and new original spectrum;
According to order, skew and the block length of multiple data blocks of new original spectrum record, read respectively from recombination data Multiple data blocks;
According to the order of multiple data blocks of new original spectrum record, multiple data blocks are write successively, obtain former data.
It is understood that either data compression system side, or data decompression system side, the meter of described above During computing device computer program in calculation machine device, can also realize it is above-mentioned corresponding to each unit in each device embodiment Function, here is omitted.Exemplary, computer program can be divided into one or more module/units, one or Multiple module/units are stored in memory, and by computing device, to complete the present invention.One or more module/units Can be the series of computation machine programmed instruction section that can complete specific function, the instruction segment is used to describe computer program in number According to the implementation procedure in compressibility/data decompression system.For example, computer program can be divided into above-mentioned data compression system Each unit in system, each unit can realize the concrete function as described in above-mentioned corresponding data compressibility.
Computer installation can be the computing devices such as desktop PC, notebook, palm PC and cloud server.Meter Calculation machine device may include but be not limited only to processor, memory.It will be understood by those skilled in the art that processor, memory are only Only it is the example of computer installation, does not form the restriction to computer installation, more or less parts can be included, or Some parts, or different parts are combined, such as computer installation can also include input-output equipment, network insertion is set Standby, bus etc..
Processor can be CPU (Central Processing Unit, CPU), can also be that other are logical With processor, digital signal processor (Digital Signal Processor, DSP), application specific integrated circuit (Application Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field- Programmable Gate Array, FPGA) either other PLDs, discrete gate or transistor logic, Discrete hardware components etc..General processor can be microprocessor or the processor can also be any conventional processor Deng processor is the control centre of computer installation, utilizes each portion of various interfaces and the whole computer installation of connection Point.
Memory can be used for storage computer program and/or module, processor to be stored in memory by running or performing Interior computer program and/or module, and the data being stored in memory are called, realize the various work(of computer installation Energy.Memory can mainly include storing program area and storage data field, wherein, storing program area can storage program area, at least Application program needed for One function etc.;Storage data field can store uses created data etc. according to terminal.In addition, deposit Reservoir can include high-speed random access memory, can also include nonvolatile memory, such as hard disk, internal memory, plug-in type Hard disk, intelligent memory card (Smart Media Card, SMC), secure digital (Secure Digital, SD) card, flash card (Flash Card), at least one disk memory, flush memory device or other volatile solid-state parts.
Present invention also offers a kind of computer-readable recording medium, the computer-readable recording medium is used to realize data The function of compressibility side, computer program is stored thereon with, when computer program is executed by processor, processor can be with For performing following steps:
Former data are divided into multiple data blocks;
Detect the similitude of multiple data blocks;
Similar data block is migrated into restructuring successively, generates recombination data;
Recombination data is compressed, generates compressed data.
In some embodiments of the invention, the computer program of computer-readable recording medium storage is executed by processor When, processor, it can be specifically used for performing following steps:
Record order, skew and the block length of multiple data blocks, generation original spectrum.
According to recombination data, the skew of multiple data blocks is updated, obtains new original spectrum;
New original spectrum is compressed, generation compressed file spectrum.
Set of metadata of similar data block is migrated into restructuring successively, generates multiple similar chained lists;
According to multiple similar chained lists, data block contents corresponding to reading, generate recombination data from former data.
The similitude of multiple data blocks is detected by super method of characteristic, Simhash or Minhash methods.
Present invention also offers another computer-readable recording medium, the computer-readable recording medium is used to realize number According to the function of decompression system side, computer program is stored thereon with, when computer program is executed by processor, processor can For performing following steps:
Depressurizing compression data and compressed file spectrum, respectively obtain recombination data and new original spectrum;
According to order, skew and the block length of multiple data blocks of new original spectrum record, read respectively from recombination data Multiple data blocks;
According to the order of multiple data blocks of new original spectrum record, multiple data blocks are write successively, obtain former data.
If it is understood that integrated unit is realized in the form of SFU software functional unit and is used as independent product pin Sell or in use, can be stored in a corresponding computer read/write memory medium.It is real based on such understanding, the present invention All or part of flow in existing above-mentioned corresponding embodiment method, the hardware of correlation can also be instructed by computer program To complete, computer program can be stored in a computer-readable recording medium, the computer program is being executed by processor When, can be achieved above-mentioned each embodiment of the method the step of.Wherein, computer program includes computer program code, computer journey Sequence code can be source code form, object identification code form, executable file or some intermediate forms etc..Computer-readable medium It can include:Any entity or device, recording medium, USB flash disk, mobile hard disk, magnetic disc, the light of computer program code can be carried Disk, computer storage, read-only storage (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), electric carrier signal, telecommunication signal and software distribution medium etc..It is it should be noted that computer-readable The content that medium includes can carry out appropriate increase and decrease according to legislation in jurisdiction and the requirement of patent practice, such as at certain A little jurisdictions, electric carrier signal and telecommunication signal are not included according to legislation and patent practice, computer-readable medium.
It is apparent to those skilled in the art that for convenience and simplicity of description, the system of foregoing description, The specific work process of device and unit, the corresponding process in preceding method embodiment is may be referred to, will not be repeated here.
In several embodiments provided herein, it should be understood that disclosed system, apparatus and method can be with Realize by another way.For example, device embodiment described above is only schematical, for example, the division of unit, Only a kind of division of logic function, can there is an other dividing mode when actually realizing, such as multiple units or component can be with With reference to or be desirably integrated into another system, or some features can be ignored, or not perform.It is another, it is shown or discussed Mutual coupling or direct-coupling or communication connection can be by some interfaces, the INDIRECT COUPLING of device or unit or Communication connection, can be electrical, mechanical or other forms.
The unit illustrated as separating component can be or may not be physically separate, be shown as unit Part can be or may not be physical location, you can with positioned at a place, or can also be distributed to multiple networks On unit.Some or all of unit therein can be selected to realize the purpose of this embodiment scheme according to the actual needs.
In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, can also That unit is individually physically present, can also two or more units it is integrated in a unit.Above-mentioned integrated list Member can both be realized in the form of hardware, can also be realized in the form of SFU software functional unit.
If integrated unit is realized in the form of SFU software functional unit and is used as independent production marketing or in use, can To be stored in a computer read/write memory medium.Based on such understanding, technical scheme substantially or Saying all or part of the part to be contributed to prior art or the technical scheme can be embodied in the form of software product Out, the computer software product is stored in a storage medium, including some instructions are causing a computer equipment (can be personal computer, server, or network equipment etc.) performs all or part of each embodiment method of the present invention Step.And foregoing storage medium includes:It is USB flash disk, mobile hard disk, read-only storage (ROM, Read-Only Memory), random Access memory (RAM, Random Access Memory), magnetic disc or CD etc. are various can be with Jie of store program codes Matter.
More than, the above embodiments are merely illustrative of the technical solutions of the present invention, rather than its limitations;Although with reference to foregoing reality Example is applied the present invention is described in detail, it will be understood by those within the art that:It still can be to foregoing each Technical scheme described in embodiment is modified, or carries out equivalent substitution to which part technical characteristic;And these are changed Or replace, the essence of appropriate technical solution is departed from the spirit and scope of various embodiments of the present invention technical scheme.

Claims (12)

  1. A kind of 1. data compression method, it is characterised in that including:
    Former data are divided into multiple data blocks;
    Detect the similitude of the multiple data block;
    Similar data block is migrated into restructuring successively, generates recombination data;
    The recombination data is compressed, generates compressed data.
  2. 2. according to the method for claim 1, it is characterised in that it is described former data are divided into multiple data blocks after, institute Before stating the similitude for detecting the multiple data block, methods described also includes:
    Record order, skew and the block length of the multiple data block, generation original spectrum.
  3. 3. according to the method for claim 2, it is characterised in that similar data block is migrated into restructuring successively described, it is raw Described to be compressed the recombination data into after recombination data, before generating compressed data, methods described also includes:
    According to the recombination data, the skew of the multiple data block is updated, obtains new original spectrum.
  4. 4. according to the method in any one of claims 1 to 3, it is characterised in that described to move similar data block successively Restructuring is moved, generates recombination data, including:
    The similar data block is migrated into restructuring successively, generates multiple similar chained lists;
    According to the multiple similar chained list, data block contents corresponding to reading, generate recombination data from the former data.
  5. 5. according to the method for claim 4, it is characterised in that the similitude of the multiple data block of detection, including:
    The similitude of multiple data blocks is detected by super method of characteristic, Simhash or Minhash methods.
  6. A kind of 6. data decompression method, it is characterised in that including:
    Depressurizing compression data and compressed file spectrum, respectively obtain recombination data and new original spectrum;
    According to order, skew and the block length of multiple data blocks of the new original spectrum record, respectively from the recombination data Read the multiple data block;
    According to the order of the multiple data block of the new original spectrum record, the multiple data block is write successively, is obtained Former data.
  7. A kind of 7. data compression system, it is characterised in that including:
    Blocking unit, for former data to be divided into multiple data blocks;
    Detection unit, for detecting the similitude of the multiple data block;
    Recomposition unit, for similar data block to be migrated into restructuring successively, generate recombination data;
    Compression unit, for the recombination data to be compressed, generate compressed data.
  8. A kind of 8. data decompression system, it is characterised in that including:
    Decompression units, composed for depressurizing compression data and compressed file, respectively obtain recombination data and new original spectrum;
    Reading unit, order, skew and the block length of multiple data blocks for composing record according to the new original, respectively from institute State and the multiple data block is read in recombination data;
    Writing unit, the order of the multiple data block for composing record according to the new original, writes described more successively Individual data block, obtain former data.
  9. 9. a kind of computer installation, it is characterised in that including processor, the processor is performing the meter of storage on a memory During calculation machine program, for realizing the step in the data compression method as described in any one of claim 1 to 5.
  10. 10. a kind of computer installation, it is characterised in that including processor, the processor stores on a memory in execution During computer program, for realizing the step in data decompression method as claimed in claim 6.
  11. 11. a kind of computer-readable recording medium, is stored thereon with computer program, it is characterised in that the computer program When being executed by processor, for realizing the step in the data compression method as described in any one of claim 1 to 5.
  12. 12. a kind of computer-readable recording medium, is stored thereon with computer program, it is characterised in that the computer program When being executed by processor, for realizing the step in data decompression method as claimed in claim 6.
CN201710884914.2A 2017-09-26 2017-09-26 Data compression method, data decompression method and related system Active CN107682016B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710884914.2A CN107682016B (en) 2017-09-26 2017-09-26 Data compression method, data decompression method and related system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710884914.2A CN107682016B (en) 2017-09-26 2017-09-26 Data compression method, data decompression method and related system

Publications (2)

Publication Number Publication Date
CN107682016A true CN107682016A (en) 2018-02-09
CN107682016B CN107682016B (en) 2021-09-17

Family

ID=61137381

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710884914.2A Active CN107682016B (en) 2017-09-26 2017-09-26 Data compression method, data decompression method and related system

Country Status (1)

Country Link
CN (1) CN107682016B (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108427538A (en) * 2018-03-15 2018-08-21 深信服科技股份有限公司 Storage data compression method, device and the readable storage medium storing program for executing of full flash array
CN110083743A (en) * 2019-03-28 2019-08-02 哈尔滨工业大学(深圳) A kind of quick set of metadata of similar data detection method based on uniform sampling
CN110781155A (en) * 2019-10-18 2020-02-11 赛尔网络有限公司 Data storage reading method, system, equipment and medium based on IPFS
CN110888918A (en) * 2019-11-25 2020-03-17 湖北工业大学 Similar data detection method and device, computer equipment and storage medium
CN111984615A (en) * 2020-08-04 2020-11-24 中国人民银行数字货币研究所 Method, device and system for sharing files
CN112099725A (en) * 2019-06-17 2020-12-18 华为技术有限公司 Data processing method and device and computer readable storage medium
CN112665886A (en) * 2020-12-11 2021-04-16 浙江中控技术股份有限公司 Data conversion method for high-frequency original data of vibration measurement of large-scale rotating machinery
WO2022206334A1 (en) * 2021-03-30 2022-10-06 华为技术有限公司 Data compression method and apparatus
CN115858478A (en) * 2023-02-24 2023-03-28 山东中联翰元教育科技有限公司 Data rapid compression method of interactive intelligent teaching platform

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101667843A (en) * 2009-09-22 2010-03-10 中兴通讯股份有限公司 Methods and devices for compressing and uncompressing data of embedded system
CN102065098A (en) * 2010-12-31 2011-05-18 网宿科技股份有限公司 Method and system for synchronizing data among network nodes
CN102737132A (en) * 2012-06-25 2012-10-17 天津神舟通用数据技术有限公司 Multi-rule combined compression method based on database row and column mixed storage
CN103020317A (en) * 2013-01-10 2013-04-03 曙光信息产业(北京)有限公司 Device and method for data compression based on data deduplication
CN103067022A (en) * 2012-12-19 2013-04-24 中国石油天然气集团公司 Nondestructive compressing method, uncompressing method, compressing device and uncompressing device for integer data
CN104142924A (en) * 2013-05-06 2014-11-12 ***通信集团福建有限公司 Method and device for compressing flash picture format
CN104283567A (en) * 2013-07-02 2015-01-14 北京四维图新科技股份有限公司 Method for compressing or decompressing name data, and equipment thereof
CN105204781A (en) * 2015-09-28 2015-12-30 华为技术有限公司 Compression method, device and equipment
CN107087184A (en) * 2017-04-28 2017-08-22 华南理工大学 A kind of multi-medium data recompression method
US9767154B1 (en) * 2013-09-26 2017-09-19 EMC IP Holding Company LLC System and method for improving data compression of a storage system in an online manner
CN107251438A (en) * 2015-02-16 2017-10-13 三菱电机株式会社 Data compression device, data decompression device, data compression method, uncompressing data and program

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101667843A (en) * 2009-09-22 2010-03-10 中兴通讯股份有限公司 Methods and devices for compressing and uncompressing data of embedded system
CN102065098A (en) * 2010-12-31 2011-05-18 网宿科技股份有限公司 Method and system for synchronizing data among network nodes
CN102737132A (en) * 2012-06-25 2012-10-17 天津神舟通用数据技术有限公司 Multi-rule combined compression method based on database row and column mixed storage
CN103067022A (en) * 2012-12-19 2013-04-24 中国石油天然气集团公司 Nondestructive compressing method, uncompressing method, compressing device and uncompressing device for integer data
CN103020317A (en) * 2013-01-10 2013-04-03 曙光信息产业(北京)有限公司 Device and method for data compression based on data deduplication
CN104142924A (en) * 2013-05-06 2014-11-12 ***通信集团福建有限公司 Method and device for compressing flash picture format
CN104283567A (en) * 2013-07-02 2015-01-14 北京四维图新科技股份有限公司 Method for compressing or decompressing name data, and equipment thereof
US9767154B1 (en) * 2013-09-26 2017-09-19 EMC IP Holding Company LLC System and method for improving data compression of a storage system in an online manner
CN107251438A (en) * 2015-02-16 2017-10-13 三菱电机株式会社 Data compression device, data decompression device, data compression method, uncompressing data and program
CN105204781A (en) * 2015-09-28 2015-12-30 华为技术有限公司 Compression method, device and equipment
CN107087184A (en) * 2017-04-28 2017-08-22 华南理工大学 A kind of multi-medium data recompression method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
蔡明 等: "一种新的数据无损压缩编码方法", 《电子与信息学报》 *

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108427538A (en) * 2018-03-15 2018-08-21 深信服科技股份有限公司 Storage data compression method, device and the readable storage medium storing program for executing of full flash array
CN110083743A (en) * 2019-03-28 2019-08-02 哈尔滨工业大学(深圳) A kind of quick set of metadata of similar data detection method based on uniform sampling
CN110083743B (en) * 2019-03-28 2021-11-16 哈尔滨工业大学(深圳) Rapid similar data detection method based on unified sampling
CN112099725A (en) * 2019-06-17 2020-12-18 华为技术有限公司 Data processing method and device and computer readable storage medium
WO2020253406A1 (en) * 2019-06-17 2020-12-24 华为技术有限公司 Data processing method and device, and computer readable storage medium
US11797204B2 (en) 2019-06-17 2023-10-24 Huawei Technologies Co., Ltd. Data compression processing method and apparatus, and computer-readable storage medium
EP3896564A4 (en) * 2019-06-17 2022-04-13 Huawei Technologies Co., Ltd. Data processing method and device, and computer readable storage medium
CN110781155B (en) * 2019-10-18 2022-06-24 赛尔网络有限公司 Data storage reading method, system, equipment and medium based on IPFS
CN110781155A (en) * 2019-10-18 2020-02-11 赛尔网络有限公司 Data storage reading method, system, equipment and medium based on IPFS
CN110888918A (en) * 2019-11-25 2020-03-17 湖北工业大学 Similar data detection method and device, computer equipment and storage medium
CN111984615A (en) * 2020-08-04 2020-11-24 中国人民银行数字货币研究所 Method, device and system for sharing files
CN111984615B (en) * 2020-08-04 2024-05-28 中国人民银行数字货币研究所 File sharing method, device and system
CN112665886A (en) * 2020-12-11 2021-04-16 浙江中控技术股份有限公司 Data conversion method for high-frequency original data of vibration measurement of large-scale rotating machinery
WO2022206334A1 (en) * 2021-03-30 2022-10-06 华为技术有限公司 Data compression method and apparatus
CN115858478A (en) * 2023-02-24 2023-03-28 山东中联翰元教育科技有限公司 Data rapid compression method of interactive intelligent teaching platform
CN115858478B (en) * 2023-02-24 2023-05-12 山东中联翰元教育科技有限公司 Data rapid compression method of interactive intelligent teaching platform

Also Published As

Publication number Publication date
CN107682016B (en) 2021-09-17

Similar Documents

Publication Publication Date Title
CN107682016A (en) A kind of data compression method, data decompression method and related system
CN108427538B (en) Storage data compression method and device of full flash memory array and readable storage medium
CN107506153B (en) Data compression method, data decompression method and related system
CN106797446B (en) Historical search based on memory
CN107046812B (en) Data storage method and device
US9798731B2 (en) Delta compression of probabilistically clustered chunks of data
US9514178B2 (en) Table boundary detection in data blocks for compression
CN107305586B (en) Index generation method, index generation device and search method
CN108027713A (en) Data de-duplication for solid state drive controller
CN111125033A (en) Space recovery method and system based on full flash memory array
CN105844210B (en) Hardware efficient fingerprinting
CN103838753A (en) Storage and verification method and device for exchange codes
CN111124940B (en) Space recovery method and system based on full flash memory array
US10534755B2 (en) Word, phrase and sentence deduplication for text repositories
CN109947731A (en) The delet method and device of repeated data
CN111124939A (en) Data compression method and system based on full flash memory array
CN111124259A (en) Data compression method and system based on full flash memory array
US20230076729A2 (en) Systems, methods and devices for eliminating duplicates and value redundancy in computer memories
CN112395275A (en) Data deduplication via associative similarity search
CN111198857A (en) Data compression method and system based on full flash memory array
US9176973B1 (en) Recursive-capable lossless compression mechanism
EP3051699B1 (en) Hardware efficient rabin fingerprints
CN114930725A (en) Capacity reduction in storage systems
CN111177092A (en) Deduplication method and device based on erasure codes
Xue et al. A comprehensive study of present data deduplication

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant