CN104951473A

CN104951473A - Method and device for compressing data

Info

Publication number: CN104951473A
Application number: CN201410125541.7A
Authority: CN
Inventors: 王新中; 于刚; 冯立峰; 陈慧德; 周敏; 纳丽铭; 刘凯; 田甲星; 韩米林; 李帆
Original assignee: China Mobile Group Ningxia Co Ltd
Current assignee: China Mobile Group Ningxia Co Ltd
Priority date: 2014-03-31
Filing date: 2014-03-31
Publication date: 2015-09-30

Abstract

The invention provides a method and device for compressing data. The method for compressing the data comprises the steps that a database to be compressed is obtained; cells with the same content are found from all columns of data of the database to be compressed; compression package data are generated according to the content of the cells, the line numbers of the cells and the column numbers of the cells, wherein the compression package data record the corresponding relation between the line numbers of the same-column cells with the same content and the content of the cells and the column numbers of the same-column cells. The method and device are applied to compression of the database in which the same degree of the content of the cells in the same column is high.

Description

To the method and apparatus that data are compressed

Technical field

The present invention relates to a kind of database field, refer to especially a kind of to data carry out compression method and device.

Background technology

Along with long accumulation, the operation system of some keys, have accumulated a large amount of historical trading data.These historical datas make system become more and more huger, and also become increasingly complex in maintenance.The quick growth of data volume become all IT administrative authoritys faced by be difficult to one of problem solved most, because the growth of data volume has seriously reduced the performance of application program, reduce the stability of application program, and consume a large amount of investments.For growing data, data compression is the measure having to take.

In a database, carry out squeeze operation by his-and-hers watches, then store, can carrying cost be reduced, reduce all kinds data to the demand of disk space.The many employings of traditional compress technique are as gzip etc., and the data compression ratio of original table is generally 3:1 or 5:1.For large table TABLE_A a certain in database, before compression, data are 280GB, are 128GB after conventional compression techniques compression.

In addition, first traditional compression storing data mode, when data access, must extract after packed data decompress(ion).

The data compression ratio of original table is generally 3:1 or 5:1 compression, this packed data is extracted, if decompress(ion) is deposited or transmitted, can add the burden of macroreticular and storage so undoubtedly.And traditional compression storing data mode, when carrying out data access and operation, its access speed can reduce, and meanwhile, needs to consume more cpu resource.

There is following defect in existing compression storing data technology:

1, the ratio of compression of data is undesirable, generally only has about 3:1 ~ 5:1;

When 2, compressed data being conducted interviews, first need decompression extracted data, after data manipulation completes, more again compress, whole process more complicated, cannot fast access data line wherein.

3, data decompression, extraction, executable operations, compress again, needs to consume more CPU, internal memory.

For compress mode traditional in oracle, after Table have passed through compression, although storage volume obtains reduction, if but will operate its data, so first, need its data line to decompress out from compression blocks, carry out corresponding DML(data manipulation language (DML)), after having operated, then compression enters among compression blocks again.If DML is frequent, immeasurable CPU will be brought to consume.

Summary of the invention

The invention provides a kind of method and apparatus that data are compressed, the ratio of compression of the high database of cell content same degree can be improved.

On the one hand, a kind of method compressed data is provided, comprises:

Obtain database to be compressed;

In each column data of described database to be compressed, search the cell that content is identical;

According to the row number of the content of described cell, the line number of described cell and described cell, generate compressed package data, described compressed package data record: the corresponding relation between the line number of the same column cell that content is identical and cell content and the row number of same column cell.

Described in each column data of described database to be compressed, before searching the step of the identical cell of content, described method also comprises:

Select a column data;

With the cell content of the described column data selected for sort by, described database is sorted.

Described method, also comprises:

Second compression is again carried out to the data after compression.

Described compressed package data comprise: the row number of each line number of the same column cell that content is identical, cell content and same column cell; Or;

When the same column cell that content is identical is adjacent, described compressed package data comprise: the initial line number of the same column cell that content is identical and the row number of termination sequence number, cell content and same column cell; Or

When the same column cell that content is identical is adjacent, described compressed package data comprise: the row number of the quantity of the same column cell that the initial line number of the same column cell that content is identical, adjacent content are identical, cell content and same column cell.

Described method, also comprises:

When described database search data to be compressed, search in described compressed package data.

On the other hand, a kind of device compressed data is provided, comprises:

Acquiring unit, obtains database to be compressed;

Search unit, in each column data of described database to be compressed, search the cell that content is identical;

First compression unit, according to the row number of the content of described cell, the line number of described cell and described cell, generate compressed package data, described compressed package data record: the corresponding relation between the line number of the same column cell that content is identical and cell content and the row number of same column cell.

Described device, also comprises:

Selection unit, selects a column data;

Sequencing unit, with the cell content of the described column data selected for sort by, sorts to described database.

Described device, also comprises:

Second compression unit, carries out second compression again to the data after compression.

Described device, also comprises:

Search unit, when when described database search data to be compressed, searches in described compressed package data.

Beneficial effect of the present invention is as follows:

When the present invention compresses database, to arrange as unit compresses, in often arranging, the content of the cell that content is identical only needs to be recorded once, therefore, it is possible to improve the ratio of compression of the high database of cell content same degree.

Accompanying drawing explanation

Fig. 1 is the schematic flow sheet of an a kind of embodiment to the method that data are compressed of the present invention;

Fig. 2 is the schematic flow sheet of a kind of another embodiment to the method that data are compressed of the present invention;

Fig. 3 is a kind of data layout schematic diagram to compressed package in the application scenarios of the method that data are compressed of the present invention;

Fig. 4 is a kind of connection diagram to the device that data are compressed of the present invention.

Embodiment

For making the object of the embodiment of the present invention, technical scheme and advantage clearly, below in conjunction with the accompanying drawing of the embodiment of the present invention, the technical scheme of the embodiment of the present invention is clearly and completely described.Obviously, described embodiment is a part of embodiment of the present invention, instead of whole embodiments.Based on described embodiments of the invention, the every other embodiment that those of ordinary skill in the art obtain, all belongs to the scope of protection of the invention.

As shown in Figure 1, be an a kind of embodiment to the method that data are compressed of the present invention, comprise:

Step 11, obtains database to be compressed; Such as, be simple database below:

Step 12, in each column data of described database to be compressed, searches the cell that content is identical; In this step, to arrange as unit is searched respectively.Such as, in the 4th row, the content of the cell of the first row, the second row, the third line is identical.

Step 13, according to the row number of the content of described cell, the line number of described cell and described cell, generate compressed package data, described compressed package data record: the corresponding relation between the line number of the same column cell that content is identical and cell content and the row number of same column cell.Such as, with routine to the 4th row boil down to, compressed package needs record: the row number (4) of the corresponding relation between the line number (1,2,3) of the same column cell that content is identical and cell content (industrial and commercial bank) and same column cell.

Described compressed package data comprise: the row number of each line number of the same column cell that content is identical, cell content and same column cell.Such as, described compressed package data comprise: the row number (4) of each line number (1,2,3) of the same column cell that content is identical, cell content (industrial and commercial bank) and same column cell.

Or; When the same column cell that content is identical is adjacent, described compressed package data comprise: the initial line number of the same column cell that content is identical and the row number of termination sequence number, cell content and same column cell; Such as, such as, described compressed package data comprise: initial line number and the row number (4) stopping sequence number (1,3), cell content (industrial and commercial bank) and same column cell of the same column cell that content is identical.

Or when the same column cell that content is identical is adjacent, described compressed package data comprise: the row number of the quantity of the same column cell that the initial line number of the same column cell that content is identical, adjacent content are identical, cell content and same column cell.Such as, with behavior initial row above, described compressed package data comprise: the row number (4) of the quantity (3) of the same column cell that the initial line number (1) of the same column cell that content is identical, adjacent content are identical, cell content (industrial and commercial bank) and same column cell; Or, with behavior initial row below, described compressed package data comprise: the row number (4) of the quantity (3) of the same column cell that the initial line number (3) of the same column cell that content is identical, adjacent content are identical, cell content (industrial and commercial bank) and same column cell.

Optionally, described method, also comprises:

Step 14, carries out second compression again to the data after compression.Second compression can adopt the compress mode identical with prior art again, reduces the space that data take further.Certainly, also can repeatedly compress.

Optionally, described method, also comprises:

Step 15, when when described database search data to be compressed, searches in described compressed package data.

Such as, when according to " cell content ", line number " or " row number " are searched for time; owing to including " cell content ", " line number " and " row number " in compressed package data, so decompress(ion) can not carried out to compressed package, directly search in described compressed package data.

As shown in Figure 2, be an a kind of embodiment to the method that data are compressed of the present invention, comprise:

Step 21, obtains database to be compressed; Such as, following database is had:

Sequence number	1	2	3	4
					1	Xiao Wang	2000	Man	Industrial and commercial bank
2	Xiao Zhang	2010	Female	Industrial and commercial bank
					3	Xiao Li	2020	Man	Industrial and commercial bank

Step 22, selects a column data; Such as, sort with the 3rd row.

Step 23, with the cell content of the described column data selected for sort by, sorts to described database.The cell that this step makes content identical adjacent.Database after sequence is as follows:

Sequence number	1	2	3	4
					1	Xiao Wang	2000	Man	Industrial and commercial bank
3	Xiao Li	2020	Man	Industrial and commercial bank
					2	Xiao Zhang	2010	Female	Industrial and commercial bank

Step 24, in each column data of the database described to be compressed after sequence, searches the cell that content is identical;

Step 25, according to the row number of the content of described cell, the line number of described cell and described cell, generate compressed package data, described compressed package data record: the corresponding relation between the line number of the same column cell that content is identical and cell content and the row number of same column cell.Described compressed package data comprise: the row number of each line number of the same column cell that content is identical, cell content and same column cell; Or; When the same column cell that content is identical is adjacent, described compressed package data comprise: the initial line number of the same column cell that content is identical and the row number of termination sequence number, cell content and same column cell; Or when the same column cell that content is identical is adjacent, described compressed package data comprise: the row number of the quantity of the same column cell that the initial line number of the same column cell that content is identical, adjacent content are identical, cell content and same column cell.

Optionally, described method, also comprises:

Step 26, carries out second compression again to the data after compression.

Described method, also comprises:

Step 27, when when described database search data to be compressed, searches in described compressed package data.

Application scenarios of the present invention is below described.This application scene is the method that the compression of a kind of database table column stores, and may be used for the databases such as oracle.The present invention, when storing data, had both maintained higher data compression ratio, and, data ranks wherein can be accessed rapidly, do not need extra CPU to consume, achieve high compression, fast location, low consumption.

This programme, according to the feature of data, devises a kind of data orga-nizational format-multirow row compression (CMI:compressed multiple insert).The data extracted, deposit in the mode of a series of Compression Vector (being also unit).A CMI vector contains several rows (such as 10,000 row), and these row belong to same table or subregion, but can not belong to same data block.CMI vector deposits capable data according to row, contains a series of row compression unit CCU (column compression unit), and each CCU houses the data of wherein row of all row.

CCU according to the concrete feature of data, can do and compress pre-service (such as, with row for sort by, sorting) accordingly, then compress.Because single-row data same characteristic features is more, more easily compress.

Such as, two row data are below had:

Xiao Wang	Industrial and commercial bank	2000
			Xiao Zhang	Industrial and commercial bank	2010

Traditional preserving type deposits by row, namely similar: Xiao Wang, industrial and commercial bank, 2000, Xiao Zhang, industrial and commercial bank, 2010.In the present invention, in order to improve ratio of compression, adopting and pressing row Storage Format, namely similar: Xiao Wang, Xiao Zhang, industrial and commercial bank, industrial and commercial bank, 2000,2010.

The data layout of compressed package is below described.As shown in Figure 3, the packet of generation is called CMI multirow row Compression Vector, and CMI vector comprises: control information and packet.Control information comprises: len vector length, crc odd even transaction code, flgz zone bit.Packet (PKG) comprising: line number list and row compression CCU unit, wherein, CCU unit comprises: len vector length, flgccu compress zone bit, data cell data.Line number list comprises: objn name object number (table or subregion), objd storage object number (table or subregion), the initial line number of rowid (optional), nrow line number, ncol columns (optional), rowidlist line number list (optional).

Being described in detail as follows of data layout:

(1) vector length Len, length is 4 bytes, for indicating the length of vector;

(2) odd even transaction code crc, whether length is 4 bytes, make mistakes in transmission and storing process for the data detecting this vector;

(3) zone bit flgz, length is 4 bytes, carrys out mark various control information below by the numerical value of word bit (bit).Word bit 0: packet (PKG) reduced overall mode

when word bit numerical value is 0: indicate that no third level is compressed.

when word bit numerical value is 1: indicate with the whole bag of entropic spueezing compression algorithm.

● word bit 1: whether have line number list

when word bit numerical value is 0: indicate without line number list.Line number be continuously from initial line number (rowid) to

(rowid+nrow-1) terminate.

when word bit numerical value is 1: be marked with line number list

All line numbers are all recorded.

● word bit 2 to 7: columns (namely row compression unit number)

when 6 word bit numerical value are 0: indicate that columns is greater than 0x3F(63), need to indicate actual columns in addition in PKG.

When 6 word bit numerical value are greater than 0: indicate actual columns.

(4) objn, length is 4 bytes, for representing nominal object number (table or subregion);

(5) objd, length is 4 bytes, for representing storage object number (table or subregion);

(6) rowid, length is 6 or 0 byte, for representing initial line number (optional);

(7) nrow, length is 4 bytes, for representing line number;

(8) ncol, length is 2 or 0 byte, for representing columns (optional); When flgz is beyond expression columns time, represent with the integer of 16 here.

(9) rowid list, elongated, line number list (optional), when flgz word bit 1 is 1 time, uses line number list to express the line number of all row.This list adopts special integer variable quantity compression method compressed storage.The reduction length of this list is determined by compression algorithm.

(10) len row compression unit length, length is 4 bytes, if be 0, then indicates that all row of this unit are sky (NULL).

(11) flgccu compresses zone bit, and length is 4 bytes,

● word bit 0: whether row length whether fixed length (0 random length, 1 fixed length)

● word bit 1: whether sequence pre-service (0 does not sort, 1 sequence) is adopted to row

● word bit 2 to 4: first compresses

when numerical value is 0: compress without first

When numerical value is 1: small integer bitmap (bitmap) compression method

(12) data, elongated, the actual compression data of this CCU.

Describe in detail in a tabular form below:

The present invention have employed traditional row data and stores by the mode of row.Further, according to the characteristic of single-row data, devise three layer compression modes (twice row pressure, a stagnation pressure), improve compressibility.

First row compression: to the singularity of single-row data, Selective Pressure compression algorithm.This road algorithm, can compress the singularity of data, and the feature of the less destruction data of energy, even can increase the compressibility of data.Be convenient to the universal compressed algorithm of second.

The compression of second row is to single-row, and the data after first compression, adopt universal compressed algorithm to compress further.

3rd road total compression is after twice compression, and the regularity of data declines greatly, and this adopts entropic spueezing algorithm can to compress further mixed and disorderly data.The compression of this road is to whole bag (PKG), comprises all row and compresses.

As shown in Figure 4, be a kind of device that data are compressed of the present invention, comprise:

Acquiring unit 31, obtains database to be compressed;

Search unit 32, in each column data of described database to be compressed, search the cell that content is identical;

First compression unit 33, according to the row number of the content of described cell, the line number of described cell and described cell, generate compressed package data, described compressed package data record: the corresponding relation between the line number of the same column cell that content is identical and cell content and the row number of same column cell.

Described device, also comprises:

Selection unit 34, selects a column data;

Sequencing unit 35, with the cell content of the described column data selected for sort by, sorts to described database.

Described device, also comprises:

Second compression unit 36, carries out second compression again to the data after compression.

Described device, also comprises:

Search unit 37, when when described database search data to be compressed, searches in described compressed package data.

The present invention may be used for packed data synchronous outside, the fields such as the filing of data, audit, high speed compression report database and memory database can also be used for.After data compression, also can reduce the data transmission pressure between production and disaster recovery database.Compensate for the shortcoming such as low compression ratio, high resource consumption of traditional compression storing data mode, have the following advantages:

1) improve ratio of compression, compare traditional compress mode, this kind of data compression scheme, the compression factor of 10:1 ~ 40:1 can be reached.

2) data store with field form and compress, and can rapidly locating (line number is physical address, does not affect by executive plan) by line number list, avoid the CPU that scanning form brings and consume, thus find data line fast, operate.

3) the present invention can use existing storage space more efficiently, constantly increases requirement to disk dilatation, reduce IT Meteorological to reduce IT system data.

4) the present invention not only has higher data compression ratio, and can fast access data line wherein, meanwhile, does not increase the consumption of CPU.

The above is only embodiments of the present invention; it should be pointed out that for those skilled in the art, under the premise without departing from the principles of the invention; can also make some improvements and modifications, these improvements and modifications also should be considered as protection scope of the present invention.

Claims

1. to the method that data are compressed, it is characterized in that, comprising:

Obtain database to be compressed;

2. the method for claim 1, is characterized in that, described in each column data of described database to be compressed, before searching the step of the identical cell of content, described method also comprises:

Select a column data;

3. the method for claim 1, is characterized in that, also comprises:

Second compression is again carried out to the data after compression.

4. the method for claim 1, is characterized in that,

5. the method for claim 1, is characterized in that, also comprises:

6. to the device that data are compressed, it is characterized in that, comprising:

Acquiring unit, obtains database to be compressed;

7. device as claimed in claim 6, is characterized in that, also comprise:

Selection unit, selects a column data;

8. device as claimed in claim 6, is characterized in that, also comprise:

9. device as claimed in claim 6, is characterized in that,

10. device as claimed in claim 6, is characterized in that, also comprise: