CN116594572A - Floating point number stream data compression method, device, computer equipment and medium - Google Patents

Floating point number stream data compression method, device, computer equipment and medium Download PDF

Info

Publication number
CN116594572A
CN116594572A CN202310869512.0A CN202310869512A CN116594572A CN 116594572 A CN116594572 A CN 116594572A CN 202310869512 A CN202310869512 A CN 202310869512A CN 116594572 A CN116594572 A CN 116594572A
Authority
CN
China
Prior art keywords
floating point
point number
value
current
bit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310869512.0A
Other languages
Chinese (zh)
Other versions
CN116594572B (en
Inventor
王勇
杨谕黔
于宁
唐鹏洲
王昊
姚延栋
翁岩青
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Siweizongheng Data Technology Co ltd
Original Assignee
Beijing Siweizongheng Data Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Siweizongheng Data Technology Co ltd filed Critical Beijing Siweizongheng Data Technology Co ltd
Priority to CN202310869512.0A priority Critical patent/CN116594572B/en
Publication of CN116594572A publication Critical patent/CN116594572A/en
Application granted granted Critical
Publication of CN116594572B publication Critical patent/CN116594572B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0628Interfaces specially adapted for storage systems making use of a particular technique
    • G06F3/0638Organizing or formatting or addressing of data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/0608Saving storage space on storage systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/06Digital input from, or digital output to, record carriers, e.g. RAID, emulated record carriers or networked record carriers
    • G06F3/0601Interfaces specially adapted for storage systems
    • G06F3/0602Interfaces specially adapted for storage systems specifically adapted to achieve a particular effect
    • G06F3/061Improving I/O performance
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The embodiment of the application provides a floating point number streaming data compression method, a device, computer equipment and a medium, and relates to the technical field of data processing, wherein the method comprises the following steps: establishing an index table, wherein the index table is provided with N barrels, and each barrel is provided with M grooves; determining a key value based on the binary representation of the current floating point number; searching a target bucket in the N buckets according to the key value; sequentially performing exclusive OR calculation on the current floating point number and the data in each groove in the target barrel by using a linear search method to obtain a plurality of first values, taking the data corresponding to the first value with the largest number of bits of zero as a basic value, and recording the position of the basic value in the current window; coding the position of the base value in the current window according to the zero bit number condition of the second value obtained by carrying out exclusive OR calculation on the current floating point number and the base value; and performing compression storage according to a preset storage format. By the scheme, lossless compression is realized, and compression rate and decompression speed are improved.

Description

Floating point number stream data compression method, device, computer equipment and medium
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a floating point number streaming data compression method, apparatus, computer device, and medium.
Background
The compression method can effectively reduce the volume of data, reduce the space occupation and reduce the data volume of IO (input/output) so as to improve the data processing speed. With rapid advances in data processing technology, data stream processing is becoming more and more common. The conventional compression method, such as zstd (Zstandard, lossless compression algorithm), is a compression method for fixed data, that is, a global search is performed for a fixed file or a larger data block to obtain a better compression effect. Stream data is generated continuously and processed at any time, and the relationship between the front and the back of the data is needed to be used for encoding and compressing.
The Gorilla algorithm is a method proposed by facebook, which uses binary form similarity between data, exclusive-ors the data with the previous data, and then stores the encoded data with the head and tail zeros removed. This approach does not work well in real scenes because floating point numbers have a special internal representation, and decimal similarity does not mean binary is similar.
Similarly, some systems store floating point numbers as strings to obtain a more repeatable representation, and better compression ratios can be obtained by using some general compression methods. However, this means that the character string is converted into a floating point number every time data is read and processed, and the cost is very high.
Victoria metrics provides another idea that drops floating point numbers with some precision and converts them into integer numbers for storage. This approach can effectively increase the compression ratio, but in many scenarios, the loss of accuracy is unacceptable to the user.
Chimp is another improvement over the Gorilla algorithm, which explores a large number of open datasets, and optimizes the bit-encoding scheme of the Gorilla algorithm, thus performing better than the Gorilla algorithm in most cases. It also inherits the lower decoding efficiency introduced by the Gorilla bit encoding.
Therefore, the conventional floating point number streaming data compression has the problems of high cost, low storage precision and low decoding efficiency during data reading and processing.
Disclosure of Invention
In view of the above, the embodiment of the application provides a floating point number streaming data compression method, so as to solve the problems of high cost, low storage precision and low decoding efficiency in data reading and processing in the floating point number streaming data compression in the prior art. The method comprises the following steps:
establishing an index table, wherein the index table is provided with N barrels, each barrel is provided with M grooves, and N and M are positive integers;
determining a key value as an index based on the binary representation of the current floating point number;
searching a target bucket in the N buckets by utilizing a hash searching method according to the key value;
performing exclusive OR calculation on the current floating point number and the data in each groove in the target barrel in sequence by using a linear search method to obtain a plurality of first values, taking the data corresponding to the first value with the largest number of bits of zero as a basic value of the current floating point number code, and recording the position of the basic value in a current window;
coding the current floating point number according to the zero bit number condition of the second value obtained by performing exclusive OR calculation on the current floating point number and the basic value and the position of the basic value in the current window;
and compressing and storing the encoded floating point number according to a preset storage format.
The embodiment of the application also provides a floating point number compression device, which solves the problems of high cost, low storage precision and low decoding efficiency in data reading and processing in the floating point number streaming data compression in the prior art. The device comprises:
the index table establishing module is used for establishing an index table, the index table is provided with N barrels, each barrel is provided with M grooves, and N and M are positive integers;
a key value determining module for determining a key value as an index based on the binary representation of the current floating point number;
the target bucket searching module is used for searching target buckets in the N buckets by utilizing a hash searching method according to the key value;
the basic value searching module is used for sequentially performing exclusive-or calculation on the current floating point number and the data in each groove in the target barrel by using a linear searching method to obtain a plurality of first values, taking the data corresponding to the first value with the largest number of bits of zero as the basic value of the current floating point number coding, and recording the position of the basic value in the current window;
the encoding module is used for encoding the current floating point number according to the bit number condition of zero of the second value obtained by carrying out exclusive OR calculation on the current floating point number and the basic value and the position of the basic value in the current window;
the storage module is used for compressing and storing the encoded floating point number according to a preset storage format.
The embodiment of the application also provides computer equipment, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the random floating point number stream data compression method when executing the computer program, so as to solve the problems of high cost, low storage precision and low decoding efficiency in data reading and processing in the floating point number stream data compression in the prior art.
The embodiment of the application also provides a computer readable storage medium which stores a computer program for executing any floating point number stream data compression method, so as to solve the problems of high data reading and processing cost, low storage precision and low decoding efficiency in the floating point number stream data compression in the prior art.
Compared with the prior art, the beneficial effects that above-mentioned at least one technical scheme that this description embodiment adopted can reach include at least: establishing an index table, wherein the index table is provided with N barrels, each barrel is provided with M grooves, and N and M are positive integers; determining a key value as an index based on the binary representation of the current floating point number; searching a target barrel in the N barrels by utilizing a hash searching method according to the key value; sequentially performing exclusive OR calculation on the current floating point number and the data in each groove in the target barrel by using a linear search method to obtain a plurality of first values, taking the data corresponding to the first value with the largest number of bits of zero as a basic value of the current floating point number code, and recording the position of the basic value in a current window; coding the current floating point number according to the zero bit number condition of the second value obtained by performing exclusive OR calculation on the current floating point number and the basic value and the position of the basic value in the current window; and compressing and storing the encoded floating point number according to a preset storage format. The application realizes lossless compression of data by using an exclusive OR method, improves the compression rate by utilizing efficient search of the data on the duration window, adopts simplified data representation, and is beneficial to improving the decompression speed.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a floating point number streaming data compression method provided by an embodiment of the application;
FIG. 2 is a schematic diagram of an index table according to an embodiment of the present application;
FIG. 3 is a schematic diagram of a coding format according to an embodiment of the present application;
FIG. 4 is a schematic diagram of a configuration of a preset storage format according to an embodiment of the present application;
FIG. 5 is a block diagram of a computer device according to an embodiment of the present application;
FIG. 6 is a block diagram of a floating point number streaming data compression device according to an embodiment of the present application.
Detailed Description
Embodiments of the present application will be described in detail below with reference to the accompanying drawings.
Other advantages and effects of the present application will become apparent to those skilled in the art from the following disclosure, which describes the embodiments of the present application with reference to specific examples. It will be apparent that the described embodiments are only some, but not all, embodiments of the application. The application may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present application. It should be noted that the following embodiments and features in the embodiments may be combined with each other without conflict. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
In an embodiment of the present application, a floating point number streaming data compression method is provided, as shown in fig. 1, where the method includes:
s1, establishing an index table, wherein the index table is provided with N barrels, each barrel is provided with M grooves, and N and M are positive integers;
s2, determining a key value serving as an index based on the binary representation of the current floating point number;
s3, searching a target barrel in the N barrels by utilizing a hash searching method according to the key value;
s4, sequentially performing exclusive OR calculation on the current floating point number and the data in each groove in the target barrel by using a linear search method to obtain a plurality of first values, taking the data corresponding to the first value with the largest number of bits of zero as a basic value of current floating point number coding, and recording the position of the basic value in a current window;
s5, coding the current floating point number according to the zero bit number condition of the second value obtained by carrying out exclusive OR calculation on the current floating point number and the basic value and the position of the basic value in the current window;
and S6, compressing and storing the encoded floating point number according to a preset storage format.
As can be seen from the flow shown in fig. 1, in the embodiment of the present application, as can be seen from step 3 and step 5, the present application inherits the xor concept of gorella to implement lossless compression of data, and uses efficient searching of data on a history window to increase compression rate, and uses simplified data representation to increase decompression speed.
Referring now to FIG. 2, it is described in detail how key values are determined as indices based on a binary representation of the current floating point number, i.e., how efficient lookup of historical values over a history window is expedited by a reduced index.
As shown in the upper half of fig. 2, as data arrives continuously, a window is formed, and 128 values are set as sliding windows. The current floating point number is compared with the previous value to select the optimal value, but each value is compared 128 times, which is very cumbersome. Therefore, the application adopts compact memory index to reduce comparison times under the condition of finding a better value.
As shown in the lower part of fig. 2, an index table is defined first, assuming N buckets, each with M slots (slots), where a bucket is implemented by hashing a specified column, splitting data under a column name into a group of buckets by hash value, and making each bucket correspond to a storage file under the column name, the slots being a unit for holding data. In general, m=8 can achieve a good effect, as shown in fig. 2, m=8 in this embodiment, the whole Hashmap is completely continuous in the memory, and the address of each bucket can be obtained only by calculation. Based on this basic structure, it is then described in detail how key values are determined as indices based on the binary representation of the current floating point number and the basic value is found using the key values, followed by three variant search methods for finding the basic value.
In one embodiment, the method is based on the search of the binary tail index of the current floating point number, and specifically comprises the following steps:
a first key value based on a plurality of bits at the tail of a binary system of the current floating point number as an index;
searching a target bucket in the N buckets by utilizing a hash searching method according to the first key value;
and sequentially performing exclusive OR calculation on the current floating point number and the data in each groove in the target barrel by using a linear search method to obtain a plurality of first values, taking the data corresponding to the first value with the largest number of bits of zero as a basic value of the current floating point number code, and recording the position of the basic value in a current window.
In one embodiment, the method is based on binary double-index lookup of the current floating point number, and is similar to an index table of tail index lookup, and the method is to additionally establish a head index lookup, namely, double-index lookup comprising tail index lookup and head index lookup, and specifically comprises the following steps:
a first key value based on a plurality of bits at the tail of a binary system of the current floating point number as an index;
and based on the second key value of the binary system of the current floating point number, when the floating point number is 64 bits, the sign bit is added with the second key value of the index from the 6 th bit to the 12 th bit, and when the floating point number is 32 bits, the sign bit is added with the second key value of the index from the 6 th bit to the 12 th bit.
In specific implementation, the process of double index searching is as follows: firstly, taking a plurality of bits at the tail of a binary system of a current floating point number as a first key value of an index; searching a target bucket in the N buckets by utilizing a hash searching method according to the first key value; performing exclusive OR calculation on the current floating point number and the data in each groove in the target barrel in sequence by using a linear search method to obtain a plurality of first values, taking the data corresponding to the first value with the largest number of bits of zero as a basic value of the current floating point number code, and recording the position of the basic value in a current window;
if the basic value is not found based on the first key value, taking a plurality of bits of the binary head of the current floating point number as the second key value of the index, when the floating point number is 64 bits, adding the second key value from the 6 th bit to the 12 th bit from the sign bit, when the floating point number is 32 bits, adding the second key value from the 6 th bit to the 12 th bit from the sign bit to the 12 th bit from the high bit, and selecting the bits because the fluctuation of the bits has a larger influence on the exclusive or result for different floating point numbers; then searching a target bucket in the N buckets by utilizing a hash searching method according to the second key value; performing exclusive OR calculation on the current floating point number and the data in each groove in the target barrel in sequence by using a linear search method to obtain a plurality of first values, taking the data corresponding to the first value with the largest number of bits of zero as a basic value of the current floating point number code, and recording the position of the basic value in a current window;
if no base value is found based on both the first key value and the second key value, the last floating point number of the current floating point number is employed as the base value.
In one embodiment, the binary hybrid search based on the current floating point number is a method for integrating tail index search and double index search, which has better finding speed and close effect, because the memory overhead of two index tables in the double index search scheme is larger, the search times are also larger, and although the effect may be good, the variant of the hybrid search is a method for integrating tail index search and double index search, which comprises the following steps:
when the floating point number is 64 bits, a third key value based on the binary sign bit, the upper 7 th bit, and the lower 6 th bit of the current floating point number as an index; when the floating point number is 32, taking the binary sign bit, the 5 th high bit and the 7 th low bit of the current floating point number as the third key value of the index;
then searching a target bucket in the N buckets by utilizing a hash searching method according to the third key value; and sequentially performing exclusive OR calculation on the current floating point number and the data in each groove in the target barrel by using a linear search method to obtain a plurality of first values, taking the data corresponding to the first value with the largest number of bits of zero as a basic value of the current floating point number code, and recording the position of the basic value in a current window.
In one embodiment, in step S4, if the key value is used to find the base value, and if the best base value is not found, the last floating point number of the current floating point number is used as the base value.
Therefore, the embodiment of the application adopts the compact memory index to find the basic value for encoding the floating point number by setting three different searching methods for the historical data, thereby reducing the comparison times and improving the searching speed and the decompression speed. And a proper searching method is selected according to the data condition of the floating point number, so that the searching speed can be effectively improved.
In one embodiment, the method specifically includes the following steps of:
the position of the base value in the current window is expressed as offset, and the value of the offset is 0 to (2 M-1 -1) the second value obtained by exclusive-or calculation of the current floating point number and the base value is expressed as xorj=xjx, xj is the current floating point number, xi is the base value;
when each bit of the Xorj is zero, only the offset is recorded during encoding;
when at least one bit of the Xorj is not zero, calculating the zero length L1 of the tail part of the Xorj and the zero length L2 of the head part of the Xorj, wherein the L1 and the L2 are all rounded according to bytes, and the total zero length L=L1+L2 of the head part and the tail part of the Xorj, wherein the zero length is a continuous zero bit number;
if L is greater than or equal to 1, recording an offset during encoding, setting the M position of the offset to be 1, recording L1 by 4 bits, recording L by 4 bits, and recording Xorj by at least one byte, wherein the rest part of the L1 length of the tail is removed after zero;
if L is less than 1, the first M-1 bits of the offset are marked as all 1's during encoding, the M-th position of the offset is marked as 1's, and then the original value of the current floating point number is recorded.
Taking M as an example, let M take 8 as an example, and referring to the coding format of fig. 3, let Xi be the found base value, which is a 64-bit floating point number, and Xj be the current floating point number, xorj=xj x, as discussed below, where the position is represented by offset, which is a value of 0-127.
If each bit of Xorj is zero (xorj= 0), that is, the number of the bits is completely equal to the number of the offset positions, only the offset is recorded during encoding, and the flow exits, referring to the first row in fig. 3;
otherwise, the offset is added with 128, i.e. the 8 th position is set to 1, which indicates that the information is still present, specifically refer to the following flow;
when at least one bit of the Xorj is not zero, calculating the zero length L1 of the tail part of the Xorj and the zero length L2 of the head part of the Xorj, wherein the L1 and the L2 are all rounded according to bytes, and the total zero length L=L1+L2 of the head part and the tail part of the Xorj, wherein the zero length is a continuous zero bit number; for example, xorj has 10 bits of zeros in succession at the head and 7 bits of zeros in succession at the tail, since 8 bits are one byte, the 10 bits at the head are rounded up to 1 in bytes, the 7 bits at the tail are rounded up to 0 in bytes, l1=1, l2=0, l=l1+l2=1;
referring to the second line in fig. 3, if L is 1 or more, the offset is recorded at the time of encoding and the 8 th position of the offset is set to 1, then L1 (tail zero length) is recorded with 4 bits, L (total zero length) is recorded with 4 bits, and the remaining portion after zero of the L1 length of the tail is removed by Xorj is recorded with at least one byte (non-zero portion of the second line of fig. 3);
referring to the third line in fig. 3, if L is less than 1, the special TAG marking the first 7 bits of the offset as all 1's is 127 (0X 7F of the third line in fig. 4) at the time of encoding, and the 8 th position of the offset is 1, and then the original value of the current floating point number is recorded, since the floating point number in this embodiment is 64 bits, i.e., the original 8 byte value is recorded at this time (non-zero portion of the third line in fig. 3).
Thus, based on the encoding method of the above embodiment, for a continuously arriving column of floating point numbers (x 1, x2, x3,..once., xn), its preset storage format is as shown in fig. 4, the header includes: magic numbers, version numbers, original length of data, compression length of data, and parameters used for encoding compression ("parameters" in the figure); the header is then followed by a recording of the original value of the first floating point number ("first value" in the figure) and the encoded and compressed value of each floating point number ("encoded representation of the Xor value" in the figure). The magic numbers are used for verifying legal compressed data blocks, the version numbers are used for detecting compatibility of future versions, the original length of data is used for decompressing and verifying, the compressed length of data is used for restoring data, and the parameters used for encoding and compressing generally comprise a compression method, a data width, which basic value searching method is adopted, the parameters of preprocessing, whether floating point numbers are converted or not, and the like. The first original value is recorded after the header, which is the basis of the compression, and the encoded compressed value is recorded after the header, which is obtained by encoding the result value of the exclusive-or after exclusive-or with the value at a certain position before.
The encoding method for the floating point number avoids expensive bit encoding operation, improves decompression speed by using simplified data representation, is a lossless, rapid and high-compression-rate stream floating point number compression method, reduces the volume of data and realizes high-speed data processing.
In one embodiment, after the step of performing exclusive-or calculation on the current floating point number and the data in each slot in the target bucket by using the linear lookup method to obtain a plurality of first values, taking the data corresponding to the first value with the largest number of bits of zero as the base value of the current floating point number code, and recording the position of the base value in the current window, the method further includes:
if the slot in which the basic value is located is a window which is not filled with data or a window which is slid when the basic value is searched, filling the current floating point number into the slot;
if the data in the groove where the base value is located is filled, the data stored in the groove first is replaced by the current floating point number.
The current floating point number is filled into the slot or the oldest data in the slot is replaced to update the data in the slot, so that the subsequent floating point number data is more similar to the data in the slot, and the searching speed and the coding efficiency of the floating point number data can be improved.
In one embodiment, the simplest tail-indexed lookup method is described, with 32-bit floating point numbers employed for ease of description. For 10 floating point numbers of 1.1-2.0, the binary representation is as follows:
float: 1.1, binary:00111111100011001100110011001101
float: 1.2, binary:00111111100110011001100110011010
float: 1.3, binary:00111111101001100110011001100110
float: 1.4, binary:00111111101100110011001100110011
float: 1.5, binary:00111111110000000000000000000000
float: 1.6, binary:00111111110011001100110011001101
float: 1.7, binary:00111111110110011001100110011010
float: 1.8, binary:00111111111001100110011001100110
float: 1.9, binary:00111111111100110011001100110011
float: 2.0, binary:01000000000000000000000000000000
for 1.5, its tail 8 bits are all 0, so the index number is also 0, so in bucket 0, there is only one value.
When processing to 2.0 it takes the lower 8 bits as well, the index number is also 0, so going to the 0 th bucket, this value of 1.5 is found, the code number recorded therein is 5, and the Xorj value after exclusive or of 2.0 and 1.5 is 01111111110000000000000000000000, calculated in bytes, the zero of the head is only 1 bit, so the zero length L2 of the head is 0, the zero of the tail is 22 bits, so the zero length L1 of the tail is 2, the final code representation is referred to the second line of fig. 4, the following can be obtained according to the code structure illustrated in fig. 4:
1: first position
0000101: a seven bit offset bit representing the 5 th value in its opposite reference window;
0010: the zero length L1 of the tail is represented by four bits, here 0 with a tail length of 2 bytes;
0010: the total zero length L is represented by four bits, here again 0 of 2 bytes in length in total;
0111111111000000: after the Xorj value removes zeros of the tail 2 bytes length, the reserved data bits are needed.
The floating point number stream data compression method of the application, because of avoiding the expensive bit coding operation, gives the variant decompression speed which is faster than the Gorilla method by several times and faster than the zstd method by 5-6 times. The encoding speed is about as high as the zstd speed, but the compression rate is several times higher than gorilla, and the zstd speed is about as high.
In this embodiment, a computer device is provided, as shown in fig. 5, including a memory 501, a processor 502, and a computer program stored in the memory and capable of running on the processor, where the processor implements any floating point number streaming data compression method described above when executing the computer program.
In particular, the computer device may be a computer terminal, a server or similar computing means.
In this embodiment, a computer-readable storage medium storing a computer program for executing any of the floating point number streaming data compression methods described above is provided.
In particular, computer-readable storage media, including both permanent and non-permanent, removable and non-removable media, may be used to implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer-readable storage media include, but are not limited to, phase-change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic tape disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable storage media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.
Based on the same inventive concept, the embodiment of the application also provides a floating point number stream data compression device, as described in the following embodiment. Because the principle of the floating point number stream data compression device for solving the problem is similar to that of the floating point number stream data compression method, the implementation of the floating point number stream data compression device can refer to the implementation of the floating point number stream data compression method, and the repetition is omitted. As used below, the term "unit" or "module" may be a combination of software and/or hardware that implements the intended function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.
FIG. 6 is a block diagram of a floating point number streaming data compression device according to an embodiment of the present application, as shown in FIG. 3, including: the structure of the index table establishment module 601, key value determination module 602, target bucket search module 603, base value search module 604, encoding module 605, and storage module 606 is described below.
An index table establishing module 601, configured to establish an index table, where the index table is provided with N barrels, each barrel has M slots, and N and M are positive integers;
a key value determination module 602, configured to determine a key value as an index based on the binary representation of the current floating point number;
a target bucket searching module 603, configured to search a target bucket in the N buckets by using a hash searching method according to the key value;
a basic value searching module 604, configured to sequentially exclusive-or calculate the current floating point number with the data in each slot in the target bucket by using a linear search method to obtain a plurality of first values, use the data corresponding to the first value with the largest number of bits of zero as the basic value of the current floating point number code, and record the position of the basic value in the current window;
the encoding module 605 is configured to encode the current floating point number according to the bit number of zero of the second value obtained by performing exclusive-or calculation on the current floating point number and the base value and the position of the base value in the current window;
the storage module 606 is configured to compress and store the encoded floating point number according to a preset storage format.
In one embodiment, the key value determination module 602 is further configured to: the first key value based on the tail bits of the binary of the current floating point number as an index.
In one embodiment, the key value determination module 602 is further configured to: a first key value based on a plurality of bits at the tail of a binary system of the current floating point number as an index; based on the binary header bits of the current floating point number as the indexed second key value, when the floating point number is 64 bits, the sign bit is added with the upper 6 th bit to the upper 12 th bit as the indexed second key value, and when the floating point number is 32 bits, the sign bit is added with the upper 6 th bit to the upper 12 th bit as the indexed second key value.
In one embodiment, the key value determination module 602 is further configured to: when the floating point number is 64 bits, a third key value based on the binary sign bit, the upper 7 th bit, and the lower 6 th bit of the current floating point number as an index; when the floating point number is 32, the third key value is indexed based on the binary sign bit, the upper 5 th bit, and the lower 7 th bit of the current floating point number.
In one embodiment, the encoding module 605 is further to: the position of the basic value in the current window is expressed as an offset, and the value of the offset is 0 to (2) M-1 -1) the second value obtained by exclusive-or calculation of the current floating point number and the base value is expressed as xorj=xjx, xj is the current floating point number, xi is the base value;
when each bit of the Xorj is zero, only the offset is recorded during encoding;
when at least one bit of the Xorj is not zero, calculating the zero length L1 of the tail part of the Xorj and the zero length L2 of the head part of the Xorj, wherein the L1 and the L2 are all rounded according to bytes, and the total zero length L=L1+L2 of the head part and the tail part of the Xorj, wherein the zero length is a continuous zero bit number;
if L is greater than or equal to 1, recording an offset during encoding, setting the M position of the offset to be 1, recording L1 by 4 bits, recording L by 4 bits, and recording Xorj by at least one byte, wherein the rest part of the L1 length of the tail is removed after zero;
if L is less than 1, the first M-1 bits of the offset are marked as all 1's during encoding, the M-th position of the offset is marked as 1's, and then the original value of the current floating point number is recorded.
In one embodiment, the apparatus further includes a data filling module, configured to fill the current floating point number into the slot if the slot in which the base value is located is a window that is not filled with data or a window that is slid when the base value is found;
if the data in the groove where the base value is located is filled, the data stored in the groove first is replaced by the current floating point number.
In one embodiment, the header of the preset storage format in the storage module 606 includes: magic number, version number, original length of data, compressed length of data and parameters used for encoding and compression; the header is then followed by recording the original value of the first floating point number and the value after compression of each floating point number encoding.
The embodiment of the application realizes the following technical effects: establishing an index table, wherein the index table is provided with N barrels, each barrel is provided with M grooves, and N and M are positive integers; determining a key value as an index based on the binary representation of the current floating point number; searching a target barrel in the N barrels by utilizing a hash searching method according to the key value; sequentially performing exclusive OR calculation on the current floating point number and the data in each groove in the target barrel by using a linear search method to obtain a plurality of first values, taking the data corresponding to the first value with the largest number of bits of zero as a basic value of the current floating point number code, and recording the position of the basic value in a current window; coding the current floating point number according to the zero bit number condition of the second value obtained by performing exclusive OR calculation on the current floating point number and the basic value and the position of the basic value in the current window; and compressing and storing the encoded floating point number according to a preset storage format. The application realizes lossless compression of data by using an exclusive or method, improves compression rate by utilizing efficient search of data on a duration window, and improves decompression speed by simplified data representation.
It will be apparent to those skilled in the art that the modules or steps of the embodiments of the application described above may be implemented in a general purpose computing device, they may be concentrated on a single computing device, or distributed across a network of computing devices, they may alternatively be implemented in program code executable by computing devices, so that they may be stored in a storage device for execution by computing devices, and in some cases, the steps shown or described may be performed in a different order than what is shown or described, or they may be separately fabricated into individual integrated circuit modules, or a plurality of modules or steps in them may be fabricated into a single integrated circuit module. Thus, embodiments of the application are not limited to any specific combination of hardware and software.
The above description is only of the preferred embodiments of the present application and is not intended to limit the present application, and various modifications and variations can be made to the embodiments of the present application by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims (10)

1. A floating point number streaming data compression method, comprising:
establishing an index table, wherein the index table is provided with N barrels, each barrel is provided with M grooves, and N and M are positive integers;
determining a key value as an index based on the binary representation of the current floating point number;
searching a target bucket in the N buckets by utilizing a hash searching method according to the key value;
performing exclusive OR calculation on the current floating point number and the data in each groove in the target barrel in sequence by using a linear search method to obtain a plurality of first values, taking the data corresponding to the first value with the largest number of bits of zero as a basic value of the current floating point number code, and recording the position of the basic value in a current window;
coding the current floating point number according to the zero bit number condition of the second value obtained by performing exclusive OR calculation on the current floating point number and the basic value and the position of the basic value in the current window;
and compressing and storing the encoded floating point number according to a preset storage format.
2. The floating point number streaming data compression method of claim 1, wherein the determining a key value as an index based on the binary representation of the current floating point number comprises:
the first key value based on the tail bits of the binary of the current floating point number as an index.
3. The floating point number streaming data compression method of claim 2, wherein the determining a key value as an index based on the binary representation of the current floating point number further comprises:
based on the binary header bits of the current floating point number as the indexed second key value, when the floating point number is 64 bits, the sign bit is added with the upper 6 th bit to the upper 12 th bit as the indexed second key value, and when the floating point number is 32 bits, the sign bit is added with the upper 6 th bit to the upper 12 th bit as the indexed second key value.
4. The floating point number streaming data compression method of claim 1, wherein the determining a key value as an index based on the binary representation of the current floating point number comprises:
when the floating point number is 64 bits, a third key value based on the binary sign bit, the upper 7 th bit, and the lower 6 th bit of the current floating point number as an index;
when the floating point number is 32, the third key value is indexed based on the binary sign bit, the upper 5 th bit, and the lower 6 th bit of the current floating point number.
5. The floating point number stream data compression method as set forth in claim 1, wherein the encoding the current floating point number based on the bit number of zeros of the second value obtained by exclusive-or calculation of the current floating point number with the base value and the position of the base value in the current window comprises:
the position of the basic value in the current window is expressed as an offset, and the value of the offset is 0 to (2) M-1 -1) the second value obtained by exclusive-or calculation of the current floating point number and the base value is expressed as xorj=xjx, xj is the current floating point number, xi is the base value;
when each bit of the Xorj is zero, only the offset is recorded during encoding;
when at least one bit of the Xorj is not zero, calculating the zero length L1 of the tail part of the Xorj and the zero length L2 of the head part of the Xorj, wherein the L1 and the L2 are all rounded according to bytes, and the total zero length L=L1+L2 of the head part and the tail part of the Xorj, wherein the zero length is a continuous zero bit number;
if L is greater than or equal to 1, recording an offset during encoding, setting the M position of the offset to be 1, recording L1 by 4 bits, recording L by 4 bits, and recording Xorj by at least one byte, wherein the rest part of the L1 length of the tail is removed after zero;
if L is less than 1, the first M-1 bits of the offset are marked as all 1's during encoding, the M-th position of the offset is marked as 1's, and then the original value of the current floating point number is recorded.
6. The floating point number stream data compression method as claimed in any one of claims 1 to 5, wherein after the step of sequentially xoring the current floating point number with the data in each slot in the target bucket by using a linear search method to obtain a plurality of first values, taking the data corresponding to the first value with the largest number of bits of zero as a base value of the current floating point number code, and recording the position of the base value in the current window, the method further comprises:
if the slot in which the basic value is located is a window which is not filled with data or a window which is slid when the basic value is searched, filling the current floating point number into the slot;
if the data in the groove where the base value is located is filled, the data stored in the groove first is replaced by the current floating point number.
7. The floating point number streaming data compression method as in any one of claims 1-5, wherein the header of the preset storage format includes: magic number, version number, original length of data, compressed length of data and parameters used for encoding and compression; the header is then followed by recording the original value of the first floating point number and the value after compression of each floating point number encoding.
8. A floating point number streaming data compression device, comprising:
the index table establishing module is used for establishing an index table, the index table is provided with N barrels, each barrel is provided with M grooves, and N and M are positive integers;
a key value determining module for determining a key value as an index based on the binary representation of the current floating point number;
the target bucket searching module is used for searching target buckets in the N buckets by utilizing a hash searching method according to the key value;
the basic value searching module is used for sequentially performing exclusive-or calculation on the current floating point number and the data in each groove in the target barrel by using a linear searching method to obtain a plurality of first values, taking the data corresponding to the first value with the largest number of bits of zero as the basic value of the current floating point number coding, and recording the position of the basic value in the current window;
the encoding module is used for encoding the current floating point number according to the bit number condition of zero of the second value obtained by carrying out exclusive OR calculation on the current floating point number and the basic value and the position of the basic value in the current window;
the storage module is used for compressing and storing the encoded floating point number according to a preset storage format.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the floating point number streaming data compression method of any one of claims 1 to 7 when the computer program is executed by the processor.
10. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program for executing the floating point number streaming data compression method according to any one of claims 1 to 7.
CN202310869512.0A 2023-07-17 2023-07-17 Floating point number stream data compression method, device, computer equipment and medium Active CN116594572B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310869512.0A CN116594572B (en) 2023-07-17 2023-07-17 Floating point number stream data compression method, device, computer equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310869512.0A CN116594572B (en) 2023-07-17 2023-07-17 Floating point number stream data compression method, device, computer equipment and medium

Publications (2)

Publication Number Publication Date
CN116594572A true CN116594572A (en) 2023-08-15
CN116594572B CN116594572B (en) 2023-09-19

Family

ID=87594097

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310869512.0A Active CN116594572B (en) 2023-07-17 2023-07-17 Floating point number stream data compression method, device, computer equipment and medium

Country Status (1)

Country Link
CN (1) CN116594572B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117082154A (en) * 2023-10-16 2023-11-17 长沙瑞腾信息技术有限公司 Big data-based double-path server data storage system
CN117440154A (en) * 2023-12-21 2024-01-23 之江实验室 Depth map sequence compression method considering floating point digital splitting

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030018676A1 (en) * 2001-03-14 2003-01-23 Steven Shaw Multi-function floating point arithmetic pipeline
US20090002207A1 (en) * 2004-12-07 2009-01-01 Nippon Telegraph And Telephone Corporation Information Compression/Encoding Device, Its Decoding Device, Method Thereof, Program Thereof, and Recording Medium Containing the Program
US9209833B1 (en) * 2015-06-25 2015-12-08 Emc Corporation Methods and apparatus for rational compression and decompression of numbers
US20160085509A1 (en) * 2014-09-18 2016-03-24 International Business Machines Corporation Optimized structure for hexadecimal and binary multiplier array
CN109871362A (en) * 2019-02-13 2019-06-11 北京航空航天大学 A kind of data compression method towards streaming time series data

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030018676A1 (en) * 2001-03-14 2003-01-23 Steven Shaw Multi-function floating point arithmetic pipeline
US20090002207A1 (en) * 2004-12-07 2009-01-01 Nippon Telegraph And Telephone Corporation Information Compression/Encoding Device, Its Decoding Device, Method Thereof, Program Thereof, and Recording Medium Containing the Program
US20160085509A1 (en) * 2014-09-18 2016-03-24 International Business Machines Corporation Optimized structure for hexadecimal and binary multiplier array
US9209833B1 (en) * 2015-06-25 2015-12-08 Emc Corporation Methods and apparatus for rational compression and decompression of numbers
CN109871362A (en) * 2019-02-13 2019-06-11 北京航空航天大学 A kind of data compression method towards streaming time series data

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
SUNGBO PARK 等: "BCD Deduplication: Effective Memory Compression using Partial Cache-Line Deduplication", ASPLOS\'21, pages 52 - 64 *
TUOMAS PELKONEN 等: "Gorilla:A Fast, Scalable, In-Memory Time Series Database", PROCEEDINGS OF THE VLDB ENDOWMENT, pages 1816 - 1827 *
赖叶静 等: "深度神经网络模型压缩方法与进展", 华东师范大学学报(自然科学版), pages 77 - 91 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117082154A (en) * 2023-10-16 2023-11-17 长沙瑞腾信息技术有限公司 Big data-based double-path server data storage system
CN117082154B (en) * 2023-10-16 2023-12-15 长沙瑞腾信息技术有限公司 Big data-based double-path server data storage system
CN117440154A (en) * 2023-12-21 2024-01-23 之江实验室 Depth map sequence compression method considering floating point digital splitting
CN117440154B (en) * 2023-12-21 2024-04-19 之江实验室 Depth map sequence compression method considering floating point digital splitting

Also Published As

Publication number Publication date
CN116594572B (en) 2023-09-19

Similar Documents

Publication Publication Date Title
CN116594572B (en) Floating point number stream data compression method, device, computer equipment and medium
US20170038978A1 (en) Delta Compression Engine for Similarity Based Data Deduplication
US7770091B2 (en) Data compression for use in communication systems
WO2017071431A1 (en) Encoding method and device
JP7321208B2 (en) Polar code rate matching method and apparatus
KR100484137B1 (en) Improved huffman decoding method and apparatus thereof
CN114268323B (en) Data compression coding method, device and time sequence database supporting line memory
CN116244313A (en) JSON data storage and access method, device, computer equipment and medium
CN116170027A (en) Data management system and processing method for poison detection equipment
CN113078908B (en) Simple encoding and decoding method suitable for time sequence database
CN112434085B (en) Roaring Bitmap-based user data statistical method
CN114640354A (en) Data compression method and device, electronic equipment and computer readable storage medium
CN116192154B (en) Data compression and data decompression method and device, electronic equipment and chip
CN113873094A (en) Chaotic compressed sensing image encryption method
US7256715B1 (en) Data compression using dummy codes
CN107832341B (en) AGNSS user duplicate removal statistical method
CN110175185B (en) Self-adaptive lossless compression method based on time sequence data distribution characteristics
CN115765754A (en) Data coding method and coded data comparison method
CN113630123B (en) Data compression system and method
CN111431539B (en) Compression method and device for neural network data and computer readable storage medium
CN109255090B (en) Index data compression method of web graph
CN112527949B (en) Data storage and retrieval method and device, computer equipment and storage medium
US10037148B2 (en) Facilitating reverse reading of sequentially stored, variable-length data
CN110875744B (en) Coding method and device
CN111008301B (en) Method for searching video by using graph

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant