CN117254820A

CN117254820A - Data compression method, device, equipment and storage medium

Info

Publication number: CN117254820A
Application number: CN202311266832.3A
Authority: CN
Inventors: 岳博; 丁昊杰; 钟林钢; 杨飞; 任宇
Original assignee: Hangzhou Flyslice Technologies Co ltd
Current assignee: Hangzhou Flyslice Technologies Co ltd
Priority date: 2023-09-27
Filing date: 2023-09-27
Publication date: 2023-12-19

Abstract

The application provides a data compression method, a device, equipment and a storage medium, wherein the method comprises the following steps: the method comprises the steps of obtaining and traversing data to be encoded, determining a character sequence of the data to be encoded, constructing a Huffman tree by the character sequence, determining the level of each character type in the Huffman tree, determining the initial encoding of each level of the Huffman tree, generating compression encoding of each character type of the corresponding level according to the initial encoding of each level, and compiling the data to be encoded into compressed data based on the compression encoding corresponding to each character type. When the character types are encoded, the method does not need to traverse from the root node of the Huffman tree for many times, so that the Huffman encoding value of the character can be obtained only according to the initial encoding of the hierarchy where the character types are located and the position of the character in the hierarchy, the encoding efficiency of the character types is improved on the basis that the encoding still has the Huffman code characteristic, and the encoding efficiency of the data to be encoded is further improved.

Description

Data compression method, device, equipment and storage medium

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a data compression method, apparatus, device, and storage medium.

Background

With the advent of the big data age, the data information has been explosively increased, but in specific application fields such as the internet of things and artificial intelligence, the time delay required for processing the massive data is continuously improved, so how to efficiently transmit the massive data without causing a large load on a processor is a problem to be solved at present.

Compression algorithms for Huffman (Huffman) coding are commonly used in the prior art to compress data. The Huffman coding is to count the number of times of character occurrence, encode the variable word length of data processing, construct Huffman binary tree based on the number of times of character occurrence, encode from leaf node to root node of tree, the more characters appear let shorter code replace, the less times of characters are replaced by longer code at the same time, can realize the overall size reduction of the data, achieve the goal of compressing.

However, when the data to be encoded is subjected to Huffman encoding, the binary tree needs to be traversed from the root node to the leaf node, and Huffman encoding of each character is completed according to the principle that the left subtree of the binary tree is encoded to be 0 and the right subtree is encoded to be 1, so that each character needs to traverse the binary tree once to complete final encoding, and encoding efficiency is low.

Disclosure of Invention

The present invention aims to provide a data compression method, device, equipment and storage medium, which solve the problem of low coding efficiency in the prior art by adopting the existing huffman coding method.

In order to achieve the above purpose, the technical scheme adopted in the application is as follows:

in a first aspect, the present application provides a data compression method, the method comprising:

acquiring and traversing data to be encoded, and determining a character sequence of the data to be encoded;

constructing a Huffman tree from the character sequence, and determining the level of each character type in the Huffman tree;

determining initial codes of all levels of the Huffman tree, and generating compression codes of all character types of corresponding levels according to the initial codes of all levels, wherein the code length of the compression codes is positively correlated with the corresponding levels, and the compression codes are prefix codes;

and compiling the data to be coded into compressed data based on the compression coding corresponding to each character type.

Optionally, the determining the character sequence of the data to be encoded includes:

extracting effective data from the data to be coded;

Counting the effective data to obtain the counting result of each character type in the effective data, wherein the counting result comprises the following steps: the method comprises the steps of a character type and a weight of the character type, wherein the same character in effective data is the same character type, and the weight of the character type is used for representing the proportion of the character type to all characters in the effective data;

and sequencing the character types according to the weight of the character types to obtain the character sequence.

Optionally, the counting the effective data to obtain a counting result of each character type in the effective data includes:

determining the character types to be counted, wherein the character types to be counted are any character types in the effective data;

respectively determining the number of characters which are the same as the character types to be counted in the effective data, and taking the number of the characters as the weight of the character types to be counted;

and generating the statistical result according to the statistical character types and the corresponding weights.

Optionally, the sorting the character types according to the weights of the character types to obtain the character sequence includes:

Sequentially comparing each character type with the other character types except the character type in the effective data to obtain a weight comparison result;

determining a score value of each character type according to the weight comparison result;

and sorting the character types based on the score value of each character type in the effective data to form the character sequence.

Optionally, the constructing the huffman tree from the character sequence and determining a hierarchy of each character category in the huffman tree includes:

a. determining a first storage space and a second storage space, wherein the first storage space stores the character sequence;

b. integrating the storage addresses and the weights corresponding to two adjacent character types in the character sequence with minimum weights into a data set, and storing the data set in the second storage space;

c. determining the type of remaining characters in the character sequence and the remaining in the second memory space

A remainder data set, wherein the remainder character type is a character type excluding a character type of which the storage address is stored in the second storage space in the character sequence, and the remainder data set is a data set excluding a data set of which the storage address is stored in the second storage space;

d. Comparing the weight corresponding to the residual character types with the weight sum in the residual data set, integrating the storage addresses and the weight sums corresponding to the two residual character types and/or the residual data sets with the minimum numerical values into the data set, and storing the data set in the second storage space;

repeating the steps c to d until the number of the residual character types is 0 and the number of the residual data sets is 1, and completing the construction of the Huffman tree in the second storage space;

e. and determining the hierarchy of each character type in the Huffman tree according to the data set in the second storage space.

Optionally, the determining the initial encoding of each level of the huffman tree includes:

determining the number of character types of a first level higher than a target level and an initial coding value of the first level higher than the target level, wherein the target level is any level in the Huffman tree, and the initial coding value of the highest level in the Huffman tree is a preset value;

taking half of the sum of the initial coding values of the first level higher than the target level as the initial coding value of the target level;

And converting the initial coding value of the target level into a corresponding initial coding, wherein the code length of the initial coding is consistent with that of the corresponding level.

Optionally, the generating the compression codes of the character types according to the initial codes of the levels includes:

determining the number of character types of each level in the Huffman tree;

generating compression codes of character types of corresponding levels according to initial codes of all levels, and respectively endowing the compression codes of each level with the character types of the corresponding levels, wherein the number of the compression codes of each level is consistent with the number of the character types of the corresponding level.

Optionally, the generating the compression codes of the character types of the corresponding levels according to the initial codes of the levels, and assigning the compression codes of each level to the character types of the corresponding levels respectively includes:

traversing each character category in each hierarchy to acquire the appearance sequence of each character category in the corresponding hierarchy;

and determining the compression coding of the character types according to the initial coding of each level and the appearance sequence of the character types.

In a second aspect, the present application provides a data compression apparatus, the apparatus comprising:

The counting and sorting module is used for acquiring and traversing the data to be encoded and determining the character sequence of the data to be encoded;

the tree construction coding module is used for constructing the Huffman tree from the character sequence and determining the level of each character type in the Huffman tree; determining initial codes of all levels of the Huffman tree, and generating compression codes of all character types according to the initial codes of all levels, wherein the code length of the compression codes is positively correlated with the corresponding level, and the compression codes are prefix codes;

and the compiling module is used for compiling the data to be coded into compressed data based on the compression codes corresponding to the character types.

Optionally, the count ordering module is further configured to:

extracting effective data from the data to be coded;

Optionally, the count ordering module is further configured to:

Optionally, the tree building coding module is further configured to:

c. determining the types of the residual characters in the character sequence and the residual data sets in the second storage space, wherein the types of the residual characters are the types of the characters in the character sequence excluding the types of the characters with the storage addresses stored in the second storage space, and the residual data sets are the data sets in the second storage space excluding the data sets with the storage addresses stored in the second storage space;

Optionally, the tree building coding module is further configured to:

determining the number of character types of each level in the Huffman tree;

Optionally, the tree building coding module is further configured to:

In a third aspect, the present application provides an electronic device, including: a processor, a storage medium storing machine-readable instructions executable by the processor, the processor in communication with the storage medium via a bus when the electronic device is running, the processor executing the machine-readable instructions to perform the steps of the data compression method as described above, and a bus.

In a fourth aspect, the present application provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of a data compression method as described above.

The beneficial effects of this application are: when the character types are encoded, the method does not need to traverse from the root node of the Huffman tree for many times, so that the Huffman encoding of the character types can be obtained only according to the initial encoding of the hierarchy where the character types are located and the positions of the character types in the hierarchy, the encoding efficiency of the character types is improved on the basis that the encoding still has the Huffman code characteristic, and the encoding efficiency of data to be encoded is further improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered limiting the scope, and that other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 shows a schematic diagram of a framework structure implemented in an FPGA according to an embodiment of the present application;

FIG. 2 shows a flow chart of a data compression method provided by an embodiment of the present application;

FIG. 3 illustrates a flow chart for ordering characters provided by an embodiment of the present application;

FIG. 4 is a schematic diagram of storing valid data according to an embodiment of the present application;

FIG. 5 shows a flowchart for obtaining statistics according to an embodiment of the present application;

FIG. 6 is a flowchart illustrating statistics of character types according to an embodiment of the present application;

FIG. 7 shows a flowchart of a sort resulting character sequence provided by an embodiment of the present application;

FIG. 8 is a schematic diagram illustrating a sort of characters based on a comparison module according to an embodiment of the present application;

FIG. 9 shows a flowchart for constructing a Huffman tree provided by embodiments of the present application;

fig. 10 is a schematic diagram illustrating a huffman tree establishment process according to an embodiment of the present application;

FIG. 11 illustrates a flow chart for determining an initial encoding provided by embodiments of the present application;

FIG. 12 illustrates a flow chart for determining compression encoding provided by embodiments of the present application;

FIG. 13 illustrates a flow chart for determining compression encoding according to yet another embodiment of the present application;

FIG. 14 illustrates a schematic diagram of an exemplary Huffman tree provided by embodiments of the present application;

fig. 15 shows a schematic structural diagram of a data compression device according to an embodiment of the present application;

fig. 16 shows a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it should be understood that the accompanying drawings in the present application are only for the purpose of illustration and description, and are not intended to limit the protection scope of the present application. In addition, it should be understood that the schematic drawings are not drawn to scale. A flowchart, as used in this application, illustrates operations implemented according to some embodiments of the present application. It should be understood that the operations of the flow diagrams may be implemented out of order and that steps without logical context may be performed in reverse order or concurrently. Moreover, one or more other operations may be added to the flow diagrams and one or more operations may be removed from the flow diagrams as directed by those skilled in the art.

In addition, the described embodiments are only some, but not all, of the embodiments of the present application. The components of the embodiments of the present application, which are generally described and illustrated in the figures herein, may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, as provided in the accompanying drawings, is not intended to limit the scope of the application, as claimed, but is merely representative of selected embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present application without making any inventive effort, are intended to be within the scope of the present application.

It should be noted that the term "comprising" will be used in the embodiments of the present application to indicate the presence of the features stated hereinafter, but not to exclude the addition of other features.

When the Huffman tree is adopted to encode data in the prior art, the binary tree needs to be traversed from a root node to a leaf node, the Huffman encoding of each character is completed according to the principle that the encoding of the left subtree of the binary tree is 0 and the encoding of the right subtree is 1, so that each character needs to traverse the binary tree once to complete final encoding, and the encoding efficiency is low.

Based on the above problems, the present application proposes a new data compression method, after constructing a huffman tree for a character code of data, by assigning a corresponding initial code to the layer number where each character code is located, and coding according to the initial coding value of the layer number where the character code is located and the order in which the corresponding character code appears in the layer number, so that the final coding bit number and the layer number are positively correlated and conform to the prefix code rule, thereby improving the coding efficiency.

FIG. 1 is a schematic diagram of a frame structure implemented in an FPGA, where a Stream-in module pre-processes original text data and then divides the original text data into multiple sections and inputs the sections to a build-ctable module and a symbol-cnt module in parallel; the symbol-cnt module is used for counting the occurrence frequency of each character type in the data to be coded; the sort module is used for carrying out descending order or ascending order sorting based on the occurrence frequency of each character type; bld-huff-tree module is used for establishing Huffman tree and recording the layer number of each character type; the build-capable module is used for establishing Huffman codes corresponding to various character types and coding the data to be coded according to the corresponding Huffman codes.

In the process of generating Huffman codes of various character types of data to be encoded, the electronic device can implement the steps of the data compression method at the hardware level through a plurality of registers, and for convenience of explanation, registers are respectively denoted by ram0-ram9, and are described in connection with specific embodiments.

Next, first, a data compression method of the present application will be described with reference to fig. 2, where an execution subject of the method may be an electronic device, and referring to fig. 2, the method includes:

s201, acquiring and traversing data to be encoded, and determining a character sequence of the data to be encoded.

The data to be encoded may be text data to be subjected to data compression, the character sequence may be a sequence composed of character types obtained after occurrence frequency statistics and sequencing of the text data, and the same character may be represented as one character type.

Alternatively, the character types in the character sequence may be arranged according to the number of occurrences of the character types in the text data, where the arrangement order may be a descending order or an ascending order, which is not limited herein.

S202, constructing a Huffman tree from the character sequence, and determining the level of each character type in the Huffman tree.

Optionally, a huffman tree corresponding to the data to be encoded may be created according to the character sequence, the huffman tree includes a plurality of nodes, each character type in the data to be encoded may be used as a leaf node in the huffman tree, and the level of the character type in the huffman tree may be the level of the leaf node corresponding to the character type in the huffman tree.

S203, determining initial codes of all levels of the Huffman tree, and generating compression codes of all character types of corresponding levels according to the initial codes of all levels, wherein the code length of the compression codes is positively correlated with the corresponding levels, and the compression codes are prefix codes.

Optionally, each level in the huffman tree corresponds to an initial code, and the initial codes of the levels are different.

Alternatively, the compression code may be a huffman code value of a character type generated based on initial encoding of the character type, and the code length of the compression code of the character type is positively correlated with the hierarchy in which the character type is located. The compression coding may be a prefix code, i.e. a code having prefix code properties.

The prefix code characteristic is that the codes of any character type cannot be prefixes of codes of other character types, namely, the compression codes of each character type cannot be prefixes of compression codes of other character types in the application.

After the Huffman tree is constructed, the number of leaf nodes and the number of levels of each level of the Huffman tree can be obtained, initial codes of the levels are determined, and then compression codes of the character types are generated based on the levels where the character types are located.

S204, compiling the data to be coded into compressed data based on the compression codes corresponding to the character types.

After the compression codes are respectively determined for each character type, the characters of the data to be coded can be sequentially read according to the character sequence, the compression codes of the character types corresponding to the characters are read, the compression codes of the characters are spliced according to the character sequence, and therefore the data to be coded are compiled into the compression data.

It should be noted that, when the data to be encoded input by the user is a large-segment text, the present application may segment the text first to obtain a multi-segment text, then execute the step S204 described above on each segment of text to obtain compressed data of each segment of text, and then splice the compressed data of each segment together to obtain the encoding result of the whole segment of text.

Illustratively, the original text may be divided into four segments according to the size of the original text to be encoded. For example, the original text size is src_size, the first three segments of size seg_size is seg_size= (src_size+3)/4, the fourth segment of size is src_size-3×seg_size, and each segment of text is stored in four independent original rams respectively.

According to the size of each segment of data, the address of each original ram is read to obtain the data to be encoded, and because the bit width of the original text stored in the original ram is 128 bits, namely 16 bytes, the initial address of each segment can be determined after the quotient and the remainder are taken according to 16, and the compressed data corresponding to each segment of data to be encoded is stored in one target ram independently.

Since the compression coding of the previous character in the data to be coded is to be placed in front of the compression coding of the next character, and the bit width of the front compression coding and the rear compression coding is also variable, the original text can be coded according to byte sequence. And storing the compression codes by using a register, putting the compression codes after encoding into the highest bit of the register after encoding each character, splicing the current compression codes with the data in the register by shifting the hierarchy of the current character type right, shifting the lower 64 bits of the register out and storing the lower 64 bits of the register into the target ram after the accumulated compression data reach 64 bits, and reading the data in the four target rams in sequence after encoding four sections of original texts to obtain the compression data of the original texts.

In the embodiment of the application, data to be encoded is acquired and traversed, a character sequence of the data to be encoded is determined, a Huffman tree is constructed by the character sequence, the levels of each character type in the Huffman tree are determined, initial codes of each level of the Huffman tree are determined, compression codes of each character type of corresponding levels are generated according to the initial codes of each level, wherein the code length of the compression codes is positively correlated with the corresponding level, the compression codes are prefix codes, and the data to be encoded is compiled into compressed data based on the compression codes corresponding to each character type. When the character types are encoded, the method does not need to traverse from the root node of the Huffman tree for many times, so that the Huffman encoding of the character types can be obtained only according to the initial encoding of the hierarchy where the character types are located and the positions of the character types in the hierarchy, the encoding efficiency of the character types is improved on the basis that the compression encoding still has the Huffman code characteristic, and the encoding efficiency of the data to be encoded is further improved.

Further, in the step S201, the determining the character sequence of the data to be encoded, as shown in fig. 3, further includes:

s301, extracting effective data from data to be encoded.

Alternatively, the data to be encoded may include a 1-bit frame start bit, a 1-bit frame end bit, 128-bit data to be encoded and 16-bit data valid bits to be encoded, such as four inputs corresponding to the symbol-cnt module and the Stream-in module in fig. 1, where each bit valid bit controls whether 8 bits of data to be encoded is valid.

The electronic device can traverse the data to be encoded, extract the effective data according to the effective bit of 16 bits after the end bit of the 1bit frame is finished, store the effective data into ram0, count the number of effective bytes, and extract and store the effective data through shifting and splicing the effective data.

Illustratively, the process of extracting and storing valid data may be:

assuming that the bit width of the input data buffer is 256 bits, putting the effective data of the first frame into low bits; the second frame data carries out the byte number corresponding to left shift according to the effective data number of the first frame data, and so on, each next frame data carries out the left shift of the effective byte number of the previous frame data, the left shifted data and the cached data are subjected to OR, the cached 256-bit data is the latest 256-bit data, and the least significant bit is reserved for the earliest data; to avoid data overwriting, when the number of cache bytes is 16 bytes or more, the data is shifted to the right by 128 bits (16 bytes wide), while the low 128 bits data is stored in ram0.

The data format of the implementation principle of storing valid data is shown in fig. 4, where data_in is input 128-bit valid data, vld_cnt is the corresponding number of valid bytes, data_temp is cached 256-bit input data, byte_cnt is the current number of remaining bytes in the input data cache, it accumulates the vld_cnt, and when the number of accumulated bytes in the byte_cnt is greater than 16 bytes, the low 128-bit data of the data_temp is written into ram0, and meanwhile the data_temp is shifted to the right by 128 bits. If the first data t0=data0||data1 < < (b_ 0*8), "||" is bit-wise or, < < is left shift. And so on, the mth data is data_m shifted left by byte_cnt bytes and then bit-wise ored with data_temp.

S302, counting the effective data to obtain the counting result of each character type in the effective data, wherein the counting result comprises the following steps: the character type and the weight of the character type, wherein the same character in the effective data is the same character type, and the weight of the character type is used for representing the proportion of all characters in the character type occupied effective data.

After the effective data is extracted, the method can count the occurrence times of each character type in the effective data to obtain the weight of each character type.

As one possible implementation, the weight of the character type may characterize the specific gravity of the character type in the valid data.

As another possible implementation, the weight of the character type may also be the number of occurrences of the character type in the valid data.

S303, sorting the character types according to the weight of the character types to obtain a character sequence.

Alternatively, the character types may be arranged in a descending order or an ascending order when they are ordered, which is not limited herein.

By way of example, a serial full comparison ordering algorithm may be employed. Assuming that the valid data includes 256 characters, comparing each character type with 256 character types in the valid data in turn, instantiating 256 identical modules at the same time, calculating comparison results of different character types and 256 character types by each module, and after obtaining the comparison result of each character type, sorting the character types according to the comparison results to obtain a character sequence.

Next, a process of counting the valid data in the step S302 to obtain the statistics result of each character type in the valid data will be described, as shown in fig. 5, and the step includes:

S501, determining the character type to be counted, wherein the character type to be counted is any character type in the effective data.

For example, the character types may be sequentially read out from ram0, and the read character types may be used as character types to be counted.

It should be noted that, in order to further improve the efficiency of counting the weights of the character types, a plurality of character types may be read out at a time, and the character types are used as character types to be counted, and the occurrence times are counted for the character types to be counted respectively.

S502, respectively determining the number of characters which are the same as the character types to be counted in the effective data, and taking the number of characters as the weight of the character types to be counted.

When the method and the device are used for counting the characters in the effective data, the same number of counting modules can be instantiated according to the number of the types of the characters.

Referring to fig. 6, assuming that there are 256 character types in the effective data, 256 statistics modules may be instantiated to perform parallel statistics on the character types, respectively, and store the statistics result in ram1, where the data format in ram1 is { character types, weights }, and when the input is completed, statistics of the occurrence times of all character types may be completed.

By way of example, the specific procedure may be: and the input 128-bit data to be coded is stored into 16 registers of the statistics module according to the condition that each 8 bits are one character. The input 16bit valid bits respectively represent whether the input characters in the 16 registers are valid; and then comparing the 16 characters in the register with a preset input value of the module in parallel, if the comparison value is the same and the character valid bit of the register is 1, adding 1 to the corresponding count value, and after counting all data, storing output results in one ram1 respectively.

S503, generating a statistical result according to the type of the statistical character and the corresponding weight.

Alternatively, the ratio of the number of character types to the total number of characters in the valid data may be used as the statistics of the character types.

As another possible implementation manner, the number of occurrences of the character type in the valid data may be directly used as the weight of the character type, and all the character types and their corresponding weights may be used as the statistics result.

After the statistics of the respective character types are obtained, the respective character types may be ranked according to their weights to obtain the character sequence, as shown in fig. 7, and the step S303 includes:

s701, comparing weights of each character type with other character types except the character type in the effective data in sequence to obtain a weight comparison result.

Optionally, in the present application, a comparison module may be respectively instantiated for each character type, and the weight comparison is performed on the character type and the other character types except for the character type based on the comparison module in parallel, so as to obtain a weight comparison result of each character type.

S702, determining the score value of each character type according to the weight comparison result.

Alternatively, for each character type, the weight value of each character type may be compared with the remaining character types in turn, and the score value of the character to be sorted may be calculated according to the comparison result of the character type and the remaining character types.

By taking the character type as m and the weight value as m_weight as an example, the character and other character types are analogous: sequentially reading the weights corresponding to the character types, and comparing the weight values with m_weight respectively, wherein the comparison result is divided into three cases:

the m_weight value is smaller than the weight value of other read character types, and the score is 0; the m_weight value is equal to the weight value of other read character types, if the character value of the character type m is greater than the weight value and the character value of the character type equal to the weight value, the score of the character type m is 0, otherwise, the score of the character type m is 1; the m_weight value is greater than the weight value of other read character types, and the score is 1; after the comparison of the character type m with other respective character types is completed, the comparison result of the character type m with the respective character types may be added to obtain a score value of the character type m.

After determining the score value of the character class m, the score value of the character class m and the weight value of the character class may be stored in logical ram2, where ram2 is stored in the format { score value, weight value }. The data length in ram2 may be determined by a developer based on actual requirements, and the application is not limited herein.

S703, sorting the character types based on the score values of the character types in the effective data to form a character sequence.

Fig. 8 is a schematic diagram of sorting character types based on a comparison module, where after data is input and comparison results of all character types are determined by a plurality of parallel comparison modules, according to comparison score conditions, the comparison results can be written into ram3 according to the order of the score values in ram2 from small to large, so as to complete descending order, finally, each character type and its weight value are output in descending order, and the value is stored in ram3, where the ram3 data format is { character type, weight value }.

After obtaining the character sequence, the huffman tree may be constructed through the step S202 described above, as shown in fig. 9, which includes:

s901, determining a first storage space and a second storage space, wherein the first storage space stores a character sequence.

The first storage space is used for storing each character type and the weight of the character type in the character sequence, and the first storage space may be ram3, for example.

The second storage space is used for storing data information of the child nodes and the father nodes of the Huffman tree in the process of constructing the Huffman tree, and the data information comprises the corresponding relation between the child nodes and the father nodes, storage addresses, weight information and the like.

S902, integrating the storage addresses, the weights and the weights corresponding to two adjacent character types in the character sequence into a data set, and storing the data set in a second storage space.

The data set comprises a storage address of the character type and a weight sum of the two character types.

For example, the two character types with the highest or lowest (determined according to descending or ascending order) ordered addresses in ram3 after the ordered reading can be the two nodes with the smallest weight value.

Assuming that the second memory space is denoted by ram4, the sum of the memory addresses and weights of the two nodes is stored in the new ram 4. The data format in ram4 is { addr0, addr1, two address weight value sums }, addr0, addr1 are memory addresses in ram3, the corresponding data in the addresses are { character types, weight values }, weight0 and weight1 are the weight values corresponding to the two memory addresses, and the two address weight value sums are weight0+weight1.

S903, determining the types of the residual characters in the character sequence and the residual data set in the second storage space, wherein the types of the residual characters are the types of the characters in the character sequence excluding the types of the characters with the storage addresses stored in the second storage space, and the residual data set is the data set in the second storage space excluding the data set with the storage addresses stored in the second storage space.

S904, comparing the weight corresponding to the residual character types with the weight sum in the residual data set, integrating the storage addresses and the weight sums corresponding to the two residual character types and/or the residual data sets with the smallest numerical values into the data set, and storing the data set in the second storage space.

The method can sequentially read the weight values in the first storage space ram3 after sequencing, respectively compare the weight values with the weight sum values in the second storage space ram4, take the smallest two values each time, exclude the weight sum values and character types which become child nodes in the process, form new weight sum values, store the new weight sum values into ram4, and add 1 to the address of each time a data set is written, so that the cycle iterates.

It should be noted that, in the process of establishing the huffman tree, if the weight value in the character sequence is equal to the weight sum value in ram4, the priority of the weight sum value is higher than the weight value.

The minimum two values may be the weights or the weights and the minimum two values in the residual character types in the first storage space and the residual data set in the second storage space, that is, the two values may be the weight values of the residual character types in the first storage space; or the two values are the weight sum value of the rest data set in the second storage space; or one value is the weight value of the remaining character types in the first storage space, and the other value is the weight sum value of the remaining data sets in the second storage space.

S905, repeating the steps S903 to S904 until the number of the residual character types is 0 and the number of the residual data sets is 1, and completing the construction of the Huffman tree in the second storage space.

Exemplary, as shown in fig. 10, a schematic diagram of a huffman tree establishment process is provided in the present application, weight values and weight sum values in a first storage space and a second storage space are compared by an instantiated weight comparison module, and data is added to the huffman tree according to a comparison result until the huffman tree is constructed, where the weight comparison module sequentially reads character types from a character sequence by a reading control module.

Assuming that the weight values of the respective characters in the character sequence are 1, 3, 4, 6, 7, 9, 10 from low to high, respectively, two character types with weight values of 1 and 3 are added to the huffman tree, that is, the storage address and the weight sum value 4 of the two character types in ram3 are stored as a data set to ram4, at this time, the weight values of the remaining character types in ram3 are sequentially 4, 6, 7, 9, 10, the weight sum value in the huffman tree stored in ram4 (i.e., the weight sum value of the remaining data set) is 4, then the minimum two values 4 and 4 are continuously read from ram3 and ram4, the second weight sum 8 is calculated, and storing the storage address of the character type with the weight value of 4 in ram3, the storage address of the data set with the weight and the value of 4 in ram4 and the second weight sum 8 as data sets to ram4, wherein the weight values of the rest character types in ram3 are 6, 7, 9 and 10 in sequence, the weight and the value of the rest data sets in ram4 are 8, then reading the minimum two values 6 and 7 from ram3, constructing corresponding data sets according to the weight values 6 and 7, adding the data sets to ram4, and so on until the number of the rest character types is 0 and the number of the rest data sets is 1, and completing the construction of the Huffman tree in the second storage space.

S906, determining the hierarchy of each character type in the Huffman tree according to the data set in the second storage space.

It will be appreciated that the data format in ram4 is { addrm, addrn, two address weight values and }, where, since addrm and addrn may be memory addresses of ram3 or ram4, and the addresses of ram3 are 0-255, the memory addresses of the data sets stored in ram4 may be from 256, that is, if addrm or addrn is less than 255, the memory addresses in ram3 refer to the corresponding character types, and when addrm or addrn is greater than or equal to 256, the memory addresses in ram4 refer to the corresponding data sets.

It should be noted that, each data set stores a storage address of a group of child nodes, and the storage address of each data set in ram4 is a parent node, and exemplary, the reading of a part of huffman tree constructed in the second storage space is as follows:

addrm＝311，addrn＝0，nodeNB＝316；

addrm＝312,addrn＝313,nodeNB＝317；

addrm＝314,addrn＝315,nodeNB＝318；

addrm＝316,addrn＝317,nodeNB＝319；

addrm＝318,addrn＝319,nodeNB＝320；

where, "addrm=318, addrn=319, nodenb=320" means that, among 320 storage addresses of ram4, 318 storage addresses and 319 storage addresses of ram4 are stored, which means that the 320 storage addresses are root nodes of the 318 storage addresses and the 319 storage addresses, and thus, the hierarchy of each character type in the huffman tree can be determined based on the data set stored in the second storage space.

Assuming that the above example has completed Huffman tree construction, then ram4 has the highest address (320 in this example) of layer 0, which includes the corresponding address in the data of layer 1, i.e., addresses 318, 319 are the first layer, and addresses 314, 315, 316, 317 are layer 2. Because address 316 contains address 0, address 0 is layer 3, and address 0 is less than or equal to 255, indicating that it is a ram3 address, i.e., the character type with the greatest weight after ram3 ordering is layer 3 in the constructed huffman tree. And sequentially calculating until all the character types are counted. Each character type and its corresponding layer number value (i.e., hierarchy) are stored in ram5, and the data format is { character type, layer number value }.

According to the embodiment of the invention, the Huffman tree is constructed according to the character sequence of the data to be encoded, wherein each node of the Huffman tree is searched in a memory address mapping mode, and the association between each father node and the corresponding child node can be realized, so that the child node corresponding to each father node in the Huffman tree can be determined according to the association relation, namely, the corresponding child node is found through the memory address of the father node stored in the second memory space, the layer number calculation is further carried out, and the number of character types included in each layer can be determined.

It should be noted that, while determining the levels of each leaf node in the huffman tree, the number of leaf nodes in each level may be determined, and the character types and the layer number values stored in ram5 may be sequentially read, so as to define ram6, which is used to store the number of character types, that is, the number of leaf nodes, of each layer. The layer number value read from ram5 is used as the address index of ram6, the character types of the same layer number value are accumulated and then stored in the corresponding address, and finally the number of the character types of each layer is obtained.

After constructing the huffman tree and determining the number of character types for each level, the present application may first determine an initial code for each level in the huffman tree, as shown in fig. 11, which includes:

s1101, determining the number of character types of a first level higher than a target level and an initial coding value of the first level higher than the target level, wherein the target level is any level in the Huffman tree, and the initial coding value of the highest level in the Huffman tree is a preset value.

S1102, a half of the sum of the number of character types of the higher one level of the target level and the initial code value of the higher one level of the target level is set as the initial code value of the target level.

S1103, converting the initial coding value of the target level into a corresponding initial coding, wherein the code length of the initial coding is consistent with that of the corresponding level.

As a possible implementation manner, the initial coding value of the highest level may be 0, and the initial coding value of each level in the huffman tree is determined one by one from the highest level in this application.

For example, in the present application, the initial coding value of the highest layer may be set to 0, the next highest layer initial coding value is the sum of the initial coding value of the highest layer and the number of character types of the corresponding layer in ram5 divided by 2, and so on, to obtain the initial coding value of each layer, and the initial coding value of each layer is stored in ram7, that is, the initial coding value ram7[ m ] = (ram 6[ m+1] +ram7[ m+1 ])/2, where m represents the layer. The address index of ram7 is a layer value, and the data content is an initial coding value corresponding to the layer value.

Further, the initial code value can be used to generate the initial code of the corresponding code length corresponding to the level, for example, if the initial code value is 0 and the corresponding level is 5, the initial code is 00000, i.e. after the initial code value is converted into binary, 0 is added to the high order, so that the code length is adapted to the level, and the corresponding initial code is generated.

After determining the initial codes of the respective levels, the present application may determine the compression codes of the respective character types according to the initial codes of the respective levels and the levels of the respective character types in the huffman tree, as shown in fig. 12, and the step S203 includes:

S1201, determining the number of character types of each level in the Huffman tree.

S1202, generating compression codes of character types of corresponding levels according to initial codes of the levels, and respectively endowing the compression codes of each level with the character types of the corresponding levels, wherein the number of the compression codes of each level is consistent with the number of the character types of the corresponding level.

Alternatively, each character type may correspond to a respective compression code, and the compression codes of the respective character types in the same hierarchy may be different. Therefore, the corresponding compression codes can be generated according to the number of character types of each level, namely, on the basis of the initial codes, the initial codes are accumulated for 1, the accumulation times are the number of character types in the corresponding level minus 1, each time the initial codes are accumulated to generate one compression code, the initial codes are also the compression codes, therefore, the compression codes which are generated by the initial codes and the accumulation are included, the compression codes which are the same as the number of the character types of the corresponding level are generated, and the compression codes are assigned to the corresponding character types. For example, if the initial encoding is 00000 and the number of character types of the hierarchy is 4, the initial encoding is accumulated 1 3 times to generate 00001, 00010, 00011, and thus 00000, 00001, 00010, and 00011 are the compression encoding of the hierarchy.

Further, as shown in fig. 13, the step S1202 includes:

s1301, traversing each character category in each hierarchy, and acquiring the appearance sequence of each character category in the corresponding hierarchy.

S1302, determining compression coding of character types according to initial coding of each level and appearance sequence of the character types.

Alternatively, the order in which the character types appear at the target level may characterize the position of the nodes of the character types at the target level of the Huffman tree.

In the present application, a ram8 may be defined for storing each character type and compression codes of the character type, where an address index is a character value of the character type, and a ram9 is defined for counting the number of occurrences of each level, an address index of the ram9 is a layer value, and data is the number of occurrences of the character type of the level.

As a possible implementation manner, the compression code of the character type appearing first in each level is the same as the initial code of the level, and the compression codes of the rest character types can be determined based on the initial code of the level where the character type is located and the position of the character type in the level, so that according to the number of occurrences of the character type in each level in ram9 and the initial code of the level in ram7, the character type of each level read in ram5 can be respectively encoded, so as to obtain the compression code of the character type, and the character type and the compression code thereof are stored in ram 8.

Specifically, after determining the initial encoding of each level, the data in ram5 may be sequentially read, the character type and the corresponding layer number thereof may be obtained, the character value of the character type may be used as the index value of ram8, the layer number may be used as the address index values of ram7 and ram9, and the initial encoding value of the layer number and the occurrence number of the layer number may be obtained. And adding the initial code value and the occurrence times, subtracting 1 to obtain a final Huffman code value of the character type, converting the Huffman code value into corresponding compression codes, and storing the character type and the compression codes of the character types into ram 8.

As a possible implementation manner, the compression code of the character type appearing for the first time may be set as the initial code of the level where the character type is located, and for each subsequent character type, 1 is added to the initial code in turn to obtain the compression code of each character type.

Illustratively, assume that the number of layers for which the character value 10 is obtained from ram6 is layer 9. First, 9 is used as the address index of ram7 to find the initial code value of 9 th layer, assuming that the obtained value is 56, i.e. the initial code is 000111000, and 9 is used as the address index of ram9 to obtain that this is the 4 th occurrence of 9 th layer, the Huffman code value of character value 10 is 59 (111011), and this code value is shifted to the left by 15 bits and then is compared with the 24-bit layer number or 1d8009. And finally, taking a character value 10 as an address index, and storing 1d8009 into ram8[10], wherein the first 9 bits of the 1d8009 are compression codes of the character type, namely, the coding of one character type is completed.

It should be understood that the character types of the same level may be encoded in other encoding manners, and the foregoing examples of the present application only provide one possible implementation manner, and the encoding may be performed starting from the last character type of the level, and sequentially adding 1 to the previous character type on the basis of the initial encoding, which is not limited herein.

Referring to fig. 14, an exemplary embodiment of the present invention is shown for performing huffman coding on character types in the huffman tree, wherein the highest layer is layer 4, and the initial coding value is 0, so that the compression coding of a is 0000, the compression coding of b is 0001, the initial coding value of the third layer is calculated to be (0+2)/2=1, the compression coding of C is 001, the compression coding of d is 010, the compression coding of e is 011, the initial coding value of the second layer is calculated to be (3+1)/2=2, the compression coding of F is 10, and the compression coding of g is 11.

Based on the same inventive concept, the embodiment of the present application further provides a data compression device corresponding to the data compression method, and since the principle of solving the problem by the device in the embodiment of the present application is similar to that of the data compression method in the embodiment of the present application, the implementation of the device may refer to the implementation of the method, and the repetition is omitted.

Referring to fig. 15, a schematic diagram of a data compression device according to an embodiment of the present application is shown.

The counting and sorting module 1501 is used for acquiring and traversing the data to be encoded and determining the character sequence of the data to be encoded;

the tree construction coding module 1502 is configured to construct a huffman tree from the character sequence, and determine a hierarchy of each character type in the huffman tree; determining initial codes of all levels of the Huffman tree, and generating compression codes of all character types according to the initial codes of all levels, wherein the code length of the compression codes is positively correlated with the corresponding level, and the compression codes are prefix codes;

the compiling module 1503 is configured to compile the data to be encoded into compressed data based on the compression codes corresponding to the respective character types.

Optionally, the count ordering module 1501 is further configured to:

extracting effective data from data to be coded;

counting the effective data to obtain the counting result of each character type in the effective data, wherein the counting result comprises the following steps: the character type and the weight of the character type, wherein the same character in the effective data is the same character type, and the weight of the character type is used for representing the proportion of all characters in the character type occupation effective data;

And sequencing the character types according to the weights of the character types to obtain a character sequence.

Optionally, the count ordering module 1501 is further configured to:

determining the character types to be counted, wherein the character types to be counted are any character type in the effective data;

and generating a statistical result according to the statistical character types and the corresponding weights.

Optionally, the count ordering module 1501 is further configured to:

sequentially comparing each character type with other character types except the character type in the effective data to obtain a weight comparison result;

determining the scoring value of each character type according to the weight comparison result;

the character categories are ordered based on the score values for the character categories in the valid data to form a character sequence.

Optionally, the tree building coding module 1502 is further configured to:

a. determining a first storage space and a second storage space, wherein the first storage space stores a character sequence;

b. integrating the storage addresses and the weights corresponding to two adjacent character types in the character sequence with minimum weights into a data set, and storing the data set in a second storage space;

c. Determining the types of the residual characters in the character sequence and the residual data sets in the second storage space, wherein the types of the residual characters are the types of the characters in the character sequence except the types of the characters with the storage addresses stored in the second storage space, and the residual data sets are the data sets in the second storage space except the data sets with the storage addresses stored in the second storage space;

Optionally, the tree building coding module 1502 is further configured to:

determining the number of character types of a first level higher than a target level and an initial coding value of the first level higher than the target level, wherein the target level is any level in a Huffman tree, and the initial coding value of the highest level in the Huffman tree is a preset value;

Taking half of the sum of the number of character types of the higher one level of the target level and the initial coding value of the higher one level of the target level as the initial coding value of the target level;

Optionally, the tree building coding module 1502 is further configured to:

determining the number of character types of each level in the Huffman tree;

generating compression codes of character types of corresponding levels according to the initial codes of the levels, and respectively endowing the compression codes of each level with the character types of the corresponding levels, wherein the number of the compression codes of each level is consistent with the number of the character types of the corresponding level.

Optionally, the tree building coding module 1502 is further configured to:

the compression coding of the character types is determined based on the initial coding of each hierarchy and the order of appearance of the character types.

The process flow of each module in the apparatus and the interaction flow between the modules may be described with reference to the related descriptions in the above method embodiments, which are not described in detail herein.

According to the embodiment of the application, the number of layers of the character is obtained according to the established Huffman tree, and the initial coding of each layer is determined based on the number of layers, so that the compression coding of each character type can not be the prefix of the coding values of other characters, on the basis of meeting the Huffman tree construction principle, compared with the existing technology for coding the character based on the Huffman tree, the method and the device do not need to traverse from the root node of the Huffman tree for many times when coding the character, and only the initial coding of the layer of the character and the position of the character in the layer can be used for obtaining the compression coding of the character, so that the coding efficiency is further improved on the basis of ensuring that the codes still have the Huffman code characteristic.

The embodiment of the application also provides an electronic device, as shown in fig. 16, which is a schematic structural diagram of the electronic device provided in the embodiment of the application, including: processor 1601, memory 1602, and a bus. The memory 1602 stores machine-readable instructions executable by the processor 1601 (e.g., execution instructions corresponding to the count sorting module 1501, the tree construction encoding module 1502, and the compiling module 1503 in the apparatus of fig. 15), which when executed by the processor 1601, perform the processing of the data compression method described above when executed by the processor 1601 when the computer device is running, by communicating with the memory 1602 via a bus.

The embodiments of the present application also provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the data compression method described above.

It will be clearly understood by those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described system and apparatus may refer to corresponding procedures in the method embodiments, which are not described in detail in this application. In the several embodiments provided in this application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. The above-described apparatus embodiments are merely illustrative, and the division of the modules is merely a logical function division, and there may be additional divisions when actually implemented, and for example, multiple modules or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some communication interface, indirect coupling or communication connection of devices or modules, electrical, mechanical, or other form.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing is merely a specific embodiment of the present application, but the protection scope of the present application is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the present application, and the changes or substitutions are covered in the protection scope of the present application.

Claims

1. A method of data compression, comprising:

and compiling the data to be coded into compressed data based on the compression codes corresponding to the character types.

2. The method of data compression according to claim 1, wherein said determining the character sequence of the data to be encoded comprises:

extracting effective data from the data to be coded;

3. The method of data compression according to claim 2, wherein the counting the valid data to obtain the result of counting each character type in the valid data includes:

4. The method of data compression according to claim 2, wherein the sorting the character categories according to their weights to obtain the character sequence includes:

sequentially comparing each character type with other character types except the character types in the effective data to obtain a weight comparison result;

5. The data compression method of claim 2, wherein constructing the huffman tree from the character sequence and determining the hierarchy of each character class in the huffman tree comprises:

6. The method of data compression of claim 1, wherein said determining initial encodings of levels of said huffman tree comprises:

7. The data compression method according to claim 1, wherein the generating compression codes of the respective character types from the initial codes of the respective layers includes:

Determining the number of character types of each level in the Huffman tree;

8. The data compression method according to claim 7, wherein the generating compression codes of character types of corresponding levels from initial codes of respective levels and assigning the compression codes of each level to the character types of the corresponding levels, respectively, comprises:

9. A data compression apparatus, comprising:

10. An electronic device, comprising: a processor, a storage medium and a bus, the storage medium storing program instructions executable by the processor, the processor and the storage medium communicating over the bus when the electronic device is running, the processor executing the program instructions to perform the steps of the data compression method according to any one of claims 1 to 8 when executed.

11. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a computer program which, when executed by a processor, performs the steps of the data compression method according to any of claims 1 to 8.