WO2009005758A2

WO2009005758A2 - System and method for compression processing within a compression engine

Info

Publication number: WO2009005758A2
Application number: PCT/US2008/008107
Authority: WO
Inventors: Robert William Laker; David T. Hass; Kaushik Kuila
Original assignee: Rmi Corporation
Priority date: 2007-06-29
Filing date: 2008-06-26
Publication date: 2009-01-08
Also published as: WO2009005758A3

Abstract

An apparatus to implement a deflate process In a compression engine. An embodiment of the apparatus Includes a hash table, a dictionary, comparison logic, and encoding logic. The hash table is configured to hash a plurality of characters of an input data stream to provide a hash address. The dictionary is configured to provide a plurality of distance values in parallel based on the hash address. The distance values are stored in the dictionary. The comparison logic is configured to identify a corresponding length for each matching distance value from the plurality of distance values. The encoding logic is configured to encode the longest length and the matching distance value as a portion of a LZ77 code stream.

Description

System and Method for Compression Processing Within a

Compression Engine

FIELD OF THE INVENTION

[0001] The present invention relates compressing or decompressing, and more particularly to allocating resources during compression or decompression.

BACKGROUND

[0002] LZ77 is the common name of a lossless data compression algorithm.

LZ77 is used as a part of the GNU zip (gzip) DEFLATE process, as specified in RFC 1951. Figure 1 illustrates a conventional compression application 10 which uses the DEFLATE process to transform a file 12 into a compressed file 14. An inverse operation, denoted the INFLATE process, is used to decompress the compressed file 14 to recreate the original file 12. In the DEFLATE process, files 12 are first compressed using LZ77, and then the resulting LZ77 code is Huffman coded to provide an even better compression performance.

[0003] Figure 2 illustrates a conventional LZ77 process 20. In the conventional

LZ77 process 20, the file 12 is read character by character. In Figure 2, the file 12 is represented by the incoming data stream 22, which is subdivided into bytes. Each byte represents one character. Each character is hashed with the preceding two characters, using a hash table 24, to provide a hash address into a dictionary. In conventional software implementations of gzip, the dictionary contains an index into a linked list 26, which contains a series of addresses (ending with a null address). Each address in the linked list 26 points to a place in the input stream, which is stored in a byte buffer 28, where the same sequence of three characters has occurred previously. In the conventional LZ77 process 20, the previous characters of the input data stream 22 are copied into the byte buffer 28, and the addresses of the linked list 26 point to locations in the byte buffer 28. Typically, these addresses are valid for positional distances up to 32K characters, because the byte buffer 28 stores the previous 32K characters. [0004] In conventional software implementations of the LZ77 process 20, the input data stream is compared to the previous bytes (i.e., the bytes in the byte buffer 28 at the location pointed to by the address in the linked list 26) to determine how many bytes are similar. The comparator 30 performs this comparison for each address in the series of addresses corresponding to the hash address until it finds a suitable match. In other words, this process is performed serially for each address in the linked list 26 that corresponds to the hash address. The serial nature of these operations affects the speed of the conventional LZ77 implementation. Additionally, the performance of the conventional LZ77 implementations is affected by the size of the linked list 26. [0005] The LZ77 process 20 then encodes the distance (corresponding to the location in the byte buffer 28) and the length (corresponding to the number of similar bytes starting at the location in the byte buffer 28) of the match to derive part of the LZ77 code stream. If there is no suitable match, the current byte is output as a literal, without further encoding. Hence, the LZ77 code stream is made up of encoded distance/length pairs and literals. The LZ77 code stream is then supplied to a Huffman encoder for further compression.

[0006] Huffman coding is an encoding algorithm for lossless data compression.

Huffman coding uses a variable-length code table for encoding a source symbol such as a character in a file. In general, the variable-length code table is derived from the number of occurrences of each source symbol in the file.

[0007] Conventional Huffman coding is used as a part of the GNU zip (gzip)

DEFLATE and INFLATE processes, as specified in RFC 1951. Figure 9 illustrates a conventional compression application 910 which uses the DEFLATE and INFLATE processes to transform between a file 912 and a compressed file 914. In particular, the DEFLATE process converts the file 912 into a compressed file 914. The INFLATE process is an inverse process used to decompress the compressed file 914 to recreate the original file 912. In the DEFLATE process, files 912 are first compressed using LZ77, and then the resulting LZ77 code is Huffman coded to provide an even better compression performance. The INFLATE process implements Huffman decoding to recover the LZ77 code, and then decompresses the LZ77 code to recreate the files 912. [0008] In conventional implementations of the INFLATE process, a series of lookups are implemented using the variable-length Huffman code values to find the LZ77 code values used in a subsequent decoding operation. These longest-prefix lookup operations are typically implemented in software using an associative array. Other conventional hardware implementations use a ternary content-addressable memory (CAM) structure. However, associative arrays and ternary CAMs have certain disadvantages. For example, ternary CAMs are relatively large so they consume a significant amount of circuit area.

[0009] During the compression and decompression of incoming data (e.g. files, etc.), conventional systems typically compress and decompress each file on a first-arrived basis. For example, such systems may allocate all available resources to compress/decompress a first incoming file until such processing is finished, after which the system may allocate all available resources to compress/decompress a second incoming file, and so forth. Sometimes, in a situation where a particular file is large, a latency for processing the same may impose an unacceptable delay in processing subsequent files. In such situations, it is often desired to pause the processing of the larger file until later, so that resources may be first allocated to other smaller files, etc. [00010] There is thus a need for addressing these and/or other issues associated with the prior art.

SUMMARY

[00011] Embodiments of a method are described. In one embodiment, the method is a method for DEFLATE processing within a compression engine. An embodiment of the method includes hashing a plurality of characters of an input data stream to provide a hash address into a dictionary. The method also includes reading a plurality of distance values in parallel from the dictionary based on the hash address. The distance values are stored in the dictionary. The method also includes identifying a corresponding length value for each of the plurality of distance values via a matching process. The method also includes encoding the longest length value and the matching distance value as a portion of a LZ77 code stream. Other embodiments of the method are also described. [00012] Embodiments of an apparatus are also described. In one embodiment, the apparatus is an apparatus to implement a DEFLATE process in a compression engine. An embodiment of the apparatus includes a hash table, a dictionary, comparison logic, and encoding logic. The hash table is configured to hash a plurality of characters of an input data stream to provide a hash address. The dictionary is coupled to the hash table. The dictionary is configured to provide a plurality of distance values in parallel based on the hash address. The distance values are stored in the dictionary. The comparison logic is coupled to the dictionary. The comparison logic is configured to identify a corresponding length value for each of the plurality of distance values. The encoding logic is coupled to the comparison logic. The encoding logic is configured to encode the longest length value and the matching distance value as a portion of a LZ77 code stream. Other embodiments of the apparatus are also described.

[00013] Other aspects and advantages of embodiments of the present invention will become apparent from the following detailed description, taken in conjunction with the accompanying drawings, illustrated by way of example of the principles of the invention.

[00014] Embodiments of a method are described. In one embodiment, the method is a method for Huffman decoding within a compression engine. An embodiment of the method includes receiving a compressed data stream. The method also includes comparing a portion of the compressed data stream with a plurality of predetermined values using a plurality of comparators. The method also includes outputting a LZ77 code value based on the portion of the compressed data stream and a comparison result from comparing the portion of the compressed data stream with the plurality of predetermined values. Other embodiments of the method are also described. [00015] Embodiments of an apparatus are also described. In one embodiment, the apparatus is an apparatus to implement Huffman decoding in an INFLATE process in a compression engine. An embodiment of the apparatus includes a bit buffer, a set of comparators, and a lookup table. The bit buffer stores a portion of a compressed data stream. The set of comparators compares the portion of the compressed data stream with a plurality of predetermined values. The lookup table stores a plurality of LZ77 code segments and outputs one of the LZ77 code segments corresponding to an index at least partially derived from a comparison result from the set of comparators. Other embodiments of the apparatus are also described.

[00016] Other aspects and advantages of embodiments of the present invention will become apparent from the following detailed description, taken in conjunction with the accompanying drawings, illustrated by way of example of the principles of the invention.

[00017] A system, method, and computer program product are provided for saving and restoring a compression/decompression state. In operation, data is processed, the processing including compressing or decompressing the data. Additionally, a state of the processing is saved. Further, the state of the processing is restored.

BRIEF DESCRIPTION OF THE DRAWINGS

[00018] Figure 1 illustrates a conventional compression application which uses the

DEFLATE process to transform a file into a compressed file.

[00019] Figure 2 illustrates a conventional LZ77 process.

[00020] Figure 3 depicts a schematic block diagram of one embodiment of a computing environment.

[00021] Figure 4 depicts a schematic block diagram of a more detailed embodiment of the compression/decompression module shown in Figure 3.

[00022] Figure 5 depicts a schematic block diagram of one embodiment of the

LZ77 process that may be implemented by the LZ77 logic of the compression/decompression module shown in Figure 4.

[00023] Figure 6 depicts a schematic timing diagram of one embodiment of a data flow for the LZ77 process shown in Figure 5.

[00024] Figure 7 depicts a schematic flow chart diagram of one embodiment of a compression method that may be implemented in conjunction with the LZ77 logic of the compression/decompression engine shown in Figure 4.

[00025] Figure 8 depicts a schematic flow chart diagram of a more detailed embodiment of the dictionary read operation shown in the compression method of Figure

7.

[00026] Figure 9 illustrates a conventional compression application which uses the

DEFLATE and INFLATE processes to transform between a file and a compressed file. [00027] Figure 10 depicts a schematic block diagram of one embodiment of a computing environment.

[00028] Figure 11 depicts a schematic block diagram of a more detailed embodiment of the compression/decompression module shown in Figure 10.

[00029] Figure 12 depicts a schematic block diagram of one embodiment of a hardware implementation of the Huffman logic of the INFLATE pipeline of the compression/decompression module shown in Figure 11.

[00030] Figure 13 depicts a schematic block diagram of another embodiment of a hardware implementation of the Huffman logic of the INFLATE pipeline of the compression/decompression module shown in Figure 11.

[00031] Figure 14 depicts a schematic flow chart diagram of one embodiment of a

Huffman decoding method that may be implemented in conjunction with the Huffman logic of the INFLATE pipeline of the compression/decompression engine shown in Figure 11.

[00032] Figure 15 shows a method for saving and restoring a compression/decompression state, in accordance with one embodiment.

[00033] Figure 16 shows a computing environment for saving and restoring a compression/decompression state, in accordance with one embodiment.

[00034] Figure 17 shows a more detailed diagram of the compression/decompression engine (CDE) shown in Figure 16, in accordance with one embodiment.

[00035] Figure 18 shows a method for saving and restoring a compression/decompression state, in accordance with another embodiment. [00036] Figures 19A-19C show a method for saving and restoring a compression/decompression state, in accordance with another embodiment.

[00037] Throughout the description, similar reference numbers may be used to identify similar elements.

DETAILED DESCRIPTION

[00038] In the following description, specific details of various embodiments are provided. However, some embodiments may be practiced with less than all of these specific details. In other instances, certain methods, procedures, components, structures, and/or functions are described in no more detail than to enable the various embodiments of the invention, for the sake of brevity and clarity.

[00039] While many embodiments are described herein, at least some of the described embodiments facilitate reading, in parallel, a plurality (e.g., four) of distance values from a dictionary based on a single hash address. The distance values are used to compare, in parallel, a corresponding plurality of byte streams from a byte buffer with an input data stream. As mismatches are found between the byte streams and the input data stream, the non-matching byte streams are dropped from consideration until a single comparison remains. In some embodiments, the last remaining byte stream is the longest matching byte stream. Alternatively, some embodiments track the lengths of multiple byte streams and perform a priority encode to select the longest. In the event that two or more byte streams are of the same length, the byte stream with the shortest distance value may be chosen so that the resulting LZ77 code potentially contains less data. [00040] Additionally, some embodiments keep the dictionary small in size. For example, some embodiments of the dictionary have about 2K entries (e.g., based on 11- bit entry addresses). Although a smaller dictionary size may mean that more character combinations hash to the same value, the number of unusable hashes can be limited. In one embodiment, the dictionary also stores one or more characters (e.g., the first two characters) from the corresponding byte stream in the byte buffer. When the addresses are read out from the dictionary, the corresponding characters are compared with the input data stream, and the addresses corresponding to non-matching characters are discarded. This may limit the number of unusable hashes and decrease the time that the hardware spends comparing the byte streams from the byte buffer with the input data stream.

[00041] In some embodiments, the byte buffer is arranged to store sixteen bytes in each storage location. This allows a comparison of up to sixteen bytes per cycle (although the first and last cycles of a matching operation may compare less than sixteen bytes). By allowing comparisons of sixteen bytes at a time, match operations may be accelerated.

[00042] Additionally, some embodiments update the dictionary whenever a literal is output or at the end of each matching operation. In some embodiments, the dictionary is not updated on every byte comparison (unlike conventional software implementations).

This exemplary update schedule offers acceptable performance since the dictionary uses four match positions. Also, this update schedule may save cycles where a single-ported dictionary random access memory (RAM) is implemented.

[00043] Figure 3 depicts a schematic block diagram of one embodiment of a computing environment 100. The illustrated computing environment 100 includes a compression/decompression engine (CDE) 102, a fast messaging network (FMN) station

104, and an input-output (I/O) distributed interconnect station 106. An exemplary embodiment of the CDE 102 is described in more detail below.

[00044] In one embodiment, the I/O distributed interconnect station 106 is part of a high speed distributed interconnect ring which connects multiple cores, caches, and processing agents. The high speed distributed interconnect ring supports simultaneous transactions among the connected components.

[00045] The FMN 104 provides a channel for messages directed to and from the

CDE 102. In some embodiments, the messages may direct the CDE 102 to perform compression or indicate completion of a compression operation.

[00046] In general, the CDE 102 is configured to compress and decompress files for transfer within the computing environment 100. Alternatively, other embodiments of the CDE 102 may be implemented in other computing environments in which compressed files may be used. The illustrated CDE 102 includes a bus interface unit

(BIU) 108, a translate block (XLT) 110, and a compression/decompression module

(CDM) 112. The BIU 108 provides a data interface to the I/O distributed interconnect station 106 and the I/O distributed interconnect ring. The XLT 1 10 provides an interface between the BIU 108 and the CDM 112. In one embodiment, the XLT 110 uses its own direct memory access (DMA) engine to read and write data via the BIU 108, so the XLT

110 may operate autonomously from a central procession unit (CPU) coupled to the computing environment 100. The CDM 1 12 performs compression and decompression operations for the CDE 102. A more detailed embodiment of the CDM is shown in Figure 4 and described below. Other embodiments of the CDE 102 may include fewer or more components. Additionally, other embodiments of the CDE 102 may implement more or less functionality than is described herein.

[00047] Figure 4 depicts a schematic block diagram of a more detailed embodiment of the compression/decompression module (CDM) 112 shown in Figure 3. The illustrated CDM 1 12 includes a DEFLATE pipeline 1 14 and an INFLATE pipeline 116. The DEFLATE pipeline 1 14 is available to implement the CDE compression process, also referred to as the DEFLATE process. The illustrated DEFLATE pipeline 114 includes an input buffer 118, LZ77 logic 120, Huffman logic 122, and an output buffer 124. The INFLATE pipeline 116 is available to implement the CDE decompression process, also referred to as the INFLATE process. The illustrated INFLATE pipeline 116 includes an input buffer 126, Huffman logic 128, LZ77 logic 130, and an output buffer 132. Although each pipeline is shown with individual schematic components, at least some of the components may operate in conjunction with both pipelines 114 and 116 using a single implementation. Other embodiments of the CDM 112 may incorporate fewer or more components.

[00048] For both the DEFLATE pipeline 1 14 and the INFLATE pipeline 1 16, the

CDM supports various operating modes, including static compression, dynamic compression, and no compression. A file such as the file 12 of Figure 1 may be split into blocks, and each block may use any of the three modes. Hence, the various blocks of a single file may be compressed using any combination of these three modes. [00049] For the DEFLATE process, splitting the file into blocks is performed as a pre-process before the file is presented to the CDE 102. The CDE 102 then compresses each block and uses bit-stitching to recombine the compressed blocks in the deflated bit stream. For the INFLATE process, the deflated bit stream is input to the CDE 102 and the CDE decompresses the blocks individually, according to the block header information within the bit stream.

[00050] The DEFLATE and INFLATE processes use two algorithms to achieve compression. The LZ77 algorithm, implemented by the LZ77 logic 120 for the DEFLATE process, creates a dictionary of strings of bytes that have occurred previously in the file. In one embodiment, the LZ77 logic 120 enforces a minimum string length (e.g., three bytes) for the byte strings in the dictionary. The LZ77 logic 120 then replaces strings with a distance value (e.g., up to 32,768 bytes) and a length value (e.g., up to 258 bytes) for a matching string. If no match exists, then the incoming byte is output as a literal character.

[00051] Subsequently, the Huffman logic 122 (for the DEFLATE process) implements the Huffman algorithm to replace the literal, length, and distance codes with codes whose length depends on the frequency of occurrence of the LZ77 codes in the block. More specifically, the Huffman logic 122 implements one of three coding modes: static compression, dynamic compression, and no compression. For static compression, a predefined code is used which is not necessarily ideal for the block being coded, but still typically achieves good compression. Static compression coding may be executed relatively quickly. Dynamic compression coding, in contrast, may be slower since it uses two passes — one pass to create a statistics table of the frequency of occurrence of each LZ77 code and to generate an optimized Huffman code, and a second pass to make use of the Huffman code to encode the LZ77 data. Although dynamic coding may be slower than static coding, in some instances, it also may result in a higher compression ratio. [00052] It should also be noted that some input files, or data such as embedded image data within a file, may already be in a compressed format. As a result, the static and dynamic coding techniques of the Huffman logic 122 may be unable to compress such data further, or potentially may increase the size of the compressed data. For these types of input files, the Huffman logic 122 may implement a format without further compression (i.e., the "no compression mode"). In this mode, the data are split into blocks, with each block having up to approximately 65,535 bytes in size. The compression process also adds a header for this data type and then outputs the data stream as is.

[00053] Figure 5 depicts a schematic block diagram of one embodiment of the

LZ77 process that may be implemented by the LZ77 logic 120 of the compression/decompression module (CDM) 112 shown in Figure 4. The illustrated LZ77 logic 120 receives an input data stream 142 and includes a hash table 144, a dictionary 146, distance logic 148, a byte buffer 150, comparison logic 152 with one or more counters 154, and encoding logic 156. Other embodiments of the LZ77 logic 120 may include fewer or more components or may implement more or less functionality. [00054] Within the CDE 102, data are received from the XLT 110 by the input buffer 118 of the DEFLATE pipeline 1 14. In one embodiment, the input buffer 118 is a first-in-first-out (FIFO) buffer. In some embodiments, the data are received as 32-byte cache lines, with a byte count to indicate how many bytes are valid on the last word. Words are then written by the LZ77 logic 120 to both a 128-byte input buffer (not shown) and to the byte buffer 150. In one embodiment, the byte buffer 150 is a 32-Kbyte buffer which stores up to the last 32 Kbytes of the input data stream 142. The data stored in the byte buffer 150 are used, at least in some instances, as reference data whenever a match is being determined.

[00055] As the input data stream 152 (e.g., an input file) is read character by character, each character is hashed with the preceding two characters, using the hash table 144, to provide a hash address into the dictionary 146. The dictionary 146 stores buffer locations for matching. In one embodiment, every 3 input bytes from the input data stream 142 are hashed to provide an 11-bit address. Based on the 11-bit hash address, the dictionary 146 may store approximately 2K entries. In each entry of the dictionary 146, up to four possible match entries are stored. In some embodiments, the hash table 144 and the dictionary 146 may be combined into a single, functional block. [00056] In one embodiment, each match entry includes a match position, a valid bit, and the first two characters of the string at the location in the byte buffer 150 indicated by the match position. The inclusion of one or more characters of the string, at the location in the byte buffer 150, within the match entry allows the distance logic 148 to quickly reject one or more of the match entries if the stored characters do not match the characters from the input data stream 142. Hence, in one embodiment, only good matches (i.e., match entries with stored characters that match the characters from the input data stream 142) proceed in the depicted DEFLATE process. Other embodiments of the match entry may include fewer or more match entry fields. [00057] Using the information from the dictionary entry (including, for example, up to four match entries), the locations in the byte buffer 150 are read. In one embodiment, the byte streams beginning at the locations in the byte buffer 150 are read 16 bytes at a time. Each byte stream read from the byte buffer 150 is compared with the bytes from the input data stream 152 by the comparison logic 152. In one embodiment, interleaved reads from the byte buffer 150 allow multiple byte strings to be read and compared simultaneously or at approximately the same time. As an example, up to four streams may be simultaneously read from the byte buffer 150 and compared with the input data stream 142. This comparison process continues until the longest matching byte stream from the byte buffer 150 is found. In one embodiment, the counter 154 (or multiple counters 154) are used to count the progress of each comparison between a byte stream from the byte buffer 150 and the input data stream 142. In another embodiment, the comparison logic 152 may be configured to stop any comparisons that reach a maximum count (e.g., 258 bytes). If multiple byte streams have the same length or reach the maximum count, then the comparison logic 152 may designate one of the byte streams as the best match. In another embodiment, the comparison logic 152 may determine that there are no matches and output the byte from the input data stream 142 as a literal.

[00058] Once a longest matching byte stream is identified, or a best match is designated, the comparison logic 152 and the distance logic 148 provide a length value and a distance value, respectively, to the encoding logic 156. In one embodiment, the encoding logic 156 encodes the length and distance values as part of an LZ77 code stream. Additionally, the encoding logic 156 may output a special code (e.g., a decimal 256) when a block is complete. Where a special code is used, the code may occur only once within the block and is used to indicate the completion of the block. The LZ77 code stream is then passed to the Huffman logic 122 of the DEFLATE pipeline 114. [00059] The INFLATE LZ77 process may be implemented using similar LZ77 logic 130 with complementary functionality. For example, the LZ77 logic 130 of the INFLATE pipeline 116 receives LZ77 coded data from the Huffman logic 128 and uses the LZ77 coded data to reconstruct the original file format. In one embodiment, the LZ77 logic 130 uses the identical 32-Kbyte byte buffer 150 used in the DEFLATE process. However, in the INFLATE process the byte buffer 150 is used as the source of the strings specified by the distance and length values provided by the Huffman logic 128. Each decoded byte is output to the XLT 110 and is written to the byte buffer 150. In one embodiment, using the same byte buffer 150 for both DEFLATE and INFLATE processes and, hence, saving chip area is possible because the DEFLATE and INFLATE processes are not implemented simultaneously. In one embodiment, the LZ77 logic 130 provides the decompressed, reconstructed file data to the XLT 110 via the output buffer 132 and a 16-byte wide bus.

[00060] Figure 6 depicts a schematic timing diagram 160 of one embodiment of a data flow for the LZ77 process shown in Figure 5. In general, the illustrated timing diagram 160 shows how read (READ) and comparison (COMP) operations may be interleaved for multiple byte streams from the byte buffer 150. Although the exemplary timing diagram 160 shows interleaved operations for four byte streams, other embodiments may interleave fewer or more byte streams.

[00061] In cycle 1, there is a dictionary lookup operation to look up four distance values (e.g., stored in the four match entries of a dictionary entry corresponding to the hash address) from the dictionary 146. For each of the distance values, the comparison logic 152 reads bytes from the byte buffer 150 over the following cycles. In one embodiment, the comparison logic 152 reads a first byte for the first byte stream (i.e., byte stream "1") during cycle 2 of the timing diagram 160. In cycle 3, the comparison logic 152 reads the first byte for the second byte stream (i.e., byte stream "2"). Additionally, the comparison logic 152 compares the first byte from the first byte stream with the first byte from the input data stream 142. In the depicted example, the first bytes of the first byte stream and the input data stream 142 are a match. [00062] In cycle 4, the comparison logic 152 reads the first byte for the third byte stream (i.e., byte stream "3") and compares the first byte from the second byte stream with the first byte from the input byte stream 142. In this example, the first bytes from the second byte stream and the input data stream 142 are not a match. Hence, the second byte stream is dropped.

[00063] In cycle 5, the comparison logic 152 reads the first byte for the fourth byte stream (i.e., byte stream "4") and compares the first byte from the third byte stream and the first byte from the input byte stream 142. In this example, the first bytes from the third byte stream and the input data stream 142 are a match. In cycle 6, the comparison logic 152 reads the second byte for the first byte stream (i.e., byte stream "1") and compares the first byte from the fourth byte stream with the first byte from the input byte stream 142. In this example, the first bytes from the fourth byte stream and the input data stream 142 are not a match. Hence, the fourth byte stream is dropped, leaving only the first and third byte streams.

[00064] In cycle 7, the comparison logic 152 reads the second byte for the third byte stream (i.e., byte stream "3") and compares the second byte from the first byte stream with the second byte from the input byte stream 142. In this example, the second bytes from the first byte stream and the input data stream 142 are not a match. Hence, the first byte stream is dropped, leaving only the third byte stream.

[00065] In cycle 8, the comparison logic 152 reads the third byte for the third byte stream (i.e., byte stream "3") and compares the second byte from the third byte stream with the second byte from the input byte stream 142. In this example, the second bytes from the third byte stream and the input data stream 142 are not a match. However, since the third byte stream is the last byte stream, the third byte stream is identified as the longest matching byte stream, having a length value. In an alternative embodiment, the comparison logic 152 may designate either the first byte stream or third byte stream as the longest matching byte stream since they have equal length values. After identifying a best match (i.e., the longest matching byte stream), the LZ77 logic 120 may start another LZ77 process on the following cycle for the next byte in the input data stream 142. [00066] Additionally, at least some embodiments of the LZ77 logic 120 allow the dictionary 146 to be updated at about the beginning of the depicted LZ77 process. In one embodiment, each dictionary entry operates like a 4-deep FIFO. When a hash has not occurred before the entire entry is marked invalid in a separate 2K vector stored in an external register, the dictionary update involves writing the first entry and setting it valid. Subsequent dictionary updates shift the entries like in a FIFO. If there are already four entries then the dictionary update may shift the oldest entry out of the dictionary to make room for the new entry. In one embodiment, an entry includes the first two characters (e.g., one byte each) that were used to compute the hash, as well as the current buffer position (e.g., fifteen bits for the block position modulo 32K) and a valid bit (e.g., for a total of 32 bits). [00067] Figure 7 depicts a schematic flow chart diagram of one embodiment of a compression method 170 that may be implemented in conjunction with the LZ77 logic 120 of the compression/decompression engine (CDE) 102 shown in Figure 4. Although the compression method 170 is described with reference to the CDE 102 of Figure 4, other embodiments may be implemented in conjunction with other compression/decompression engines. Also, it should be noted that at least some of the operations of the illustrated compression method 170 may be implemented in parallel (e.g., interleaved) in order to process multiple byte streams simultaneously or at about the same time.

[00068] In the illustrated compression method 170, the hash table 144 reads 172 characters from the input data stream 142. In one embodiment, the hash table 144 reads the current character and the two previous characters from the input data stream 142. Alternatively, the hash table 144 may use a different combination of characters from the input data stream 142. The hash table 144 then hashes 174 the characters from the input data stream 142 to provide a hash address to the dictionary 146. Using the hash address, the dictionary 146 outputs 176 one or more (e.g., up to four) distance values. In one embodiment, the distance values are obtained simultaneously or at about the same time from the dictionary 146.

[00069] The comparison logic 152 then obtains a corresponding number of byte streams from the byte buffer 150 using the distance values provided by the dictionary 146. Each byte stream is compared 178 with the input data stream 142 to determine if the byte streams match the input data stream 142. As explained above, if the byte streams from the byte buffer 150 do not match the input data stream 142, then the non-matching byte streams are dropped, or discarded. In one embodiment, the comparison logic 152 identifies 180 the lengths of each matching byte stream from the byte stream buffer 150. The comparison logic 152 then determines 182 if one of the byte streams is the longest matching byte stream. In one embodiment, the comparison logic 152 references the count stored by each of the counters 154 to determine the longest matching byte stream. Ultimately, the byte streams that are not the longest matching byte streams are dropped (and the corresponding length and distance values are discarded). If two or more byte streams have matching lengths that qualify as the longest length, then the comparison logic 152 identifies 184 the byte stream with the matching longest length and the shortest distance. After identifying the byte stream with the longest length or the byte stream with the matching longest length and the shortest distance, the length and distance values for the selected byte stream are encoded 186 in the LZ77 code stream. The illustrated compression method 170 then ends.

[00070] As an example, the comparison logic may begin comparisons for four byte streams from the byte buffer 150. If a byte stream fails to match to the end of a 16-byte segment, then the segment is dropped. Otherwise, if the byte stream does match to the end of a 16-byte segment, then the length of the match is unknown until further matching is performed on subsequent 16-byte segments. In one embodiment, even a dropped byte stream may be the longest match even though it is not the last remaining byte stream. In this case, the counters 154 may be used to determine the longest matching byte stream. As a further example, two byte streams may be compared, in which the first byte stream matches 1 byte and the second byte stream matches 15 bytes on the first 16-byte segment. On the second 16-byte segment, the first and second byte streams both match 16 bytes. On the third 16-byte segment, the first byte stream matches 16 bytes and the second byte stream matches 8 bytes. Since the second byte stream does not match to the end of the 16-byte segment, further matching is not performed for the second byte stream. However, the count for the second byte stream is maintained for eventual comparison with the count for the first byte stream. On the fourth 16-byte segment, the first byte stream matches 3 bytes. Thus, the first counter for the first byte stream counts 36 matching bytes (i.e., 1+16+16+3=36), and the second counter for the second byte stream counts 39 matching bytes (i.e., 15+16+8=39). Hence, in this example, the second byte stream is dropped before the first byte stream, but is nevertheless the longest matching byte stream.

[00071] Figure 8 depicts a schematic flow chart diagram of a more detailed embodiment of the dictionary read operation 176 shown in the compression method 170 of Figure 7. Although the dictionary read operation 176 is described with reference to the CDE 102 of Figure 4, other embodiments may be implemented in conjunction with other compression/decompression engines. Also, it should be noted that at least some of the operations of the illustrated dictionary read operation 176 may be implemented in parallel (e.g., interleaved) in order to process multiple distance values and/or byte streams simultaneously or at about the same time.

[00072] As explained above, each of the match entries in a dictionary entry may include one or more initial characters from the byte streams stored in the corresponding locations in the byte buffer 150. In the illustrated dictionary reading operation 176, the initial byte stream characters stored in the dictionary 146 are read 188 and compared 190 by the distance logic 148 with the corresponding bytes from the input data stream. For each non-matching initial byte stream, the distance logic 148 discards the corresponding distance value so that the comparison logic 152 does not consume any time or resources trying to compare the non-matching byte stream with the input data stream 142. The illustrated dictionary read operation 176 then ends.

[00073] It should be noted that embodiments of the methods, operations, functions, and/or logic may be implemented in software, firmware, hardware, or some combination thereof. Additionally, some embodiments of the methods, operations, functions, and/or logic may be implemented using a hardware or software representation of one or more algorithms related to the operations described above. To the degree that an embodiment may be implemented in software, the methods, operations, functions, and/or logic are stored on a computer-readable medium and accessible by a computer processor. [00074] Embodiments of the invention also may involve a number of functions to be performed by a computer processor such as a central processing unit (CPU), a graphics processing unit (GPU), or a microprocessor. The microprocessor may be a specialized or dedicated microprocessor that is configured to perform particular tasks by executing machine-readable software code that defines the particular tasks. The microprocessor also may be configured to operate and communicate with other devices such as direct memory access modules, memory storage devices, Internet related hardware, and other devices that relate to the transmission of data. The software code may be configured using software formats such as Java, C++, XML (Extensible Mark-up Language) and other languages that may be used to define functions that relate to operations of devices required to carry out the functional operations related described herein. The code may be written in different forms and styles, many of which are known to those skilled in the art. Different code formats, code configurations, styles and forms of software programs and other means of configuring code to define the operations of a microprocessor may be implemented.

[00075] Within the different types of computers, such as computer servers, that utilize the invention, there exist different types of memory devices for storing and retrieving information while performing some or all of the functions described herein. In some embodiments, the memory/storage device where data is stored may be a separate device that is external to the processor, or may be configured in a monolithic device, where the memory or storage device is located on the same integrated circuit, such as components connected on a single substrate. Cache memory devices are often included in computers for use by the CPU or GPU as a convenient storage location for information that is frequently stored and retrieved. Similarly, a persistent memory is also frequently used with such computers for maintaining information that is frequently retrieved by a central processing unit, but that is not often altered within the persistent memory, unlike the cache memory. Main memory is also usually included for storing and retrieving larger amounts of information such as data and software applications configured to perform certain functions when executed by the central processing unit. These memory devices may be configured as random access memory (RAM), static random access memory (SRAM), dynamic random access memory (DRAM), flash memory, and other memory storage devices that may be accessed by a central processing unit to store and retrieve information. Embodiments may be implemented with various memory and storage devices, as well as any commonly used protocol for storing and retrieving information to and from these memory devices respectively.

[00076] Although the operations of the method(s) herein are shown and described in a particular order, the order of the operations of each method may be altered so that certain operations may be performed in an inverse order or so that certain operations may be performed, at least in part, concurrently with other operations. In another embodiment, instructions or sub-operations of distinct operations may be implemented in an intermittent and/or alternating manner.

[00077] Although specific embodiments of the invention have been described and illustrated, the invention is not to be limited to the specific forms or arrangements of parts so described and illustrated. The scope of the invention is to be defined by the claims appended hereto and their equivalents.

[00078] In the following description, specific details of various embodiments are provided. However, some embodiments may be practiced with less than all of these specific details. In other instances, certain methods, procedures, components, structures, and/or functions are described in no more detail than to enable the various embodiments of the invention, for the sake of brevity and clarity.

[00079] While many embodiments are described herein, at least some of the described embodiments implement logic to facilitate all or part of an INFLATE process to decompress a compressed file. More specifically, at least one embodiment uses a set of comparators to determine a code length of an incoming code and an adder (or other addition logic) to determine an index into a lookup table (LUT) that is based on a random access memory (RAM). As a result, some embodiments of the Huffman logic are faster than a ternary CAM and consume less circuit area.

[00080] Additionally, some embodiments may be used with dynamic and/or static

Huffman coding, as explained in more detail below. For a static Huffman code, fewer comparators (e.g., 2 comparators) may be used to determine a code length, and the LUT is programmed during initialization. In contrast, for a dynamic Huffman code, a preamble sequence provides the code length of each of the codes. The number of codes of each length are then tabulated and used to determine the starting code for each code length. A RAM index is derived by subtracting the starting code and adding the number of all shorter codes, which may be pre-computed to allow a single addition operation. Thus, the codes may be stored in a contiguous manner within the RAM-based LUT. [00081] As one example, the largest lookup value may be for a 286-entry literal/length code. This LZ77 would have 9 code bits together with associated extra data size, which can vary from 0 to 5 bits, depending on the code. This extra data size is stored alongside the LZ77 code word, so the total size of the RAM LUT for this exemplary code is 286 locations each of 12 bits — 9 bits for the code and 3 bits for the extra data size.

[00082] Figure 10 depicts a schematic block diagram of one embodiment of a computing environment 1000. The illustrated computing environment 1000 includes a compression/decompression engine (CDE) 1002, a fast messaging network (FMN) station 1004, and an input-output (I/O) distributed interconnect station 1006. An exemplary embodiment of the CDE 1002 is described in more detail below. [00083] In one embodiment, the I/O distributed interconnect station 1006 is part of a high speed distributed interconnect ring which connects multiple cores, caches, and processing agents. The high speed distributed interconnect ring supports simultaneous transactions among the connected components.

[00084] The FMN 1004 provides a channel for messages directed to and from the

CDE 1002. In some embodiments, the messages may direct the CDE 1002 to perform compression or indicate completion of a compression operation. [00085] In general, the CDE 1002 is configured to compress files for transfer via the BIU 1008 and to decompress compressed files received via the BIU 1008. Alternatively, other embodiments of the CDE 1002 may be implemented in other computing environments in which compressed files may be used. The illustrated CDE 1002 includes a bus interface unit (BIU) 1008, a translate block (XLT) 1010, and a compression/decompression module (CDM) 1012. The BIU 1008 provides a data interface to the I/O distributed interconnect station 1006 and the I/O distributed interconnect ring. The XLT 1010 provides an interface between the BIU 1008 and the CDM 1012. In one embodiment, the XLT 1010 uses its own direct memory access (DMA) engine to read and write data via the BIU 1008, so the XLT 1010 may operate autonomously from a central procession unit (CPU) coupled to the computing environment 1000. The CDM 1012 performs compression and decompression operations for the CDE 1002. A more detailed embodiment of the CDM is shown in Figure 12 and described below. Other embodiments of the CDE 1002 may include fewer or more components. Additionally, other embodiments of the CDE 1002 may implement more or less functionality than is described herein.

[00086] Figure 11 depicts a schematic block diagram of a more detailed embodiment of the compression/decompression module (CDM) 1012 shown in Figure 10. The illustrated CDM 1012 includes a DEFLATE pipeline 1 114 and an INFLATE pipeline 11 16. The DEFLATE pipeline 1114 is available to implement the CDE compression process, also referred to as the DEFLATE process. The illustrated DEFLATE pipeline 1114 includes an input buffer 1118, LZ77 logic 1120, Huffman logic 1122, and an output buffer 1124. The INFLATE pipeline 1116 is available to implement the CDE decompression process, also referred to as the INFLATE process. The illustrated INFLATE pipeline 1 116 includes an input buffer 1 126, Huffman logic 1 128, LZ77 logic 1130, and an output buffer 1132. Although each pipeline is shown with individual schematic components, at least some of the components may operate in conjunction with both pipelines 1114 and 11 16 using a single implementation. Other embodiments of the CDM 1012 may incorporate fewer or more components. [00087] For both the DEFLATE pipeline 1114 and the INFLATE pipeline 1116, the CDM supports various operating modes, including static compression, dynamic compression, and no compression. A file such as the file 912 of Figure 9 may be split into blocks, and each block may use any of the three modes. Hence, the various blocks of a single file may be compressed using any combination of these three modes. [00088] For the DEFLATE process, splitting the file into blocks is performed as a pre-process before the file is presented to the CDE 1002. The CDE 1002 then compresses each block and uses bit-stitching to recombine the compressed blocks in the deflated bit stream. For the INFLATE process, the deflated bit stream is input to the CDE 1002 and the CDE decompresses the blocks individually, according to the block header information within the bit stream.

[00089] The DEFLATE and INFLATE processes use two algorithms to achieve compression. The LZ77 algorithm, implemented by the LZ77 logic 1120 for the DEFLATE process, creates a dictionary of strings of bytes that have occurred previously in the file. In one embodiment, the LZ77 logic 1120 enforces a minimum string length (e.g., three bytes) for the byte strings in the dictionary. The LZ77 logic 1120 then replaces strings with a distance value (e.g., up to 32,768 bytes) and a length value (e.g., up to 258 bytes) for a matching string. If no match exists, then the incoming byte is output as a literal character.

[00090] Subsequently, the Huffman logic 1122 (for the DEFLATE process) implements the Huffman algorithm to replace the literal, length, and distance codes with codes whose length depends on the frequency of occurrence of the LZ77 codes in the block. More specifically, the Huffman logic 1122 implements one of three coding modes: static compression, dynamic compression, and no compression. For static compression, a predefined code is used which is not necessarily ideal for the block being coded, but still typically achieves good compression. Static compression coding may be executed relatively quickly. Dynamic compression coding, in contrast, may be slower since it uses two passes — one pass to create a statistics table of the frequency of occurrence of each LZ77 code and to generate an optimized Huffman code, and a second pass to make use of the Huffman code to encode the LZ77 data. Although dynamic coding may be slower than static coding, in some instances, it also may result in a higher compression ratio. Regardless of whether dynamic or static compression is implemented, the Huffman logic 1122 outputs a serial bit stream which is sent a byte at a time to the XLT 1010. In some embodiments, the bit stream is packed with zeroes at the end of the file in order to finish on a byte boundary. As one example, the maximum transfer rate is approximately 3.2 Gbps at 400 MHz, although other levels of performance may be achieved using other systems.

[00091] In a more detailed embodiment of the static compression mode, the

Huffman logic 1122 parses the LZ77 compressed data, replacing symbols with equivalent Huffman codes and extra length and distance bits. More specifically, a static LUT is built upon initialization, and the LUT is used to provide a Huffman code for every literal, length, or distance subsequently presented to it. In some embodiments, there are 30 distance codes, each having 5 bits. Additionally, literal and length codes may be part of the same 286-entry LUT (refer to Figure 12) or part of a separate LUT (refer to Figure 13). In some embodiments, each literal and length code is 7, 8, or 9 bits in size. Furthermore, many of the length and distance codes may have extra data which follows directly after the code word, which provides a range of possible lengths or distances. The extra bits are also used to define an exact length or distance. However, the number of extra bits is a function of the code, with longer codes having more extra data values. The Huffman logic 1122 then outputs the deflated block, including the compressed data and other symbols.

[00092] In a more detailed embodiment of the dynamic compression mode, the

Huffman logic 1122 implements multiple phases. In one embodiment, two phases are implemented for dynamic Huffman coding. In the first phase (also referred to as the first pass), the Huffman logic 1122 gathers statistics for each of 286 literal/length codes. The Huffman logic 1122 also gathers statistics for each of 30 distance codes. In the second phase (also referred to as the second pass), several operations are implemented. In one embodiment, a literal and length heap is built, and a literal and length Huffman tree is built. Then the literal and length Huffman code is generated. Similar heap, tree and code generation operations are also implemented for the corresponding distance value and the bit length. Subsequently, the Huffman logic 1122 outputs the bit length code sizes, the literal/length codes using the bit length code, and the distance code using the bit length code. In one embodiment, the Huffman logic 1122 parses the literal/length and distance Huffman codes, replacing code lengths and repetition counts with equivalent bit length Huffman codes. Similarly, the Huffman logic 1122 may parse the LZ77 compressed data, replacing symbols with equivalent Huffman codes and extra length and distance bits. The output literal/length codes and distance codes are also referred to as the output bit stream.

[00093] It should also be noted that some input files, or data such as embedded image data within a file, may already be in a compressed format. As a result, the static and dynamic coding techniques of the Huffman logic 1122 may be unable to compress such data further, or potentially may increase the size of the compressed data. For these types of input files, the Huffman logic 1122 may implement a format without further compression (i.e., the "no compression mode"). In this mode, the data are split into blocks, with each block being up to 65,535 bytes in size. The compression process also adds a header for this data type and then outputs the data stream as is. [00094] In general, the INFLATE process is the reverse of the DEFLATE process.

Although some aspects of the INFLATE process are less complicated than the DEFLATE process (e.g., there is no need to choose a decoding or decompression mode), other complications may arise in the INFLATE process. For example, some embodiments of the INFLATE process are configured to process any valid compressed file, including the possibility of unlimited block size, as well as distances and lengths up to the maximums specified in the industry standards.

[00095] Within the INLATE process, the Huffman logic 1 128 receives data from the XLT 1010 via the input buffer 1126. In some embodiments, the Huffman logic 1128 operates in a single phase, regardless of whether the data are statically or dynamically encoded. For static decoding, the LUT is programmed during initialization. A set of comparators is used to determine the length of each incoming literal/length code, which may be for example 7, 8, or 9 bits. In some embodiments, the distance codes are all 5 bits in length. However, other embodiments may use different bit lengths for the literal/length and distance codes. An offset is then added to the code to put it into the correct range within the LUT. The output of the LUT provides both the value of the code and the length of any extra data that is appended.

[00096] In contrast to the LUT for the static decoding, the dynamic LUT is programmed on demand. In some embodiments, the Huffman logic 1128 reads and stores the size (e.g., 1-7 bits) of each bit length code and determines the sum of codes of each size for the bit length codes. The Huffman logic 1128 also determines the start code for each code size for the bit length codes. The Huffman logic 1128 then writes the bit length LUT.

[00097] Using the bit length LUT, the Huffman logic 1128 reads and stores the size (e.g., 1-15 bits) of each literal/length code and determines the sum codes of each size for the literal/length codes. The Huffman logic 1128 also determines the start code for each code size for the literal/length codes. The Huffman logic 1128 then writes the literal/length LUT.

[00098] The Huffman logic 1128 also uses the bit length LUT to read and store the size (e.g., 1-15) bits of each distance code and to determine the sum of codes of each size for the distance codes. The Huffman logic 1128 also determines the start code for each code size of the distance codes. The Huffman logic 1128 then writes the distance LUT. [00099] Like the static LUT, a set of comparators is used to determine the size of each incoming literal/length and distance code. In one embodiment, 14 comparators are used because the size may vary from 1 to 15 bits. Other embodiments may use other quantities of comparators. The output of the LUT gives both the value of the code and the length of any extra data that is appended. Together the code and extra data are used to recover the length or distance value. In some embodiments, literals are treated like lengths but have no extra data. In this way, the original LZ77 sequence is recovered and output to the LZ77 logic 1 130 of the INFLATE pipeline 11 16. The LZ77 logic 130 then reconstructs the original file data and sends the original file data via the output buffer 1132 to the XLT 1010.

[000100] Figure 12 depicts a schematic block diagram of one embodiment of a hardware implementation of the Huffman logic 1 128 of the INFLATE pipeline 11 16 of the compression/decompression module (CDM) 1012 shown in Figure 11. In general, the Huffman logic 1128 receives compressed data from the XLT 1010 and generates LZ77 length and distance code values to send to the LZ77 logic 1130. The illustrated Huffman logic 1128 includes a bit buffer 1242, a set of comparators 1244, a bit selector 1246, a shift adder 1248, an index adder 1250, and a LUT 1252. Other embodiments of the Huffman logic 1128 may include fewer or more components or may implement more or less functionality.

[000101] In one embodiment, the bit buffer 1242 receives the compressed data from the XLT 1010. For example, the bit buffer 1242 may receive the compressed data via the input buffer 1126 of the INFLATE pipeline 1116. Additionally, in some embodiments the compressed data in the bit buffer 1242 does not include header information which is previously stripped and processed separately. The bit buffer 1242 stores multiple bits (e.g., at least 15 bits) of the compressed data prior to sending the bits to the comparators 1244 and the index adder 1248. In some embodiments, the bit buffer 1242 may write the bits (e.g., 4 bytes at a time) to a scratch buffer (not shown) before the bits are sent to the comparators 144 and/or the index adder 1248.

[000102] The set of comparators 1244 compares the buffered bits to a plurality of different preloaded values. For example, some embodiments use 14 different preloaded values, although other embodiments may use fewer or more values and corresponding comparators. As an example, a dynamic code may include a total of 10 codes of length 4, 16 codes of length 6, and 16 codes of length 7. The shortest codes are numbered first, starting from 0, so they are codes 0000 thru 1001. The start code for the next code length is derived by multiplying the next code (1010) by 2 (for length 5 codes) and by 2 again to give a length 6 code. Hence, the starting code for the codes of length 6 is 101000. So the length 6 codes are numbered 101000 thru 110111. Using a similar technique, the length 7 codes start at 1110000 and are numbered 1110000 thru 1111111. The decimal equivalents for these code ranges are 0 thru 9 for the length 4 codes, 40 thru 55 for the length 6 codes, and 112 thru 127 for the length 7 codes. In the lookup table these are stored in indices 0 thru 41 because there are 42 codes. The length 4 codes are easy, the start code is 0 and the number of shorter codes is 0, so the index offset is 0. The length 6 codes have a start code of 40 and 10 shorter codes, so an offset of 30 is subtracted to get the index for the length 6 codes. The length 7 codes have a start of 1 12 and 26 shorter codes, so an offset of 86 is subtracted to get the index for the length 7 codes. Using these values, the 14 comparators 1244 would be set thus: 2>=0, 3>=0, 4>=0, 5>=40, 6>=40, 7>=112, 8>=0x8000, 15>=0x8000. The largest triggered comparator 1244 gives the code length. It should also be noted that codes with lengths greater than the longest code in use are compared with a 16-bit value so that they are not triggered. [000103] The result of the comparison provides a bit selection value that is input to the bit selector 1246, which selects one or more values to add to the value of the bits in the bit buffer 1242. In one embodiment, the bit selector 1246 selects an index offset to add to the bits in the bit buffer 1242. In another embodiment, the bit selector 1246 selects a start value to add to the bits in the bit buffer 1242. It should also be noted that multiple parameters may be selected, either independently or as a pre-computed sum value, to be added to the bits in the bit buffer 1242. As an example, the bit selector may select 1 of 15 pre-computed values using a bit selection value between 1 and 15 (e.g., using a 4-bit value). Other embodiments may use a different number of possible values. The index adder 1248 then adds the selected values to the bits in the bit buffer 1242. In some embodiments, the bit selector 1246 and the index adder 1248 may be combined into a single functional block.

[000104] The resulting index from the index adder 1248 is then used to look up corresponding LZ77 length or distance code values in the LUT 1250. Where a single LUT 1250 is used for the length and distance values, the LZ77 length and distance code values are output in an alternating manner. It should also be noted that the LUT 1250 may be used to look up literal values in a manner substantially similar to the length values. In one embodiment, the LUT 1250 is a 286 x 9 RAM, and the output LZ77 code value is a 9-bit code. In another embodiment, the LUT 1250 is a 286 x 13 RAM in order to accommodate a 9-bit code value and 4 bits of extra data size. In another embodiment, the LUT 1250 is a 320 x 9 or 320 x 13 RAM to combine the length and distance code values in the LUT 1250.

[000105] In one embodiment, the bit selection value from the set of comparators 1244 is also sent to the shift adder 1252 to add a shift offset to the bit selection value. In one embodiment, the shift offset is the extra data size from the LUT 1250, which is described above. Hence, the shift adder 1252 adds the extra data size to the code length from the comparators 1244 to get the total shift amount to read the next code in the data stream. In this way, the result of the shift addition is then used to indicate the next variable-length Huffman code in the bit buffer 1242. In some embodiments, shifting the location of the bit buffer 1242 avoids one or more extra bits that are not needed for the lookup operations.

[000106] Figure 13 depicts a schematic block diagram of another embodiment of a hardware implementation of the Huffman logic 1128 of the INFLATE pipeline 1116 of the compression/decompression module (CDM) 1012 shown in Figure 11. In many aspects, the Huffman logic 1128 shown in Figure 13 is substantially similar to the Huffman logic 1128 shown in Figure 12. However, the Huffman logic 1128 of Figure 13 includes a demultiplexor 1354 and uses multiple LUTs 1356 and 1358 to look up the LZ77 length and distance code values, instead of using a single, combined LUT 1250. [000107] In one embodiment, the demultiplexor 1354 receives the index from the index adder 148 and directs the index to either a length LUT 1356 or a distance LUT 1358, depending on a control signal. In one embodiment, the demultiplexor 1354 alternates sending the index to the length LUT 1356 and the distance LUT 1358. Thus, one index value is used to look up the LZ77 length (or literal) code value in the length LUT 1356, and the next index value is used to look up the corresponding LZ77 distance code value. Other embodiments may implement other combinations of demultiplexers 1356 and LUTs 1250, 1356, and 1358.

[000108] Figure 14 depicts a schematic flow chart diagram of one embodiment of a Huffman decoding method 1470 that may be implemented in conjunction with the Huffman logic 1128 of the INFLATE pipeline of the compression/decompression engine (CDE) 1002 shown in Figure 10. Although the Huffman decoding method 1470 is described with reference to the CDE 1002 of Figure 10, other embodiments may be implemented in conjunction with other compression/decompression engines. Also, it should be noted that at least some of the operations of the illustrated Huffman decoding method 1470 may be implemented in parallel (e.g., interleaved) or in another order. [000109] In the illustrated Huffman decoding method 1470, the Huffman logic 1128 receives 1472 the compressed data stream and the buffer 1242 reads 1474 a number of bits from the variable-length bit stream. In one embodiment, the Huffman logic 1128 then determines 1476 if the next LZ77 code segment is a length code segment. If so, then the comparators 1244 compare 1478 the bits from the bit buffer 1242 with a plurality of predetermined values. In this way, the comparators 1244 identify 1480 the bit length of the corresponding LZ77 code segment.

[000110] After identifying the bit length of the corresponding LZ77 code segment, or after determining that the next LZ77 code segment is not a length code segment, then the bit selector 1246 selects 1482 values corresponding to the LZ77 code segment and computes 1484 the LUT index for the next LZ77 code segment. The index is then used to look up 1486 the value of the corresponding LZ77 code segment in the LUT 1250, which outputs 1488 the LZ77 code segment to for processing by the LZ77 logic 1130. Additionally, the shift adder 1252 determines 1490 if the next bits are extra bits and, if so, shifts 1492 the buffer location for the next bit buffer read operation. The illustrated Huffman decoding method 1470 then ends.

[000111] It should be noted that embodiments of the methods, operations, functions, and/or logic may be implemented in software, firmware, hardware, or some combination thereof. Additionally, some embodiments of the methods, operations, functions, and/or logic may be implemented using a hardware or software representation of one or more algorithms related to the operations described above. To the degree that an embodiment may be implemented in software, the methods, operations, functions, and/or logic are stored on a computer-readable medium and accessible by a computer processor. [000112] Embodiments of the invention also may involve a number of functions to be performed by a computer processor such as a central processing unit (CPU), a graphics processing unit (GPU), or a microprocessor. The microprocessor may be a specialized or dedicated microprocessor that is configured to perform particular tasks by executing machine-readable software code that defines the particular tasks. The microprocessor also may be configured to operate and communicate with other devices such as direct memory access modules, memory storage devices, Internet related hardware, and other devices that relate to the transmission of data. The software code may be configured using software formats such as Java, C++, XML (Extensible Mark-up Language) and other languages that may be used to define functions that relate to operations of devices required to carry out the functional operations related described herein. The code may be written in different forms and styles, many of which are known to those skilled in the art. Different code formats, code configurations, styles and forms of software programs and other means of configuring code to define the operations of a microprocessor may be implemented.

[000113] Within the different types of computers, such as computer servers, that utilize the invention, there exist different types of memory devices for storing and retrieving information while performing some or all of the functions described herein. In some embodiments, the memory/storage device where data is stored may be a separate device that is external to the processor, or may be configured in a monolithic device, where the memory or storage device is located on the same integrated circuit, such as components connected on a single substrate. Cache memory devices are often included in computers for use by the CPU or GPU as a convenient storage location for information that is frequently stored and retrieved. Similarly, a persistent memory is also frequently used with such computers for maintaining information that is frequently retrieved by a central processing unit, but that is not often altered within the persistent memory, unlike the cache memory. Main memory is also usually included for storing and retrieving larger amounts of information such as data and software applications configured to perform certain functions when executed by the central processing unit. These memory devices may be configured as random access memory (RAM), static random access memory (SRAM), dynamic random access memory (DRAM), flash memory, and other memory storage devices that may be accessed by a central processing unit to store and retrieve information. Embodiments may be implemented with various memory and storage devices, as well as any commonly used protocol for storing and retrieving information to and from these memory devices respectively. [000114] Although the operations of the method(s) herein are shown and described in a particular order, the order of the operations of each method may be altered so that certain operations may be performed in an inverse order or so that certain operations may be performed, at least in part, concurrently with other operations. In another embodiment, instructions or sub-operations of distinct operations may be implemented in an intermittent and/or alternating manner.

[000115] Although specific embodiments of the invention have been described and illustrated, the invention is not to be limited to the specific forms or arrangements of parts so described and illustrated. The scope of the invention is to be defined by the claims appended hereto and their equivalents.

[000116] Figure 15 shows a method 1500 for saving and restoring a compression/decompression state, in accordance with one embodiment. As shown, data is processed, the processing including compressing or decompressing the data. See operation 1502.

[000117] In the context of the present description, compressing refers to any act of compressing data. For example, in various embodiments, the compressing may include, but is not limited to, implementing lossless data compression algorithms such as Lempel- Ziv algorithms (e.g. UZJl, LZlS, etc.), Lempel-Ziv- Welch (LZW) algorithms, Burrows- Wheeler transforms (BWT), implementing lossy data compression algorithms, and/or any other compression that meets the above definition. Furthermore, decompressing refers to any act of decompressing the data.

[000118] As shown further, a state of the processing is saved. See operation 1504. In the context of the present description, a state refers to a condition of the processing. For example, in one embodiment, the state may include information associated with a status of the processing.

[000119] Further, the state of the processing is restored. See operation 1506. As an option, at least a portion of the saving and at least a portion of the restoring may be carried out simultaneously. [000120] More illustrative information will now be set forth regarding various optional architectures and features with which the foregoing framework may or may not be implemented, per the desires of the user. It should be strongly noted that the following information is set forth for illustrative purposes and should not be construed as limiting in any manner. Any of the following features may be optionally incorporated with or without the exclusion of other features described.

[000121] Figure 16 shows a computing environment 1600 for saving and restoring a compression/decompression state, in accordance with one embodiment. As shown, the computing environment 1600 includes a compression/decompression block (CDB) 1602, a fast messaging network (FMN) station 1604, and an input-output (I/O) distributed interconnect station 1606. In one embodiment, the I/O distributed interconnect station 1606 may be part of a high speed distributed interconnect ring which connects multiple cores, caches, and processing agents. The high speed distributed interconnect ring may support simultaneous transactions among the connected components.

[000122] The FMN 1604 provides a channel for messages directed to and from the CDB 1602. In some embodiments, the messages may direct the CDB 1602 to perform compression/decompression or indicate completion of a compression/decompression operation. In one embodiment, the CDB 1602 is configured to compress files for transfer via a bus interface unit (BIU) 1608 and to decompress compressed files received via the BIU 1608. Alternatively, other embodiments of the CDB 1602 may be implemented in other computing environments in which compressed files may be used.

[000123] As shown, the CDB 1602 may include the BIU 1608, a translate block (XLT) 1610, and a compression/decompression engine (CDE) 1612. The BIU 1608 may provide a data interface to the I/O distributed interconnect station 1606 and the I/O distributed interconnect ring. The XLT 1610 may provide an interface between the BIU 1608 and the CDE 1612. In one embodiment, the XLT 1610 may use its own direct memory access (DMA) engine to read and write data via the BIU 1608, such that the XLT 1610 may operate autonomously from a central processing unit (CPU) coupled to the computing environment 1600.

[000124] In one embodiment, the CDE 1612 may perform compression and decompression operations for the CDB 1602. It should be noted that, in various other embodiments, the CDB 1602 may include fewer or more components. Additionally, other embodiments of the CDB 1602 may implement more or less functionality than is described herein.

[000125] In operation, software may provide free descriptors to the CDB 1602 (e.g. at system start-up). In this case, free descriptors refer to any descriptor associated with a free page in memory. In one embodiment, these free descriptors may be in a buffer such as a FIFO buffer (e.g. a free descriptor pool FIFO buffer).

[000126] In various embodiments, this buffer may hold a various number of descriptors. For example, in one embodiment, the buffer may hold up to eight descriptors on chip and have the ability to be extended into memory. This spill region in the memory may be configured in the case that the CDB 1602 is initialized with more than eight descriptors.

[000127] When a compression/decompression message is sent from the CPU to the CDB 1502, the message may first be decoded and a list of data pointers may be retrieved from memory. In one embodiment, the first pointer in the list may point to a scratch page. As an option, this scratch page may be at least 1 Kbyte and be used by the CDB 1602 to store intermediate results of a compression/decompression process.

[000128] In this way, a save and restore feature may be implemented, which allows the CDB 1602 to store the intermediate state of a first file being processed and work on a second file. When more data for the first file is received, the state may be restored by using the restore feature. The XLT 1610 may then walk down the list of data pointers fetching the data and sending it to the CDE 1612. [000129] The CDE 1612 may then perform the transformations on the data (e.g. compressions/decompressions) and return the data back to the XLT 1610. The XLT 1610 may then pop two free descriptors from the buffer. In this case, the first free descriptor may be used for forming a return list. Additionally, the second descriptor may be the location where the transformed data is written.

[000130] If the data does not fit into a single descriptor, then more free descriptors may be popped from the buffer and used to store the data. When all the transformed data has been written to memory, the return message may be formed and sent to a "return bucket." In this case, the return bucket is a field in the message that was sent from the CPU to the CDB 1602. Return buckets may be associated with CPUs in a multiprocessor system.

[000131] For files that use the save/restore functionality, software may return free descriptors back to the CDB 1602 after it has received a message and read the transformed data. It should be noted that, for files that use save/restore, free descriptors may not be sent back to the CDB 1602 until the whole file is transformed, as following segments might point back to data in previous segments.

[000132] Figure 17 shows a more detailed diagram of the compression/decompression engine 1612 shown in Figure 16, in accordance with one embodiment. As shown, the CDE 1612 may include a deflate pipeline 1614 and an inflate pipeline 1616. In one embodiment, the deflate pipeline 1614 may be utilized to implement the compression process performed by the CDB 1602, also referred to as the deflate process. As shown, the deflate pipeline 1614 may include an input buffer 1618, LZ77 logic 1620, Huffman logic 1622, and an output buffer 1624.

[000133] In another embodiment, the inflate pipeline 1616 may be utilized to implement the decompression process performed by the CDB 1602, also referred to as the inflate process. As shown, the inflate pipeline 1616 may include an input buffer 1626, Huffman logic 1628, LZ77 logic 1630, and an output buffer 1632. Although each pipeline is shown with individual schematic components, at least some of the components may operate in conjunction with both pipelines 1614 and 1616 using a single implementation. Other embodiments of the CDE 1612 may include fewer or more components.

[000134] For both the deflate pipeline 1614 and the inflate pipeline 1616, the CDE 1612 may support various operating modes, including static compression, dynamic compression, and no compression. In one embodiment, a file (e.g. a data file, etc.) may be split into blocks, where each block may use any of the three modes. Hence, the various blocks of a single file may be compressed using any combination of these three modes.

[000135] For the deflate process, splitting the file into blocks may be performed as a pre-process before the file is presented to the CDB 1602 and/or the CDE 1612. The CDB 1602 may then compress each block and use bit-stitching to recombine the compressed blocks in the deflated bit stream. For the inflate process, the deflated bit stream may be input to the CDB 1602 and the CDB 1602 may decompress the blocks individually. As an option, the CDB 1602 may decompress the blocks individually according to block header information within the bit stream.

[000136] In one embodiment, the deflate and inflate processes may each use two algorithms to achieve compression. For example, an LZ77 algorithm, implemented by the LZ77 logic 1620 for the deflate process, may create a dictionary of strings of bytes that have occurred previously in the file. In one embodiment, the LZ77 logic 1620 may enforce a minimum string length (e.g. three bytes, etc.) for the byte strings in the dictionary. The LZ77 logic 1620 may then replace strings with a distance value (e.g. up to 32,768 bytes, etc.) and a length value (e.g. up to 258 bytes, etc.) for a matching string. If no match exists, then the incoming byte may be output as a literal character. [000137] Further, the Huffman logic 1622 (for the deflate process) may implement a Huffman algorithm to replace the literal, length, and distance codes with codes whose length depends on the frequency of occurrence of the LZ77 codes in the block. In one embodiment, the Huffman logic 1622 may implement one of three coding modes. For example, the Huffman logic 1622 may implement static compression, dynamic compression, and no compression.

[000138] For static compression, a predefined code may be used. Static compression coding may be executed relatively quickly. Dynamic compression coding may use one pass to create a statistics table of the frequency of occurrence of each LZ77 code and to generate an optimized Huffman code, and a second pass to make use of the Huffman code to encode the LZ77 data. In this way, a high compression ratio may be achieved.

[000139] Regardless of whether dynamic or static compression is implemented, the Huffman logic 1622 may output a serial bit stream which may be sent a byte at a time to the XLT 1610. In some embodiments, the bit stream may be packed with zeroes at the end of the file in order to finish on a byte boundary. As one example, the maximum transfer rate may be approximately 3.2 Gbps at 400 MHz, although other levels of performance may be achieved using other systems.

[000140] As a more detailed example of the static compression mode, the Huffman logic 1622 may parse the LZ77 compressed data, replacing symbols with equivalent Huffman codes and extra length and distance bits. More specifically, a static lookup table (LUT) may be built upon initialization, where the LUT may be used to provide a Huffman code for every literal, length, or distance subsequently presented to it. In some embodiments, there may be thirty distance codes, each having five bits.

[000141] Additionally, literal and length codes may be part of the same LUT (e.g. a 286-entry LUT, etc.) or part of a separate LUT. In one embodiment, each literal and length code may be seven, eight, or nine bits in size. Furthermore, many of the length and distance codes may have extra data which follows directly after the code word, which provides a range of possible lengths or distances. The extra bits may also be used to define an exact length or distance. However, the number of extra bits may be a function of the code, with longer codes having more extra data values. The Huffman logic 1622 may then output the deflated block, including the compressed data and other symbols.

[000142] With further reference to the dynamic compression mode, in one embodiment, the Huffman logic 1622 may implement multiple phases. For example, two phases may be implemented for dynamic Huffman coding. In the first phase (i.e. the first pass), the Huffman logic 1622 may gather statistics for each literal/length code (e.g. 286 codes in a 286-entry LUT, etc.). The Huffman logic 1622 may also gather statistics for each distance code (e.g. 30 distance codes, etc.).

[000143] In the second phase (i.e. the second pass), several operations may be implemented. In one embodiment, a literal and length heap may be built, and a literal and length Huffman tree may be built. Further, the literal and length Huffman code may be generated. Similar heap, tree, and code generation operations may also be implemented for the corresponding distance value and the bit length.

[000144] Subsequently, the Huffman logic 1622 may output the bit length code sizes, the literal/length codes using the bit length code, and the distance code using the bit length code. In one embodiment, the Huffman logic 1622 may parse the literal/length and distance Huffman codes, replacing code lengths and repetition counts with equivalent bit length Huffman codes. Similarly, the Huffman logic 1622 may parse the LZ77 compressed data, replacing symbols with equivalent Huffman codes and extra length and distance bits. The output literal/length codes and distance codes are also referred to as the output bit stream.

[000145] It should be noted that some input files, or data such as embedded image data within a file, may already be in a compressed format. As a result, the static and dynamic coding techniques of the Huffman logic 1622 may be unable to compress such data further without potentially increasing the size of the compressed data. For these types of input files, the Huffman logic 1622 may implement a format without further compression (i.e. the "no compression mode"). In this mode, the data may be split into blocks, with each block being up to a defined number of bytes in size (e.g. 65,535 bytes). The compression process may also add a header for this data type and output the data stream as configured.

[000146] In general, the inflate process is the reverse of the deflate process. However, in some cases, additional functionality may be implemented in one process and not the other. For example, some embodiments of the inflate process may be configured to process any valid compressed file, including the possibility of unlimited block size, as well as distances and lengths up to the maximums specified in the industry standards.

[000147] Within the inflate process, the Huffman logic 1628 may receive data from the XLT 1610 via the input buffer 1626. In some embodiments, the Huffman logic 1628 may operate in a single phase, regardless of whether the data is statically or dynamically encoded. For static decoding, the LUT may be programmed during initialization. As an option, a set of comparators may be used to determine the length of each incoming literal/length code, which may be a specified number of bits (e.g. 7, 8, or 9 bits, etc.).

[000148] In one embodiment, the distance codes may all be five bits in length. However, other embodiments may use different bit lengths for the literal/length and/or distance codes. An offset may then be added to the code to put it into the correct range within the LUT. The output of the LUT may provide both the value of the code and the length of any extra data that is appended.

[000149] In contrast to the LUT for the static decoding, the dynamic LUT may be programmed on demand. In some embodiments, the Huffman logic 1628 may read and store the size (e.g. 1-7 bits, etc.) of each bit length code and determine the sum of codes of each size for the bit length codes. The Huffman logic 1628 may also determine the start code for each code size for the bit length codes. The Huffman logic 1628 may then write the bit length LUT.

[000150] Using the bit length LUT, the Huffman logic 1628 may read and store the size (e.g. 1-15 bits, etc.) of each literal/length code and determine the sum codes of each size for the literal/length codes. The Huffman logic 1628 may also determine the start code for each code size for the literal/length codes. The Huffman logic 1628 may then write the literal/length LUT.

[000151] The Huffman logic 1628 may also use the bit length LUT to read and store the size (e.g. 1-15 bits, etc.) of each distance code and to determine the sum of codes of each size for the distance codes. The Huffman logic 1628 may also determine the start code for each code size of the distance codes and then write the distance LUT.

[000152] Like the static LUT, a set of comparators may be used to determine the size of each incoming literal/length and distance code. In one embodiment, fourteen comparators may be used, in the case that the size varies from 1 to 15 bits. Other embodiments may use other quantities of comparators.

[000153] The output of the LUT may give both the value of the code and the length of any extra data that is appended. Together the code and extra data may be used to recover the length or distance value. In some embodiments, literals may be treated like lengths but have no extra data. In this way, the original LZ77 sequence may be recovered and output to the LZ77 logic 1630 of the inflate pipeline 1616.

[000154] The LZ77 logic 1630 may then reconstruct the original file data and send the original file data via the output buffer 1632 to the XLT 1610. In one embodiment, the LZ77 logic 1630 may use the same buffer (e.g. a 32Kbyte buffer, etc.) used by the LZ77 logic 1620 of the deflate pipeline 1614. In this case, the LZ77 logic 1630 may use the buffer as the source of the strings specified by the distance and length codes provided by the Huffman logic 1628 of the inflate pipeline 1616. Each decoded byte may be output to the XLT 1610 and written to the buffer. In this manner, the previous set number of bytes of data (e.g. 32Kbytes of data, etc.) may always be available for reference.

[000155] In one embodiment, the CDB 1602 may be designed to perform either deflate or inflate sequentially, not both at once. To make it easier to deal with long files which could provide a bottleneck, the CDB 1602 architecture may allow a "context save" to be performed, followed at some arbitrary time later by a "context restore" in order to continue (or complete) the stream.

[000156] For deflate, the CPU may divide the blocks and decide the type of encoding to use for each block (e.g. static, dynamic, or non-compressed). The CPU may also decide where to switch to another stream. In one embodiment, the switch may occur at a block boundary. In this way, the CDB 1602 and/or the XLT 1610 may save a bit position and partial byte data for bit-stitching the blocks together.

[000157] In one embodiment, a save operation during the deflate process may be performed at block boundaries of the file. For dynamic blocks, saves may always occur at the end of the second pass. In another embodiment, a save and restore during the deflate process may always occur on block boundaries, with no special processing by the CDE 1612 being implemented.

[000158] In the case that a save occurs on a block boundary, current dynamic Huffman code does not need to be reloaded. Further, as an option, distance codes may not be allowed to straddle blocks. Thus, the buffer does not have to be re-initialized with warm-up data (i.e. data for initializing the buffer). In this way, apart from some context for bit-stitching by the XLT 1610, the restore may look similar to the beginning of any other block.

[000159] With respect to implementing a save operation during the inflate process, context may be saved for reloading the dynamic Huffman tables, re-initializing the buffer (e.g. the 32K-byte buffer), and for continuing the decoding of the bit stream from where the decoding was interrupted. To enable reloading the Huffman tables in the case of a dynamic Huffman coding, the inflate pipeline 1616 may send out any Huffman preamble the inflate pipeline 1616 receives to the XLT 1610 to store in a scratch page for later recovery. In some cases, the only difference in the bit stream that the inflate pipeline 1616 receives and the output bit stream, is that the output preamble may start out byte aligned.

[000160] Additionally, in some cases, the output preamble may be partial data because the save may occur while the preamble is being received. In that situation, a bit in a context word may indicate that the preamble was partial. In this case, the feedback data may be sent via the deflate output buffer 1632.

[000161] In one embodiment, the XLT 1610 may save pointers to the previous 32K- bytes of the inflated buffer data in the scratch page so that on restore the data may be retrieved to allow the buffer to be re-initialized. Additionally, an 8-byte context word may be created which allows the input stream state to be recovered, such that the inflate process may later continue from where the process was interrupted. In this case, the CPU may not have any information about the incoming stream, with the exception of the amount of data that has been received.

[000162] Because of the Huffman coding, the CPU may not be able to distinguish the location of the block boundaries. Thus, a context save may occur anywhere in the bit stream, either in the preamble data, or in the compressed data itself. Save logic may be utilized to resolve every eventuality.

[000163] To accomplish this, the save logic may maintain a save buffer (e.g. a 6- byte buffer, etc.) which holds the last N bytes (e.g. 6 bytes, etc.) of incoming inflate data and a counter which determines how many of these bits (e.g. 48 bits, etc.) have been committed by being sent to the LZ77 logic 1630 and accepted. In some cases, sending a length code without a distance code may be no value to the LZ77 logic 1630. Thus, if the stream stops without a complete length plus distance, then the save count may include both code lengths until the distance code is accepted.

[000164] Similarly, there may be no way of knowing whether there are sufficient bits to perform a look-up in the Huffman LUT unless the LZ77 logic 1630 has the full number of bits expected (e.g. the 7 or 15 bits expected). If the LZ77 logic 1630 does not have the full number of bits expected, the logic may not attempt the look-up and all the remaining bits may be saved. In one embodiment, the total number of saved bits may include 15 bits for the length code, 15 bits for the distance code, a maximum of 5 bits of length extra data and a maximum of 13 bits of distance extra data, less one bit (thereby preventing the look up) for a total of 47 bits.

[000165] In one embodiment, a save during the inflate process may be performed each time a dynamic inflate preamble is received. In this case, the preamble data may be sent to the pipeline output buffer 1632, where the data is sent back to the XLT 1610. As an option, the saved preamble may be byte-aligned at the beginning, whether or not the incoming data is aligned.

[000166] Subsequently, if a save is requested, two cases may be handled. For example, if the save request follows the preamble, then the XLT 1610 has already received and stored the entire preamble (i.e. a full save). If a save is requested during the preamble, then a partial save results. In either case, the CDB 1602 may send a context word (e.g. an 8-byte context word) which contains any fragmentary data remaining (up to 47 bits), the length of the fragmentary data (as a bit count), the state of the inflate pipeline 1614, the type of data (dynamic, static, or non-compressed), and a bit to distinguish the full and partial save types.

[000167] With respect to the inflate process, the restore begins by sending restore data. The XLT 1610 may use a data type indication which may indicate normal, restore, or warm-up data. The restore data may be treated much like normal data except it will be known that only the LUTs are being programmed and compressed data should not be expected. If the save was partial, the restore may run out of restore data before the LUTs are fully programmed, and may have to continue programming them when the normal data is received.

[000168] The XLT 1610 may then send the warm-up data to re-initialize a 32-kbyte Byte Buffer 1634. It should be noted that, although the Byte Buffer 1634 is shown as part of the CDE 1612, in another embodiment the Byte Buffer 1634 may be separate from the CDE 1612. For example, in one embodiment the Byte Buffer 1634 may be included as part of the CDB 1602.

[000169] The warm-up data may be sent much like the normal inflate data, however with a different data type indication. The deflate pipeline 1614 will recognize this data type, and put the data in the Byte Buffer 1634 in the same way it would for normal deflate data but without doing any deflate operations. At the end of the warm up data, the XLT 1610 may make a restore request which causes the context word (e.g. the 8-byte word) to be read back and the input buffer 1626 and an inflate state to be reset to a state that was present before the save occurred.

[000170] Finally, the XLT 1610 will be ready to send normal data. The CDE 1612 will either continue programming the LUTs if the save was partial, or begin looking up codes if the save was full. At this point, the input buffer 1626 could be empty or there may still be data waiting to be processed. However full the input buffer 1626 may be, there should still be room for the maximum 47 bits of restore data to be added without it overflowing. In one embodiment, for each compression type, a maximum of 8 bytes may be stored in the input buffer. This is sufficient because in the worst case the buffer will have at most 17 bits of data (i.e. 17 bits of data representing a static distance code of 5 bits plus 13 bits of extra data, less one bit, which prevents the distance code from being looked up).

[000171] Whether a full or partial restore was performed, the CDB 1602 will continue from where it was halted. It should be noted that the CDB 1602 receives only data to be deflated or the bit stream data portion of packets to be inflated. In one embodiment, header information and cyclic redundancy check (CRC) data may be stripped (e.g. by software) from inflate packets before sending the raw inflate data to the CDB 1602.

[000172] Whether deflating or inflating, data may be sent to a CRC block within the CDB 1602 which computes both an Adler checksum and a CRC value. These checksums may then be written to the scratch space and may be available for software to either append to a deflate packet or compare with an inflate packet checksum.

[000173] Figure 18 shows a method 1800 for saving and restoring a compression/decompression state, in accordance with another embodiment. As an option, the present method 1800 may be implemented in the context of the functionality and architecture of Figures 15-17. Of course, however, the method 1800 may be implemented in any desired environment. Again, the aforementioned definitions may apply during the present description.

[000174] In operation, data is processed by compressing or decompressing the data. In this case, the processing may be facilitated by communication of one or more messages 1802, including various information, from a CPU to a compression/decompression block. Table 1 shows various information that may be included in the message 1802, in accordance with one embodiment.

Table 1

[000175] In this case, the SRC ADDR field may point to a list of descriptors in memory. Table 2 shows a list of descriptors in accordance with one embodiment. Table 3 shows definitions for the descriptors in Table 2.

Table 2

Table 3

EOF - End Of File

{ Typel, TypeO } - Useful for deflate. For inflate, the incoming data has to be decoded to determine the Block type.

00 - No Compress Block

01 - Static Huffman

10 - Dynamic Huffman - pass 1

11 - Dynamic Huffman - pass 2 SOD - Start of Data

SOB - Start of Block. Useful for deflate.

Save - Save Context. This may be set only on the last entry of the descriptor

List.

Restore - Restore Context. This may be set only on the First entry of the descriptor list.

EOB - End of Block. Useful for deflate.

[000176] Once received, the message 1802 may then be decoded and a list or data structure 1804 including data pointers may be retrieved from memory using a DMA operation. As shown, a first pointer in the data structure 1804 points to a scratch page 1806. Additionally, the data structure 1804 and/or the scratch page 1806 may include a preamble of the data. In this case, the preamble may include information associated with the data. Further, the preamble may include a last-received preamble.

[000177] The scratch page 1806 may be utilized to store intermediate results of the compression/decompression. For example in case of decompression, the scratch page may store RO-RN, which are pointers to the decompressed data output from the CDE. In this way, a state of the processing may be saved in the scratch page 1806. The data structure 1804 may contain the pointer to this scratch page in the first entry of the list. In one embodiment, the state of the processing may be saved so that other data can be processed. In this case, the state of the processing may be restored once the other data is processed.

[000178] As an option, the state of the processing may be restored utilizing a direct memory access operation. As another option, the data structure 1804 may further include error correction information. In one embodiment, the error correction information may be utilized in conjunction with processing the data.

[000179] With reference to Table 2, if the "Restore" bit is set then the first pointer is associated with the scratch page 1806. In this case, the scratch page 1806 may hold data used by the compression/decompression block to store or retrieve intermediate results. In various embodiments, intermediate results may be a partial CRC, a partial Adler checksum, and/or dynamic Huffman codes.

[000180] As an option, the next set of descriptors following the scratch page descriptors may be associated with warm up data. This data may be used by the compression/decompression block for warming up a dictionary (e.g. a buffer, etc.) associated with the block. The SOD (Start of Data) bit may be set for the page where the first data starts.

[000181] For dynamic deflate, the descriptors may have to be repeated twice. In this case, all descriptors starting with the Start of Block (SOB) and ending with End of Block (EOB) of each dynamic deflate may be repeated in the same sequence. The last descriptor in the list may or may not have the SAVE bit set. If the SAVE bit is set, the intermediate results may be stored back into the SCRATCH PAGE ADDR.

[000182] Figures 19A-19C show a method 1900 for saving and restoring a compression/decompression state, in accordance with another embodiment. As an option, the present method 1900 may be implemented in the context of the functionality and architecture of Figures 15-18. Of course, however, the method 1900 may be carried out in any desired environment. Further, the aforementioned definitions may apply during the present description.

[000183] As shown, data is stored in memory. See operation 1902. Further, memory is allocated for a list (i.e. a data structure). See operation 1904. In one embodiment, the memory may be allocated by a CPU.

[000184] Additionally, the list is defined. See operation 1906. In one embodiment, defining the list may include setting a first entry in the list equal to a scratch page. In another embodiment, defining the list may include setting a plurality of entries equal to the data. For example, the list may include a first entry "0" pointing to the scratch page and a plurality of entries "1 through N" pointing to data entries.

[000185] Once the list is defined, the list is sent to a compression/decompression block (CDB). See operation 1908. In this case, the list or information pointing to the list may be included in a message that is sent. In one embodiment, the list may be sent utilizing the CPU.

[000186] Once the message is sent to the CDB, it is determined whether the message is received by the CDB. See operation 1910. In various embodiments, the message may be received utilizing a variety of devices. For example, in one embodiment, the message may be received using a bus interface unit and/or a translate block as illustrated in Figure 16. Furthermore, the message may be different formats in various embodiments.

[000187] If it is determined that a message is received, a DMA operation is performed. See operation 1912. Further, a scratch page associated with the message is read. See operation 1914. Once the scratch page is read, a context associated with the scratch page/list is sent. See operation 1916. In one embodiment, the context may be sent to a compression/decompression engine. [000188] Once the context is sent, it is determined whether a preamble is present in the scratch page (e.g. see the scratch page 1806 of Figure 18). See operation 1918. If a preamble is present, the preamble is read and the preamble is sent to the compression/decompression engine. See operations 1920 and 1922. In this case, the preamble may be read using a DMA read.

[000189] Further, it is determined whether warm-up data is present. See operation 1924 of Figure 19B. In this case, the warm-up data may refer to the entries RO-RN shown in the scratch page 1806 of Figure 18. If warm-up data is present, the warm-up data is read and the warm-up data is sent to the compression/decompression engine. See operations 1926 and 1928. In this case, the warm-up data may be read using a DMA read. Additionally, the warm-up data may include any data that is pointed to by descriptors included in the list.

[000190] It should be noted that operations 1918 through 1928 occur when a restore process is being implemented. In the case that a restore is not being implemented, such operations may be omitted. In one embodiment, a bit may be included in the message indicating whether a restore process is to be implemented.

[000191] Once the warm-up data is read, new data is read and sent. See operations 1930 and 1932. In this case, the new data refers to data that has not yet been read. For example, the new data may be data associated with a new message. Once the new data is read, it is then determined whether the new data includes a new preamble. See operation 1934. If a new preamble is present, the old preamble is overwritten in the scratch page. See operation 1936.

[000192] Additionally, it is determined whether inflated data from the compression/decompression engine is present. See operation 1938. As shown, determining whether inflated data is present may occur in parallel with operations 1934 and 1936. [000193] If inflated/deflated data is output from the CDE, a free page is popped out of a buffer (e.g. a FIFO) to store the list. See operation 1940. Further, a descriptor is stored to the list. See operation 1942. In this case, the descriptor may include an address of the last free page popped from the buffer.

[000194] In addition, if inflated/deflated data is output from the CDE, another free page is popped in the buffer. See operation 1944 of Figure 19C. The data is then written into the free page. See operation 1946. It is then determined whether all the data is written and/or whether the page is full. See operation 1948. If the page is full, another free page is popped and the data is written to this new page.

[000195] Further, the same descriptors stored to the list are stored in the scratch page. See operation 1950. Still yet, a first cache line of the scratch page is written. See operation 1952. In this case, data written to the first line of the scratch page may include information indicating whether the list is associated with a save or restore operation. Additionally, the data written to the first line of the scratch page may include CRC data and/or an Adler checksum. Once the first cache line of the scratch page is written, the message is returned, the message pointing to the list. See operation 1954.

[000196] In this way, a save and restore feature may be implemented, which allows the intermediate state information of a first file being processed to be stored and allows work on a second file to proceed. While various embodiments have been described above, it should be understood that they have been presented by way of example only, and not limitation. Thus, the breadth and scope of a preferred embodiment should not be limited by any of the above-described exemplary embodiments, but should be defined only in accordance with the following claims and their equivalents.

Claims

WHAT IS CLAIMED IS:

1. A method for deflate processing within a compression engine, the method comprising: hashing a plurality of characters of an input data stream to provide a hash address into a dictionary; reading a plurality of distance values in parallel from the dictionary based on the hash address, wherein the distance values are stored in the dictionary; identifying a matching distance value from the plurality of distance values; and encoding the matching distance value with a length value as a portion of a LZ77 code stream.

2. The method according to claim 1, wherein identifying the matching distance value further comprises: comparing the input data stream in parallel with a plurality of byte streams from a byte buffer, the plurality of byte streams corresponding to the plurality of distance values from the dictionary; and identifying a longest matching byte stream of the plurality of byte streams.

3. The method according to claim 2, further comprising deriving the length value from the longest matching byte stream of the plurality of byte streams.

4. The method according to claim 2, wherein comparing the input data stream with the plurality of byte streams further comprises comparing a plurality of bytes per cycle from each byte stream with corresponding bytes of the input data stream.

5. The method of claim 4, wherein comparing the plurality of bytes per cycle from each byte stream further comprises comparing up to 16 bytes per cycle from each byte stream with the corresponding bytes of the input data stream.

6. The method according to claim 1, further comprising: reading at least one byte stream character from the dictionary, the at least one byte stream character corresponding to at least one of the plurality of characters used to compute the hash address; and comparing the at least one byte stream character with the plurality of characters of the input data stream.

7. The method according to claim 6, further comprising discarding the distance value corresponding to the at least one byte stream character in response to a determination that the at least one byte stream character is different from the plurality of characters of the input data stream.

8. The method according to claim 1 , further comprising updating the dictionary at the end of a match operation to identify a longest matching byte stream from a byte buffer.

9. The method according to claim 1 , further comprising updating the dictionary in response to an operation to output a literal, wherein the literal comprises another character of the input data stream.

10. The method according to claim 1 , further comprising storing approximately 2K entries in the dictionary, wherein each entry comprises a plurality of possible match entries.

11. The method according to claim 10, wherein each entry comprises up to four possible match entries.

12. The method according to claim 10, wherein each possible match entry comprises: a possible distance value corresponding to a byte stream in a byte buffer; a valid bit to indicate a valid status of the byte stream in the byte buffer; and at least one initial character from the byte stream in the byte buffer.

13. The method according to claim 12, wherein each possible match entry comprises two initial characters from the byte stream in the byte buffer.

14. An apparatus to implement a deflate process in a compression engine, the apparatus comprising: a hash table to hash a plurality of characters of an input data stream to provide a hash address; a dictionary coupled to the hash table, the dictionary to provide a plurality of distance values in parallel based on the hash address, wherein the distance values are stored in the dictionary; comparison logic coupled to the dictionary, the comparison logic to identify a matching distance value from the plurality of distance values; and encoding logic coupled to the comparison logic, the encoding logic to encode the matching distance value with a length value as a portion of a LZ77 code stream.

15. The apparatus according to claim 14, further comprising a byte buffer coupled to the dictionary and the comparison logic, the byte buffer to store a plurality of previous bytes of the input data stream, wherein the comparison logic is further configured to compare the input data stream in parallel with a plurality of byte streams from the byte buffer, the plurality of byte streams from the byte buffer corresponding to the plurality of distance values from the dictionary.

16. The apparatus according to claim 15, wherein the comparison logic is further configured to identify a longest matching byte stream of the plurality of byte streams.

17. The apparatus according to claim 16, further comprising a counter coupled to the comparison logic, the counter to count a number of matching bytes to identify the longest matching byte stream.

18. The apparatus according to claim 16, wherein the comparison logic is further configured to derive the length value from the longest matching byte stream of the plurality of byte streams.

19. The apparatus according to claim 16, wherein the comparison logic is further configured to compare about 16 bytes per cycle from each byte stream with corresponding bytes of the input data stream.

20. The apparatus according to claim 14, further comprising distance logic coupled to the dictionary, the distance logic to read at least one byte stream character from the dictionary and to compare the at least one byte stream character with the plurality of characters of the input data stream, wherein the at least one byte stream character corresponds to at least one of the plurality of characters used to compute the hash address.

21. The apparatus according to claim 20, wherein the distance logic is further configured to discard the distance value corresponding to the at least one byte stream character in response to a determination that the at least one byte stream character is different from the plurality of characters of the input data stream.

22. The apparatus according to claim 14, wherein the dictionary is further configured to update at least one dictionary entry at the end of a match operation to identify a longest matching byte stream from a byte buffer.

23. The apparatus according to claim 14, wherein the dictionary is further configured to update at least one dictionary entry in response to an operation to output a literal, wherein the literal comprises another character of the input data stream.

24. The apparatus according to claim 14, wherein the dictionary comprises approximately 2K dictionary entries, wherein each dictionary entry comprises a plurality of possible match entries.

25. The apparatus according to claim 14, wherein each possible match entry comprises: a possible distance value corresponding to a byte stream in a byte buffer; a valid bit to indicate a valid status of the byte stream in the byte buffer; and a plurality of initial characters from the byte stream in the byte buffer;

26. The apparatus according to claim 14, further comprising an input buffer coupled to the hash table, the input buffer to store the plurality of characters of the input data stream.

27. The apparatus according to claim 26, further comprising a Huffman encoder coupled to the encoding logic, the Huffman encoder to compress the LZ77 code stream according to a Huffman coding algorithm.

28. The apparatus according to claim 27, further comprising an output buffer coupled to the Huffman encoder, the output buffer to store the LZ77 code stream compressed according to the Huffman coding algorithm.

29. The apparatus according to claim 14, wherein the dictionary comprises a single-ported dictionary random access memory (RAM).

30. A computer program product comprising a computer useable storage medium to store a computer readable program that, when executed on a computer, causes the computer to perform operations comprising: hash a plurality of characters of an input data stream to provide a hash address into a dictionary; read a plurality of distance values in parallel from the dictionary based on the hash address, wherein the distance values are stored in the dictionary; identify a matching distance value from the plurality of distance values; and encode the matching distance value with a length value as a portion of a LZ77 code stream.

31. The computer program product of claim 30, wherein the computer readable program, when executed on the computer, causes the computer to perform operations to: compare the input data stream in parallel with a plurality of byte streams from a byte buffer, the plurality of byte streams corresponding to the plurality of distance values from the dictionary; and identify a longest matching byte stream of the plurality of byte streams.

32. The computer program product of claim 31 , wherein the computer readable program, when executed on the computer, causes the computer to perform an operation to derive the length value from the longest matching byte stream of the plurality of byte streams.

33. The computer program product of claim 31 , wherein the computer readable program, when executed on the computer, causes the computer to perform an operation to compare a plurality of bytes per cycle from each byte stream with corresponding bytes of the input data stream.

34. The computer program product of claim 33, wherein the computer readable program, when executed on the computer, causes the computer to perform an operation to compare up to 16 bytes per cycle from each byte stream with the corresponding bytes of the input data stream.

35. The computer program product of claim 30, wherein the computer readable program, when executed on the computer, causes the computer to perform operations to: read at least one byte stream character from the dictionary, the at least one byte stream character corresponding to at least one of the plurality of characters used to compute the hash address; and compare the at least one byte stream character with the plurality of characters of the input data stream.

36. The computer program product of claim 35, wherein the computer readable program, when executed on the computer, causes the computer to perform an operation to discard the distance value corresponding to the at least one byte stream character in response to a determination that the at least one byte stream character is different from the plurality of characters of the input data stream.

37. The computer program product of claim 30, wherein the computer readable program, when executed on the computer, causes the computer to perform an operation to update the dictionary at the end of a match operation to identify a longest matching byte stream from a byte buffer.

38. The computer program product of claim 30, wherein the computer readable program, when executed on the computer, causes the computer to perform an operation to update the dictionary in response to an operation to output a literal, wherein the literal comprises another character of the input data stream.

39. The computer program product of claim 30, wherein the computer readable program, when executed on the computer, causes the computer to perform an operation to store approximately 2K entries in the dictionary, wherein each entry comprises a plurality of possible match entries.

40. An apparatus for deflate processing within a compression engine, the apparatus comprising: means for accessing a dictionary entry in a dictionary, wherein the dictionary entry comprises a plurality of possible match entries corresponding to a combination of characters of an input data stream; means for identifying a matching distance value from the plurality of possible match entries in the dictionary entry; and means for encoding the matching distance value with a length value as a portion of a LZ77 code stream.

41. A method for Huffman decoding within a compression engine, the method comprising: receiving a compressed data stream; comparing a portion of the compressed data stream with a plurality of predetermined values using a plurality of comparators; and outputting a LZ77 code value based on the portion of the compressed data stream and a comparison result from comparing the portion of the compressed data stream with the plurality of predetermined values.

42. The method according to claim 41, wherein the LZ77 code value comprises a LZ77 length code segment.

43. The method according to claim 41 , wherein the LZ77 code value comprises a LZ77 distance code segment.

44. The method according to claim 41 , wherein the LZ77 code value comprises a literal value.

45. The method according to claim 41 , further comprising: generating a bit selection value based on the comparison of the portion of the compressed data stream with the plurality of the predetermined values; and generating an index based on portion of the compressed data stream and the bit selection value.

46. The method according to claim 445, further comprising adding a start value to the portion of the compressed data stream to generate the index.

47. The method according to claim 445, further comprising adding an index offset to the portion of the compressed data stream to generate the index.

48. The method according to claim 445, further comprising adding a pre- computed value to the portion of compressed data stream to generate the index.

49. The method according to claim 448, wherein the pre-computed value comprises a start value and an index offset.

50. The method according to claim 445, further comprising looking up the LZ77 code value in a lookup table based on the index.

51. The method according to claim 50, further comprising: looking up a first LZ77 code value in a first lookup table based on the index; and looking up a second LZ77 code value in a second lookup table based on a subsequent value of the index.

52. An apparatus to implement Huffman decoding in an INFLATE process in a compression engine, the apparatus comprising: a bit buffer to store a portion of a compressed data stream; a set of comparators coupled to the bit buffer, the set of comparators to compare the portion of the compressed data stream with a plurality of predetermined values; and a lookup table coupled to the set of comparators, the lookup table to store a plurality of LZ77 code segments and to output one of the LZ77 code segments corresponding to an index at least partially derived from a comparison result from the set of comparators.

53. The apparatus according to claim 14, wherein the LZ77 code value comprises a LZ77 length code segment.

54. The apparatus according to claim 14, wherein the LZ77 code value comprises a LZ77 distance code segment.

55. The apparatus according to claim 14, wherein the LZ77 code value comprises a literal value.

56. The apparatus according to claim 14, further comprising a bit selector coupled to the set of comparators, the bit selector to receive the comparison result and to identify a value to be added to the portion of the compressed data stream.

57. The apparatus according to claim 56, wherein the value to be added to the portion of the compressed data stream comprises an index offset minus a start value.

58. The apparatus according to claim 56, wherein the value to be added to the portion of the compressed data stream comprises a pre-computed value comprising a start value and an index offset.

59. The apparatus according to claim 56, further comprising an index adder to add the value to the portion of the compressed data stream according to the bit selector.

60. The apparatus according to claim 14, wherein the lookup table comprises a single combined lookup table comprising both LZ77 length code segments and LZ77 distance code segments.

61. The apparatus according to claim 14, wherein the lookup table comprises a plurality of LZ77 length code segments, the apparatus further comprising a second lookup table coupled to the bit buffer, the second lookup table comprising a plurality of LZ77 distance code segments.

62. The apparatus according to claim 61 further comprising a demultiplexer coupled between an index adder and the lookup tables, the demultiplexer to direct the index value to one of the lookup tables.

63. The apparatus according to claim 14, wherein the lookup table comprises a random access memory (RAM) lookup table (LUT).

64. An apparatus to implement Huffman decoding in an INFLATE process in a compression engine, the apparatus comprising: means for receiving a compressed data stream; means for comparing a portion of the compressed data stream with a plurality of predetermined values using a plurality of comparators; and means for outputting a LZ77 code value based on the portion of the compressed data stream and a comparison result from comparing the portion of the compressed data stream with the plurality of predetermined values.

65. The apparatus according to claim 30, further comprising means for generating a bit selection value based on the comparison of the portion of the compressed data stream with the plurality of the predetermined values.

66. The apparatus according to claim 65, further comprising means for generating an index based on portion of the compressed data stream and the bit selection value.

67. The apparatus according to claim 66, further comprising means for adding an index value minus a start value to the portion of the compressed data stream to generate the index.

68. The apparatus according to claim 66, further comprising: means for pre-computing a sum value of a least a start value and an index offset; and means for adding the pre-computed value to the portion of the compressed data stream to generate the index.

69. The apparatus according to claim 64, further comprising means for looking up the LZ77 code value in a lookup table based on the index.

70. The apparatus according to claim 64, further comprising: means for looking up a first LZ77 code value in a first lookup table based on the index; and means for looking up a second LZ77 code value in a second lookup table based on a subsequent value of the index.

71. A method, comprising : processing data, the processing including compressing or decompressing the data; saving a state of the processing; and restoring the state of the processing.

72. The method of Claim 71, wherein the state of the processing is saved so that other data can be processed.

73. The method of Claim 71, wherein the state of the processing is stored in a data structure.

74. The method of Claim 73, wherein the data structure includes error correction information.

75. The method of Claim 73, wherein the data structure includes a preamble of the data.

76. The method of Claim 75, wherein the preamble includes a last-received preamble.

77. The method of Claim 73, wherein the data structure includes pointers to the data.

78. The method of Claim 71 , wherein the state of the processing is restored utilizing a direct memory access operation.

79. The method of Claim 71, wherein at least a portion of the saving and at least a portion of the restoring are carried out simultaneously.

80. The method of Claim 71 , wherein the processing includes compressing.

81. The method of Claim 71 , wherein the processing includes decompressing.

82. A system, comprising: a processor to process data, the processing including compressing or decompressing the data; memory for saving a state of the processing; and logic for restoring the state of the processing.

83. The system of Claim 82, wherein the state of the processing is saved so that other data can be processed.

84. The system of Claim 82, wherein the state of the processing is stored in a data structure.

85. The system of Claim 84, wherein the data structure includes error correction information.

86. The system of Claim 84, wherein the data structure includes a preamble of the data.

87. The system of Claim 86, wherein the preamble includes a last-received preamble.

88. The system of Claim 84, wherein the data structure includes pointers to the data.

89. The system of Claim 82, wherein the state of the processing is restored utilizing a direct memory access operation.

80. The system of Claim 82, wherein at least a portion of the saving and at least a portion of the restoring are carried out simultaneously.

81. A computer program product embodied on a computer readable medium, comprising: code for processing data, the processing including compressing or decompressing the data; code for saving a state of the processing; and code for restoring the state of the processing.