JP4261779B2

JP4261779B2 - Data compression apparatus and method

Info

Publication number: JP4261779B2
Application number: JP2001067975A
Authority: JP
Inventors: 宣子佐藤
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2000-03-31
Filing date: 2001-03-12
Publication date: 2009-04-30
Anticipated expiration: 2021-03-12
Also published as: JP2001345710A

Description

【０００１】
【発明の属する技術分野】
本発明は、被圧縮データ列から生成される辞書を用いて、そのデータ列を圧縮する装置および方法に関する。本発明は、文字コードの圧縮に限らず、様々なデータの圧縮に適用できるが、以下では情報理論に基づき、データ列をワード単位に分割し、１ワードのデータを文字と呼び、任意のワード数のデータ列を文字列と呼ぶことにする。
【０００２】
【従来の技術】
近年、文字コード、画像データ等の様々な種類のデータがコンピュータで扱われるようになるのに伴い、取り扱われるデータ量も増大している。そのような大量のデータを扱う場合、データ中の冗長な部分を省いて圧縮することにより、必要な記憶容量を減らしたり、遠隔地へ高速に伝送したりすることができる。
【０００３】
従来のデータ圧縮技術には、データ系列の類似性を利用した辞書型符号化と、データ列の出現頻度を利用した確率統計型符号化とがある。このうち、前者の辞書型符号化の代表的な方法として、ＬＺ７７符号化とＬＺ７８符号化が知られている（植松友彦（うえまつともひこ）著、“文書データ圧縮アルゴリズム入門”、ＣＱ出版、ｐｐ．１３１−２０８、１９９５年）。ＬＺ７７符号化とＬＺ７８符号化では、ＬＺ７７符号化の方が、簡単な処理で充分な圧縮率が得られることから、実際の使用では主流となっている。
【０００４】
ＬＺ７７符号化では、図２１に示すように、一定サイズのスライドバッファ１を設け、このバッファ１内で入力文字列と最長一致する文字列を検索し、その位置と長さを用いて入力文字列を符号化する。符号化が進むにつれてバッファ１をスライドさせていくことから、この符号化方法は、スライド辞書法とも呼ばれる。
【０００５】
図２１では、バッファ１の右隣の入力文字列“ａｂｃｄａａａｑ．．．”が符号化されるとき、バッファ１内で一致する文字列のうち最長のものは“ａｂｃｄ”である。そこで、この最長一致文字列の先頭位置と入力文字列の先頭位置の相対アドレス“５（バイト）”を一致位置とし、最長一致文字列の長さ“４（バイト）”を一致長として、（一致位置，一致長）＝（５，４）のような符号を生成する。これにより、入力文字列の先頭の“ａｂｃｄ”が（５，４）に置き換えられる。同様にして、次の文字列“ａａａ”は、符号（１３，３）に置き換えられる。
【０００６】
しかし、実際に用いられるスライドバッファはもっと長く、最長一致する文字列を発見するためにバッファ内の文字列を順に検索していくと、膨大な時間を要する。このため、実際には、バッファ内のすべての文字列と照合するのではなく、文字列の接頭部（２〜４文字程度）の出現位置を随時テーブルに登録し、テーブルに保持されている位置の文字列のみと照合している。このような検索に使用されるテーブルとしては、ルックアップテーブル（Look Up Table ，ＬＵＴ）とハッシュテーブル（Hash Table）とがある。
【０００７】
図２２は、ＬＵＴを用いた文字列検索を示している。図２２のＬＵＴ２は、バッファ１内の文字列の接頭部をアドレスとして、その文字列のバッファ１内における出現位置（アドレスまたはポインタ）を格納している。そして、検索時には、入力文字列の接頭部をアドレスとして、ＬＵＴ２の領域にアクセスし、対応する文字列の位置を取得する。
【０００８】
同じ接頭部の文字列がバッファ１内に複数存在する場合は、リンクドリスト３の形式で複数の出現位置が保持される。したがって、ＬＵＴ２に１回アクセスするだけで、バッファ１内のすべての対応する文字列の位置を取得することができる。ここでは、２文字分の接頭部が用いられており、入力文字列の接頭部“ａｂ”に対応するＬＵＴ２の領域は、リンクドリスト３を利用して２つの出現位置を保持している。
【０００９】
このように、ＬＵＴは、検索する文字列をテーブルの領域に１対１に対応させ、１回のテーブル引きのみで必要な情報を取得できるため、非常に高速な検索を行うことができる。しかし、長い文字列を検索する場合、テーブルに必要な領域の数は出現可能な文字の数の巾乗で増えるため、必要な領域が膨大になる。例えば、出現可能な文字の数を２⁸＝２５６とすると、ｎ文字の接頭部に対して２５６ⁿ個の領域が必要となる。
【００１０】
ところが、検索する文字列が多少長くなると、用意された領域のうち実際に使用される（登録される）部分は一部分のみに止まり、テーブル内はまばらな状態になる。したがって、長い文字列を検索する場合には、メモリの使用効率が悪化する。
【００１１】
そこで、ハッシュテーブルでは、検索文字列を縮退させて、複数の文字列が１つの領域を共有するようにしている。このため、テーブル引きの後で、得られた文字列が実際に検索している文字列かどうかをチェックする必要があるが、ＬＵＴに比べて、同等のテーブル領域でより長い文字列を検索することができる。
【００１２】
図２３は、ハッシュテーブルを用いた文字列検索を示している。図２３のハッシュコード生成部４は、入力文字列の接頭部“ａｂｃ”からハッシュコード５を生成し、それをアドレスとしてハッシュテーブル６にアクセスする。ハッシュテーブル６には、ハッシュコード５に対応するバッファ１内の位置が格納されており、その位置にある文字列“ａｂｃｄｅ”と入力文字列を照合することで、両者の接頭部が一致するかどうかがチェックされる。そして、それらが一致すれば、入力文字列と一致する文字列がバッファ１内に存在すると判断される。
【００１３】
ハッシュテーブルの場合も、ＬＵＴの場合と同様に、バッファ１内の同じ接頭部を持つ複数の文字列に対しては、リンクドリストの形式で複数の出現位置が保持される。いずれの場合も、リンクドリストは、最長一致文字列を検索するために用いられる。
【００１４】
【発明が解決しようとする課題】
しかしながら、上述した従来のデータ圧縮技術には、次のような問題がある。
ＬＵＴを用いて長い文字列を検索する場合、上述したように、膨大な領域を持つテーブルを用意しても、その一部分のみしか使用されないので、テーブル内はまばらな状態になる。ハッシュテーブルでは、ＬＵＴと比べるとテーブルサイズが小さくなるが、入力データが少なければ、同じようにテーブル内がまばらな状態になる。したがって、メモリが必ずしも有効に利用されないという問題がある。
【００１５】
また、最長一致文字列を検索する際、リンクドリストに保持された複数の出現位置を一つ一つ辿らなければならず、同じ接頭部を持つ文字列が多くなると、検索処理に時間がかかるという問題もある。
【００１６】
本発明の課題は、辞書型符号化に基づくデータ圧縮において、入力データに応じたリーズナブルなメモリ量で文字列検索を実現し、最長一致検索を効率よく行うデータ圧縮装置およびその方法を提供することである。
【００１７】
【課題を解決するための手段】
図１は、本発明のデータ圧縮装置の原理図である。図１のデータ圧縮装置は、データ格納手段１１、ソート手段１２、出現位置格納手段１３、検出手段１４、および符号化手段１５を備える。
【００１８】
データ格納手段１１は、圧縮すべき文字列データを格納し、ソート手段１２は、データ格納手段１１内の複数のアドレスの各々を始点とする各文字列を、各文字列の内容に基づいて並べ換える。出現位置格納手段１３は、並べ換えられた文字列の順序で、各文字列のアドレスを表すアドレス情報を格納する。検出手段１４は、出現位置格納手段１３に格納されたアドレス情報に基づいて、繰返し文字列を検出し、符号化手段１５は、検出された繰返し文字列を符号化して出力する。
【００１９】
まず、データ格納手段１１内の複数のアドレスの各々に、圧縮すべき文字列データに含まれる各文字が格納される。次に、ソート手段１２は、それらのアドレスをそれぞれ始点とする複数の文字列を、各文字列の内容に基づいて所定の順序で並べ換え、各文字列のアドレス情報を、その順序で出現位置格納手段１３に格納する。
【００２０】
次に、検出手段１４は、出現位置格納手段１３に格納された各アドレス情報と、出現位置格納手段１３内におけるそのアドレス情報の順位（格納位置）との関係を参照して、データ格納手段１１内で繰り返し出現する文字列を検出する。そして、符号化手段１５は、２回目以降に出現した繰返し文字列を符号化して出力する。
【００２１】
このようなデータ圧縮装置によれば、データ格納手段１１内に出現する複数の文字列が、その内容に応じて規則的に並べ換えられて、出現位置格納手段１３に格納される。このため、出現位置格納手段１３を参照すれば、同じ文字列が出現する複数の位置を容易に見つけることができ、文字列検索が効率化される。このとき、複数の同じ文字列が互いに隣接するように文字列を並べ換えることで、最長一致検索をさらに効率化することができる。
【００２２】
また、出現位置格納手段１３内のアドレス情報の数は、被圧縮データを格納するデータ格納手段１１内のアドレスの数とほぼ同じになるため、入力データにほぼ比例するメモリ量で文字列検索を行うことができる。
【００２３】
例えば、図１のデータ格納手段１１、ソート手段１２、出現位置格納手段１３、および検出手段１４は、それぞれ、後述する図５の入力バッファ２１、ソート部２５、出現位置保持部２６、および一致検出部２２に対応し、図１の符号化手段１５は、図５の符号生成部２３および符号出力部２４に対応する。
【００２４】
【発明の実施の形態】
以下、図面を参照しながら、本発明の実施の形態を詳細に説明する。
本実施形態では、入力文字列を保持する入力バッファを設け、バッファ中の各アドレスを始点とする各文字列を、文字列の内容に従って並びかえて、順位リストを生成する。そして、この順位リストを辞書として利用して文字列検索を行い、一致位置と一致長を求める。
【００２５】
例えば、図２のような入力バッファを設けた場合、まず、バッファ内の各アドレスを始点とする各文字列から、それぞれ、３文字分の接頭部を抽出し、図３のような順位リストを生成する。図３の順位リストは、図２の入力バッファとほぼ同じ数の要素（レコード）を有するテーブルに対応し、各レコードには、図２の入力バッファにおいて、各接頭部が出現する位置のアドレスが格納される。
【００２６】
ここでは、入力バッファに、“ｃｏｍｐｒｅｓｓｉｏｎ＿ｄｅｃｏｍｐｒｅｓｓ＿ｃｏｍｐｒｅｓｓｉｏｎ”という３４バイトの入力文字列が保持されており、アドレス１、２、および３からは、それぞれ、“ｃｏｍ”、“ｏｍｐ”、および“ｍｐｒ”という接頭部が抽出されている。他のアドレスからも、同様にして、３文字の接頭部が抽出される。ただし、記号“＿”はスペースを表す。そして、順位リストには、これらの接頭部の出現位置に対応する“１”〜“３２”のアドレスが格納される。
【００２７】
次に、この順位リストに保持されたアドレスを、対応する接頭部の各文字のコード順に並べ換えて、図４のような順位リストを生成する。図４の順位リストは、図３の順位リストと同じ数のレコードを有し、並べ換えられた順序で、各接頭部のアドレスを保持している。
【００２８】
ここでは、“ｃｏｍ”や“ｓｓｉ”のように、入力バッファ内に含まれる複数の同じ接頭部が、出現順に隣接して並んでいる。このため、ある文字列と同じ接頭部を持つ文字列のうちで、最近出現したもののアドレスは、必ずその文字列のアドレスが格納されたレコードの直前（１つ上の順位）のレコードに格納されている。したがって、入力バッファ内の符号化対象の文字列を、直前のレコードに保持されたアドレスの文字列と比較すれば、一致する文字列を容易に検索することができる。
【００２９】
また、他の一致候補のアドレスも連続して格納されているため、最長一致検索の場合は、符号化対象の文字列を、連続して格納された複数のアドレスに対応する複数の文字列と比較すればよく、最長一致検索が高速化される。さらに、入力バッファと順位リストは、ほぼ同じ程度の長さになるため、入力バッファの長さにほぼ比例したサイズのメモリ量で、検索に必要な情報を格納することができる。
【００３０】
次に、図５から図１８までを参照しながら、図４に示した順位リストを用いた圧縮処理について、より詳細に説明する。
図５は、本実施形態のデータ圧縮装置の構成図である。図５のデータ圧縮装置は、例えば、コンピュータを用いて構成され、入力バッファ２１、一致検出部２２、符号生成部２３、符号出力部２４、ソート部２５、および出現位置保持部２６を備える。
【００３１】
入力バッファ２１は、入力された文字列を被圧縮データ列として保持する。ソート部２５は、入力バッファ２１内の各アドレスを始点とする文字列を、文字列の内容に従って並べ換え、並べ換えられた順序で文字列のアドレスを保持する順位リスト２７を生成する。出現位置保持部２６は、順位リストを出現位置情報として保持する。
【００３２】
一致検出部２２は、順位リスト２７の情報に基づいて、入力バッファ２１内の繰返し文字列を検出し、他の文字列とともに符号生成部２３に渡す。符号生成部２３は、一致検出部２２から受け取った文字列の符号を生成し、符号出力部２４は、生成された符号を圧縮データとして出力する。順位リスト２７を利用した繰返し文字列の検索方法としては、次の３つが考えられる。
（１）逆引きテーブル（逆引きリスト）を用いる方法
（２）一致位置テーブル（一致位置リスト）を用いる方法
（３）検索テーブル（ハッシュテーブル）を用いる方法
まず、図６から図９までは、逆引きリストを用いたデータ圧縮処理を示している。
【００３３】
この場合、一致検出部２２は、図６に示すように、逆引きリスト３１と照合部３２を備える。逆引きリスト３１は、入力バッファ２１内の符号化対象の文字列のアドレスから、順位リスト２７におけるその文字列の順位を求めるための情報を格納する。そして、一致検出部２２は、逆引きリスト３１から得られた順位より上の順位のアドレスから始まる文字列を、一致候補として採用する。
【００３４】
照合部３２は、符号化対象の文字列と一致候補の文字列とを照合し、一致した文字列の長さを求める。そして、符号生成部２３は、得られた長さを一致長とし、一致した文字列のアドレスを一致位置として、文字列を符号化する。最長一致検索を行う場合は、複数の一致候補のうち最も長い一致長を有するものを符号化する。
【００３５】
例えば、図２の入力文字列からは、図７のような逆引きリストと順位リストが生成される。図７の順位リストＯｄｒ２Ｐ［］は、図４の順位リストと同様である。逆引きリストＰ２Ｏｄｒ［］は、順位リストＯｄｒ２Ｐ［］に保持された各アドレスの順位を表す番号を、それぞれ、そのアドレスに対応するレコードに格納していくことで、容易に生成される。例えば、順位リストＯｄｒ２Ｐ［］の１番目のアドレス“２３”に対しては、逆引きリストＰ２Ｏｄｒ［］のアドレス“２３”のレコードに、順位番号“１”が格納されている。
【００３６】
繰返し文字列を検出するとき、一致検出部２２は、符号化対象の文字列のアドレスに基づいて、逆引きリストＰ２Ｏｄｒ［］と順位リストＯｄｒ２Ｐ［］にアクセスし、一致候補の文字列を求める。
【００３７】
例えば、入力バッファＩｎＢｕｆ［］のアドレス“２４”を始点とする文字列“ｃｏｍｐｒｅｓｓｉｏｎ”が符号化対象であれば、逆引きリストＰ２Ｏｄｒ［］のアドレス“２４”に保持された順位番号“５”を取得し、順位リストＯｄｒ２Ｐ［］のその順位にアクセスする。次に、それより上の順位“３”および“４”に保持されたアドレス“１”および“１５”を取得する。そして、それらのアドレスを始点とする文字列“ｃｏｍｐｒｅｓｓｉｏｎ＿ｄｅｃｏｍ．．．”および“ｄｅｃｏｍｐｒｅｓｓ＿ｃｏｍ．．．”を一致候補とする。
【００３８】
最長一致検索を行わない場合は、１つ上の順位の文字列“ｄｅｃｏｍｐｒｅｓｓ＿ｃｏｍ．．．”のみが一致候補となり、最長一致検索を行う場合は、両方の文字列が一致候補となる。
【００３９】
このように、逆引きリストを設けることで、容易に順位リストにアクセスすることができ、文字列検索が効率化される。また、逆引きリストは順位リストと同じ長さなので、これらを合わせても、入力バッファの長さにほぼ比例するメモリ量で、必要な情報を格納することができる。
【００４０】
図８は、図６の一致検出部を用いたデータ圧縮処理のフローチャートである。この処理では、最長一致検索は行われず、最近出現した一致候補のみが検索される。
【００４１】
データ圧縮装置は、まず、入力バッファＩｎＢｕｆ［］に、所定のサイズＢＵＦＳＩＺＥ分のデータを入力し、符号化位置を表す変数ｔを１とおく（ステップＳ１）。また、ＩｎＢｕｆ［］の各アドレスを始点とする３文字の文字列をアルファベット順に並べて、順位リストＯｄｒ２Ｐ［］を作成し、Ｏｄｒ２Ｐ［］用の逆引きリストＰ２Ｏｄｒ［］を作成する。
【００４２】
次に、アドレスｔを始点とする文字列がｔより前に出現しているかどうかをチェックする。ここでは、まず、最近出現した一致候補の順位を表す変数ｏｄｒを、Ｐ２Ｏｄｒ［ｔ］−１とおき、一致位置のアドレスを表す変数ｐを、Ｏｄｒ２Ｐ［ｏｄｒ］とおく（ステップＳ２）。ｏｄｒは、Ｏｄｒ２Ｐ［］において、符号化位置を始点とする文字列の順位の１つ上の順位に対応する。
【００４３】
次に、アドレスｔを始点とする３文字の文字列Ｃｔ＝（ＩｎＢｕｆ［ｔ］，ＩｎＢｕｆ［ｔ＋１］，ＩｎＢｕｆ［ｔ＋２］）と、アドレスｐを始点とする３文字の文字列Ｃｐとを比較する（ステップＳ３）。
【００４４】
ＣｔとＣｐが一致すれば、Ｃｐで始まる文字列を一致候補として、次に、一致長を求める。ここでは、まず、一致長を表す変数ｓを３とおき（ステップＳ４）、ＩｎＢｕｆ［ｔ＋ｓ］とＩｎＢｕｆ［ｐ＋ｓ］を比較する（ステップＳ５）。これらが一致すれば、ｓ＝ｓ＋１とおいて（ステップＳ６）、ステップＳ５の処理を繰り返す。
【００４５】
ステップＳ５において、ＩｎＢｕｆ［ｔ＋ｓ］とＩｎＢｕｆ［ｐ＋ｓ］が一致しなければ、（ｐ，ｓ）を符号として出力し、ｔ＝ｔ＋ｓとおいて（ステップＳ７）、ｔとＢＵＦＳＩＺＥを比較する（ステップＳ８）。そして、ｔ＜ＢＵＦＳＩＺＥであれば、ステップＳ２以降の処理を繰り返す。
【００４６】
ステップＳ８において、ｔ≧ＢＵＦＳＩＺＥとなれば、次に、被圧縮データが終了したかどうかをチェックする（ステップＳ９）。被圧縮データが残っていれば、ステップＳ１以降の処理を繰り返し、被圧縮データがなくなれば、処理を終了する。
【００４７】
また、ステップＳ３において、ＣｔとＣｐが一致しなければ、一致候補が存在しないので、Ｃｔの先頭文字ＩｎＢｕｆ［ｔ］をそのまま符号として出力し、ｔ＝ｔ＋１とおいて（ステップＳ１０）、ステップＳ８以降の処理を行う。
【００４８】
例えば、図７の被圧縮データの場合は、図８の処理により、“ｃｏｍｐｒｅｓｓｉｏｎ＿ｄｅ（１，８）＿（１５，８）（９，３）”のような圧縮データが生成される。
【００４９】
図９は、最長一致検索を行う場合のデータ圧縮処理のフローチャートである。図９のステップＳ１１、Ｓ１３〜Ｓ１６、およびＳ２２〜Ｓ２４の処理は、図８のステップＳ１、Ｓ３〜Ｓ６、およびＳ８〜Ｓ１０の処理と同様である。
【００５０】
ステップＳ１１において、Ｏｄｒ２Ｐ［］とＰ２Ｏｄｒ［］を作成すると、データ圧縮装置は、次に、ｏｄｒ＝Ｐ２Ｏｄｒ［ｔ］−１、ｐ＝Ｏｄｒ２Ｐ［ｏｄｒ］とおく（ステップＳ１２）。このとき、さらに、最長一致文字列の一致位置を表す変数ｐｒｅをｐとおき、その一致長を表す変数ｌｅｎを０とおく。そして、ステップＳ３〜Ｓ６の処理により、最近出現した一致候補の一致長ｓを求める。
【００５１】
次に、ｓとｌｅｎを比較し（ステップＳ１７）、ｓ＞ｌｅｎであれば、ｌｅｎ＝ｓ、ｐｒｅ＝ｐとおく（ステップＳ１８）。そして、より長い一致候補を求めるために、ｏｄｒ＝ｏｄｒ−１、ｐ＝Ｏｄｒ２Ｐ［ｏｄｒ］とおいて（ステップＳ１９）、ＣｔとＣｐを比較する（ステップＳ２０）。ステップＳ１７において、ｓ≦ｌｅｎであれば、ｌｅｎとｐｒｅを更新せずに、ステップＳ１９以降の処理を行う。
【００５２】
ＣｔとＣｐが一致すれば、新たな一致候補が見つかったので、ステップＳ１４以降の処理を繰り返し、その候補の一致長がｌｅｎより長ければ、ｌｅｎとｐｒｅを更新する。
【００５３】
そして、ステップＳ２０において、ＣｔとＣｐが一致しなくなると、（ｐｒｅ，ｌｅｎ）を符号として出力し、ｔ＝ｔ＋ｌｅｎとおいて（ステップＳ２１）、ステップＳ２２以降の処理を行う。こうして、最終的に、一致長が最も長い文字列の位置と長さが符号として出力される。
【００５４】
例えば、図７の被圧縮データの場合は、図９の処理により、“ｃｏｍｐｒｅｓｓｉｏｎ＿ｄｅ（１，８）＿（１，１１）”のような圧縮データが生成される。
次に、図１０から図１３までは、一致位置リストを用いたデータ圧縮処理を示している。この処理では、入力データは、一旦、一致位置リストに変換されてから圧縮される。
【００５５】
この場合、一致検出部２２は、図１０に示すように、一致位置リスト４１、領域検出部４２、および照合部４３を備える。一致位置リスト４１は、順位リスト２７から生成され、入力バッファ２１内の各文字列のアドレスから最近出現した同じ文字列の位置（一致位置）を求めるための情報を格納する。
【００５６】
例えば、図７の順位リストＯｄｒ２Ｐ［］は、図１１のような一致位置リストＰ２ＰｒｅＰ［］に変換される。この一致位置リストＰ２ＰｒｅＰ［］は、入力バッファの要素と同じ数のレコードからなる。そして、各アドレスのレコードには、順位リストＯｄｒ２Ｐ［］において、そのアドレスを始点とする接頭部の順位の１つ上の順位に保持されたアドレスが、一致位置として格納されている。ただし、１つ上の順位に登録された接頭部が異なる場合は、一致候補が存在しないことを表す記号“Ｎ”が格納される。
【００５７】
一致位置リストＰ２ＰｒｅＰ［］の生成時に、一致検出部２２は、順位リストＯｄｒ２Ｐ［］に保持されたアドレスを上位から順に見ていく。そして、注目する順位に登録された接頭部が１つ上の順位に登録された接頭部と同じであれば、前者の順位に保持されたアドレスに対応するレコードに、後者の順位に保持されたアドレスを格納する。
【００５８】
また、注目する順位に登録された接頭部が１つ上の順位に登録された接頭部と異なれば、前者の順位に保持されたアドレスに対応するレコードに、記号“Ｎ”を格納する。このような操作を繰返すことで、一致位置リストＰ２ＰｒｅＰ［］が容易に生成される。
【００５９】
例えば、順位リストＯｄｒ２Ｐ［］の１番目のアドレス“２３”に対しては、１つ上の順位のレコードが存在しない。そこで、一致位置リストＰ２ＰｒｅＰ［］のアドレス“２３”には、記号“Ｎ”が格納される。また、順位リストＯｄｒ２Ｐ［］の５番目のアドレス“２４”に対しては、４番目のアドレス“１５”のレコードが同じ接頭部“ｃｏｍ”に対応する。そこで、一致位置リストＰ２ＰｒｅＰ［］のアドレス“２４”には、４番目のレコードに保持されたアドレス“１５”が格納される。
【００６０】
図１０の領域検出部４２は、このような一致位置リスト４１の隣接するレコードの値（アドレス）を比較して、一致位置を示す値が連続して１ずつ増えているような領域を検出する。そして、符号生成部２３は、その領域の先頭の値を一致位置とし、値が連続している長さから一致長を求めて、文字列を符号化する。
【００６１】
例えば、図１１の一致位置リストＰ２ＰｒｅＰ［］では、アドレス“１５”〜“２０”の領域において、レコードの値が“１”から“６”まで連続して１ずつ増えている。そこで、この領域の長さ“６”に、順位リストに登録された接頭部の長さ“３”を加え、“１”を引いた結果“８（＝６＋３−１）”を、一致長とする。そして、先頭のレコードの値“１”を一致位置として、（１，８）のような符号が生成される。
【００６２】
また、最長一致検索を行う場合、領域検出部４２は、一致位置リスト４１において、値が連続している領域が２つ以上繋がっている部分を検出する。そして、一致検出部２２は、検出された複数の連続領域に保持されているアドレスを分析して、複数の一致候補の位置を求める。
【００６３】
次に、照合部３２は、符号化対象の文字列と各一致候補の文字列とを照合し、一致した文字列の長さを求める。そして、符号生成部２３は、複数の一致候補のうち、一致した長さが最も長いものの一致位置と一致長を用いて、文字列を符号化する。
【００６４】
例えば、図１１の一致位置リストＰ２ＰｒｅＰ［］では、アドレス“２４”〜“２９”の領域において、レコードの値が“１５”から“２０”まで連続して増えており、アドレス“３０”〜“３２”の領域において、レコードの値が“７”から“９”まで連続して増えている。これらの２つの連続領域は繋がっているため、アドレス“２４”を始点とする文字列“ｃｏｍｐｒｅｓｓｉｏｎ”を符号化対象として、最長一致検索が行われる。
【００６５】
この場合、２番目の連続領域“３０”〜“３２”の値に基づいて、１番目の連続領域のアドレス“２４”には、アドレス“１５”の一致候補より長いアドレス“１”の候補があることが分かる。その一致長は、２つの連続領域の長さ“９”に、接頭部の長さ“３”を加え、“１”を引くことで求められ、“１１（＝９＋３−１）”となる。こうして、（１，１１）のような符号が生成される。
【００６６】
３個以上の連続領域が繋がっている場合も、同様にして、最長一致文字列の一致位置と一致長を表す符号が生成される。一般に、ｎ個の連続領域が繋がっている場合は、少なくともｎ個の一致候補が存在し、それらの中に最長一致文字列が含まれている。
【００６７】
このように、順位リストを一致位置リストに変換することで、一致位置と一致長が容易に求められるようになり、文字列検索が効率化される。また、一致位置リストは入力バッファと同じ長さなので、入力バッファの長さに比例するメモリ量で、必要な情報を格納することができる。図１１では、一致位置のアドレスそのものを一致位置リストに格納しているが、各アドレスから一致位置までの相対アドレスを格納してもよい。
【００６８】
図１２は、図１０の一致検出部を用いたデータ圧縮処理のフローチャートである。この処理では、最長一致検索は行われず、最近出現した一致候補のみが検索される。
【００６９】
データ圧縮装置は、まず、入力バッファＩｎＢｕｆ［］に、ＢＵＦＳＩＺＥ分のデータを入力し、変数ｔを１とおく（ステップＳ３１）。また、ＩｎＢｕｆ［］のデータから順位リストＯｄｒ２Ｐ［］を作成し、Ｏｄｒ２Ｐ［］から一致位置リストＰ２ＰｒｅＰ［］を作成する。
【００７０】
次に、Ｐ２ＰｒｅＰ［ｔ］を“Ｎ”と比較して、アドレスｔを始点とする文字列の一致候補が存在するかどうかをチェックする（ステップＳ３２）。そして、その値が“Ｎ”でなければ、一致候補が存在するので、“連続領域の長さ−１”を表す変数ｓを０とおいて、Ｐ２ＰｒｅＰ［ｔ＋ｓ］とＰ２ＰｒｅＰ［ｔ＋ｓ＋１］−１とを比較する（ステップＳ３４）。
【００７１】
Ｐ２ＰｒｅＰ［ｔ＋ｓ］とＰ２ＰｒｅＰ［ｔ＋ｓ＋１］−１が一致すれば、Ｐ２ＰｒｅＰ［ｔ＋ｓ＋１］は“Ｎ”ではなく、Ｐ２ＰｒｅＰ［ｔ＋ｓ］より１だけ大きい値を表す。そこで、ｓ＝ｓ＋１とおいて（ステップＳ３５）、ステップＳ３４の処理を繰り返す。
【００７２】
ステップＳ３４において、Ｐ２ＰｒｅＰ［ｔ＋ｓ］とＰ２ＰｒｅＰ［ｔ＋ｓ＋１］−１が一致しなければ、Ｐ２ＰｒｅＰ［ｔ］を一致位置とし、ｓ＋３を一致長として、符号（Ｐ２ＰｒｅＰ［ｔ］，（ｓ＋３））を出力する（ステップＳ３６）。そして、ｔ＝ｔ＋ｓ＋３とおいて、ｔとＢＵＦＳＩＺＥを比較する（ステップＳ３７）。そして、ｔ＜ＢＵＦＳＩＺＥであれば、ステップＳ３２以降の処理を繰り返す。
【００７３】
ステップＳ３７において、ｔ≧ＢＵＦＳＩＺＥとなれば、次に、被圧縮データが終了したかどうかをチェックする（ステップＳ３８）。被圧縮データが残っていれば、ステップＳ３１以降の処理を繰り返し、被圧縮データがなくなれば、処理を終了する。
【００７４】
また、ステップＳ３２において、Ｐ２ＰｒｅＰ［ｔ］が“Ｎ”であれば、一致候補が存在しないので、ＩｎＢｕｆ［ｔ］をそのまま符号として出力し、ｔ＝ｔ＋１とおいて（ステップＳ３９）、ステップＳ３７以降の処理を行う。
【００７５】
例えば、図７の被圧縮データの場合は、図１２の処理により、“ｃｏｍｐｒｅｓｓｉｏｎ＿ｄｅ（１，８）＿（１５，８）（９，３）”のような圧縮データが生成される。
【００７６】
図１３は、最長一致検索を行う場合のデータ圧縮処理のフローチャートである。図１３のステップＳ４１〜Ｓ４２、Ｓ４４〜Ｓ４５、およびＳ５０〜Ｓ５２の処理は、図１２のステップＳ３１〜Ｓ３２、Ｓ３４〜Ｓ３５、およびＳ３７〜Ｓ３９の処理と同様である。
【００７７】
ステップＳ４２において、Ｐ２ＰｒｅＰ［ｔ］が“Ｎ”でなければ、データ圧縮装置は、次に、ｓ＝０とおき、最長一致文字列の一致位置を表す変数ｐをＰ２ＰｒｅＰ［ｔ］とおく（ステップＳ４３）。そして、ステップＳ４４〜Ｓ４５の処理により、ｓの値を更新する。
【００７８】
ステップＳ４４において、Ｐ２ＰｒｅＰ［ｔ＋ｓ］とＰ２ＰｒｅＰ［ｔ＋ｓ＋１］−１が一致しなければ、次に、Ｐ２ＰｒｅＰ［ｔ＋ｓ＋１］と“Ｎ”を比較して、最初の連続領域と繋がった次の連続領域が存在するかどうかをチェックする（ステップＳ４６）。
【００７９】
例えば、図１１の場合は、ｔ＝２４、ｓ＝５のときに、Ｐ２ＰｒｅＰ［２４＋５］＝２０となり、Ｐ２ＰｒｅＰ［２４＋５＋１］−１＝７−１＝６となって、両者が一致しないので、Ｐ２ＰｒｅＰ［３０］＝７が“Ｎ”と比較される。
【００８０】
Ｐ２ＰｒｅＰ［ｔ＋ｓ＋１］が“Ｎ”でなければ、次の連続領域が存在することが分かる。そこで、その領域の先頭の値Ｐ２ＰｒｅＰ［ｔ＋ｓ＋１］から求められるアドレスＰ２ＰｒｅＰ［ｔ＋ｓ＋１］−（ｓ＋１）を新たな一致候補の位置として、その文字列と符号化対象の文字列とを比較する。
【００８１】
ここでは、まず、アドレスｔを始点とする長さｓ＋１の文字列をＳｔｒ（ｔ，ｓ）＝（ＩｎＢｕｆ［ｔ］，ＩｎＢｕｆ［ｔ＋１］，．．．，ＩｎＢｕｆ［ｔ＋ｓ］）とおいて、Ｓｔｒ（Ｐ２ＰｒｅＰ［ｔ＋ｓ＋１］−（ｓ＋１），ｓ）とＳｔｒ（ｔ，ｓ）とを比較する（ステップＳ４７）。
【００８２】
これらの文字列が一致すれば、新たな一致候補を最長一致文字列とみなして、ｓ＝ｓ＋１とおき、ｐ＝Ｐ２ＰｒｅＰ［ｔ＋ｓ＋１］−（ｓ＋１）とおいて（ステップＳ４８）、ステップＳ４４以降の処理を繰り返す。
【００８３】
そして、ステップＳ４７において、２つの文字列が一致しなくなると、（ｐ，（ｓ＋３））を符号として出力し、ｔ＝ｔ＋ｓ＋３とおいて（ステップＳ４９）、ステップＳ５０以降の処理を行う。
【００８４】
また、ステップＳ４６において、Ｐ２ＰｒｅＰ［ｔ＋ｓ＋１］が“Ｎ”であれば、次の連続領域が存在しないので、そのままステップＳ４９以降の処理を行う。こうして、最終的に、一致長が最も長い文字列の位置と長さが符号として出力される。
【００８５】
図１１の場合は、ステップＳ４７において、Ｓｔｒ（Ｐ２ＰｒｅＰ［２４＋５＋１］−（５＋１），５）＝Ｓｔｒ（１，５）とＳｔｒ（２４，５）が比較される。これらの文字列はともに“ｃｏｍｐｒｅ”を表すので、次に、ｓ＝６、ｐ＝１とおいて、ステップＳ４４以降の処理が繰り返される。
【００８６】
そして、ｓ＝８のとき、ステップＳ４６において、Ｐ２ＰｒｅＰ［２４＋８＋１］＝Ｎとなるので、（１，（８＋３））＝（１，１１）のような符号が生成される。したがって、最終的には、“ｃｏｍｐｒｅｓｓｉｏｎ＿ｄｅ（１，８）＿（１，１１）”のような圧縮データが生成される。
【００８７】
次に、図１４から図１６までは、ハッシュテーブルを用いたデータ圧縮処理を示している。この処理では、図７の逆引きリストの代わりに、ハッシュテーブルを用いて、順位リストがアクセスされる。
【００８８】
この場合、一致検出部２２は、図１４に示すように、ハッシュテーブル５１、照合部５２、および更新部５３を備える。ハッシュテーブル５１は、入力バッファ２１内の符号化対象の文字列の接頭部から、順位リスト２７における同じ接頭部を有する文字列の順位を求めるための情報を格納する。そして、一致検出部２２は、ハッシュテーブル５１から得られた順位、または、それより上の順位のアドレスから始まる文字列を、一致候補として採用する。
【００８９】
照合部５２は、符号化対象の文字列と一致候補の文字列とを照合し、一致した文字列の長さを求める。そして、符号生成部２３は、得られた長さを一致長とし、一致した文字列のアドレスを一致位置として、文字列を符号化する。最長一致検索を行う場合は、複数の一致候補のうち最も長い一致長を有するものを符号化する。また、更新部５３は、ハッシュテーブル５１から得られる順位を、最近出現した、同じ接頭部を有する文字列の順位に変更する。
【００９０】
図１５は、このようなハッシュテーブルを用いて順位リストにアクセスする処理の例を示している。図１５の順位リストＯｄｒ２Ｐ［］は、図４の順位リストと同様である。ハッシュテーブルｈａｓｈ２Ｏｄｒ［］は、ハッシュ値をアドレスとして、順位リストＯｄｒ２Ｐ［］における順位番号を格納する。このテーブルにアクセスするためのハッシュ値は、例えば、図２３に示したようなハッシュコード生成部４により、ハッシュ関数Ｈを用いて生成される。また、このテーブルのサイズは一般に２^Mであり、整数Ｍにより指定される。
【００９１】
順位リストＯｄｒ２Ｐ［］に複数の同じ接頭部が登録されている場合、初期状態のハッシュテーブルｈａｓｈ２Ｏｄｒ［］には、その接頭部から得られるハッシュ値に対応して、それらの接頭部のブロックの１つ上の順位番号が保持される。例えば、３文字の接頭部“ｃｏｍ”は、順位リストＯｄｒ２Ｐ［］の３番目、４番目、および５番目に登録されているが、圧縮処理の開始時には、“ｃｏｍ”のハッシュ値Ｈ（“ｃｏｍ”）に対応するアドレスに、順位番号“２”が格納される。
【００９２】
繰返し文字列を検出するとき、一致検出部２２は、符号化対象の文字列の３文字の接頭部に基づいて、ハッシュテーブルｈａｓｈ２Ｏｄｒ［］と順位リストＯｄｒ２Ｐ［］にアクセスし、一致候補の文字列を求める。
【００９３】
例えば、入力バッファＩｎＢｕｆ［］のアドレス“１”を始点とする文字列“ｃｏｍｐｒｅｓｓｉｏｎ＿ｄｅｃｏｍ．．．”が符号化対象であれば、まず、３文字の接頭部“ｃｏｍ”からハッシュ値Ｈ（“ｃｏｍ”）を生成する。次に、ハッシュテーブルｈａｓｈ２Ｏｄｒ［］において、そのハッシュ値のアドレスに保持された順位番号“２”を取得し、順位リストＯｄｒ２Ｐ［］のその順位にアクセスする。
【００９４】
この場合、その順位には同じ接頭部が登録されていないので、一致候補は存在しない。そこで、先頭文字“ｃ”をそのまま出力して、ハッシュテーブルｈａｓｈ２Ｏｄｒ［］のアドレスＨ（“ｃｏｍ”）に保持された順位番号“２”に１を加算する。これにより、接頭部“ｃｏｍ”から得られる順位“２”が、１つ下の順位“３”に変更される。
【００９５】
その後、アドレス“１５”を始点とする文字列“ｃｏｍｐｒｅｓｓ＿ｃｏｍ．．．”が符号化対象になったとき、接頭部“ｃｏｍ”のハッシュ値に基づき、ハッシュテーブルｈａｓｈ２Ｏｄｒ［］から、更新された順位番号“３”を取得する。そして、順位リストＯｄｒ２Ｐ［］のその順位にアクセスする。
【００９６】
次に、その順位“３”に保持されたアドレス“１”を取得し、そのアドレスを始点とする文字列“ｃｏｍｐｒｅｓｓｉｏｎ＿ｄｅｃｏｍ．．．”を一致候補とする。そして、一致位置と一致長の符号を出力し、再び、ハッシュテーブルｈａｓｈ２Ｏｄｒ［］の値を更新する。これにより、接頭部“ｃｏｍ”から得られる順位“３”が、１つ下の順位“４”に変更される。
【００９７】
その後、アドレス“２４”を始点とする文字列“ｃｏｍｐｒｅｓｓｉｏｎ”が符号化対象になったとき、ハッシュテーブルｈａｓｈ２Ｏｄｒ［］から、更新された順位番号“４”を取得し、順位リストＯｄｒ２Ｐ［］のその順位にアクセスする。
【００９８】
次に、その順位“３”と、その１つ上の順位“４”に保持されたアドレス“１”、“１５”を取得する。そして、それらのアドレスを始点とする文字列“ｃｏｍｐｒｅｓｓｉｏｎ＿ｄｅｃｏｍ．．．”および“ｄｅｃｏｍｐｒｅｓｓ＿ｃｏｍ．．．”を一致候補とする。ここで、最長一致検索を行わない場合は、順位“３”の文字列“ｄｅｃｏｍｐｒｅｓｓ＿ｃｏｍ．．．”のみが一致候補となり、最長一致検索を行う場合は、両方の文字列が一致候補となる。
【００９９】
このように、ハッシュテーブルを設けることで、容易に順位リストにアクセスすることができ、文字列検索が効率化される。また、ハッシュテーブルの長さは順位リストの長さ以下にすることができるので、これらを合わせても、入力バッファの長さに比例するメモリ量以内で、必要な情報を格納することができる。また、符号化が行われる度に、ハッシュテーブルが指す順位を１つずつ下にシフトすることにより、最近出現した一致候補の順位を保持することができ、最長一致検索が効率化される。
【０１００】
図１６は、図１４の一致検出部を用いたデータ圧縮処理のフローチャートである。この処理では、最長一致検索は行われず、最近出現した一致候補のみが検索される。図１６のステップＳ６３〜Ｓ６７およびＳ６９〜Ｓ７１の処理は、図８のステップＳ３〜Ｓ７およびＳ８〜Ｓ１０の処理と同様である。
【０１０１】
データ圧縮装置は、まず、入力バッファＩｎＢｕｆ［］に、ＢＵＦＳＩＺＥ分のデータを入力し、変数ｔを１とおく（ステップＳ６１）。また、ＩｎＢｕｆ［］のデータから順位リストＯｄｒ２Ｐ［］を作成し、Ｏｄｒ２Ｐ［］用のハッシュテーブルｈａｓｈ２Ｏｄｒ［］を作成する。
【０１０２】
次に、ここでは、まず、アドレスｔを始点とする３文字の文字列をＣｔ＝（ＩｎＢｕｆ［ｔ］，ＩｎＢｕｆ［ｔ＋１］，ＩｎＢｕｆ［ｔ＋２］）として、ハッシュ値を表す変数ｈａｓｈをＨ（Ｃｔ）とおく（ステップＳ６２）。また、最近出現した一致候補の順位を表す変数ｏｄｒを、ｈａｓｈ２Ｏｄｒ［ｈａｓｈ］とおき、一致位置を表す変数ｐを、Ｏｄｒ２Ｐ［ｏｄｒ］とおく。
【０１０３】
次に、ステップＳ６３〜Ｓ６７の処理により、アドレスｔを始点とする文字列がｔより前に出現しているかどうかをチェックし、そのような文字列が出現していれば、一致位置と一致長を符号として出力する。そして、ｈａｓｈ２Ｏｄｒ［ｈａｓｈ］に１を加算して、ｈａｓｈに対応する順位を１つ下にシフトし（ステップＳ６８）、ステップＳ６９以降の処理を行う。また、ステップＳ７１において、ＩｎＢｕｆ［ｔ］を符号として出力し、ｔ＝ｔ＋１とおいた後は、ステップＳ６８以降の処理を行う。
【０１０４】
図１６の処理による圧縮結果は、図８の処理による結果と同様である。また、最長一致検索を行う場合は、図１６の処理に対して、図９と同様の変更を加えればよい。
【０１０５】
ところで、上述した順位リストは、入力バッファ内の各アドレスを始点とする文字列の接頭部を、各文字のコード順にソートして、各文字列の出現位置のアドレスを並べ換えることにより、生成される。このとき、基底法（radix sort）、クイックソート、バブルソート等の任意のソート方法を用いることができる。
【０１０６】
例えば、基底法では、Ｎ文字（Ｎバイト）の接頭部に含まれるｋ番目（ｋ＝１，．．．，Ｎ）の文字に注目してビンソート（bin sort）を行う操作を、Ｎ番目の文字から順に繰り返すことで、ソート処理が行われる。また、クイックソートでは、Ｎ文字の接頭部の集合を１つの接頭部を基準にして２つに分割する操作を繰り返すことで、ソート処理が行われる。また、バブルソートでは、隣接する２つの接頭部を比較して、その結果からそれらの接頭部を交換する操作を繰り返すことで、ソート処理が行われる。
【０１０７】
図１７および図１８は、基底法に基づく順位リスト生成処理のフローチャートである。ここでは、図５のソート部２５により、３文字の接頭部の各文字についてビンソートが行われる。実験的には、接頭部を３文字に限定してソートすることで、最長一致検索が効率化されることが分かっている。
【０１０８】
ビンソートにおいては、０〜２５５の各値（文字コード）の出現回数がカウントされ、各カウント値を元にして、その文字コード未満の文字コードの個数が計算される。これにより、出現した各文字コードが最終的に配列のどの位置に納まるべきかが決定される。
【０１０９】
入力バッファＩｎＢｕｆ［］にＢＵＦＳＩＺＥ分のデータが入力されると、ソート部２５は、まず、０〜２５５の文字コードの出現回数を表す配列Ｃｏｕｎｔｅｒ［２５６］の各要素を０に初期化し、変数ｔを１とおく（ステップＳ８１）。
【０１１０】
次に、Ｃｏｕｎｔｅｒ［ＩｎＢｕｆ［ｔ］］に１を加算して、ＩｎＢｕｆ［ｔ］に保持された文字コードの出現回数をインクリメントする（ステップＳ８２）。そして、ｔに１を加算して、ｔとＢＵＦＳＩＺＥを比較する（ステップＳ８３）。ｔ＜ＢＵＦＳＩＺＥであれば、ステップＳ８２の処理を繰り返し、ｔがＢＵＦＳＩＺＥに達すると、ｔ＝１、Ｓｕｍ［０］＝０とおく（ステップＳ８４）。
【０１１１】
次に、Ｓｕｍ［ｔ］＝Ｃｏｕｎｔｅｒ［ｔ−１］＋Ｓｕｍ［ｔ−１］とおき、ｔに１を加算して（ステップＳ８５）、ｔと２５６を比較する（ステップＳ８６）。ここで、Ｓｕｍ［ｔ］は、０〜ｔ−１までの文字コードの出現回数の総和を表す。ｔ≦２５６であれば、ステップＳ８５の処理を繰り返し、ｔが２５６を越えると、次に、図１８の処理を行う。
【０１１２】
図１８では、ソート部２５は、まず、接頭部の３番目の文字でビンソートを行う。この場合、まず、ｔ＝１とおき、ＳｔａｃｋＰ［］にＳｕｍ［］をコピーする（ステップＳ８７）。ここで、配列Ａ［］は、ＩｎＢｕｆ［］のアドレスｔを始点とする文字列の３番目の文字でソートしたアドレスを格納する。３番目の文字が値ｘである場合、ＳｔａｃｋＰ［ｘ］は、ソート結果として格納されるべき配列Ａ［］の添字を格納する。
【０１１３】
次に、Ａ［ＳｔａｃｋＰ［ＩｎＢｕｆ［ｔ＋２］］］＝ｔとおき、ＳｔａｃｋＰ［ＩｎＢｕｆ［ｔ＋２］］に１を加算し、ｔに１を加算する（ステップＳ８８）。ここで、ＳｔａｃｋＰ［ＩｎＢｕｆ［ｔ＋２］］は、アドレスｔを始点とする接頭部の３番目の文字に対応するＡ［］の添字を表し、Ａ［ＳｔａｃｋＰ［ＩｎＢｕｆ［ｔ＋２］］］は、その接頭部のアドレスを表す。次に、ｔとＢＵＦＳＩＺＥを比較し（ステップＳ８９）、ｔ＜ＢＵＦＳＩＺＥであれば、ステップＳ８８の処理を繰り返す。
【０１１４】
そして、ｔがＢＵＦＳＩＺＥに達すると、次に、生成された配列Ａ［］を、接頭部の２番目の文字でビンソートする。この場合、まず、ｔ＝１とおき、ＳｔａｃｋＰ［］にＳｕｍ［］をコピーする（ステップＳ９０）。ここで、配列ＳｔａｃｋＰ［］は、ソート結果を格納する配列Ｂ［］の添字を格納する。
【０１１５】
次に、Ｂ［ＳｔａｃｋＰ［ＩｎＢｕｆ［Ａ［ｔ］＋１］］］＝ｔとおき、ＳｔａｃｋＰ［ＩｎＢｕｆ［Ａ［ｔ］＋１］］に１を加算し、ｔに１を加算する（ステップＳ９１）。ここで、ＳｔａｃｋＰ［ＩｎＢｕｆ［Ａ［ｔ］＋１］］は、配列Ａ［］の添字ｔの位置に格納された接頭部の２番目の文字に対応するＢ［］の添字を表し、Ｂ［ＳｔａｃｋＰ［ＩｎＢｕｆ［Ａ［ｔ］＋１］］］は、その接頭部のアドレスを表す。次に、ｔとＢＵＦＳＩＺＥを比較し（ステップＳ９２）、ｔ＜ＢＵＦＳＩＺＥであれば、ステップＳ９１の処理を繰り返す。
【０１１６】
そして、ｔがＢＵＦＳＩＺＥに達すると、次に、生成された配列Ｂ［］を、接頭部の１番目の文字でビンソートする。この場合、まず、ｔ＝１とおき、ＳｔａｃｋＰ［］にＳｕｍ［］をコピーする（ステップＳ９３）。ここで、配列ＳｔａｃｋＰ［］は、ソート結果を格納する順位リストＯｄｒ２Ｐ［］の添字（順位番号）を格納する。
【０１１７】
次に、Ｏｄｒ２Ｐ［ＳｔａｃｋＰ［ＩｎＢｕｆ［Ｂ［ｔ］］］］＝ｔとおき、ＳｔａｃｋＰ［ＩｎＢｕｆ［Ｂ［ｔ］］］に１を加算し、ｔに１を加算する（ステップＳ９４）。ここで、ＳｔａｃｋＰ［ＩｎＢｕｆ［Ｂ［ｔ］］］は、配列Ｂ［］の添字ｔの位置に格納された接頭部の１番目の文字の順位を表し、Ｏｄｒ２Ｐ［ＳｔａｃｋＰ［ＩｎＢｕｆ［Ｂ［ｔ］］］］は、その接頭部のアドレスを表す。
【０１１８】
次に、ｔとＢＵＦＳＩＺＥを比較し（ステップＳ９５）、ｔ＜ＢＵＦＳＩＺＥであれば、ステップＳ９４の処理を繰り返す。そして、ｔがＢＵＦＳＩＺＥに達すると、処理を終了する。こうして、順位リストＯｄｒ２Ｐ［］が生成される。
【０１１９】
上述の実施形態では、順位リストを生成するときに、各文字列の固定長（Ｎ文字）の接頭部を比較することで文字列をソートしているが、その代わりに、可変長の接頭部を比較するようにしてもよい。また、上述の実施形態では、ＬＺ７７符号化における文字列検索について説明したが、本発明は、ＬＺ７７符号化に限らず、任意の符号化における文字列検索に適用することができる。
【０１２０】
図５のデータ圧縮装置は、例えば、図１９に示すような情報処理装置（コンピュータ）を用いて構成することができる。図１９の情報処理装置は、ＣＰＵ（中央処理装置）６１、メモリ６２、入力装置６３、出力装置６４、外部記憶装置６５、媒体駆動装置６６、およびネットワーク接続装置６７を備え、それらはバス６８により互いに接続されている。
【０１２１】
メモリ６２は、例えば、ＲＯＭ（read only memory）、ＲＡＭ（random access memory）等を含み、処理に用いられるプログラムとデータを格納する。ＣＰＵ６１は、メモリ６２を利用してプログラムを実行することにより、必要な処理を行う。
【０１２２】
例えば、図５の入力バッファ２１、出現位置保持部２６、図６の逆引きリスト３１、図１０の一致位置リスト４１、および図１４のハッシュテーブル５１は、メモリ６２内に設けられる。また、図５の一致検出部２２、符号生成部２３、符号出力部２４、ソート部２５、図６の照合部３２、図１０の領域検出部４２、照合部４３、図１４の照合部５２および更新部５３は、プログラムにより記述されたソフトウェアコンポーネントとしてメモリ６２に格納される。
【０１２３】
入力装置６３は、例えば、キーボード、ポインティングデバイス、タッチパネル等であり、ユーザからの指示や情報の入力に用いられる。出力装置６４は、例えば、ディスプレイ、プリンタ、スピーカ等であり、ユーザへの問い合わせや処理結果の出力に用いられる。
【０１２４】
外部記憶装置６５は、例えば、磁気ディスク装置、光ディスク装置、光磁気ディスク（magneto-optical disk）装置、テープ装置等である。情報処理装置は、この外部記憶装置６５に、上述のプログラムとデータを保存しておき、必要に応じて、それらをメモリ６２にロードして使用する。
【０１２５】
媒体駆動装置６６は、可搬記録媒体６９を駆動し、その記録内容にアクセスする。可搬記録媒体６９としては、メモリカード、フロッピーディスク、ＣＤ−ＲＯＭ（compact disk read only memory ）、光ディスク、光磁気ディスク等、任意のコンピュータ読み取り可能な記録媒体が用いられる。ユーザは、この可搬記録媒体６９に上述のプログラムとデータを格納しておき、必要に応じて、それらをメモリ６２にロードして使用する。
【０１２６】
ネットワーク接続装置６７は、ＬＡＮ（Local Area Network）等の任意の通信ネットワークに接続され、通信に伴うデータ変換を行う。また、情報処理装置は、上述のプログラムとデータをネットワーク接続装置６７を介して他の装置から受け取り、必要に応じて、それらをメモリ６２にロードして使用する。
【０１２７】
図２０は、図１９の情報処理装置にプログラムとデータを供給することのできるコンピュータ読み取り可能な記録媒体を示している。可搬記録媒体６９や外部のデータベース７０に保存されたプログラムとデータは、メモリ６２にロードされる。そして、ＣＰＵ６１は、そのデータを用いてそのプログラムを実行し、必要な処理を行う。
【０１２８】
（付記１）圧縮すべき文字列データを格納するデータ格納手段と、
前記データ格納手段内の複数のアドレスの各々を始点とする各文字列を、各文字列の内容に基づいて並べ換えるソート手段と、
並べ換えられた文字列の順序で、各文字列のアドレスを表すアドレス情報を格納する出現位置格納手段と、
前記出現位置格納手段に格納されたアドレス情報に基づいて、繰返し文字列を検出する検出手段と、
検出された繰返し文字列を符号化して出力する符号化手段と
を備えることを特徴とするデータ圧縮装置。
（付記２）前記ソート手段は、各文字列に含まれる所定文字数の接頭部を用いて、文字列を並べ換えることを特徴とする付記１記載のデータ圧縮装置。
（付記３）前記ソート手段は、各文字列に含まれる３文字の接頭部を用いて、前記文字列を並べ換えることを特徴とする付記２記載のデータ圧縮装置。
（付記４）前記ソート手段は、複数の同じ接頭部が互いに隣接するように、前記文字列を並べ換えることを特徴とする付記２記載のデータ圧縮装置。
（付記５）前記ソート手段は、基底法を用いて、前記文字列を並べ換えることを特徴とする付記２記載のデータ圧縮装置。
（付記６）前記ソート手段は、クイックソートを用いて、前記文字列を並べ換えることを特徴とする付記２記載のデータ圧縮装置。
（付記７）符号化対象文字列のアドレスから、前記出現位置格納手段における該符号化対象文字列の順位を求めるための情報を格納する逆引き手段をさらに備え、前記検出手段は、該逆引き手段から得られた順位より上の順位に格納されたアドレス情報に対応する文字列を一致候補とし、該符号化対象文字列と該一致候補とを照合して一致長を求め、前記符号化手段は、該一致候補の位置を示す情報と該一致長とを用いて、該符号化対象文字列を符号化することを特徴とする付記１記載のデータ圧縮装置。
（付記８）各文字列のアドレスに対応して、最近出現した同じ文字列のアドレス情報を格納する一致位置格納手段をさらに備え、前記検出手段は、前記出現位置格納手段に格納されたアドレス情報から、該一致位置格納手段に格納されるアドレス情報を生成し、該一致位置格納手段の隣接するアドレス情報を比較して、アドレス情報が連続している連続領域を検出し、前記符号化手段は、該連続領域の位置に対応する文字列を符号化対象文字列とし、該連続領域に格納されたアドレス情報と該連続領域の長さとを用いて、該符号化対象文字列を符号化することを特徴とする付記１記載のデータ圧縮装置。
（付記９）前記検出手段は、前記出現位置格納手段の１つの順位に注目し、注目する順位の文字列の接頭部が１つ上の順位の文字列の接頭部と同じであるとき、前記一致位置格納手段において、該注目する順位に格納されたアドレス情報に対応する位置に、該１つ上の順位に格納されたアドレス情報を格納することを特徴とする付記８記載のデータ圧縮装置。
（付記１０）前記検出手段は、前記一致位置格納手段内で２つ以上の連続領域が繋がっている部分を検出し、該２つ以上の連続領域に格納されたアドレス情報に基づいて複数の一致候補の文字列を求め、前記符号化手段は、該複数の一致候補のうち最も長い一致長を有する一致候補の位置を示す情報と、該最も長い一致長とを用いて、前記符号化対象文字列を符号化することを特徴とする付記８記載のデータ圧縮装置。
（付記１１）符号化対象文字列に含まれる所定文字数の接頭部から、前記出現位置格納手段における同じ接頭部を含む文字列の順位を求めるための情報を格納する検索手段をさらに備え、前記検出手段は、該検索手段から得られた順位に格納されたアドレス情報に対応する文字列を一致候補とし、該符号化対象文字列と該一致候補とを照合して一致長を求め、前記符号化手段は、該一致候補の位置を示す情報と該一致長とを用いて、該符号化対象文字列を符号化することを特徴とする付記１記載のデータ圧縮装置。
（付記１２）前記検出手段は、前記所定文字数の接頭部に対応して前記検索手段から得られる順位が、最近出現した同じ接頭部を含む文字列の順位になるように、該検索手段に格納された情報を更新することを特徴とする付記１１記載のデータ圧縮装置。
（付記１３）コンピュータのためのプログラムを記録した記録媒体であって、前記プログラムは、
圧縮すべき文字列データが有する複数のアドレスの各々を始点とする各文字列を、各文字列の内容に基づいて並べ換え、
並べ換えられた文字列の順序で、各文字列のアドレスを表すアドレス情報を記録し、
記録されたアドレス情報に基づいて、繰返し文字列を検出し、
検出された繰返し文字列を符号化する
処理を前記コンピュータに実行させることを特徴とするコンピュータ読み取り可能な記録媒体。
（付記１４）圧縮すべき文字列データが有する複数のアドレスの各々を始点とする各文字列を、各文字列の内容に基づいて並べ換え、
並べ換えられた文字列の順序で、各文字列のアドレスを表すアドレス情報を記録し、
記録されたアドレス情報に基づいて、繰返し文字列を検出し、
検出された繰返し文字列を符号化する
ことを特徴とするデータ圧縮方法。
（付記１５）圧縮すべき文字列データが有する複数のアドレスの各々を始点とする各文字列を、各文字列の内容に基づいて並べ換え、
並べ換えられた文字列の順序で、各文字列のアドレスを表すアドレス情報を記録し、
記録されたアドレス情報に基づいて、繰返し文字列を検出し、
検出された繰返し文字列を符号化する
処理をコンピュータに実行させるためのプログラム。
【０１２９】
【発明の効果】
本発明によれば、データを圧縮するとき、入力データにほぼ比例したメモリ量で文字列検索を行うことができ、特に、少量のデータを圧縮する場合、既存の方法より少ないメモリ量で済む。また、最長一致文字列の検索の負荷が低いため、高い圧縮率の処理を高速に行うことが可能となる。
【図面の簡単な説明】
【図１】本発明のデータ圧縮装置の原理図である。
【図２】入力バッファを示す図である。
【図３】第１の順位リストを示す図である。
【図４】第２の順位リストを示す図である。
【図５】データ圧縮装置の構成図である。
【図６】第１の一致検出部の構成図である。
【図７】逆引きリストと順位リストを示す図である。
【図８】第１の圧縮処理のフローチャートである。
【図９】第２の圧縮処理のフローチャートである。
【図１０】第２の一致検出部の構成図である。
【図１１】順位リストと一致位置リストを示す図である。
【図１２】第３の圧縮処理のフローチャートである。
【図１３】第４の圧縮処理のフローチャートである。
【図１４】第３の一致検出部の構成図である。
【図１５】ハッシュテーブルと順位リストを示す図である。
【図１６】第５の圧縮処理のフローチャートである。
【図１７】順位リスト生成処理のフローチャート（その１）である。
【図１８】順位リスト生成処理のフローチャート（その２）である。
【図１９】情報処理装置の構成図である。
【図２０】記録媒体を示す図である。
【図２１】従来の圧縮方法を示す図である。
【図２２】ＬＵＴによる検索を示す図である。
【図２３】ハッシュテーブルによる検索を示す図である。
【符号の説明】
１スライドバッファ
２ＬＵＴ
３リンクドリスト
４ハッシュコード生成部
５ハッシュ値
６、５１ハッシュテーブル
１１データ格納手段
１２ソート手段
１３出現位置格納手段
１４検出手段
１５符号化手段
２１入力バッファ
２２一致検出部
２３符号生成部
２４符号出力部
２５ソート部
２６出現位置保持部
２７順位リスト
３１逆引きリスト
３２、４３、５２照合部
４１一致位置リスト
４２領域検出部
５３更新部
６１ＣＰＵ
６２メモリ
６３入力装置
６４出力装置
６５外部記憶装置
６６媒体駆動装置
６７ネットワーク接続装置
６８バス
６９可搬記録媒体
７０データベース[0001]
BACKGROUND OF THE INVENTION
The present invention relates to an apparatus and a method for compressing a data string using a dictionary generated from the compressed data string. The present invention can be applied not only to compression of character codes but also to compression of various data. However, in the following, based on information theory, a data string is divided into word units, one word of data is called a character, and an arbitrary word A data string of numbers is called a character string.
[0002]
[Prior art]
In recent years, as various types of data such as character codes and image data are handled by computers, the amount of data handled is also increasing. When handling such a large amount of data, it is possible to reduce the necessary storage capacity or to transmit the data to a remote place at high speed by omitting redundant portions in the data and compressing the data.
[0003]
Conventional data compression techniques include dictionary-type encoding that uses the similarity of data sequences and probability statistical encoding that uses the appearance frequency of data strings. Among them, LZ77 coding and LZ78 coding are known as representative methods of the former dictionary-type coding (written by Tomohiko Uematsu, “Introduction to Document Data Compression Algorithm”, CQ Publishing, pp. 131-208, 1995). In LZ77 encoding and LZ78 encoding, LZ77 encoding is more mainstream in actual use because a sufficient compression rate can be obtained with simple processing.
[0004]
In the LZ77 encoding, as shown in FIG. 21, a slide buffer 1 having a fixed size is provided, a character string that is the longest match with the input character string is searched in this buffer 1, and the input character string is used by using the position and length. Is encoded. Since the buffer 1 is slid as encoding progresses, this encoding method is also called a slide dictionary method.
[0005]
In FIG. 21, when the input character string “abcdaaaq...” On the right side of the buffer 1 is encoded, the longest character string that matches in the buffer 1 is “abcd”. Therefore, the relative address “5 (byte)” between the head position of the longest match character string and the input character string is set as the match position, and the length “4 (byte)” of the longest match character string is set as the match length. A code such as (match position, match length) = (5, 4) is generated. As a result, “abcd” at the beginning of the input character string is replaced with (5, 4). Similarly, the next character string “aaa” is replaced with the code (13, 3).
[0006]
However, the slide buffer actually used is longer, and it takes an enormous amount of time to search the character strings in the buffer in order to find the longest matching character string. For this reason, the actual position of the prefix of the character string (about 2 to 4 characters) is registered in the table at any time, rather than collating with all the character strings in the buffer, and the position held in the table Matches only with the string. Tables used for such a search include a look-up table (Look Up Table, LUT) and a hash table (Hash Table).
[0007]
FIG. 22 shows a character string search using the LUT. The LUT 2 in FIG. 22 stores the appearance position (address or pointer) of the character string in the buffer 1 using the prefix of the character string in the buffer 1 as an address. At the time of retrieval, the prefix of the input character string is used as an address to access the LUT2 area, and the position of the corresponding character string is acquired.
[0008]
When there are a plurality of character strings having the same prefix in the buffer 1, a plurality of appearance positions are held in the linked list 3 format. Therefore, the positions of all the corresponding character strings in the buffer 1 can be acquired by accessing the LUT 2 only once. Here, a prefix for two characters is used, and an area of the LUT 2 corresponding to the prefix “ab” of the input character string uses the linked list 3 to hold two appearance positions.
[0009]
In this way, the LUT makes it possible to perform a very high-speed search because the character string to be searched corresponds to the table area on a one-to-one basis, and necessary information can be acquired by only one table lookup. However, when searching for a long character string, the number of necessary areas in the table increases by the power of the number of characters that can appear, and the necessary area becomes enormous. For example, the number of characters that can appear is 2 ⁸ = 256, 256 for n-letter prefix ⁿ Areas are required.
[0010]
However, when the character string to be searched becomes a little longer, the part of the prepared area that is actually used (registered) is limited to a part, and the table is sparse. Therefore, when searching for a long character string, the memory use efficiency deteriorates.
[0011]
Therefore, in the hash table, the search character string is degenerated so that a plurality of character strings share one area. For this reason, after the table lookup, it is necessary to check whether the obtained character string is the character string that is actually searched. However, a longer character string is searched in the equivalent table area than the LUT. be able to.
[0012]
FIG. 23 shows character string search using a hash table. The hash code generation unit 4 in FIG. 23 generates a hash code 5 from the prefix “abc” of the input character string, and accesses the hash table 6 using this as an address. The hash table 6 stores the position in the buffer 1 corresponding to the hash code 5. By comparing the character string “abcde” at the position with the input character string, the prefixes of the two match. Is checked. If they match, it is determined that a character string that matches the input character string exists in the buffer 1.
[0013]
Also in the case of the hash table, as in the case of the LUT, a plurality of appearance positions are held in a linked list format for a plurality of character strings having the same prefix in the buffer 1. In either case, the linked list is used to search for the longest matching character string.
[0014]
[Problems to be solved by the invention]
However, the above-described conventional data compression technique has the following problems.
When searching for a long character string using the LUT, as described above, even if a table having an enormous area is prepared, only a part of the table is used, so the table is sparse. The hash table has a smaller table size than the LUT, but if the input data is small, the table is similarly sparse. Therefore, there is a problem that the memory is not always effectively used.
[0015]
Also, when searching for the longest matching character string, multiple appearance positions held in the linked list must be traced one by one. If there are many strings with the same prefix, the search process takes time. There is also a problem.
[0016]
An object of the present invention is to provide a data compression apparatus and method for efficiently performing a longest match search by realizing a character string search with a reasonable amount of memory according to input data in data compression based on dictionary type encoding. It is.
[0017]
[Means for Solving the Problems]
FIG. 1 is a principle diagram of a data compression apparatus according to the present invention. The data compression apparatus in FIG. 1 includes a data storage unit 11, a sort unit 12, an appearance position storage unit 13, a detection unit 14, and an encoding unit 15.
[0018]
The data storage unit 11 stores character string data to be compressed, and the sorting unit 12 rearranges each character string starting from each of a plurality of addresses in the data storage unit 11 based on the contents of each character string. The The appearance position storage unit 13 stores address information indicating the address of each character string in the order of the rearranged character strings. The detecting unit 14 detects a repeated character string based on the address information stored in the appearance position storing unit 13, and the encoding unit 15 encodes and outputs the detected repeated character string.
[0019]
First, each character included in the character string data to be compressed is stored in each of the plurality of addresses in the data storage unit 11. Next, the sorting unit 12 rearranges a plurality of character strings starting from those addresses in a predetermined order based on the contents of each character string, and stores the address information of each character string in the order of appearance. Store in the means 13.
[0020]
Next, the detection means 14 refers to the relationship between each address information stored in the appearance position storage means 13 and the rank (storage position) of the address information in the appearance position storage means 13, and the data storage means 11. Detects a character string that repeatedly appears in Then, the encoding means 15 encodes and outputs the repeated character string that appears after the second time.
[0021]
According to such a data compression apparatus, a plurality of character strings appearing in the data storage unit 11 are regularly rearranged according to the contents and stored in the appearance position storage unit 13. For this reason, referring to the appearance position storage means 13, a plurality of positions where the same character string appears can be easily found, and the character string search is made efficient. At this time, the longest match search can be made more efficient by rearranging the character strings so that a plurality of the same character strings are adjacent to each other.
[0022]
Further, since the number of address information in the appearance position storage means 13 is almost the same as the number of addresses in the data storage means 11 for storing the compressed data, the character string search is performed with a memory amount substantially proportional to the input data. It can be carried out.
[0023]
For example, the data storage unit 11, the sort unit 12, the appearance position storage unit 13, and the detection unit 14 in FIG. 1 are respectively an input buffer 21, a sort unit 25, an appearance position holding unit 26, and a coincidence detection in FIG. 1 corresponds to the code generation unit 23 and the code output unit 24 in FIG. 5.
[0024]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
In this embodiment, an input buffer that holds an input character string is provided, and each character string starting from each address in the buffer is rearranged according to the contents of the character string to generate a ranking list. Then, a character string search is performed using this ranking list as a dictionary, and a matching position and a matching length are obtained.
[0025]
For example, when an input buffer as shown in FIG. 2 is provided, first, a prefix for three characters is extracted from each character string starting from each address in the buffer, and a ranking list as shown in FIG. Generate. The order list in FIG. 3 corresponds to a table having almost the same number of elements (records) as the input buffer in FIG. 2, and each record has an address of the position where each prefix appears in the input buffer in FIG. Stored.
[0026]
Here, a 34-byte input character string “compression_decompression_compression” is held in the input buffer, and prefixes “com”, “omp”, and “mpr” are respectively used from the addresses 1, 2, and 3. Has been extracted. Similarly, a three-character prefix is extracted from other addresses. However, the symbol “_” represents a space. In the rank list, addresses “1” to “32” corresponding to the appearance positions of these prefixes are stored.
[0027]
Next, the addresses held in the order list are rearranged in the order of the codes of the characters of the corresponding prefixes to generate the order list as shown in FIG. The rank list of FIG. 4 has the same number of records as the rank list of FIG. 3, and holds the addresses of the prefixes in the rearranged order.
[0028]
Here, a plurality of the same prefixes included in the input buffer, such as “com” and “ssi”, are arranged adjacent to each other in the order of appearance. For this reason, among the character strings having the same prefix as that of a certain character string, the address of the most recent occurrence is always stored in the record immediately before the record in which the address of the character string is stored (up one rank). ing. Therefore, if the character string to be encoded in the input buffer is compared with the character string at the address held in the immediately preceding record, the matching character string can be easily searched.
[0029]
In addition, since the addresses of other match candidates are also stored continuously, in the case of the longest match search, the character string to be encoded is converted into a plurality of character strings corresponding to a plurality of continuously stored addresses. The longest match search is speeded up. Furthermore, since the input buffer and the rank list are approximately the same length, information necessary for the search can be stored with a memory amount having a size substantially proportional to the length of the input buffer.
[0030]
Next, the compression process using the order list shown in FIG. 4 will be described in more detail with reference to FIGS.
FIG. 5 is a configuration diagram of the data compression apparatus of the present embodiment. The data compression apparatus in FIG. 5 is configured using, for example, a computer, and includes an input buffer 21, a coincidence detection unit 22, a code generation unit 23, a code output unit 24, a sort unit 25, and an appearance position holding unit 26.
[0031]
The input buffer 21 holds the input character string as a compressed data string. The sorting unit 25 rearranges the character strings starting from each address in the input buffer 21 according to the contents of the character string, and generates a rank list 27 that holds the character string addresses in the rearranged order. The appearance position holding unit 26 holds the rank list as appearance position information.
[0032]
The coincidence detection unit 22 detects a repeated character string in the input buffer 21 based on the information in the rank list 27 and passes it to the code generation unit 23 together with other character strings. The code generation unit 23 generates a code of the character string received from the match detection unit 22, and the code output unit 24 outputs the generated code as compressed data. The following three methods are conceivable as a method for searching for repeated character strings using the ranking list 27.
(1) Method using a reverse lookup table (reverse lookup list)
(2) Method using a matching position table (matching position list)
(3) Method using search table (hash table)
First, FIGS. 6 to 9 show data compression processing using a reverse lookup list.
[0033]
In this case, the coincidence detection unit 22 includes a reverse lookup list 31 and a collation unit 32 as shown in FIG. The reverse lookup list 31 stores information for determining the rank of the character string in the rank list 27 from the address of the character string to be encoded in the input buffer 21. Then, the match detection unit 22 employs a character string starting from an address higher than the rank obtained from the reverse lookup list 31 as a match candidate.
[0034]
The collation unit 32 collates the character string to be encoded with the character string of the match candidate, and obtains the length of the matched character string. Then, the code generation unit 23 encodes the character string using the obtained length as the matching length and the address of the matched character string as the matching position. When performing the longest match search, the longest match length among the plurality of match candidates is encoded.
[0035]
For example, a reverse lookup list and a ranking list as shown in FIG. 7 are generated from the input character string of FIG. The order list Odr2P [] in FIG. 7 is the same as the order list in FIG. The reverse lookup list P2Odr [] is easily generated by storing the numbers representing the ranks of the addresses held in the rank list Odr2P [] in the records corresponding to the addresses. For example, for the first address “23” of the rank list Odr2P [], the rank number “1” is stored in the record of the address “23” of the reverse lookup list P2Odr [].
[0036]
When detecting a repeated character string, the match detection unit 22 accesses the reverse lookup list P2Odr [] and the rank list Odr2P [] based on the address of the character string to be encoded, and obtains a match candidate character string.
[0037]
For example, if the character string “compression” starting from the address “24” of the input buffer InBuf [] is to be encoded, the rank number “5” held at the address “24” of the reverse lookup list P2Odr [] is used. Obtain and access that rank in the rank list Odr2P []. Next, the addresses “1” and “15” held in the higher ranks “3” and “4” are acquired. Then, the character strings “compression_decom...” And “decompression_com.
[0038]
When the longest match search is not performed, only the character string “decompress_com...” Of the next higher rank is a match candidate, and when the longest match search is performed, both character strings are match candidates.
[0039]
Thus, by providing the reverse lookup list, the ranking list can be easily accessed, and the character string search is made efficient. Further, since the reverse lookup list has the same length as the ranking list, even if they are combined, necessary information can be stored with a memory amount substantially proportional to the length of the input buffer.
[0040]
FIG. 8 is a flowchart of data compression processing using the coincidence detection unit of FIG. In this process, the longest match search is not performed, and only the match candidates that appear recently are searched.
[0041]
First, the data compression apparatus inputs data for a predetermined size BUFSIZE to the input buffer InBuf [], and sets a variable t representing the encoding position to 1 (step S1). Further, a character string of three characters starting from each address of InBuf [] is arranged in alphabetical order to create a ranking list Odr2P [] and a reverse lookup list P2Odr [] for Odr2P [].
[0042]
Next, it is checked whether or not a character string starting from the address t appears before t. Here, first, a variable odr representing the rank of a match candidate that has recently appeared is set to P2Odr [t] -1, and a variable p representing the address of the match position is set to Odr2P [odr] (step S2). odr corresponds to a rank one higher than the rank of the character string starting from the encoding position in Odr2P [].
[0043]
Next, the three-character string Ct = (InBuf [t], InBuf [t + 1], InBuf [t + 2]) starting from the address t is compared with the three-character string Cp starting from the address p. (Step S3).
[0044]
If Ct and Cp match, the character string starting with Cp is taken as a match candidate, and then the match length is obtained. Here, first, the variable s representing the matching length is set to 3 (step S4), and InBuf [t + s] is compared with InBuf [p + s] (step S5). If they match, s = s + 1 is set (step S6), and the process of step S5 is repeated.
[0045]
If InBuf [t + s] and InBuf [p + s] do not match in step S5, (p, s) is output as a sign, t = t + s is set (step S7), and t is compared with BUFSIZE (step S8). . If t <BUFSIZE, the processes in and after step S2 are repeated.
[0046]
If t ≧ BUFSIZE in step S8, it is next checked whether or not the data to be compressed has ended (step S9). If the data to be compressed remains, the process from step S1 is repeated, and if there is no data to be compressed, the process ends.
[0047]
In step S3, if Ct and Cp do not match, there is no match candidate, so the first character InBuf [t] of Ct is output as a code as it is, t = t + 1 (step S10), and after step S8 Perform the process.
[0048]
For example, in the case of the compressed data in FIG. 7, compressed data such as “compression_de (1,8) _ (15,8) (9,3)” is generated by the processing in FIG.
[0049]
FIG. 9 is a flowchart of data compression processing when performing the longest match search. The processes in steps S11, S13 to S16, and S22 to S24 in FIG. 9 are the same as the processes in steps S1, S3 to S6, and S8 to S10 in FIG.
[0050]
In step S11, when Odr2P [] and P2Odr [] are created, the data compression apparatus then sets odr = P2Odr [t] -1 and p = Odr2P [odr] (step S12). At this time, the variable pre representing the matching position of the longest matching character string is set to p, and the variable len representing the matching length is set to 0. Then, the matching length s of the recently appearing matching candidate is obtained by the processing of steps S3 to S6.
[0051]
Next, s and len are compared (step S17), and if s> len, len = s and pre = p are set (step S18). Then, in order to obtain a longer match candidate, set odr = odr−1, p = Odr2P [odr] (step S19), and compare Ct and Cp (step S20). In step S17, if s ≦ len, the processes after step S19 are performed without updating len and pre.
[0052]
If Ct and Cp match, a new match candidate has been found, so the processing from step S14 is repeated, and if the match length of the candidate is longer than len, len and pre are updated.
[0053]
In step S20, if Ct and Cp do not match, (pre, len) is output as a code, t = t + len is set (step S21), and the processing from step S22 is performed. Thus, finally, the position and length of the character string having the longest matching length is output as a code.
[0054]
For example, in the case of the compressed data in FIG. 7, compressed data such as “compression_de (1,8) _ (1,11)” is generated by the processing in FIG.
Next, FIGS. 10 to 13 show data compression processing using the matching position list. In this process, the input data is once converted into a matching position list and then compressed.
[0055]
In this case, the coincidence detection unit 22 includes a coincidence position list 41, an area detection unit 42, and a collation unit 43, as shown in FIG. The matching position list 41 is generated from the ranking list 27 and stores information for obtaining the position (matching position) of the same character string that has recently appeared from the address of each character string in the input buffer 21.
[0056]
For example, the ranking list Odr2P [] in FIG. 7 is converted into a matching position list P2PreP [] as shown in FIG. The matching position list P2PreP [] is made up of the same number of records as the elements of the input buffer. In the record of each address, the address held in the rank one higher than the rank of the prefix starting from that address in the rank list Odr2P [] is stored as the matching position. However, if the prefixes registered in the next higher rank are different, the symbol “N” indicating that there is no matching candidate is stored.
[0057]
When the match position list P2PreP [] is generated, the match detection unit 22 looks at the addresses held in the rank list Odr2P [] in order from the top. If the prefix registered in the order of interest is the same as the prefix registered in the next higher rank, the record corresponding to the address held in the former rank is held in the latter rank. Stores an address.
[0058]
If the prefix registered in the order of interest is different from the prefix registered in the next higher order, the symbol “N” is stored in the record corresponding to the address held in the former order. By repeating such an operation, the matching position list P2PreP [] is easily generated.
[0059]
For example, for the first address “23” in the rank list Odr2P [], there is no record with a rank higher than that. Therefore, the symbol “N” is stored in the address “23” of the matching position list P2PreP []. For the fifth address “24” in the order list Odr2P [], the record at the fourth address “15” corresponds to the same prefix “com”. Therefore, the address “15” held in the fourth record is stored in the address “24” of the matching position list P2PreP [].
[0060]
The area detection unit 42 in FIG. 10 compares the values (addresses) of adjacent records in such a match position list 41 and detects a region in which the value indicating the match position continuously increases by one. . Then, the code generation unit 23 uses the value at the beginning of the area as the coincidence position, obtains the coincidence length from the length of continuous values, and encodes the character string.
[0061]
For example, in the matching position list P2PreP [] of FIG. 11, the value of the record is successively increased by 1 from “1” to “6” in the area of addresses “15” to “20”. Therefore, by adding the length “3” of the prefix registered in the rank list to the length “6” of this area and subtracting “1”, the result “8 (= 6 + 3-1)” is obtained as the matching length. To do. Then, a code such as (1, 8) is generated with the value “1” of the first record as the matching position.
[0062]
When performing the longest match search, the region detection unit 42 detects a portion in the match position list 41 where two or more regions having continuous values are connected. Then, the coincidence detection unit 22 analyzes the addresses held in the plurality of detected continuous areas and obtains the positions of a plurality of coincidence candidates.
[0063]
Next, the collation unit 32 collates the character string to be encoded with the character string of each match candidate, and obtains the length of the matched character string. Then, the code generation unit 23 encodes the character string using the match position and the match length of the longest match among the plurality of match candidates.
[0064]
For example, in the matching position list P2PreP [] in FIG. 11, the value of the record continuously increases from “15” to “20” in the area of addresses “24” to “29”, and addresses “30” to “30”. In the area “32”, the record value continuously increases from “7” to “9”. Since these two continuous areas are connected, the longest match search is performed using the character string “compression” starting from the address “24” as an encoding target.
[0065]
In this case, based on the values of the second continuous areas “30” to “32”, the address “1” longer than the match candidate of the address “15” is included in the address “24” of the first continuous area. I understand that there is. The coincidence length is obtained by adding the prefix length “3” and subtracting “1” to the length “9” of the two continuous areas, and becomes “11 (= 9 + 3-1)”. Thus, a code such as (1, 11) is generated.
[0066]
Similarly, when three or more continuous regions are connected, a code representing the matching position and the matching length of the longest matching character string is generated. In general, when n continuous regions are connected, there are at least n matching candidates, and the longest matching character string is included in them.
[0067]
Thus, by converting the ranking list to the matching position list, the matching position and the matching length can be easily obtained, and the character string search is made efficient. Further, since the matching position list has the same length as the input buffer, necessary information can be stored with a memory amount proportional to the length of the input buffer. In FIG. 11, the address of the matching position itself is stored in the matching position list, but a relative address from each address to the matching position may be stored.
[0068]
FIG. 12 is a flowchart of data compression processing using the coincidence detection unit of FIG. In this process, the longest match search is not performed, and only the match candidates that appear recently are searched.
[0069]
First, the data compression apparatus inputs BUFSIZE data to the input buffer InBuf [], and sets a variable t to 1 (step S31). Also, a ranking list Odr2P [] is created from the data of InBuf [], and a matching position list P2PreP [] is created from Odr2P [].
[0070]
Next, P2PreP [t] is compared with “N” to check whether there is a character string match candidate starting from the address t (step S32). If the value is not “N”, there is a match candidate. Therefore, the variable s representing “continuous region length−1” is set to 0, and P2PreP [t + s] and P2PreP [t + s + 1] −1 are set. Compare (step S34).
[0071]
If P2PreP [t + s] and P2PreP [t + s + 1] −1 match, P2PreP [t + s + 1] is not “N” and represents a value larger by 1 than P2PreP [t + s]. Therefore, s = s + 1 is set (step S35), and the process of step S34 is repeated.
[0072]
In step S34, if P2PreP [t + s] and P2PreP [t + s + 1] -1 do not match, a code (P2PreP [t], (s + 3)) is output with P2PreP [t] as the matching position and s + 3 as the matching length. (Step S36). Then, at t = t + s + 3, t is compared with BUFSIZE (step S37). If t <BUFSIZE, the processes in and after step S32 are repeated.
[0073]
If t ≧ BUFSIZE in step S37, it is next checked whether or not the data to be compressed has ended (step S38). If the data to be compressed remains, the process from step S31 is repeated, and if there is no data to be compressed, the process ends.
[0074]
If P2PreP [t] is “N” in step S32, there is no match candidate, so InBuf [t] is output as a code as it is, and t = t + 1 is set (step S39). Process.
[0075]
For example, in the case of the compressed data in FIG. 7, compressed data such as “compression_de (1,8) _ (15,8) (9,3)” is generated by the processing in FIG.
[0076]
FIG. 13 is a flowchart of data compression processing when performing the longest match search. The processes in steps S41 to S42, S44 to S45, and S50 to S52 in FIG. 13 are the same as the processes in steps S31 to S32, S34 to S35, and S37 to S39 in FIG.
[0077]
If P2PreP [t] is not “N” in step S42, the data compression apparatus then sets s = 0 and sets the variable p representing the matching position of the longest matching character string to P2PreP [t] (step S42). S43). And the value of s is updated by the process of step S44-S45.
[0078]
In step S44, if P2PreP [t + s] and P2PreP [t + s + 1] -1 do not match, then P2PreP [t + s + 1] is compared with “N”, and there is a next continuous area connected to the first continuous area. It is checked whether or not to perform (step S46).
[0079]
For example, in the case of FIG. 11, when t = 24 and s = 5, P2PreP [24 + 5] = 20 and P2PreP [24 + 5 + 1] -1 = 7-1 = 6, and the two do not match, so P2PreP [30] = 7 is compared with “N”.
[0080]
If P2PreP [t + s + 1] is not “N”, it can be seen that the next continuous area exists. Therefore, the character string and the character string to be encoded are compared with the address P2PreP [t + s + 1] − (s + 1) obtained from the leading value P2PreP [t + s + 1] of the area as the position of a new match candidate.
[0081]
Here, first, a character string of length s + 1 starting from the address t is set as Str (t, s) = (InBuf [t], InBuf [t + 1],..., InBuf [t + s]), and Str ( P2PreP [t + s + 1] − (s + 1), s) is compared with Str (t, s) (step S47).
[0082]
If these character strings match, the new matching candidate is regarded as the longest matching character string, s = s + 1 is set, p = P2PreP [t + s + 1] − (s + 1) is set (step S48), and the processing after step S44 is performed. repeat.
[0083]
In step S47, if the two character strings do not match, (p, (s + 3)) is output as a code, t = t + s + 3 is set (step S49), and the processing after step S50 is performed.
[0084]
In step S46, if P2PreP [t + s + 1] is “N”, the next continuous area does not exist, so the processing from step S49 is performed as it is. Thus, finally, the position and length of the character string having the longest matching length is output as a code.
[0085]
In the case of FIG. 11, in step S47, Str (P2PreP [24 + 5 + 1] − (5 + 1), 5) = Str (1, 5) and Str (24, 5) are compared. Since both of these character strings represent “compre”, the processing after step S44 is repeated with s = 6 and p = 1.
[0086]
When s = 8, since P2PreP [24 + 8 + 1] = N in step S46, a code such as (1, (8 + 3)) = (1, 11) is generated. Therefore, finally, compressed data such as “compression_de (1,8) _ (1,11)” is generated.
[0087]
Next, FIGS. 14 to 16 show data compression processing using a hash table. In this process, the ranking list is accessed using a hash table instead of the reverse lookup list of FIG.
[0088]
In this case, the match detection unit 22 includes a hash table 51, a collation unit 52, and an update unit 53, as shown in FIG. The hash table 51 stores information for obtaining the rank of character strings having the same prefix in the rank list 27 from the prefix of the character string to be encoded in the input buffer 21. Then, the match detection unit 22 employs a character string starting from the rank obtained from the hash table 51 or a higher rank address as a match candidate.
[0089]
The collation unit 52 collates the character string to be encoded with the character string of the match candidate, and obtains the length of the matched character string. Then, the code generation unit 23 encodes the character string using the obtained length as the matching length and the address of the matched character string as the matching position. When performing the longest match search, the longest match length among the plurality of match candidates is encoded. Also, the update unit 53 changes the rank obtained from the hash table 51 to the rank of character strings that have recently appeared and have the same prefix.
[0090]
FIG. 15 shows an example of processing for accessing the ranking list using such a hash table. The order list Odr2P [] in FIG. 15 is the same as the order list in FIG. The hash table hash2Odr [] stores the rank number in the rank list Odr2P [] with the hash value as an address. A hash value for accessing this table is generated by using a hash function H, for example, by a hash code generation unit 4 as shown in FIG. The size of this table is generally 2 ^M And is specified by the integer M.
[0091]
When a plurality of the same prefixes are registered in the rank list Odr2P [], the hash table hash2Odr [] in the initial state corresponds to the hash value obtained from the prefix and 1 of the blocks of those prefixes. The higher rank number is retained. For example, the three-letter prefix “com” is registered in the third, fourth, and fifth in the ranking list Odr2P [], but at the start of the compression process, the hash value H (“com”) of “com” The rank number “2” is stored in the address corresponding to “)”.
[0092]
When detecting the repeated character string, the match detection unit 22 accesses the hash table hash2Odr [] and the rank list Odr2P [] based on the prefix of the three characters of the character string to be encoded, and matches the match candidate character string. Ask for.
[0093]
For example, if the character string “compression_decom...” Starting from the address “1” of the input buffer InBuf [] is the encoding target, first, the hash value H (“com”) is obtained from the three-character prefix “com”. ) Is generated. Next, in the hash table hash2Odr [], the rank number “2” held at the address of the hash value is acquired, and the rank in the rank list Odr2P [] is accessed.
[0094]
In this case, since the same prefix is not registered in the ranking, there is no matching candidate. Therefore, the first character “c” is output as it is, and 1 is added to the rank number “2” held in the address H (“com”) of the hash table hash2Odr []. As a result, the rank “2” obtained from the prefix “com” is changed to the rank “3” one level lower.
[0095]
After that, when the character string “compress_com...” Starting from the address “15” becomes an encoding target, the updated rank number is obtained from the hash table hash2Odr [] based on the hash value of the prefix “com”. Get “3”. Then, the rank of the rank list Odr2P [] is accessed.
[0096]
Next, the address “1” held in the ranking “3” is acquired, and the character string “compression_decom...” Starting from the address is set as a match candidate. Then, the code of the coincidence position and the coincidence length is output, and the value of the hash table hash2Odr [] is updated again. As a result, the rank “3” obtained from the prefix “com” is changed to the rank “4” one level lower.
[0097]
After that, when the character string “compression” starting from the address “24” is to be encoded, the updated rank number “4” is obtained from the hash table hash2Odr [], and the character string “compression” in the rank list Odr2P [] Access the ranking.
[0098]
Next, the addresses “1” and “15” held in the rank “3” and the rank “4” one level higher than the rank “3” are acquired. Then, the character strings “compression_decom...” And “decompression_com. Here, when the longest match search is not performed, only the character string “decompress_com...” Of rank “3” is a match candidate, and when the longest match search is performed, both character strings are match candidates.
[0099]
Thus, by providing the hash table, the ranking list can be easily accessed, and the character string search is made efficient. Further, since the length of the hash table can be made shorter than the length of the rank list, necessary information can be stored within the memory amount proportional to the length of the input buffer even if these are combined. In addition, each time encoding is performed, the rank indicated by the hash table is shifted down by one, whereby the rank of the match candidate that appears recently can be held, and the longest match search is made efficient.
[0100]
FIG. 16 is a flowchart of data compression processing using the coincidence detection unit of FIG. In this process, the longest match search is not performed, and only the match candidates that appear recently are searched. The processes in steps S63 to S67 and S69 to S71 in FIG. 16 are the same as the processes in steps S3 to S7 and S8 to S10 in FIG.
[0101]
First, the data compression apparatus inputs data for BUFSIZE into the input buffer InBuf [], and sets a variable t to 1 (step S61). Also, a ranking list Odr2P [] is created from the data of InBuf [], and a hash table hash2Odr [] for Odr2P [] is created.
[0102]
Next, here, first, a character string of three characters starting from the address t is set as Ct = (InBuf [t], InBuf [t + 1], InBuf [t + 2]), and a variable hash representing a hash value is set to H (Ct ) (Step S62). In addition, a variable odr representing the rank of a match candidate that has recently appeared is set as hash2Odr [hash], and a variable p indicating a match position is set as Odr2P [odr].
[0103]
Next, it is checked whether or not a character string starting from the address t appears before t by the processing in steps S63 to S67. If such a character string appears, the matching position and the matching length are checked. Is output as a code. Then, 1 is added to hash2Odr [hash], the order corresponding to hash is shifted down by one (step S68), and the processing after step S69 is performed. In step S71, InBuf [t] is output as a code, and after t = t + 1, the processes in and after step S68 are performed.
[0104]
The compression result by the process of FIG. 16 is the same as the result of the process of FIG. Further, when performing the longest match search, the same change as in FIG. 9 may be added to the process in FIG.
[0105]
By the way, the above-described ranking list is generated by sorting the prefixes of the character strings starting from each address in the input buffer in the order of the code of each character, and rearranging the addresses of the appearance positions of each character string. The At this time, any sort method such as a base method (radix sort), quick sort, bubble sort, or the like can be used.
[0106]
For example, in the base method, an operation for performing a bin sort by paying attention to k-th (k = 1,..., N) characters included in a prefix of N characters (N bytes) Sort processing is performed by repeating in order from the character. In the quick sort, the sorting process is performed by repeating an operation of dividing a set of N character prefixes into two with reference to one prefix. In the bubble sort, the sorting process is performed by comparing two adjacent prefixes and repeating the operation of exchanging the prefixes based on the comparison result.
[0107]
17 and 18 are flowcharts of the rank list generation process based on the basis method. Here, the sorting unit 25 in FIG. 5 performs bin sorting for each character of the three-character prefix. Experimentally, it has been found that the longest match search is made more efficient by sorting with a prefix limited to 3 characters.
[0108]
In bin sorting, the number of appearances of each value (character code) from 0 to 255 is counted, and the number of character codes less than that character code is calculated based on each count value. As a result, it is determined in which position in the array each character code that appears should finally be placed.
[0109]
When data for BUFSIZE is input to the input buffer InBuf [], the sorting unit 25 first initializes each element of the array Counter [256] representing the number of appearances of the character codes 0 to 255 to 0 and sets the variable t Is set to 1 (step S81).
[0110]
Next, 1 is added to Counter [InBuf [t]], and the number of appearances of the character code held in InBuf [t] is incremented (step S82). Then, 1 is added to t, and t is compared with BUFSIZE (step S83). If t <BUFSIZE, the process of step S82 is repeated. When t reaches BUFSIZE, t = 1 and Sum [0] = 0 are set (step S84).
[0111]
Next, Sum [t] = Counter [t−1] + Sum [t−1] is set, 1 is added to t (step S85), and t is compared with 256 (step S86). Here, Sum [t] represents the total number of appearances of character codes from 0 to t-1. If t ≦ 256, the process of step S85 is repeated. If t exceeds 256, the process of FIG.
[0112]
In FIG. 18, the sorting unit 25 first performs bin sorting by the third character of the prefix. In this case, first, t = 1 is set, and Sum [] is copied to StackP [] (step S87). Here, the array A [] stores addresses sorted by the third character of the character string starting from the address t of InBuf []. If the third character is the value x, StackP [x] stores the subscript of the array A [] to be stored as the sort result.
[0113]
Next, A [StackP [InBuf [t + 2]]] = t is set, 1 is added to StackP [InBuf [t + 2]], and 1 is added to t (step S88). Here, StackP [InBuf [t + 2]] represents a subscript of A [] corresponding to the third character of the prefix starting from the address t, and A [StackP [InBuf [t + 2]]] is its prefix. Indicates the address of the part. Next, t and BUFSIZE are compared (step S89), and if t <BUFSIZE, the process of step S88 is repeated.
[0114]
When t reaches BUFSIZE, next, the generated array A [] is bin-sorted by the second character of the prefix. In this case, first, t = 1 is set, and Sum [] is copied to StackP [] (step S90). Here, the array StackP [] stores a subscript of the array B [] for storing the sort result.
[0115]
Next, B [StackP [InBuf [A [t] +1]]] = t is set, 1 is added to StackP [InBuf [A [t] +1]], and 1 is added to t (step S91). Here, StackP [InBuf [A [t] +1]] represents the subscript of B [] corresponding to the second character of the prefix stored at the position of the subscript t of the array A [], and B [StackP [InBuf [A [t] +1]]] represents the address of the prefix. Next, t is compared with BUFSIZE (step S92), and if t <BUFSIZE, the process of step S91 is repeated.
[0116]
When t reaches BUFSIZE, next, the generated array B [] is bin-sorted by the first character of the prefix. In this case, first, t = 1 is set, and Sum [] is copied to StackP [] (step S93). Here, the array StackP [] stores the subscript (rank number) of the rank list Odr2P [] for storing the sort result.
[0117]
Next, Odr2P [StackP [InBuf [B [t]]]] = t is set, 1 is added to StackP [InBuf [B [t]]], and 1 is added to t (step S94). Here, StackP [InBuf [B [t]]] represents the rank of the first character of the prefix stored at the position of the subscript t in the array B [], and Odr2P [StackP [InBuf [B [t]]. ]]] Represents the address of the prefix.
[0118]
Next, t and BUFSIZE are compared (step S95), and if t <BUFSIZE, the process of step S94 is repeated. When t reaches BUFSIZE, the process is terminated. In this way, the ranking list Odr2P [] is generated.
[0119]
In the above-described embodiment, when the ranking list is generated, the character strings are sorted by comparing the fixed length (N characters) prefixes of the respective character strings. Instead, the variable length prefixes are sorted. May be compared. In the above-described embodiment, the character string search in the LZ77 encoding has been described. However, the present invention is not limited to the LZ77 encoding, and can be applied to a character string search in an arbitrary encoding.
[0120]
The data compression apparatus in FIG. 5 can be configured using, for example, an information processing apparatus (computer) as shown in FIG. 19 includes a CPU (central processing unit) 61, a memory 62, an input device 63, an output device 64, an external storage device 65, a medium drive device 66, and a network connection device 67, which are connected via a bus 68. Are connected to each other.
[0121]
The memory 62 includes, for example, a read only memory (ROM), a random access memory (RAM), and the like, and stores programs and data used for processing. The CPU 61 performs necessary processing by executing a program using the memory 62.
[0122]
For example, the input buffer 21 in FIG. 5, the appearance position holding unit 26, the reverse lookup list 31 in FIG. 6, the matching position list 41 in FIG. 10, and the hash table 51 in FIG. 14 are provided in the memory 62. Also, the coincidence detection unit 22, the code generation unit 23, the code output unit 24, the sorting unit 25, the collation unit 32 in FIG. 6, the region detection unit 42 in FIG. 10, the collation unit 43, the collation unit 52 in FIG. The update unit 53 is stored in the memory 62 as a software component described by a program.
[0123]
The input device 63 is, for example, a keyboard, a pointing device, a touch panel, etc., and is used for inputting instructions and information from the user. The output device 64 is, for example, a display, a printer, a speaker, etc., and is used for outputting an inquiry to a user and a processing result.
[0124]
The external storage device 65 is, for example, a magnetic disk device, an optical disk device, a magneto-optical disk device, a tape device, or the like. The information processing apparatus stores the above-described program and data in the external storage device 65, and loads them into the memory 62 for use as necessary.
[0125]
The medium driving device 66 drives a portable recording medium 69 and accesses the recorded contents. As the portable recording medium 69, any computer-readable recording medium such as a memory card, a floppy disk, a CD-ROM (compact disk read only memory), an optical disk, a magneto-optical disk, or the like is used. The user stores the above-described program and data in the portable recording medium 69, and loads them into the memory 62 for use as necessary.
[0126]
The network connection device 67 is connected to an arbitrary communication network such as a LAN (Local Area Network) and performs data conversion accompanying communication. Further, the information processing apparatus receives the above-described program and data from another apparatus via the network connection apparatus 67 and loads them into the memory 62 and uses them as necessary.
[0127]
FIG. 20 shows a computer-readable recording medium capable of supplying a program and data to the information processing apparatus of FIG. Programs and data stored in the portable recording medium 69 and the external database 70 are loaded into the memory 62. Then, the CPU 61 executes the program using the data and performs necessary processing.
[0128]
(Supplementary note 1) Data storage means for storing character string data to be compressed;
Sort means for rearranging each character string starting from each of the plurality of addresses in the data storage means based on the contents of each character string;
Appearance position storage means for storing address information representing addresses of the respective character strings in the order of the rearranged character strings;
Detection means for detecting a repeated character string based on the address information stored in the appearance position storage means;
Encoding means for encoding and outputting the detected repeated character string;
A data compression apparatus comprising:
(Additional remark 2) The said compression means rearranges a character string using the prefix of the predetermined number of characters contained in each character string, The data compression apparatus of Additional remark 1 characterized by the above-mentioned.
(Additional remark 3) The said compression means rearranges the said character string using the prefix part of 3 characters contained in each character string, The data compression apparatus of Additional remark 2 characterized by the above-mentioned.
(Additional remark 4) The said compression means rearranges the said character string so that several same prefixes mutually adjoin, The data compression apparatus of Additional remark 2 characterized by the above-mentioned.
(Additional remark 5) The said compression means rearranges the said character string using a basis method, The data compression apparatus of Additional remark 2 characterized by the above-mentioned.
(Additional remark 6) The said compression means rearranges the said character string using quick sort, The data compression apparatus of Additional remark 2 characterized by the above-mentioned.
(Additional remark 7) It further comprises reverse lookup means for storing information for obtaining the rank of the encoding target character string in the appearance position storage means from the address of the encoding target character string, and the detection means includes the reverse lookup A character string corresponding to address information stored in a higher rank than the rank obtained from the means is used as a match candidate, the character string to be coded is matched with the match candidate to obtain a match length, and the coding means The data compression apparatus according to appendix 1, wherein the encoding target character string is encoded using information indicating the position of the match candidate and the match length.
(Supplementary Note 8) Corresponding position storage means for storing address information of the same character string that has recently appeared corresponding to the address of each character string is further provided, and the detection means includes address information stored in the appearance position storage means Generating address information stored in the matching position storage means, comparing adjacent address information of the matching position storage means, detecting a continuous area in which the address information is continuous, and the encoding means A character string corresponding to the position of the continuous area is set as an encoding target character string, and the encoding target character string is encoded using address information stored in the continuous area and the length of the continuous area. The data compression apparatus according to appendix 1, wherein:
(Supplementary Note 9) The detection unit pays attention to one rank of the appearance position storage unit, and when the prefix of the character string of the rank of interest is the same as the prefix of the character string of the rank one higher, 9. The data compression apparatus according to appendix 8, wherein the matching position storage means stores the address information stored in the next higher rank at a position corresponding to the address information stored in the target rank.
(Additional remark 10) The said detection means detects the part to which two or more continuous area | regions are connected in the said coincidence position storage means, A plurality of matching is performed based on the address information stored in these two or more continuous areas A candidate character string is obtained, and the encoding means uses the information indicating the position of the match candidate having the longest match length among the plurality of match candidates and the longest match length to encode the character to be encoded. The data compression apparatus according to appendix 8, wherein the column is encoded.
(Additional remark 11) The search means which stores the information for calculating | requiring the order of the character string containing the same prefix in the said appearance position storage means from the prefix part of the predetermined number of characters contained in the encoding object character string, The said detection is further provided. The means uses a character string corresponding to the address information stored in the order obtained from the search means as a match candidate, and compares the encoding target character string with the match candidate to obtain a match length, and The data compression apparatus according to appendix 1, wherein the means encodes the encoding target character string using information indicating the position of the match candidate and the match length.
(Additional remark 12) The said detection means stores in the said search means so that the order | rank obtained from the said search means corresponding to the said prefix of a predetermined number of characters becomes the order | rank of the character string containing the same prefix which appeared recently. The data compression apparatus according to appendix 11, wherein the information is updated.
(Supplementary note 13) A recording medium storing a program for a computer, wherein the program is
Reorder each character string starting from each of a plurality of addresses of character string data to be compressed based on the contents of each character string,
Record address information that represents the address of each string in the order of the sorted strings,
Based on the recorded address information, the repeated character string is detected,
Encode the detected repeated string
A computer-readable recording medium that causes the computer to execute processing.
(Supplementary Note 14) Reordering each character string starting from each of a plurality of addresses of character string data to be compressed based on the contents of each character string,
Record address information that represents the address of each string in the order of the sorted strings,
Based on the recorded address information, the repeated character string is detected,
Encode the detected repeated string
A data compression method.
(Supplementary Note 15) Reorder each character string starting from each of a plurality of addresses of character string data to be compressed based on the contents of each character string,
Record address information that represents the address of each string in the order of the sorted strings,
Based on the recorded address information, the repeated character string is detected,
Encode the detected repeated string
A program that causes a computer to execute processing.
[0129]
【The invention's effect】
According to the present invention, when data is compressed, a character string search can be performed with a memory amount substantially proportional to the input data. In particular, when a small amount of data is compressed, a memory amount smaller than that of the existing method is sufficient. In addition, since the search load for the longest matching character string is low, processing with a high compression rate can be performed at high speed.
[Brief description of the drawings]
FIG. 1 is a principle diagram of a data compression apparatus according to the present invention.
FIG. 2 is a diagram illustrating an input buffer.
FIG. 3 is a diagram showing a first ranking list;
FIG. 4 is a diagram showing a second ranking list.
FIG. 5 is a configuration diagram of a data compression apparatus.
FIG. 6 is a configuration diagram of a first coincidence detection unit.
FIG. 7 is a diagram illustrating a reverse lookup list and a ranking list.
FIG. 8 is a flowchart of first compression processing.
FIG. 9 is a flowchart of second compression processing.
FIG. 10 is a configuration diagram of a second coincidence detection unit.
FIG. 11 is a diagram illustrating a ranking list and a matching position list.
FIG. 12 is a flowchart of third compression processing.
FIG. 13 is a flowchart of fourth compression processing.
FIG. 14 is a configuration diagram of a third coincidence detection unit.
FIG. 15 is a diagram showing a hash table and a ranking list.
FIG. 16 is a flowchart of fifth compression processing.
FIG. 17 is a flowchart (part 1) of a ranking list generation process;
FIG. 18 is a flowchart (part 2) of the order list generation process;
FIG. 19 is a configuration diagram of an information processing apparatus.
FIG. 20 is a diagram illustrating a recording medium.
FIG. 21 is a diagram illustrating a conventional compression method.
FIG. 22 is a diagram showing search by LUT.
FIG. 23 is a diagram illustrating a search using a hash table.
[Explanation of symbols]
1 Slide buffer
2 LUT
3 Linked list
4 Hash code generator
5 Hash value
6, 51 Hash table
11 Data storage means
12 Sorting means
13 Appearance position storage means
14 Detection means
15 Encoding means
21 Input buffer
22 Match detection part
23 Code generator
24 Code output section
25 Sort section
26 Appearance position holding unit
27 Ranking list
31 Reverse Lookup List
32, 43, 52 Verification unit
41 Matching position list
42 Area detection unit
53 Update Department
61 CPU
62 memory
63 Input device
64 output device
65 External storage device
66 Medium Drive Device
67 Network connection device
68 Bus
69 Portable recording media
70 database

Claims

圧縮すべき文字列データを格納するデータ格納手段と、
前記データ格納手段内の複数のアドレスの各々を始点とする各文字列を、各文字列に含まれる所定文字数の接頭部を用いて、複数の同じ接頭部が互いに隣接するように並べ換えるソート手段と、
並べ換えられた文字列の順序で、前記データ格納手段内の各文字列のアドレスを表すアドレス情報を格納する出現位置格納手段と、
前記出現位置格納手段に格納されたアドレス情報に基づいて、前記データ格納手段に格納された文字列データに含まれる繰返し文字列を検出する検出手段と、
検出された繰返し文字列を符号化して出力する符号化手段と
を備えることを特徴とするデータ圧縮装置。Data storage means for storing character string data to be compressed;
Sort means for rearranging each character string starting from each of a plurality of addresses in the data storage means so that a plurality of the same prefixes are adjacent to each other using a prefix of a predetermined number of characters included in each character string When,
Appearance position storage means for storing address information representing addresses of the respective character strings in the data storage means in the order of the rearranged character strings;
Detection means for detecting a repetitive character string included in the character string data stored in the data storage means based on the address information stored in the appearance position storage means;
A data compression apparatus comprising: encoding means for encoding and outputting the detected repeated character string.

前記ソート手段は、基底法を用いて、前記文字列を並べ換えることを特徴とする請求項１記載のデータ圧縮装置。It said sorting means uses the base method, the data compression apparatus according to claim 1, wherein the permuting the string.

符号化対象文字列のアドレスから、前記出現位置格納手段における該符号化対象文字列の順位を求めるための情報を格納する逆引き手段をさらに備え、前記検出手段は、該逆引き手段から得られた順位より上の順位に格納されたアドレス情報に対応する文字列を一致候補とし、該符号化対象文字列と該一致候補とを照合して一致長を求め、前記符号化手段は、該一致候補の位置を示す情報と該一致長とを用いて、該符号化対象文字列を符号化することを特徴とする請求項１又は２記載のデータ圧縮装置。Further comprising reverse lookup means for storing information for determining the rank of the encoding target character string in the appearance position storage means from the address of the encoding target character string, and the detection means is obtained from the reverse lookup means The character string corresponding to the address information stored in a higher rank is used as a match candidate, and the encoding target character string and the match candidate are collated to obtain a match length. 3. The data compression apparatus according to claim 1, wherein the encoding target character string is encoded using information indicating a candidate position and the matching length.

各文字列のアドレスに対応して、最近出現した同じ文字列のアドレス情報を格納する一致位置格納手段をさらに備え、前記検出手段は、前記出現位置格納手段に格納されたアドレス情報から、該一致位置格納手段に格納されるアドレス情報を生成し、該一致位置格納手段の隣接するアドレス情報を比較して、アドレス情報が連続している連続領域を検出し、前記符号化手段は、該連続領域の位置に対応する文字列を符号化対象文字列とし、該連続領域に格納されたアドレス情報と該連続領域の長さとを用いて、該符号化対象文字列を符号化することを特徴とする請求項１又は２記載のデータ圧縮装置。Corresponding position storage means for storing address information of the same character string that has recently appeared corresponding to the address of each character string is further provided, and the detecting means is configured to detect the match from the address information stored in the appearance position storage means. Generating address information stored in the position storage means, comparing adjacent address information in the matching position storage means, detecting a continuous area in which the address information is continuous, and the encoding means A character string corresponding to the position of the character string is an encoding target character string, and the encoding target character string is encoded using address information stored in the continuous area and the length of the continuous area. The data compression apparatus according to claim 1 or 2 .

前記検出手段は、前記出現位置格納手段の１つの順位に注目し、注目する順位の文字列の接頭部が１つ上の順位の文字列の接頭部と同じであるとき、前記一致位置格納手段において、該注目する順位に格納されたアドレス情報に対応する位置に、該１つ上の順位に格納されたアドレス情報を格納することを特徴とする請求項４記載のデータ圧縮装置。The detection means pays attention to one rank of the appearance position storage means, and when the prefix of the character string of the rank of interest is the same as the prefix of the character string of the rank one higher, the matching position storage means 5. The data compression apparatus according to claim 4 , wherein the address information stored in the next higher order is stored at a position corresponding to the address information stored in the target order.

前記検出手段は、前記一致位置格納手段内で２つ以上の連続領域が繋がっている部分を検出し、該２つ以上の連続領域に格納されたアドレス情報に基づいて複数の一致候補の文字列を求め、前記符号化手段は、該複数の一致候補のうち最も長い一致長を有する一致候補の位置を示す情報と、該最も長い一致長とを用いて、前記符号化対象文字列を符号化することを特徴とする請求項４記載のデータ圧縮装置。The detection means detects a portion where two or more continuous areas are connected in the matching position storage means, and a plurality of matching candidate character strings based on address information stored in the two or more continuous areas And the encoding means encodes the encoding target character string using information indicating the position of the match candidate having the longest match length among the plurality of match candidates and the longest match length. 5. The data compression apparatus according to claim 4, wherein

符号化対象文字列に含まれる所定文字数の接頭部から、前記出現位置格納手段における同じ接頭部を含む文字列の順位を求めるための情報を格納する検索手段をさらに備え、前記検出手段は、該検索手段から得られた順位に格納されたアドレス情報に対応する文字列を一致候補とし、該符号化対象文字列と該一致候補とを照合して一致長を求め、前記符号化手段は、該一致候補の位置を示す情報と該一致長とを用いて、該符号化対象文字列を符号化することを特徴とする請求項１又は２記載のデータ圧縮装置。Search means for storing information for determining the rank of the character string including the same prefix in the appearance position storage means from a prefix of a predetermined number of characters included in the encoding target character string, the detection means, A character string corresponding to the address information stored in the rank obtained from the search means is used as a match candidate, the character string to be encoded is matched with the match candidate to obtain a match length, and the coding means 3. The data compression apparatus according to claim 1, wherein the encoding target character string is encoded using information indicating a position of a match candidate and the match length.

前記検出手段は、前記所定文字数の接頭部に対応して前記検索手段から得られる順位が、最近出現した同じ接頭部を含む文字列の順位になるように、該検索手段に格納された情報を更新することを特徴とする請求項７記載のデータ圧縮装置。The detecting means extracts the information stored in the search means so that the order obtained from the search means corresponding to the predetermined number of character prefixes is the order of character strings including the same prefix that has recently appeared. The data compression apparatus according to claim 7 , wherein the data compression apparatus is updated.

コンピュータのためのプログラムを記録した記録媒体であって、
前記プログラムは、
圧縮すべき文字列データが有する複数のアドレスの各々を始点とする各文字列を、各文字列に含まれる所定文字数の接頭部を用いて、複数の同じ接頭部が互いに隣接するように並べ換え、
並べ換えられた文字列の順序で、前記圧縮すべき文字列データに含まれる各文字列のアドレスを表すアドレス情報を記録し、
記録されたアドレス情報に基づいて、前記圧縮すべき文字列データに含まれる繰返し文字列を検出し、
検出された繰返し文字列を符号化する
処理を前記コンピュータに実行させることを特徴とするコンピュータ読み取り可能な記録媒体。A recording medium recording a program for a computer,
The program is
Reorder each character string starting from each of a plurality of addresses of character string data to be compressed using a predetermined number of characters included in each character string so that the same prefixes are adjacent to each other ,
Record address information representing the address of each character string included in the character string data to be compressed in the order of the rearranged character strings,
Based on the recorded address information, a repeated character string included in the character string data to be compressed is detected,
A computer-readable recording medium that causes the computer to execute processing for encoding a detected repeated character string.

圧縮すべき文字列データが有する複数のアドレスの各々を始点とする各文字列を、各文字列に含まれる所定文字数の接頭部を用いて、複数の同じ接頭部が互いに隣接するように並べ換え、
並べ換えられた文字列の順序で、前記圧縮すべき文字列データに含まれる各文字列のアドレスを表すアドレス情報を記録し、
記録されたアドレス情報に基づいて、前記圧縮すべき文字列データに含まれる繰返し文字列を検出し、
検出された繰返し文字列を符号化する
ことを特徴とするデータ圧縮方法。Reorder each character string starting from each of a plurality of addresses of character string data to be compressed using a predetermined number of characters included in each character string so that the same prefixes are adjacent to each other ,
Record address information representing the address of each character string included in the character string data to be compressed in the order of the rearranged character strings,
Based on the recorded address information, a repeated character string included in the character string data to be compressed is detected,
A data compression method comprising encoding a detected repeated character string.

圧縮すべき文字列データが有する複数のアドレスの各々を始点とする各文字列を、各文字列に含まれる所定文字数の接頭部を用いて、複数の同じ接頭部が互いに隣接するように並べ換え、
並べ換えられた文字列の順序で、前記圧縮すべき文字列データに含まれる各文字列のアドレスを表すアドレス情報を記録し、
記録されたアドレス情報に基づいて、前記圧縮すべき文字列データに含まれる繰返し文字列を検出し、
検出された繰返し文字列を符号化する
処理をコンピュータに実行させるためのプログラム。Reorder each character string starting from each of a plurality of addresses of character string data to be compressed using a predetermined number of characters included in each character string so that the same prefixes are adjacent to each other ,
Record address information representing the address of each character string included in the character string data to be compressed in the order of the rearranged character strings,
Based on the recorded address information, a repeated character string included in the character string data to be compressed is detected,
A program for causing a computer to execute processing for encoding a detected repeated character string.