JP4347086B2

JP4347086B2 - Pattern matching apparatus and method, and program

Info

Publication number: JP4347086B2
Application number: JP2004051654A
Authority: JP
Inventors: 亮介榑林; 勝片山; 直明山中; 公平塩本
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2004-02-26
Filing date: 2004-02-26
Publication date: 2009-10-21
Anticipated expiration: 2024-02-26
Also published as: JP2005242668A

Description

本発明は、与えられた入力文字列であるテキストから任意の複数種類の文字列であるパターンを検索するパターンマッチングを行うための手法および装置に関する。 The present invention relates to a technique and an apparatus for performing pattern matching for searching for patterns that are arbitrary plural types of character strings from text that is a given input character string.

与えられたテキストから任意の複数種類のパターンを検索するパターンマッチングは、ワープロソフト、データベースの検索等、多様な分野で応用されている。しかし、情報化社会の進展、そしてハードディスクといった記憶装置の大容量化、低コスト化によって、検索対象となる情報が膨大化している。したがって、パターンマッチングを高効率化するため、そのハードウェア化などの試みがなされている。 Pattern matching for searching for a plurality of arbitrary patterns from a given text is applied in various fields such as word processing software and database search. However, with the progress of the information society and the increase in capacity and cost of storage devices such as hard disks, the information to be searched for has become enormous. Therefore, in order to increase the efficiency of pattern matching, attempts have been made to implement hardware.

ここで、パターンマッチングについて定義する。長さｎのテキストＴと、長さｍのテキストＰを仮定する。ここで、パターンとテキストとはそれぞれ、共通した文字集合に属する文字を、テキストの場合はｎ文字、パターンの場合はｍ文字を、左から右に順番に並べることで構成される。すなわち、テキストＴ中の文字をｔ_iと表すと、テキストＴはｔ₁…ｔ_nと表記できる。同様にパターンＰはｐ₁…ｐ_mと表記できる。パターンマッチングの目的はテキスト中にパターンが存在するか否かを特定することにある。すなわち、ｔ_s+1…ｔ_s+m＝ｐ₁…ｐ_mとなる部分テキストを検出することにある。 Here, pattern matching is defined. Assume a text T having a length n and a text P having a length m. Here, the pattern and the text are configured by arranging characters belonging to a common character set in order from left to right, n characters in the case of text, and m characters in the case of a pattern. That is, if the character in the text T represents a t _i, text T can be represented as t ₁ ... t _n. Similarly the pattern P can be represented as p ₁ ... p _m. The purpose of pattern matching is to specify whether or not a pattern exists in the text. That is to detect the partial text to be _{_{t s + 1 ... t s +}} m = p 1 ... p m.

パターンマッチングの方法として既に多くの手法が提案されている。パターンマッチングの方法は大きく以下の二つに分類できる。
１）テキストとパターンとの文字比較位置が関数的に変化する方法。 Many methods have already been proposed as pattern matching methods. Pattern matching methods can be broadly classified into the following two types.
1) A method in which the character comparison position between the text and the pattern changes functionally.

この種の方法では、テキストおよびパターンの双方の文字を比較し、それぞれの文字が一致しない場合は、比較中のテキストとパターンとの重なり位置（以下、ウィンドウと呼ぶ）をテキストに対して右方向に移動させ、ウィンドウ中のテキストとパターンとの比較を再開する。このとき、テキストとパターンとの文字比較回数を減少させるには、文字が一致しなかった際にウィンドウを移動させる幅を可能な限り大きくすることが重要となる。この種の代表的な方法として、ＢＭ(Boyer Moore)，ＲｅｖｅｒｓｅＦａｃｔｏｒ，ＫＭＰ(Knuth-Morris-Pratt)などがある（例えば、非特許文献１参照）。
２）テキストとパターンとの文字比較位置が一定に変化する方法。 In this type of method, both text and pattern characters are compared, and if the characters do not match, the overlapping position of the text being compared and the pattern (hereinafter referred to as the window) is directed to the right of the text. To resume the comparison of the text in the window with the pattern. At this time, in order to reduce the number of character comparisons between the text and the pattern, it is important to increase the width for moving the window as much as possible when the characters do not match. Typical examples of this type include BM (Boyer Moore), Reverse Factor, KMP (Knuth-Morris-Pratt), and the like (see, for example, Non-Patent Document 1).
2) A method in which the character comparison position between the text and the pattern changes constantly.

この種の方法では、どのようなテキストおよびパターンを用いても、テキストとパターンとの文字比較回数が常にテキスト長にのみ依存する特徴がある。この種の方法として、オートマトンを用いた方法およびＳｈｉｆｔＯＲアルゴリズムがある。 This type of method has a feature that the number of character comparisons between text and pattern always depends only on the text length, regardless of the text and pattern used. As this type of method, there are a method using an automaton and a Shift OR algorithm.

前者の方法では、一般的に文字比較回数の期待値が後者の方法より優れる。このため、特にソフトウェア上での実現では、前者の方法が広く用いられる。しかし、前者の方法では、複数パターンを同時に検索することができないという問題がある。さらに、そのハードウェア化に際して以下の問題が生じる。 In the former method, the expected value of the number of character comparisons is generally superior to the latter method. For this reason, the former method is widely used, particularly for implementation on software. However, the former method has a problem that a plurality of patterns cannot be searched simultaneously. Furthermore, the following problems occur when the hardware is implemented.

先の入力文字の比較結果に応じてどれだけウィンドウを右方向に移動できるかが決まる。このため、テキスト中の入力文字をパイプライン処理することができない。 How much the window can be moved to the right depends on the comparison result of the previous input characters. For this reason, input characters in the text cannot be pipelined.

テキストとパターンとの文字比較位置が関数的に変化するため、文字比較回数がテキストとパターンとに依存して変化する。このとき文字比較回数の最悪値は後者の方法より多くなる。文字比較回数の変化にともなうパターンマッチングのスループットの変化を吸収するため、入力テキストのバッファ機構が必要となる。バッファの容量を大きくとることによって、バッファが溢れる確率を小さくすることが可能であるが、完全にバッファが溢れないことを保障することは困難点である。また、バッファ中のテキストが枯渇した場合には、パイプラインがストールするなどの問題が生じる。
ChristianCharrs and ThierryLecroq“Handbook of Exact String Matching Algorithms”、[online]、[平成16年2月10日検索]、インターネット<URL:http://homepage.stts.edu/~aikawa/string.pdf> A.V.Ahoand M.J.Corasick.Efficientstring matching:an aid to bibliographicsearch.Comm.of theACM.18(6):333-340,June 1975. Since the character comparison position between the text and the pattern changes functionally, the number of character comparisons changes depending on the text and the pattern. At this time, the worst value of the number of character comparisons is larger than that of the latter method. A buffer mechanism for input text is required to absorb changes in pattern matching throughput that accompany changes in the number of character comparisons. Although it is possible to reduce the probability that the buffer overflows by increasing the buffer capacity, it is difficult to ensure that the buffer does not overflow completely. Also, when the text in the buffer is depleted, problems such as pipeline stalls occur.
ChristianCharrs and ThierryLecroq “Handbook of Exact String Matching Algorithms”, [online], [Search February 10, 2004], Internet <URL: http://homepage.stts.edu/~aikawa/string.pdf> AVAhoand MJCorasick.Efficientstring matching: an aid to bibliographicsearch.Comm.of theACM.18 (6): 333-340, June 1975.

従来技術によるパターンマッチングのハードウェア化では、主にオートマトンを用いた手法が用いられる。オートマトンの特徴としてまず、１）複数パターンを同時に検索することが可能であることが挙げられる。さらに、ハードウェア向きである特徴として、２−１）テキストとパターンとの文字比較位置が一定でありテキストをバッファする機構が必要でない。２−２）テキスト中の文字が入力されてから次の文字が入力されるまでの遅延が他の方法と比較して小さい、ことが挙げられる。その一方で、オートマトンによる方法では、メモリおよび論理回路の規模、スループットの点において幾つかの課題が存在する。 In the conventional pattern matching hardware implementation, a method using an automaton is mainly used. The features of automata are as follows: 1) It is possible to search a plurality of patterns simultaneously. Further, as a feature suitable for hardware, 2-1) the character comparison position between the text and the pattern is constant, and a mechanism for buffering the text is not necessary. 2-2) The delay between the input of a character in the text and the input of the next character is small compared to other methods. On the other hand, the automaton method has several problems in terms of the scale and throughput of the memory and logic circuit.

オートマトンを用いた手法では、まず、これまでに入力されたテキストのサフィックスに対するパターンの最長プリフィックスを状態として持つオートマトンを構成する。ここで、任意の文字列ｔ_iｔ_i+1…ｔ_i+jのプリフィックスは、ｔ_iｔ_i+1…ｔ_i+l（０≦ｌ≦ｊ）と定義される。一方、任意の文字列ｔ_iｔ_i+1…ｔ_i+jのサフィックスはｔ_lｔ_l+1…ｔ_i+j（ｉ≦ｌ≦ｉ＋ｊ）と定義される。複数のパターンが存在する場合には、個々のパターンに対するオートマトンを単一のオートマトンとして合成することが可能である。このような方法をＡｈｏ−Ｃｏｒａｓｉｃｋと呼ぶ（例えば、非特許文献２参照）。 In the method using the automaton, first, an automaton having the longest prefix of the pattern with respect to the suffix of the text input so far as a state is constructed. Here, the prefix of an arbitrary character string t _i t _{i + 1} ... T _{i + j} is defined as t _i t _{i + 1} ... t _{i + l} (0 ≦ l ≦ j). On the other hand, the suffix of an arbitrary character string t _i t _{i + 1} ... T _{i + j} is defined as t _l t _{l + 1} ... t _{i + j} (i ≦ l ≦ i + j). When there are a plurality of patterns, the automaton for each pattern can be synthesized as a single automaton. Such a method is called Aho-Corasick (for example, refer nonpatent literature 2).

図１１はＡｈｏ−Ｃｏｒａｓｉｃｋオートマトンを説明するための図であるが、図１１は例としてパターンをｃａｂｃとｃｂａｃａの２つとした場合に生成されるオートマトンを示している。なお、いずれの状態においても、アーク上に明示されていない文字が入力された場合は状態０に戻る。次に、そのオートマトンに対して入力テキストの文字列を入力させ、入力された文字と現在のオートマトンの状態とから、オートマトンを次の状態に遷移させるという動作を繰り返す。このとき、オートマトンがパターン全体を表す状態（以下、最終状態）に遷移すると、テキスト中の文字列がパターンと一致したことを表す。図１１の例では、状態４と状態８が最終状態である。 FIG. 11 is a diagram for explaining an Aho-Corasick automaton, but FIG. 11 shows an automaton generated when two patterns, cabc and cbaca, are used as an example. Note that, in any state, when a character that is not clearly indicated on the arc is input, the state returns to state 0. Next, the operation of inputting the character string of the input text to the automaton and changing the automaton to the next state from the input characters and the current automaton state is repeated. At this time, when the automaton transitions to a state representing the entire pattern (hereinafter, the final state), it represents that the character string in the text matches the pattern. In the example of FIG. 11, state 4 and state 8 are final states.

図１２は従来のメモリを用いたオートマトンの実現例を示す図であり、図１３は従来のオートマトンの回路展開例を示す図であるが、オートマトンのハードウェア実現に際しては、図１２のように、オートマトンをメモリ上で表現する方法と、図１３のようにオートマトンを直接的に回路化する方法とに分けられる。オートマトンをメモリ上で表現する方法として、新しく入力されたテキストの文字と現在のオートマトンの状態とを入力とし、次に遷移すべき状態を出力とする表が用いられる。しかし、オートマトンを表として実現する場合には、表のエントリ数がテキストの一文字あたりのビット幅に対して指数関数的に増加するという問題がある。 FIG. 12 is a diagram showing an implementation example of an automaton using a conventional memory, and FIG. 13 is a diagram showing an example of circuit development of a conventional automaton. In the hardware implementation of an automaton, as shown in FIG. It can be divided into a method of expressing an automaton on a memory and a method of directly forming an automaton as shown in FIG. As a method for expressing the automaton on the memory, a table is used in which newly input text characters and the current automaton state are input, and the next state to be transitioned is output. However, when the automaton is implemented as a table, there is a problem that the number of entries in the table increases exponentially with respect to the bit width per character of the text.

一方、オートマトンをバイナリツリー等を用いて表現することによって、エントリ数を削減する方法もある。しかし、この方法では、テキストの文字が入力されてから次の状態が決定されるまでの遅延が表と比較して大きくなる。この遅延の間、次のテキストの文字に対する照合処理ができないため、スループットが大きく制限される。一方、図１３のオートマトンを直接布線論理化する方法では、オートマトンの状態毎に比較器が必要となる。また、状態間の配線遅延によってスループットが制限される。 On the other hand, there is a method of reducing the number of entries by expressing an automaton using a binary tree or the like. However, with this method, the delay between the input of text characters and the determination of the next state is greater than in the table. During this delay, the next text character cannot be collated, and the throughput is greatly limited. On the other hand, in the method of directly logicalizing the automaton shown in FIG. 13, a comparator is required for each automaton state. Also, throughput is limited by wiring delays between states.

本発明は、このような背景に行われたものであって、パターンマッチングをハードウェアで実現する際に、メモリおよび回路規模を削減し、かつスループットの向上を可能とすることができるパターンマッチング装置および方法を提供することを目的とする。 The present invention has been made in this background, and a pattern matching apparatus capable of reducing memory and circuit scale and improving throughput when pattern matching is realized by hardware. And to provide a method.

本発明では、従来のオートマトンによるパターンマッチングに対して以下の点を付加することを特徴とする。まず、これまでに入力されたテキストのサフィックスに対するパターンにハッシュ関数を適用した結果として得られる文字列の最長プリフィックスを状態として持つオートマトンを構成し、そのオートマトンに同一のハッシュ関数をテキストに適用した結果として得られる文字列を入力することによって、パターンに一致する可能性のある入力文字列と一致する可能性のない入力文字列との識別を行う。 The present invention is characterized in that the following points are added to the pattern matching by the conventional automaton. First, construct an automaton with the longest prefix of the string obtained as a result of applying the hash function to the pattern for the text suffix entered so far, and the result of applying the same hash function to the text As a result, the input character string that may match the pattern is identified from the input character string that does not match.

このとき、複数文字から一つのハッシュ値を生成するハッシュ関数を用いることによって、オートマトンの状態数を削減させることができる。また、パターンの１文字あたりのビット幅より小さいビット幅を持つハッシュ値を生成するハッシュ関数を用いることによって、オートマトンに入力される文字の種類を削減させることができる。 At this time, the number of states of the automaton can be reduced by using a hash function that generates one hash value from a plurality of characters. In addition, by using a hash function that generates a hash value having a bit width smaller than the bit width per character of the pattern, the types of characters input to the automaton can be reduced.

次に、ハッシュ関数の適用によって生成される複数の独立した文字列単位でオートマトンによるパターンマッチングをパイプライン化して並列化する（以下では、パイプライン並列処理という）。最後に、ハッシュを適用したオートマトンによって特定される、パターンと一致する可能性のあるテキスト中の文字列とそのテキストに一致する可能性のあるパターンとを正確に照合することによってパターンマッチングを行う。 Next, pattern matching by an automaton is pipelined and parallelized in units of a plurality of independent character strings generated by application of a hash function (hereinafter referred to as pipeline parallel processing). Finally, pattern matching is performed by accurately matching a character string in the text that may match the pattern, which is specified by the automaton to which the hash is applied, with a pattern that may match the text.

このように、本発明では、ハッシュ関数をテキストとパターンとの双方に対して適用することによって、ハッシュ関数によって圧縮された空間上でパターンマッチングを行う。これにより、オートマトンの状態数の削減効果および文字の種類の削減効果が得られる。ハッシュ関数がｋ文字からハッシュ値を生成するとすると、ハッシュ関数の適用されたパターンの文字数は１／ｋに減少する。 Thus, in the present invention, pattern matching is performed on the space compressed by the hash function by applying the hash function to both the text and the pattern. Thereby, the effect of reducing the number of states of the automaton and the effect of reducing the type of characters can be obtained. If the hash function generates a hash value from k characters, the number of characters in the pattern to which the hash function is applied is reduced to 1 / k.

オートマトンをメモリ表現する場合のメモリのエントリ数、回路表現する場合の回路規模はオートマトンの状態数に比例する。また、ハッシュ値のビット幅をハッシュ適用前のパターンの一文字あたりのビット幅より小さくすることによって、オートマトンに入力される文字の種類を削減させることが可能である。 The number of memory entries when the automaton is expressed in memory and the circuit scale when the circuit is expressed are proportional to the number of states of the automaton. Further, by making the bit width of the hash value smaller than the bit width per character of the pattern before hash application, it is possible to reduce the types of characters input to the automaton.

すなわち、ハッシュ適用前のパターンの一文字あたりのビット幅をω、ハッシュ値のビット幅をｈωとすると、その削減比は２^hω／２^ωとなる。オートマトンをメモリを用いて表として表現する場合は、そのメモリエントリ数は文字のビット幅に対して指数関数的に増加する。故に、ハッシュ関数の適用により、本発明の目的である、メモリおよび回路規模の削減が可能となる。 That is, the bit width per character hash before applying pattern omega, when the bit width of the hash value and Etchiomega, the reduction ratio is 2 ^hω / 2 ^ω. When an automaton is represented as a table using a memory, the number of memory entries increases exponentially with respect to the bit width of the character. Therefore, by applying the hash function, it is possible to reduce the memory and circuit scale, which is the object of the present invention.

さらに、テキストにｋ文字からハッシュ値を生成する関数を適用すると、一つのテキストからｋ個の独立した文字列が生成される。この独立した文字列間はそれぞれ依存することなくパターンマッチングを行うことができる。故に、これらの文字列をパイプライン並列処理することによって、本発明の目的である、パターンマッチングのスループット向上が得られる。 Furthermore, when a function for generating a hash value from k characters is applied to text, k independent character strings are generated from one text. Pattern matching can be performed between these independent character strings without depending on each other. Therefore, by performing pipeline parallel processing on these character strings, the throughput of pattern matching, which is the object of the present invention, can be improved.

すなわち、本発明の第一の観点は、入力文字列であるテキストから任意文字列であるパターンを検索するパターンマッチング装置であって、前記パターンにｋ（ｋは２以上の整数）文字から一つのハッシュ値を生成するハッシュ関数を適用した結果として得られる文字列をそれぞれ保持し、これまでに入力された文字列のサフィックスに対する前記保持している文字列の最長プリフィックスを状態として持つｋ個のオートマトン部と、前記ハッシュ関数と同一のハッシュ関数を前記テキストに１文字ずつｋ−１文字までずらしながら適用した結果として得られるｋ個の文字列を前記ｋ個のオートマトン部にそれぞれ入力するハッシュ計算部と、前記ｋ個のオートマトン部のいずれかが最終状態になったか否かを判定することによって、パターンに一致する可能性のある部分テキストと、パターンに一致する可能性のない部分テキストとの識別を行う最終状態判定部とを備えたところにある。 Specifically, a first aspect of the present invention, there is provided a pattern matching apparatus for searching the input character is a text or RaTsutomu meaning string is the column pattern, k to the pattern (k is an integer of 2 or more) from the character k number with retained single string resulting of applying the hash function to generate a hash value, respectively, the longest prefix string that the holding for ever suffix of the input string as state and the automaton portion of the hash for inputting respectively the k-number of strings obtained by said hash function the same hash function and as a result of applying shifting to k-1 character by one character in the text to the k-number of the automaton portion by determining and calculating unit, whether or not any of the k-number of the automaton portion becomes the final state, putter A partial text that can be matched to, Ru near was a final state determination unit which performs the identification of the portion without text can match the pattern.

このときに、パターンの１文字あたりのビット幅より小さいビット幅を持つハッシュ値を生成するハッシュ関数を用いることが望ましい。 In this case, it was desired to use a hash function that generates a hash value with a smaller bit width than the per character pattern.

さらに、前記ｋ個のオートマトン部をパイプライン化して実現し、このパイプライン化されたｋ個のオートマトン部において、ハッシュ関数の適用によって生成される前記ｋ個の文字列をパイプライン処理する手段を備えることができる。 Furthermore, the k pieces of the automaton portion implemented by pipelined, in the k automaton portion which is the pipelined, the k number of means for the string pipelining generated by application of a hash function Ru can be provided.

さらに、前記一致する可能性のある部分テキストに対し、パターンと完全に一致するか否かを判定する手段を備えることができる。 Moreover, the relative matching potential partial text, Ru can be provided with means for determining whether exactly match the pattern.

本発明の第二の観点は、パターンマッチング装置が、入力文字列であるテキストから任意文字列であるパターンを検索するパターンマッチング方法であって、前記パターンにｋ（ｋは２以上の整数）文字から一つのハッシュ値を生成するハッシュ関数を適用した結果として得られる文字列をそれぞれ保持し、これまでに入力された文字列のサフィックスに対する前記保持している文字列の最長プリフィックスを状態として持つｋ個のオートマトン部を用い、ハッシュ計算部が、前記ハッシュ関数と同一のハッシュ関数を前記テキストに１文字ずつｋ−１文字までずらしながら適用した結果として得られるｋ個の文字列を前記ｋ個のオートマトン部にそれぞれ入力するステップと、最終状態判定部が、前記ｋ個のオートマトン部のいずれかが最終状態になったか否かを判定するステップを実行するところにある。 A second aspect of the invention, the pattern matching apparatus, a pattern matching method to find the input character is a text or RaTsutomu meaning string is the column pattern, k to the pattern (k is an integer of 2 or more ) holds a character string obtained as a result of applying a hash function to generate a single hash value from the character, respectively, as the state the longest prefix string that the holding to a string suffix input so far using the k automaton portion, the hash calculation unit, the k pieces of string obtained by the hash function the same hash function and as a result of applying shifting to k-1 character by one character in the text k with inputting each of the pieces of the automaton portion, the final state determination unit, any of the k-number of the automaton portion outermost Ru near the place to perform the step of determining whether the state.

このときに、パターンの１文字あたりのビット幅より小さいビット幅を持つハッシュ値を生成するハッシュ関数を用いることが望ましい。 In this case, it was desired to use a hash function that generates a hash value with a smaller bit width than per character pattern.

さらに、前記ｋ個のオートマトン部がパイプライン化されて実現され、このパイプライン化されたｋ個のオートマトン部において、ハッシュ関数の適用によって生成される前記ｋ個の文字列をパイプライン処理することができる。 Furthermore, the k-number of the automaton portion is realized pipelined, in the k automaton portion which is the pipelined, to the pipeline processing the k-number of the character string generated by application of a hash function Rukoto is Ru can.

さらに、前記一致する可能性のある部分テキストに対し、パターンと完全に一致するか否かを判定するステップを実行することができる。 Further, with respect to possible partial text that the matches, Ru can perform the step of determining whether exactly matches the pattern.

本発明の第三の観点は、情報処理装置にインストールすることにより、その情報処理装置に、入力文字列であるテキストから任意文字列であるパターンを検索するパターンマッチング装置に相応する機能を実現させるプログラムであって、前記パターンにｋ（ｋは２以上の整数）文字から一つのハッシュ値を生成するハッシュ関数を適用した結果として得られる文字列をそれぞれ保持し、これまでに入力された文字列のサフィックスに対する前記保持している文字列の最長プリフィックスを状態として持つｋ個のオートマトン部に相応する機能と、前記ハッシュ関数と同一のハッシュ関数を前記テキストに１文字ずつｋ−１文字までずらしながら適用した結果として得られるｋ個の文字列を前記ｋ個のオートマトン部にそれぞれ入力するハッシュ計算機能と、前記ｋ個のオートマトン部のいずれかが最終状態になったか否かを判定することによって、パターンに一致する可能性のある部分テキストと、パターンに一致する可能性のない部分テキストとの識別を行う最終状態判定機能とを実現させるところにある。 A third aspect of the present invention, by installing the information processing apparatus, to the information processing apparatus, the function corresponding to the pattern matching device to search for text or RaTsutomu pattern is meaning string is the input character string A program to be realized , each holding a character string obtained as a result of applying a hash function for generating one hash value from k (k is an integer of 2 or more) characters to the pattern, and input so far a function corresponding to the k automaton portion with the longest prefix string that the holding to a suffix string as state until one character the same hash function and the hash function to the text k-1 character had to enter respectively the k-number of the character string obtained as a result of applying while shifting the k-number of the automaton portion And Interview calculation function, the one of the k automaton part by determines whether it is a final state, a partial text that can be matched to the pattern, not likely partial text matching the pattern Ru near the place to achieve a final state determination function for identification of the.

このときに、パターンの１文字あたりのビット幅より小さいビット幅を持つハッシュ値を生成するハッシュ関数を用いることが望ましい。
さらに、前記ｋ個のオートマトン部をパイプライン化して実現し、このパイプライン化されたｋ個のオートマトン部において、ハッシュ関数の適用によって生成される前記ｋ個の文字列をパイプライン処理する機能を実現させることができる。 At this time, it is desirable to use a hash function that generates a hash value having a bit width smaller than the bit width per character of the pattern.
Furthermore, the k-number of the automaton portion implemented by pipelined, in the k automaton portion which is the pipelined, the k pieces of string generated by application of a hash function to the ability to pipeline Ru can be realized.

さらに、前記一致する可能性のある部分テキストに対し、パターンと完全に一致するか否かを判定する機能を実現させることができる。 Moreover, the relative matching potential partial text, Ru can be realized a function of determining whether exactly match the pattern.

本発明の第四の観点は、本発明のプログラムが記録された前記情報処理装置読取可能な記録媒体である。本発明のプログラムは本発明の記録媒体に記録されることにより、前記情報処理装置は、この記録媒体を用いて本発明のプログラムをインストールすることができる。あるいは、本発明のプログラムを保持するサーバからネットワークを介して直接前記情報処理装置に本発明のプログラムをインストールすることもできる。 The fourth aspect of the present invention, Ru said information processing apparatus readable media der which the program is recorded of the present invention. By recording the program of the present invention on the recording medium of the present invention, the information processing apparatus can install the program of the present invention using this recording medium. Alternatively, the program of the present invention can be directly installed in the information processing apparatus via a network from a server holding the program of the present invention.

これにより、汎用の情報処理装置を用いて、スループットの向上を可能とすることができるパターンマッチング装置を実現することができる。 Thereby, it is possible to realize a pattern matching device capable of improving throughput using a general-purpose information processing device.

本発明によれば、パターンマッチングをハードウェアで実現する際に、メモリおよび回路規模を削減し、かつスループットの向上を可能とすることができる。また、パターンマッチングを汎用の情報処理装置とプログラムとを用いて実現する際にもスループットの向上を可能とすることができる。 According to the present invention, when pattern matching is realized by hardware, the memory and circuit scale can be reduced and throughput can be improved. Also, throughput can be improved when pattern matching is realized using a general-purpose information processing apparatus and a program.

本発明実施例のパターンマッチング装置を図１を参照して説明する。図１は本実施例のパターンマッチング装置のブロック構成図である。 A pattern matching apparatus according to an embodiment of the present invention will be described with reference to FIG. FIG. 1 is a block diagram of the pattern matching apparatus of this embodiment.

本実施例の入力文字列であるテキストから複数種類の任意文字列であるパターンを検索するパターンマッチング装置は、図１に示すように、これまでに入力されたテキストのサフィックスに対するパターンにハッシュ関数を適用した結果として得られる文字列の最長プリフィックスを状態として持つオートマトンを構成するハッシュ計算部１およびオートマトン部２と、このオートマトンに同一のハッシュ関数をテキストに適用した結果として得られる文字列を入力することによって、パターンに一致する可能性のある部分テキストと、パターンに一致する可能性のない部分テキストとの識別を行う最終状態判定部３とを備えたことを特徴とする。 As shown in FIG. 1, the pattern matching apparatus for searching for a pattern that is a plurality of types of arbitrary character strings from the text that is the input character string of the present embodiment uses a hash function for the pattern for the suffix of the text that has been input so far. The hash calculation unit 1 and the automaton unit 2 constituting the automaton having the longest prefix of the character string obtained as a result of application as a state, and the character string obtained as a result of applying the same hash function to the text are input to the automaton by, comprising: the partial text that can be matched to the pattern, and a final state determining unit 3 for identifying and no possibility of partial text matching the pattern.

このときに、ハッシュ計算部１は、複数文字から一つのハッシュ値を生成するハッシュ関数を用いる。また、ハッシュ計算部１は、パターンの１文字あたりのビット幅より小さいビット幅を持つハッシュ値を生成するハッシュ関数を用いる。 At this time, the hash calculating unit 1, Ru using a hash function that generates a single hash value from a plurality characters. Moreover, the hash calculating unit 1, Ru using a hash function that generates a hash value with a smaller bit width than the per character pattern.

図４は本実施例のハッシュ計算部１の実装例を示す図であるが、図４に示すように、ハッシュ計算部１は、ハッシュ関数の適用によって生成される独立した複数の文字列単位を並列処理する手段を備える。 FIG. 4 is a diagram illustrating an implementation example of the hash calculation unit 1 according to the present embodiment. As illustrated in FIG. 4, the hash calculation unit 1 includes a plurality of independent character string units generated by applying a hash function. Ru comprises means for parallel processing.

また、図５は本実施例のメモリ表現した場合のハッシュによるパイプライン化を示す図であり、図６は本実施例の回路展開した場合のハッシュによるパイプライン化を示す図であるが、図５および図６に示すように、ハッシュ計算部１は、ハッシュ関数の適用によって生成される独立した複数の文字列単位でパイプライン処理を行う手段を備えることもできる。 FIG. 5 is a diagram showing pipelining by hash when the memory of this embodiment is expressed, and FIG. 6 is a diagram showing pipelining by hashing when the circuit of this embodiment is developed. 5 and 6, the hash calculation unit 1, Ru can also comprise a plurality of means for performing pipeline processing string units independent produced by application of a hash function.

また、図７は本実施例のパイプライン並列処理を行うオートマトンの実装例を示す図であるが、図７に示すように、ハッシュ関数の適用によって生成される独立した複数の文字列単位を並列処理すると共にパイプライン処理を行う手段を備えることもできる。 FIG. 7 is a diagram showing an implementation example of an automaton that performs pipeline parallel processing according to the present embodiment. As shown in FIG. 7, a plurality of independent character string units generated by applying a hash function are parallelized. Ru can also comprise means for performing pipeline processing while processing.

さらに、図１に示すように、前記一致する可能性のある部分テキストに対し、パターンと完全に一致するか否かを判定する完全一致照合部４を備える。 Furthermore, as shown in FIG. 1, the relative matching potential partial text, Ru a full match collating unit 4 to determine whether or not exactly match the pattern.

本実施例の入力文字列であるテキストから複数種類の任意文字列であるパターンを検索するパターンマッチング方法は、図１に示すハッシュ計算部１およびオートマトン部２により、これまでに入力されたテキストのサフィックスに対するパターンにハッシュ関数を適用した結果として得られる文字列の最長プリフィックスを状態として持つオートマトンを構成するステップを実行し、最終状態判定部３により、このオートマトンに同一のハッシュ関数をテキストに適用した結果として得られる文字列を入力することによって、パターンに一致する可能性のある部分テキストと、パターンに一致する可能性のない部分テキストとの識別を行うステップを実行することを特徴とする。 A pattern matching method for searching for a pattern that is a plurality of types of arbitrary character strings from text that is an input character string according to the present embodiment is performed by using the hash calculation unit 1 and the automaton unit 2 shown in FIG. A step of constructing an automaton having the longest prefix of the character string obtained as a result of applying the hash function to the pattern for the suffix as a state is executed, and the same hash function is applied to the text by the final state determination unit 3 by entering a character string obtained as a result, and the client performs steps of performing a partial text that can be matched to the pattern, the identification of the portion without text can match the pattern.

このとき、ハッシュ計算部１は、複数文字から一つのハッシュ値を生成するハッシュ関数を用いる。また、ハッシュ計算部１は、パターンの１文字あたりのビット幅より小さいビット幅を持つハッシュ値を生成するハッシュ関数を用いる。 At this time, the hash calculation unit 1 uses a hash function that generates one hash value from a plurality of characters. Moreover, the hash calculating unit 1, Ru using a hash function that generates a hash value with a smaller bit width than the per character pattern.

図４に示すハッシュ計算部１の実装例により、ハッシュ関数の適用によって生成される独立した複数の文字列単位を並列処理するステップを実行する。 The implementation of the hash calculation unit 1 shown in FIG. 4, that perform the step of parallel processing a plurality of string units independent produced by application of a hash function.

図５に示すメモリ表現した場合のハッシュによるパイプライン化または図６に示す回路展開した場合のハッシュによるパイプライン化により、ハッシュ関数の適用によって生成される独立した複数の文字列単位でパイプライン処理を行うステップを実行することもできる。 Pipeline processing by a plurality of independent character strings generated by application of a hash function by pipelining by hash when the memory is expressed as shown in FIG. 5 or by pipelining by hash when the circuit is expanded as shown in FIG. Ru can also perform the step of performing.

さらに、図７に示すパイプライン並列処理するオートマトンの実装例により、ハッシュ関数の適用によって生成される独立した複数の文字列単位を並列処理すると共にパイプライン処理を行うステップを実行することもできる。 Furthermore, the implementation of the automata parallel processing pipeline shown in FIG. 7, Ru can also perform the step of performing a pipeline processing with parallel processing of multiple strings units independent produced by application of a hash function .

さらに、図１に示す完全一致照合部４により、前記一致する可能性のある部分テキストに対し、パターンと完全に一致するか否かを判定するステップを実行する。 Further, the exact match checking unit 4 shown in FIG. 1, with respect to possible partial text of the match, that perform determining whether exactly matches the pattern.

本発明は、汎用の情報処理装置にインストールすることにより、その情報処理装置に本発明のパターンマッチング装置に相応する機能を実現させるプログラムとして実現することができる。このプログラムは、記録媒体に記録されて情報処理装置にインストールされ、あるいは通信回線を介して情報処理装置にインストールされることにより当該情報処理装置に、ハッシュ計算部１、オートマトン部２、最終状態判定部３、完全一致照合部４にそれぞれ相応する機能を実現させることができる。 The present invention, by installing a general-purpose information processing apparatus, Ru can be realized as a program for realizing functions corresponding to the pattern matching apparatus of the present invention to the information processing apparatus. This program is recorded on a recording medium and installed in the information processing apparatus , or installed in the information processing apparatus via a communication line, so that the hash calculation unit 1, the automaton unit 2, and the final state are installed in the information processing apparatus. A function corresponding to each of the determination unit 3 and the exact match verification unit 4 can be realized.

以下では、本実施例のパターンマッチング装置および方法をさらに詳細に説明する。 Hereinafter, the pattern matching apparatus and method according to the present embodiment will be described in more detail.

本発明の全体構成を図１に示す。本発明では、あらかじめハッシュ関数が適用されたパターンを用いてオートマトンを構成しておく。次に、ハッシュ計算部１で入力テキストに対して同様のハッシュを適用する。そして、オートマトン部２において、ハッシュが適用されたテキストからハッシュ関数が適用されたパターン検索を行う。このとき、オートマトンを複数コピーすることによって、ハッシュの適用によって生じる独立した文字系列を並列に処理することが可能である。 The overall configuration of the present invention is shown in FIG. In the present invention, an automaton is configured using a pattern to which a hash function is applied in advance. Next, the hash calculation unit 1 applies the same hash to the input text. Then, the automaton unit 2 performs a pattern search to which the hash function is applied from the text to which the hash is applied. At this time, by copying a plurality of automata, it is possible to process independent character sequences generated by applying the hash in parallel.

最終状態判定部３において、オートマトンが最終状態に遷移したか否かを監視する。オートマトンが最終状態に遷移した場合には、ハッシュが適用されたテキストから、ハッシュが適用されたパターンが検出されたことを示す。次に、完全一致照合部４において、ハッシュ適用前のテキストとパターンとが完全一致するか否かを判定する。 The final state determination unit 3 monitors whether the automaton has transitioned to the final state. When the automaton transitions to the final state, it indicates that the pattern to which the hash is applied is detected from the text to which the hash is applied. Next, the exact match verification unit 4 determines whether or not the text before the hash application and the pattern completely match.

（ハッシュ計算部およびオートマトン部の実施例）
図２は複数のパターンからハッシュ関数が適用された一つのオートマトンを生成する手順を示す。まず、各パターンに対して、ハッシュ関数を適用する。ｋ文字から一つのハッシュ値を生成するハッシュ関数をＨ（ｘ₁，ｘ₂，…，ｘ_k）とする。また、パターンＰ_iの文字数をｍ_iとし、そのパターンをｐ_i,1ｐ_i,2…ｐ_i,miと表記する。このとき、ハッシュ関数生成後のパターン中の文字ｈｐ_i,j（ｊ＝１，…，ｍi／ｋ）はＨ（ｐ_i,(j-1)k+1，ｐ_i,(j-1)k+2，…，ｐ_i,jk）と表現できる。全てのパターンをハッシュを用いて変換した後、Ａｈｏ−Ｃｈｏｒａｓｉｃｋアルゴリズムに従って、各パターンを単一オートマトンとして合成する。 (Example of hash calculation unit and automaton unit)
FIG. 2 shows a procedure for generating one automaton to which a hash function is applied from a plurality of patterns. First, a hash function is applied to each pattern. A hash function that generates one hash value from k characters is assumed to be H (x ₁ , x ₂ ,..., x _k ). Further, the number of characters of the pattern P _i and m _i, is denoted the pattern _{_{p i, 1 p i, 2}} ... p i, mi and. At this time, characters hp _{i, j} (j = 1,..., Mi / k) in the pattern after generating the hash function are H ( _{pi, (j−1) k + 1} , _{pi, (j−1). k + 2} ,..., p _{i, jk} ). After all patterns are converted using a hash, each pattern is synthesized as a single automaton according to the Aho-Chorasick algorithm.

すなわち、図２に示すように、パターンを取得し（Ｓ１）、このパターンからｋ文字を取得し（Ｓ２）、このｋ文字からハッシュ値を生成する（Ｓ３）。取得したｋ文字を生成したハッシュ値で置き換え（Ｓ４）、未処理の文字あるいは未処理のパターンが無くなるまで、ステップＳ１〜Ｓ４を繰り返し実行する（Ｓ５、Ｓ６）。続いて、Ａｈｏ−Ｃｏｒａｓｉｃｋによる合成オートマトンの生成を行う（Ｓ７）。 That is, as shown in FIG. 2, a pattern is acquired (S1), k characters are acquired from the pattern (S2), and a hash value is generated from the k characters (S3). The acquired k characters are replaced with the generated hash value (S4), and steps S1 to S4 are repeatedly executed until there are no unprocessed characters or unprocessed patterns (S5, S6). Then, the synthetic | combination automaton by Aho-Corasick is produced | generated (S7).

図３はテキストに対してハッシュを適用するハッシュ計算部１の入出力を示す図である。図３ではｎ文字からなる入力テキストをｔ₁ｔ₂…ｔ_nと表現している。またｔ_(j-1)k+iｔ_(j-1)k+i+1…ｔ_jk+iにハッシュを適用した結果得られるハッシュ値をｈｔ_i,jと表記している。このとき、ｈｐ₁…ｈｐ_m/kとの比較は、ハッシュ適用後テキストｈｔ_i,j…ｈｔ_i,j+m/kに対して行えばよい。 FIG. 3 is a diagram illustrating input / output of the hash calculation unit 1 that applies a hash to text. In FIG. 3, the input text consisting of n characters is expressed as t ₁ t ₂ ... T _n . Further, a hash value obtained as a result of applying a hash to t _{(j-1) k + i} t _{(j-1) k + i + 1} ... T _{jk + i} is denoted as ht _{i, j} . At this time, the comparison with hp ₁ ... Hp _{m / k} may be performed on the text ht _{i, j} after the hash application ht _{i, j + m / k} .

すなわち、ｉ≠ｉ′ならば、ハッシュ適用後テキストｈｔ_i,0ｈｔ_i,1…とｈｔi′,0ｈｔ_i′,1…とは互いに依存関係がなく、独立した文字列とみなすことができる。すなわち、ｋ文字から一つのハッシュ値を生成するハッシュ関数は、ｋ個の独立したハッシュ適用後文字列を生成する。これらの文字列はパイプライン並列処理することが可能である。 That is, if i ≠ i ′, the hashed texts ht _{i, 0} ht _{i, 1} ... And hti ′, 0ht _{i ′, 1} ... Have no dependency and can be regarded as independent character strings. That is, a hash function that generates one hash value from k characters generates k independent character strings after hash application. These character strings can be processed in pipeline parallel processing.

（ハッシュ計算の実装例）
ハッシュ関数の具体例として、それぞれの文字間で排他的論理和をとることによって、８ビット幅の文字４つから４ビットのハッシュ値を生成するハッシュ計算部１の実装例を図４に示す。ハッシュ関数の例としては、他にＣＲＣなどが挙げられる。 (Implementation example of hash calculation)
As a specific example of the hash function, FIG. 4 shows an implementation example of the hash calculation unit 1 that generates a 4-bit hash value from four 8-bit wide characters by taking an exclusive OR between the characters. Other examples of the hash function include CRC.

また、ハッシュ値のビット幅がパターンの１文字あたりのビット幅より小さくならないハッシュ関数も同様に利用可能である。特に、入力される各文字をなんら変更せずに任意に並べるハッシュ関数を用いる場合は、オートマトンが最終状態に遷移した後の完全一致照合処理を省略できる。 A hash function in which the bit width of the hash value does not become smaller than the bit width per character of the pattern can also be used. In particular, when using a hash function that arbitrarily arranges each input character without changing anything, it is possible to omit the complete matching process after the automaton transitions to the final state.

（パイプライン処理の場合）
図５は、メモリを用いてオートマトンを表現する場合において、ハッシュ適用後テキストをパイプライン処理する構成を示している。オートマトンをメモリ表現する場合に、入力文字と現在の状態から状態メモリのアドレスを計算するアドレス計算部１１と、メモリにアクセスするメモリアクセス部１２からなる。これら二つの計算部、さらにはその計算部をパイプライン化してハッシュ適用後テキストを処理することができる。 (For pipeline processing)
FIG. 5 shows a configuration in which text after hash application is pipelined when an automaton is expressed using a memory. When the automaton is expressed in memory, the address calculation unit 11 calculates the address of the state memory from the input character and the current state, and the memory access unit 12 accesses the memory. These two calculation units, as well as the calculation unit, can be pipelined to process the text after applying the hash.

図６は、回路展開してオートマトンを表現する場合において、ハッシュ適用後文字列をパイプライン処理する構成を示している。図６では、状態間の配線にレジスタＲを配置し、各ハッシュ適用後テキストをパイプライン処理させることによって、配線遅延の隠蔽を行っている。 FIG. 6 shows a configuration in which a hash-applied character string is pipelined when an automaton is expressed by circuit expansion. In FIG. 6, the register R is arranged in the wiring between the states, and the text after the hash application is pipelined to conceal the wiring delay.

（パイプライン並列処理の場合）
図７は、パイプライン並列処理の例として、図４で生成される４つのハッシュ適用後テキストｈｔ_1,0ｈｔ_1,1…，ｈｔ_2,0ｈｔ_2,1…，ｈｔ_3,0ｈｔ_3,1…，およびｈｔ_4,0ｈｔ_4,1…を並列処理すると共にパイプライン処理する回路を示す。図７では、メモリを用いて表現したオートマトンを２つのパイプラインステージに分割したものを２つ並列に並べることによって、４つのハッシュ適用後テキストを処理している。回路展開する場合も同様に、各状態を２つのステージに細分化したオートマトンを２つ並列に並べることによって、同様のパイプライン並列処理が可能である。 (For pipeline parallel processing)
FIG. 7 shows, as an example of pipeline parallel processing, the four hash-applied texts ht _1,0 ht _1,1 ..., Ht _2,0 ht _2,1 ... Ht _3,0 ht ₃ generated in FIG. _{, 1} ..., And ht _4,0 ht _4,1 . In FIG. 7, four post-hash application texts are processed by arranging two automaton expressed using memory divided into two pipeline stages in parallel. Similarly, in the case of circuit development, the same pipeline parallel processing is possible by arranging two automatons in which each state is divided into two stages in parallel.

（最終状態判定部の実施例）
最終状態判定部３は、オートマトンの現在の状態が最終状態であるか否かを判定する。図８はオートマトンをメモリ表現した場合の最終状態判定部３の実装例を示す。オートマトンの状態を入力として、パターンと一致する可能性があるか否か、すなわち、いずれかのパターンに対応する最終状態に遷移したか否かを示す信号と、それに対応するパターン番号とを出力する。 (Example of final state determination unit)
The final state determination unit 3 determines whether or not the current state of the automaton is the final state. FIG. 8 shows an implementation example of the final state determination unit 3 when the automaton is expressed in memory. The automaton state is input, and a signal indicating whether or not there is a possibility of matching with the pattern, that is, whether or not a transition to the final state corresponding to any pattern and a pattern number corresponding thereto are output. .

一方、図９は、オートマトンを回路で表現する場合の最終状態判定部３の実施例を示している。各最終状態を表す状態のレジスタ値の論理積をとることによって、オートマトンが各パターンの最終状態に遷移したか否かを示す信号を出力する。また、パターン番号を出力とするデコーダを用いて一致する可能性のあるパターンの番号を出力する。 On the other hand, FIG. 9 shows an embodiment of the final state determination unit 3 when the automaton is expressed by a circuit. By taking the logical product of the register values of the states representing the final states, a signal indicating whether or not the automaton has transitioned to the final state of each pattern is output. Also, the number of patterns that may match is output using a decoder that outputs pattern numbers.

（完全一致照合部の実施例）
最後に完全一致照合部４において、ハッシュ空間で一致したパターンと部分テキストが、ハッシュ適用前のパターンとテキストとで完全に一致するか否かの判定を行う。ハッシュ空間で一致しなかったパターンと部分列とを比較する必要がないため、この処理に求められるスループットは非常に小さい。このため、ハードウェアではなく、ソフトウェア上で実現することも可能である。 (Example of exact matching unit)
Finally, the complete match collating unit 4 determines whether or not the pattern matched with the hash space and the partial text completely match the pattern before the hash application and the text. Since there is no need to compare a partial sequence with a pattern that does not match in the hash space, the throughput required for this processing is very small. For this reason, it is also possible to implement on software instead of hardware.

図１０は、完全一致照合部４の照合手順例を示す。まず最終状態判定部３からの出力にしたがって、対象とするハッシュ適用済み部分テキストが同じくハッシュ適用済みパターンと一致するか否かによって分岐する。一致する可能性がある場合のみ、ハッシュ適用済み部分テキストに対応するテキストと、一致する可能性のあるパターンとの比較を行う。この処理を未処理のテキストがなくなるまで繰り返す。 FIG. 10 shows a verification procedure example of the exact match verification unit 4. First, according to the output from the final state determination unit 3, the process branches depending on whether the target hash-applied partial text matches the hash-applied pattern. Only when there is a possibility of a match, the text corresponding to the hashed partial text is compared with a pattern that may match. This process is repeated until there is no unprocessed text.

すなわち、図１０に示すように、ハッシュ適用済みテキストとパターンとが一致したか否かを判定し（Ｓ１１）、一致した場合には、一致する可能性のある部分テキストを取得し（Ｓ１２）、さらに、一致する可能性のあるパターンを取得する（Ｓ１３）。これら取得したテキストとパターンとを比較する（Ｓ１４）。この比較結果に基づき、テキストとパターンとが一致した場合には（Ｓ１５）、パターン検出結果を出力する（Ｓ１６）。この検出されたパターンが未判定のハッシュ適用済みテキストでなければ（Ｓ１７）処理を終了する。また、この検出されたパターンが未判定のハッシュ適用済みテキストならばステップＳ１１へ戻る。 That is, as shown in FIG. 10, it is determined whether or not the hash-applied text and the pattern match (S11). If they match, a partial text that may match is obtained (S12), Further, a pattern that may match is obtained (S13). The acquired text and pattern are compared (S14). If the text matches the pattern based on the comparison result (S15), the pattern detection result is output (S16). If the detected pattern is not an undetermined hash applied text (S17), the process is terminated. If the detected pattern is an undetermined hash applied text, the process returns to step S11.

また、ステップＳ１１において、ハッシュ適用済みテキストとパターンとが一致せず、未判定のハッシュ適用済みテキストでなければ処理を終了する。あるいは、ステップＳ１５において、テキストとパターンとが一致せず、未判定のハッシュ適用済みテキストでなければ処理を終了する。 In step S11, if the hash-applied text and the pattern do not match, and the undetermined hash-applied text is not determined, the process ends. Alternatively, in step S15, if the text does not match the pattern, and the text has not been determined yet, the process ends.

本発明は、パターンマッチングをハードウェア処理した際の、メモリおよび論理回路の必要量の低減および各種遅延の隠蔽とテキスト中の複数文字を並列処理することによってスループットを向上させることができるので、記憶装置の大容量化、低コスト化による検索対象となる情報の膨大化に対応することができる。また、パターンマッチングを汎用の情報処理装置とプログラムとを用いて実現する際にもスループットの向上を可能とすることができ、検索対象となる情報の膨大化に対応することができる。 The present invention reduces the required amount of memory and logic circuit when pattern matching is hardware-processed, conceals various delays, and improves the throughput by processing multiple characters in the text in parallel. It is possible to cope with the enormous volume of information to be searched by increasing the capacity and cost of the apparatus. Also, throughput can be improved when pattern matching is realized using a general-purpose information processing apparatus and program, and it is possible to cope with an increase in information to be searched.

本実施例のパターンマッチング装置のブロック構成図。The block block diagram of the pattern matching apparatus of a present Example. 本実施例のオートマトンの生成手順を示すフローチャート。The flowchart which shows the production | generation procedure of the automaton of a present Example. 本実施例のハッシュ計算部の入出力を示す図。The figure which shows the input / output of the hash calculation part of a present Example. 本実施例のハッシュ計算部の実装例を示す図。The figure which shows the example of mounting of the hash calculation part of a present Example. 本実施例のメモリ表現した場合のハッシュによるパイプライン化を示す図。The figure which shows the pipelining by a hash at the time of memory representation of a present Example. 本実施例の回路展開した場合のハッシュによるパイプライン化を示す図。The figure which shows the pipelining by hash at the time of the circuit expansion | deployment of a present Example. 本実施例のパイプライン並列処理するオートマトンの実装例を示す図。The figure which shows the example of mounting of the automaton which performs pipeline parallel processing of a present Example. 本実施例のメモリ表現した場合の最終状態判定部を示す図。The figure which shows the final state determination part at the time of memory expression of a present Example. 本実施例の回路展開した場合の最終状態判定部を示す図。The figure which shows the final state determination part at the time of circuit expansion | deployment of a present Example. 本実施例の完全一致照合部の照合手順を示すフローチャート。The flowchart which shows the collation procedure of the perfect matching collation part of a present Example. 従来のＡｈｏ−Ｃｏｒａｓｉｃｋオートマトンを説明するための図。The figure for demonstrating the conventional Aho-Corasick automaton. 従来のメモリを用いたオートマトンの実現例を示す図。The figure which shows the implementation example of the automaton using the conventional memory. 従来のオートマトンの回路展開例を示す図。The figure which shows the circuit development example of the conventional automaton.

１ハッシュ計算部
２オートマトン部
３最終状態判定部
４完全一致照合部
１１アドレス計算部
１２メモリアクセス部
Ｒレジスタ DESCRIPTION OF SYMBOLS 1 Hash calculation part 2 Automaton part 3 Final state determination part 4 Perfect match collation part 11 Address calculation part 12 Memory access part R register

Claims

入力文字列であるテキストから任意文字列であるパターンを検索するパターンマッチング装置において、
前記パターンにｋ（ｋは２以上の整数）文字から一つのハッシュ値を生成するハッシュ関数を適用した結果として得られる文字列をそれぞれ保持し、これまでに入力された文字列のサフィックスに対する前記保持している文字列の最長プリフィックスを状態として持つｋ個のオートマトン部と、
前記ハッシュ関数と同一のハッシュ関数を前記テキストに１文字ずつｋ−１文字までずらしながら適用した結果として得られるｋ個の文字列を前記ｋ個のオートマトン部にそれぞれ入力するハッシュ計算部と、
前記ｋ個のオートマトン部のいずれかが最終状態になったか否かを判定することによって、パターンに一致する可能性のある部分テキストと、パターンに一致する可能性のない部分テキストとの識別を行う最終状態判定部と
を備えたことを特徴とするパターンマッチング装置。 In the pattern matching apparatus for searching for a pattern which is the input character text or RaTsutomu meaning string is a column,
Each of the character strings obtained as a result of applying a hash function that generates one hash value from k (k is an integer of 2 or more) characters to the pattern is stored, and the above-described storage for the suffix of the character string input so far and k pieces of the automaton portion having the longest prefix to that string as a state,
A hash calculation unit for inputting respectively the k-number of strings obtained by said hash function the same hash function and as a result of applying shifting to k-1 character by one character in the text to the k-number of the automaton portion,
By discriminating whether any of the k automaton parts has reached the final state, the partial text that may match the pattern is distinguished from the partial text that does not match the pattern. A pattern matching device comprising: a final state determination unit .

前記ハッシュ関数として、パターンの１文字あたりのビット幅より小さいビット幅を持つハッシュ値を生成するハッシュ関数を用いる
ことを特徴とする請求項１記載のパターンマッチング装置。 A hash function that generates a hash value having a bit width smaller than the bit width per character of the pattern is used as the hash function.
The pattern matching apparatus according to claim 1.

前記ｋ個のオートマトン部をパイプライン化して実現し、このパイプライン化されたｋ個のオートマトン部において、ハッシュ関数の適用によって生成される前記ｋ個の文字列をパイプライン処理する手段を備えた
ことを特徴とする請求項１または２記載のパターンマッチング装置。 Said k pieces of the automaton portion implemented by pipelined, in the k automaton portion which is the pipelined, with the k-number of means for the string pipelining generated by application of a hash function
Pattern matching apparatus according to claim 1 or 2, wherein the.

前記一致する可能性のある部分テキストに対し、パターンと完全に一致するか否かを判定する完全一致照合部を備えた
ことを特徴とする請求項１ないし３のいずれか記載のパターンマッチング装置。 Provided with a perfect match matching unit for judging whether or not the partial text that may match is completely matched with the pattern
Pattern matching apparatus according to any one of claims 1-3, characterized in that.

パターンマッチング装置が、入力文字列であるテキストから任意文字列であるパターンを検索するパターンマッチング方法において、
前記パターンにｋ（ｋは２以上の整数）文字から一つのハッシュ値を生成するハッシュ関数を適用した結果として得られる文字列をそれぞれ保持し、これまでに入力された文字列のサフィックスに対する前記保持している文字列の最長プリフィックスを状態として持つｋ個のオートマトン部を用い、
ハッシュ計算部が、前記ハッシュ関数と同一のハッシュ関数を前記テキストに１文字ずつｋ−１文字までずらしながら適用した結果として得られるｋ個の文字列を前記ｋ個のオートマトン部にそれぞれ入力するステップと、
最終状態判定部が、前記ｋ個のオートマトン部のいずれかが最終状態になったか否かを判定することによって、パターンに一致する可能性のある部分テキストと、パターンに一致する可能性のない部分テキストとの識別を行うステップと
を実行することを特徴とするパターンマッチング方法。 Pattern matching apparatus, in the pattern matching method for searching for text or RaTsutomu pattern is meaning string is the input character string,
Each of the character strings obtained as a result of applying a hash function that generates one hash value from k (k is an integer of 2 or more) characters to the pattern is stored, and the above-described storage for the suffix of the character string input so far using the k automaton portion having the longest prefix to that string as a state,
The step of hash calculation unit inputs respectively the k character strings obtained by the hash function the same hash function and as a result of applying shifting to k-1 character by one character in the text to the k-number of the automaton portion When,
The final state determination unit determines whether any of the k automaton units has reached the final state, so that the partial text that may match the pattern and the portion that does not match the pattern A pattern matching method comprising: performing a step of discriminating from text.

前記ハッシュ関数として、パターンの１文字あたりのビット幅より小さいビット幅を持つハッシュ値を生成するハッシュ関数を用いる
ことを特徴とする請求項５記載のパターンマッチング方法。 A hash function that generates a hash value having a bit width smaller than the bit width per character of the pattern is used as the hash function.
6. The pattern matching method according to claim 5, wherein:

前記ｋ個のオートマトン部がパイプライン化されて実現され、このパイプライン化されたｋ個のオートマトン部において、ハッシュ関数の適用によって生成される前記ｋ個の文字列をパイプライン処理する
ことを特徴とする請求項５または６記載のパターンマッチング方法。 The k number of the automaton portion is realized pipelined, the pipelined of k automaton part of this, you pipeline processing the k-number of the character string generated by application of a hash function
The pattern matching method according to claim 5 or 6, wherein:

完全一致照合部が、前記一致する可能性のある部分テキストに対し、パターンと完全に一致するか否かを判定するステップを実行する
ことを特徴とする請求項５または７記載のパターンマッチング方法。 The perfect match matching unit performs a step of determining whether or not the partial text that may be matched completely matches the pattern.
The pattern matching method according to claim 5 or 7, wherein:

情報処理装置にインストールすることにより、その情報処理装置に、
入力文字列であるテキストから任意文字列であるパターンを検索するパターンマッチング装置に相応する機能を実現させるプログラムにおいて、
前記パターンにｋ（ｋは２以上の整数）文字から一つのハッシュ値を生成するハッシュ関数を適用した結果として得られる文字列をそれぞれ保持し、これまでに入力された文字列のサフィックスに対する前記保持している文字列の最長プリフィックスを状態として持つｋ個のオートマトン部に相応する機能と、
前記ハッシュ関数と同一のハッシュ関数を前記テキストに１文字ずつｋ−１文字までずらしながら適用した結果として得られるｋ個の文字列を前記ｋ個のオートマトン部にそれぞれ入力するハッシュ計算機能と、
前記ｋ個のオートマトン部のいずれかが最終状態になったか否かを判定することによって、パターンに一致する可能性のある部分テキストと、パターンに一致する可能性のない部分テキストとの識別を行う最終状態判定機能と
を実現させることを特徴とするプログラム。 By installing on an information processing device,
A program for realizing functions corresponding to the pattern matching apparatus for searching the input character is a text or RaTsutomu meaning string is the column pattern,
Each of the character strings obtained as a result of applying a hash function that generates one hash value from k (k is an integer of 2 or more) characters to the pattern is stored, and the above-described storage for the suffix of the character string input so far A function corresponding to k automaton parts having the longest prefix of the character string being
A hash calculation function of inputting respectively the k-number of strings obtained by said hash function the same hash function and as a result of applying shifting to k-1 character by one character in the text to the k-number of the automaton portion,
By discriminating whether any of the k automaton parts has reached the final state, the partial text that may match the pattern is distinguished from the partial text that does not match the pattern. A program characterized by realizing a final state determination function .

前記ハッシュ関数として、パターンの１文字あたりのビット幅より小さいビット幅をもつハッシュ値を用いる
ことを特徴とする請求項９記載のプログラム。 As the hash function, a hash value having a bit width smaller than the bit width per character of the pattern is used.
The program according to claim 9.

前記ｋ個のオートマトン部をパイプライン化して実現し、このパイプライン化されたｋ個のオートマトン部において、ハッシュ関数の適用によって生成される前記ｋ個の文字列をパイプライン処理する機能を実現させる
ことを特徴とする請求項９または１０記載のプログラム。 The k automaton parts are realized by pipelining, and the pipelined k automaton parts realize a function of pipelining the k character strings generated by applying the hash function. The program according to claim 9 or 10, characterized in that:

前記一致する可能性のある部分テキストに対し、パターンと完全に一致するか否かを判定する完全一致照合機能を実現させる
ことを特徴とする請求項９または１０記載のプログラム。 The program according to claim 9 or 10, wherein a perfect match matching function for determining whether or not the partial text that may match is completely matched with a pattern is realized.

請求項９ないし１２のいずれかに記載のプログラムが記録された前記情報処理装置読み取り可能な記録媒体。 The information processing apparatus-readable recording medium on which the program according to claim 9 is recorded.