JP6767825B2

JP6767825B2 - Data management equipment, data management methods, and data management programs

Info

Publication number: JP6767825B2
Application number: JP2016182586A
Authority: JP
Inventors: 岩崎　雅二郎; 雅二郎岩崎; 宮崎　大輔; 大輔宮崎
Original assignee: Yahoo Japan Corp
Current assignee: Yahoo Japan Corp
Priority date: 2016-09-20
Filing date: 2016-09-20
Publication date: 2020-10-14
Anticipated expiration: 2036-09-20
Also published as: JP2018049315A

Description

本発明は、データ管理装置、データ管理方法、およびデータ管理プログラムに関する。 The present invention relates to a data management device, a data management method, and a data management program.

従来、ウェブブラウザやメールソフトウェアなどにおいて、膨大な量のデータから所望のデータを検索する技術が利用されている（例えば、特許文献１参照）。このような技術では、検索対象のデータごとに検索済みまたは未検索といった情報をテーブルによって管理している。例えば、検索対象のデータに対して一対一に対応したビット配列のテーブルを用意しておき、検索対象のデータが既に検索したデータであるのか否かを評価する手法が知られている。また、Ｃ言語などで実装された既存のハッシュテーブルに検索したデータの識別子を登録することで、検索対象のデータが既に検索したデータであるのか否かを評価する手法が知られている。 Conventionally, a technique for searching a desired data from a huge amount of data has been used in a web browser, an email software, or the like (see, for example, Patent Document 1). In such a technology, information such as searched or unsearched is managed by a table for each data to be searched. For example, there is known a method of preparing a bit array table having a one-to-one correspondence with the data to be searched and evaluating whether or not the data to be searched is already searched data. Further, there is known a method of evaluating whether or not the data to be searched is the data that has already been searched by registering the identifier of the searched data in an existing hash table implemented in C language or the like.

特開２００９−２１１２６３号公報Japanese Patent Application Laid-Open No. 2009-211263

従来の技術では、例えばビット配列のテーブルを利用する場合、検索対象のデータ数が増加するのに応じて、検索開始時にテーブルを初期化するのに要する時間が増加してしまう。 In the conventional technique, for example, when a bit array table is used, the time required to initialize the table at the start of the search increases as the number of data to be searched increases.

一方で、既存のハッシュテーブルを利用する場合、データの識別子同士をポインタで関連付けて管理するリスト構造のデータ構造を利用するため、データの取得に時間がかかってしまう。 On the other hand, when an existing hash table is used, it takes time to acquire the data because the data structure of the list structure that manages the data identifiers by associating them with pointers is used.

本発明は、上記の課題に鑑みてなされたものであって、膨大な量のデータからの検索を高速化することを目的としている。 The present invention has been made in view of the above problems, and an object of the present invention is to speed up a search from a huge amount of data.

本発明の一態様は、対象データに対する所定の処理の要否を判定するデータ管理装置であって、前記対象データの識別情報を記憶する領域を有する一次判定テーブルと、前記対象データの識別情報をリスト構造で記憶する二次判定テーブルと、前記対象データの識別情報を第１のハッシュ関数でハッシュ化したハッシュ値に対応する前記一次判定テーブルのアドレスに、識別情報の登録の有無を判定し、識別情報が登録されていないと判定された場合に、前記所定の処理を行うこととして、前記一次判定テーブルのアドレスに前記対象データの識別情報を登録し、前記対象データの識別情報が登録されていると判定された場合に、前記対象データに関する前記所定の処理を不要と判定し、前記登録判定部により前記二次判定テーブルに識別情報が登録されていると判定された場合、前記対象データの識別情報を第２のハッシュ関数でハッシュ化したハッシュ値をキーとして前記二次判定テーブルを参照し、前記所定の処理の要否を判定するデータ管理部と、を備えるデータ管理装置である。 One aspect of the present invention is a data management device that determines the necessity of predetermined processing for the target data, and uses a primary determination table having an area for storing the identification information of the target data and the identification information of the target data. Whether or not the identification information is registered is determined at the addresses of the secondary determination table stored in the list structure and the primary determination table corresponding to the hash value obtained by hashing the identification information of the target data with the first hash function. When it is determined that the identification information is not registered, the identification information of the target data is registered at the address of the primary determination table as the predetermined processing is performed, and the identification information of the target data is registered. If it is determined that the target data is present, it is determined that the predetermined processing related to the target data is unnecessary, and if the registration determination unit determines that the identification information is registered in the secondary determination table, the target data It is a data management device including a data management unit that refers to the secondary determination table using a hash value obtained by hashing identification information with a second hash function as a key, and determines the necessity of the predetermined processing.

本発明の一態様によれば、膨大な量のデータからの検索を高速化することができる。 According to one aspect of the present invention, it is possible to speed up a search from a huge amount of data.

第１の実施形態におけるデータ管理装置１００を中心とした構成図である。It is a block diagram centering on the data management apparatus 100 in 1st Embodiment. 既存データオブジェクト１３２の内容の一例を示す図である。It is a figure which shows an example of the contents of the existing data object 132. グラフインデックス１３４のデータ構成の一例を示す図である。It is a figure which shows an example of the data structure of the graph index 134. インデックス検索部１１４による検索処理の内容を模式的に示す図である。It is a figure which shows typically the content of the search process by the index search unit 114. 一次判定テーブル１２２のデータ構造の一例を示す図である。It is a figure which shows an example of the data structure of the primary determination table 122. 二次判定テーブル１２４のデータ構造の一例を示す図である。It is a figure which shows an example of the data structure of the secondary determination table 124. インデックス検索部１１４により実行される、新規オブジェクトとの距離を計算したか否かを判定する処理の流れの一例を示すフローチャートである。It is a flowchart which shows an example of the flow of the process of determining whether or not the distance to a new object is calculated, which is executed by the index search unit 114. インデックス検索部１１４により実行される、新規オブジェクトとの距離を計算したか否かを判定する処理の流れの一例を示すフローチャートである。It is a flowchart which shows an example of the flow of the process of determining whether or not the distance to a new object is calculated, which is executed by the index search unit 114. 一次判定テーブル１２２の他の例を示す図である。It is a figure which shows another example of the primary determination table 122.

以下、本発明を適用したデータ管理装置、データ管理方法、およびデータ管理プログラムを、図面を参照して説明する。 Hereinafter, a data management device, a data management method, and a data management program to which the present invention is applied will be described with reference to the drawings.

［概要］
データ管理装置は、一以上のプロセッサによって実現される。データ管理装置は、例えば、インターネット検索においてユーザが所望するデータを検索する際に、既に探索したデータとそうでないデータとを判別するためのデータ管理を行う装置である。 [Overview]
The data management device is realized by one or more processors. The data management device is, for example, a device that manages data for discriminating between already searched data and non-searched data when searching for data desired by a user in an Internet search.

（第１の実施形態）
［全体構成］
図１は、第１の実施形態におけるデータ管理装置１００を中心とした構成図である。第１の実施形態におけるデータ管理装置１００は、一以上のクライアント端末１０とネットワークＮＷを介して接続される。クライアント端末１０は、パーソナルコンピュータ、スマートフォンなどの携帯電話、タブレット端末などである。ネットワークＮＷは、無線基地局、公衆回線、専用回線、プロバイダ端末、インターネットなどを含む。ネットワークＮＷは、無線基地局、公衆回線、専用回線、プロバイダ端末、インターネットなどを含む。 (First Embodiment)
[overall structure]
FIG. 1 is a configuration diagram centered on the data management device 100 according to the first embodiment. The data management device 100 in the first embodiment is connected to one or more client terminals 10 via a network NW. The client terminal 10 is a personal computer, a mobile phone such as a smartphone, a tablet terminal, or the like. The network NW includes a wireless base station, a public line, a dedicated line, a provider terminal, the Internet, and the like. The network NW includes a wireless base station, a public line, a dedicated line, a provider terminal, the Internet, and the like.

データ管理装置１００は、クライアント端末１０からクエリデータを受信すると、クエリデータに類似するデータを検索し、検索結果のデータをクライアント端末１０に返信する。クエリデータは、例えば、クライアント端末１０において実行されるウェブブラウザの検索窓に入力されたキーワードや、データの検索条件を記述したデータなどである。データ管理装置１００が返信するデータは、データそのもの（例えば拡張子がｊｐｇなどの画像データ）であってもよいし、データを参照するための識別子（例えばＵＲＬ（Uniform Resource Locator）など）であってもよい。また、クエリデータや検索の対象データは、画像データ、音声データ、テキストデータなど、如何なる種類のデータであってもよい。 When the data management device 100 receives the query data from the client terminal 10, it searches for data similar to the query data and returns the search result data to the client terminal 10. The query data is, for example, a keyword input in the search window of a web browser executed on the client terminal 10, data describing data search conditions, and the like. The data returned by the data management device 100 may be the data itself (for example, image data having an extension of jpg or the like), or an identifier for referencing the data (for example, a URL (Uniform Resource Locator)). May be good. Further, the query data and the target data for the search may be any kind of data such as image data, voice data, and text data.

データ管理装置１００は、例えば、ネットワークインターフェース１０２と、入出力装置１０４と、制御部１１０と、記憶部１２０と、データサーバ１３０とを備える。ネットワークインターフェース１０２は、例えば、ＮＩＣ（Network Interface Card）である。 The data management device 100 includes, for example, a network interface 102, an input / output device 104, a control unit 110, a storage unit 120, and a data server 130. The network interface 102 is, for example, a NIC (Network Interface Card).

入出力装置１０４は、データ管理装置１００の管理者による入力操作を受け付ける。例えば、入出力装置１０４は、マウスやキーボード、タッチパネルなどの入力装置と、ＬＣＤ（Liquid Crystal Display）や有機ＥＬ（Electroluminescence）表示装置、スピーカなどの出力装置とを含む。 The input / output device 104 receives an input operation by the administrator of the data management device 100. For example, the input / output device 104 includes an input device such as a mouse, a keyboard, and a touch panel, and an output device such as an LCD (Liquid Crystal Display), an organic EL (Electroluminescence) display device, and a speaker.

［制御部］
制御部１１０は、例えば、データオブジェクト生成部１１２と、インデックス検索部（データ管理部の一例）１１４とを備える。これらの制御部１１０の構成要素は、例えば、ＣＰＵ（Central Processing Unit）などのプロセッサがプログラムを実行することにより実現される。また、これらの構成要素のうち一部または全部は、ＬＳＩ（Large Scale Integration）、ＡＳＩＣ（Application Specific Integrated Circuit）、またはＦＰＧＡ（Field-Programmable Gate Array）などのハードウェアによって実現されてもよいし、ソフトウェアとハードウェアが協働して実現されてもよい。なお、制御部１１０は、一つのプロセッサにより実現される必要はなく、機能ごとに分散処理を行ってもよい。例えば、データオブジェクト生成部１１２およびインデックス検索部１１４は、それぞれ別体のプロセッサにより実現されてもよい。 [Control unit]
The control unit 110 includes, for example, a data object generation unit 112 and an index search unit (an example of a data management unit) 114. The components of these control units 110 are realized by, for example, a processor such as a CPU (Central Processing Unit) executing a program. In addition, some or all of these components may be realized by hardware such as LSI (Large Scale Integration), ASIC (Application Specific Integrated Circuit), or FPGA (Field-Programmable Gate Array). Software and hardware may work together. The control unit 110 does not have to be realized by one processor, and may perform distributed processing for each function. For example, the data object generation unit 112 and the index search unit 114 may be realized by separate processors.

データオブジェクト生成部１１２は、例えば、クライアント端末１０などから受信した入力データに基づいて、入力データのオブジェクトを生成する。入力データが画像である場合、オブジェクトは、例えば、局所特徴量やＢｏＦ（Bag of Features）、色のヒストグラムなど、或いはこれらの組み合わせである。また、入力データがテキストデータである場合、オブジェクトは、例えば、単語の出現頻度のベクトルデータ、ニューラルネットワークを用いて抽出した意味ベクトル、或いは、他の単語と共に出現する頻度その他の相対関係に基づくベクトルなどである。オブジェクトは、オブジェクト間の距離が定義可能であれば、如何なるデータであってもよい。 The data object generation unit 112 generates an input data object based on the input data received from, for example, the client terminal 10. When the input data is an image, the object is, for example, a local feature amount, a BoF (Bag of Features), a color histogram, or a combination thereof. When the input data is text data, the object is, for example, vector data of the frequency of occurrence of words, a meaning vector extracted using a neural network, or a vector based on the frequency of occurrence with other words or other relative relationships. And so on. The object may be any data as long as the distance between the objects can be defined.

インデックス検索部１１４は、データサーバ１３０を検索し、データオブジェクト生成部１１２により生成されたオブジェクト（以下、新規オブジェクトと称する）に類似するオブジェクトを、既存データオブジェクト１３２から抽出する。データサーバ１３０には、例えば、既存データオブジェクト１３２と、グラフインデックス１３４とが格納される。グラフインデックス１３４は、オブジェクトを高速に検索するためのインデックスであり、エッジとオブジェクト（ノード）で構成されるグラフ構造のデータである。 The index search unit 114 searches the data server 130 and extracts an object similar to the object (hereinafter referred to as a new object) generated by the data object generation unit 112 from the existing data object 132. For example, the existing data object 132 and the graph index 134 are stored in the data server 130. The graph index 134 is an index for searching an object at high speed, and is data having a graph structure composed of edges and objects (nodes).

データサーバ１３０は、ＨＤＤやフラッシュメモリなどの記憶装置により実現される。また、データサーバ１３０は、データ管理装置１００からネットワークＮＷを介してアクセス可能なＮＡＳ（Network Attached Storage）装置などの外部記憶装置により実現されてもよい。 The data server 130 is realized by a storage device such as an HDD or a flash memory. Further, the data server 130 may be realized by an external storage device such as a NAS (Network Attached Storage) device that can be accessed from the data management device 100 via the network NW.

データサーバ１３０には、既存データオブジェクト１３２と、グラフインデックス１３４とが格納される。既存データオブジェクト１３２やグラフインデックス１３４は、データサーバ１３０からＲＡＭ（Random Access Memory）などの高速アクセス可能なメモリに展開（ロード）されて使用される。既存データオブジェクト１３２は、検索対象となるデータから生成されたオブジェクトである。既存データオブジェクト１３２には、その既存データオブジェクト１３２を生成する元となったデータ、或いはデータを参照する識別子が対応付けられている。データ管理装置１００は、新規オブジェクトをクエリとして得られた既存オブジェクト、すなわち新規オブジェクトに類似するオブジェクトに対応付けられたデータ、或いはそのデータを参照する識別子をクライアント端末１０に送信する。 The existing data object 132 and the graph index 134 are stored in the data server 130. The existing data object 132 and the graph index 134 are expanded (loaded) from the data server 130 into a high-speed accessible memory such as a RAM (Random Access Memory) and used. The existing data object 132 is an object generated from the data to be searched. The existing data object 132 is associated with the data from which the existing data object 132 is generated or an identifier that refers to the data. The data management device 100 transmits to the client terminal 10 an existing object obtained by querying the new object, that is, data associated with an object similar to the new object, or an identifier that refers to the data.

図２は、既存データオブジェクト１３２の内容の一例を示す図である。図示するように、既存データオブジェクト１３２は、オブジェクトの識別情報（図中、オブジェクトＩＤ）に対して、オブジェクト（図ではベクトルデータ）、およびそのオブジェクトを生成する元となったデータ、或いはデータを参照する識別子が対応付けられたデータである。 FIG. 2 is a diagram showing an example of the contents of the existing data object 132. As shown in the figure, the existing data object 132 refers to the object (vector data in the figure), the data from which the object is generated, or the data with respect to the object identification information (object ID in the figure). It is the data associated with the identifier to be used.

グラフインデックス１３４は、複数の既存ベクトルデータ１３２を接続するエッジに関する情報であり、既存ベクトルデータ１３４のうち任意の二つを接続する複数のエッジにより形成されるグラフ構造のデータである。グラフインデックス１３４に含まれるエッジは、例えば、双方向のエッジ（無向エッジ）である。図３は、グラフインデックス１３４の内容の一例を示す図である。図示するように、グラフインデックス１３４は、各エッジの識別情報であるエッジＩＤに対して、そのエッジが接続する両端の既存オブジェクトのオブジェクトＩＤが対応付けられたデータである。なお、オブジェクトは、双方向のエッジに代えて、一方向のエッジ（有向エッジ）で接続されてもよい。 The graph index 134 is information about an edge connecting a plurality of existing vector data 132, and is data having a graph structure formed by a plurality of edges connecting any two of the existing vector data 134. The edge included in the graph index 134 is, for example, a bidirectional edge (undirected edge). FIG. 3 is a diagram showing an example of the contents of the graph index 134. As shown in the figure, the graph index 134 is data in which the object IDs of the existing objects at both ends to which the edges are connected are associated with the edge IDs that are the identification information of each edge. The objects may be connected by a unidirectional edge (directed edge) instead of the bidirectional edge.

以下、インデックス検索部１１４の処理について、より詳細に説明する。インデックス検索部１１４は、グラフインデックス１３４により規定されたエッジによって辿ることのできるオブジェクトのうち、新規オブジェクトに対する距離が短いものから順に所定数のオブジェクトを、新規オブジェクトに類似するオブジェクトとして抽出する。なお、グラフインデックス１３４が有向エッジで構成されている場合、インデックス検索部１１４は、参照元のオブジェクトから参照先のオブジェクトへの向きに従って辿ることのできるオブジェクトのうち、新規オブジェクトに対する距離が短いものから順に所定数のオブジェクトを、新規オブジェクトに類似するオブジェクトとして抽出する。オブジェクト間の距離とは、例えば、オブジェクトがベクトルである場合、ベクトル要素間の差分についてＬｐノルム（ｐ＝１、２、…）を求めたものと定義される。 Hereinafter, the processing of the index search unit 114 will be described in more detail. The index search unit 114 extracts a predetermined number of objects as objects similar to the new object in order from the object having the shortest distance to the new object among the objects that can be traced by the edge defined by the graph index 134. When the graph index 134 is composed of directed edges, the index search unit 114 has a short distance to the new object among the objects that can be traced according to the direction from the reference source object to the reference destination object. A predetermined number of objects are extracted in order from the beginning as objects similar to the new object. The distance between objects is defined as, for example, when the objects are vectors, the Lp norm (p = 1, 2, ...) Is obtained for the difference between the vector elements.

また、インデックス検索部１１４は、グラフインデックス１３４により規定されたエッジによって辿ることのできるオブジェクトのうち、新規オブジェクトに対して既定の距離以内にあるオブジェクトを、新規オブジェクトに類似するオブジェクトとして抽出してもよい。そして、インデックス検索部１１４は、既存データオブジェクト１３２において、抽出したオブジェクトに対応付けられているデータまたはその識別子を、ネットワークインターフェース１０２およびネットワークＮＷを介してクライアント端末１０に送信する。 Further, the index search unit 114 may extract an object within a predetermined distance from the new object among the objects that can be traced by the edge defined by the graph index 134 as an object similar to the new object. Good. Then, the index search unit 114 transmits the data associated with the extracted object or the identifier thereof in the existing data object 132 to the client terminal 10 via the network interface 102 and the network NW.

図４は、インデックス検索部１１４による検索処理の内容を模式的に示す図である。まず、インデックス検索部１１４は、グラフ構造（グラフインデックス）Ｇ、新規オブジェクトｑ、検索範囲ｒを入力する（Ｓ１００）。次に、インデックス検索部１１４は、グラフ構造Ｇからランダムにオブジェクトの集合Ｓを生成する（Ｓ１０２）。 FIG. 4 is a diagram schematically showing the content of the search process by the index search unit 114. First, the index search unit 114 inputs the graph structure (graph index) G, the new object q, and the search range r (S100). Next, the index search unit 114 randomly generates a set S of objects from the graph structure G (S102).

次に、インデックス検索部１１４は、出力データとなる集合Ｒをクリアする（Ｓ１０４）。次に、インデックス検索部１１４は、集合Ｓに含まれるオブジェクトのうち、最も新規オブジェクトｑに近いオブジェクトを選択し、オブジェクトｓとし（Ｓ１０６）、集合Ｓからオブジェクトｓを除外する（Ｓ１０８）。 Next, the index search unit 114 clears the set R that is the output data (S104). Next, the index search unit 114 selects the object closest to the new object q from the objects included in the set S, sets it as the object s (S106), and excludes the object s from the set S (S108).

次に、インデックス検索部１１４は、オブジェクトｓと新規オブジェクトｑとの距離が検索半径ｒを超えるか否かを判定する（Ｓ１１０）。オブジェクトｓと新規オブジェクトｑとの距離が検索半径ｒを超えない場合、インデックス検索部１１４は、オブジェクトｓの近傍オブジェクト集合Ｎ（Ｇ，ｓ）からオブジェクトを一つ選択し、オブジェクトｏとする（Ｓ１１２）。そして、インデックス検索部１１４は、オブジェクトｏが集合Ｃに含まれるか否かを判定する（Ｓ１１４）。集合Ｃは、重複選択を回避するために設定される集合であり、このフローチャートの処理の開始時または終了時にデータが削除される。 Next, the index search unit 114 determines whether or not the distance between the object s and the new object q exceeds the search radius r (S110). When the distance between the object s and the new object q does not exceed the search radius r, the index search unit 114 selects one object from the neighboring object set N (G, s) of the object s and sets it as the object o (S112). ). Then, the index search unit 114 determines whether or not the object o is included in the set C (S114). The set C is a set set to avoid duplicate selection, and data is deleted at the start or end of the processing of this flowchart.

オブジェクトｏが集合Ｃに含まれない場合、インデックス検索部１１４は、オブジェクトｏと新規オブジェクトｑとの距離が検索半径ｒ以下であるか否かを判定する（Ｓ１１８）。オブジェクトｏと新規オブジェクトｑとの距離が検索半径ｒ以下である場合、オブジェクトｏを集合Ｓと集合Ｒのそれぞれに入れる（Ｓ１２０）。オブジェクトｏと新規オブジェクトｑとの距離が検索半径ｒを超える場合、Ｓ１２０の処理はスキップされる。また、Ｓ１１４においてオブジェクトｏが集合Ｃに含まれると判定された場合、Ｓ１１６〜Ｓ１２０の処理はスキップされる。 When the object o is not included in the set C, the index search unit 114 determines whether or not the distance between the object o and the new object q is equal to or less than the search radius r (S118). When the distance between the object o and the new object q is less than or equal to the search radius r, the object o is put in each of the set S and the set R (S120). When the distance between the object o and the new object q exceeds the search radius r, the processing of S120 is skipped. Further, when it is determined in S114 that the object o is included in the set C, the processes of S116 to S120 are skipped.

次に、インデックス検索部１１４は、オブジェクトｓの近傍オブジェクト集合Ｎ（Ｇ，ｓ）から全てのオブジェクトを選択したか否かを判定する（Ｓ１２２）。オブジェクトｓの近傍オブジェクト集合Ｎ（Ｇ，ｓ）から全てのオブジェクトを選択していない場合、Ｓ１１２に処理が戻される。 Next, the index search unit 114 determines whether or not all the objects have been selected from the neighboring object set N (G, s) of the objects s (S122). If all the objects are not selected from the neighboring object set N (G, s) of the objects s, the process is returned to S112.

オブジェクトｓの近傍オブジェクト集合Ｎ（Ｇ，ｓ）から全てのオブジェクトを選択した場合、インデックス検索部１１４は、集合Ｓが空集合であるか否かを判定する（Ｓ１２４）。集合Ｓが空集合でない場合、Ｓ１０６に処理が戻される。 When all the objects are selected from the neighboring object set N (G, s) of the object s, the index search unit 114 determines whether or not the set S is an empty set (S124). If the set S is not an empty set, processing is returned to S106.

Ｓ１２４において集合Ｓが空集合であると判定された場合、またはＳ１１０においてオブジェクトｓとｑの距離が検索半径ｒを超えると判定された場合、インデックス検索部１１４は、集合Ｒを出力する（Ｓ１２６）。この集合Ｒに含まれるオブジェクトが、新規オブジェクトに類似するオブジェクトである。なお、ここで説明した検索処理の内容は、あくまで一例であり、他の手法により検索処理が行われてもよい。 When it is determined in S124 that the set S is an empty set, or when it is determined in S110 that the distance between the objects s and q exceeds the search radius r, the index search unit 114 outputs the set R (S126). .. The objects included in this set R are objects similar to the new objects. The content of the search process described here is just an example, and the search process may be performed by another method.

ここで、インデックス検索部１１４は、Ｓ１１４において、Ｓ１１６で集合Ｃに入れ、且つＳ１１８で距離を計算したオブジェクトであるか否かを判定するために、記憶部１２０に格納された一次判定テーブル１２２および二次判定テーブルを使用して、「既に距離を計算したオブジェクト」であるか否かを判定する。記憶部１２０は、例えば、ＲＡＭやフラッシュメモリなどの読み書き可能なメモリによって実現される。 Here, the index search unit 114 has the primary determination table 122 and the primary determination table 122 stored in the storage unit 120 in order to determine in S114 whether or not the object is included in the set C in S116 and the distance is calculated in S118. The secondary determination table is used to determine whether or not the object is an "object whose distance has already been calculated". The storage unit 120 is realized by, for example, a readable / writable memory such as a RAM or a flash memory.

図５は、一次判定テーブル１２２のデータ構造の一例を示す図である。一次判定テーブル１２２は、一次元の配列構造を有するテーブルであり、メモリ領域における固定アドレスに格納されるテーブルである。一次判定テーブル１２２を分割した各領域には、それぞれ、オブジェクトＩＤを第１のハッシュ関数でハッシュ化して得られるハッシュ値が対応付けられている。図では、各領域を代表して示す情報として、開始アドレスを例示している。それぞれのハッシュ値に対応する領域には、ハッシュ値の元となったオブジェクトのオブジェクトＩＤと、二次判定テーブル１２４にデータが格納されているか否かを示すフラグとが格納される。第１のハッシュ関数は、例えば、オブジェクトＩＤを２^ｎで除算した剰余を求める関数である。これを実現するために、インデックス検索部１１４は、例えば、入力データのバイナリ列の下位ｎビットをハッシュ値として利用することで、実際に演算を行うよりも高速に処理を行うことができる。第１のハッシュ関数がオブジェクトＩＤを２^ｎで除算した剰余を求める関数である場合、図中のＮ＝２^ｎである。例えば、Ｎが、既存ベクトルデータ１３２の全データ数の１／１０程度の値に設定されると好適である。 FIG. 5 is a diagram showing an example of the data structure of the primary determination table 122. The primary determination table 122 is a table having a one-dimensional array structure, and is a table stored at a fixed address in the memory area. Each area obtained by dividing the primary determination table 122 is associated with a hash value obtained by hashing the object ID with the first hash function. In the figure, the start address is illustrated as information showing each area as a representative. In the area corresponding to each hash value, the object ID of the object that is the source of the hash value and the flag indicating whether or not the data is stored in the secondary determination table 124 are stored. The first hash function is, for example, a function for obtaining a remainder obtained by dividing an object ID by 2 ⁿ . In order to realize this, the index search unit 114 can perform processing at a higher speed than actually performing an operation by using, for example, the lower n bits of the binary string of the input data as a hash value. When the first hash function is a function for obtaining the remainder obtained by dividing the object ID by 2 ⁿ , N = 2 ⁿ in the figure. For example, it is preferable that N is set to a value of about 1/10 of the total number of existing vector data 132.

インデックス検索部１１４は、既に距離を計算したか否かを判断する対象として着目するオブジェクト（以下、着目オブジェクト）について「新規オブジェクトとの距離を計算したか否か」を判定する際に、まず、着目オブジェクトのオブジェクトＩＤ（以下、着目ＩＤ）をハッシュ化したハッシュ値に対応するアドレスに、着目ＩＤが格納されている（登録されている）か否かを判定する。着目ＩＤが格納されている場合、インデックス検索部１１４は、着目オブジェクトについて「新規オブジェクトとの距離を計算した」と判定する。 The index search unit 114 first determines "whether or not the distance to a new object has been calculated" for the object of interest (hereinafter referred to as the "object of interest") as an object for determining whether or not the distance has already been calculated. It is determined whether or not the interest ID is stored (registered) in the address corresponding to the hash value obtained by hashing the object ID (hereinafter, the attention ID) of the object of interest. When the focus ID is stored, the index search unit 114 determines that the focus object has "calculated the distance to the new object".

着目ＩＤが格納されていない場合、インデックス検索部１１４は、着目オブジェクトからのハッシュ値に対応するアドレスに、他のオブジェクトのオブジェクトＩＤが格納されているか否かを判定する。他のオブジェクトのオブジェクトＩＤが格納されていない場合（すなわち何も格納されていない場合）、インデックス検索部１１４は、着目オブジェクトについて「新規オブジェクトとの距離を計算していない」と判定する。この場合、インデックス検索部１１４は、着目オブジェクトと新規オブジェクトとの距離を計算すると共に、着目ＩＤを、着目オブジェクトからのハッシュ値に対応するアドレスに書き込む。 When the object ID of interest is not stored, the index search unit 114 determines whether or not the object ID of another object is stored at the address corresponding to the hash value from the object of interest. When the object IDs of other objects are not stored (that is, when nothing is stored), the index search unit 114 determines that the object of interest "has not calculated the distance to the new object". In this case, the index search unit 114 calculates the distance between the object of interest and the new object, and writes the ID of interest in the address corresponding to the hash value from the object of interest.

一方、インデックス検索部１１４は、着目オブジェクトからのハッシュ値に対応するアドレスに、他のオブジェクトのオブジェクトＩＤが格納されている場合、一次判定テーブル１２２のフラグを参照し、二次判定テーブル１２４に、第１のハッシュ関数によって同じハッシュ値が導出される（ハッシュ衝突する）一つ以上のオブジェクトＩＤが格納されているか否かを判定する。例えば、フラグ値が１であれば、一つ以上のオブジェクトＩＤが二次判定テーブル１２４に格納されている（ハッシュ衝突が起こっている）ことを示し、フラグ値が０であれば、オブジェクトＩＤが格納されていない（ハッシュ衝突が起こっていない）ことを示す。 On the other hand, when the object ID of another object is stored in the address corresponding to the hash value from the object of interest, the index search unit 114 refers to the flag of the primary determination table 122 and displays the secondary determination table 124. It is determined whether or not one or more object IDs from which the same hash value is derived (hash collision) are stored by the first hash function. For example, if the flag value is 1, it means that one or more object IDs are stored in the secondary determination table 124 (hash collision has occurred), and if the flag value is 0, the object ID is Indicates that it is not stored (no hash collision has occurred).

従って、フラグ値が０であれば、インデックス検索部１１４は、着目オブジェクトについて「新規オブジェクトとの距離を計算していない」と判定する。この場合、インデックス検索部１１４は、着目オブジェクトと新規オブジェクトとの距離を計算すると共に、二次判定テーブル１２４における着目ＩＤを第２のハッシュ関数（後述）でハッシュ化したハッシュ値に対応した領域に、着目ＩＤを書き込む。また、フラグ値が１であれば、インデックス検索部１１４は、二次判定テーブル１２４を参照して、着目オブジェクトについて「新規オブジェクトとの距離を計算したか否か」を判定する。 Therefore, if the flag value is 0, the index search unit 114 determines that the object of interest has not calculated the distance to the new object. In this case, the index search unit 114 calculates the distance between the object of interest and the new object, and puts the ID of interest in the secondary determination table 124 into an area corresponding to the hash value hashed by the second hash function (described later). , Write the focus ID. If the flag value is 1, the index search unit 114 refers to the secondary determination table 124 and determines "whether or not the distance to the new object has been calculated" for the object of interest.

図６は、二次判定テーブル１２４のデータ構造の一例を示す図である。二次判定テーブル１２４は、第２のハッシュ関数により導出されるハッシュ値のとり得る値分の一次元の配列構造のデータ格納領域（図中、配列部分）を有すると共に、リスト構造のデータ格納領域（図中、リスト部分）を有する。配列部分には、リスト部分の先頭のアドレス（リンク先）が格納されている。第２のハッシュ関数は、例えば、入力データから、Ｎよりも十分に小さいＭ個のハッシュ値のいずれかを導出する関数である。なお、第２のハッシュ関数は、例えば、オブジェクトＩＤのバイナリ列の下位ｍビットをハッシュ値として利用する関数である（ｎ＞ｍ）。 FIG. 6 is a diagram showing an example of the data structure of the secondary determination table 124. The secondary determination table 124 has a one-dimensional array structure data storage area (array portion in the figure) for the possible value of the hash value derived by the second hash function, and has a list structure data storage area. (List part in the figure). The address (link destination) at the beginning of the list part is stored in the array part. The second hash function is, for example, a function for deriving any of M hash values sufficiently smaller than N from the input data. The second hash function is, for example, a function that uses the lower m bits of the binary string of the object ID as a hash value (n> m).

二次判定テーブル１２４のリスト部分には、オブジェクトＩＤが順次登録されていく。リスト部分にオブジェクトＩＤが登録されている場合、例えば、配列部分におけるオブジェクトＩＤに続く末尾部分には、次に参照すべきリスト部分のアドレスが格納されている。この関係を、ポインタによって次に参照すべきアドレスが示されていると称する。図中、矢印はポインタを示している。 Object IDs are sequentially registered in the list portion of the secondary determination table 124. When the object ID is registered in the list part, for example, the address of the list part to be referred to next is stored in the end part following the object ID in the array part. This relationship is referred to as the pointer indicating the address to be referenced next. In the figure, the arrow indicates a pointer.

インデックス検索部１１４は、二次判定テーブル１２４を参照し、ポインタを辿りながら着目ＩＤが格納されているか否かを判定する。いずれかの時点で着目ＩＤが格納されている場合、インデックス検索部１１４は、着目オブジェクトについて「新規オブジェクトとの距離を計算した」と判定する。 The index search unit 114 refers to the secondary determination table 124 and determines whether or not the attention ID is stored while tracing the pointer. If the focus ID is stored at any time, the index search unit 114 determines that the focus object has "calculated the distance to the new object".

一方、ポインタを辿って最後の領域まで到達した場合、インデックス検索部１１４は、着目オブジェクトについて「新規オブジェクトとの距離を計算していない」と判定する。なお、インデックス検索部１１４は、次に参照すべきリスト部分のアドレスが格納されていないデータが出現した場合に、「最後の領域まで到達した」と判定する。この場合、インデックス検索部１１４は、着目オブジェクトのオブジェクトＩＤを任意のアドレスに格納すると共に、直前の領域に対して、着目オブジェクトのオブジェクトＩＤを格納したアドレスを示すポインタを追加する。 On the other hand, when the pointer is traced to reach the last area, the index search unit 114 determines that the object of interest has not calculated the distance to the new object. The index search unit 114 determines that "the last area has been reached" when data that does not store the address of the list part to be referred to next appears. In this case, the index search unit 114 stores the object ID of the object of interest at an arbitrary address, and adds a pointer indicating the address in which the object ID of the object of interest is stored to the immediately preceding area.

図７および図８は、インデックス検索部１１４により実行される、新規オブジェクトとの距離を計算したか否かを判定する処理の流れの一例を示すフローチャートである。 7 and 8 are flowcharts showing an example of a flow of processing executed by the index search unit 114 to determine whether or not the distance to the new object has been calculated.

まず、インデックス検索部１１４は、一次判定テーブル１２２と二次判定テーブル１２４のデータを消去（クリア）する（Ｓ２００）。 First, the index search unit 114 erases (clears) the data in the primary determination table 122 and the secondary determination table 124 (S200).

次に、インデックス検索部１１４は、着目ＩＤを第１のハッシュ関数でハッシュ化する（Ｓ２０２）。次に、インデックス検索部１１４は、一次判定テーブル１２２を参照し（Ｓ２０４）、Ｓ２０２で求めたハッシュ値に対応するアドレスにデータが登録されているか否かを判定する（Ｓ２０６）。Ｓ２０２で求めたハッシュ値に対応するアドレスにデータが登録されていない場合、一次判定テーブルに着目ＩＤを登録し（Ｓ２０８）、着目オブジェクトについて「新規オブジェクトとの距離を計算していない」と判定し（Ｓ２１０）、本フローチャートの処理を終了する。 Next, the index search unit 114 hashes the attention ID with the first hash function (S202). Next, the index search unit 114 refers to the primary determination table 122 (S204), and determines whether or not the data is registered at the address corresponding to the hash value obtained in S202 (S206). If the data is not registered at the address corresponding to the hash value obtained in S202, the focus ID is registered in the primary judgment table (S208), and it is determined that the focus object "has not calculated the distance to the new object". (S210), the process of this flowchart ends.

Ｓ２０６において、Ｓ２０２で求めたハッシュ値に対応するアドレスにデータが格納されていると判定した場合、インデックス検索部１１４は、Ｓ２０２で求めたハッシュ値に対応するアドレスに着目ＩＤが登録されているか否かを判定する（Ｓ２１２）。Ｓ２０２で求めたハッシュ値に対応するアドレスに着目ＩＤが登録されている場合、インデックス検索部１１４は、着目オブジェクトについて「新規オブジェクトとの距離を計算した」と判定し（Ｓ２１４）、本フローチャートの処理を終了する。 When it is determined in S206 that the data is stored in the address corresponding to the hash value obtained in S202, the index search unit 114 determines whether or not the attention ID is registered in the address corresponding to the hash value obtained in S202. (S212). When the focus ID is registered at the address corresponding to the hash value obtained in S202, the index search unit 114 determines that the focus object has "calculated the distance to the new object" (S214), and processes this flowchart. To finish.

Ｓ２０２で求めたハッシュ値に対応するアドレスに着目ＩＤが登録されていない場合、（以下、図８）インデックス検索部１１４は、一次判定テーブル１２２における、Ｓ２０２で求めた第１のハッシュ関数によるハッシュ値に対応する領域に付与されたフラグが１であるか０であるかを判定する（Ｓ２１６）。フラグが０である場合、インデックス検索部１１４は、着目ＩＤを第２のハッシュ関数でハッシュ化し（Ｓ２１８）、二次判定テーブル１２４に着目ＩＤを登録する（Ｓ２２４）。この際に、インデックス検索部１１４は、二次判定テーブル１２４の配列部分における、Ｓ２１８で得られたハッシュ値に対応する領域に、着目ＩＤを登録したアドレスを示すポインタを付与しておく。そして、インデックス検索部１１４は、（以下、図７）着目オブジェクトについて「新規オブジェクトとの距離を計算していない」と判定して（Ｓ２１０）、本フローチャートの処理を終了する。 When the attention ID is not registered in the address corresponding to the hash value obtained in S202 (hereinafter, FIG. 8), the index search unit 114 uses the hash value obtained by the first hash function in S202 in the primary determination table 122. It is determined whether the flag given to the region corresponding to is 1 or 0 (S216). When the flag is 0, the index search unit 114 hashes the attention ID with the second hash function (S218), and registers the attention ID in the secondary determination table 124 (S224). At this time, the index search unit 114 assigns a pointer indicating the address in which the attention ID is registered to the area corresponding to the hash value obtained in S218 in the array portion of the secondary determination table 124. Then, the index search unit 114 determines (hereinafter, FIG. 7) that the object of interest has not calculated the distance to the new object (S210), and ends the process of this flowchart.

一方、フラグが１である場合、インデックス検索部１１４は、着目ＩＤを第２のハッシュ関数でハッシュ化し（Ｓ２２０）、Ｓ２２０で求めたハッシュ値に対応するアドレスからポインタを順次辿った先に着目ＩＤが登録されているか否かを判定する（Ｓ２２２）。ポインタを順次辿った先に着目ＩＤが登録されていない場合、インデックス検索部１１４は、Ｓ２２４以下の処理を行う。ポインタを順次辿った先に着目ＩＤが登録されている場合、インデックス検索部１１４は、（以下、図７）着目オブジェクトについて「新規オブジェクトとの距離を計算した」と判定し（Ｓ２１４）、本フローチャートの処理を終了する。 On the other hand, when the flag is 1, the index search unit 114 hashes the focus ID with the second hash function (S220), and the focus ID sequentially traces the pointer from the address corresponding to the hash value obtained in S220. Is determined whether or not is registered (S222). When the attention ID is not registered at the destination where the pointers are sequentially followed, the index search unit 114 performs the processing of S224 or less. When the focus ID is registered at the destination where the pointers are sequentially followed, the index search unit 114 (hereinafter, FIG. 7) determines that the focus object has "calculated the distance to the new object" (S214), and this flowchart. Ends the processing of.

以上説明した第１の実施形態によれば、一次元の配列構造を有する一次判定テーブル１２２と、一次元の配列構造およびリスト構造を有する二次判定テーブル１２４とが格納される記憶部１２０と、着目オブジェクト（対象データ）について新規オブジェクトとの距離を計算したか否かを判定する際に、一次判定テーブル１２２における、着目ＩＤを第１のハッシュ関数でハッシュ化したハッシュ値に対応するアドレスに、着目ＩＤまたは他のオブジェクトのオブジェクトＩＤが登録されているか否かを判定し、それらのいずれかが登録されていない場合、そのアドレスに着目ＩＤを登録し、着目ＩＤが登録されている場合、着目オブジェクトについて新規オブジェクトとの距離を計算したと判定し、他のオブジェクトのオブジェクトＩＤが登録されている場合、着目オブジェクトのオブジェクトＩＤを第２のハッシュ関数でハッシュ化したハッシュ値をキーとして二次判定テーブル１２２を参照するインデックス検索部１２２と、を備えることにより、高速処理を実現することができる。 According to the first embodiment described above, the storage unit 120 in which the primary determination table 122 having the one-dimensional array structure and the secondary determination table 124 having the one-dimensional array structure and the list structure are stored. When determining whether or not the distance between the object of interest (target data) and the new object has been calculated, the address corresponding to the hash value obtained by hashing the interest ID with the first hash function in the primary determination table 122 is used. It is determined whether or not the focus ID or the object ID of another object is registered, and if any of them is not registered, the focus ID is registered at that address, and if the focus ID is registered, the focus is paid. If it is determined that the distance between the object and the new object has been calculated and the object ID of another object is registered, the secondary determination is made using the hash value obtained by hashing the object ID of the object of interest with the second hash function as a key. High-speed processing can be realized by providing the index search unit 122 that refers to the table 122.

ここで、配列構造のみ、またはリスト構造のみで上記と同様の判定を行うものとの比較について説明する。 Here, a comparison with a case where the same determination as above is performed only with the sequence structure or the list structure will be described.

仮に、一次判定テーブル１２２の配列数を、例えば既存データオブジェクト１３２に含まれるオブジェクトの数と同じだけ用意した場合、リスト構造においてポインタを辿る処理が不要となる。このため、新規オブジェクトを既に選択したか否かを判定する処理自体は高速に行うことができるが、図７のＳ２００で示した「データを消去する処理」に要する時間が長くなってしまい、検索処理全体に要する時間が長くなってしまう。 If the number of arrays in the primary determination table 122 is prepared to be the same as the number of objects included in the existing data object 132, for example, the process of tracing the pointer in the list structure becomes unnecessary. Therefore, the process of determining whether or not a new object has already been selected can be performed at high speed, but the time required for the "process of erasing data" shown in S200 of FIG. 7 becomes long, and the search can be performed. The time required for the entire process becomes long.

一方、リスト構造のみで同様の判定を行う場合、ハッシュ衝突が起きていないオブジェクトについてもポインタを辿る処理が発生するため、一つ一つの判定処理に要する時間が長くなってしまう。また、ハッシュ衝突が起きる度にポインタが追加されるので、判定処理に要する時間が更に長くなってしまう。 On the other hand, when the same determination is made only with the list structure, the process of tracing the pointer occurs even for the object in which the hash collision does not occur, so that the time required for each determination process becomes long. Further, since the pointer is added every time a hash collision occurs, the time required for the determination process becomes longer.

通常、検索処理の対象となる範囲（探索範囲）は、グラフインデックス１３４全体に対して限られた範囲内である。従って、探索範囲に相当する配列数の配列構造を設定しておけば、判定処理に要する時間は十分に短縮することができる。この部分が一次判定テーブル１２２に相当する。二次判定テーブル１２４を使用することで処理時間は長くなってしまうが、一次判定テーブル１２２の配列数に余裕を持たせておけば、ハッシュ衝突の起きる可能性を小さくすることができ、全体として検索処理を高速化することができる。 Usually, the target range (search range) of the search process is a limited range with respect to the entire graph index 134. Therefore, if the array structure of the number of arrays corresponding to the search range is set, the time required for the determination process can be sufficiently shortened. This part corresponds to the primary determination table 122. The processing time becomes long by using the secondary judgment table 124, but if the number of arrays in the primary judgment table 122 has a margin, the possibility of hash collision can be reduced, and as a whole, the possibility of hash collision can be reduced. The search process can be speeded up.

なお、一次判定テーブル１２２は、一段の配列構造を有するものとして説明したが、多段の配列構造を有するようにしてもよい。図９は、一次判定テーブル１２２の他の例を示す図である。この場合、オブジェクトＩＤは、一段目が空いていれば一段目に、一段目が空いておらず二段目が空いていれば二段目に格納され、二段目も空いていない場合に二次判定テーブル１２４が参照される。 Although the primary determination table 122 has been described as having a one-stage array structure, it may have a multi-stage array structure. FIG. 9 is a diagram showing another example of the primary determination table 122. In this case, the object ID is stored in the first stage if the first stage is empty, in the second stage if the first stage is not empty and the second stage is empty, and second if the second stage is also empty. The next determination table 124 is referred to.

また、「一次判定テーブル１２２のそれぞれのエントリに、二次判定テーブル１２４にデータが格納されているか否かを示すフラグを付与する」ものとしたが、このフラグの付与を省略し、インデックス検索部１１４は、一次判定テーブル１２２に着目ＩＤ以外のオブジェクトのＩＤが格納されている場合、二次判定テーブル１２４の最初のデータを参照して、二次判定テーブル１２４にデータが格納されているか否かを確認してもよい。 In addition, "a flag indicating whether or not data is stored in the secondary judgment table 124 is added to each entry of the primary judgment table 122", but the addition of this flag is omitted and the index search unit is used. When the ID of the object other than the attention ID is stored in the primary determination table 122, 114 refers to the first data of the secondary determination table 124 and determines whether or not the data is stored in the secondary determination table 124. May be confirmed.

また、データ管理装置１００が既に実行したか否かを判定する「所定の処理」は、グラフ構造のデータを用いたデータ検索処理において、対象データを検索対象のノードとして、クエリに対応するノード（新規オブジェクト）との距離を計算する処理に限らず、他の種類の処理であってもよい。 Further, the "predetermined process" for determining whether or not the data management device 100 has already executed is a node corresponding to the query (the target data is set as the search target node in the data search process using the graph structure data). The process is not limited to the process of calculating the distance to the new object), and may be another type of process.

以上、本発明を実施するための形態について実施形態を用いて説明したが、本発明はこうした実施形態に何ら限定されるものではなく、本発明の要旨を逸脱しない範囲内において種々の変形及び置換を加えることができる。 Although the embodiments for carrying out the present invention have been described above using the embodiments, the present invention is not limited to these embodiments, and various modifications and substitutions are made without departing from the gist of the present invention. Can be added.

１０…クライアント端末、１００…データ管理装置、１０２…ネットワークインターフェース、１０４…入出力装置、１１０…制御部、１１２…データオブジェクト生成部、１１４…インデックス検索部、１２０…記憶部、１２２…一次判定テーブル、１２４…二次判定テーブル、１３０…データサーバ、１３２…既存ベクトルデータ、１３４…グラフインデックス 10 ... Client terminal, 100 ... Data management device, 102 ... Network interface, 104 ... Input / output device, 110 ... Control unit, 112 ... Data object generation unit, 114 ... Index search unit, 120 ... Storage unit, 122 ... Primary judgment table , 124 ... Secondary judgment table, 130 ... Data server, 132 ... Existing vector data, 134 ... Graph index

Claims

対象データに対する所定の処理の要否を判定するデータ管理装置であって、
前記対象データの識別情報を記憶する領域を有する一次判定テーブルと、
前記対象データの識別情報をリスト構造で記憶する二次判定テーブルと、
前記対象データの識別情報を第１のハッシュ関数でハッシュ化したハッシュ値に対応する前記一次判定テーブルのアドレスに、識別情報の登録の有無を判定し、
識別情報が登録されていないと判定された場合に、前記所定の処理を行うこととして、前記一次判定テーブルのアドレスに前記対象データの識別情報を登録し、
前記対象データの識別情報が登録されていると判定された場合に、前記対象データに関する前記所定の処理を不要と判定し、
前記二次判定テーブルに識別情報が登録されていると判定された場合、前記対象データの識別情報を第２のハッシュ関数でハッシュ化したハッシュ値をキーとして前記二次判定テーブルを参照し、前記所定の処理の要否を判定するデータ管理部と、
を備えるデータ管理装置。 A data management device that determines the necessity of predetermined processing for target data.
A primary determination table having an area for storing the identification information of the target data, and
A secondary judgment table that stores the identification information of the target data in a list structure,
Whether or not the identification information is registered is determined at the address of the primary determination table corresponding to the hash value obtained by hashing the identification information of the target data with the first hash function.
When it is determined that the identification information is not registered, the identification information of the target data is registered in the address of the primary determination table as the predetermined processing is performed.
When it is determined that the identification information of the target data is registered, it is determined that the predetermined processing related to the target data is unnecessary.
When it is determined that the identification information is registered in the secondary determination table, the secondary determination table is referred to by using the hash value obtained by hashing the identification information of the target data with the second hash function as a key. A data management unit that determines the necessity of predetermined processing,
A data management device equipped with.

前記データ管理部は、前記一次判定テーブルに格納された、前記二次判定テーブルに識別情報が格納されているか否かを示すフラグを参照し、前記二次判定テーブルに識別情報が登録されているか否かを判定する、
請求項１記載のデータ管理装置。 The data management unit refers to a flag stored in the primary determination table, which indicates whether or not the identification information is stored in the secondary determination table, and whether the identification information is registered in the secondary determination table. Judge whether or not
The data management device according to claim 1.

前記データ管理部は、前記対象データの識別情報を前記第２のハッシュ関数でハッシュ化したハッシュ値に対応する前記二次判定テーブルのアドレスに格納されたポインタの示すアドレスを順に辿ることにより、前記二次判定テーブルに前記対象データの識別情報が登録されているか否かを判定する、
請求項１または２記載のデータ管理装置。 The data management unit sequentially traces the address indicated by the pointer stored in the address of the secondary determination table corresponding to the hash value obtained by hashing the identification information of the target data with the second hash function. It is determined whether or not the identification information of the target data is registered in the secondary determination table.
The data management device according to claim 1 or 2.

前記データ管理部は、前記二次判定テーブルに前記対象データの識別情報が登録されている場合に、前記所定の処理が必要であると判定する、
請求項３記載のデータ管理装置。 When the identification information of the target data is registered in the secondary determination table, the data management unit determines that the predetermined processing is necessary.
The data management device according to claim 3.

前記第１のハッシュ関数によって得られるハッシュ値のとり得る値の数は、前記第２のハッシュ関数によって得られるハッシュ値のとり得る値の数よりも多い、
請求項１から４のうちいずれか１項記載のデータ管理装置。 The number of possible values of the hash value obtained by the first hash function is larger than the number of possible values of the hash value obtained by the second hash function.
The data management device according to any one of claims 1 to 4.

前記第１のハッシュ関数は、入力データのバイナリ列の下位ｎビットをハッシュ値とする関数である、
請求項１から５のうちいずれか１項記載のデータ管理装置。 The first hash function is a function whose hash value is the lower n bits of the binary string of the input data.
The data management device according to any one of claims 1 to 5.

前記一次判定テーブルは、一次元の配列構造を複数有する、
請求項１から６のうちいずれか１項記載のデータ管理装置。 The primary determination table has a plurality of one-dimensional array structures.
The data management device according to any one of claims 1 to 6.

前記所定の処理は、グラフ構造のデータを用いたデータ検索処理において、前記対象データを検索対象のノードとして、クエリに対応するノードとの距離を計算する処理である、
請求項１から７のうちいずれか１項に記載のデータ管理装置。 The predetermined process is a process of calculating the distance to a node corresponding to a query by using the target data as a search target node in a data search process using graph-structured data.
The data management device according to any one of claims 1 to 7.

前記一次判定テーブルの配列数は、前記検索対象のノードの数の１／１０程度の値に設定される、
請求項８記載のデータ管理装置。 The number of arrays in the primary determination table is set to a value of about 1/10 of the number of nodes to be searched.
The data management device according to claim 8.

対象データの識別情報を記憶する領域を有する一次判定テーブルと、前記対象データの識別情報をリスト構造で記憶する二次判定テーブルと、を有し、前記対象データに対する所定の処理の要否を判定するデータ管理装置が、
前記対象データの識別情報を第１のハッシュ関数でハッシュ化したハッシュ値に対応する前記一次判定テーブルのアドレスに、識別情報の登録の有無を判定し、
識別情報が登録されていないと判定された場合に、前記所定の処理を行うこととして、前記アドレスに前記対象データの識別情報を登録し、
前記対象データの識別情報が登録されていると判定された場合に、前記対象データに関する前記所定の処理を不要と判定し、
前記二次判定テーブルに識別情報が登録されていると判定された場合、前記対象データの識別情報を第２のハッシュ関数でハッシュ化したハッシュ値をキーとして前記二次判定テーブルを参照し、前記所定の処理の要否を判定する、
データ管理方法。 It has a primary determination table having an area for storing the identification information of the target data and a secondary determination table for storing the identification information of the target data in a list structure, and determines the necessity of predetermined processing for the target data. Data management device
Whether or not the identification information is registered is determined at the address of the primary determination table corresponding to the hash value obtained by hashing the identification information of the target data with the first hash function.
When it is determined that the identification information is not registered, the identification information of the target data is registered at the address as the predetermined processing is performed.
When it is determined that the identification information of the target data is registered, it is determined that the predetermined processing related to the target data is unnecessary.
When it is determined that the identification information is registered in the secondary determination table, the secondary determination table is referred to by using the hash value obtained by hashing the identification information of the target data with the second hash function as a key. Judging the necessity of a predetermined process,
Data management method.

対象データの識別情報を記憶する領域を有する一次判定テーブルと、前記対象データの識別情報をリスト構造で記憶する二次判定テーブルと、を有し、前記対象データに対する所定の処理の要否を判定するデータ管理装置に、
前記対象データの識別情報を第１のハッシュ関数でハッシュ化したハッシュ値に対応する前記一次判定テーブルのアドレスに、識別情報の登録の有無を判定させ、
識別情報が登録されていないと判定された場合に、前記所定の処理を行うこととして、前記アドレスに前記対象データの識別情報を登録させ、
前記対象データの識別情報が登録されていると判定された場合に、前記対象データに関する前記所定の処理を不要と判定させ、
前記二次判定テーブルに識別情報が登録されていると判定された場合、前記対象データの識別情報を第２のハッシュ関数でハッシュ化したハッシュ値をキーとして前記二次判定テーブルを参照し、前記所定の処理の要否を判定させる、
データ管理プログラム。 It has a primary determination table having an area for storing the identification information of the target data and a secondary determination table for storing the identification information of the target data in a list structure, and determines the necessity of predetermined processing for the target data. To the data management device
The address of the primary determination table corresponding to the hash value obtained by hashing the identification information of the target data with the first hash function is made to determine whether or not the identification information is registered.
When it is determined that the identification information is not registered, the identification information of the target data is registered at the address by performing the predetermined process.
When it is determined that the identification information of the target data is registered, the predetermined processing relating to the target data is determined to be unnecessary.
When it is determined that the identification information is registered in the secondary determination table, the secondary determination table is referred to by using the hash value obtained by hashing the identification information of the target data with the second hash function as a key. Lets determine the necessity of predetermined processing,
Data management program.