JP5265610B2

JP5265610B2 - Related word extractor

Info

Publication number: JP5265610B2
Application number: JP2010091854A
Authority: JP
Inventors: ゾランステイチ
Original assignee: Yahoo Japan Corp
Current assignee: Yahoo Japan Corp
Priority date: 2010-04-13
Filing date: 2010-04-13
Publication date: 2013-08-14
Anticipated expiration: 2030-04-13
Also published as: JP2011221877A

Description

本発明は、クエリに関連するデータを検索する技術に関するものである。 The present invention relates to a technique for retrieving data related to a query.

クエリを用いた検索技術では、クローリングと呼ばれる手法により、予め、ウェブ上に存在する各種のコンテンツが収集されて、検索エンジンのデータベース中に蓄積される。蓄積時には、各データに索引付け（インデックス）とランク付けが行われる。 In a search technique using a query, various contents existing on the web are collected in advance and stored in a database of a search engine by a technique called crawling. At the time of accumulation, each data is indexed (indexed) and ranked.

検索の対象となるコンテンツの種類は、テキスト（文字列）や静止画、動画、音声などの多岐のメディアに渡り、各々のコンテンツに対応した検索エンジンにてデータ収集等の処理が行われる。 The types of content to be searched cover a wide variety of media such as text (character strings), still images, moving images, and audio, and processing such as data collection is performed by a search engine corresponding to each content.

ユーザは、検索したいコンテンツに対応する検索エンジンに対してクエリ（キーワード）を入力して、該クエリに関連するコンテンツの検索結果を得る。そして、所望の検索結果が得られない場合には、更なるキーワードの追加や変更を行って再検索を行う。 The user inputs a query (keyword) to a search engine corresponding to the content to be searched, and obtains a search result of the content related to the query. If a desired search result cannot be obtained, the search is performed again by adding or changing a keyword.

ユーザのクエリ入力を支援する機能として、関連キーワードの提示が知られている。即ち、単語同士の関連性によって単語を分類したシソーラス辞書を記憶し、シソーラス辞書から入力クエリに対する関連語を抽出する情報処理装置が知られている（例えば、特許文献１）。 Related keyword presentation is known as a function that supports user query input. That is, there is known an information processing apparatus that stores a thesaurus dictionary in which words are classified according to the relationship between words and extracts related words for an input query from the thesaurus dictionary (for example, Patent Document 1).

特開２００８−１９２１１０号公報JP 2008-192110 A

しかし、特許文献１に示されるようなシソーラス辞書は、予め単語と単語とを関連付けておくという人為的なメンテナンスが必要となる。また、ウェブ検索で検索されるキーワードは、新語や略語等、多種多様なものが入力されるため、このようなキーワードに対応させた関連語の辞書を逐次メンテナンスするのは非常に煩雑であった。 However, a thesaurus dictionary as shown in Patent Document 1 requires artificial maintenance in which words are associated with each other in advance. In addition, since a wide variety of keywords such as new words and abbreviations are input as keywords searched in the web search, it is very complicated to sequentially maintain a dictionary of related terms corresponding to such keywords. .

本発明は、上述の課題に鑑みてなされたものであり、その目的とするところは、人為的なメンテナンスによる負担を軽減して関連語の抽出を行うことである。 The present invention has been made in view of the above-mentioned problems, and an object of the present invention is to extract related terms while reducing the burden caused by human maintenance.

上記課題を解決するために、本発明の第１の側面は、
それぞれ異なるメディアを検索対象とした複数の検索エンジンそれぞれに異なるクエリを出力して、該検索エンジンによって検索されたコンテンツデータを該クエリ毎に取得する検索コンテンツ取得手段と、
前記異なるクエリの出力に対して各検索エンジンから取得されたコンテンツデータの類似度を該クエリ間で算出する類似度算出手段と、
前記検索エンジンに出力された異なるクエリ間の前記算出された類似度に基づいて該クエリ同士を関連語として特定する関連語特定手段と、
前記関連語として特定された複数のクエリのそれぞれと、該複数のクエリがそれぞれ前記出力された検索エンジンの種別とを対応付けて記憶する関連語記憶手段と、
を備えることを特徴としている。 In order to solve the above problems, the first aspect of the present invention provides:
Search content acquisition means for outputting different queries to each of a plurality of search engines that search for different media, and acquiring content data searched by the search engines for each query,
Similarity calculation means for calculating the similarity of the content data acquired from each search engine with respect to the output of the different queries, between the queries;
Related word specifying means for specifying the queries as related words based on the calculated similarity between different queries output to the search engine;
Each of a plurality of queries specified as the related term, and a related term storage unit that stores the plurality of queries in association with the type of the search engine from which each of the plurality of queries is output;
It is characterized by having.

第１の側面によれば、各種検索エンジンで検索された異なるクエリ毎のコンテンツデータの類似度に基づいて、その異なるクエリが関連語として特定されるため、クエリ間の意味的な類似性を人が判定することなく、人為的なメンテナンスによる負担を軽減して関連語の抽出を行うことができる。また、関連語として特定されたクエリと、そのクエリが検索された検索エンジンの種別とが対応付けて記憶されるため、関連語のユーザへの提示の際に、検索エンジンの種別もユーザに提示できるようになる。 According to the first aspect, since the different queries are identified as related words based on the similarity of the content data for each different query searched by various search engines, the semantic similarity between the queries is Therefore, it is possible to extract related words while reducing the burden caused by human maintenance. In addition, since the query specified as the related term and the type of the search engine in which the query is searched are stored in association with each other, the type of the search engine is also presented to the user when the related term is presented to the user. become able to.

また、本発明の第２の側面においては、前記複数の検索エンジンそれぞれで検索されたクエリの履歴を該検索エンジン毎に蓄積記憶する記憶手段を更に備え、
前記検索コンテンツ取得手段は、前記記憶された検索エンジン毎のクエリの履歴の中から一ずつのクエリを抽出して、それぞれ対応する前記検索エンジンに出力することを特徴としている。 Further, in the second aspect of the present invention, the apparatus further comprises storage means for accumulating and storing the history of queries searched by each of the plurality of search engines for each search engine,
The search content acquisition means is characterized by extracting one query from the stored query history for each search engine and outputting it to the corresponding search engine.

第２の側面によれば、各検索エンジンでのクエリの履歴を用いて関連語の特定が為されるため、ユーザが過去に入力したクエリを利用して関連語の特定ができる。 According to the second aspect, the related word is specified using the query history in each search engine. Therefore, the related word can be specified using the query input by the user in the past.

また、本発明の第３の側面において、前記メディアは、テキスト、静止画、動画、音声の少なくとも何れかを含み、前記類似度算出手段は、前記異なるクエリに対して各検索エンジンから取得されたコンテンツデータに含まれるテキスト、静止画、動画又は音声の特徴量を該コンテンツデータ間で比較することで、前記類似度を算出することを特徴としている。 Further, acquired in the third aspect of the present invention, prior to texture Deer, text, still images, including videos, at least one of audio, the similarity calculation means, from each search engine to the different queries The similarity is calculated by comparing feature values of text, still image, moving image, or audio included in the content data, between the content data.

第３の側面によれば、検索されたコンテンツデータに含まれるテキスト、静止画、動画、音声の特徴量に基づいて類似度が算出されるため、各クエリが表すコンテンツの内容を考慮した関連語の特定が可能になる。 According to the third aspect, the similarity is calculated based on the text, still image, moving image, and audio feature amounts included in the searched content data, and thus related terms that take into account the content of the content represented by each query Can be identified.

また、本発明の第４の側面においては、前記検索エンジンにより検索されたコンテンツデータにはランク付けがなされ、前記類似度算出手段は、前記コンテンツデータに付与されたランクにより前記類似度に重み付けを行うことを特徴としている。 In the fourth aspect of the present invention, the content data searched by the search engine is ranked, and the similarity calculation means weights the similarity by the rank given to the content data. It is characterized by doing.

第４の側面によれば、各クエリで検索されたコンテンツデータに付与されたランクを用いて類似度の重み付けを行うため、クエリに対して関連性の高いコンテンツ間の類似度を重要視して関連語抽出を行うため、関連語の抽出精度の向上が図れる。 According to the fourth aspect, the similarity is weighted by using the rank given to the content data searched in each query. Therefore, the similarity between contents highly relevant to the query is regarded as important. Since the related word extraction is performed, the related word extraction accuracy can be improved.

また、本発明の第５の側面においては、前記関連語記憶手段に記憶された前記関連語として特定された複数のクエリを、それぞれ前記出力された検索エンジンで検索可能としたページを生成するページ生成手段を更に備えることを特徴としている。 Further, in the fifth aspect of the present invention, a page for generating a page that allows a plurality of queries specified as the related words stored in the related word storage means to be searched by the output search engine. It further has a generation means .

第５の側面によれば、関連語のユーザへの提示の際に、検索エンジンの種別もユーザに提示できるようになる。 According to the fifth aspect, when the related word is presented to the user, the type of the search engine can be presented to the user.

本発明によれば、人為的なメンテナンスによる負担を軽減して関連語の抽出を行うことができる。 According to the present invention, it is possible to extract related terms while reducing the burden caused by artificial maintenance.

検索サーバの機能構成の一例を示すブロック図。The block diagram which shows an example of a function structure of a search server. インデックス、クエリログ、関連ワードＤＢのデータ構成例を示す図。The figure which shows the data structural example of an index, a query log, and related word DB. 関連ワード抽出処理の具体的な処理内容を示すフローチャート。The flowchart which shows the specific process content of a related word extraction process. マルチメディア類似度の算出の例示のための概念図。The conceptual diagram for the illustration of calculation of multimedia similarity. マルチメディア類似度の算出の際のクエリの組み合わせパターンを示す図。The figure which shows the combination pattern of the query in the case of calculation of multimedia similarity. 検索画面の表示例。Display example of search screen.

〔本実施形態の装置構成〕
本発明の関連語抽出装置を図１に示す検索サーバに適用した場合の実施形態を、図面に基づいて説明する。本実施形態の装置は、クエリを用いてウェブ検索を行うものである。 [Apparatus configuration of this embodiment]
An embodiment when the related word extracting device of the present invention is applied to the search server shown in FIG. 1 will be described with reference to the drawings. The apparatus of this embodiment performs a web search using a query.

本実施形態の検索サーバ１の機能構成の一例を示すブロック図を図１に示す。検索サーバ１は、ユーザ端末Ｔとインターネット等の通信回線網を介して相互に通信可能に接続されている。 A block diagram showing an example of the functional configuration of the search server 1 of the present embodiment is shown in FIG. The search server 1 is connected to the user terminal T via a communication network such as the Internet so as to be able to communicate with each other.

ユーザ端末Ｔは、ユーザが検索のためのクエリを入力する入力機能と、該クエリに応じた検索結果を表示出力する出力機能とを備えた端末であり、ＣＰＵや入力装置、表示装置等を有するパーソナルコンピュータや携帯端末等により実現される。 The user terminal T is a terminal having an input function for a user to input a query for search and an output function for displaying and outputting a search result corresponding to the query, and includes a CPU, an input device, a display device, and the like. This is realized by a personal computer or a portable terminal.

検索サーバ１は、ユーザ端末Ｔで入力されたクエリに基づいて検索を行って、その検索結果をユーザ端末Ｔに返送する。また、ユーザ端末Ｔでは、検索対象となるコンテンツがユーザにより指定可能に構成される。例えば、図６に示すような検索画面Ｗにおいて、コンテンツタブＴＢを選択することにより、ウェブ、静止画（以下単に「画像」という）、動画、商品情報といったコンテンツの指定が可能になる。 The search server 1 performs a search based on the query input at the user terminal T, and returns the search result to the user terminal T. Further, the user terminal T is configured such that the content to be searched can be specified by the user. For example, by selecting the content tab TB on the search screen W as shown in FIG. 6, it is possible to specify content such as web, still image (hereinafter simply referred to as “image”), moving image, and product information.

検索サーバ１は、各種コンテンツを検索対象とした検索エンジンＥを有して構成され、ユーザ端末Ｔで入力されたクエリに対して各コンテンツでの検索結果を生成してユーザ端末Ｔに返す。 The search server 1 is configured to include a search engine E that searches various contents, and generates a search result for each content in response to a query input at the user terminal T and returns the search result to the user terminal T.

検索サーバ１は、図１に示すように、クエリ受付部１０と、検索結果取得部２０と、検索結果出力部３０と、各種検索エンジンＥとしてのウェブ検索エンジンＥ１、画像検索エンジンＥ３、動画検索エンジンＥ５及び商品検索エンジンＥ７と、関連ワード抽出部４０と、関連ワードＤＢ５０とを備えて構成される。 As shown in FIG. 1, the search server 1 includes a query receiving unit 10, a search result acquiring unit 20, a search result output unit 30, a web search engine E1, an image search engine E3, and a video search as various search engines E. An engine E5, a product search engine E7, a related word extraction unit 40, and a related word DB 50 are provided.

クエリ受付部１０は、検索のためのクエリをユーザ端末Ｔから受信して受け付ける。クエリは、単一のキーワードや複数のキーワードの組み合わせにより構成される。また、クエリ受付部１０は、上述した検索対象として指定されたコンテンツの種類の情報（コンテンツ指定情報）も受信して受け付ける。 The query receiving unit 10 receives and receives a search query from the user terminal T. The query is composed of a single keyword or a combination of a plurality of keywords. In addition, the query receiving unit 10 receives and receives information on the type of content specified as the search target (content specifying information).

検索結果取得部２０は、クエリ受付部１０により受け付けられたコンテンツ指定情報に対応した検索エンジンＥを選定して、その検索エンジンＥに対してクエリを出力することで、該クエリに対応した検索結果を取得する。 The search result acquisition unit 20 selects a search engine E corresponding to the content designation information received by the query reception unit 10 and outputs a query to the search engine E, thereby retrieving the search result corresponding to the query. To get.

具体的には、コンテンツ指定情報に基づいてウェブが指定されている場合には、ウェブ検索エンジンＥ１にクエリを出力する。また、画像が指定されている場合は画像検索エンジンＥ３に、動画が指定されている場合は動画検索エンジンＥ５に、商品情報が指定されている場合は商品検索エンジンＥ７にそれぞれクエリを出力する。 Specifically, when the web is designated based on the content designation information, a query is output to the web search engine E1. Further, the query is output to the image search engine E3 when an image is specified, to the video search engine E5 when a moving image is specified, and to the product search engine E7 when product information is specified.

検索結果出力部３０は、検索結果取得部２０で得た検索結果をユーザ端末Ｔに出力するためのものであり、具体的には、検索結果の画像データにアクセス可能な表示データ（例えばＨＴＭＬによるウェブページデータ）を生成して、ユーザ端末Ｔに送信する。 The search result output unit 30 is for outputting the search result obtained by the search result acquisition unit 20 to the user terminal T. Specifically, the search result output unit 30 is display data that can access the image data of the search result (for example, by HTML). Web page data) is generated and transmitted to the user terminal T.

各種検索エンジンＥは、クエリに含まれるキーワードをインデックスＤに記憶されたキーワードと比較することにより、該クエリに対する検索結果を得る。ここで、ウェブ検索エンジンＥ１は、テキストや画像、動画等を含むウェブコンテンツの検索を行うものである。画像検索エンジンＥ３は、静止画の検索を行うものであり、動画検索エンジンＥ５は、動画像を検索するものである。商品検索エンジンＥ７は、販売サイト上に掲載された商品情報を検索するものである。 The various search engines E compare the keyword included in the query with the keyword stored in the index D to obtain a search result for the query. Here, the web search engine E1 searches for web content including text, images, moving images, and the like. The image search engine E3 searches for still images, and the moving image search engine E5 searches for moving images. The product search engine E7 searches product information posted on the sales site.

各検索エンジンＥは、所謂ロボット検索におけるクローリングによりインターネット上のコンテンツを収集して、該コンテンツに対してインデクシングを行う。例えば、ウェブ検索エンジンＥ１は、ウェブページに掲載されたＵＲＬを巡回することによりウェブページを収集し、該ウェブページ内に含まれるキーワードにより該ＵＲＬにインデクシングを行う。 Each search engine E collects content on the Internet by crawling in so-called robot search, and indexes the content. For example, the web search engine E1 collects web pages by circulating URLs posted on the web pages, and indexes the URLs using keywords included in the web pages.

また、画像検索エンジンＥ３は、画像掲載サイトから画像データを収集すると共に、該画像データに付与されているタグ情報（キーワード）によって画像掲載サイトのＵＲＬにインデクシングを行う。この際、タグ情報は、画像が掲載されたウェブページにおいて、該画像の近傍に記述されたテキスト情報を形態素解析等によって単語に分割することにより設定されるものであってもよい。動画検索エンジンＥ５や商品検索エンジンＥ７も同様にして、クローリングとインデクシングを行ってインデックスＤを作成する。 The image search engine E3 collects image data from the image posting site, and indexes the URL of the image posting site based on tag information (keyword) given to the image data. At this time, the tag information may be set by dividing text information described in the vicinity of the image into words by morphological analysis or the like on the web page on which the image is posted. Similarly, the video search engine E5 and the product search engine E7 perform crawling and indexing to create the index D.

図１に示すように各検索エンジンＥは、インデックスＤ（Ｄ１，Ｄ３，Ｄ５，Ｄ７）とクエリログＬ（Ｌ１，Ｌ３，Ｌ５，Ｌ７）とをそれぞれ有して構成される。 As shown in FIG. 1, each search engine E includes an index D (D1, D3, D5, D7) and a query log L (L1, L3, L5, L7).

インデックスＤは、所謂転置インデックスであって、図２（ａ）に示すように索引となるキーワードと、コンテンツが掲載されているウェブページ（コンテンツデータ）のＵＲＬと、該キーワードとコンテンツデータとの関連性を示す重み値と、該コンテンツデータとを対応付けて記憶するデータベースである。上述のクローリング時に、各検索エンジンＥは、クローリングにより収集したＵＲＬとコンテンツデータに、該コンテンツデータから抽出したキーワード（タグ情報を含む）をインデクシングして記憶する。 The index D is a so-called transposed index. As shown in FIG. 2A, the index is a keyword, the URL of the web page (content data) on which the content is posted, and the relationship between the keyword and the content data. It is a database that stores weight values indicating sex and the content data in association with each other. At the time of the above crawling, each search engine E indexes and stores a keyword (including tag information) extracted from the content data in the URL and content data collected by crawling.

また、該キーワードと、コンテンツデータとの関連度を示す重み値をＴＦ／ＩＤＦ等により算出して対応付けて記憶する。 Further, a weight value indicating the degree of association between the keyword and the content data is calculated by TF / IDF or the like and stored in association with it.

各検索エンジンＥは、検索結果取得部２０からクエリが入力されると、該クエリに基づいてインデックスＤを検索して、検索結果としてＵＲＬの一覧を生成する。この検索結果の一覧の生成の際には、クエリに対するウェブページの関連度である重み値が高い順に各検索結果のＵＲＬにランキングを付与する。 When a query is input from the search result acquisition unit 20, each search engine E searches the index D based on the query, and generates a list of URLs as the search result. When generating this list of search results, ranking is given to the URLs of the search results in descending order of the weight value, which is the degree of association of the web page with the query.

検索エンジンＥのランキングアルゴリズムとしては、上述のＴＦ／ＩＤＦ等にようにウェブページ内でのキーワードの重要度を用いる方法の他、ウェブページ間のリンク関係を分析することによるウェブページの重要度を用いる方法であってもよく、また、これらの組み合わせであってもよい。 As a ranking algorithm of the search engine E, the importance of the web page by analyzing the link relation between the web pages as well as the method of using the importance of the keyword in the web page as in the above TF / IDF or the like. It may be a method used, or a combination thereof.

クエリログＬは、各検索エンジンＥで検索されたクエリを蓄積記憶するデータベースであり、図２（ｂ）に示すように、クエリと、検索回数とを対応付けて記憶する。各検索エンジンＥは、検索結果取得部２０からクエリが入力されると、該クエリを各々のクエリログＬに記憶すると共に、検索回数を更新（１加算）する。 The query log L is a database for accumulating and storing queries searched by each search engine E, and stores the query and the number of searches in association with each other as shown in FIG. When a query is input from the search result acquisition unit 20, each search engine E stores the query in each query log L and updates (adds 1) the number of searches.

尚、図２のデータ構成は一例であって、実装に応じて適宜変更可能である。例えば、クエリに対して集計済みの検索回数を対応付けたデータ例を図示して説明したが、クエリに対して検索を実行した日時を対応付けて蓄積記憶したデータベースを別途設け、このデータベースを定期的に集計することでクエリログＬが更新される。また、インデックスＤとしては、キーワードとコンテンツデータを指し示すドキュメントＩＤと重み値とを対応付けて記憶するのみであって、ＵＲＬ及びコンテンツデータは別途他のデータベースにおいてドキュメントＩＤと関連付けることにより記憶することとしてもよい。 Note that the data configuration of FIG. 2 is an example, and can be changed as appropriate according to the implementation. For example, although the data example in which the total number of searches that have been aggregated for the query is illustrated and described, a database that stores and stores the date and time when the search is performed for the query is provided separately, and this database is regularly The query log L is updated by summing up automatically. In addition, as the index D, only the document ID indicating the keyword, the content data, and the weight value are stored in association with each other, and the URL and the content data are separately stored in association with the document ID in another database. Also good.

関連ワード抽出部４０は、各検索エンジンＥのクエリログＬに基づいて関連語を抽出する機能部であり、図１に示すように、検索コンテンツ取得手段４２、類似度算出手段４４及び関連語特定手段４６を備えて構成される。尚、これら手段の機能については後述する。 The related word extracting unit 40 is a functional unit that extracts related words based on the query log L of each search engine E. As shown in FIG. 1, the search content acquiring unit 42, the similarity calculating unit 44, and the related word specifying unit 46 is comprised. The functions of these means will be described later.

関連ワードＤＢ５０は、関連ワード抽出部４０により関連語として抽出されたキーワード同士を関連付けて記憶するデータベースであり、図２（ｃ）に示すように、キーワードと対象エンジンとの組み合わせを対応付けて記憶する。対象エンジンとは、キーワードが実際に検索された検索エンジンＥを指し示すデータある。 The related word DB 50 is a database that stores keywords extracted as related words by the related word extraction unit 40 in association with each other, and stores a combination of keywords and target engines in association with each other as shown in FIG. To do. The target engine is data indicating the search engine E in which the keyword is actually searched.

この対象エンジンをキーワード毎に対応付けて記憶しておくことで、例えば、関連ワードの提示をユーザに行う際に、図６のクエリ候補Ｑのように、そのクエリでよく検索される検索エンジンＥを提示することができる。 By storing this target engine in association with each keyword, for example, when a related word is presented to the user, a search engine E that is often searched by the query, such as the query candidate Q in FIG. Can be presented.

〔関連ワード抽出部の詳細な説明〕
次に、関連ワード抽出部４０の具体的な動作について、図３〜図６を参照しながら説明する。関連ワード抽出部４０が行う関連ワード抽出処理は、２つの検索エンジンＥのクエリログＬに対して行われ、各検索エンジンＥの全組合せ（例えば、ウェブ検索と画像検索、ウェブ検索と動画検索、・・・、画像検索と動画検索、・・・）について行う。また、各クエリログＬに記憶されたクエリの組合せ全てに対して行うことが好ましいが、該クエリに対応付けられた検索回数が上位のものを対象にして行うこととしてもよい。 [Detailed explanation of related word extraction unit]
Next, a specific operation of the related word extracting unit 40 will be described with reference to FIGS. The related word extraction processing performed by the related word extraction unit 40 is performed on the query logs L of the two search engines E, and all combinations of the search engines E (for example, web search and image search, web search and video search,.・・ Perform image search and video search. Moreover, it is preferable to perform it for all the combinations of queries stored in each query log L, but it is also possible to perform the search for the higher number of searches associated with the query.

先ず、関連ワード抽出部４０の検索コンテンツ取得手段４２は、２つの検索エンジンＥのクエリログＬから１つずつ異なるクエリを抽出する（ステップＳ１１）。そして、その抽出したクエリを各検索エンジンＥに出力して、それぞれの検索エンジンＥで検索された検索結果群（検索結果の一覧）に対応したコンテンツデータを所定数（例えば、上位Ｎ件、Ｎは任意の自然数）取得する（ステップＳ１２）。 First, the search content acquisition unit 42 of the related word extraction unit 40 extracts different queries one by one from the query logs L of the two search engines E (step S11). Then, the extracted query is output to each search engine E, and a predetermined number of content data corresponding to the search result group (search result list) searched by each search engine E (for example, top N, N Is an arbitrary natural number) (step S12).

例えば、図４の例示においては、ウェブ検索エンジンＥ１のクエリログＬ１から「東京」を抽出して、「東京」でウェブ検索エンジンＥ１を検索した結果の検索結果群Ａが取得される。 For example, in the illustration of FIG. 4, “Tokyo” is extracted from the query log L1 of the web search engine E1, and the search result group A as a result of searching the web search engine E1 in “Tokyo” is acquired.

また、画像検索エンジンＥ３のクエリログＬ３からは「レインボーブリッジ」が抽出されて、「レインボーブリッジ」で画像検索エンジンＥ３を検索した結果の検索結果群Ｂが取得される。 Further, “Rainbow Bridge” is extracted from the query log L3 of the image search engine E3, and a search result group B as a result of searching the image search engine E3 with “Rainbow Bridge” is acquired.

次いで、関連ワード抽出部４０の類似度算出手段４４は、２つの検索エンジンＥから取得した検索結果群同士の類似度を算出する（ステップＳ１３）。ここで算出する類似度を「マルチメディア類似度」という。 Next, the similarity calculation means 44 of the related word extraction unit 40 calculates the similarity between the search result groups acquired from the two search engines E (step S13). The similarity calculated here is referred to as “multimedia similarity”.

マルチメディア類似度は、コンテンツデータに含まれるテキストや画像、音声といった各メディア間の類似度を総合的に考慮した指標である。具体的には、次式に基づいて算出される。 The multimedia similarity is an index that comprehensively considers the similarity between media such as text, images, and audio included in content data. Specifically, it is calculated based on the following equation.

〔式１〕
マルチメディア類似度＝（テキスト類似度＋画像類似度＋動画類似度）／（コンテンツに含まれるメディアの種類数） [Formula 1]
Multimedia similarity = (text similarity + image similarity + video similarity) / (number of types of media included in content)

コンテンツに含まれるメディアの種類数は、各検索エンジンＥの検索対象により設定され、例えば、ウェブ検索のようにコンテンツ内にテキスト、画像及び動画が含まれれば‘３’となり、画像検索であれば‘１’となる。このメディアの種類数は、マルチメディア類似度を算出する検索エンジンＥの組み合わせにおいて、検索対象間でメディアの種類数が小さいほうの値としてもよいし、検索対象間で共通するメディアの数を設定してもよい。 The number of types of media included in the content is set according to the search target of each search engine E. For example, if the content includes text, images, and moving images as in web search, it is “3”. It becomes '1'. The number of types of media may be the value of the smaller number of types of media between search targets in the combination of search engines E that calculate multimedia similarity, or the number of media that is common between search targets may be set May be.

テキスト類似度は、コンテンツに含まれるテキスト同士の類似度である。この類似度は、例えば、テキスト内に含まれるキーワードの出現回数等により該テキストの特徴量を多次元ベクトルにより表現することで、該ベクトル間のコサイン距離によって求められる。 The text similarity is a similarity between texts included in the content. For example, the similarity is obtained from the cosine distance between the vectors by expressing the feature amount of the text by a multidimensional vector based on the number of appearances of the keyword included in the text.

ウェブ検索エンジンＥ１の検索結果群Ａと、画像検索エンジンＥ３の検索結果群Ｂとについて、その検索結果群の中にはＮ個のコンテンツが含まれているとして、該検索結果群内の各コンテンツはＡ１，Ａ２，Ａ３，・・・・ＡＮ、Ｂ１，Ｂ２，Ｂ３，・・・・ＢＮで表されるとする。 Regarding the search result group A of the web search engine E1 and the search result group B of the image search engine E3, assuming that N contents are included in the search result group, each content in the search result group Are represented by A1, A2, A3,... AN, B1, B2, B3,.

検索結果群Ａと検索結果群Ｂとの間のテキスト類似度は、次のように求められる。 The text similarity between the search result group A and the search result group B is obtained as follows.

〔式２〕
テキスト類似度＝[MAX{類似度(テキストA1,テキストB1),類似度(テキストA1,テキストB2),・・・,類似度(テキストA1,テキストBN)}+MAX{類似度(テキストA2,テキストB1),類似度(テキストA2,テキストB2),・・・,類似度(テキストA2,テキストBN)}+・・・+MAX{類似度(テキストAN,テキストB1),類似度(テキストAN,テキストB2),・・・,類似度(テキストAN,テキストBN)}]／Ｎ [Formula 2]
Text similarity = [MAX {similarity (text A1, text B1), similarity (text A1, text B2), ..., similarity (text A1, text BN)} + MAX {similarity (text A2, Text B1), similarity (text A2, text B2), ..., similarity (text A2, text BN)} + ... + MAX {similarity (text AN, text B1), similarity (text AN , Text B2), ..., similarity (text AN, text BN)}] / N

尚、MAX{}は、最大値を選ぶ関数を意味し、類似度()は、類似度を算出する関数を意味する。即ち、テキスト類似度は、検索結果群内の最も類似するテキスト同士の最大類似度の平均により求められる。 MAX {} means a function for selecting the maximum value, and similarity () means a function for calculating the similarity. That is, the text similarity is obtained by averaging the maximum similarities between the most similar texts in the search result group.

また、画像類似度は、コンテンツに含まれる画像同士の類似度である。この類似度は、例えば、画像から抽出される色、形状、パターン等の特徴量を多次元ベクトルにより表現することで、そのベクトル間のユークリッド距離によって求められる。ウェブ検索エンジンＥ１と画像検索エンジンＥ３との検索結果群について画像類似度を求めるとすると、次式により求められる。 The image similarity is the similarity between images included in the content. This similarity is obtained, for example, from the Euclidean distance between the vectors by expressing the feature quantities such as colors, shapes, and patterns extracted from the images by multidimensional vectors. If the image similarity is obtained for the search result group of the web search engine E1 and the image search engine E3, it is obtained by the following equation.

〔式３〕
画像類似度＝[MAX{類似度(画像A1,画像B1),類似度(画像A1,画像B2),・・・,類似度(画像A1,画像BN)}+MAX{類似度(画像A2,画像B1),類似度(画像A2,画像B2),・・・,類似度(画像A2,画像BN)}+・・・+MAX{類似度(画像AN,画像B1),類似度(画像AN,画像B2),・・・,類似度(画像AN,画像BN)}]／Ｎ [Formula 3]
Image similarity = [MAX {similarity (image A1, image B1), similarity (image A1, image B2), ..., similarity (image A1, image BN)} + MAX {similarity (image A2, Image B1), similarity (image A2, image B2), ..., similarity (image A2, image BN)} + ... + MAX {similarity (image AN, image B1), similarity (image AN , Image B2), ..., similarity (image AN, image BN)}] / N

また、動画類似度は、コンテンツに含まれる動画同士の類似度である。この類似度は、例えば、動画から抽出される色、形状、パターン等の画像的特徴量や、オブジェクトの動きや音声信号等の特徴量を多次元ベクトルにより表現することで、そのベクトル間のユークリッド距離によって求められる。ウェブ検索エンジンＥ１と画像検索エンジンＥ３との検索結果群について動画類似度を求めるとすると、次式により求められる。 The moving image similarity is a similarity between moving images included in the content. This similarity is represented by, for example, image features such as colors, shapes, and patterns extracted from a moving image, and feature amounts such as object motion and audio signals expressed by multidimensional vectors. Calculated by distance. When the moving image similarity is obtained for the search result group of the web search engine E1 and the image search engine E3, the following equation is obtained.

〔式４〕
動画類似度＝[MAX{類似度(動画A1,動画B1),類似度(動画A1,動画B2),・・・,類似度(動画A1,動画BN)}+MAX{類似度(動画A2,動画B1),類似度(動画A2,動画B2),・・・,類似度(動画A2,動画BN)}+・・・+MAX{類似度(動画AN,動画B1),類似度(動画AN,動画B2),・・・,類似度(画像AN,画像BN)}]／Ｎ [Formula 4]
Video similarity = [MAX {similarity (video A1, video B1), similarity (video A1, video B2), ..., similarity (video A1, video BN)} + MAX {similarity (video A2, Video B1), similarity (video A2, video B2), ..., similarity (video A2, video BN)} + ... + MAX {similarity (video AN, video B1), similarity (video AN , Video B2), ..., similarity (image AN, image BN)}] / N

上述のように検索結果群間のテキスト類似度、画像類似度、動画類似度を求めて、コンテンツに含まれるメディアの種類の数で除算することで、それらの平均値であるマルチメディア類似度を算出する。このマルチメディア類似度により、検索結果群のコンテンツの内容に基づいて、その検索結果を導出したクエリ間の類似性を算出することができる。また、各類似度の算出にMAX{}を用いて最大値を選択することで、２つの検索結果群を比較して最も類似しているコンテンツデータによってマルチメディア類似度を算出できる。 As described above, the text similarity, the image similarity, and the video similarity between the search result groups are obtained and divided by the number of types of media included in the content, so that the multimedia similarity that is an average value thereof is calculated. calculate. Based on the contents of the contents of the search result group, the similarity between the queries from which the search results are derived can be calculated based on the multimedia similarity. Further, by selecting the maximum value using MAX {} for calculating each similarity, the two search result groups are compared, and the multimedia similarity can be calculated based on the most similar content data.

尚、図４の検索結果Ａ１やＡ２のように、コンテンツの中に同種のメディアが複数含まれている場合には、その各々について上述のように最大となる類似度を抽出して、その平均値を用いることとしてもよい。また、各類似度の値は、各メディア（例えばテキストや画像などの種類別）での最大類似度に対して正規化されていることが好ましい。 In the case where a plurality of media of the same type are included in the content as in the search results A1 and A2 in FIG. 4, the maximum similarity is extracted for each of them as described above, and the average is extracted. A value may be used. Moreover, it is preferable that the value of each similarity is normalized with respect to the maximum similarity in each medium (for example, according to the type of text or image).

関連ワード抽出部４０の検索コンテンツ取得手段４２は、２つの検索エンジンＥのクエリログＬ間で全ての組み合わせでクエリを抽出したか否かを判断する（ステップＳ１４）。例えば、図５に示すように、ウェブ検索のクエリログＬ１の総クエリ数がＭ個であり、画像検索のクエリログＬ３の総クエリ数がＬ個である場合には、Ｌ×Ｍの組み合わせでのクエリの抽出が行われる。従って、検索エンジンＥが３つ以上ある場合には、各検索エンジンＥの組み合わせに対してクエリ間のマルチメディア類似度を算出することとなる。 The search content acquisition means 42 of the related word extraction unit 40 determines whether or not queries have been extracted in all combinations between the query logs L of the two search engines E (step S14). For example, as shown in FIG. 5, when the total number of queries in the web search query log L1 is M and the total number of queries in the image search query log L3 is L, queries in a combination of L × M Is extracted. Therefore, when there are three or more search engines E, the multimedia similarity between queries is calculated for each search engine E combination.

類似度算出手段４４が各検索エンジンＥのクエリログＬから抽出したクエリ間でのマルチメディア類似度を算出すると、このマルチメディア類似度は、そのクエリの組み合わせに対応付けてメモリ上に記憶される。即ち、図５に示すような対応関係によりマルチメディア類似度は記憶される。 When the similarity calculation unit 44 calculates the multimedia similarity between queries extracted from the query log L of each search engine E, the multimedia similarity is stored in the memory in association with the combination of the queries. That is, the multimedia similarity is stored according to the correspondence as shown in FIG.

関連語特定手段４６は、ステップＳ１３において算出したマルチメディア類似度に基づいて、関連ワードを抽出して、関連ワードＤＢ５０に記憶する（ステップＳ１５）。具体的には、あるクエリＱ１に対して算出したマルチメディア類似度のうち、上位所定数（例えば、１０位まで）を選出し、そのマルチメディア類似度の算出の対象となったクエリをクエリＱ１の関連ワードとして特定して、それらを対応付けて関連ワードＤＢ５０に記憶する。この記憶の際には、該クエリを検索した検索エンジンＥの識別情報を対象エンジンとして対応付けて記憶する。 The related word specifying means 46 extracts a related word based on the multimedia similarity calculated in step S13 and stores it in the related word DB 50 (step S15). Specifically, among the multimedia similarities calculated with respect to a certain query Q1, a predetermined upper number (for example, up to the 10th) is selected, and the query for which the multimedia similarity is calculated is referred to as query Q1. Are identified as related words and stored in the related word DB 50 in association with each other. In this storage, the identification information of the search engine E that searched for the query is stored in association with the target engine.

例えば、図５のように、画像検索のクエリログＬ３のクエリ１〜Ｌのうち、「東京」というウェブ検索のクエリとのマルチメディア類似度が上位１０件のクエリを関連ワードとして特定する。 For example, as shown in FIG. 5, among the queries 1 to L in the image search query log L3, queries having the top ten multimedia similarity to the web search query “Tokyo” are specified as related words.

そして、例えば、ウェブ検索エンジンＥ１のクエリ「東京」に対して画像検索エンジンＥ３のクエリ「レインボーブリッジ」が関連ワードとして特定されていれば、「東京」と「レインボーブリッジ」との対象エンジンである「ウェブ検索エンジンＥ１」と「画像検索エンジンＥ３」とが対応付けて記憶される。 For example, if the query “Rainbow Bridge” of the image search engine E3 is specified as a related word for the query “Tokyo” of the web search engine E1, it is a target engine of “Tokyo” and “Rainbow Bridge”. “Web search engine E1” and “image search engine E3” are stored in association with each other.

検索結果出力部３０は、検索結果取得部２０により得られた検索結果の一覧ページを作成する際に、クエリに対応付けられた関連ワードを関連ワードＤＢ５０から抽出して、クエリ候補として該ページに埋め込んで作成する。このクエリ候補の表示には、検索エンジンＥに対する検索指示をリンクとして設定することが好ましい。図６に示す検索画面Ｗがクエリ候補Ｑの表示例である。ユーザは、該クエリ候補Ｑを選択することにより、そのクエリでの再検索を指示入力することができる。 When the search result output unit 30 creates a list page of search results obtained by the search result acquisition unit 20, the search result output unit 30 extracts related words associated with the query from the related word DB 50, and stores them as query candidates on the page. Create by embedding. In order to display the query candidates, it is preferable to set a search instruction for the search engine E as a link. The search screen W shown in FIG. 6 is a display example of the query candidate Q. By selecting the query candidate Q, the user can input a re-search with the query.

また、クエリ候補の表示の際に、関連ワードに対応付けて記憶された対象エンジンによって、該クエリ候補での検索が好ましい検索エンジンＥを表示することとしても良い。この場合は、関連ワードに対応付けられた検索エンジンＥに対する検索指示をリンクとして、クエリ候補としての関連ワードに設定する。例えば、ユーザが、図６の検索画面Ｗによりクエリ候補Ｑを選択することにより、そのクエリに対応した検索エンジンＥでの検索指示を入力することができる。 In addition, when displaying a query candidate, a search engine E that is preferably searched for the query candidate may be displayed by a target engine stored in association with a related word. In this case, the search instruction for the search engine E associated with the related word is set as a link to the related word as a query candidate. For example, when the user selects a query candidate Q on the search screen W in FIG. 6, a search instruction in the search engine E corresponding to the query can be input.

上述のように、本実施形態によれば、クエリログＬに記憶されたクエリに基づいた各検索エンジンＥでの検索結果群の間の類似度（マルチメディア類似度）を算出することにより、検索結果に含まれるテキストや静止画、動画といったメディアの類似度を考慮して、関連ワードを抽出することができる。従って、人為的なコストを抑えて関連語の抽出を行うことができる。 As described above, according to the present embodiment, the search result is calculated by calculating the similarity (multimedia similarity) between the search result groups in each search engine E based on the query stored in the query log L. The related words can be extracted in consideration of the similarity of media such as text, still images, and moving images included in. Therefore, it is possible to extract related terms while reducing artificial costs.

また、一般に関連ワードの抽出方法としてユーザが入力したクエリ（例えば「東京」）を一部分として含むクエリ（例えば「東京駅」）を抽出することが容易ではあるが、本実施形態によれば、クエリの文字列上の類似性だけではなく、そのクエリの持つ意味（内容）をコンテンツデータから考慮して関連ワードを抽出することができる。 In general, it is easy to extract a query (for example, “Tokyo Station”) that includes a part of a query (for example, “Tokyo”) input by the user as a related word extraction method. It is possible to extract related words in consideration of not only the similarity on the character string but also the meaning (contents) of the query from the content data.

〔変形例〕
尚、上述した実施形態は、本発明を適用した一例であり、その適用可能な範囲は上述例に限られない。 [Modification]
The embodiment described above is an example to which the present invention is applied, and the applicable range is not limited to the above example.

例えば、楽曲や録音音声等の音声データを検索する音声検索エンジンを用いて、音声の類似度をマルチメディア類似度に加えることによって、音声検索エンジンでのクエリログから関連ワードを抽出することしてもよく、音楽、ニュース、ブログ、地図等の、他のメディアについての検索エンジンを用いることも可能である。 For example, a related word may be extracted from a query log in the voice search engine by adding a voice similarity to the multimedia similarity using a voice search engine that searches voice data such as music and recorded voice. Search engines for other media such as music, news, blogs, maps, etc. can also be used.

また、マルチメディア類似度の算出の際に、検索結果に付与されているランキングを重みとして用いることとしてもよい。例えば、検索結果の順位１〜Ｎをランクとした場合に、上述のテキスト類似度は次式により求められる。 In addition, when calculating the multimedia similarity, the ranking given to the search result may be used as a weight. For example, when ranks 1 to N of the search results are ranked, the above-described text similarity is obtained by the following equation.

テキスト類似度＝
[MAX{類似度(テキストA1,テキストB1)×(1/(1*1)),類似度(テキストA1,テキストB2)×(1/(1*2)),・・・,類似度(テキストA1,テキストBN)×(1/(1*N))}
+MAX{類似度(テキストA2,テキストB1)×(1/(2*1),類似度(テキストA2,テキストB2)×(1/(2*2),・・・,類似度(テキストA2,テキストBN)×(1/(2*N)}
+・・・
+MAX{類似度(テキストAN,テキストB1)×(1/(N*1),類似度(テキストAN,テキストB2)×(1/(N*2),・・・,類似度(テキストAN,テキストBN)×(1/(N*N)}]／Ｎ
＝[MAX{類似度(テキストA1,テキストB1)×1,類似度(テキストA1,テキストB2)×2,・・・,類似度(テキストA1,テキストBN)×N}/1
+MAX{類似度(テキストA2,テキストB1)×1,類似度(テキストA2,テキストB2)×2,・・・,類似度(テキストA2,テキストBN)×N}/2
+・・・
+MAX{類似度(テキストAN,テキストB1)×1,類似度(テキストAN,テキストB2)×2,・・・,類似度(テキストAN,テキストBN)×N}/N]／Ｎ Text similarity =
(MAX {similarity (text A1, text B1) x (1 / (1 * 1)), similarity (text A1, text B2) x (1 / (1 * 2)), ..., similarity ( Text A1, text BN) x (1 / (1 * N))}
+ MAX {similarity (text A2, text B1) x (1 / (2 * 1), similarity (text A2, text B2) x (1 / (2 * 2), ..., similarity (text A2 , Text BN) × (1 / (2 * N)}
+ ...
+ MAX {similarity (text AN, text B1) x (1 / (N * 1), similarity (text AN, text B2) x (1 / (N * 2), ..., similarity (text AN , Text BN) × (1 / (N * N)}] / N
= [MAX {similarity (text A1, text B1) x 1, similarity (text A1, text B2) x 2, ..., similarity (text A1, text BN) x N} / 1
+ MAX {similarity (text A2, text B1) x 1, similarity (text A2, text B2) x 2, ..., similarity (text A2, text BN) x N} / 2
+ ...
+ MAX {similarity (text AN, text B1) x 1, similarity (text AN, text B2) x 2, ..., similarity (text AN, text BN) x N} / N] / N

このように、類似度の算出に検索結果のランキングを重み付けすることにより、クエリに対して関連性の高い検索結果に含まれるコンテンツに重点をおいたマルチメディア類似度を算出することができるため、より関連性の高い関連ワードを抽出することができる。 In this way, by weighting the ranking of search results in calculating similarity, it is possible to calculate multimedia similarity that focuses on content included in search results that are highly relevant to the query. More related words can be extracted.

また、各検索エンジンＥのクエリログＬに記憶されたクエリを用いて関連語の抽出を行うこととしたが、例えば、複数のキーワードを蓄積したデータベース（例えば国語辞書ＤＢ）を設け、このデータベースに記憶された異なるキーワードを抽出して、各検索エンジンＥに出力することで、上述のように関連語を抽出することとしてもよい。 In addition, the related words are extracted using the queries stored in the query log L of each search engine E. For example, a database (for example, a national language dictionary DB) storing a plurality of keywords is provided and stored in this database. It is also possible to extract the related keywords as described above by extracting the different keywords that have been extracted and outputting them to each search engine E.

また、マルチメディア類似度の算出式として、式１のようにテキスト類似度、画像類似度及び動画類似度の平均値を用いることとして説明したが、その中の最大値若しくは最小値を用いることとしてもよい。 In addition, as an expression for calculating the multimedia similarity, the average value of the text similarity, the image similarity, and the moving image similarity has been described as in Expression 1, but the maximum value or the minimum value among them is used. Also good.

具体的には、算出したテキスト類似度、画像類似度及び動画類似度のうちの最大値となるものをマルチメディア類似度として設定してもよい。これによれば、類似度の高いメディアを重要視したマルチメディア類似度によって関連ワードを抽出することができる。 Specifically, the maximum value among the calculated text similarity, image similarity, and moving image similarity may be set as the multimedia similarity. According to this, it is possible to extract related words based on a multimedia similarity that places importance on a medium having a high similarity.

また、テキスト類似度、画像類似度及び動画類似度のうちの最小値となるものをマルチメディア類似度として設定してもよい。これによれば、関連ワードの抽出の基準となる値を厳しく設定して、関連ワードの抽出精度を高めることができる。 Further, the minimum value among the text similarity, the image similarity, and the moving image similarity may be set as the multimedia similarity. According to this, it is possible to increase the accuracy of extracting related words by strictly setting a value as a reference for extracting related words.

また、テキスト類似度、画像類似度及び動画類似度を乗算した値をマルチメディア類似度として設定してもよい。これによれば、各類似度の値が全て高くならなければマルチメディア類似度の値も高まらないため、各メディアが総合的に類似する関連ワードが抽出されるようになる。 A value obtained by multiplying the text similarity, the image similarity, and the moving image similarity may be set as the multimedia similarity. According to this, unless all the similarity values are increased, the multimedia similarity value is not increased, so that related words in which the respective media are generally similar are extracted.

また、テキスト類似度、画像類似度及び動画類似度の算出の際に、MAX｛｝により各検索結果のコンテンツデータ間の類似度の最大値を選出することとして説明したが、MIN{}を用いて最小値を選出することで類似度の判断基準を厳しくしてもよいし、AVE{}を用いて各類似度を総合的に考慮することとしてもよい。 Further, in the calculation of text similarity, image similarity, and video similarity, it has been explained that MAX {} selects the maximum similarity between the content data of each search result, but MIN {} is used. By selecting the minimum value, the criteria for determining the similarity may be tightened, or each similarity may be comprehensively considered using AVE {}.

また、実施形態の動作は、コンピュータに適宜のコンピュータソフトウエアを組み込むことにより実施することができる。尚、本発明の内容は、前記実施形態に限定されるものではない。本発明は、特許請求の範囲に記載された範囲内において、具体的な構成に対して種々の変更を加えうるものである。 The operation of the embodiment can be performed by incorporating appropriate computer software into the computer. The contents of the present invention are not limited to the above embodiment. In the present invention, various modifications can be made to the specific configuration within the scope of the claims.

例えば、各構成要素は、機能ブロックとして存在していればよく、独立したハードウエアとして存在しなくても良い。また、実装方法としては、ハードウエアを用いてもコンピュータソフトウエアを用いても良い。更に、本発明における一つの機能要素が複数の機能要素の集合によって実現されても良く、本発明における複数の機能要素が一つの機能要素により実現されても良い。 For example, each component only needs to exist as a functional block, and does not need to exist as independent hardware. As a mounting method, hardware or computer software may be used. Furthermore, one functional element in the present invention may be realized by a set of a plurality of functional elements, and a plurality of functional elements in the present invention may be realized by one functional element.

また、機能要素は、物理的に離間した位置に配置されていてもよい。この場合、機能要素どうしがネットワークにより接続されていても良い。グリッドコンピューティングにより機能を実現し、あるいは機能要素を構成することも可能である。 Moreover, the functional element may be arrange | positioned in the position physically separated. In this case, the functional elements may be connected by a network. It is also possible to realize functions or configure functional elements by grid computing.

Ｄインデックス
Ｅ検索エンジン
Ｅ１ウェブ検索エンジン
Ｅ３画像検索エンジン
Ｅ５動画検索エンジン
Ｅ７商品検索エンジン
Ｌクエリログ
Ｑクエリ候補
１検索サーバ
１０クエリ受付部
２０検索結果取得部
３０検索結果出力部
４０関連ワード抽出部
５０関連ワードＤＢ D Index E Search engine E1 Web search engine E3 Image search engine E5 Video search engine E7 Product search engine L Query log Q Query candidate 1 Search server 10 Query reception unit 20 Search result acquisition unit 30 Search result output unit 40 Related word extraction unit 50 Related Word DB

Claims

それぞれ異なるメディアを検索対象とした複数の検索エンジンそれぞれに異なるクエリを出力して、該検索エンジンによって検索されたコンテンツデータを該クエリ毎に取得する検索コンテンツ取得手段と、
前記異なるクエリの出力に対して各検索エンジンから取得されたコンテンツデータの類似度を該クエリ間で算出する類似度算出手段と、
前記検索エンジンに出力された異なるクエリ間の前記算出された類似度に基づいて該クエリ同士を関連語として特定する関連語特定手段と、
前記関連語として特定された複数のクエリのそれぞれと、該複数のクエリがそれぞれ前記出力された検索エンジンの種別とを対応付けて記憶する関連語記憶手段と、
を備えることを特徴とする関連語抽出装置。 Search content acquisition means for outputting different queries to each of a plurality of search engines that search for different media, and acquiring content data searched by the search engines for each query,
Similarity calculation means for calculating the similarity of the content data acquired from each search engine with respect to the output of the different queries, between the queries;
Related word specifying means for specifying the queries as related words based on the calculated similarity between different queries output to the search engine;
Each of a plurality of queries specified as the related term, and a related term storage unit that stores the plurality of queries in association with the type of the search engine from which each of the plurality of queries is output;
A related word extraction device comprising:

前記複数の検索エンジンそれぞれで検索されたクエリの履歴を該検索エンジン毎に蓄積記憶する記憶手段を更に備え、
前記検索コンテンツ取得手段は、
前記記憶された検索エンジン毎のクエリの履歴の中から一ずつのクエリを抽出して、それぞれ対応する前記検索エンジンに出力することを特徴とする請求項１に記載の関連語抽出装置。 Storage means for accumulating and storing the history of queries searched by each of the plurality of search engines for each search engine;
The search content acquisition means includes
2. The related word extraction apparatus according to claim 1, wherein one query is extracted from the stored query history for each search engine and is output to the corresponding search engine.

前記メディアは、テキスト、静止画、動画、音声の少なくとも何れかを含み、
前記類似度算出手段は、
前記異なるクエリに対して各検索エンジンから取得されたコンテンツデータに含まれるテキスト、静止画、動画又は音声の特徴量を該コンテンツデータ間で比較することで、前記類似度を算出することを特徴とする請求項１又は２に記載の関連語抽出装置。 Before texture Deer include text, still images, moving images, at least one of voice,
The similarity calculation means includes:
The similarity is calculated by comparing text, still image, video, or audio feature amounts included in content data acquired from each search engine with respect to the different queries between the content data. The related word extraction device according to claim 1 or 2.

前記検索エンジンにより検索されたコンテンツデータにはランク付けがなされ、
前記類似度算出手段は、
前記コンテンツデータに付与されたランクにより前記類似度に重み付けを行うことを特徴とする請求項１〜３の何れか一項に記載の関連語抽出装置。 The content data searched by the search engine is ranked,
The similarity calculation means includes:
The related word extraction device according to any one of claims 1 to 3, wherein the similarity is weighted according to a rank assigned to the content data.

前記関連語記憶手段に記憶された前記関連語として特定された複数のクエリを、それぞれ前記出力された検索エンジンで検索可能としたページを生成するページ生成手段を更に備えることを特徴とする請求項１〜４の何れか一項に記載の関連語抽出装置。The system further comprises page generation means for generating a page that allows a plurality of queries specified as the related words stored in the related word storage means to be searched by the output search engine. The related word extraction apparatus as described in any one of 1-4.

それぞれ異なるメディアを検索対象とした複数の検索エンジンそれぞれに異なるクエリを出力して、該検索エンジンによって検索されたコンテンツデータを該クエリ毎に取得する検索コンテンツ取得工程と、
前記異なるクエリの出力に対して各検索エンジンから取得されたコンテンツデータの類似度を該クエリ間で算出する類似度算出工程と、
前記検索エンジンに出力された異なるクエリ間の前記算出された類似度に基づいて該クエリ同士を関連語として特定する関連語特定工程と、
前記関連語として特定された複数のクエリのそれぞれと、該複数のクエリがそれぞれ前記出力された検索エンジンの種別とを対応付けて関連語記憶手段に記憶する関連語記憶工程と、
をコンピュータが行うことを特徴とする関連語抽出方法。 A search content acquisition step of outputting a different query to each of a plurality of search engines that search different media, and acquiring content data searched by the search engine for each query;
A similarity calculation step of calculating the similarity of content data acquired from each search engine with respect to the output of the different queries, between the queries;
A related word specifying step of specifying the queries as related words based on the calculated similarity between different queries output to the search engine;
A related word storage step of associating each of the plurality of queries specified as the related word with the type of search engine from which each of the plurality of queries is output in a related word storage unit;
The related word extraction method characterized by performing computer.

請求項６に記載の関連語抽出方法をコンピュータに実行させるためのプログラム。 A program for causing a computer to execute the related word extracting method according to claim 6.