JP2013073474A

JP2013073474A - Information processor and information processing method

Info

Publication number: JP2013073474A
Application number: JP2011212922A
Authority: JP
Inventors: Keiichi Ochiai; 桂一落合
Original assignee: NTT Docomo Inc
Current assignee: NTT Docomo Inc
Priority date: 2011-09-28
Filing date: 2011-09-28
Publication date: 2013-04-22

Abstract

PROBLEM TO BE SOLVED: To store information as the index of the scale of the value of content by using text data included in the content.SOLUTION: A control unit 110 of a text search device extracts text data including a transmitted search query (step S52). The control unit refers to a flag associated with information source ID applied to the extracted text data, and calculates a ranking score by using a value ("1" or "0") of the flag. The control unit 110 ranks the text data extracted in the step S52 in descending order of the ranking score calculated in the step S53 (step S54), and transmits the text data to a user terminal (step S55). A control unit of the user terminal displays the transmitted text data at a display part by arranging the text data in an order corresponding to the text data, for example, in descending order (step S56).

Description

本発明は、情報処理装置及び情報処理方法に関する。 The present invention relates to an information processing apparatus and an information processing method.

インターネットに接続されているＷｅｂサーバ装置は、テキストや音声、画像又は動画などを含むコンテンツを記憶しており、これらをユーザからの要求に応じて配信する。例えば、ある商品についてのコメントや施設についての感想など、いわゆる口コミのコンテンツはその一例である。このようなコンテンツの中からユーザが得たいものを探し出しやすいようにするため、これらのコンテンツを検索によって絞り込み、ユーザにとっての価値を評価する技術がある。特許文献１には、インターネット上のテキストを検索した結果得られたコンテンツを、画像データや動画データなどのマルチメディアデータを検索した結果に応じてランク付けすることで、それぞれのコンテンツの価値を評価する技術が記載されている。 A Web server device connected to the Internet stores contents including text, sound, images, or moving images, and distributes them in response to a request from a user. For example, so-called word-of-mouth content such as comments about a product or impressions about a facility is one example. In order to make it easier for the user to find what he wants to obtain from such contents, there is a technique for narrowing down these contents by searching and evaluating the value for the user. Patent Document 1 evaluates the value of each content by ranking the content obtained as a result of searching text on the Internet according to the result of searching multimedia data such as image data and video data. The technology to do is described.

特開２０１０−１８６２１４号公報JP 2010-186214 A

ところで、近年、日記や掲示板、ブログ、マイクロブログ、ＳＮＳ（Social Network Service）など、上記のようなコンテンツを発信する手段が多様化しており、そのコンテンツの価値も様々である。例えば、マイクロブログにおいて発信されているコンテンツであれば、実際の利用者のコメントを含む口コミのコンテンツや、他のコンテンツへのアクセス方法が示されたコンテンツなどは、そうでないものに比べて価値が大きいことがある。また、単に一言だけといった情報量が少ないものや、Retweetと呼ばれる他のコンテンツを繰り返し発信したものなどは、そうでないものに比べて価値が小さいことがある。検索によって絞り込んだコンテンツに価値の低いものが含まれていると、それらのコンテンツに埋もれて、価値の高いコンテンツが見つけ出しにくくなる場合がある。特許文献１の技術では、マルチメディアデータの内容に基づいてランク付けすることでコンテンツの価値を評価しているが、上記のような価値の大きさは、マルチメディアデータの内容から評価することが難しい。
そこで、本発明は、コンテンツに含まれるテキストデータを用いてそのコンテンツの価値の大きさの指標となる情報を蓄積することを目的とする。 By the way, in recent years, means for transmitting contents such as diaries, bulletin boards, blogs, microblogs, and SNS (Social Network Service) have been diversified, and the value of the contents is also various. For example, if the content is sent on a microblog, the content of word-of-mouth including comments from actual users and the content that shows how to access other content are more valuable than those that are not. May be big. Also, things that have a small amount of information, such as just one word, and those that repeatedly send other content called Retweet may be less valuable than those that do not. If content narrowed down by search includes low-value content, it may be buried in those content, making it difficult to find high-value content. In the technique of Patent Document 1, the value of content is evaluated by ranking based on the content of multimedia data. However, the magnitude of the above value can be evaluated from the content of multimedia data. difficult.
Therefore, an object of the present invention is to accumulate information that serves as an index of the value of content using text data included in the content.

上記課題を達成するために、本発明は、コンテンツに含まれるテキストデータを収集する収集手段と、前記収集手段により収集されたテキストデータを、形態素単位に分解する分解手段と、前記分解手段により分解された形態素の数を計数する計数手段と、前記計数手段により計数された形態素の数が閾値以上であるか否かを判定する第１判定手段と、前記収集手段により収集されたテキストデータに、コンテンツにアクセスするためのアドレスが含まれているか否かを判定する第２判定手段と、前記収集手段により収集されたテキストデータの予め決められた位置に予め決められた特定の文字列が含まれているか否かを判定する第３判定手段と、前記テキストデータに対する前記第１判定手段による判定結果、前記第２判定手段による判定結果及び前記第３判定手段による判定結果をそれぞれ表す識別子を、当該テキストデータに対応付けて記憶する識別子記憶手段とを備えることを特徴とする情報処理装置を提供する。 To achieve the above object, the present invention provides a collecting means for collecting text data included in content, a decomposing means for decomposing the text data collected by the collecting means into morpheme units, and decomposing by the decomposing means. Counting means for counting the number of morphemes performed, first determination means for determining whether or not the number of morphemes counted by the counting means is greater than or equal to a threshold, and text data collected by the collecting means, A second determination unit that determines whether or not an address for accessing the content is included, and a predetermined character string included in a predetermined position of the text data collected by the collection unit; Third determination means for determining whether or not the text data, determination result by the first determination means for the text data, determination by the second determination means An identifier representing results and determination result by the third determining means respectively, to provide an information processing apparatus characterized by comprising an identifier storage means for storing in association with the text data.

また、ユーザによって操作される通信装置から検索クエリを取得する取得手段と、前記収集手段により収集された前記テキストデータから、前記取得手段により取得された検索クエリを含むテキストデータを抽出する抽出手段と、前記抽出手段が抽出したテキストデータに対応付けて前記識別子記憶手段に記憶されている識別子に応じた値を用いて、前記テキストデータを検索対象として評価するときの評価値を算出する算出手段と、前記抽出手段により抽出されたテキストデータを、当該テキストデータについて前記算出手段により算出された評価値又は当該評価値の順位とともに前記通信装置に送信する送信手段とを備えさせてもよい。 An acquisition unit that acquires a search query from a communication device operated by a user; and an extraction unit that extracts text data including the search query acquired by the acquisition unit from the text data collected by the collection unit. Calculating means for calculating an evaluation value when the text data is evaluated as a search target using a value corresponding to the identifier stored in the identifier storage means in association with the text data extracted by the extraction means; The text data extracted by the extraction unit may be provided with a transmission unit that transmits the text data to the communication device together with the evaluation value calculated by the calculation unit or the rank of the evaluation value.

さらに、前記算出手段は、前記テキストデータに対応する前記識別子に応じた値のそれぞれに予め決められた係数を乗じた値に基づいて前記評価値を算出してもよい。 Furthermore, the calculation means may calculate the evaluation value based on a value obtained by multiplying each value corresponding to the identifier corresponding to the text data by a predetermined coefficient.

また、前記特定の文字列は、前記テキストデータが特定の相手に向けて発信されたものであること、又は当該テキストデータが当該テキストデータとは異なるテキストデータを引用していることを表すものであってもよい。 Further, the specific character string indicates that the text data is transmitted to a specific partner or that the text data quotes text data different from the text data. There may be.

また、本発明は、情報処理端末において実行される情報処理方法であって、コンテンツに含まれるテキストデータを収集する収集ステップと、前記収集ステップにおいて収集されたテキストデータを、形態素単位に分解する分解ステップと、前記分解ステップにおいて分解された形態素の数を計数する計数ステップと、前記計数ステップにおいて計数された形態素の数が閾値以上であるか否かを判定する第１判定ステップと、前記収集ステップにおいて収集されたテキストデータに、コンテンツにアクセスするためのアドレスが含まれているか否かを判定する第２判定ステップと、前記収集ステップにおいて収集されたテキストデータの予め決められた位置に予め決められた特定の文字列が含まれているか否かを判定する第３判定ステップと、前記テキストデータに対する前記第１判定ステップにおける判定結果、前記第２判定ステップにおける判定結果及び前記第３判定ステップにおける判定結果をそれぞれ表す識別子を、当該テキストデータに対応付けて記憶する識別子記憶ステップとを備えることを特徴とする情報処理方法を提供する。 In addition, the present invention is an information processing method executed in an information processing terminal, a collection step for collecting text data included in content, and a decomposition for decomposing the text data collected in the collection step into morpheme units A step for counting the number of morphemes decomposed in the decomposition step, a first determination step for determining whether or not the number of morphemes counted in the counting step is greater than or equal to a threshold value, and the collecting step A second determination step for determining whether or not the text data collected in step 2 includes an address for accessing the content, and a predetermined position of the text data collected in the collection step. A third determination step for determining whether or not a specific character string is included; An identifier storage step for storing identifiers representing the determination results in the first determination step, the determination results in the second determination step, and the determination results in the third determination step for the text data in association with the text data; An information processing method is provided.

本発明によれば、コンテンツに含まれるテキストデータを用いてそのコンテンツの価値の大きさの指標となる情報を蓄積することができる。 According to the present invention, information serving as an index of the value of content can be accumulated using text data included in the content.

実施形態に係るテキスト検索システムの全体構成を示すブロック図である。It is a block diagram which shows the whole structure of the text search system which concerns on embodiment. ユーザ端末のハードウェア構成を示す図である。It is a figure which shows the hardware constitutions of a user terminal. コンテンツ発信サーバ装置のハードウェア構成を示す図である。It is a figure which shows the hardware constitutions of a content transmission server apparatus. テキスト検索装置のハードウェア構成を示す図である。It is a figure which shows the hardware constitutions of a text search device. テキスト検索装置の制御部が実現する機能を示すブロック図である。It is a block diagram which shows the function which the control part of a text search device implement | achieves. ＵＲＬ辞書の内容の例を示す表である。It is a table | surface which shows the example of the content of URL dictionary. フラグ格納処理の手順を示すフローチャートである。It is a flowchart which shows the procedure of a flag storing process. 条件Ａ判定処理の手順を示すフローチャートであるIt is a flowchart which shows the procedure of the condition A determination process. 条件Ｂ判定処理の手順を示すフローチャートであるIt is a flowchart which shows the procedure of the condition B determination process. 条件Ｃ判定処理の手順を示すフローチャートであるIt is a flowchart which shows the procedure of the condition C determination process. 検索インデックスの一例を示す表である。It is a table | surface which shows an example of a search index. 検索処理におけるシーケンスチャートである。It is a sequence chart in a search process.

［実施形態］
以下、本発明の実施形態について図面を参照して説明する。
以下においてテキストとは、文字によって構成されたものであり、文章、語句及び文字列を含む概念である。テキストデータとは、テキストを文字コードで表現したものである。
図１は、本発明の一実施形態に係るテキスト検索システム１の全体構成を示すブロック図である。テキスト検索システム１においては、ユーザが得たいコンテンツをテキストの検索によって見つけ出しその結果を通知するというテキスト検索サービスがユーザに提供される。テキスト検索システム１は、テキスト検索装置１０と、複数のコンテンツ発信サーバ装置２０と、複数のユーザ端末３０と、ネットワーク４０とを備えている。ネットワーク４０は、移動体通信網又はインターネット等を含むものである。テキスト検索装置１０及び各コンテンツ発信サーバ装置２０と各ユーザ端末３０とは、ネットワーク４０を介して互いに接続される。 [Embodiment]
Embodiments of the present invention will be described below with reference to the drawings.
Hereinafter, the text is composed of characters, and is a concept including sentences, phrases, and character strings. Text data is a text representation of text.
FIG. 1 is a block diagram showing the overall configuration of a text search system 1 according to an embodiment of the present invention. In the text search system 1, the user is provided with a text search service that finds the content that the user wants to obtain by text search and notifies the result. The text search system 1 includes a text search device 10, a plurality of content transmission server devices 20, a plurality of user terminals 30, and a network 40. The network 40 includes a mobile communication network or the Internet. The text search device 10 and each content transmission server device 20 and each user terminal 30 are connected to each other via a network 40.

複数のコンテンツ発信サーバ装置２０は、例えば、日記、掲示板、ブログ、マイクロブログ又はＳＮＳ等のサービスをユーザ端末３０のユーザに提供するＷｅｂサーバ装置である。各コンテンツ発信サーバ装置２０は、或るユーザのユーザ端末から投稿されたコンテンツをＷｅｂページに掲載することで、ブラウザなどの閲覧プログラムを実行してそのＷｅｂページのＵＲＬ（Uniform Resource Locator）にアクセスしてきた他のユーザ端末のユーザがそのコンテンツを閲覧できるようにする。こうして、各コンテンツ発信サーバ装置２０は、ユーザが投稿したコンテンツを他のユーザに発信する。各コンテンツ発信サーバ装置２０に投稿されるコンテンツは、テキストのほか、音声や画像又は動画等を含んでいるものもある。 The plurality of content transmission server devices 20 are Web server devices that provide a user of the user terminal 30 with services such as a diary, a bulletin board, a blog, a microblog, or an SNS, for example. Each content transmission server device 20 publishes content posted from a user terminal of a certain user on a Web page, thereby executing a browsing program such as a browser and accessing a URL (Uniform Resource Locator) of the Web page. The user of the other user terminal can browse the contents. Thus, each content transmission server device 20 transmits the content posted by the user to other users. The content posted to each content transmission server device 20 includes not only text but also sound, images, moving images, and the like.

テキスト検索装置１０は、各コンテンツ発信サーバ装置２０に投稿されたコンテンツから、特定のテキストを含むコンテンツを検索するものである。この検索に用いられる特定のテキストは、検索クエリといい、ユーザ端末３０から送られてくる。この検索クエリは、ユーザが得たいコンテンツを検索するために選んだテキストであり、例えばそのコンテンツを構成するテキストに含まれているとユーザが考えた語句である。テキスト検索装置１０は、各コンテンツ発信サーバ装置２０からユーザが投稿したコンテンツを取得して記憶しておき、この記憶内容から、ユーザ端末３０から送られてきた検索クエリを含むコンテンツを検索して、その結果をユーザ端末３０に通知する。テキスト検索装置１０は、前述したテキスト検索サービスをユーザに提供する事業者によって管理されている。 The text search device 10 searches for content including specific text from content posted to each content transmission server device 20. The specific text used for this search is called a search query and is sent from the user terminal 30. This search query is text selected to search for content that the user wants to obtain, and is, for example, a phrase that the user thinks is included in the text that constitutes the content. The text search device 10 acquires and stores the content posted by the user from each content transmission server device 20 and searches the content including the search query sent from the user terminal 30 from the stored content. The result is notified to the user terminal 30. The text search apparatus 10 is managed by a provider that provides the user with the text search service described above.

複数のユーザ端末３０は、ユーザが各コンテンツ発信サーバ装置２０にコンテンツを投稿するとき、又はテキスト検索装置１０に検索クエリを送ってその検索結果を取得するときに、そのユーザによって用いられるものである。各ユーザ端末３０は、携帯電話機、スマートフォン、タブレット端末又はパーソナルコンピュータ等の通信装置であり、無線又は有線でネットワーク４０と通信する。図１では無線で通信するユーザ端末３０を示している。 The plurality of user terminals 30 are used by the user when the user posts content to each content transmission server device 20 or when the user sends a search query to the text search device 10 to acquire the search result. . Each user terminal 30 is a communication device such as a mobile phone, a smartphone, a tablet terminal, or a personal computer, and communicates with the network 40 wirelessly or by wire. FIG. 1 shows a user terminal 30 that communicates wirelessly.

図２は、ユーザ端末３０のハードウェア構成を示す図である。ユーザ端末３０は、制御部３１０と、通信部３２０と、操作部３３０と、表示部３４０と、記憶部３５０とを備えたコンピュータとして構成されている。制御部３１０は、ＣＰＵ（Central Processing Unit）等の演算装置と、ＲＯＭ（Read Only Memory）及びＲＡＭ（Random Access Memory）などの記憶装置とを備えている。ＣＰＵは、ＲＡＭをワークエリアとして用いてＲＯＭや記憶部３５０に記憶されたプログラムを実行することによって、ユーザ端末３０の各部の動作を制御する。通信部３２０は、ネットワーク４０との間で信号を遣り取りする通信回路を備えており、ネットワーク４０を介してテキスト検索装置１０及びコンテンツ発信サーバ装置２０と通信する。操作部３３０は、複数のキー及びタッチセンサなどの操作子を備え、ユーザの操作に応じた操作信号を制御部３１０に供給する。制御部３１０は、この操作信号に応じた処理を行う。表示部３４０は、液晶パネル及び液晶駆動回路を有する表示手段であり、制御部３１０からの指示に応じて液晶パネルの表示面に画像を表示する。記憶部３５０は、例えばフラッシュメモリやハードディスク等の記憶手段であり、制御部３１０が制御に用いるデータやプログラムを記憶している。 FIG. 2 is a diagram illustrating a hardware configuration of the user terminal 30. The user terminal 30 is configured as a computer including a control unit 310, a communication unit 320, an operation unit 330, a display unit 340, and a storage unit 350. The control unit 310 includes an arithmetic device such as a CPU (Central Processing Unit) and a storage device such as a ROM (Read Only Memory) and a RAM (Random Access Memory). The CPU controls the operation of each unit of the user terminal 30 by executing a program stored in the ROM or the storage unit 350 using the RAM as a work area. The communication unit 320 includes a communication circuit that exchanges signals with the network 40, and communicates with the text search device 10 and the content transmission server device 20 via the network 40. The operation unit 330 includes a plurality of operation elements such as keys and touch sensors, and supplies an operation signal corresponding to a user operation to the control unit 310. The control unit 310 performs processing according to the operation signal. The display unit 340 is a display unit having a liquid crystal panel and a liquid crystal driving circuit, and displays an image on the display surface of the liquid crystal panel in accordance with an instruction from the control unit 310. The storage unit 350 is a storage unit such as a flash memory or a hard disk, and stores data and programs used by the control unit 310 for control.

図３は、コンテンツ発信サーバ装置２０のハードウェア構成を示す図である。コンテンツ発信サーバ装置２０は、制御部２１０と、通信部２２０と、記憶部２３０とを備えたコンピュータとして構成されている。制御部２１０は、ＣＰＵ等の演算装置と、ＲＯＭ及びＲＡＭ等の記憶装置とを備えている。ＣＰＵは、ＲＡＭをワークエリアとして用いてＲＯＭや記憶部２３０に記憶されたプログラムを実行することによって、コンテンツ発信サーバ装置２０の各部の動作を制御する。通信部２２０は、ネットワーク４０に接続されており、テキスト検索装置１０及び各ユーザ端末３０とデータを送受信する。記憶部２３０は、例えばハードディスク等の記憶手段であり、制御部２１０が制御に用いるデータやプログラムを記憶しており、例えば上述のとおりユーザから投稿されたコンテンツを記憶する。 FIG. 3 is a diagram illustrating a hardware configuration of the content transmission server device 20. The content transmission server device 20 is configured as a computer including a control unit 210, a communication unit 220, and a storage unit 230. The control unit 210 includes an arithmetic device such as a CPU and a storage device such as a ROM and a RAM. The CPU controls the operation of each unit of the content transmission server device 20 by executing a program stored in the ROM or the storage unit 230 using the RAM as a work area. The communication unit 220 is connected to the network 40 and transmits / receives data to / from the text search device 10 and each user terminal 30. The storage unit 230 is a storage unit such as a hard disk, for example, and stores data and programs used by the control unit 210 for control. For example, the content posted by the user is stored as described above.

図４は、テキスト検索装置１０のハードウェア構成を示す図である。テキスト検索装置１０は、制御部１１０、通信部１４０及び記憶部１３０を備える。これらの各部は、ハードウェアとしては、コンテンツ発信サーバ装置２０の各部と共通するものである。記憶部１３０に記憶されているデータやプログラムは、記憶部２３０に記憶されているものと異なっている。制御部１１０が記憶部１３０に記憶されているプログラムを実行することで実現する機能と、記憶部１３０に記憶されているデータについて、図５を参照しながら説明する。 FIG. 4 is a diagram illustrating a hardware configuration of the text search apparatus 10. The text search apparatus 10 includes a control unit 110, a communication unit 140, and a storage unit 130. These units are common to the units of the content transmission server device 20 as hardware. Data and programs stored in the storage unit 130 are different from those stored in the storage unit 230. Functions realized by the control unit 110 executing a program stored in the storage unit 130 and data stored in the storage unit 130 will be described with reference to FIG.

図５は、テキスト検索装置１０の制御部１１０が実現する機能と記憶部１３０に記憶されているデータとを示すブロック図である。制御部１１０は、コンテンツ収集部１１１と、形態素解析部１１２と、形態素数計数部１１３と、形態素数判定部１１４と、ＵＲＬ抽出部１１５と、ＵＲＬ照合部１１６と、テキスト解析部１１７と、取得部１１８と、抽出部１１９と、算出部１２０と、送信部１２１といった各機能を実現する。記憶部１３０は、テキストデータ群１３１と、ＵＲＬ辞書１３２と、検索インデックス１３３とを記憶している。 FIG. 5 is a block diagram illustrating functions realized by the control unit 110 of the text search apparatus 10 and data stored in the storage unit 130. The control unit 110 includes a content collection unit 111, a morpheme analysis unit 112, a morpheme number counting unit 113, a morpheme number determination unit 114, a URL extraction unit 115, a URL collation unit 116, a text analysis unit 117, and an acquisition. Each function of the unit 118, the extraction unit 119, the calculation unit 120, and the transmission unit 121 is realized. The storage unit 130 stores a text data group 131, a URL dictionary 132, and a search index 133.

コンテンツ収集部１１１は、各コンテンツ発信サーバ装置２０から発信されているコンテンツに含まれるテキストを表すテキストデータを収集する収集手段である。このテキストは、各コンテンツ発信サーバ装置２０が提供しているＷｅｂページに掲載されているコンテンツに含まれるテキストである。コンテンツ収集部１１１は、各種のＡＰＩ（Application Programming Interface）を利用したり、ＷｅｂページのＵＲＬをたどりながらコンテンツを収集するクローリングと呼ばれる処理を行ったりすることにより、テキストデータを収集する。コンテンツ収集部１１１は、収集したテキストデータの各々にユニークな情報元ＩＤを付与し、そのテキストデータを情報元ＩＤと対応付けて、形態素解析部１１２、ＵＲＬ抽出部１１５及びテキスト解析部１１７に供給する。この情報元ＩＤは、各テキストデータを識別するための識別子である。また、コンテンツ収集部１１１は、収集したテキストデータを情報元ＩＤと対応付けて記憶部１３０に記憶させる。このようにしてコンテンツ収集部１１１が記憶させたテキストデータがテキストデータ群１３１である。 The content collection unit 111 is a collection unit that collects text data representing text included in content transmitted from each content transmission server device 20. This text is a text included in the content posted on the Web page provided by each content transmission server device 20. The content collection unit 111 collects text data by using various APIs (Application Programming Interfaces) or by performing a process called crawling that collects content while following the URL of a Web page. The content collection unit 111 assigns a unique information source ID to each collected text data, associates the text data with the information source ID, and supplies the text data to the morpheme analysis unit 112, the URL extraction unit 115, and the text analysis unit 117. To do. This information source ID is an identifier for identifying each text data. Also, the content collection unit 111 stores the collected text data in the storage unit 130 in association with the information source ID. The text data stored in this way by the content collection unit 111 is the text data group 131.

続いて、コンテンツ収集部１１１から供給されるテキストデータが表すテキスト（以下「収集テキスト」という。）が、予め決められた３つの条件（条件Ａ、Ｂ、Ｃという。）を満たすか否かを判定する機能について説明する。これらの条件は、それが満たされた場合に、ユーザにとって価値が大きい可能性が高いものとして定められている。
まず、条件Ａの判定について説明する。条件Ａとは、収集テキストに含まれる形態素数が閾値以上である場合に満たされる条件である。条件Ａの判定は、形態素解析部１１２、形態素数計数部１１３及び形態素数判定部１１４が協働することで行われる。形態素解析部１１２は、上記の収集テキストを形態素解析して形態素単位に分解する分解手段である。形態素解析部１１２は、分解した形態素を情報元ＩＤと対応付けて形態素数計数部１１３に供給する。形態素数計数部１１３は、形態素解析部１１２から供給された形態素の数（形態素数）を情報元ＩＤごとに計数する計数手段である。形態素数計数部１１３は、計数した結果である形態素数を情報元ＩＤと対応付けて形態素数判定部１１４に供給する。 Subsequently, whether the text represented by the text data supplied from the content collection unit 111 (hereinafter referred to as “collected text”) satisfies three predetermined conditions (referred to as conditions A, B, and C). The determination function will be described. These conditions are defined as being highly likely to be valuable to the user when they are met.
First, the determination of condition A will be described. The condition A is a condition that is satisfied when the number of morphemes included in the collected text is greater than or equal to a threshold value. The condition A is determined by the cooperation of the morpheme analyzer 112, the morpheme counter 113, and the morpheme determiner 114. The morpheme analysis unit 112 is a decomposing unit that analyzes the collected text into morpheme units by performing morphological analysis. The morpheme analysis unit 112 supplies the decomposed morpheme to the morpheme number counting unit 113 in association with the information source ID. The morpheme counting unit 113 is a counting unit that counts the number of morphemes (morpheme number) supplied from the morpheme analyzer 112 for each information source ID. The morpheme number counting unit 113 supplies the morpheme number as a result of counting to the morpheme number determination unit 114 in association with the information source ID.

形態素数判定部１１４は、形態素数計数部１１３により計数された形態素数が閾値以上であるか否かを判定する第１判定手段である。形態素数判定部１１４は、形態素数が閾値以上である場合を「１」、閾値未満である場合を「０」とするフラグ（フラグＡという。）を、収集テキストに対応する情報元ＩＤに対応付けて検索インデックス１３３に格納する。検索インデックス１３３は、情報元ＩＤ及びフラグＡ等を記憶させる記憶領域のことである。こうして格納されたフラグＡは、それが「１」であれば、対応する情報元ＩＤが付与されたテキストデータ（が表すテキスト）が条件Ａを満たすことを表し、「０」であれば、そのテキストデータが条件Ａを満たさないことを表す。 The morpheme number determination unit 114 is a first determination unit that determines whether or not the morpheme number counted by the morpheme number counting unit 113 is equal to or greater than a threshold value. The morpheme number determination unit 114 corresponds to the information source ID corresponding to the collected text, with a flag (referred to as flag A) that sets “1” when the morpheme number is equal to or greater than the threshold value and “0” when the morpheme number is less than the threshold value. In addition, it is stored in the search index 133. The search index 133 is a storage area for storing an information source ID, a flag A, and the like. If the flag A stored in this way is “1”, it indicates that the text data (to which the corresponding information source ID is assigned) satisfies the condition A, and if it is “0”, the flag A This indicates that the text data does not satisfy the condition A.

次に、条件Ｂの判定について説明する。条件Ｂとは、ＵＲＬ辞書に登録されたＵＲＬが収集テキストに含まれる場合に満たされる条件である。条件Ｂの判定は、ＵＲＬ抽出部１１５及びＵＲＬ照合部１１６が協働することで行われる。まず、条件Ｂの判定で用いられるＵＲＬ辞書１３２について説明する。ＵＲＬ辞書１３２は、各ＷｅｂページのＵＲＬと、それらのＷｅｂページに含まれるコンテンツの種別（コンテンツ種別）とが登録されている辞書である。
図６は、ＵＲＬ辞書１３２の内容の例を示す表である。この例では、ＵＲＬ「http://xxxx.xx」、「http://yyyy.yy」、「http://zzzz.zz」にそれぞれコンテンツ種別「写真」、「動画」、「ニュース」が対応付けられている。ＵＲＬ辞書１３２は、上述したテキスト検索サービスを提供する事業者によって予め用意されているものであり、このＵＲＬ辞書１３２には、各コンテンツ発信サーバ装置２０が提供している日記、掲示板、ブログ、マイクロブログ又はＳＮＳ等のＷｅｂページのＵＲＬが登録されている。 Next, the determination of condition B will be described. Condition B is a condition that is satisfied when the URL registered in the URL dictionary is included in the collected text. The condition B is determined by the cooperation of the URL extraction unit 115 and the URL collation unit 116. First, the URL dictionary 132 used for determination of the condition B will be described. The URL dictionary 132 is a dictionary in which URLs of web pages and content types (content types) included in the web pages are registered.
FIG. 6 is a table showing an example of the contents of the URL dictionary 132. In this example, URLs “http: //xxxx.xx”, “http: //yyyy.yy”, and “http: //zzzz.zz” have content types “photo”, “video”, and “news”, respectively. It is associated. The URL dictionary 132 is prepared in advance by a provider that provides the above-described text search service. The URL dictionary 132 includes diaries, bulletin boards, blogs, micros provided by each content transmission server device 20. The URL of a web page such as a blog or SNS is registered.

ＵＲＬ抽出部１１５は、上述した収集テキストからＵＲＬを抽出する。ここで、そのテキストの中にマイクロブログの提供元により短縮されたＵＲＬ（短縮ＵＲＬ）が含まれている場合、ＵＲＬ抽出部１１５は、既存の手法で短縮ＵＲＬを拡張し、これをコンテンツのＵＲＬとして抽出する。ＵＲＬ抽出部１１５は、抽出したＵＲＬをＵＲＬ照合部１１６に供給する。また、ＵＲＬ抽出部１１５は、収集テキストにＵＲＬが含まれていないためにＵＲＬを抽出できなかった場合は、その旨をＵＲＬ照合部１１６に通知する。 The URL extraction unit 115 extracts a URL from the collected text described above. Here, when the URL includes a URL shortened by the microblog provider, the URL extraction unit 115 expands the shortened URL using an existing method, and uses this to expand the URL of the content. Extract as The URL extraction unit 115 supplies the extracted URL to the URL collation unit 116. If the URL cannot be extracted because the URL is not included in the collected text, the URL extraction unit 115 notifies the URL collation unit 116 to that effect.

ＵＲＬ照合部１１６は、ＵＲＬ抽出部１１５からＵＲＬが供給された場合、そのＵＲＬ（抽出ＵＲＬという。）をＵＲＬ辞書１３２に含まれるＵＲＬ（辞書ＵＲＬという。）と照合する。ＵＲＬ照合部１１６は、抽出ＵＲＬと一致する辞書ＵＲＬがある場合を「１」、ない場合を「０」とするフラグ（フラグＢという。）を、収集テキストに対応する情報元ＩＤに対応付けて検索インデックス１３３に格納する。また、ＵＲＬ照合部１１６は、ＵＲＬ抽出部１１５から収集テキストにＵＲＬが含まれていない旨を通知された場合は、フラグＢを「０」として、検索インデックス１３３に格納する。こうして格納されたフラグＢは、それが「１」であれば、対応する情報元ＩＤが付与されたテキストデータ（が表すテキスト）が条件Ｂを満たすことを表し、「０」であれば、そのテキストデータが条件Ｂを満たさないことを表す。以上のとおり、ＵＲＬ抽出部１１５及びＵＲＬ照合部１１６が協働することで、収集テキストに、コンテンツにアクセスするためのＵＲＬが含まれているか否かを判定する第２判定手段として機能する。 When a URL is supplied from the URL extraction unit 115, the URL collation unit 116 collates the URL (referred to as an extraction URL) with a URL (referred to as a dictionary URL) included in the URL dictionary 132. The URL matching unit 116 associates a flag (referred to as flag B) with “1” when there is a dictionary URL that matches the extracted URL and “0” when there is no dictionary URL with the extracted URL, in association with the information source ID corresponding to the collected text. Store in the search index 133. When the URL collating unit 116 is notified that the URL is not included in the collected text from the URL extracting unit 115, the URL collating unit 116 sets the flag B to “0” and stores it in the search index 133. If the flag B stored in this way is “1”, it indicates that the text data (to which the corresponding information source ID is assigned) satisfies the condition B. If it is “0”, the flag B This indicates that the text data does not satisfy the condition B. As described above, the URL extraction unit 115 and the URL collation unit 116 work together to function as a second determination unit that determines whether or not a URL for accessing content is included in the collected text.

続いて、条件Ｃの判定について説明する。条件Ｃとは、収集テキストの先頭が予め決められた特定の文字列となっていない場合に満たされる条件である。この特定の文字列とは、マイクロブログにおいて同じ内容を引用するRetweetと呼ばれる文章の先頭や、特定の個人宛のメッセージの先頭に用いられる文字列であり、例えば「ＲＴ」や「＠」である。条件Ｃの判定は、テキスト解析部１１７により行われる。テキスト解析部１１７は、上述した収集テキストの先頭が特定の文字列となっているか否かを解析し、なっていない場合を「１」、なっている場合を「０」とするフラグ（フラグＣという。）を、収集テキストに対応する情報元ＩＤに対応付けて検索インデックス１３３に格納する。こうして格納されたフラグＣは、それが「１」であれば、対応する情報元ＩＤが付与されたテキストデータ（が表すテキスト）が条件Ｃを満たすことを表し、「０」であれば、そのテキストデータが条件Ｃを満たさないことを表す。以上のとおり、テキスト解析部１１７は、収集テキストの先頭が予め決められた特定の文字列となっているか否かを判定する第３判定手段として機能する。 Subsequently, the determination of the condition C will be described. The condition C is a condition that is satisfied when the beginning of the collected text is not a predetermined specific character string. This specific character string is a character string used at the beginning of a sentence called Retweet that quotes the same content in a microblog, or at the beginning of a message addressed to a specific individual, such as “RT” or “@”. . The determination of the condition C is performed by the text analysis unit 117. The text analysis unit 117 analyzes whether or not the start of the collected text is a specific character string. A flag (flag C) is set to “1” if not and “0” if not. Is stored in the search index 133 in association with the information source ID corresponding to the collected text. If the flag C stored in this way is “1”, it indicates that the text data (to which the corresponding information source ID is assigned) satisfies the condition C, and if it is “0”, This indicates that the text data does not satisfy the condition C. As described above, the text analysis unit 117 functions as a third determination unit that determines whether or not the beginning of the collected text is a predetermined character string.

記憶部１３０の検索インデックス１３３には、テキストデータに対する第１判定手段による判定結果、第２判定手段による判定結果及び第３判定手段による判定結果をそれぞれ表す識別子（フラグ）が、そのテキストデータに対応付けて記憶されている。すなわち、記憶部１３０は、識別子を記憶する識別子記憶手段である。具体的には、検索インデックス１３３には、形態素数判定部１１４、ＵＲＬ照合部１１６及びテキスト解析部１１７によって格納されたフラグＡ、Ｂ、Ｃと、それらに対応する情報元ＩＤとが格納されている。これにより、検索インデックス１３３を参照することで、各情報元ＩＤが付与されたテキストデータがどの条件を満たし、又は満たしていないのかを制御部１１０が分かるようになっている。 In the search index 133 of the storage unit 130, identifiers (flags) respectively representing the determination result by the first determination unit, the determination result by the second determination unit, and the determination result by the third determination unit for the text data correspond to the text data. It is remembered. That is, the storage unit 130 is an identifier storage unit that stores an identifier. Specifically, the search index 133 stores flags A, B, and C stored by the morphological number determination unit 114, the URL collation unit 116, and the text analysis unit 117, and information source IDs corresponding to them. Yes. Thus, by referring to the search index 133, the control unit 110 can know which condition the text data to which each information source ID is assigned satisfies or does not satisfy.

取得部１１８は、上述した検索クエリを取得する取得手段であり、ユーザによって操作されるユーザ端末３０から検索クエリを取得する。取得部１１８は、取得した検索クエリを抽出部１１９に供給する。抽出部１１９は、コンテンツ収集部１１１により収集されたテキストデータであるテキストデータ群１３１から特定のテキストデータを抽出する抽出手段である。この特定のテキストデータとは、取得部１１８により取得された検索クエリを含むテキストデータである。抽出部１１９は、抽出したテキストデータを算出部１２０に供給する。算出部１２０は、ランキングスコアを算出する算出手段である。ここにおいて、ランキングスコアとは、テキストデータを検索対象として評価するときに用いるテキストデータの価値の大きさを表す値（評価値）である。算出部１２０は、抽出部１１９が抽出したテキストデータに対応付けて記憶部１３０に記憶されている各フラグに応じた値（「１」又は「０」）を用いて、ランキングスコアを算出する。算出部１２０は、算出した評価値を送信部１２１に供給する。送信部１２１は、データをユーザ端末３０に送信する送信手段であり、抽出部１１９により抽出されたテキストデータを、そのテキストデータについて算出部１２０により算出された評価値の順位とともにユーザ端末３０に送信する。 The acquisition unit 118 is an acquisition unit that acquires the search query described above, and acquires the search query from the user terminal 30 operated by the user. The acquisition unit 118 supplies the acquired search query to the extraction unit 119. The extraction unit 119 is an extraction unit that extracts specific text data from the text data group 131 that is text data collected by the content collection unit 111. The specific text data is text data including a search query acquired by the acquisition unit 118. The extraction unit 119 supplies the extracted text data to the calculation unit 120. The calculation unit 120 is a calculation unit that calculates a ranking score. Here, the ranking score is a value (evaluation value) representing the value of text data used when evaluating text data as a search target. The calculation unit 120 calculates a ranking score using a value (“1” or “0”) corresponding to each flag stored in the storage unit 130 in association with the text data extracted by the extraction unit 119. The calculation unit 120 supplies the calculated evaluation value to the transmission unit 121. The transmission unit 121 is a transmission unit that transmits data to the user terminal 30 and transmits the text data extracted by the extraction unit 119 to the user terminal 30 together with the ranking of the evaluation values calculated by the calculation unit 120 for the text data. To do.

テキスト検索システム１の構成は、以上のとおりである。この構成のもと、テキスト検索システム１においては、ユーザにテキスト検索サービスが提供される。このときにおけるテキスト検索装置１０が行う処理について、以下、図７から図１０までを参照して説明する。 The configuration of the text search system 1 is as described above. With this configuration, the text search system 1 provides the user with a text search service. Processing performed by the text search apparatus 10 at this time will be described below with reference to FIGS.

図７は、テキスト検索装置１０の制御部１１０が検索インデックス１３３にフラグを格納する処理、すなわちフラグ格納処理の手順を示すフローチャートである。このフラグ格納処理は、予め決められた時間の間隔、例えば１時間毎、で行われる。まず、制御部１１０（コンテンツ収集部１１１）は、上記のとおりテキストデータを収集する（ステップＳ１１）。このとき制御部１１０がテキストデータを収集する対象となるＷｅｂページは、テキスト検索サービスの提供元により予め定められている。そして、制御部１１０（コンテンツ収集部１１１）は、収集した各テキストデータに情報元ＩＤを付与する（ステップＳ１２）。 FIG. 7 is a flowchart showing a procedure for storing a flag in the search index 133 by the control unit 110 of the text search apparatus 10, that is, a procedure of the flag storage process. This flag storing process is performed at a predetermined time interval, for example, every hour. First, the control unit 110 (content collection unit 111) collects text data as described above (step S11). At this time, the Web page from which the control unit 110 collects text data is predetermined by the provider of the text search service. Then, the control unit 110 (content collection unit 111) gives an information source ID to each collected text data (step S12).

続いて、制御部１１０は、条件Ａ、Ｂ、Ｃの判定を行う処理である条件Ａ判定処理、条件Ｂ判定処理及び条件Ｃ判定処理をそれぞれ実行する（ステップＳ２０、Ｓ３０及びＳ４０）。制御部１１０は、これらの処理を実行することで、上記の収集テキストが条件Ａ、Ｂ及びＣをそれぞれ満たすか否かを判定する。そして、制御部１１０は、その結果を示すフラグＡ、Ｂ及びＣを検索インデックス１３３に格納して、このフラグ格納処理を終了する。 Subsequently, the control unit 110 executes a condition A determination process, a condition B determination process, and a condition C determination process, which are processes for determining the conditions A, B, and C, respectively (steps S20, S30, and S40). The control unit 110 determines whether the collected text satisfies the conditions A, B, and C by executing these processes. And the control part 110 stores the flag A, B, and C which show the result in the search index 133, and complete | finishes this flag storage process.

図８は、条件Ａ判定処理において制御部１１０が行う処理の手順を示すフローチャートである。まず、制御部１１０（形態素解析部１１２）は、ステップＳ１２で情報元ＩＤが付与されたテキストデータが表すテキストに対し上述した形態素解析を行い、形態素に分割する（ステップＳ２１）。次に、制御部１１０（形態素数計数部１１３）は、分割された形態素の数を計数する（ステップＳ２２）。続いて、制御部１１０（形態素数判定部１１４）は、計数した形態素数が閾値以上か否かを判定する（ステップＳ２３）。形態素数が閾値以上（ステップＳ２３：ＹＥＳ）である場合、制御部１１０は、フラグＡを「１」として、そのフラグＡとステップＳ１２で付与された情報元ＩＤとを対応付けて検索インデックス１３３に格納する（ステップＳ２４）。また、形態素数が閾値未満（ステップＳ２３：ＮＯ）である場合、制御部１１０は、フラグＡを「０」として、そのフラグＡと情報元ＩＤとを対応付けて検索インデックス１３３に格納する（ステップＳ２５）。この閾値は記憶部１３０に予め記憶されている。制御部１１０は、ステップＳ２４又はＳ２５の処理を行うと、条件Ａ判定処理を終了する。 FIG. 8 is a flowchart illustrating a procedure of processing performed by the control unit 110 in the condition A determination processing. First, the control unit 110 (morpheme analysis unit 112) performs the above-described morpheme analysis on the text represented by the text data provided with the information source ID in step S12, and divides it into morphemes (step S21). Next, the control unit 110 (morpheme number counting unit 113) counts the number of divided morphemes (step S22). Subsequently, the control unit 110 (morpheme number determination unit 114) determines whether or not the counted morpheme number is equal to or greater than a threshold value (step S23). When the number of morphemes is equal to or greater than the threshold (step S23: YES), the control unit 110 sets the flag A to “1”, associates the flag A with the information source ID assigned in step S12, and stores the search index 133. Store (step S24). If the number of morphemes is less than the threshold (step S23: NO), the control unit 110 sets the flag A to “0”, associates the flag A with the information source ID, and stores them in the search index 133 (step S23). S25). This threshold value is stored in the storage unit 130 in advance. When the process of step S24 or S25 is performed, the control unit 110 ends the condition A determination process.

図９は、条件Ｂ判定処理において制御部１１０が行う処理の手順を示すフローチャートである。まず、制御部１１０（ＵＲＬ抽出部１１５）は、ステップＳ１２で情報元ＩＤが付与されたテキストデータが表すテキストからＵＲＬを抽出する（ステップＳ３１）。次に、制御部１１０（ＵＲＬ照合部１１６）は、抽出したＵＲＬをＵＲＬ辞書とを照合する（ステップＳ３２）。続いて、制御部１１０（ＵＲＬ照合部１１６）は、照合の結果、抽出したＵＲＬがＵＲＬ辞書に含まれているか否かを判定する（ステップＳ３３）。ステップＳ３３において含まれている（ＹＥＳ）と判定した場合、制御部１１０は、フラグＢを「１」として、そのフラグＢとステップＳ１２で付与された情報元ＩＤとを対応付けて検索インデックス１３３に格納する（ステップＳ３４）。また、ステップＳ３３において含まれていない（ＮＯ）と判定した場合、制御部１１０は、フラグＢを「０」として、そのフラグＢと情報元ＩＤとを対応付けて検索インデックス１３３に格納する（ステップＳ３５）。制御部１１０は、ステップＳ３４又はＳ３５の処理を行うと、条件Ｂ判定処理を終了する。 FIG. 9 is a flowchart illustrating a procedure of processing performed by the control unit 110 in the condition B determination processing. First, the control unit 110 (URL extraction unit 115) extracts a URL from the text represented by the text data to which the information source ID is assigned in step S12 (step S31). Next, the control unit 110 (URL collation unit 116) collates the extracted URL with the URL dictionary (step S32). Subsequently, the control unit 110 (URL collation unit 116) determines whether or not the extracted URL is included in the URL dictionary as a result of collation (step S33). If it is determined in step S33 (YES), the control unit 110 sets the flag B to “1”, associates the flag B with the information source ID assigned in step S12, and stores it in the search index 133. Store (step S34). If it is determined in step S33 that the flag B is not included (NO), the control unit 110 sets the flag B to “0” and associates the flag B with the information source ID and stores them in the search index 133 (step S33). S35). When the process of step S34 or S35 is performed, the control unit 110 ends the condition B determination process.

図１０は、条件Ｃ判定処理において制御部１１０が行う処理の手順を示すフローチャートである。まず、制御部１１０（テキスト解析部１１７）は、ステップＳ１２で情報元ＩＤが付与されたテキストデータが表すテキストを解析して、先頭が特定の文字列（「ＲＴ」又は「＠」）であるか否かを判定する（ステップＳ４１）。ステップＳ４１において先頭が特定の文字列でない（ＮＯ）と判定した場合、制御部１１０は、フラグＣを「１」として、そのフラグＢとステップＳ１２で付与された情報元ＩＤとを対応付けて検索インデックス１３３に格納する（ステップＳ４２）。また、ステップＳ４１において先頭が特定の文字列である（ＹＥＳ）と判定した場合、制御部１１０は、フラグＣを「０」として、そのフラグＣと情報元ＩＤとを対応付けて検索インデックス１３３に格納する（ステップＳ４３）。制御部１１０は、ステップＳ４２又はＳ４３の処理を行うと、条件Ｃ判定処理を終了する。 FIG. 10 is a flowchart illustrating a procedure of processing performed by the control unit 110 in the condition C determination processing. First, the control unit 110 (text analysis unit 117) analyzes the text represented by the text data to which the information source ID is assigned in step S12, and the head is a specific character string (“RT” or “@”). Is determined (step S41). If it is determined in step S41 that the head is not a specific character string (NO), the control unit 110 sets the flag C to “1” and associates the flag B with the information source ID assigned in step S12 to search. Store in the index 133 (step S42). If it is determined in step S41 that the first character string is a specific character string (YES), the control unit 110 sets the flag C to “0” and associates the flag C with the information source ID in the search index 133. Store (step S43). The control part 110 will complete | finish a condition C determination process, if the process of step S42 or S43 is performed.

図１１は、図７から図１０までの処理が行われた後の検索インデックス１３３の一例を示す表である。この例では、「情報元ＩＤ」の列に、上から順に「ＩＤ００１」と、「ＩＤ００２」と、「ＩＤ００３」とが示されている。また、「ＩＤ００１」の行には「１」、「１」、「０」が、「ＩＤ００２」の行には「１」、「０」、「１」が、「ＩＤ００３」の行には「０」、「０」、「１」が、「フラグＡ」、「フラグＢ」、「フラグＣ」としてそれぞれ示されている。 FIG. 11 is a table showing an example of the search index 133 after the processing from FIG. 7 to FIG. 10 is performed. In this example, “ID001”, “ID002”, and “ID003” are shown in the “information source ID” column in order from the top. Also, “1”, “1”, “0” are in the “ID001” row, “1”, “0”, “1” are in the “ID002” row, and “ID003” are “1” in the “ID003” row. “0”, “0”, and “1” are shown as “flag A”, “flag B”, and “flag C”, respectively.

また、テキスト検索装置１０の記憶部１３０は、各テキストデータと、それらのテキストデータに含まれる単語とを対応付けた単語インデックスを記憶する。単語インデックスは、制御部１１０が周知の技術を用いて生成し、記憶部１３０に記憶させればよい。制御部１１０は、以上のとおり得られた検索インデックス１３３及び単語インデックスを用いて、ユーザが得たいコンテンツを検索する処理（検索処理という。）を行う。 In addition, the storage unit 130 of the text search device 10 stores a word index in which each text data is associated with a word included in the text data. The word index may be generated by the control unit 110 using a known technique and stored in the storage unit 130. Using the search index 133 and the word index obtained as described above, the control unit 110 performs a process of searching for content desired by the user (referred to as a search process).

図１２は、検索処理においてテキスト検索装置１０の制御部１１０及びユーザ端末３０の制御部３１０が行う処理の手順を示すシーケンスチャートである。以下では、テキスト検索装置１０及びユーザ端末３０が処理を行うものとして説明するが、それらの処理を行う主体は、それぞれの制御部１１０及び３１０である。この処理は、ユーザがユーザ端末３０の操作部３３０を操作して、得たいコンテンツを検索するためのテキスト、つまり検索クエリを作成することを契機に開始される。まず、ユーザ端末３０は、操作部３３０が受け付けた操作に応じて生成した検索クエリを、テキスト検索装置１０に送信する（ステップＳ５１）。テキスト検索装置１０は、ステップＳ５１で送信されてきた検索クエリを取得する。次に、テキスト検索装置１０は、取得した検索クエリを含むテキストデータをテキストデータ群１３１から抽出する（ステップＳ５２）。詳細には、テキスト検索装置１０は、図５に示すテキストデータ群１３１を構成するテキストデータから、それぞれが表すテキストに検索クエリが含まれるものを抽出する。 FIG. 12 is a sequence chart illustrating a procedure of processes performed by the control unit 110 of the text search device 10 and the control unit 310 of the user terminal 30 in the search process. In the following description, it is assumed that the text search apparatus 10 and the user terminal 30 perform processing, but the main body that performs the processing is the control units 110 and 310, respectively. This process is started when the user operates the operation unit 330 of the user terminal 30 to create a text for searching for content to be obtained, that is, a search query. First, the user terminal 30 transmits a search query generated according to the operation received by the operation unit 330 to the text search device 10 (step S51). The text search device 10 acquires the search query transmitted in step S51. Next, the text search device 10 extracts text data including the acquired search query from the text data group 131 (step S52). More specifically, the text search device 10 extracts text data included in the text data group 131 shown in FIG.

次に、テキスト検索装置１０は、抽出したテキストデータに付与された情報元ＩＤに対応付けて記憶部１３０に記憶されているフラグを参照し、これらのフラグの値（「１」又は「０」）を用いてランキングスコアを算出する（ステップＳ５３）。制御部１１０は、以下の式（１）によりランキングスコアを算出する。
ランキングスコア＝α×フラグＡ＋β×フラグＢ＋γ×フラグＣ・・・（１） Next, the text search device 10 refers to the flags stored in the storage unit 130 in association with the information source ID given to the extracted text data, and the values of these flags (“1” or “0”). ) Is used to calculate the ranking score (step S53). The control unit 110 calculates a ranking score according to the following formula (1).
Ranking score = α × flag A + β × flag B + γ × flag C (1)

式（１）で表されるように、各フラグが「１」である、すなわちそのフラグに対応する条件が満たされている場合に、ランキングスコアに点数が加算されることになる。また、式（１）におけるα、β、γは、条件Ａ、Ｂ、Ｃを重み付けするための係数である。これらの係数は、テキスト検索サービスの提供元により予め定められるものであり、ユーザが得たいコンテンツとの相関関係が高いと提供元が判断したものほど大きな値が定められている。本実施形態では、α＝０．２、β＝０．３、γ＝０．４と定められているものとする。例えば、図１１に示す情報元ＩＤ「ＩＤ００１」、「ＩＤ００２」、「ＩＤ００３」のテキストデータが抽出された場合であれば、それらの「１」、「１」、「０」と「１」、「０」、「１」と「０」、「０」、「１」という各フラグの値から、ランキングスコアがそれぞれ０．５、０．６、０．４と算出される。 As represented by Expression (1), when each flag is “1”, that is, when a condition corresponding to the flag is satisfied, a score is added to the ranking score. Further, α, β, and γ in the equation (1) are coefficients for weighting the conditions A, B, and C. These coefficients are determined in advance by the provider of the text search service, and a larger value is determined as the provider determines that the correlation with the content desired by the user is high. In this embodiment, it is assumed that α = 0.2, β = 0.3, and γ = 0.4. For example, if the text data of the information source IDs “ID001”, “ID002”, and “ID003” shown in FIG. 11 are extracted, their “1”, “1”, “0”, “1”, From the values of the flags “0”, “1” and “0”, “0”, “1”, the ranking scores are calculated as 0.5, 0.6, and 0.4, respectively.

ランキングスコアは、上記のとおり、満たされている条件の数が多いほど、大きな値となる。これらの条件は、上述のとおり、それが満たされた場合に、ユーザにとって価値が大きい可能性が高いものとして定められているため、このランキングスコアが大きいほど、ユーザにとってそのテキストデータの価値が大きい可能性が高いことになる。また、テキストデータの価値が大きければ、そのテキストデータを含むコンテンツも、ユーザにとって価値が大きいものとなる。つまり、ユーザは、ランキングスコアの大きさによって、テキストデータ及びコンテンツの価値の大きさを把握することができる。 As described above, the ranking score increases as the number of satisfied conditions increases. As described above, since these conditions are determined to have a high possibility of being valuable to the user when they are satisfied, the larger the ranking score, the greater the value of the text data for the user. The possibility is high. In addition, if the value of the text data is large, the content including the text data is also valuable for the user. That is, the user can grasp the magnitude of the value of the text data and the content based on the magnitude of the ranking score.

テキスト検索装置１０は、ステップＳ５２で抽出したテキストデータを、ステップＳ５３で算出したランキングスコアが大きいものから順位付けする（ステップＳ５４）。上記の例の場合、テキスト検索装置１０は、「ＩＤ００２」を１番目、「ＩＤ００１」を２番目、「ＩＤ００３」を３番目と順位付けする。そして、テキスト検索装置１０は、抽出したテキストデータを、検索クエリを送信してきたユーザ端末３０に対して、ステップＳ５４で付けたランキングスコアの順位とともに送信する（ステップＳ５５）。ユーザ端末３０は、ステップＳ５５で送信されてきたテキストデータを、それらに対応する順位で、例えば上から順番に並べて表示部３４０に表示する（ステップＳ５６）。 The text search device 10 ranks the text data extracted in step S52 in descending order of the ranking score calculated in step S53 (step S54). In the case of the above example, the text search apparatus 10 ranks “ID002” first, “ID001” second, and “ID003” third. Then, the text search device 10 transmits the extracted text data to the user terminal 30 that has transmitted the search query together with the ranking of the ranking score added in step S54 (step S55). The user terminal 30 displays the text data transmitted in step S55 on the display unit 340 in the order corresponding to the text data, for example, in order from the top (step S56).

以上のとおりテキスト検索装置１０及びユーザ端末３０が処理を行うことで、検索クエリを含むテキストデータのうち、定められた各条件を多く満たすものとそうでないものとを区別しやすくすることができる。また、条件Ａを満たす、すなわち形態素数が閾値以上の場合にランキングスコアに点数を加算することで、形態素数が閾値以上にならない短いテキストデータの順位を低くすることができる。形態素数が多いテキストには、形態素数が少ないテキストよりも価値が大きいコンテンツが含まれている可能性が高い。特に、マイクロブログ及びＳＮＳ等において文字数が制限されたテキストデータには、単に一言だけ含まれているというような情報量が少ないものがある。そのようなテキストデータの順位を低くすることで、より価値が大きい可能性があるテキストデータを、そうでないテキストデータに比べてユーザに見つけやすくすることができる。 As described above, when the text search device 10 and the user terminal 30 perform processing, it is possible to easily distinguish text data including a search query from those that satisfy many defined conditions and those that do not. In addition, when the condition A is satisfied, that is, when the number of morphemes is equal to or greater than a threshold, the rank of short text data in which the number of morphemes does not exceed the threshold can be lowered by adding a score to the ranking score. The text with a large number of morphemes is likely to contain content that is more valuable than text with a small number of morphemes. In particular, text data with a limited number of characters in microblogs, SNSs, and the like has a small amount of information such that only one word is included. By lowering the rank of such text data, it is possible to make it easier for the user to find text data that may be more valuable than text data that is not.

また、条件Ｂを満たす、すなわち上記辞書ＵＲＬがテキストに含まれる場合にランキングスコアに点数を加算することで、写真、動画又はニュース等のコンテンツへのリンクが貼られているマイクロブログ等のコンテンツの順位を高くすることができる。これらのコンテンツへのリンクを含むテキストデータは、リンク先のコンテンツをユーザに提供することができる。そのため、ＵＲＬがテキストに含まれていないテキストデータに比べてより価値が大きいテキストデータをユーザに提供できる可能性が高い。特に、上記のように文字数が制限されたテキストデータにおいては、文字数が制限されていないテキストデータに比べて、その価値がより大きくなる。このように、本実施形態によれば、より価値が大きい可能性があるテキストデータを、そうでないテキストデータに比べてユーザに見つけやすくすることができる。 In addition, when the condition B is satisfied, that is, when the above dictionary URL is included in the text, by adding a score to the ranking score, content of a microblog or the like to which a link to content such as a photo, video or news is attached The ranking can be raised. Text data including links to these contents can provide the linked contents to the user. Therefore, there is a high possibility that text data having a higher value than text data whose URL is not included in the text can be provided to the user. In particular, text data with a limited number of characters as described above is more valuable than text data with a limited number of characters. As described above, according to the present embodiment, it is possible to make it easier for the user to find text data that may be more valuable than text data that does not.

また、条件Ｃを満たす、すなわち先頭が「ＲＴ」や「＠」となっていない場合にランキングスコアに点数を加算することで、マイクロブログにおいて自身とは異なるテキストデータが表すテキストを引用しているテキストデータであるRetweetや、ある特定の相手に向けて発信されたメッセージを表すテキストデータの順位を低くすることができる。これらのテキストデータには、他のテキストデータと同じことが書かれてあったり、検索しているユーザには関係ないことが書かれてあったりする可能性が高い。そのようなテキストデータの順位が高くなると、本当にユーザが得たいコンテンツがそれらのテキストデータに埋もれて見つけにくくなってしまうおそれがある。特に、先頭の「ＲＴ」や「＠」が上述した意味を表すという特定の形式のテキストデータが収集したテキストデータの中に多く含まれている場合に、そのおそれが大きくなる。本実施形態によれば、そのような場合に、そのユーザにとってより価値が大きい可能性があるテキストデータが、そうでないテキストデータによって見つけにくくなることを抑制することができる。 In addition, by adding a score to the ranking score when the condition C is satisfied, that is, when the head is not “RT” or “@”, the text represented by text data different from itself is cited in the microblog. It is possible to lower the order of text data representing Retweet and text data representing a message sent to a specific partner. There is a high possibility that the same text as other text data is written in these text data, or that the text data has nothing to do with the searching user. When the ranking of such text data increases, there is a possibility that the content that the user really wants to obtain is buried in the text data and is difficult to find. In particular, when there is a large amount of text data in a specific format in which the leading “RT” or “@” represents the above-described meaning, the risk increases. According to the present embodiment, in such a case, it is possible to prevent text data that may be more valuable to the user from becoming difficult to find due to text data that is not so.

また、記憶部１３０の検索インデックス１３３に記憶される情報、すなわち各フラグは、上述したとおり、テキストデータの価値の大きさをユーザが評価するための指標となる情報である。つまり、制御部１１０が各フラグを記憶部１３０に記憶させることで、テキストデータの価値の大きさの指標となる情報を蓄積することができる。また、この指標となる情報は、テキストデータを用いて蓄積するものであり、画像データ及び動画データ等のマルチメディアデータを用いなかったとしても蓄積することが可能である。 Further, as described above, information stored in the search index 133 of the storage unit 130, that is, each flag is information that serves as an index for the user to evaluate the value of the text data. That is, the control unit 110 stores each flag in the storage unit 130 so that information serving as an index of the value of the text data can be accumulated. The information serving as the index is stored using text data, and can be stored even if multimedia data such as image data and moving image data is not used.

［変形例］
上述した実施形態は、本発明の実施の一例に過ぎず、以下のように変形させてもよい。また、上述した実施形態及び以下に示す各変形例は、必要に応じて組み合わせて実施してもよい。 [Modification]
The above-described embodiment is merely an example of implementation of the present invention, and may be modified as follows. Moreover, you may implement combining embodiment mentioned above and each modification shown below as needed.

（変形例１）
上述した実施形態では、各フラグは「１」か「０」の値であったが、これに限らず、他の数値であってもよいし、数値ではなく記号であってもよい。要するに、各フラグに対応する条件が満たされたか否かを制御部１１０が判断できるものであればよい。各フラグを記号とした場合、制御部１１０は、それらの記号に応じた値を用いてランキングスコアを算出する。例えば、各フラグが、各条件が満たされた場合に「甲」、満たされない場合に「乙」である場合に、制御部１１０は、各フラグが「甲」であればその値を「１」として、各フラグが「乙」であればその値を「０」として、ランキングスコアを算出する。 (Modification 1)
In the embodiment described above, each flag has a value of “1” or “0”. However, the present invention is not limited to this, and may be another numerical value, or may be a symbol instead of a numerical value. In short, what is necessary is just to be able to determine whether or not the condition corresponding to each flag is satisfied. When each flag is a symbol, the control unit 110 calculates a ranking score using a value corresponding to the symbol. For example, when each flag is “Class A” when each condition is satisfied, and “B” when each condition is not satisfied, the control unit 110 sets the value to “1” if each flag is “Class A”. If each flag is “B”, the value is set to “0”, and the ranking score is calculated.

（変形例２）
制御部１１０は、上述した実施形態では、ランキングスコアを算出して順位付けをした結果をユーザ端末３０に送信したが、順位付けをすることなくランキングスコアをそのまま送信してもよいし、さらには、ランキングスコアを算出することなく検索インデックス１３３に格納されている各フラグを送信してもよい。いずれの場合も、制御部１１０は、ランキングスコア又は各フラグを、それぞれ対応する情報元ＩＤが付与されたテキストデータとともに送信する。そして、ユーザ端末３０の制御部３１０は、送信されてきた情報元ＩＤとそれに対応するランキングスコア又は各フラグを表示部３４０に表示させる。前者の場合、ユーザは、ランキングスコアの値を見ることで、対応する情報元ＩＤが付与されたコンテンツの価値の大きさを把握することができる。また、後者の場合、ユーザは、各フラグが示している条件の内容を理解していれば、これらの情報元ＩＤが付与されたコンテンツの価値の大きさを把握することができる。 (Modification 2)
In the embodiment described above, the control unit 110 transmits the ranking score and the ranking result to the user terminal 30, but the ranking score may be transmitted as it is without ranking. Each flag stored in the search index 133 may be transmitted without calculating the ranking score. In any case, the control unit 110 transmits the ranking score or each flag together with the text data to which the corresponding information source ID is assigned. Then, the control unit 310 of the user terminal 30 causes the display unit 340 to display the transmitted information source ID and the corresponding ranking score or each flag. In the former case, the user can grasp the magnitude of the value of the content provided with the corresponding information source ID by looking at the ranking score value. In the latter case, if the user understands the contents of the conditions indicated by the flags, the user can grasp the value of the content to which the information source ID is assigned.

（変形例３）
制御部１１０は、上述した実施形態では、条件Ｂの判定にＵＲＬを用いたが、そのＵＲＬに対応するＩＰアドレスを用いてもよい。要するに、制御部１１０は、コンテンツにアクセスするためのアドレスがテキストデータに含まれている場合に条件Ｂが満たされるものとしてフラグＢを「１」として検索インデックス１３３に格納すればよい。ここでいうドレスとは、例えばＵＲＬやＩＰアドレスであり、ブラウザなどの閲覧プログラムを実行してコンテンツを掲載しているＷｅｂページにアクセスするときにアクセス先として指定する文字列のことである。 (Modification 3)
In the above-described embodiment, the control unit 110 uses the URL for the determination of the condition B. However, an IP address corresponding to the URL may be used. In short, the control unit 110 may store the flag B as “1” in the search index 133 assuming that the condition B is satisfied when an address for accessing the content is included in the text data. The dress here is, for example, a URL or an IP address, and is a character string that is designated as an access destination when a browsing program such as a browser is executed to access a Web page on which content is posted.

（変形例４）
制御部１１０は、上述した実施形態では、各フラグが示す値に係数を乗じたものをそれぞれ加算してランキングスコアを算出したが、加算する代わりに減算、乗算又は除算等の他の演算をしてもよい。例えば、制御部１１０は、フラグＣが示す値を乗算してもよい。この場合、フラグＣが「１」であれば、フラグＡ、Ｂの値によって得られたランキングスコアがそのままテキストデータのランキングスコアとなる。一方、フラグが「０」であれば、フラグＡ、Ｂがどのような値であっても、ランキングスコアを「０」とすることができる。要するに、制御部１１０は、テキストデータに対応する各フラグに応じた値のそれぞれに予め決められた係数を乗じた値に基づいてランキングスコアを算出すればよい。 (Modification 4)
In the embodiment described above, the control unit 110 calculates the ranking score by adding the value indicated by each flag multiplied by the coefficient, but performs other operations such as subtraction, multiplication, or division instead of adding. May be. For example, the control unit 110 may multiply the value indicated by the flag C. In this case, if the flag C is “1”, the ranking score obtained from the values of the flags A and B becomes the ranking score of the text data as it is. On the other hand, if the flag is “0”, the ranking score can be “0” regardless of the values of the flags A and B. In short, the control unit 110 may calculate a ranking score based on a value obtained by multiplying each value corresponding to each flag corresponding to text data by a predetermined coefficient.

（変形例５）
制御部１１０は、上述した実施形態では、収集テキストの先頭が予め決められた特定の文字列となっているか否かを判定したが、先頭ではない他の位置に特定の文字列が含まれるか否かを判定してもよい。例えば、予め決められた特定の文字列がテキストの先頭以外の予め決められた位置（例えば最後）に含まれている場合に上述したRetweetや特定の個人宛のメッセージが表されるコンテンツがあるものとする。その場合、制御部１１０は、特定の文字列がその位置（最後）に含まれているか否かを判定する。これにより、そのような特定の形式のテキストデータが収集したテキストデータの中に多く含まれている場合であっても、ユーザにとってより価値が大きい可能性があるテキストデータが、そうでないテキストデータによって見つけにくくなることを抑制することができる。 (Modification 5)
In the above-described embodiment, the control unit 110 determines whether or not the beginning of the collected text is a specific character string determined in advance, but whether or not the specific character string is included in another position that is not the top. It may be determined whether or not. For example, when there is a content that represents the above-mentioned Retweet or a message addressed to a specific individual when a predetermined specific character string is included in a predetermined position (for example, the end) other than the beginning of the text And In that case, the control unit 110 determines whether or not a specific character string is included in the position (last). This allows text data that may be more valuable to the user to be captured by text data that is not so, even if the text data of such a specific format is included in the collected text data. It can suppress becoming difficult to find.

（変形例６）
制御部１１０は、上述した実施形態では、上述した収集テキストにＵＲＬ辞書１３２に含まれるＵＲＬが含まれているか否かを判定したが、このように予め決められたＵＲＬではなく、テキストデータを解析することで、そのテキストデータにＵＲＬが含まれているか否かを判定してもよい。制御部１１０は、例えば、テキストデータに「http:」という文字列が含まれていれば、ＵＲＬを含んでいると判定する。つまり、制御部１１０は、コンテンツにアクセスするための（予め決められたものではない）アドレスが含まれているか否かを判定することになる。これにより、ＵＲＬ辞書１３２にＵＲＬが登録されるよりも前にそのＵＲＬを含むテキストデータのランキングスコアを大きくすることができる。また、ＵＲＬ辞書１３２に登録されないＵＲＬが含まれている場合でも、そのＵＲＬを含むテキストデータのランキングスコアを大きくすることができる。 (Modification 6)
In the embodiment described above, the control unit 110 determines whether or not the collected text includes the URL included in the URL dictionary 132. However, the control unit 110 analyzes the text data instead of the URL determined in advance. By doing so, it may be determined whether or not the URL is included in the text data. For example, if the text data includes the character string “http:”, the control unit 110 determines that the URL is included. That is, the control unit 110 determines whether or not an address (not predetermined) for accessing the content is included. As a result, the ranking score of the text data including the URL can be increased before the URL is registered in the URL dictionary 132. Even when a URL that is not registered in the URL dictionary 132 is included, the ranking score of text data including the URL can be increased.

（変形例７）
本発明は、テキスト検索装置１０のような情報処理装置、テキスト検索装置１０の制御部１１０のような制御装置又はこれらを含むテキスト検索システム１のような情報処理システムとしても把握されるものである。また、これらのみならず、これらを実現するための情報処理方法や、コンピュータに制御部１１０の機能を実現させるためのプログラムとしても把握されるものである。かかるプログラムは、これを記憶させた光ディスク等の記録媒体の形態で提供されたり、インターネット等のネットワークを介して、コンピュータにダウンロードさせ、これをインストールして利用可能にするなどの形態でも提供されたりするものであってもよい。 (Modification 7)
The present invention can also be understood as an information processing apparatus such as the text search apparatus 10, a control apparatus such as the control unit 110 of the text search apparatus 10, or an information processing system such as the text search system 1 including these. . Further, not only these, but also an information processing method for realizing these and a program for causing a computer to realize the function of the control unit 110 are grasped. Such a program may be provided in the form of a recording medium such as an optical disk storing the program, or may be provided in a form such that the program is downloaded to a computer via a network such as the Internet, and the program can be installed and used. You may do.

（変形例８）
制御部１１０は、上述した条件Ａ、Ｂ、Ｃの他の条件を満たす場合を「１」とするフラグを検索インデックス１３３に格納してもよい。他の条件としては、例えば、テキストに東京、大阪などの地名（又は特定の地名）が含まれる場合に満たされるものや、デパート、駅などの施設名（又は特定の施設名）が含まれる場合に満たされるものである。これらの場合、上述したＵＲＬ辞書に加え、地名が登録された地名辞書や、施設名が登録された施設名辞書を記憶部１３０に記憶させておく。そして、制御部１１０のＵＲＬ抽出部１１５及びＵＲＬ照合部１１６が行った処理を、ＵＲＬを地名又は施設名に代えて行えばよい。これにより、地名や施設名を含むテキストデータの順位を高くすることができる。 (Modification 8)
The control unit 110 may store, in the search index 133, a flag that is “1” when the other conditions A, B, and C described above are satisfied. Other conditions include, for example, text that includes a place name (or a specific place name) such as Tokyo or Osaka, or a facility name (or a specific facility name) such as a department store or a station. Is satisfied. In these cases, in addition to the URL dictionary described above, a place name dictionary in which place names are registered and a facility name dictionary in which facility names are registered are stored in the storage unit 130. Then, the processing performed by the URL extraction unit 115 and the URL collation unit 116 of the control unit 110 may be performed by replacing the URL with a place name or a facility name. Thereby, the rank of the text data including the place name and the facility name can be increased.

また、他の条件は、テキストデータを投稿したユーザのプロフィールの文字数が閾値以上である場合に満たされるものであってもよい。この場合、制御部１１０は、テキストデータを収集するときに、そのテキストデータが表すテキストを投稿したユーザのプロフィールを表すデータをともに収集する。そして、制御部１１０は、収集したテキストデータを解析するときに、ともに収集したデータが表すプロフィールの文字数を計数し、係数した文字数が閾値以上である場合を「１」とするフラグを検索インデックス１３３に格納する。制御部１１０は、この文字数の係数を、周知の技術を用いて行えばよい。これにより、プロフィールを多く（文字数が閾値以上）書き込んでいるユーザの投稿したテキストデータの順位を高くすることができる。 Other conditions may be satisfied when the number of characters in the profile of the user who posted the text data is greater than or equal to the threshold value. In this case, when collecting the text data, the control unit 110 collects together data representing the profile of the user who posted the text represented by the text data. Then, when analyzing the collected text data, the control unit 110 counts the number of characters in the profile represented by the collected data, and sets a flag that is “1” when the coefficient number of characters is greater than or equal to the threshold value to the search index 133. To store. The control unit 110 may perform the coefficient of the number of characters using a known technique. Thereby, the ranking of the text data posted by the user who writes many profiles (the number of characters is greater than or equal to the threshold) can be increased.

要するに、テキスト検索装置１０においては、その条件が満たされた場合、満たされない場合に比べてよりユーザが得たいコンテンツとなるものであれば、どのような条件が用いられてもよい。それによって、その条件を用いない場合に比べて、ユーザにとってより価値が大きい可能性が高いコンテンツ、つまりユーザが得たいものにより近いコンテンツを表すテキストデータの順位を高くすることができる。 In short, in the text search apparatus 10, any condition may be used as long as the condition is satisfied and the content that the user wants to obtain is higher than when the condition is not satisfied. Accordingly, it is possible to increase the rank of text data that represents content that is likely to be more valuable to the user, that is, content that is closer to what the user wants to obtain than when the condition is not used.

（変形例９）
制御部１１０は、上述した実施形態では、予め定められた重み付けの係数α、β、γを用いてランキングスコアを算出したが、可変の係数を用いてランキングスコアを算出してもよい。例えば、制御部１１０は、係数αの代わりに、条件Ａの判定で用いられる形態素数が多いほど大きくなる係数α₂を用いてもよい。この場合、制御部１１０は、図１２のステップＳ５３において、例えば、形態素数が１から１０の場合はα₂＝０．２、１１から２０の場合はα₂＝０．３、２１から３０の場合はα₂＝０．４、３１以上の場合はα₂＝０．５として、上述した式（１）に従ってランキングスコアを算出する。これにより、テキストに含まれる形態素数が多いテキストデータほど、ランキングスコアが大きくなり、順位が高くなりやすくなる。 (Modification 9)
In the embodiment described above, the control unit 110 calculates the ranking score using predetermined weighting coefficients α, β, γ, but may calculate the ranking score using a variable coefficient. For example, the control unit 110 may use a coefficient α ₂ that increases as the number of morphemes used in the determination of the condition A increases, instead of the coefficient α. In this case, the control unit 110 in step S53 in FIG. 12, for example, morphemes number 1 from the case of 10 from α ₂ = 0.2,11 20 to 30 α ₂ = 0.3,21 For In this case, α ₂ = 0.4, and in the case of 31 or more, α ₂ = 0.5, and the ranking score is calculated according to the above-described equation (1). As a result, the text data having a larger number of morphemes included in the text has a higher ranking score and a higher ranking.

また、制御部１１０は、係数βの代わりに、条件Ｂの判定で用いられる抽出したＵＲＬの数が多いほど大きくなる係数β₂を用いてもよい。この場合、制御部１１０は、例えば、図９のステップＳ３２においてＵＲＬをＵＲＬ辞書１３２と照合したときに、ＵＲＬ辞書１３２に含まれるＵＲＬの数を計数して、その結果を記憶部１３０に記憶させておく。そして、図１２のステップＳ５３において、例えば、ＵＲＬの数が１つであればβ₂＝０．３、２から３であればβ₂＝０．４、４つ以上であればβ₂＝０．５として、上述した式（１）に従ってランキングスコアを算出する。これにより、テキストに含まれるＵＲＬの数が多いテキストデータほど、ランキングスコアが大きくなり、順位が高くなりやすくなる。 Further, the control unit 110 may use a coefficient β ₂ that increases as the number of extracted URLs used in the determination of the condition B increases, instead of the coefficient β. In this case, for example, when the URL is compared with the URL dictionary 132 in step S32 of FIG. 9, the control unit 110 counts the number of URLs included in the URL dictionary 132 and stores the result in the storage unit 130. Keep it. In step S53 of FIG. 12, for example, if the number of URLs is 1, β ₂ = 0.3, if ₂ to 3, β ₂ = 0.4, if 4 or more, β ₂ = 0. .5, the ranking score is calculated according to the above-described equation (1). As a result, the higher the number of URLs included in the text, the higher the ranking score and the higher the ranking.

１…テキスト検索システム、１０…テキスト検索装置、２０…コンテンツ発信サーバ装置、３０…ユーザ端末、４０…ネットワーク、１１０、２１０、３１０…制御部、１３０、２３０、３５０…記憶部、１４０、２２０、３２０…通信部、３３０…操作部、３４０…表示部、１１１…コンテンツ収集部、１１２…形態素解析部、１１３…形態素数計数部、１１４…形態素数判定部、１１５…ＵＲＬ抽出部、１１６…ＵＲＬ照合部、１１７…テキスト解析部、１１８…取得部、１１９…抽出部、１２０…算出部、１２１…送信部、１３１…テキストデータ群、１３２…ＵＲＬ辞書、１３３…検索インデックス DESCRIPTION OF SYMBOLS 1 ... Text search system, 10 ... Text search apparatus, 20 ... Content transmission server apparatus, 30 ... User terminal, 40 ... Network, 110, 210, 310 ... Control part, 130, 230, 350 ... Storage part, 140, 220, 320 ... Communication unit, 330 ... Operation unit, 340 ... Display unit, 111 ... Content collection unit, 112 ... Morphological analysis unit, 113 ... Morphological number counting unit, 114 ... Morphological number determination unit, 115 ... URL extraction unit, 116 ... URL Collation unit, 117 ... text analysis unit, 118 ... acquisition unit, 119 ... extraction unit, 120 ... calculation unit, 121 ... transmission unit, 131 ... text data group, 132 ... URL dictionary, 133 ... search index

Claims

コンテンツに含まれるテキストデータを収集する収集手段と、
前記収集手段により収集されたテキストデータを、形態素単位に分解する分解手段と、
前記分解手段により分解された形態素の数を計数する計数手段と、
前記計数手段により計数された形態素の数が閾値以上であるか否かを判定する第１判定手段と、
前記収集手段により収集されたテキストデータに、コンテンツにアクセスするためのアドレスが含まれているか否かを判定する第２判定手段と、
前記収集手段により収集されたテキストデータの予め決められた位置に予め決められた特定の文字列が含まれているか否かを判定する第３判定手段と、
前記テキストデータに対する前記第１判定手段による判定結果、前記第２判定手段による判定結果及び前記第３判定手段による判定結果をそれぞれ表す識別子を、当該テキストデータに対応付けて記憶する識別子記憶手段と
を備えることを特徴とする情報処理装置。 A collection means for collecting text data included in the content;
Decomposition means for decomposing the text data collected by the collection means into morpheme units;
Counting means for counting the number of morphemes decomposed by the decomposition means;
First determination means for determining whether the number of morphemes counted by the counting means is greater than or equal to a threshold;
Second determination means for determining whether the text data collected by the collection means includes an address for accessing content;
Third determining means for determining whether or not a predetermined character string is included in a predetermined position of the text data collected by the collecting means;
Identifier storage means for storing identifiers respectively representing determination results by the first determination means, determination results by the second determination means, and determination results by the third determination means for the text data in association with the text data; An information processing apparatus comprising:

ユーザによって操作される通信装置から検索クエリを取得する取得手段と、
前記収集手段により収集された前記テキストデータから、前記取得手段により取得された検索クエリを含むテキストデータを抽出する抽出手段と、
前記抽出手段が抽出したテキストデータに対応付けて前記識別子記憶手段に記憶されている識別子に応じた値を用いて、前記テキストデータを検索対象として評価するときの評価値を算出する算出手段と、
前記抽出手段により抽出されたテキストデータを、当該テキストデータについて前記算出手段により算出された評価値又は当該評価値の順位とともに前記通信装置に送信する送信手段とを備える
ことを特徴とする請求項１に記載の情報処理装置。 Obtaining means for obtaining a search query from a communication device operated by a user;
Extraction means for extracting text data including a search query acquired by the acquisition means from the text data collected by the collection means;
Calculating means for calculating an evaluation value when evaluating the text data as a search target, using a value corresponding to the identifier stored in the identifier storage means in association with the text data extracted by the extraction means;
The transmission unit that transmits the text data extracted by the extraction unit to the communication device together with the evaluation value calculated by the calculation unit or the rank of the evaluation value for the text data. The information processing apparatus described in 1.

前記算出手段は、前記テキストデータに対応する前記識別子に応じた値のそれぞれに予め決められた係数を乗じた値に基づいて前記評価値を算出する
ことを特徴とする請求項２に記載の情報処理装置。 The information according to claim 2, wherein the calculation unit calculates the evaluation value based on a value obtained by multiplying each value corresponding to the identifier corresponding to the text data by a predetermined coefficient. Processing equipment.

前記特定の文字列は、前記テキストデータが特定の相手に向けて発信されたものであること、又は当該テキストデータが当該テキストデータとは異なるテキストデータを引用していることを表すものである
ことを特徴とする請求項１乃至３のいずれか１項に記載の情報処理装置。 The specific character string indicates that the text data is transmitted to a specific partner or that the text data cites text data different from the text data. The information processing apparatus according to any one of claims 1 to 3.

情報処理端末において実行される情報処理方法であって、
コンテンツに含まれるテキストデータを収集する収集ステップと、
前記収集ステップにおいて収集されたテキストデータを、形態素単位に分解する分解ステップと、
前記分解ステップにおいて分解された形態素の数を計数する計数ステップと、
前記計数ステップにおいて計数された形態素の数が閾値以上であるか否かを判定する第１判定ステップと、
前記収集ステップにおいて収集されたテキストデータに、コンテンツにアクセスするためのアドレスが含まれているか否かを判定する第２判定ステップと、
前記収集ステップにおいて収集されたテキストデータの予め決められた位置に予め決められた特定の文字列が含まれているか否かを判定する第３判定ステップと、
前記テキストデータに対する前記第１判定ステップにおける判定結果、前記第２判定ステップにおける判定結果及び前記第３判定ステップにおける判定結果をそれぞれ表す識別子を、当該テキストデータに対応付けて記憶する識別子記憶ステップと
を備えることを特徴とする情報処理方法。 An information processing method executed in an information processing terminal,
A collection step for collecting text data included in the content;
A decomposing step of decomposing the text data collected in the collecting step into morpheme units;
A counting step of counting the number of morphemes decomposed in the decomposition step;
A first determination step of determining whether or not the number of morphemes counted in the counting step is greater than or equal to a threshold;
A second determination step of determining whether the text data collected in the collecting step includes an address for accessing the content;
A third determination step of determining whether a predetermined character string is included in a predetermined position of the text data collected in the collecting step;
An identifier storage step for storing identifiers representing the determination results in the first determination step, the determination results in the second determination step, and the determination results in the third determination step for the text data in association with the text data; An information processing method characterized by comprising: