JP7043373B2

JP7043373B2 - Information processing equipment, information processing methods, and programs

Info

Publication number: JP7043373B2
Application number: JP2018173193A
Authority: JP
Inventors: ニレシュデワンガン; 陸富樫; 一雄山下; 尚方四熊
Original assignee: Yahoo Japan Corp
Current assignee: Yahoo Japan Corp
Priority date: 2018-09-18
Filing date: 2018-09-18
Publication date: 2022-03-29
Anticipated expiration: 2038-09-18
Also published as: JP2020046792A

Description

本発明は、情報処理装置、情報処理方法、およびプログラムに関する。 The present invention relates to an information processing apparatus, an information processing method, and a program.

従来、文書の中から、ＷｅｂページのＵＲＬ（Uniform Resource Locator）がハイパーリンクとして対応付けられたアンカーテキストを抽出する技術が知られている（例えば、特許文献１参照）。 Conventionally, there is known a technique for extracting an anchor text in which a URL (Uniform Resource Locator) of a Web page is associated as a hyperlink from a document (see, for example, Patent Document 1).

特開２００４－７８４４６号公報Japanese Unexamined Patent Publication No. 2004-78446

しかしながら、従来の技術では、文書の中から、予め決められた条件を満たす文字列を、アンカーテキストのような文字列として抽出していることから、現行の条件を満たさない新語や造語、流行語などについては、それらの文字列を抽出することができない場合があった。この結果、新語や造語、流行語などの文字列に対してユーザが興味関心を寄せる場合があっても、それらの文字列に対して関連するコンテンツをハイパーリンクとして対応付けることができない場合があった。 However, in the conventional technique, a character string satisfying a predetermined condition is extracted from a document as a character string such as an anchor text, so that a new word, a coined word, or a buzzword that does not satisfy the current condition is extracted. In some cases, it was not possible to extract those character strings. As a result, even if the user may be interested in a character string such as a new word, a coined word, or a buzzword, it may not be possible to associate the content related to the character string as a hyperlink. ..

本発明は、上記の課題に鑑みてなされたものであり、関連するコンテンツに対応付けることが可能な文字列を文書から精度よく抽出することができる情報処理装置、情報処理方法、およびプログラムを提供することを目的としている。 The present invention has been made in view of the above problems, and provides an information processing device, an information processing method, and a program capable of accurately extracting a character string that can be associated with related contents from a document. The purpose is.

本発明の一態様は、複数の文字が含まれる文書を取得する取得部と、入力された文字を、予め決められた条件を満たす文字列に含まれる第１文字か、前記文字列に含まれない第２文字かに分類するように学習された分類器に対して、前記文書に含まれる文字を入力して得られた前記分類器の出力結果に基づいて、前記文書から前記文字列を抽出する抽出部と、を備える情報処理装置である。
である。 One aspect of the present invention includes an acquisition unit for acquiring a document containing a plurality of characters, and the input characters are included in the first character included in a character string satisfying a predetermined condition, or in the character string. The character string is extracted from the document based on the output result of the classifier obtained by inputting the characters included in the document to the classifier trained to classify into the second character. It is an information processing device including an extraction unit.
Is.

本発明の一態様によれば、関連するコンテンツに対応付けることが可能な文字列を文書から精度よく抽出することができる。 According to one aspect of the present invention, a character string that can be associated with related contents can be accurately extracted from a document.

第１実施形態における情報処理装置１００を含む情報処理システム１の一例を示す図である。It is a figure which shows an example of the information processing system 1 including the information processing apparatus 100 in 1st Embodiment. サービス提供装置２０により提供されるウェブページの一例を示す図である。It is a figure which shows an example of the web page provided by the service providing apparatus 20. 文字のタグ付け方法を説明するための図である。It is a figure for demonstrating the method of tagging a character. 第１実施形態における情報処理装置１００の構成の一例を示す図である。It is a figure which shows an example of the structure of the information processing apparatus 100 in 1st Embodiment. 第１実施形態における情報処理装置１００による運用時の一連の処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a series of processing at the time of operation by the information processing apparatus 100 in 1st Embodiment. 第１実施形態における分類器ＭＤＬの一例を示す図である。It is a figure which shows an example of the classifier MDL in 1st Embodiment. 条件付き確率場モデルの一例を示す図である。It is a figure which shows an example of the conditional random field model. サービス提供装置２０により提供されるウェブページの他の例を示す図である。It is a figure which shows the other example of the web page provided by the service providing apparatus 20. 第１実施形態における情報処理装置１００による学習時の一連の処理の流れを示すフローチャートである。It is a flowchart which shows the flow of a series of processing at the time of learning by the information processing apparatus 100 in 1st Embodiment. 言語モデルＭＤＬ２の一例を示す図である。It is a figure which shows an example of a language model MDL2. 第３実施形態における情報処理装置１００Ａの構成の一例を示す図である。It is a figure which shows an example of the structure of the information processing apparatus 100A in 3rd Embodiment. 実施形態の情報処理装置１００および１００Ａのハードウェア構成の一例を示す図である。It is a figure which shows an example of the hardware composition of the information processing apparatus 100 and 100A of embodiment.

以下、本発明を適用した情報処理装置、情報処理方法、およびプログラムを、図面を参照して説明する。 Hereinafter, an information processing apparatus, an information processing method, and a program to which the present invention is applied will be described with reference to the drawings.

［概要］
情報処理装置は、一以上のプロセッサにより実現される。情報処理装置は、複数の文字が含まれる、ある着目する文書（以下、文書と称する）を取得し、取得した文書に含まれる文字を、ある分類器に入力する。 [Overview]
The information processing device is realized by one or more processors. The information processing apparatus acquires a document of interest (hereinafter referred to as a document) containing a plurality of characters, and inputs the characters included in the acquired document into a classifier.

分類器は、入力された文字を、予め決められた条件を満たす文字列に含まれる第１文字か、その文字列に含まれない第２文字かに分類するように学習された分類器である。予め決められた条件を満たす文字列とは、例えば、ＨＴＭＬ（HyperText Markup Language）で記述されたウェブページにおいて、他のウェブページのＵＲＬ（Uniform Resource Locator）などがハイパーリンクとして対応付けられたアンカーテキスト（アンカー文字列）であり、具体的には、＜ａ＞や＜／ａ＞といった記号で囲われた文字列である。また、予め決められた条件を満たす文字列は、アンカーテキストに限られず、例えば、固有表現抽出と呼ばれる技術によって文書から抽出される固有表現を示す文字列であってもよい。固有表現は、例えば、組織名、人名、地名、日付表現、時間表現、金額表現、割合表現、固有物名などである。 A classifier is a classifier learned to classify an input character into a first character included in a character string satisfying a predetermined condition or a second character not included in the character string. .. A character string that satisfies a predetermined condition is, for example, an anchor text in which a URL (Uniform Resource Locator) of another web page is associated with a web page described in HTML (HyperText Markup Language) as a hyperlink. (Anchor character string), specifically, a character string surrounded by symbols such as <a> and </a>. Further, the character string satisfying the predetermined condition is not limited to the anchor text, and may be, for example, a character string indicating a unique expression extracted from the document by a technique called named entity extraction. The named entity is, for example, an organization name, a person name, a place name, a date expression, a time expression, a monetary expression, a ratio expression, a unique object name, or the like.

情報処理装置は、文書に含まれる文字を分類器に入力し、その結果、得られた分類器の出力結果に基づいて、文書から、アンカーテキストなどになり得る蓋然性が高い文字列を抽出する。このように、入力された文字を、タグ付けされた文字列に含まれる文字であるのかそうでないのかを分類するように予め学習された分類器を利用することで、関連するコンテンツに対応付けることが可能な文字列を文書から精度よく抽出することができる。 The information processing device inputs the characters contained in the document into the classifier, and as a result, extracts from the document a character string having a high probability of becoming an anchor text or the like based on the output result of the obtained classifier. In this way, by using a classifier that has been pre-learned to classify whether the input character is a character contained in the tagged character string or not, it is possible to associate the input character with the related content. Possible character strings can be extracted from the document with high accuracy.

＜第１実施形態＞
［全体構成］
図１は、第１実施形態における情報処理装置１００を含む情報処理システム１の一例を示す図である。第１実施形態における情報処理システム１は、例えば、一以上の端末装置１０と、サービス提供装置２０と、情報処理装置１００とを備える。これらの装置は、ネットワークＮＷを介して接続される。また、これらの装置のうち一部は、他の装置に仮想的な装置として包含されてもよく、例えば、サービス提供装置２０の機能の一部または全部が、情報処理装置１００の機能によって実現される仮想マシンであってもよい。 <First Embodiment>
[overall structure]
FIG. 1 is a diagram showing an example of an information processing system 1 including an information processing apparatus 100 according to the first embodiment. The information processing system 1 in the first embodiment includes, for example, one or more terminal devices 10, a service providing device 20, and an information processing device 100. These devices are connected via the network NW. Further, a part of these devices may be included in another device as a virtual device. For example, a part or all of the functions of the service providing device 20 are realized by the functions of the information processing device 100. It may be a virtual machine.

図１に示す各装置は、ネットワークＮＷを介して種々の情報を送受信する。ネットワークＮＷは、例えば、インターネット、ＷＡＮ（Wide Area Network）、ＬＡＮ（Local Area Network）、プロバイダ端末、無線通信網、無線基地局、専用回線などを含む。なお、図１に示す各装置の全ての組み合わせが相互に通信可能である必要はなく、ネットワークＮＷは、一部にローカルなネットワークを含んでもよい。 Each device shown in FIG. 1 transmits and receives various information via the network NW. The network NW includes, for example, the Internet, a WAN (Wide Area Network), a LAN (Local Area Network), a provider terminal, a wireless communication network, a wireless base station, a dedicated line, and the like. It should be noted that not all combinations of the devices shown in FIG. 1 need to be able to communicate with each other, and the network NW may include a local network in part.

端末装置１０は、例えば、スマートフォンなどの携帯電話、タブレット端末、各種パーソナルコンピュータなどの、入力装置、表示装置、通信装置、記憶装置、および演算装置を備える端末装置である。通信装置は、ＮＩＣ（Network Interface Card）などのネットワークカード、無線通信モジュールなどを含む。端末装置１０では、ウェブブラウザやアプリケーションプログラムなどのＵＡ（User Agent）が起動し、ユーザの入力する内容に応じたリクエストをサービス提供装置２０に送信する。また、ＵＡが起動された端末装置１０は、サービス提供装置２０から取得した情報に基づいて、表示装置に各種画像を表示させる。 The terminal device 10 is a terminal device including an input device, a display device, a communication device, a storage device, and an arithmetic unit, such as a mobile phone such as a smartphone, a tablet terminal, and various personal computers. The communication device includes a network card such as a NIC (Network Interface Card), a wireless communication module, and the like. In the terminal device 10, a UA (User Agent) such as a web browser or an application program is activated, and a request according to the content input by the user is transmitted to the service providing device 20. Further, the terminal device 10 in which the UA is activated causes the display device to display various images based on the information acquired from the service providing device 20.

サービス提供装置２０は、例えば、ＵＡとして起動されたウェブブラウザからのリクエスト（例えばＨＴＴＰ（Hypertext Transfer Protocol）リクエスト）に応じてウェブページを端末装置１０に提供するウェブサーバであってよい。ウェブページは、例えば、インターネット上において、ニュース記事のような、文書（テキスト）を含むコンテンツを配信するウェブページであってよい。また、サービス提供装置２０は、ＵＡとして起動されたアプリケーションからのリクエストに応じてコンテンツを端末装置１０に提供することで、上記のウェブページと同様のサービスを提供するアプリケーションサーバであってもよい。 The service providing device 20 may be, for example, a web server that provides a web page to the terminal device 10 in response to a request from a web browser activated as a UA (for example, an HTTP (Hypertext Transfer Protocol) request). The web page may be a web page that delivers content including a document (text), such as a news article, on the Internet, for example. Further, the service providing device 20 may be an application server that provides the same service as the above-mentioned web page by providing the content to the terminal device 10 in response to a request from the application started as the UA.

図２は、サービス提供装置２０により提供されるウェブページの一例を示す図である。図示の例では、ニュース記事が掲載されたウェブページを模式的に示している。例えば、ニュース記事の文書中には、しばしば、他のウェブページのＵＲＬがハイパーリンクとして対応付けられた文字が含まれる。図示の例では、「台風１０号」という文字列がハイパーリンクであることを表している。ユーザは、このようなハイパーリンクが対応付けられた文字列をクリック操作などで選択することで、その文字列に関連したコンテンツを含むウェブページにアクセスすることができる。 FIG. 2 is a diagram showing an example of a web page provided by the service providing device 20. The illustrated example schematically shows a web page containing a news article. For example, news article documents often contain characters associated with URLs of other web pages as hyperlinks. In the illustrated example, the character string "Typhoon No. 10" indicates that it is a hyperlink. The user can access the web page containing the content related to the character string by selecting the character string associated with such a hyperlink by a click operation or the like.

文書中からハイパーリンクの候補となる文字列（アンカーテキスト）を抽出する方法にはいくつかの手法があり、その一つに、上述した固有表現を抽出する手法が存在する。固有表現抽出は、「ＢＩＯ」方式や「ＢＩＬＯＵ」方式によって、文書中の各文字にいくつかの種類のタグをつけていくことで、予め決められた固有表現を文書中から抽出する方法である。「ＢＩＯ」方式は、「Ｂｅｇｉｎ」の頭文字をとった「Ｂ」というタグと、「Ｉｎｓｉｄｅ」の頭文字をとった「Ｉ」というタグと、「Ｏｕｔｓｉｄｅ」の頭文字をとった「Ｏ」というタグのいずれかを各文字に付与する方式である。「Ｂ」タグは、固有表現の最初の文字を識別する分類子であり、「Ｉ」タグは、固有表現の内部の文字を識別する分類子であり、「Ｏ」タグは、固有表現の外側の文字（固有表現と異なる文字）を識別する分類子である。「ＢＩＬＯＵ」方式は、上述した「Ｂ」「Ｉ」「Ｏ」のタグに加えて、「Ｌａｓｔ」の頭文字をとった「Ｌ」というタグと、「Ｕｎｉｔ」の頭文字をとった「Ｕ」というタグとを含む複数のタグの中からいずれかのタグを各文字に付与する方式である。「Ｌ」タグは、固有表現の最後の文字を識別する分類子であり、「Ｕ」タグは、固有表現の単位長さと同じ文字列を識別する分類子を表しており、例えば、固有表現が一文字である場合（「Ｂ」「Ｉ」「Ｌ」のタグに該当しない場合）に、「Ｕ」のタグが付与される。「ＢＩＬＯＵ」方式は、「ＢＩＯＥＳ」方式とも称される。「ＢＩＯＥＳ」方式は、上述した「Ｂ」「Ｉ」「Ｏ」のタグに加えて、「Ｅｎｄ」の頭文字をとった「Ｅ」というタグと、「Ｓｉｎｇｌｅ」の頭文字をとった「Ｓ」というタグとを含む複数のタグの中からいずれかのタグを各文字に付与する方式である。「Ｅ」タグは、「Ｌ」タグと同じ分類子であり、「Ｓ」タグは、「Ｕ」タグと同じ分類子である。 There are several methods for extracting character strings (anchor text) that are candidates for hyperlinks from a document, and one of them is a method for extracting the above-mentioned named entity. Named entity extraction is a method of extracting a predetermined named entity from a document by attaching several types of tags to each character in the document by the "BIO" method or the "BILOU" method. .. The "BIO" method is the tag "B" which is an acronym for "Begin", the tag "I" which is an acronym for "Inside", and the acronym "O" which is an acronym for "Outside". It is a method to add one of the tags to each character. The "B" tag is a classifier that identifies the first character of the named entity, the "I" tag is the classifier that identifies the character inside the named entity, and the "O" tag is the outside of the named entity. A classifier that identifies characters in (characters that differ from named entities). In the "BILOU" method, in addition to the above-mentioned "B", "I", and "O" tags, the "L" tag, which is an acronym for "Last", and the "U", which is an acronym for "Unit", are used. This is a method of assigning one of a plurality of tags including the tag "" to each character. The "L" tag is a classifier that identifies the last character of the named entity, and the "U" tag represents a classifier that identifies the same character string as the unit length of the named entity, for example, the named entity. When it is a single character (when it does not correspond to the tags of "B", "I", and "L"), the tag of "U" is added. The "BILOU" method is also referred to as a "BIOES" method. In the "BIOES" method, in addition to the above-mentioned "B", "I", and "O" tags, the "E" tag, which is an acronym for "End", and the "S", which is an acronym for "Single", are used. This is a method of assigning one of a plurality of tags including the tag "" to each character. The "E" tag is the same classifier as the "L" tag, and the "S" tag is the same classifier as the "U" tag.

図３は、文字のタグ付け方法を説明するための図である。図示の例のように、「検索太郎は○○〇株式会社の社長です」という一つの文（センテンス）が文書中に存在すると仮定する。この場合、固有表現抽出において、「検索太郎」という文字列は、人名という固有表現のクラスに分類される文字列であり、「○○〇株式会社」という文字列は、組織名という固有表現のクラスに分類される文字列であり、「社長」という文字列は、固有物名という固有表現のクラスに分類される文字列であり、これら以外の「は」、「の」、「です」、「。」といった助詞や助動詞、句読点などの文字は、固有表現のいずれのクラスにも分類されない文字である。このような場合、図示の例のように、「検索太郎」という文字列に含まれる複数の文字のうち、最初に出現する「検」という文字には、「Ｂ」タグが付与され、「検」の後に続く「索」および「太」という文字には、「Ｉ」タグが付与され、最後に出現する「郎」という文字には、「Ｌ」タグが付与される。一方、「は」や「の」といった文字には、それらの文字が固有表現でないため、「Ｏ」タグが付与される。 FIG. 3 is a diagram for explaining a method of tagging characters. As shown in the illustrated example, it is assumed that one sentence (sentence) "Search Taro is the president of XX Co., Ltd." exists in the document. In this case, in the named entity extraction, the character string "Search Taro" is a character string classified into the named entity class, and the character string "○○○ Co., Ltd." is the named entity name. It is a character string classified into a class, and the character string "President" is a character string classified into a named entity name class, and other than these, "ha", "no", "desu", Characters such as auxiliary words, auxiliary verbs, and punctuation marks such as "." Are characters that are not classified in any class of named entity recognition. In such a case, as shown in the illustrated example, the character "Kan" that appears first among the multiple characters included in the character string "Search Taro" is given a "B" tag, and "Kan" is added. The letters "search" and "thick" following "" are tagged with "I", and the letters "ro" appearing at the end are tagged with "L". On the other hand, characters such as "ha" and "no" are given an "O" tag because those characters are not named entities.

ハイパーリンクは、一般的に、固有表現として抽出された文字列に対応付けられる。すなわち、ハイパーリンクは、「Ｂ」タグが付与された文字から始まり、「Ｌ」タグが付与された文字で終わる文字列に対応付けられ得る。また、ハイパーリンクは、予め決められた固有名詞と一致する文字列に対応付けられる場合もある。 Hyperlinks are generally associated with strings extracted as named entities. That is, the hyperlink can be associated with a character string that starts with the character with the "B" tag and ends with the character with the "L" tag. In addition, the hyperlink may be associated with a character string that matches a predetermined proper noun.

ハイパーリンクが対応付けられる文字列は、「予め決められた条件を満たす文字列」の一例であり、予め決められた固有名詞と一致する文字列は、「予め決められた条件を満たす文字列」の他の例である。また、「Ｂ」タグが付与された文字は、「第１文字」の一例であり、「Ｌ」タグまたは「Ｉ」タグが付与された文字は、「第１文字」の他の例である。また、「Ｏ」タグまたは「Ｕ」が付与された文字は、「第２文字」の一例である。また、「Ｂ」タグが付与された文字は、「第３文字」の一例であり、「Ｌ」タグが付与された文字は、「第４文字」の一例であり、「Ｉ」タグが付与された文字は、「第５文字」の一例である。 A character string to which a hyperlink is associated is an example of a "character string that satisfies a predetermined condition", and a character string that matches a predetermined proper noun is a "character string that satisfies a predetermined condition". Another example. Further, the character to which the "B" tag is attached is an example of the "first character", and the character to which the "L" tag or the "I" tag is attached is another example of the "first character". .. Further, the character to which the "O" tag or "U" is added is an example of the "second character". Further, the character to which the "B" tag is attached is an example of the "third character", and the character to which the "L" tag is attached is an example of the "fourth character", and the "I" tag is attached. The character is an example of the "fifth character".

情報処理装置１００は、サービス提供装置２０によってウェブページなどを介してコンテンツとして提供される文書（例えばニュース記事など）を受信し、受信した文書から、ハイパーリンクが対応付けられやすく、アンカーテキストなどになり得る蓋然性が高い文字列を抽出する。情報処理装置１００は、文書から文字列を抽出すると、その抽出した文字列を、サービス提供装置２０に送信したり、抽出した文字列を含む文書をサービス提供装置２０に送信したりする。これによって、サービス提供装置２０は、情報処理装置１００によって抽出された文字列にＵＲＬなどをハイパーリンクとして対応付け、その文字列を含む文書をコンテンツとして含むウェブページなどを端末装置１０に提供する。なお、サービス提供装置２０から受信する文書の各文字には、予め固有表現を識別するタグが付与されている必要はないが、タグが付与されていてもよい。 The information processing device 100 receives a document (for example, a news article) provided as content via a web page or the like by the service providing device 20, and a hyperlink is easily associated with the received document and becomes an anchor text or the like. Extract a character string that is highly probable. When the information processing apparatus 100 extracts a character string from a document, the information processing apparatus 100 transmits the extracted character string to the service providing apparatus 20, or transmits a document including the extracted character string to the service providing apparatus 20. As a result, the service providing device 20 associates a URL or the like with the character string extracted by the information processing device 100 as a hyperlink, and provides the terminal device 10 with a web page or the like including a document containing the character string as content. It should be noted that each character of the document received from the service providing device 20 does not need to be tagged in advance to identify the named entity, but may be tagged.

［情報処理装置の構成］
図４は、第１実施形態における情報処理装置１００の構成の一例を示す図である。図示のように、情報処理装置１００は、例えば、通信部１０２と、制御部１１０と、記憶部１３０とを備える。 [Information processing device configuration]
FIG. 4 is a diagram showing an example of the configuration of the information processing apparatus 100 according to the first embodiment. As shown in the figure, the information processing apparatus 100 includes, for example, a communication unit 102, a control unit 110, and a storage unit 130.

通信部１０２は、例えば、ＮＩＣ等の通信インターフェースを含む。通信部１０２は、ネットワークＮＷを介して、サービス提供装置２０などと通信する。例えば、通信部１０２は、サービス提供装置２０と通信し、コンテンツとして提供される文書を受信する。 The communication unit 102 includes, for example, a communication interface such as a NIC. The communication unit 102 communicates with the service providing device 20 and the like via the network NW. For example, the communication unit 102 communicates with the service providing device 20 and receives a document provided as content.

制御部１１０は、例えば、取得部１１２と、文字列抽出部１１４と、通信制御部１１６と、学習部１１８とを備える。制御部１１０の構成要素は、例えば、ＣＰＵ（Central Processing Unit）やＧＰＵ（Graphics Processing Unit）などのプロセッサが記憶部１３０に格納されたプログラムを実行することにより実現される。また、制御部１１０の構成要素の一部または全部は、ＬＳＩ（Large Scale Integration）、ＡＳＩＣ（Application Specific Integrated Circuit）、またはＦＰＧＡ（Field-Programmable Gate Array）などのハードウェア（回路部；circuitry）により実現されてもよいし、ソフトウェアとハードウェアの協働によって実現されてもよい。また、プロセッサにより参照されるプログラムは、予め記憶部１３０に格納されていてもよいし、ＤＶＤやＣＤ－ＲＯＭなどの着脱可能な記憶媒体に格納されており、その記憶媒体が情報処理装置１００のドライブ装置に装着されることで記憶媒体から記憶部１３０にインストールされてもよい。 The control unit 110 includes, for example, an acquisition unit 112, a character string extraction unit 114, a communication control unit 116, and a learning unit 118. The components of the control unit 110 are realized by, for example, a processor such as a CPU (Central Processing Unit) or a GPU (Graphics Processing Unit) executing a program stored in the storage unit 130. In addition, some or all of the components of the control unit 110 are driven by hardware (circuit unit) such as LSI (Large Scale Integration), ASIC (Application Specific Integrated Circuit), or FPGA (Field-Programmable Gate Array). It may be realized, or it may be realized by the cooperation of software and hardware. Further, the program referred to by the processor may be stored in the storage unit 130 in advance, or is stored in a removable storage medium such as a DVD or a CD-ROM, and the storage medium is stored in the information processing apparatus 100. It may be installed in the storage unit 130 from the storage medium by being attached to the drive device.

記憶部１３０は、例えば、ＨＤＤ（Hard Disc Drive）、フラッシュメモリ、ＥＥＰＲＯＭ（Electrically Erasable Programmable Read Only Memory）、ＲＯＭ（Read Only Memory）、ＲＡＭ（Random Access Memory）などの記憶装置により実現される。記憶部１３０には、ファームウェアやアプリケーションプログラムなどの各種プログラムのほかに、分類器情報１３２や条件付き確率場モデル情報１３４、教師データ１３６などが格納される。これらの詳細については後述する。 The storage unit 130 is realized by, for example, a storage device such as an HDD (Hard Disc Drive), a flash memory, an EEPROM (Electrically Erasable Programmable Read Only Memory), a ROM (Read Only Memory), and a RAM (Random Access Memory). In addition to various programs such as firmware and application programs, the storage unit 130 stores classifier information 132, conditional random field model information 134, teacher data 136, and the like. Details of these will be described later.

［運用時（ランタイム）の処理フロー］
以下、第１実施形態における情報処理装置１００による運用時の一連の処理の流れをフローチャートに即して説明する。運用時とは、既に学習された分類器ＭＤＬを利用する状態である。図５は、第１実施形態における情報処理装置１００による運用時の一連の処理の流れを示すフローチャートである。本フローチャートの処理は、例えば、所定の周期で繰り返し行われてもよい。 [Processing flow during operation (runtime)]
Hereinafter, the flow of a series of processes during operation by the information processing apparatus 100 in the first embodiment will be described according to a flowchart. The operation time is a state in which the already learned classifier MDL is used. FIG. 5 is a flowchart showing a flow of a series of processes during operation by the information processing apparatus 100 according to the first embodiment. The processing of this flowchart may be repeated, for example, at a predetermined cycle.

まず、取得部１１２は、通信部１０２にサービス提供装置２０と通信させ、サービス提供装置２０から、ウェブページなどを介してコンテンツとして提供される文書を取得する（Ｓ１００）。サービス提供装置２０から取得される文書には、句読点や、終止符、カンマ、疑問符、感嘆符などで区切られた一つの文（センテンス）が少なくとも含まれているものとし、段落などで区切られた節（パラグラフ）や、それ以上大きい単位の文章が含まれていてもよい。また、サービス提供装置２０から取得される文書は、トークナイズ（単語分割）されている必要はなく、句点、読点、終止符、カンマ、ピリオド、疑問符、感嘆符、省略符、括弧、記号、スペース、改行などが含まれていてもよい。また、サービス提供装置２０から取得される文書の言語は、日本語であってもよいし、英語や中国語、ドイツ語など他の言語であってもよい。 First, the acquisition unit 112 causes the communication unit 102 to communicate with the service providing device 20, and acquires a document provided as content from the service providing device 20 via a web page or the like (S100). It is assumed that the document obtained from the service providing device 20 contains at least one sentence (sentence) separated by a punctuation mark, a full stop, a comma, a question mark, an exclamation mark, etc., and a clause separated by a paragraph or the like. It may contain sentences of (paragraph) or larger units. Further, the document acquired from the service providing device 20 does not need to be tokenized (word division), and has punctuation marks, commas, full stops, commas, periods, question marks, exclamation marks, abbreviations, parentheses, symbols, spaces, etc. It may include line breaks and the like. Further, the language of the document acquired from the service providing device 20 may be Japanese, or may be another language such as English, Chinese, or German.

次に、文字列抽出部１１４は、分類器情報１３２を参照して、分類器ＭＤＬを構築（生成）し、取得部１１２によって取得された文書に含まれる複数の文字Ｃの中から、ある決まった数の文字列を、構築した分類器ＭＤＬに入力する（Ｓ１０２）。例えば、抽出したい文字列が５文字である場合、分類器ＭＤＬに一度に入力する文字列の数は、５文字あるいはそれ以上であってよい。 Next, the character string extraction unit 114 constructs (generates) a classifier MDL with reference to the classifier information 132, and is determined from among a plurality of characters C included in the document acquired by the acquisition unit 112. The number of character strings is input to the constructed classifier MDL (S102). For example, when the character string to be extracted is 5 characters, the number of character strings input to the classifier MDL at one time may be 5 characters or more.

分類器情報１３２は、分類器ＭＤＬを定義した情報（プログラムまたはデータ構造）である。分類器ＭＤＬは、文字にタグを順次付与する系列ラベリングと呼ばれる問題を解くための学習モデルであり、例えば、リカレントニューラルネットワーク（Reccurent Neural Network：ＲＮＮ）などの種々のニューラルネットワークによって実現される。 The classifier information 132 is information (program or data structure) that defines the classifier MDL. The classifier MDL is a learning model for solving a problem called sequence labeling in which characters are sequentially tagged, and is realized by various neural networks such as a recurrent neural network (RNN), for example.

分類器情報１３２には、例えば、各ニューラルネットワークを構成する入力層、一以上の隠れ層（中間層）、出力層の其々に含まれるニューロン（ユニット）が互いにどのように結合されるのかという結合情報や、結合されたニューロン間で入出力されるデータに付与される結合係数がいくつであるのかという重み情報などが含まれる。結合情報は、例えば、各層に含まれるニューロン数や、各ニューロンの結合先のニューロンの種類を指定する情報、各ニューロンを実現する活性化関数、隠れ層のニューロン間に設けられたゲートなどの情報を含む。ニューロンを実現する活性化関数は、例えば、入力符号に応じて動作を切り替える関数（ＲｅＬＵ関数やＥＬＵ関数）であってもよいし、シグモイド関数や、ステップ関数、ハイパボリックタンジェント関数であってもよいし、恒等関数であってもよい。ゲートは、例えば、活性化関数によって返される値（例えば１または０）に応じて、ニューロン間で伝達されるデータを選択的に通過させたり、重み付けたりする。結合係数は、活性化関数のパラメータであり、例えば、ニューラルネットワークの隠れ層において、ある層のニューロンから、より深い層のニューロンにデータが出力される際に、出力データに対して付与される重みを含む。また、結合係数は、各層の固有のバイアス成分などを含んでもよい。 In the classifier information 132, for example, how the neurons (units) contained in the input layer, one or more hidden layers (intermediate layers), and the output layers constituting each neural network are connected to each other. It includes connection information and weight information such as how many connection coefficients are given to data input / output between connected neurons. The connection information includes, for example, the number of neurons contained in each layer, information that specifies the type of neuron to which each neuron is connected, the activation function that realizes each neuron, and information such as a gate provided between neurons in the hidden layer. including. The activation function that realizes the neuron may be, for example, a function that switches the operation according to the input code (ReLU function or ELU function), or may be a sigmoid function, a step function, or a hyperbolic tangent function. , May be an equal function. The gate selectively passes or weights the data transmitted between neurons, for example, depending on the value returned by the activation function (eg 1 or 0). The coupling coefficient is a parameter of the activation function, and is a weight given to the output data when data is output from a neuron in one layer to a neuron in a deeper layer, for example, in a hidden layer of a neural network. including. Further, the coupling coefficient may include a bias component peculiar to each layer.

図６は、第１実施形態における分類器ＭＤＬの一例を示す図である。第１実施形態における分類器ＭＤＬは、例えば、埋め込みレイヤ２００と、特徴抽出レイヤ３００と、ＣＲＦ（Conditional random field）レイヤ４００とによって実現される。ＣＲＦレイヤ４００と文字列抽出部１１４とを合わせたものは、「抽出部」の一例である。 FIG. 6 is a diagram showing an example of the classifier MDL in the first embodiment. The classifier MDL in the first embodiment is realized by, for example, an embedding layer 200, a feature extraction layer 300, and a CRF (Conditional random field) layer 400. The combination of the CRF layer 400 and the character string extraction unit 114 is an example of the “extraction unit”.

埋め込みレイヤ２００は、入力された各文字Ｃから、ある次元数のベクトル（以下、文字ベクトルｘと称する）を生成する。例えば、埋め込みレイヤ２００は、対象とする文字に対応した要素を１とし、それ以外の文字に対応した要素を０とするワンホット表現と呼ばれる手法によって、入力された文字Ｃ_ｋから文字ベクトルｘ_ｋを生成する。また、埋め込みレイヤ２００は、対象とする文字と、その文字の前後に出現する文字との共起性に基づいて文字をベクトル化する分散表現と呼ばれる手法によって、入力された文字Ｃ_ｋから文字ベクトルｘ_ｋを生成してもよい。 The embedded layer 200 generates a vector having a certain number of dimensions (hereinafter referred to as a character vector x) from each input character C. For example, the embedded layer 200 has a character vector x _k from the input character C _k by a method called one-hot expression in which the element corresponding to the target character is 1 and the element corresponding to other characters is 0. To generate. Further, the embedded layer 200 is a character vector from the input character _Ck by a method called distributed expression that vectorizes the character based on the co-occurrence between the target character and the characters appearing before and after the character. You may generate x _k .

以下の説明では、一例として、Ｃ_１を先頭にしてＣ_２、Ｃ_３、Ｃ_４の順に並んだ４文字が埋め込みレイヤ２００に入力される場合について説明する。４つの文字Ｃ_１からＣ_４が埋め込みレイヤ２００に入力されると、ｘ_１からｘ_４の４つの文字ベクトルが生成される。埋め込みレイヤ２００によって生成された文字ベクトルｘは、特徴抽出レイヤ３００に出力される。 In the following description, as an example, a case where four characters arranged in the order of C ₂ , C ₃ , and C ₄ starting with C ₁ are input to the embedded layer 200 will be described. When the four characters C ₁ to C ₄ are input to the embedded layer 200, four character vectors x ₁ to x ₄ are generated. The character vector x generated by the embedding layer 200 is output to the feature extraction layer 300.

特徴抽出レイヤ３００は、埋め込みレイヤ２００により出力された文字ベクトルｘから特徴量を抽出する。例えば、特徴抽出レイヤ３００は、ＢｉＬＳＴＭ（Bidirectional Long short-term memory）レイヤ３１０と、アテンション機構（注意機構）３３０とによって実現される。 The feature extraction layer 300 extracts a feature amount from the character vector x output by the embedding layer 200. For example, the feature extraction layer 300 is realized by a BiLSTM (Bidirectional Long short-term memory) layer 310 and an attention mechanism (attention mechanism) 330.

ＢｉＬＳＴＭレイヤ３１０は、入力された文字の後に出現する文字を予測するとともに、入力された文字の前に出現する文字を予測する双方向型のＬＳＴＭを含むＲＮＮである。言い換えれば、ＢｉＬＳＴＭレイヤ３１０は、現在までの入力から未来の出力を予測するとともに、未来の入力から過去の出力を予測する。ＢｉＬＳＴＭレイヤ３１０には、ＬＳＴＭ３２０（１）～（８）が含まれる。なお、ＬＳＴＭ３２０（１）～（８）は、互いに異なるＬＳＴＭ３２０ではなく、処理周期ｔの異なる同一のＬＳＴＭ３２０である。括弧内の数字は、処理周期ｔに対応している。 The BiLSTM layer 310 is an RNN including a bidirectional LSTM that predicts the character that appears after the input character and predicts the character that appears before the input character. In other words, the BiLSTM layer 310 predicts future outputs from inputs up to the present and predicts past outputs from future inputs. The BiLSTM layer 310 includes LSTM320 (1) to (8). It should be noted that LSTM320s (1) to (8) are not LSTM320s different from each other, but the same LSTM320s having different processing cycles t. The numbers in parentheses correspond to the processing cycle t.

各ＬＳＴＭ３２０は、数式（１）に基づいて、各文字ベクトルｘから抽出される特徴量を表す特徴ベクトルｈを出力する。数式（１）は、各ＬＳＴＭ３２０における計算式の一例を表している。 Each LSTM320 outputs a feature vector h representing a feature amount extracted from each character vector x based on the mathematical formula (1). The mathematical formula (1) represents an example of the calculation formula in each LSTM320.

ｔは、ＢｉＬＳＴＭレイヤ３１０によって繰り返し行われる再帰処理の処理周期（処理時刻）を表しており、ｘ_ｔは、処理周期ｔにおいて埋め込みレイヤ２００から入力される文字ベクトルを表している。例えば、最初の周期ｔ_１では、先頭の文字Ｃ_１に対応した文字ベクトルｘ_１が、ＬＳＴＭ３２０（１）とＬＳＴＭ３２０（８）とに入力され、次の周期ｔ_２では、文字Ｃ_１の後に続く文字Ｃ_２に対応した文字ベクトルｘ_２が、ＬＳＴＭ３２０（２）とＬＳＴＭ３２０（７）とに入力され、次の周期ｔ_３では、文字Ｃ_２の後に続く文字Ｃ_３に対応した文字ベクトルｘ_３が、ＬＳＴＭ３２０（３）とＬＳＴＭ３２０（６）とに入力され、次の周期ｔ_４では、文字Ｃ_３の後に続く文字Ｃ_４に対応した文字ベクトルｘ_４が、ＬＳＴＭ３２０（４）とＬＳＴＭ３２０（５）とに入力される。 t represents the processing cycle (processing time) of the recursive processing repeatedly performed by the BiLSTM layer 310, and x _t represents the character vector input from the embedded layer 200 in the processing cycle t. For example, in the first cycle _t ₁ , the character vector x ₁ corresponding to the first character C ₁ is input to the LSTM320 ( ₁ ) and LSTM320 (8), and in the next cycle t2, it follows the character C1. The character vector x ₂ corresponding to the character C ₂ is input to the LSTM320 (2) and the LSTM320 (7), and in the next cycle t ₃ , the character vector x ₃ corresponding to the character C ₃ following the character C ₂ is generated. , LSTM320 ( ₃ ) and LSTM320 (6), and in the next cycle t4, the character vector x4 corresponding to the character _C4 following the character C3 is the _LSTM320 ( ₄ ) and LSTM320 (5). Is entered in.

ｈ_ｔは、処理周期ｔにおいてＬＳＴＭ３２０により出力される特徴ベクトルを表している。再帰処理とは、過去の処理周期で求めた特徴ベクトルを、今回の処理周期での特徴ベクトルの導出に利用することである。特徴ベクトルｈ_ｔは、インプットゲートによって出力されるベクトルｚ_ｔに基づく重み（１－ｚ_ｔ）と特徴ベクトルｈ_ｔ－１との畳み込み値と、ベクトルｚ_ｔと特徴ベクトルｈ_ｔ（～）との畳み込み値との和として導出される。 h _t represents a feature vector output by the LSTM 320 in the processing cycle t. Recursive processing is to use the feature vector obtained in the past processing cycle for deriving the feature vector in the current processing cycle. The feature vector h _t is a convolution value of the weight (1-z _t ) based on the vector z _t output by the input gate and the feature vector h _t-1 , and the vector z _t and the feature vector h _t (~). Derived as the sum of the convolution values.

ＬＳＴＭ３２０（１）の特徴ベクトルｈ_１は、ＬＳＴＭ３２０（２）に出力され、ＬＳＴＭ３２０（２）の特徴ベクトルｈ_２は、ＬＳＴＭ３２０（３）に出力され、ＬＳＴＭ３２０（３）の特徴ベクトルｈ_３は、ＬＳＴＭ３２０（４）に出力される。一方、ＬＳＴＭ３２０（５）の特徴ベクトルｈ_５は、ＬＳＴＭ３２０（６）に出力され、ＬＳＴＭ３２０（６）の特徴ベクトルｈ_６は、ＬＳＴＭ３２０（７）に出力され、ＬＳＴＭ３２０（７）の特徴ベクトルｈ_７は、ＬＳＴＭ３２０（８）に出力される。 The feature vector h ₁ of the LSTM320 (1) is output to the LSTM320 (2), the feature vector h ₂ of the LSTM320 (2) is output to the LSTM320 (3), and the feature vector h3 of the LSTM320 ( ₃ ) is the LSTM320. It is output to (4). On the other hand, the feature vector h ₅ of the LSTM320 (5) is output to the LSTM320 (6), the feature vector h6 of the LSTM320 ( ₆ ) is output to the LSTM320 (7), and the feature vector h7 of the LSTM320 ( ₇ ) is , Is output to LSTM320 (8).

ｚ_ｔは、ＬＳＴＭに含まれるインプットゲートによって出力されるベクトルを表しており、σは、ゲートの活性化関数がシグモイド関数であることを表しており、Ｗ_ｚは、前段のＬＳＴＭ３２０の特徴ベクトルｈ_ｔ－１と、文字ベクトルｘ_ｔとを線形変換するための重みを表している。 z _t represents a vector output by the input gate included in the LSTM, σ represents that the activation function of the gate is a sigmoid function, and W _z represents the feature vector h of the LSTM 320 in the previous stage. It represents the weight for linearly transforming _t-1 and the character vector x _t .

ｒ_ｔは、ＬＳＴＭに含まれるフォーゲットゲートによって出力されるベクトルを表しており、Ｗ_ｆは、重みＷ_ｉと同様に、前段のＬＳＴＭ３２０の特徴ベクトルｈ_ｔ－１と、文字ベクトルｘ_ｔとを線形変換するための重みを表している。 rt represents a vector output by the forget gate included in the _LSTM , and W _f represents the feature vector h _t _-1 of the LSTM 320 in the previous stage and the character vector x _t , similarly to the weight Wi. Represents the weight for linear transformation.

ｈ_ｔ（～）は、内部演算のために一時的に導出される特徴ベクトルを表している。特徴ベクトルｈ_ｔ（～）は、フォーゲットゲートによって出力されるベクトルｒ_ｔと、特徴ベクトルｈ_ｔ－１との畳み込み値と、文字ベクトルｘ_ｔとに対して、ある重みＷを乗算した積を変数としたハイパボリックタンジェント関数ｔａｎｈを解くことで導出される。 _ht (~) represents a feature vector that is temporarily derived for internal operations. The feature vector h _t (~) is the product of the convolution value of the vector rt output by the forget gate, the feature vector h _t _-1 , and the character vector x _t multiplied by a certain weight W. It is derived by solving the hyperbolic tangent function tanh as a variable.

ＢｉＬＳＴＭレイヤ３１０は、同じ文字ベクトルｘが入力された２つのＬＳＴＭ３２０のそれぞれによって出力された２つの特徴ベクトルｈを結合して、１つの特徴ベクトルｈを生成し、これを出力する。特徴ベクトルｈに含まれる要素は、各文字Ｃに付されるタグの尤もらしさ（尤度）を数値（スコア）によって表している。以下、２つの特徴ベクトルｈを結合して生成した特徴ベクトルｈを、「タグスコアベクトルｈ」と称して説明する。タグスコアベクトルｈは、「特徴量」の一例である。 The BiLSTM layer 310 combines two feature vectors h output by each of the two LSTM 320s to which the same character vector x is input to generate one feature vector h and outputs it. The element included in the feature vector h represents the likelihood (likelihood) of the tag attached to each character C by a numerical value (score). Hereinafter, the feature vector h generated by combining the two feature vectors h will be referred to as a “tag score vector h” and will be described. The tag score vector h is an example of a “feature amount”.

例えば、ＢｉＬＳＴＭレイヤ３１０は、文字ベクトルｘ_１が入力された前段のＬＳＴＭ３２０（１）および後段のＬＳＴＭ３２０（８）のそれぞれによって出力された特徴ベクトルｈ_ｔを結合して、タグスコアベクトルｈ_１を出力する。 For example, the BiLSTM layer 310 outputs the tag score vector h ₁ by combining the feature vectors h _t output by each of the front stage LSTM 320 (1) and the rear stage LSTM 320 (8) in which the character vector x ₁ is input. do.

また、ＢｉＬＳＴＭレイヤ３１０は、文字ベクトルｘ_２が入力された前段のＬＳＴＭ３２０（２）および後段のＬＳＴＭ３２０（７）のそれぞれによって出力された特徴ベクトルｈ_ｔを結合して、タグスコアベクトルｈ_２を出力する。 Further, the BiLSTM layer 310 outputs the tag score vector h ₂ by combining the feature vectors h _t output by each of the front stage LSTM 320 (2) and the rear stage LSTM 320 (7) in which the character vector x ₂ is input. do.

また、ＢｉＬＳＴＭレイヤ３１０は、文字ベクトルｘ_３が入力された前段のＬＳＴＭ３２０（３）および後段のＬＳＴＭ３２０（６）のそれぞれによって出力された特徴ベクトルｈ_ｔを結合して、タグスコアベクトルｈ_３を出力する。 Further, the BiLSTM layer 310 outputs the tag score vector h ₃ by combining the feature vectors h _t output by each of the front stage LSTM 320 (3) and the rear stage LSTM 320 (6) in which the character vector x ₃ is input. do.

また、ＢｉＬＳＴＭレイヤ３１０は、文字ベクトルｘ_４が入力された前段のＬＳＴＭ３２０（４）および後段のＬＳＴＭ３２０（５）のそれぞれによって出力された特徴ベクトルｈ_ｔを結合して、タグスコアベクトルｈ_４を出力する。 Further, the BiLSTM layer 310 outputs the tag score vector h ₄ by combining the feature vectors h _t output by each of the front stage LSTM 320 (4) and the rear stage LSTM 320 (5) in which the character vector x ₄ is input. do.

アテンション機構３３０は、ある文字列において、各文字が文字列内のどの位置に出現したのかということに応じて、その文字に対する注意のかけ方を変更するニューラルネットワークである。例えば、アテンション機構３３０は、ＢｉＬＳＴＭレイヤ３１０によって出力されたタグスコアベクトルｈ_１からｈ_４のそれぞれに対して、互いに異なる重みαを乗算することで、各文字Ｃに対する注意の度合い（アテンション）を、文字Ｃごとに変更する。数式（２）は、アテンション機構３３０における計算式の一例を表している。 The attention mechanism 330 is a neural network that changes how to pay attention to a certain character string according to the position in the character string in which each character appears. For example, the attention mechanism 330 determines the degree of attention (attention) for each character C by multiplying each of the tag score vectors h ₁ to h ₄ output by the BiLSTM layer 310 by different weights α. Change for each letter C. The mathematical formula (2) represents an example of the calculation formula in the attention mechanism 330.

Ｍは、タグスコアベクトルｈ_ｔを変数としたハイパボリックタンジェント関数ｔａｎｈを解くことで導出されるベクトルであり、タグスコアベクトルｈ_ｔを非線形化したベクトルである。重みαは、ソフトマックス関数によって導出される。ｗ^Ｔは、ベクトルＭに乗算される重みベクトルの転置である。 M is a vector derived by solving the hyperbolic tangent function tanh with the tag score vector h _t as a variable, and is a non-linear vector of the tag score vector h _t . The weight α is derived by the softmax function. w ^T is the transpose of the weight vector multiplied by the vector M.

例えば、ある文字Ｃ_ｋを入力したときにＢｉＬＳＴＭレイヤ３１０によって出力されるタグスコアベクトルをｈ_ｋとした場合、そのタグスコアベクトルｈ_ｋに対応した重みα_ｋは、数式（３）に示すソフトマックス関数によって導出される。 For example, when the tag score vector output by the BiLSTM layer 310 when a certain character C _k is input is h _{k, the weight α k} _{corresponding to the tag score vector h k} _is the softmax shown in the equation (3). Derived by a function.

ＢｉＬＳＴＭレイヤ３１０によって出力されるタグスコアベクトルｈの総数がｎである場合、数式（３）に示すように、重みα_ｋは、対象とするタグスコアベクトルｈ_ｋの指数関数の出力値ｅｘｐ（ｗ^ＴＭ_ｋ）を、タグスコアベクトルｈ_１の指数関数の出力値ｅｘｐ（ｗ^ＴＭ_１）から、タグスコアベクトルｈ_ｎの指数関数の出力値ｅｘｐ（ｗ^ＴＭ_ｎ）までを足し合わせた合計値で除算した値（すなわち割合）となる。そのため、重みα_ｋは、０から１までの範囲の値をとる。数式（３）の分子は「文字ごとに出力された各特徴量」の一例であり、分母は「全ての文字の特徴量」の一例である。 When the total number of tag score vectors h output by the BiLSTM layer 310 is n, the weight α _k is the output value exp (w) of the exponential function of the target tag score vector h _k , as shown in the equation (3). The sum of ^TM _k ) from the output value exp (w ^TM ₁ ) of the exponential function of the tag score vector h ₁ to the output value exp (w ^TM _n ) of the exponential function of the tag score vector h _n . It is the value divided by the value (that is, the ratio). Therefore, the weight α _k takes a value in the range of 0 to 1. The numerator of the formula (3) is an example of "features output for each character", and the denominator is an example of "features of all characters".

アテンション機構３３０は、対象とする文字Ｃ_ｋのタグスコアベクトルｈ_ｋを非線形化したベクトルＭ_ｋに対して、対象とする文字Ｃ_ｋのタグスコアベクトルｈ_ｋに応じた重みα_ｋの転置を乗算することで、対象とする文字Ｃ_ｋのタグスコアベクトルｈ_ｋが他の文字との相対的な注意の度合いに応じて重み付けられたタグスコアベクトルＨ_ｋを生成する。以下、タグスコアベクトルｈ_ｋに重みα_ｋが乗算されたタグスコアベクトルＨ_ｋを、「重み付きタグスコアベクトルＨ_ｋ」と称して説明する。重みα_ｋはベクトルであるため、重み付きタグスコアベクトルＨ_ｋは、ベクトルＭ_ｋと重みα_ｋとのアダマール積となる。 The attention mechanism 330 multiplies the vector M _k , which is a non-linearized tag score vector h _k of the target character C _k , by the transposition of the weight α _k according to the tag score vector h _k of the target character C _k . By doing so, the tag score vector h _k of the target character C _k is weighted according to the degree of attention relative to other characters to generate the tag score vector H _k . Hereinafter, the tag score vector H _k obtained by multiplying the tag score vector h _k by the weight α _k will be referred to as a “weighted tag score vector H _k ”. Since the weight α _k is a vector, the weighted tag score vector H _k is the Hadamard product of the vector M _k and the weight α _k .

上述したように、本実施形態では一例としてＣ_１からＣ_４の４文字を処理対象としているため、アテンション機構３３０は、文字Ｃ_１のタグスコアベクトルｈ_１を非線形化したベクトルＭ_１に対して、文字Ｃ_１のタグスコアベクトルｈ_１に応じた重みα_１を乗算して、文字Ｃ_１に対応した重み付きタグスコアベクトルＨ_１を生成する。 As described above, since the four characters C ₁ to C ₄ are processed as an example in the present embodiment, the attention mechanism 330 has a vector M ₁ in which the tag score vector h ₁ of the character C ₁ is made non-linear. , The weight α ₁ corresponding to the tag score vector h ₁ of the character C ₁ is multiplied to generate the weighted tag score vector H ₁ corresponding to the character C ₁ .

また、アテンション機構３３０は、文字Ｃ_２のタグスコアベクトルｈ_２を非線形化したベクトルＭ_２に対して、文字Ｃ_２のタグスコアベクトルｈ_２に応じた重みα_２を乗算して、文字Ｃ_２に対応した重み付きタグスコアベクトルＨ_２を生成する。 Further, the attention mechanism 330 multiplies the vector M ₂ obtained by decomposing the tag score vector h ₂ of the character C ₂ by the weight α ₂ corresponding to the tag score vector h ₂ of the character C ₂ to obtain the character C ₂ . Generates a weighted tag score vector H ₂ corresponding to.

また、アテンション機構３３０は、文字Ｃ_３のタグスコアベクトルｈ_３を非線形化したベクトルＭ_３に対して、文字Ｃ_３のタグスコアベクトルｈ_３に応じた重みα_３を乗算して、文字Ｃ_３に対応した重み付きタグスコアベクトルＨ_３を生成する。 Further, the attention mechanism ₃₃₀ multiplies the vector M3, which is a non-linearized tag score vector h ₃ of the character C ₃ , by the weight α _{3 corresponding to the tag score vector h 3} _of the character C ₃ , and the character C ₃ Generates a weighted tag score vector _H3 corresponding to.

また、アテンション機構３３０は、文字Ｃ_４のタグスコアベクトルｈ_４を非線形化したベクトルＭ_４に対して、文字Ｃ_４のタグスコアベクトルｈ_４に応じた重みα_２を乗算して、文字Ｃ_４に対応した重み付きタグスコアベクトルＨ_４を生成する。 Further, the attention mechanism 330 multiplies the vector M ₄ , which is a non-linearization of the tag score vector h ₄ of the character C ₄ , by the weight α ₂ corresponding to the tag score vector h ₄ of the character C ₄ , and the character C ₄ Generates a weighted tag score vector _H4 corresponding to.

これらの重み付きタグスコアベクトルＨは、ベクトルＭに対して、ソフトマックス関数によって求められる重みαを乗算することで生成されるため、重み付きタグスコアベクトルＨに含まれる各要素の値は、確率のように振る舞う。重み付きタグスコアベクトルＨに含まれる各要素は、各文字Ｃに付されるタグの尤もらしさ（尤度）を、０から１までの範囲の数値によって表している。例えば、重み付きタグスコアベクトルＨがｅ１からｅ４の４つの要素を含む場合、要素ｅ１が「Ｂ」タグの尤度を表し、要素ｅ２が「Ｉ」タグの尤度を表し、要素ｅ３が「Ｌ」タグの尤度を表し、要素ｅ４が「Ｏ」タグの尤度を表していてよい。なお、重み付きタグスコアベクトルＨの要素数と、各要素に対応するタグの種類は、ハイパーパラメータとして任意に決定されてよい。要素ｅ１からｅ３は、「第１指標値」の一例であり、要素ｅ４は、「第２指標値」の一例である。 Since these weighted tag score vectors H are generated by multiplying the vector M by the weight α obtained by the softmax function, the value of each element included in the weighted tag score vector H is a probability. Behave like. Each element included in the weighted tag score vector H represents the likelihood (likelihood) of the tag attached to each character C by a numerical value in the range of 0 to 1. For example, when the weighted tag score vector H includes four elements e1 to e4, the element e1 represents the likelihood of the "B" tag, the element e2 represents the likelihood of the "I" tag, and the element e3 represents the "I" tag. The likelihood of the "L" tag may be represented, and the element e4 may represent the likelihood of the "O" tag. The number of elements of the weighted tag score vector H and the type of tag corresponding to each element may be arbitrarily determined as hyperparameters. The elements e1 to e3 are examples of the "first index value", and the elements e4 are examples of the "second index value".

アテンション機構３３０によって生成された重み付きタグスコアベクトルＨは、ＣＲＦレイヤ４００に出力される。 The weighted tag score vector H generated by the attention mechanism 330 is output to the CRF layer 400.

ＣＲＦレイヤ４００は、条件付き確率場モデル情報１３４が示す条件付き確率場モデル（マルコフ確率場）と、アテンション機構３３０によって出力された複数の重み付きタグスコアベクトルＨとに基づいて、入力された複数の文字Ｃに対して付けられるタグの組み合わせとして、最も尤度が高いタグ付けの組み合わせを決定する。条件付き確率場モデル情報１３４が示す条件付き確率場モデルは、「Ｂ」「Ｉ」「Ｌ」「Ｏ」などの各種タグのいずれかが付与される文字同士が、文書（文字列）の中で互いに隣り合うときの確率を条件付けた条件付き確率場である。 The CRF layer 400 is a plurality of input based on the conditional random field model (Markov random field) indicated by the conditional random field model information 134 and the plurality of weighted tag score vectors H output by the attention mechanism 330. As a combination of tags attached to the character C of, the combination of tagging with the highest probability is determined. In the conditional random field model indicated by the conditional random field model information 134, characters to which any of various tags such as "B", "I", "L", and "O" are attached are contained in a document (character string). It is a conditional random field in which the probabilities when they are adjacent to each other are conditioned.

図７は、条件付き確率場モデルの一例を示す図である。例えば、ある文字列に含まれる複数の文字に対して、「Ｂ」「Ｉ」「Ｌ」「Ｏ」などの各種タグのいずれかが付与されている場合、タグ付けの順序として、ある決まったパターンが存在し得る。例えば、「Ｂ」「Ｉ」「Ｌ」という順序でタグが付される文字列は存在し得るが、「Ｏ」「Ｌ」「Ｏ」という順序でタグが付される文字列は、誤記や特殊な文体以外では存在する蓋然性が低い。このようなことを考慮して、条件付き確率場モデルでは、一つのタグを状態と見做したときに、あるタグから他のタグに移り変わる際の確率（ある二つのタグが隣り合う確率）を状態遷移確率として定義している。 FIG. 7 is a diagram showing an example of a conditional random field model. For example, when any of various tags such as "B", "I", "L", and "O" is attached to a plurality of characters included in a certain character string, the tagging order is fixed. Patterns can exist. For example, there may be character strings that are tagged in the order of "B", "I", and "L", but character strings that are tagged in the order of "O", "L", and "O" are erroneous or written. It is unlikely that it will exist except for special writing styles. With this in mind, in the conditional random field model, when one tag is regarded as a state, the probability of transition from one tag to another (the probability that two tags are next to each other) is calculated. It is defined as a state transition probability.

一般的な文書では、例えば、「Ｂ」タグが付与される文字の後に「Ｉ」タグが付与される文字が続く頻度が大きいため、「Ｂ」タグから「Ｉ」タグへの状態遷移確率は大きく設定される。また、「Ｌ」タグが付与される文字の後に「Ｏ」タグが付与される文字が続く頻度も大きいため、「Ｌ」タグから「Ｏ」タグへの状態遷移確率も大きく設定される。また、「Ｉ」タグが付与される文字の後に「Ｌ」タグが付与される文字が続く頻度も大きいため、「Ｉ」タグから「Ｌ」タグへの状態遷移確率も大きく設定される。また、「Ｉ」タグが付与される文字の後に「Ｉ」タグが付与される文字が続く頻度も大きいため、「Ｉ」タグから「Ｉ」タグへの状態遷移確率も大きく設定される。一方、一般的な文書では、「Ｌ」タグが付与される文字の後に「Ｂ」タグが付与される文字が続く頻度が小さいため、「Ｌ」タグから「Ｂ」タグへの状態遷移確率は小さく設定される。また、「Ｏ」タグが付与される文字の後に「Ｌ」タグが付与される文字が続く頻度も小さいため、「Ｏ」タグから「Ｌ」タグへの状態遷移確率も小さく設定される。また、「Ｂ」タグが付与される文字の後に「Ｂ」タグが付与される文字が続く頻度も小さいため、「Ｂ」タグから「Ｂ」タグへの状態遷移確率も小さく設定される。これらの状態遷移確率は、ハイパーパラメータとして予め決められているものとする。 In a general document, for example, since the character to which the "B" tag is attached is frequently followed by the character to which the "I" tag is attached, the state transition probability from the "B" tag to the "I" tag is high. It is set large. Further, since the character to which the "O" tag is attached is frequently followed by the character to which the "L" tag is attached, the state transition probability from the "L" tag to the "O" tag is also set to be large. Further, since the character to which the "L" tag is attached is frequently followed by the character to which the "I" tag is attached, the state transition probability from the "I" tag to the "L" tag is also set to be large. Further, since the character to which the "I" tag is attached is frequently followed by the character to which the "I" tag is attached, the state transition probability from the "I" tag to the "I" tag is also set to be large. On the other hand, in a general document, since the frequency of the character with the "B" tag being followed by the character with the "L" tag is small, the state transition probability from the "L" tag to the "B" tag is high. Set small. Further, since the frequency of the character to which the "L" tag is attached follows the character to which the "O" tag is attached is small, the state transition probability from the "O" tag to the "L" tag is also set to be small. Further, since the frequency of the character to which the "B" tag is attached is small after the character to which the "B" tag is attached, the state transition probability from the "B" tag to the "B" tag is also set small. It is assumed that these state transition probabilities are predetermined as hyperparameters.

ＣＲＦレイヤ４００では、上記のような考えに基づいて全てのタグ間の状態遷移確率が定義された条件付き確率場モデルに従って、全てのタグの組み合わせの尤度を導出する。 In the CRF layer 400, the likelihood of all tag combinations is derived according to a conditional random field model in which the state transition probabilities between all tags are defined based on the above idea.

例えば、先頭の文字Ｃ_１が「Ｂ」タグ、次の文字Ｃ_２が「Ｉ」タグ、次の文字Ｃ_３が「Ｌ」タグ、最後の文字Ｃ_４が「Ｏ」タグとなる、ある一つの組み合わせを考えた場合、ＣＲＦレイヤ４００は、まず、文字Ｃ_１に対応した重み付きタグスコアベクトルＨ_１に含まれる複数の要素のうち、「Ｂ」タグの尤度を表す要素ｅ１に対して、「Ｂ」タグ（＝要素ｅ１）から「Ｉ」タグ（＝要素ｅ２）へと状態遷移する状態遷移確率０．５を乗算する。次に、ＣＲＦレイヤ４００は、文字Ｃ_２に対応した重み付きタグスコアベクトルＨ_２に含まれる複数の要素のうち、「Ｉ」タグの尤度を表す要素ｅ２に対して、「Ｉ」タグ（＝要素ｅ２）から「Ｌ」タグ（＝要素ｅ３）へと状態遷移する状態遷移確率０．４を乗算する。次に、ＣＲＦレイヤ４００は、文字Ｃ_３に対応した重み付きタグスコアベクトルＨ_３に含まれる複数の要素のうち、「Ｌ」タグの尤度を表す要素ｅ３に対して、「Ｌ」タグ（＝要素ｅ３）から「Ｏ」タグ（＝要素ｅ４）へと状態遷移する状態遷移確率０．３を乗算する。そして、ＣＲＦレイヤ４００は、これらの積を足し合わせた線形和（０．５×ｅ１＋０．４×ｅ２＋０．３×ｅ３）を、「Ｂ」「Ｉ」「Ｌ」「Ｏ」の組み合わせの尤度として導出する。なお、各タグの組み合わせの尤度は、全組み合わせの尤度で除算され、０から１の範囲の数値をとるものとする。 For example, the first character C ₁ is an "B" tag, the next character C ₂ is an "I" tag, the next character C ₃ is an "L" tag, and the last character C ₄ is an "O" tag. When considering the combination of the two, the CRF layer 400 first refers to the element e1 representing the probability of the "B" tag _among the plurality of elements included in the weighted tag _score vector H1 corresponding to the character C1. , The state transition probability of state transition from the "B" tag (= element e1) to the "I" tag (= element e2) is multiplied by 0.5. Next, the CRF layer 400 has an "I" tag (with respect to the element e2 representing the probability of the " _I " tag among the plurality of elements included in the weighted tag score vector H2 corresponding to the character _C2 . = Multiply the state transition probability 0.4 of the state transition from the element e2) to the "L" tag (= element e3). Next, the CRF layer 400 has an "L" tag ₍ for an element _e3 representing the probability of the "L" tag among a plurality of elements included in the weighted tag score vector H3 corresponding to the character C3. = Multiply the state transition probability 0.3 of the state transition from the element e3) to the "O" tag (= element e4). Then, in the CRF layer 400, the linear sum (0.5 × e1 + 0.4 × e2 + 0.3 × e3) obtained by adding these products is the likelihood of the combination of “B”, “I”, “L”, and “O”. Derived as. The likelihood of each combination of tags is divided by the likelihood of all combinations and takes a numerical value in the range of 0 to 1.

ＣＲＦレイヤ４００は、全てのタグの組み合わせについて尤度を導出すると、その組み合わせの中から最も尤度が大きくなるタグの組み合わせを選択し、その組み合わせをベクトル（以下、最尤タグベクトルと称する）として出力する。例えば、分類すべきタグの種類が４種類である場合、ＣＲＦレイヤ４００によって出力される最尤タグベクトルには、例えば、最初の文字Ｃ_１に付されるタグの尤度を表す要素ｅ１と、次の文字Ｃ_２に付されるタグの尤度を表す要素ｅ２と、次の文字Ｃ_３に付されるタグの尤度を表す要素ｅ３と、最後の文字Ｃ_４に付されるタグの尤度を表す要素ｅ４とが含まれることになる。 When the CRF layer 400 derives the likelihood for all the tag combinations, the tag combination having the highest likelihood is selected from the combinations, and the combination is used as a vector (hereinafter referred to as the maximum likelihood tag vector). Output. For example, when there are four types of tags to be classified, the maximum likelihood tag vector output by the CRF layer 400 includes, for example, an element e1 representing the likelihood of the tag attached to the _first character C1. The element e2 representing the likelihood of the tag attached to the next character C ₂ , the element e3 representing the likelihood of the tag attached to the next character C ₃ , and the likelihood of the tag attached to the last character C ₄ The element e4 representing the degree is included.

図５の説明に戻り、次に、文字列抽出部１１４は、分類器ＭＤＬのＣＲＦレイヤ４００から出力結果、すなわち、最尤のタグの組み合わせを示す最尤タグベクトルを取得する（Ｓ１０４）。 Returning to the description of FIG. 5, the character string extraction unit 114 then acquires the output result, that is, the maximum likelihood tag vector indicating the maximum likelihood tag combination from the CRF layer 400 of the classifier MDL (S104).

次に、文字列抽出部１１４は、Ｓ１００の処理で取得された文書に含まれる全ての文字を分類器ＭＤＬに入力したか否かを判定し（Ｓ１０６）、未だ全ての文字について分類器ＭＤＬに入力していない場合、Ｓ１０２の処理に戻り、前回分類器ＭＤＬに入力された文字列と重複しない他の文字列を分類器ＭＤＬに入力する。 Next, the character string extraction unit 114 determines whether or not all the characters included in the document acquired in the process of S100 have been input to the classifier MDL (S106), and the classifier MDL still has all the characters. If not, the process returns to S102, and another character string that does not overlap with the character string input to the classifier MDL last time is input to the classifier MDL.

一方、文字列抽出部１１４は、全ての文字について分類器ＭＤＬに入力している場合、分類器ＭＤＬから取得した最尤タグベクトルに基づいて、文書に含まれる各文字にタグを付与し、その文書の中から、アンカーテキストのようにハイパーリンクに対応付けることが可能な文字列を抽出する（Ｓ１０８）。例えば、文字Ｃ_１からＣ_４が分類器ＭＤＬに入力され、その分類器ＭＤＬから取得された最尤タグベクトルに「Ｂ」「Ｉ」「Ｌ」「Ｏ」のタグの尤度を表す要素が含まれている場合、文字列抽出部１１４は、文字Ｃ_１に「Ｂ」タグを付与し、文字Ｃ_２に「Ｉ」タグを付与し、文字Ｃ_３に「Ｌ」タグを付与し、文字Ｃ_４に「Ｏ」タグを付与する。そして、文字列抽出部１１４は、文書の中から、「Ｂ」タグが付与された文字Ｃからはじまり、「Ｌ」タグが付与された文字Ｃで終わる文字列を、ハイパーリンクに対応付けることが可能な文字列として抽出する。この際、「Ｂ」タグが付与された文字Ｃと、「Ｌ」タグが付与された文字Ｃとの間には、「Ｉ」タグが付与された文字Ｃが存在してもよい。 On the other hand, when the character string extraction unit 114 inputs all the characters to the classifier MDL, the character string extraction unit 114 attaches a tag to each character included in the document based on the most likely tag vector acquired from the classifier MDL, and assigns a tag to each character. A character string that can be associated with a hyperlink, such as an anchor text, is extracted from the document (S108). For example, the characters C ₁ to C ₄ are input to the classifier MDL, and the maximum likelihood tag vector acquired from the classifier MDL contains elements representing the likelihood of the tags "B", "I", "L", and "O". When included, the character string extraction unit 114 assigns the character C ₁ to the "B" tag, the character C ₂ to the "I" tag, the character C ₃ to the "L" tag, and the character C 3. An "O" tag is attached to _C4 . Then, the character string extraction unit 114 can associate a character string starting with the character C with the "B" tag and ending with the character C with the "L" tag from the document with the hyperlink. Extract as a character string. At this time, the character C to which the "I" tag is attached may exist between the character C to which the "B" tag is attached and the character C to which the "L" tag is attached.

次に、通信制御部１１６は、通信部１０２を制御して、文字列抽出部１１４によって抽出された文字列を含む情報（以下、文字列情報と称する）を、サービス提供装置２０に送信する（Ｓ１１０）。これによって本フローチャートの処理が終了する。 Next, the communication control unit 116 controls the communication unit 102 to transmit information including the character string extracted by the character string extraction unit 114 (hereinafter referred to as character string information) to the service providing device 20 (hereinafter referred to as character string information). S110). This ends the processing of this flowchart.

文字列情報を受信したサービス提供装置２０は、図２に例示するように、文字列抽出部１１４によって抽出された文字列にハイパーリンクを対応付けたコンテンツを、ウェブページを媒体にしてユーザの端末装置１０に提供してよい。なお、サービス提供装置２０は、文字列抽出部１１４によって抽出された文字列をハイパーリンクに対応付けるのに代えて、あるいは加えて、文字列抽出部１１４によって抽出された文字列を、情報検索のためのクエリとしてサジェストしてもよい。 As illustrated in FIG. 2, the service providing device 20 that has received the character string information is a user's terminal using a web page as a medium for content in which a hyperlink is associated with a character string extracted by the character string extraction unit 114. It may be provided to the device 10. In addition, the service providing device 20 replaces or additionally associates the character string extracted by the character string extraction unit 114 with the hyperlink, and uses the character string extracted by the character string extraction unit 114 for information retrieval. You may suggest it as a query of.

図８は、サービス提供装置２０により提供されるウェブページの他の例を示す図である。図示の例のように、文字列抽出部１１４によって抽出された文字列が「台風１０号」である場合に、ユーザがその文字列にマウスオン（マウスオーバー）した場合、サービス提供装置２０は、その文字列で検索することを促すための検索ボタンＢ１を表示させる。 FIG. 8 is a diagram showing another example of a web page provided by the service providing device 20. As shown in the illustrated example, when the character string extracted by the character string extraction unit 114 is "typhoon No. 10" and the user mouses on (mouse over) the character string, the service providing device 20 sets the same. The search button B1 for prompting the search by the character string is displayed.

［学習時（トレーニング）の処理フロー］
以下、第１実施形態における情報処理装置１００の学習時の一連の処理の流れをフローチャートに即して説明する。学習時とは、運用時に利用される分類器ＭＤＬを学習させる状態である。図９は、第１実施形態における情報処理装置１００による学習時の一連の処理の流れを示すフローチャートである。 [Processing flow during learning (training)]
Hereinafter, the flow of a series of processes at the time of learning of the information processing apparatus 100 in the first embodiment will be described according to a flowchart. The learning time is a state in which the classifier MDL used at the time of operation is learned. FIG. 9 is a flowchart showing a flow of a series of processes during learning by the information processing apparatus 100 according to the first embodiment.

まず、学習部１１８は、記憶部１３０に格納された教師データ１３６から、ある決まった数の文字列を選択する（Ｓ２００）。教師データ１３６は、予め「Ｂ」タグや「Ｏ」タグなどの各種タグが教師ラベルとして対応付けられた文字列（タグが自明な文字列）を含む文書データである。教師データ１３６が示す文書は、トークナイズ（単語分割）されている必要はなく、句点、読点、終止符、カンマ、ピリオド、疑問符、感嘆符、省略符、括弧、記号、スペース、改行などが含まれているものとする。また、その文書の言語は、日本語であってもよいし、英語や中国語、ドイツ語など他の言語であってもよい。 First, the learning unit 118 selects a fixed number of character strings from the teacher data 136 stored in the storage unit 130 (S200). The teacher data 136 is document data including a character string (a character string whose tag is trivial) to which various tags such as the "B" tag and the "O" tag are associated with each other as a teacher label in advance. The document indicated by teacher data 136 does not need to be tokenized and contains punctuation marks, commas, full stops, commas, periods, question marks, exclamation marks, abbreviations, parentheses, symbols, spaces, line breaks, etc. It is assumed that it is. The language of the document may be Japanese or another language such as English, Chinese, or German.

次に、学習部１１８は、教師データ１３６から文字列を選択すると、その文字列を分類器ＭＤＬに入力する（Ｓ２０２）。 Next, when the learning unit 118 selects a character string from the teacher data 136, the learning unit 118 inputs the character string to the classifier MDL (S202).

次に、学習部１１８は、文字列を入力した分類器ＭＤＬから出力結果、すなわち最尤タグベクトルを取得する（Ｓ２０４）。 Next, the learning unit 118 acquires the output result, that is, the maximum likelihood tag vector from the classifier MDL in which the character string is input (S204).

次に、学習部１１８は、最尤タグベクトルによって示されるタグの組み合わせと、分類器ＭＤＬに入力した文字列の各文字に教師ラベルとした対応付けられたタグとを比較し、双方のタグの組み合わせが一致するか否かを判定する（Ｓ２０６）。 Next, the learning unit 118 compares the combination of tags indicated by the maximum likelihood tag vector with the associated tag as a teacher label for each character of the character string input to the classifier MDL, and the learning unit 118 compares the tags of both tags. It is determined whether or not the combinations match (S206).

学習部１１８は、タグの組み合わせが一致しないと判定した場合、誤差逆伝番などの勾配法に基づいて、分類器ＭＤＬのパラメータを学習する（Ｓ２０８）。例えば、学習部１１８は、教師ラベルとした対応付けられたタグの組み合わせの尤度を最大値とし、その最大値から、最尤タグベクトルによって示されるタグの組み合わせの尤度を減算した差分が小さくなるように、分類器ＭＤＬのパラメータを学習する。学習対象とする分類器ＭＤＬのパラメータは、少なくとも特徴抽出レイヤ３００の重みやバイアス成分を含む。これによって、本フローチャートの処理が終了する。 When the learning unit 118 determines that the combinations of tags do not match, the learning unit 118 learns the parameters of the classifier MDL based on a gradient method such as an error reverse transmission number (S208). For example, in the learning unit 118, the likelihood of the combination of the associated tags used as the teacher label is set as the maximum value, and the difference obtained by subtracting the likelihood of the combination of tags indicated by the maximum likelihood tag vector from the maximum value is small. The parameters of the classifier MDL are learned so as to be. The parameters of the classifier MDL to be learned include at least the weights and bias components of the feature extraction layer 300. This ends the processing of this flowchart.

以上説明した第１実施形態によれば、複数の文字Ｃが含まれる文書を取得し、入力された文字を、ハイパーリンクを対応付けることが可能な文字列（固有表現）を識別する「Ｂ」「Ｉ」「Ｌ」タグや、それ以外の文字であることを識別する「Ｏ」タグなどのいずれかに分類するように学習された分類器ＭＤＬに対して、取得した文書に含まれる文字Ｃを入力して得られた最尤タグベクトルに基づいて、文書から、ハイパーリンクを対応付けることが可能な文字列を抽出するため、関連するコンテンツにハイパーリンクを介して対応付けることが可能な文字列を文書から精度よく抽出することができる。 According to the first embodiment described above, "B" and "B" that acquire a document containing a plurality of characters C and identify a character string (unique expression) to which a hyperlink can be associated with the input character. For the classifier MDL learned to classify into either the "I" or "L" tag or the "O" tag that identifies it as another character, the character C contained in the acquired document is assigned. Based on the most likely tag vector obtained by inputting, the character string that can be associated with the hyperlink is extracted from the document, so the character string that can be associated with the related content via the hyperlink is specified in the document. Can be extracted accurately from.

一般的に、文書に含まれる文字を「Ｂ」「Ｉ」「Ｌ」「Ｏ」「Ｕ」などのタグに分類する場合、予め設計者が定めた条件式（辞書）に従って、一文字一文字を厳密にタグ付けしている。このような場合、設計者が当初想定していなかった新しい言葉などが普及しても、その言葉が既存の条件式を満たさなければ、タグ付けが実施できないことがあった。 Generally, when classifying characters contained in a document into tags such as "B", "I", "L", "O", and "U", each character is strictly classified according to a conditional expression (dictionary) determined in advance by the designer. Is tagged with. In such a case, even if a new word or the like that was not initially envisioned by the designer becomes widespread, tagging may not be possible unless the word satisfies the existing conditional expression.

これに対して、本実施形態では、予め設計者が定めた条件式など基にしてタグ付けがなされた文字を教師データ１３６として、ニューラルネットワークをベースとした分類器ＭＤＬを学習させるため、予め決められた条件を汎化させることができる。この結果、予め決められた条件を満たさない文字についても、「Ｂ」「Ｉ」「Ｌ」「Ｏ」「Ｕ」などのタグに分類することができる。 On the other hand, in the present embodiment, the characters tagged based on the conditional expression determined in advance by the designer are used as the teacher data 136 to train the classifier MDL based on the neural network, so that the characters are determined in advance. The specified conditions can be generalized. As a result, even characters that do not satisfy the predetermined conditions can be classified into tags such as "B", "I", "L", "O", and "U".

また、上述した第１実施形態によれば、トークナイズされておらず、句点、読点、終止符、カンマ、疑問符、感嘆符、省略符、括弧、記号、スペース、改行などが含まれる文書を教師データ１３６とするため、例えば、通常、文章の終わりを表す句点が含まれる文字列（例えば、「○○アイドル。」のような文字列）や、「頑張ろう、○○。」のようなスローガンやキャッチフレーズなどの単語ではない文字列が、新語や造語として新たに流行した場合であっても、この文字列を、ハイパーリンクを対応付けることが可能な文字列として文書から抽出することができる。 Further, according to the first embodiment described above, the teacher data is a document that is not tokenized and includes punctuation marks, reading points, full stops, commas, question marks, exclamation marks, abbreviations, parentheses, symbols, spaces, line breaks, and the like. In order to set it to 136, for example, a character string that usually includes a punctuation mark indicating the end of a sentence (for example, a character string such as "○○ idol.") Or a slogan such as "Let's do our best, ○○." Even if a non-word character string such as a catch phrase is newly popular as a new word or coined word, this character string can be extracted from the document as a character string to which a hyperlink can be associated.

また、上述した第１実施形態によれば、分類器ＭＤＬにＢｉＬＳＴＭレイヤ３１０が含まれるため、入力された文字の並び順を考慮した上で、最も尤度が大きいタグの組み合わせを決定することができる。 Further, according to the first embodiment described above, since the classifier MDL includes the BiLSTM layer 310, it is possible to determine the combination of tags having the highest likelihood in consideration of the order of the input characters. can.

また、上述した第１実施形態によれば、分類器ＭＤＬにアテンション機構３３０が含まれるため、前段のＢｉＬＳＴＭレイヤ３１０によって出力された特徴ベクトル（タグスコアベクトル）ｈを、入力された文字列における各文字の位置（順番）に応じて重み付けることができ、その結果、その特徴ベクトルに含まれる各要素の値を、タグの尤度に変換することができる。 Further, according to the first embodiment described above, since the classifier MDL includes the attention mechanism 330, the feature vector (tag score vector) h output by the BiLSTM layer 310 in the previous stage is input to each of the input character strings. It can be weighted according to the position (order) of the characters, and as a result, the value of each element included in the feature vector can be converted into the likelihood of the tag.

また、上述した第１実施形態によれば、分類器ＭＤＬにＣＲＦレイヤ４００が含まれるため、尤度が大きいタグの組み合わせが存在しても、その組み合わせが「Ｏ」「Ｌ」「Ｏ」のような組み合わせで、存在し得る蓋然性が低いタグの組み合わせである場合には、その組み合わせの尤度を低下させる。この結果、関連するコンテンツに対応付けることが可能な文字列を文書から、更に精度よく抽出することができる。 Further, according to the first embodiment described above, since the classifier MDL includes the CRF layer 400, even if there is a combination of tags having a high likelihood, the combination is "O", "L", or "O". In the case of a combination of tags having a low probability of being present in such a combination, the likelihood of the combination is lowered. As a result, a character string that can be associated with the related content can be extracted from the document with higher accuracy.

＜第２実施形態＞
以下、第２実施形態について説明する。第２実施形態では、分類器ＭＤＬに含まれる埋め込みレイヤ２００が、ある言語モデルＭＤＬ２を利用して事前学習されている点で上述した第１実施形態と相違する。以下、第１実施形態との相違点を中心に説明し、第１実施形態と共通する点については説明を省略する。なお、第２実施形態の説明において、第１実施形態と同じ部分については同一符号を付して説明する。 <Second Embodiment>
Hereinafter, the second embodiment will be described. The second embodiment differs from the first embodiment described above in that the embedded layer 200 included in the classifier MDL is pre-learned using a certain language model MDL2. Hereinafter, the differences from the first embodiment will be mainly described, and the points common to the first embodiment will be omitted. In the description of the second embodiment, the same parts as those of the first embodiment will be described with the same reference numerals.

図１０は、言語モデルＭＤＬ２の一例を示す図である。言語モデルＭＤＬ２は、例えば、エンコーダ５００と、デコーダ６００とを含む。エンコーダ５００は、入力された文字を、文字ベクトルに変換するニューラルネットワークであり、デコーダ６００は、エンコーダ５００によって文字から変換された文字ベクトルを、元の文字に逆変換するニューラルネットワークである。エンコーダ５００は、例えば、埋め込みレイヤ２００と、第１ＬＳＴＭ５１０と、第２ＬＳＴＭ５２０とを含む。 FIG. 10 is a diagram showing an example of the language model MDL2. The language model MDL2 includes, for example, an encoder 500 and a decoder 600. The encoder 500 is a neural network that converts the input characters into a character vector, and the decoder 600 is a neural network that reversely converts the character vector converted from the characters by the encoder 500 into the original characters. The encoder 500 includes, for example, an embedded layer 200, a first LSTM510, and a second LSTM520.

言語モデルＭＤＬ２は、ある文字Ｃ_ｋが入力されたときに、その入力された文字Ｃ_ｋの次にくる文字Ｃ_ｋ＋１を予測するように学習される。例えば、図示のように、文字Ｃ_１から文字Ｃ_６までのある文字列が言語モデルＭＤＬ２に入力された場合、学習部１１８は、文字Ｃ_１を入力として、その文字Ｃ_１に隣り合う次の文字Ｃ_Ｘを言語モデルＭＤＬ２に予測させる。学習部１１８は、言語モデルＭＤＬ２によって予測された文字Ｃ_Ｘが、文字Ｃ_１の次にくる文字Ｃ_２でない場合、少なくともエンコーダ５００の重みやバイアス成分などのパラメータを再決定する。次に、学習部１１８は、文字Ｃ_３を入力として、その文字Ｃ_３に隣り合う次の文字Ｃ_Ｘを言語モデルＭＤＬ２に予測させる。学習部１１８は、言語モデルＭＤＬ２によって予測された文字Ｃ_Ｘが、文字Ｃ_３の次にくる文字Ｃ_４でない場合、少なくともエンコーダ５００の重みやバイアス成分などのパラメータを再決定する。このように、学習部１１８は、入力データだけで言語モデルＭＤＬ２のパラメータを学習する教師なし学習を行う。 The language model MDL2 is learned to predict the character C _{k + 1} following the input character C _k when a certain character C _k is input. For example, as shown in the figure, when a certain character string from the character C ₁ to the character C ₆ is input to the language model MDL 2, the learning unit 118 takes the character C ₁ as an input and next to the character C ₁ next to it. Let the language model MDL2 predict the letter C _X. If the character C _X predicted by the language model MDL 2 is not the character C ₂ following the character C ₁ , the learning unit 118 redetermines at least parameters such as the weight and the bias component of the encoder 500. Next, the learning unit 118 takes the character C ₃ as an input and causes the language model MDL 2 to predict the next character C _X adjacent to the character C ₃ . If the character C _X predicted by the language model MDL 2 is not the character C ₄ following the character C ₃ , the learning unit 118 redetermines at least parameters such as the weight and bias component of the encoder 500. In this way, the learning unit 118 performs unsupervised learning in which the parameters of the language model MDL2 are learned using only the input data.

言語モデルＭＤＬ２を学習させた結果、言語モデルＭＤＬ２が入力された文字Ｃ_ｋの次に来る文字Ｃ_ｋ＋１を一定の精度で予測できるようになった場合、言語モデルＭＤＬ２のエンコーダ５００に含まれる埋め込みレイヤ２００を取り出して、分類器ＭＤＬの埋め込みレイヤ２００とする。 As a result of training the language model MDL2, when the language model MDL2 can predict the character C _{k + 1} following the input character C _k with a certain accuracy, the embedded layer included in the encoder 500 of the language model MDL2. 200 is taken out and used as an embedded layer 200 of the classifier MDL.

以上説明した第２実施形態によれば、分類器ＭＤＬに含まれる埋め込みレイヤ２００を、言語モデルＭＤＬ２を利用して事前学習することで、分類器ＭＤＬがより適切な文字ベクトルを生成することができる。 According to the second embodiment described above, the classifier MDL can generate a more appropriate character vector by pre-learning the embedded layer 200 included in the classifier MDL using the language model MDL2. ..

＜第３実施形態＞
以下、第３実施形態について説明する。第３実施形態では、文字列抽出部１１４によって抽出された一以上の文字列から、特定の文字列を除去する点で上述した第１または第２実施形態と相違する。以下、第１または第２実施形態との相違点を中心に説明し、第１または第２実施形態と共通する点については説明を省略する。なお、第３実施形態の説明において、第１または第２実施形態と同じ部分については同一符号を付して説明する。 <Third Embodiment>
Hereinafter, the third embodiment will be described. The third embodiment is different from the first or second embodiment described above in that a specific character string is removed from one or more character strings extracted by the character string extraction unit 114. Hereinafter, the differences from the first or second embodiment will be mainly described, and the points common to the first or second embodiment will be omitted. In the description of the third embodiment, the same parts as those of the first or second embodiment will be described with the same reference numerals.

図１１は、第３実施形態における情報処理装置１００Ａの構成の一例を示す図である。図示のように、第３実施形態における情報処理装置１００Ａの制御部１１０Ａは、上述した取得部１１２、文字列抽出部１１４、通信制御部１１６、および学習部１１８に加えて、更に、特定文字列除去部１２０を備える。 FIG. 11 is a diagram showing an example of the configuration of the information processing apparatus 100A according to the third embodiment. As shown in the figure, the control unit 110A of the information processing apparatus 100A according to the third embodiment further includes a specific character string in addition to the acquisition unit 112, the character string extraction unit 114, the communication control unit 116, and the learning unit 118 described above. A removal unit 120 is provided.

特定文字列除去部１２０は、文字列抽出部１１４によって抽出された一以上の文字列から、特定の文字列を除去する。特定の文字列とは、例えば、公序良俗に反するような不適切な文字列や、ネガティブな意味をもつ文字列、放送禁止用語に相当する文字列などを含む。 The specific character string removing unit 120 removes a specific character string from one or more character strings extracted by the character string extracting unit 114. The specific character string includes, for example, an inappropriate character string that is offensive to public order and morals, a character string that has a negative meaning, a character string that corresponds to a broadcast prohibited term, and the like.

通信制御部１１６は、文字列抽出部１１４によって抽出され、その後、特定文字列除去部１２０によって除去されずに残った文字列を含む文字列情報を、サービス提供装置２０に送信する。 The communication control unit 116 transmits the character string information including the character string extracted by the character string extraction unit 114 and then not removed by the specific character string removal unit 120 to the service providing device 20.

以上説明した第３実施形態によれば、「Ｂ」タグが付与された文字Ｃからはじまり、「Ｌ」タグが付与された文字Ｃで終わる一以上の文字列の中から特定の文字列を除去するため、より適切なコンテンツへのハイパーリンクを文字列に対応付けることができる。 According to the third embodiment described above, a specific character string is removed from one or more character strings starting from the character C to which the "B" tag is attached and ending with the character C to which the "L" tag is attached. Therefore, it is possible to associate a hyperlink to a more appropriate content with a character string.

＜ハードウェア構成＞
上述した実施形態の情報処理装置１００は、例えば、図１２に示すようなハードウェア構成により実現される。図１２は、実施形態の情報処理装置１００および１００Ａのハードウェア構成の一例を示す図である。 <Hardware configuration>
The information processing apparatus 100 of the above-described embodiment is realized by, for example, a hardware configuration as shown in FIG. FIG. 12 is a diagram showing an example of the hardware configuration of the information processing devices 100 and 100A of the embodiment.

情報処理装置１００および１００Ａは、ＮＩＣ１００－１、ＣＰＵ１００－２、ＲＡＭ１００－３、ＲＯＭ１００－４、フラッシュメモリやＨＤＤなどの二次記憶装置１００－５、およびドライブ装置１００－６が、内部バスあるいは専用通信線によって相互に接続された構成となっている。ドライブ装置１００－６には、光ディスクなどの可搬型記憶媒体が装着される。二次記憶装置１００－５、またはドライブ装置１００－６に装着された可搬型記憶媒体に格納されたプログラムがＤＭＡコントローラ（不図示）などによってＲＡＭ１００－３に展開され、ＣＰＵ１００－２によって実行されることで、制御部１１０および１１０Ａが実現される。制御部１１０および１１０Ａが参照するプログラムは、ネットワークＮＷを介して他の装置からダウンロードされてもよい。 The information processing devices 100 and 100A include NIC100-1, CPU100-2, RAM100-3, ROM100-4, secondary storage devices 100-5 such as flash memory and HDD, and drive devices 100-6 as internal buses or dedicated devices. It is configured to be connected to each other by a communication line. A portable storage medium such as an optical disk is mounted on the drive device 100-6. A program stored in a portable storage medium mounted on the secondary storage device 100-5 or the drive device 100-6 is expanded in the RAM 100-3 by a DMA controller (not shown) or the like, and executed by the CPU 100-2. As a result, the control units 110 and 110A are realized. The program referred to by the control units 110 and 110A may be downloaded from another device via the network NW.

以上、本発明を実施するための形態について実施形態を用いて説明したが、本発明はこうした実施形態に何ら限定されるものではなく、本発明の要旨を逸脱しない範囲内において種々の変形及び置換を加えることができる。 Although the embodiments for carrying out the present invention have been described above using the embodiments, the present invention is not limited to these embodiments, and various modifications and substitutions are made without departing from the gist of the present invention. Can be added.

１…情報処理システム、１０…端末装置、２０…サービス提供装置、１００、１００Ａ…情報処理装置、１０２…通信部、１１０、１１０Ａ…制御部、１１２…取得部、１１４…文字列抽出部、１１６…通信制御部、１１８…学習部、１２０…特定文字列除去部、１３０…記憶部 1 ... Information processing system, 10 ... Terminal device, 20 ... Service providing device, 100, 100A ... Information processing device, 102 ... Communication unit, 110, 110A ... Control unit, 112 ... Acquisition unit, 114 ... Character string extraction unit, 116 ... communication control unit, 118 ... learning unit, 120 ... specific character string removal unit, 130 ... storage unit

Claims

複数の文字が含まれる文書を取得する取得部と、
入力された文字を、予め決められた条件を満たす文字列に含まれる第１文字か、前記文字列に含まれない第２文字かに分類するように学習された分類器に対して、前記文書に含まれる文字を入力して得られた前記分類器の出力結果に基づいて、前記文書から前記文字列を抽出する抽出部と、を備え、
前記分類器は、入力された文字が、前記第１文字であることの尤もらしさを表す第１指標値と、前記第２文字であることの尤もらしさを表す第２指標値とを出力し、
前記抽出部は、前記分類器により出力された前記第１指標値および前記第２指標値に基づいて、前記文書から前記文字列を抽出する、
情報処理装置。 An acquisition unit that acquires a document containing multiple characters,
The document for a classifier learned to classify an input character into a first character included in a character string satisfying a predetermined condition or a second character not included in the character string. It is provided with an extraction unit for extracting the character string from the document based on the output result of the classifier obtained by inputting the characters included in.
The classifier outputs a first index value indicating the plausibility that the input character is the first character and a second index value indicating the plausibility that the input character is the second character.
The extraction unit extracts the character string from the document based on the first index value and the second index value output by the classifier.
Information processing equipment.

複数の文字が含まれる文書を取得する取得部と、 An acquisition unit that acquires a document containing multiple characters,
入力された文字を、予め決められた条件を満たす文字列に含まれる第１文字か、前記文字列に含まれない第２文字かに分類するように学習された分類器に対して、前記文書に含まれる文字を入力して得られた前記分類器の出力結果に基づいて、前記文書から前記文字列を抽出する抽出部と、を備え、 The document for a classifier learned to classify an input character into a first character included in a character string satisfying a predetermined condition or a second character not included in the character string. It is provided with an extraction unit for extracting the character string from the document based on the output result of the classifier obtained by inputting the characters included in.
前記分類器には、入力された文字の後に出現する文字を予測するとともに、入力された文字の前に出現する文字を予測する双方向の再帰型ニューラルネットワークが含まれる、 The classifier includes a bidirectional recurrent neural network that predicts the characters that appear after the input character as well as the characters that appear before the input character.
情報処理装置。 Information processing equipment.

複数の文字が含まれる文書を取得する取得部と、 An acquisition unit that acquires a document containing multiple characters,
入力された文字を、予め決められた条件を満たす文字列に含まれる第１文字か、前記文字列に含まれない第２文字かに分類するように学習された分類器に対して、前記文書に含まれる文字を入力して得られた前記分類器の出力結果に基づいて、前記文書から前記文字列を抽出する抽出部と、を備え、 The document for a classifier learned to classify an input character into a first character included in a character string satisfying a predetermined condition or a second character not included in the character string. It is provided with an extraction unit for extracting the character string from the document based on the output result of the classifier obtained by inputting the characters included in.
前記抽出部は、前記分類器の出力結果と、文書において少なくとも前記第１文字と前記第２文字とが互いに隣り合う確率を条件付けた条件付き確率場とに基づいて、前記文書から前記文字列を抽出する、 The extraction unit obtains the character string from the document based on the output result of the classifier and a conditional random field in which at least the probability that the first character and the second character are adjacent to each other in the document is conditioned. Extract,
情報処理装置。 Information processing equipment.

前記第１文字には、前記文字列において、最初に出現する第３文字と、最後に出現する第４文字と、前記第３文字および前記第４文字の間に出現する第５文字とが含まれ、
前記分類器は、前記第３文字、前記第４文字、および第５文字のそれぞれの前記第１指標値を出力し、
前記抽出部は、前記分類器により出力された前記第３文字、前記第４文字、および第５文字のそれぞれの前記第１指標値と、前記第２文字の前記第２指標値とに基づいて、前記文書から前記文字列を抽出する、
請求項１に記載の情報処理装置。 The first character includes a third character that appears first, a fourth character that appears last, and a fifth character that appears between the third character and the fourth character in the character string. Re,
The classifier outputs the first index value of each of the third character, the fourth character, and the fifth character.
The extraction unit is based on the first index value of each of the third character, the fourth character, and the fifth character output by the classifier, and the second index value of the second character. , Extract the character string from the document,
The information processing apparatus according to claim 1 .

前記再帰型ニューラルネットワークは、前記入力された各文字に対して特徴量を出力し、
前記分類器には、更に、前記再帰型ニューラルネットワークによって文字ごとに出力された各特徴量に重みを乗算するアテンション機構が含まれる、
請求項２に記載の情報処理装置。 The recurrent neural network outputs a feature amount for each of the input characters.
The classifier further includes an attention mechanism that multiplies each feature output character by character by the recurrent neural network by a weight.
The information processing apparatus according to claim 2 .

前記アテンション機構は、対象とする文字の特徴量に、前記対象とする文字の特徴量を全ての文字の特徴量で除算した割合を前記重みとして乗算する、
請求項５に記載の情報処理装置。 The attention mechanism multiplies the feature amount of the target character by the ratio obtained by dividing the feature amount of the target character by the feature amount of all the characters as the weight.
The information processing apparatus according to claim 5.

コンピュータが、
複数の文字が含まれる文書を取得し、
入力された文字を、予め決められた条件を満たす文字列に含まれる第１文字か、前記文字列に含まれない第２文字かに分類するように学習された分類器に対して、前記文書に含まれる文字を入力して得られた前記分類器の出力結果に基づいて、前記文書から前記文字列を抽出し、
前記分類器は、入力された文字が、前記第１文字であることの尤もらしさを表す第１指標値と、前記第２文字であることの尤もらしさを表す第２指標値とを出力し、
前記コンピュータが、前記分類器により出力された前記第１指標値および前記第２指標値に基づいて、前記文書から前記文字列を抽出する、
情報処理方法。 The computer
Get a document that contains multiple characters
The document for a classifier learned to classify an input character into a first character included in a character string satisfying a predetermined condition or a second character not included in the character string. Based on the output result of the classifier obtained by inputting the characters contained in, the character string is extracted from the document.
The classifier outputs a first index value indicating the plausibility that the input character is the first character and a second index value indicating the plausibility that the input character is the second character.
The computer extracts the character string from the document based on the first index value and the second index value output by the classifier.
Information processing method.

コンピュータが、 The computer
複数の文字が含まれる文書を取得し、 Get a document that contains multiple characters
入力された文字を、予め決められた条件を満たす文字列に含まれる第１文字か、前記文字列に含まれない第２文字かに分類するように学習された分類器に対して、前記文書に含まれる文字を入力して得られた前記分類器の出力結果に基づいて、前記文書から前記文字列を抽出し、 The document for a classifier learned to classify an input character into a first character included in a character string satisfying a predetermined condition or a second character not included in the character string. Based on the output result of the classifier obtained by inputting the characters contained in, the character string is extracted from the document.
前記分類器には、入力された文字の後に出現する文字を予測するとともに、入力された文字の前に出現する文字を予測する双方向の再帰型ニューラルネットワークが含まれる、 The classifier includes a bidirectional recurrent neural network that predicts the characters that appear after the input character as well as the characters that appear before the input character.
情報処理方法。 Information processing method.

コンピュータが、 The computer
複数の文字が含まれる文書を取得し、 Get a document that contains multiple characters
入力された文字を、予め決められた条件を満たす文字列に含まれる第１文字か、前記文字列に含まれない第２文字かに分類するように学習された分類器に対して、前記文書に含まれる文字を入力して得られた前記分類器の出力結果に基づいて、前記文書から前記文字列を抽出し、 The document for a classifier learned to classify an input character into a first character included in a character string satisfying a predetermined condition or a second character not included in the character string. Based on the output result of the classifier obtained by inputting the characters contained in, the character string is extracted from the document.
前記分類器の出力結果と、文書において少なくとも前記第１文字と前記第２文字とが互いに隣り合う確率を条件付けた条件付き確率場とに基づいて、前記文書から前記文字列を抽出する、 Extracting the character string from the document based on the output result of the classifier and a conditional random field conditioned on the probability that at least the first character and the second character are adjacent to each other in the document.
情報処理方法。 Information processing method.

コンピュータに、
複数の文字が含まれる文書を取得する処理と、
入力された文字を、予め決められた条件を満たす文字列に含まれる第１文字か、前記文字列に含まれない第２文字かに分類するように学習された分類器に対して、前記文書に含まれる文字を入力して得られた前記分類器の出力結果に基づいて、前記文書から前記文字列を抽出する処理と、を実行させるためのプログラムであって、
前記分類器は、入力された文字が、前記第１文字であることの尤もらしさを表す第１指標値と、前記第２文字であることの尤もらしさを表す第２指標値とを出力し、
前記コンピュータに、更に、前記分類器により出力された前記第１指標値および前記第２指標値に基づいて、前記文書から前記文字列を抽出する処理、
を実行させるためのプログラム。 On the computer
The process of retrieving a document containing multiple characters and
The document for a classifier learned to classify an input character into a first character included in a character string satisfying a predetermined condition or a second character not included in the character string. It is a program for executing the process of extracting the character string from the document based on the output result of the classifier obtained by inputting the characters included in.
The classifier outputs a first index value indicating the plausibility that the input character is the first character and a second index value indicating the plausibility that the input character is the second character.
A process of extracting the character string from the document based on the first index value and the second index value output by the classifier to the computer.
A program to execute.

コンピュータに、 On the computer
複数の文字が含まれる文書を取得する処理と、 The process of retrieving a document containing multiple characters and
入力された文字を、予め決められた条件を満たす文字列に含まれる第１文字か、前記文字列に含まれない第２文字かに分類するように学習された分類器に対して、前記文書に含まれる文字を入力して得られた前記分類器の出力結果に基づいて、前記文書から前記文字列を抽出する処理と、を実行させるためのプログラムであって、 The document for a classifier learned to classify an input character into a first character included in a character string satisfying a predetermined condition or a second character not included in the character string. It is a program for executing the process of extracting the character string from the document based on the output result of the classifier obtained by inputting the characters included in.
前記分類器には、入力された文字の後に出現する文字を予測するとともに、入力された文字の前に出現する文字を予測する双方向の再帰型ニューラルネットワークが含まれる、 The classifier includes a bidirectional recurrent neural network that predicts the characters that appear after the input character as well as the characters that appear before the input character.
プログラム。 program.

コンピュータに、
複数の文字が含まれる文書を取得する処理と、
入力された文字を、予め決められた条件を満たす文字列に含まれる第１文字か、前記文字列に含まれない第２文字かに分類するように学習された分類器に対して、前記文書に含まれる文字を入力して得られた前記分類器の出力結果に基づいて、前記文書から前記文字列を抽出する処理と、
前記分類器の出力結果と、文書において少なくとも前記第１文字と前記第２文字とが互いに隣り合う確率を条件付けた条件付き確率場とに基づいて、前記文書から前記文字列を抽出する処理と、
を実行させるためのプログラム。 On the computer
The process of retrieving a document containing multiple characters and
The document for a classifier learned to classify an input character into a first character included in a character string satisfying a predetermined condition or a second character not included in the character string. The process of extracting the character string from the document based on the output result of the classifier obtained by inputting the characters included in
A process of extracting the character string from the document based on the output result of the classifier and a conditional random field conditioned on the probability that at least the first character and the second character are adjacent to each other in the document.
A program to execute.