JP2011008312A

JP2011008312A - Language analysis device and program

Info

Publication number: JP2011008312A
Application number: JP2009148259A
Authority: JP
Inventors: Yasuhide Miura; 康秀三浦; Tomoko Okuma; 智子大熊; Hiroshi Masuichi; 博増市
Original assignee: Fuji Xerox Co Ltd
Current assignee: Fujifilm Business Innovation Corp
Priority date: 2009-06-23
Filing date: 2009-06-23
Publication date: 2011-01-13
Anticipated expiration: 2029-06-23
Also published as: JP5387161B2

Abstract

PROBLEM TO BE SOLVED: To classify whether to separate parenthesis expressions included in a sentence from the sentence in syntax analysis.SOLUTION: A language analysis device 10 obtains sentence information including parenthesis expressions from a document group storage part 12 which stores sentence information, and temporarily classifies parenthesis expressions included in the obtained sentence information into a first type separated from the sentence information and a second type unseparated from the sentence information according to a predetermined rule, and classifies the parenthesis expressions included in a prescribed sentence into the first type or the second type based on a rule learnt by a machine learning part 20 which learns a rule to classify the parenthesis expressions into the first type and the second type based on the temporary classification result of the parenthesis expressions.

Description

本発明は、言語解析装置及びプログラムに関する。 The present invention relates to a language analysis apparatus and a program.

自然言語処理において行われる処理の１つに構文解析処理がある。構文解析処理において、文に含まれる括弧が冗長な表現となって構文解析の精度に影響を与えることがある。そこで、下記の特許文献１では、文中の括弧を文から分離して構文解析をした後に、分離した括弧を元の文に戻すことで括弧を含む文の構文解析を行っている。 One of the processes performed in the natural language process is a parsing process. In parsing processing, parentheses included in a sentence may become redundant expressions and affect parsing accuracy. Therefore, in Patent Document 1 below, after parsing the parenthesis in the sentence from the sentence, syntax analysis of the sentence including the parenthesis is performed by returning the separated parenthesis to the original sentence.

特開２００２−８２９４４号公報JP 2002-82944 A

文と文に含まれる括弧表現（括弧と括弧に囲まれた文字列とを含む文字情報）との間に構文上の依存関係がある場合には、文から括弧表現を分離して構文解析を行うと正しい結果が得られないことがある。 If there is a syntactic dependency between the sentence and the parenthesis expression included in the sentence (character information including the parenthesis and the character string enclosed in parentheses), the parenthesis expression is separated from the sentence and the parsing is performed. Doing so may not give correct results.

本発明の目的の一つは、文に含まれる括弧表現について構文解析時に文から分離すべきか否かを実例に即して分類できる言語解析装置及びプログラムを提供することにある。 One of the objects of the present invention is to provide a language analysis device and a program that can classify whether parentheses included in a sentence should be separated from the sentence at the time of syntax analysis according to an actual example.

上記目的を達成するために、請求項１に記載の言語解析装置の発明は、文情報を格納した格納手段から括弧表現が含まれる文情報を取得する取得手段と、予め定められた規則に従って、前記取得手段により取得した文情報に含まれる括弧表現を当該文情報から分離する第１の類型と分離しない第２の類型に仮分類する仮分類手段と、前記仮分類手段による括弧表現の仮分類結果に基づいて、括弧表現を前記第１の類型と前記第２の類型とに分類する規則を学習する学習手段と、前記学習手段により学習された規則に基づいて、所与の文に含まれる括弧表現を前記第１の類型又は前記第２の類型に分類する分類手段と、を含むことを特徴とする。 To achieve the above object, the invention of the language analysis device according to claim 1 is characterized in that, according to a predetermined rule, an acquisition unit that acquires sentence information including parenthesis expressions from a storage unit that stores sentence information, Temporary classification means for provisionally classifying parenthesis expressions included in the sentence information acquired by the acquisition means into a first type that is separated from the sentence information and a second type that is not separated from the sentence information, and provisional classification of parenthesis expressions by the temporary classification means Based on the result, learning means for learning rules for classifying parenthesis expressions into the first type and the second type, and included in a given sentence based on the rules learned by the learning means Classification means for classifying the parenthesis expression into the first type or the second type.

また、請求項２に記載の発明は、請求項１に記載の言語解析装置において、前記分類手段により前記第１の類型に分類された括弧表現を、当該括弧表現を含む文情報から分離して構文解析の対象を設定する設定手段をさらに含むことを特徴とする。 The invention according to claim 2 is the language analysis apparatus according to claim 1, wherein the parenthesis expression classified into the first type by the classification unit is separated from sentence information including the parenthesis expression. It further includes setting means for setting a target for parsing.

また、請求項３に記載の発明は、請求項１又は２に記載の言語解析装置において、前記学習手段は、前記仮分類手段による括弧表現の仮分類結果を教師情報として、前記括弧表現の特徴情報を前記第１の類型と前記第２の類型とに分類する規則を学習することを特徴とする。 The invention according to claim 3 is the language analysis apparatus according to claim 1 or 2, wherein the learning means uses the provisional classification result of the parenthesis expression by the provisional classification means as teacher information, and features of the parenthesis expression A rule for classifying information into the first type and the second type is learned.

また、請求項４に記載の発明は、請求項３に記載の言語解析装置において、前記括弧表現の特徴情報は、当該括弧表現の周辺文字列の形態素情報に基づいて生成されることを特徴とする。 The invention according to claim 4 is the language analysis apparatus according to claim 3, wherein the feature information of the parenthesis expression is generated based on morpheme information of a surrounding character string of the parenthesis expression. To do.

また、請求項５に記載の発明は、請求項１乃至３のいずれかに記載の言語解析装置において、前記仮分類手段は、前記取得手段により取得された文情報毎に、当該文情報と当該文情報から括弧表現を除いた文字列情報とのそれぞれの構文情報を比較した結果と、当該括弧表現の構文情報が予め定められた条件に合致するか否かを判定した結果とに基づいて、当該括弧表現を第１の類型と第２の類型とに仮分類することを特徴とする。 Further, the invention according to claim 5 is the language analysis apparatus according to any one of claims 1 to 3, wherein the temporary classification unit includes the sentence information and the sentence for each sentence information acquired by the acquisition unit. Based on the result of comparing each piece of syntax information with the character string information obtained by removing the parenthesis expression from the sentence information and the result of determining whether the syntax information of the parenthesis expression matches a predetermined condition, The parenthesis expression is provisionally classified into a first type and a second type.

また、請求項６に記載のプログラムの発明は、文情報を格納した格納手段から括弧表現が含まれる文情報を取得する取得手段と、予め定められた規則に従って、前記取得手段により取得した文情報に含まれる括弧表現を当該文情報から分離する第１の類型と分離しない第２の類型に仮分類する仮分類手段と、前記仮分類手段による括弧表現の仮分類結果に基づいて、括弧表現を前記第１の類型と前記第２の類型とに分類する規則を学習する学習手段と、前記学習手段により学習された規則に基づいて、所与の文に含まれる括弧表現を前記第１の類型又は前記第２の類型に分類する分類手段としてコンピュータを機能させるためのプログラムである。 According to a sixth aspect of the present invention, there is provided a program for acquiring sentence information including parenthesis expressions from a storage means for storing sentence information, and sentence information acquired by the acquisition means according to a predetermined rule. A temporary classification means for temporarily classifying the parenthesis expression included in the sentence information into a second type that is not separated from the first type that is separated from the sentence information, and a parenthesis expression based on the temporary classification result of the parenthesis expression by the temporary classification means Learning means for learning a rule for classifying into the first type and the second type, and a parenthesis expression included in a given sentence based on the rule learned by the learning means. Or it is a program for making a computer function as a classification means to classify | categorize into said 2nd type.

請求項１及び６に記載の発明によれば、文に含まれる括弧表現について構文解析時に文から分離すべきか否かを実例に即して分類できる。 According to the first and sixth aspects of the present invention, it is possible to classify whether parentheses included in a sentence should be separated from the sentence at the time of parsing based on an actual example.

請求項２に記載の発明によれば、文から分離すべき括弧表現については分離して構文解析対象を設定できる。 According to the second aspect of the present invention, the parsing target can be set by separating the parenthesis expression to be separated from the sentence.

請求項３に記載の発明によれば、教師情報を別途与えることなく括弧表現の類型を学習できる。 According to the third aspect of the present invention, the type of parenthesis expression can be learned without separately providing teacher information.

請求項４に記載の発明によれば、仮分類の結果を用いて括弧表現の特徴情報を生成できる。 According to the fourth aspect of the invention, it is possible to generate parenthesized feature information using the provisional classification result.

請求項５に記載の発明によれば、本構成を有しない場合に比較して、括弧表現を文から分離すべきか否かの判定精度を向上できる。 According to the fifth aspect of the present invention, it is possible to improve the accuracy of determining whether or not the parenthesis expression should be separated from the sentence as compared with the case where the present configuration is not provided.

本実施形態に係る言語解析装置の機能ブロック図である。It is a functional block diagram of the language analysis apparatus concerning this embodiment. 文書情報の一例を示す図である。It is a figure which shows an example of document information. サンプルテキストの一例を示す図である。It is a figure which shows an example of a sample text. 括弧表現を「独立」と「従属」に分類した例を示す図である。It is a figure which shows the example which classified the parenthesis expression into "independent" and "dependent". 括弧表現を「独立」と「従属」に分類した例を示す図である。It is a figure which shows the example which classified the parenthesis expression into "independent" and "dependent". 括弧表現を「独立」と「従属」に分類した例を示す図である。It is a figure which shows the example which classified the parenthesis expression into "independent" and "dependent". 学習データの一例を示す図である。It is a figure which shows an example of learning data. 入力データの一例を示す図である。It is a figure which shows an example of input data. 学習処理のフローチャートである。It is a flowchart of a learning process. 構文解析対象の設定処理のフローチャートである。It is a flowchart of a setting process of a parsing target.

以下、本発明を実施するための好適な実施の形態（以下、実施形態という）を、図面に従って説明する。 DESCRIPTION OF EXEMPLARY EMBODIMENTS Hereinafter, preferred embodiments (hereinafter referred to as embodiments) for carrying out the invention will be described with reference to the drawings.

図１には、本実施形態に係る言語解析装置１０の機能ブロック図を示す。図１に示されるように、言語解析装置１０は、文書群格納部１２、学習標本抽出部１４、構文解析部１６、学習データ生成部１８、機械学習部２０、構文解析対象文取得部２２、括弧分類部２４、及び構文解析対象設定部２６を含む。上記各部の機能は、ＣＰＵ（Central Processing Unit）等の制御手段、メモリ等の記憶手段、外部デバイスとデータを送受信する入出力手段等を備えたコンピュータが、コンピュータ読み取り可能な情報記憶媒体に格納されたプログラムを読み込み実行することで実現されるものとしてよい。なお、プログラムは情報記憶媒体によってコンピュータたる言語解析装置１０に供給されることとしてもよいし、インターネット等のデータ通信手段を介して供給されることとしてもよい。 FIG. 1 shows a functional block diagram of a language analysis apparatus 10 according to the present embodiment. As shown in FIG. 1, the language analysis apparatus 10 includes a document group storage unit 12, a learning sample extraction unit 14, a syntax analysis unit 16, a learning data generation unit 18, a machine learning unit 20, a syntax analysis target sentence acquisition unit 22, A parenthesis classification unit 24 and a syntax analysis target setting unit 26 are included. The functions of the above-described units are stored in a computer-readable information storage medium by a computer including a control unit such as a CPU (Central Processing Unit), a storage unit such as a memory, and an input / output unit that transmits and receives data to and from an external device. It may be realized by reading and executing the program. The program may be supplied to the language analysis apparatus 10 as a computer by an information storage medium, or may be supplied via a data communication unit such as the Internet.

文書群格納部１２は、半導体メモリや磁気ディスク装置等の記憶装置を含み構成され、１又は複数の分野について文書群（コーパス）を格納したものである。文書群（コーパス）は、テキストの集合体であって、例えばウェブテキストの集合体、百科事典の記事の集合体、新聞記事の集合体、特定の技術分野、概念分野についての文書の集合体等を含むこととしてよい。 The document group storage unit 12 includes a storage device such as a semiconductor memory or a magnetic disk device, and stores a document group (corpus) for one or more fields. A document group (corpus) is a collection of texts, for example, a collection of web text, a collection of encyclopedia articles, a collection of newspaper articles, a collection of documents about a specific technical field, conceptual field, etc. May be included.

図２には、文書群格納部１２に格納される文書情報の一例を示す。図２に示されるように、文書群格納部１２には、各文書を識別する文書ＩＤに関連づけて、当該文書の内容を表すテキストデータが格納されている。 FIG. 2 shows an example of document information stored in the document group storage unit 12. As shown in FIG. 2, the document group storage unit 12 stores text data representing the contents of the document in association with the document ID for identifying each document.

学習標本抽出部１４は、文書群格納部１２に格納された文書情報の中から学習サンプルとして用いる文を抽出するものである。本実施形態では、学習標本抽出部１４は、文書群格納部１２に格納された文書情報のうち解析の対象とするテキストに応じた分野のテキストから、括弧を１つ含む文を抽出することとする。学習標本抽出部１４は、文書群格納部１２に格納される文書内に複数の文が含まれている場合には、当該各文を区切ると共に、区切った各文のうち括弧を１つ含む文を抽出することとする。なお、本実施形態における括弧は、丸括弧、鉤括弧、二重鉤括弧、角括弧、波括弧、亀甲括弧、山括弧、二重山括弧、隅付き括弧等の多様な種類の括弧を含むこととしてよく、これらの括弧のうち指定したもののみを括弧として扱うこととしてもよい。 The learning sample extraction unit 14 extracts a sentence used as a learning sample from the document information stored in the document group storage unit 12. In the present embodiment, the learning sample extraction unit 14 extracts a sentence including one parenthesis from the text in a field corresponding to the text to be analyzed among the document information stored in the document group storage unit 12. To do. When a plurality of sentences are included in the document stored in the document group storage unit 12, the learning sample extraction unit 14 delimits each sentence and includes one parenthesis among the delimited sentences. Is to be extracted. The parentheses in the present embodiment include various types of parentheses such as parentheses, braces, double braces, square brackets, braces, turtle shell brackets, angle brackets, double angle brackets, and corner brackets. Of these, only those specified among the parentheses may be treated as parentheses.

図３には、学習標本抽出部１４により文書群格納部１２から抽出されるサンプルテキストの一例を示す。図３に示されるように、図２に示された文書情報におけるＩＤが００１のテキストのうち、最初の文「明日は雨が降る（だろう）。」は括弧を１つ含む文であるためこの文はサンプルテキストとして抽出される。また、同様にＩＤが００２のテキストのうち、鉤括弧内の「学問的にはケンブリッジ大学も（エジンバラ大学も）得る物は何もなかった」も括弧を１つ含む文であるためこの文もサンプルテキストとして抽出される。このように、学習標本抽出部１４は、句点がなくとも文としての形態をなす文字列であれば、当該文字列をサンプルテキストとして抽出することとしてよい。例えば学習標本抽出部１４は、文書群格納部１２からサンプルテキストを予め定められた数に達するまで抽出することとしてよい。 FIG. 3 shows an example of sample text extracted from the document group storage unit 12 by the learning sample extraction unit 14. As shown in FIG. 3, the first sentence “it will rain tomorrow (will be)” in the text with the ID “001” in the document information shown in FIG. 2 is a sentence including one parenthesis. This sentence is extracted as sample text. Similarly, in the text with the ID 002, the text in brackets also says that “There was nothing academically available at the University of Cambridge (or University of Edinburgh)” because it is also a sentence containing a single parenthesis. Extracted as sample text. As described above, the learning sample extraction unit 14 may extract the character string as the sample text as long as it is a character string that is in the form of a sentence without any punctuation. For example, the learning sample extraction unit 14 may extract sample text from the document group storage unit 12 until a predetermined number is reached.

構文解析部１６は、所与の文字列の形態素を分析すると共に、当該分析した形態素の関係性を解析して、当該所与の文字列の構文を解析するものである。本実施形態では、構文解析部１６は、学習標本抽出部１４により抽出されたサンプルテキストから生成される以下の３つの文字列について構文解析を行う。 The syntax analysis unit 16 analyzes the morpheme of the given character string, analyzes the relationship between the analyzed morphemes, and analyzes the syntax of the given character string. In the present embodiment, the syntax analysis unit 16 performs syntax analysis on the following three character strings generated from the sample text extracted by the learning sample extraction unit 14.

まず、構文解析部１６は、サンプルテキストに含まれる括弧に囲まれる文字列について構文解析を行う（以下、この構文解析結果を第１の構文解析結果とする）。次に、構文解析部１６は、サンプルテキスト全体について構文解析を行うと共に（以下、この構文解析結果を第２の構文解析結果とする）、サンプルテキストから括弧及び括弧に囲まれる文字列を含む括弧表現を除外した文字列について構文解析を行う（以下、この構文解析結果を第３の構文解析結果とする）。構文解析部１６によりサンプルテキストについて行った上記の第１乃至第３の構文解析結果の各々は、後述する学習データ生成部１８に出力される。 First, the syntax analysis unit 16 performs syntax analysis on a character string enclosed in parentheses included in the sample text (hereinafter, this syntax analysis result is referred to as a first syntax analysis result). Next, the syntax analysis unit 16 performs syntax analysis on the entire sample text (hereinafter, this syntax analysis result is referred to as a second syntax analysis result), and includes parentheses including a character string enclosed in parentheses from the sample text. The character string excluding the expression is parsed (hereinafter, this syntax analysis result is referred to as a third syntax analysis result). Each of the first to third syntax analysis results performed on the sample text by the syntax analysis unit 16 is output to the learning data generation unit 18 described later.

学習データ生成部１８は、第１の構文解析結果及び、第２の構文解析結果と第３の構文解析結果との比較結果に基づいて、サンプルテキストに含まれる括弧表現が当該サンプルテキストに従属しているか否かを分類（仮分類）すると共に、当該分類結果に基づいて括弧表現の学習データを生成する。以下、本実施形態における学習データの生成処理を具体的に説明する。 Based on the first syntax analysis result and the comparison result between the second syntax analysis result and the third syntax analysis result, the learning data generation unit 18 determines that the parenthesis expression included in the sample text depends on the sample text. Is classified (temporary classification), and learning data in parenthesis is generated based on the classification result. Hereinafter, the learning data generation process in the present embodiment will be described in detail.

学習データ生成部１８は、まず第１の構文解析結果を参照して、当該構文解析結果により示される最上位ノードが文（Ｓ）又は名詞句である場合には、第２の構文解析結果と第３の構文解析結果とにおける最上位ノードがそれぞれ一致していればサンプルテキストに含まれる括弧表現はサンプルテキストから分離する「独立」、そうでなければ「従属」として判定する。これは、括弧表現が文又は名詞句としての形態を有しているものであって、さらに括弧表現を文から取り去っても構文が変化しない場合には、当該括弧表現は構文解析時に文から分離してよい「独立」型の括弧表現であると考えられるためである。 The learning data generation unit 18 first refers to the first parsing result, and when the highest node indicated by the parsing result is a sentence (S) or a noun phrase, If the topmost node in the third parsing result matches, the parenthesis expression included in the sample text is determined as “independent” separated from the sample text, and “subordinate” otherwise. If the parenthesis expression has a form as a sentence or noun phrase, and the syntax does not change even if the parenthesis expression is removed from the sentence, the parenthesis expression is separated from the sentence at the time of parsing. This is because it is considered to be an “independent” type of parenthesis expression.

また、学習データ生成部１８は、第１の構文解析結果により示される最上位ノードが文（Ｓ）又は名詞句でない場合には、第２の構文解析結果と第３の構文解析結果とにおける最上位ノードがそれぞれ一致していればサンプルテキストに含まれる括弧表現は当該サンプルテキストから分離しない「従属」、そうでなければ「独立」として判定する。これは、括弧表現が文又は名詞句としての形態を有していないものであって、さらに括弧表現を文から取り去った場合に構文が変化するような場合には、このような括弧表現を文に含めて構文解析すると文全体が文法に即さなくなり、正しい構文構造が得られないことが考えられるためである。 In addition, when the highest node indicated by the first syntax analysis result is not a sentence (S) or a noun phrase, the learning data generation unit 18 determines the highest in the second syntax analysis result and the third syntax analysis result. If the upper nodes match, the parenthesis expression included in the sample text is determined as “subordinate” that is not separated from the sample text, and “independent” otherwise. This is because the parenthesis expression does not have a form as a sentence or noun phrase, and if the syntax changes when the parenthesis expression is further removed from the sentence, such parenthesis expression is used as a sentence. This is because the entire sentence will not conform to the grammar and the correct syntax structure may not be obtained.

図４Ａ、図４Ｂ、図４Ｃには、サンプルテキストに含まれる具体的な括弧表現を「独立」と「従属」に分類した例を示した。図４Ａに示されるように、サンプルテキスト「雨が降る（だろう）」に対して、括弧表現の文字列「だろう」を解析した第１の構文解析結果の最上位ノードが文（Ｓ）又は名詞句でなく、第２の構文解析結果と第３の構文解析結果との最上位ノードは一致しているため、「雨が降る（だろう）」の括弧表現は構文解析時に分離しない「従属」と判定される。 4A, 4B, and 4C show examples in which specific parenthesis expressions included in the sample text are classified into “independent” and “dependent”. As shown in FIG. 4A, for the sample text “it will rain”, the topmost node of the first parsing result obtained by analyzing the character string “will be” in the parenthesis expression is the sentence (S). Or, since the top node of the second parsing result and the third parsing result is not the same as the noun phrase, the parenthesis expression of “it will rain” is not separated during parsing. Dependent "is determined.

図４Ｂには、サンプルテキスト「ワイン（赤・白）を扱う」の括弧表現を分類した分類結果を示した。すなわち、上記のサンプルテキストでは、第１の構文解析結果の最上位ノードは名詞句であり、かつ、第２の構文解析結果と第３の構文解析結果との最上位ノードが一致しているため、当該サンプルテキストの括弧表現は構文解析時に分離する「独立」と判定される。 FIG. 4B shows a classification result obtained by classifying the parenthesis expression of the sample text “handle wine (red / white)”. In other words, in the above sample text, the top node of the first parsing result is a noun phrase, and the top node of the second parsing result and the third parsing result match. The parenthesis expression of the sample text is determined as “independent” to be separated at the time of parsing.

図４Ｃには、サンプルテキスト「（であれば問題なので）現場へ行こう」の括弧表現を分類した分類結果を示した。図４Ｃに示されるように、上記のサンプルテキストでは、第１の構文解析結果の最上位ノードは文（Ｓ）又は名詞句ではなく、かつ、第２の構文解析結果と第３の構文解析結果との最上位ノードが一致していないため、当該サンプルテキストの括弧表現は構文解析時に分離する「独立」と判定される。 FIG. 4C shows a classification result obtained by classifying the parenthesis expression of the sample text “Let's go to the site if it is a problem”. As shown in FIG. 4C, in the above sample text, the top node of the first parsing result is not a sentence (S) or a noun phrase, and the second parsing result and the third parsing result. Since the most significant node does not match, the parenthesis expression of the sample text is determined to be “independent” which is separated at the time of parsing.

図５には、各サンプルテキストに含まれる括弧表現について学習データ生成部１８により生成される学習データの一例を示した。図５に示されるように、本実施形態では、括弧表現についての学習データを、正負、括弧種類、括弧内の構文、周辺形態素の情報を含み生成することとしている。正負とは、括弧表現が「独立」（正例）か「従属」（負例）かを示す教師情報であり、括弧種類は、括弧が丸括弧、鉤括弧、二重鉤括弧等のいずれの種類であるかを示す情報である。また、括弧内の構文は、括弧に囲まれた文字列の構文解析結果における最上位ノードを示し、周辺形態素の情報は、括弧表現について予め定められた数の周辺の形態素の表層及び品詞情報により構成される情報である。例えば、周辺形態素の情報を周辺２形態素として構成する場合には、括弧表現から前後に２つの形態素までの各形態素の表層及び品詞の情報を連結させることとする。なお、周辺形態素としていくつの形態素を用いるかは上述した例に限られるものではない。学習データは、正負、括弧種類等の各項目のデータを数値に変換した多次元のベクトルデータとして生成することとしてよい。そして、学習データ生成部１８は、各サンプルテキストの括弧表現について学習データを生成し、生成した学習データを機械学習部２０に出力する。 FIG. 5 shows an example of learning data generated by the learning data generation unit 18 for parenthesis expressions included in each sample text. As shown in FIG. 5, in the present embodiment, learning data about parenthesis expression is generated including information on positive / negative, parenthesis type, syntax in parentheses, and peripheral morpheme information. Positive and negative are teacher information indicating whether the parenthesis expression is “independent” (positive example) or “subordinate” (negative example), and the parenthesis type is any of parentheses such as parentheses, square brackets, double brackets, etc. It is information indicating whether the type. The syntax in parentheses indicates the highest node in the parsing result of the character string enclosed in the parentheses, and the information on peripheral morphemes is based on the surface layer and part-of-speech information of the number of peripheral morphemes determined in advance for the parenthesis expression. It is configured information. For example, when the peripheral morpheme information is configured as two peripheral morphemes, the surface layer and part of speech information of each morpheme from the parenthesis expression to the two morphemes before and after is connected. Note that the number of morphemes used as the peripheral morphemes is not limited to the above-described example. The learning data may be generated as multidimensional vector data obtained by converting each item of data such as positive / negative and parenthesis type into numerical values. Then, the learning data generation unit 18 generates learning data for the parenthesis expression of each sample text, and outputs the generated learning data to the machine learning unit 20.

機械学習部２０は、学習データ生成部１８により生成された学習データに基づいて、括弧表現の分類規則を学習するものである。括弧表現の分類規則とは、文に含まれる括弧表現を構文解析時に分離する「独立」タイプか、分離しない「従属」タイプかを分類するための規則である。なお、機械学習部２０は、例えばＳＶＭ（Support Vector Machine：サポートベクターマシーン）やＣＲＦ（Condition Random Field：条件付き確率場）等の機械学習アルゴリズムを用いて括弧表現の分類規則を学習することとしてよい。分類規則とは、例えば括弧表現の特徴ベクトル空間における識別面として表現されることとしてもよいし、ニューラルネットワークにおけるノードの入出力の重みとして表現されることとしてもよく、多様なデータ表現形式を含むものとしてよい。 The machine learning unit 20 learns parenthesized expression classification rules based on the learning data generated by the learning data generation unit 18. The parenthesis expression classification rule is a rule for classifying whether a parenthesis expression included in a sentence is an “independent” type that separates at the time of parsing or a “subordinate” type that is not separated. Note that the machine learning unit 20 may learn a classification rule of parenthesis expression using a machine learning algorithm such as SVM (Support Vector Machine) or CRF (Condition Random Field). . The classification rule may be expressed as, for example, an identification plane in a feature vector space expressed in parentheses, or may be expressed as an input / output weight of a node in a neural network, and includes various data expression formats. Good thing.

次に、本実施形態に係る言語解析装置１０に備えられた、上記学習した括弧表現の分類規則を利用して構文解析を行うための機能について説明する。 Next, the function for performing syntax analysis using the learned parenthesis expression classification rules provided in the language analysis apparatus 10 according to the present embodiment will be described.

構文解析対象文取得部２２は、構文解析の対象とするテキストデータを取得するものである。構文解析の対象とするテキストデータは、１つの文としてもよいし複数の文を含んでいてもよい。 The syntax analysis target sentence acquisition unit 22 acquires text data to be subjected to syntax analysis. The text data to be parsed may be a single sentence or may include a plurality of sentences.

括弧分類部２４は、構文解析対象文取得部２２により取得された文に含まれる各括弧表現を当該文から分離して構文解析する「独立」タイプのものか、文と分離せずに構文解析する「従属」タイプのものかを分類するものである。本実施形態では、括弧分類部２４は、上記機械学習部２０における学習データと同様に、処理対象の文に含まれる括弧表現毎に、当該括弧表現を表す入力データを生成し、当該生成した入力データを機械学習部２０において括弧表現の分類規則を学習したＳＶＭ等の分類器に入力して括弧表現を分類するものである。 The parenthesis classification unit 24 is an “independent” type in which each parenthesis expression included in the sentence acquired by the syntax analysis target sentence acquisition unit 22 is separated from the sentence and is analyzed, or is analyzed without being separated from the sentence. It is classified whether it is of “subordinate” type. In the present embodiment, the parenthesis classification unit 24 generates input data representing the parenthesis expression for each parenthesis expression included in the sentence to be processed, like the learning data in the machine learning unit 20, and the generated input The data is input to a classifier such as SVM that has learned the classification rules of the parenthesis expression in the machine learning unit 20, and the parenthesis expression is classified.

図６には、処理対象の文に含まれる括弧表現について生成される入力データの一例を示した。図６に示されるように、入力データは、括弧種類、括弧内の構文、周辺形態素の情報を含み構成されるものであり、学習データとの差異は、入力データには「正負」の情報が含まれていないことであり、その他の項目は学習データと同様である。括弧分類部２４は、上記生成した入力データを分類規則を学習した分類器に入力することで、当該入力データが「独立」（正）又は「従属」（負）のいずれに該当するのかを分類する。なお、括弧分類部２４は処理対象の文に含まれる各括弧表現について上記の分類処理を行うこととする。 FIG. 6 shows an example of input data generated for the parenthesis expression included in the sentence to be processed. As shown in FIG. 6, the input data is configured to include information on parenthesis type, syntax within parentheses, and peripheral morpheme. The difference from the learning data is that “positive / negative” information is included in the input data. The other items are the same as the learning data. The parenthesis classification unit 24 classifies whether the input data corresponds to “independent” (positive) or “subordinate” (negative) by inputting the generated input data to a classifier that has learned the classification rule. To do. Note that the parenthesis classification unit 24 performs the above-described classification process for each parenthesis expression included in the sentence to be processed.

構文解析対象設定部２６は、括弧分類部２４による各括弧表現の分類結果に基づいて、処理対象の文における構文解析対象を設定するものである。本実施形態では、構文解析対象設定部２６は、括弧分類部２４により括弧表現が「独立」と分類された場合には、処理対象の文から当該括弧表現を分離することとし、一方で括弧表現が「従属」と分類された場合には、処理対象の文に当該括弧表現を残して構文解析対象を設定することとする。 The parsing target setting unit 26 sets the parsing target in the sentence to be processed based on the classification result of each parenthesis expression by the parenthesis classification unit 24. In this embodiment, when the parenthesis expression is classified as “independent” by the parenthesis classification unit 24, the parsing target setting unit 26 separates the parenthesis expression from the sentence to be processed, while the parenthesis expression Is classified as “subordinate”, the parsing target is set by leaving the parenthesis expression in the sentence to be processed.

ここで、具体例を用いて括弧分類部２４、構文解析対象設定部２６の処理を説明する。例えば、構文解析対象文取得部２２により“変更を伴う由来という語を使っている（evolutionの原義については下の項目を参照のこと）。”という文が処理対象として取得された場合には、括弧分類部２４は、括弧表現（evolutionの原義については下の項目を参照のこと）についての入力データ（特徴ベクトル）を生成し、生成した入力データを分類器に入力して上記括弧表現が「独立」か「従属」かの分類結果を得る。例えば、上記括弧表現が「独立」と分類された場合には、構文解析対象設定部２６は、（１）「変更を伴う由来」という語を使っている、（２）evolutionの原義については下の項目を参照のこと、の各文を構文解析対象として設定する。そして、構文解析部１６は、構文解析対象設定部２６により設定された各構文解析対象の文字列に対して構文解析を行い、それらの構文解析結果を統合して最終的な構文解析結果を得ることとする。 Here, processing of the parenthesis classification unit 24 and the syntax analysis target setting unit 26 will be described using a specific example. For example, when a sentence “using a word derived from change (see the item below for the origin of evolution)” is acquired as a processing target by the parsing target sentence acquisition unit 22, The parenthesis classification unit 24 generates input data (feature vector) for the parenthesis expression (refer to the item below for the origin of evolution), and inputs the generated input data to the classifier. A classification result of “independent” or “subordinate” is obtained. For example, when the parenthesis expression is classified as “independent”, the parsing target setting unit 26 uses the word (1) “derived from change”. (2) The original meaning of evolution is as follows. Refer to the item of, and each sentence of is set as a parsing target. Then, the syntax analysis unit 16 performs a syntax analysis on each character string to be analyzed set by the syntax analysis target setting unit 26 and integrates the results of the syntax analysis to obtain a final syntax analysis result. I will do it.

また、処理対象の文が“「変更を伴う由来」という語を使っている。”の場合に同様の処理により括弧表現「変更を伴う由来」が「従属」として分類された場合には、構文解析対象設定部２６は“「変更を伴う由来」という語を使っている。”を構文解析対象として設定する。なお、処理対象の文に複数の括弧表現が含まれる場合にも、各括弧表現について入力データ（特徴ベクトル）を同様に生成して、生成した入力データを分類器に入力して各括弧表現の分類結果を得て、得られた分類結果に基づいて構文解析対象を設定することとしてよい。 In addition, the sentence to be processed uses the word “derived from change”. When the parenthesis expression “derived from change” is classified as “subordinate” by the same processing in the case of “,” the parsing target setting unit 26 uses the word “derived from change”. ”Is set as the parsing target. Note that even if the sentence to be processed contains multiple parenthesis expressions, the input data (feature vector) is similarly generated for each parenthesis expression, and the generated input data is classified. It is good also as setting a parsing object based on the classification result obtained by inputting into a container and obtaining the classification result of each parenthesis expression.

次に、図７及び図８を参照しながら、本実施形態に係る言語解析装置１０による処理の流れについて説明する。 Next, the flow of processing by the language analysis apparatus 10 according to the present embodiment will be described with reference to FIGS. 7 and 8.

図７には、括弧表現の分類規則を学習する学習処理のフローチャートを示した。図７に示されるように、言語解析装置１０は、解析対象の分野のテキストが格納された文書群（コーパス）の中から括弧を１つ含む文を取得する（Ｓ１０１）と共に、取得した文に含まれる括弧表現を抽出する（Ｓ１０２）。次に、言語解析装置１０は、抽出した括弧表現が文又は名詞句であるかを判断する（Ｓ１０３）と共に、文から括弧表現を除外した場合とそうでない場合とで構文解析結果の最上位ノードが一致するか否かを判断する（Ｓ１０４）。ここで、言語解析装置１０は、Ｓ１０３でＹかつＳ１０４でＹの場合と、Ｓ１０３でＮかつＳ１０４でＮの場合に抽出した括弧表現が「独立」（正例）であると仮分類し（Ｓ１０５）、それ以外の場合には「従属」（負例）であると仮分類する（Ｓ１０６）。 FIG. 7 shows a flowchart of a learning process for learning a classification rule of parenthesis expression. As shown in FIG. 7, the language analysis apparatus 10 acquires a sentence including one parenthesis from a document group (corpus) in which text in the field to be analyzed is stored (S 101). The included parenthesis expression is extracted (S102). Next, the language analysis device 10 determines whether or not the extracted parenthesis expression is a sentence or a noun phrase (S103), and the top node of the syntax analysis result depending on whether the parenthesis expression is excluded from the sentence or not (S104). Here, the language analysis device 10 tentatively classifies that the parenthesis expression extracted in the case of Y in S103 and Y in S104, and N in S103 and N in S104 is “independent” (positive example) (S105). In other cases, it is provisionally classified as “subordinate” (negative example) (S106).

言語解析装置１０は、上記仮分類の結果と括弧表現の素性情報とに基づいて、括弧表現についての学習データを生成する（Ｓ１０７）。言語解析装置１０は、学習データの生成を継続する場合には（Ｓ１０８：Ｙ）、Ｓ１０１に戻ってそれ以降の処理を繰り返し、学習データの生成を終了する場合には（Ｓ１０８：Ｎ）、それまでに生成した学習データに基づいて括弧表現の分類規則を学習して（Ｓ１０９）、学習処理を終了する。 The language analysis device 10 generates learning data for the parenthesis expression based on the result of the provisional classification and the feature information of the parenthesis expression (S107). When the generation of learning data is continued (S108: Y), the language analysis apparatus 10 returns to S101 and repeats the subsequent processing, and when the generation of learning data is terminated (S108: N), Based on the learning data generated so far, the parenthesis expression classification rules are learned (S109), and the learning process is terminated.

図８には、構文解析対象の設定処理のフローチャートを示した。図８に示されるように、言語解析装置１０は、処理対象文を取得して（Ｓ２０１）、取得した処理対象文に含まれる括弧表現を抽出する（Ｓ２０２）。言語解析装置１０は、抽出した括弧表現の素性情報に基づいて入力データを生成して（Ｓ２０３）、生成した入力データを図７に示された学習処理により括弧表現の分類規則を学習した分類器に入力して括弧表現を分類する（Ｓ２０４）。 FIG. 8 shows a flowchart of a process for setting a parsing target. As shown in FIG. 8, the language analysis apparatus 10 acquires a processing target sentence (S201), and extracts parenthesis expressions included in the acquired processing target sentence (S202). The language analysis device 10 generates input data based on the extracted parenthesis expression feature information (S203), and the classifier that has learned the classification rules of the parenthesis expression by the learning process shown in FIG. To classify parenthesis expressions (S204).

言語解析装置１０は、上記分類した結果が「独立」である場合には（Ｓ２０５：Ｙ）、括弧表現を処理対象文から分離した構文解析対象に設定し（Ｓ２０６）、「従属」である場合には（Ｓ２０５：Ｎ）、括弧表現を処理対象文に含めて構文解析対象を設定する（Ｓ２０７）。言語解析装置１０は、処理対象文のうち未処理の括弧表現があるか否かを判断し（Ｓ２０８）、あると判断する場合には（Ｓ２０８：Ｙ）、Ｓ２０２に戻ってそれ以降の処理を繰り返し、ないと判断する場合には（Ｓ２０８：Ｎ）、設定した構文解析対象の各文字列に対して構文解析処理を実行して（Ｓ２０９）、処理を終了する。 When the result of the classification is “independent” (S205: Y), the language analysis device 10 sets the parenthesis expression as a parsing target separated from the processing target sentence (S206), and “subordinate”. (S205: N), parentheses are included in the processing target sentence to set a parsing target (S207). The language analysis apparatus 10 determines whether or not there is an unprocessed parenthesis expression in the processing target sentence (S208), and if it is determined (S208: Y), the process returns to S202 and the subsequent processing is performed. If it is determined that there is no repetition (S208: N), a syntax analysis process is executed for each set character string to be analyzed (S209), and the process ends.

以上説明した本実施形態に係る言語解析装置では、予め定めた分類規則に従ってコーパス等の文書集合に含まれるサンプルテキストを分類し、その分類結果を教師情報としてサンプルテキスト内の括弧表現の特徴ベクトルの分類を機械学習することで、予め定めた分類規則に合致するか否かで括弧表現を分類する場合に比べて分類精度が向上する。 In the language analysis apparatus according to the present embodiment described above, sample text included in a document set such as a corpus is classified according to a predetermined classification rule, and the classification result is used as a teacher information for the feature vector of parenthesis expression in the sample text. By machine learning of classification, classification accuracy is improved as compared with the case of classifying parenthesis expressions depending on whether or not a predetermined classification rule is met.

また、本発明は上記の実施形態に限定されるものではなく、例えば解析対象の文が属する分野、書式等の属性に基づいて文書群（コーパス）を選択し、当該選択した文書群（コーパス）に基づいて学習した分類規則に基づいて解析対象の文に含まれる括弧表現を分類することとしてもよい。 Further, the present invention is not limited to the above-described embodiment. For example, a document group (corpus) is selected based on attributes such as a field to which a sentence to be analyzed belongs, a format, etc., and the selected document group (corpus) is selected. The parenthesis expressions included in the sentence to be analyzed may be classified based on the classification rule learned based on the above.

１０言語解析装置、１２文書群格納部、１４学習標本抽出部、１６構文解析部、１８学習データ生成部、２０機械学習部、２２構文解析対象文取得部、２４括弧分類部、２６構文解析対象設定部。 DESCRIPTION OF SYMBOLS 10 Language analyzer, 12 Document group storage part, 14 Learning sample extraction part, 16 Syntax analysis part, 18 Learning data generation part, 20 Machine learning part, 22 Parsing target sentence acquisition part, 24 Parenthesis classification part, 26 Syntax analysis object Setting part.

Claims

文情報を格納した格納手段から括弧表現が含まれる文情報を取得する取得手段と、
予め定められた規則に従って、前記取得手段により取得した文情報に含まれる括弧表現を当該文情報から分離する第１の類型と分離しない第２の類型に仮分類する仮分類手段と、
前記仮分類手段による括弧表現の仮分類結果に基づいて、括弧表現を前記第１の類型と前記第２の類型とに分類する規則を学習する学習手段と、
前記学習手段により学習された規則に基づいて、所与の文に含まれる括弧表現を前記第１の類型又は前記第２の類型に分類する分類手段と、を含む
ことを特徴とする言語解析装置。 An acquisition means for acquiring sentence information including parenthesis expression from a storage means storing sentence information;
Provisional classification means for provisionally classifying a parenthesis expression included in the sentence information acquired by the acquisition means into a first type that is separated from the sentence information and a second type that is not separated according to a predetermined rule;
Learning means for learning a rule for classifying the parenthesis expression into the first type and the second type based on a temporary classification result of the parenthesis expression by the temporary classification means;
A language analysis apparatus comprising: classifying means for classifying parenthesis expressions included in a given sentence into the first type or the second type based on the rules learned by the learning means. .

前記分類手段により前記第１の類型に分類された括弧表現を、当該括弧表現を含む文情報から分離して構文解析の対象を設定する設定手段をさらに含む
ことを特徴とする請求項１に記載の言語解析装置。 The parenthesis expression classified into the first type by the classification unit is further separated from sentence information including the parenthesis expression, and further includes setting means for setting a target for parsing. Language analyzer.

前記学習手段は、前記仮分類手段による括弧表現の仮分類結果を教師情報として、前記括弧表現の特徴情報を前記第１の類型と前記第２の類型とに分類する規則を学習する
ことを特徴とする請求項１又は２に記載の言語解析装置。 The learning means learns a rule for classifying the feature information of the parenthesis expression into the first type and the second type using the temporary classification result of the parenthesis expression by the temporary classification means as teacher information. The language analysis apparatus according to claim 1 or 2.

前記括弧表現の特徴情報は、当該括弧表現の周辺文字列の形態素情報に基づいて生成される
ことを特徴とする請求項３に記載の言語解析装置。 The language analysis apparatus according to claim 3, wherein the feature information of the parenthesis expression is generated based on morpheme information of a surrounding character string of the parenthesis expression.

前記仮分類手段は、前記取得手段により取得された文情報毎に、当該文情報と当該文情報から括弧表現を除いた文字列情報とのそれぞれの構文情報を比較した結果と、当該括弧表現の構文情報が予め定められた条件に合致するか否かを判定した結果とに基づいて、当該括弧表現を第１の類型と第２の類型とに仮分類する
ことを特徴とする請求項１乃至３のいずれかに記載の言語解析装置。 For each sentence information acquired by the acquisition means, the temporary classification means compares the syntax information of the sentence information and character string information obtained by removing the parenthesis expression from the sentence information, and the parenthesis expression 2. The parenthesized expression is provisionally classified into a first type and a second type based on a result of determining whether or not the syntax information matches a predetermined condition. 4. The language analysis apparatus according to any one of 3.

文情報を格納した格納手段から括弧表現が含まれる文情報を取得する取得手段と、
予め定められた規則に従って、前記取得手段により取得した文情報に含まれる括弧表現を当該文情報から分離する第１の類型と分離しない第２の類型に仮分類する仮分類手段と、
前記仮分類手段による括弧表現の仮分類結果に基づいて、括弧表現を前記第１の類型と前記第２の類型とに分類する規則を学習する学習手段と、
前記学習手段により学習された規則に基づいて、所与の文に含まれる括弧表現を前記第１の類型又は前記第２の類型に分類する分類手段としてコンピュータを機能させるためのプログラム。 An acquisition means for acquiring sentence information including parenthesis expression from a storage means storing sentence information;
Provisional classification means for provisionally classifying a parenthesis expression included in the sentence information acquired by the acquisition means into a first type that is separated from the sentence information and a second type that is not separated according to a predetermined rule;
Learning means for learning a rule for classifying the parenthesis expression into the first type and the second type based on a temporary classification result of the parenthesis expression by the temporary classification means;
A program for causing a computer to function as classification means for classifying parenthesis expressions included in a given sentence into the first type or the second type based on the rules learned by the learning means.