JPWO2019021804A1

JPWO2019021804A1 - Information processing apparatus, information processing method, and program

Info

Publication number: JPWO2019021804A1
Application number: JP2019532489A
Authority: JP
Inventors: 政明星野; 由紀子荒川; 澁谷　崇; 崇澁谷; 亮介三谷
Original assignee: Sony Corp
Current assignee: Sony Corp
Priority date: 2017-07-24
Filing date: 2018-07-10
Publication date: 2020-05-28
Also published as: WO2019021804A1

Abstract

本開示は、意味解析器の開発に必要なコーパスの作成に係る負担を軽減し、開発コストを低減させることができるようにする情報処理装置、および情報処理方法、並びにプログラムに関する。人手で生成されたIND（In Domain）コーパスを述語項構造解析して、置換箇所を設定し、格フレーム辞書により置換箇所の単語に類似する単語を検索して、置換箇所の単語を検索結果となる単語で置換することでコーパスを生成する。本開示は、意味解析器の学習に使用するコーパスを生成する。本開示は、コーパス生成装置に適用することができる。The present disclosure relates to an information processing device, an information processing method, and a program that reduce the burden of creating a corpus necessary for developing a semantic analyzer and reduce the development cost. The manually generated IND (In Domain) corpus is analyzed by the predicate term structure, the replacement part is set, a word similar to the replacement part word is searched by the case frame dictionary, and the replacement part word is used as the search result. Generate a corpus by substituting the word The present disclosure generates a corpus that is used to train a semantic analyzer. The present disclosure can be applied to a corpus generation device.

Description

本開示は、情報処理装置、および情報処理方法、並びにプログラムに関し、特に、意味解析器の精度の向上に寄与するコーパスの開発コストを低減できるようにした情報処理装置、および情報処理方法、並びにプログラムに関する。 The present disclosure relates to an information processing device, an information processing method, and a program, and in particular, an information processing device, an information processing method, and a program capable of reducing the development cost of a corpus that contributes to improvement in accuracy of a semantic analyzer. Regarding

音声対話システムは、発話内容をテキストデータに変換し、テキストデータを意味解析し、発話内容を認識する。 The spoken dialogue system converts speech content into text data, semantically analyzes the text data, and recognizes the speech content.

発話内容の認識には、コーパス（例文集）を用いた機械学習により発話内容を解析し、認識する意味解析器が使用される。 To recognize the utterance content, a semantic analyzer that analyzes and recognizes the utterance content by machine learning using a corpus (sentence collection) is used.

意味解析器は、アプリケーションプログラム毎に扱うべき発話内容に対するコーパスを用いた機械学習により、発話内容を解析し、認識する。 The semantic analyzer analyzes and recognizes the utterance content by machine learning using a corpus for the utterance content to be handled for each application program.

音声対話システムでは、天気の問い合わせ、スケジュール確認、音楽の再生など複数の話題、タスクやアプリケーションプログラムを単一のシステムで扱えるようにマルチドメイン音声対話システムが広く利用されている。 In the voice dialogue system, a multi-domain voice dialogue system is widely used so that a single system can handle a plurality of topics such as weather inquiry, schedule confirmation, music playback, and tasks and application programs.

マルチドメインの音声対話システムでは、新たなドメインの意味解析機能の追加構築の容易さが求められる。そのため個々のドメインの意味解析器を結合して意味解析システムを構築する手法（アーキテクチャ）が広く提案されている。このアーキテクチャは複数のドメインの意味解析システムとそれを統合するドメイン選択器（フレーム推定器）で構成されたシステムとなる（非特許文献１）。 In a multi-domain spoken dialogue system, it is necessary to easily construct a new semantic analysis function for a new domain. Therefore, a method (architecture) for constructing a semantic analysis system by connecting semantic analyzers of individual domains has been widely proposed. This architecture is a system including a semantic analysis system for a plurality of domains and a domain selector (frame estimator) that integrates the systems (Non-Patent Document 1).

マルチドメインの音声対話システムにおいては、それぞれのドメインに必要とされるコーパスを用いた学習により、多様なアプリケーションプログラムの発話内容を認識できる意味解析器を実現する技術が必要である。 In a multi-domain spoken dialogue system, a technique for realizing a semantic analyzer capable of recognizing utterances of various application programs by learning using a corpus required for each domain is required.

また、想定外の発話（Out of Domain発話：以下OOD発話とも称する）を受けた時に誤ったドメインに意味解析処理を遷移することを防ぐ必要がある。そのためにはOODコーパスを用意してフレーム推定器を再学習するのが理想的だが、OODコーパスの開発には工数がかかるため対話の履歴も活用して推定する方法など様々な手法が議論されている（非特許文献２参照）。 In addition, it is necessary to prevent the semantic analysis processing from transiting to the wrong domain when receiving an unexpected utterance (Out of Domain utterance: hereinafter also referred to as OOD utterance). For that purpose, it is ideal to prepare an OOD corpus and re-learn the frame estimator, but it takes a lot of time to develop the OOD corpus, so various methods such as a method of estimating using the history of dialogue are discussed. (See Non-Patent Document 2).

Bor-shen Lin1, Hsin-min Wang2他 1998 “A DISTRIBUTED ARCHITECTURE FOR COOPERATIVE SPOKEN DIALOGUE AGENTS WITH COHERENT DIALOGUE STATE AND HISTORY”Bor-shen Lin1, Hsin-min Wang2 and others 1998 “A DISTRIBUTED ARCHITECTURE FOR COOPERATIVE SPOKEN DIALOGUE AGENTS WITH COHERENT DIALOGUE STATE AND HISTORY” Ian R. Lane 他 2004 “Out-of-Domain Utterance Detection using Classification Confidences of Multiple Topics”Ian R. Lane et al. 2004 “Out-of-Domain Utterance Detection using Classification Confidences of Multiple Topics”

しかしながら、非特許文献１，２に係るコーパスは、人手で作成されており、多様なアプリケーションプログラムの発話内容を認識するためには、より多くのコーパスが必要となるので、コーパスの作成に係る負荷が意味解析器の開発コストの大きな負担になっている。 However, the corpus according to Non-Patent Documents 1 and 2 is manually created, and more corpus is required to recognize the utterance content of various application programs. Is a heavy burden on the development cost of the semantic analyzer.

本開示は、このような状況に鑑みてなされたものであり、特に、学習に必要とされるコーパスを効率的に開発できるようにすることで、意味解析器の開発コストを低減させるものである。 The present disclosure has been made in view of such a situation, and in particular, reduces the development cost of a semantic analyzer by enabling efficient development of a corpus required for learning. .

本開示の一側面の情報処理装置は、入力文の構造を解析する構造解析部と、前記構造解析部の解析結果に基づいて、前記入力文における置換箇所を設定する置換箇所設定部と、前記入力文における前記置換箇所の単語を置換してコーパスを生成するコーパス生成部とを含む情報処理装置である。 An information processing apparatus according to one aspect of the present disclosure is a structure analysis unit that analyzes a structure of an input sentence, a replacement location setting unit that sets a replacement location in the input sentence based on an analysis result of the structure analysis unit, and It is an information processing apparatus including a corpus generation unit that generates a corpus by replacing a word at the replacement portion in an input sentence.

前記入力文は、所定のアプリケーションプログラムで扱うべき発話内容であるIND（In Domain）判定文とすることができる。 The input sentence can be an IND (In Domain) determination sentence that is the utterance content to be handled by a predetermined application program.

前記構造解析部には、前記入力文の述語項構造を解析する前記置換箇所設定部は、前記構造解析部の解析結果である前記述語項構造に基づいて、前記入力文における置換箇所を設定させるようにすることができる。 In the structure analysis unit, the substitution place setting unit that analyzes the predicate term structure of the input sentence sets the substitution place in the input sentence based on the predescription word structure that is the analysis result of the structure analysis unit. Can be allowed to.

前記入力文における前記置換箇所の単語を置換する候補を、辞書を照会して検索する辞書照会部をさらに含ませるようにすることができ、前記コーパス生成部には、前記辞書照会部により検索された単語で、前記入力文における前記置換箇所の単語を置換させるようにしてもよい。 It is possible to further include a dictionary inquiring unit for inquiring and searching a dictionary for a candidate for replacing the word at the replacement portion in the input sentence, and the corpus generating unit is configured to search the dictionary by the dictionary inquiring unit. You may make it replace the word of the said replacement part in the said input sentence with the said word.

前記辞書は、格フレーム辞書とすることができる。 The dictionary may be a case frame dictionary.

前記置換箇所設定部には、前記構造解析部の解析結果である前記述語項構造に基づいて、前記入力文における置換箇所と、前記置換箇所の置換方式を設定させ、前記コーパス生成部には、前記入力文における前記置換箇所の単語を、前記置換方式で置換してコーパスを生成させるようにすることができる。 The replacement location setting unit is configured to set the replacement location in the input sentence and the replacement method of the replacement location based on the predescription word structure that is the analysis result of the structure analysis unit, and the corpus generation unit , The word at the replacement location in the input sentence can be replaced by the replacement method to generate a corpus.

前記置換方式には、前記入力文の述部を固定し、かつ、対象格を含む述語項となる名詞を置換する第１の方式と、前記入力文の対象格を含む述語項を固定し、かつ、述部を置換する第２の方式とを含ませるようにすることができる。 In the replacement method, a first method that fixes the predicate of the input sentence and replaces a noun that is a predicate term including the subject case, and a predicate term that includes the subject case of the input sentence is fixed, In addition, the second method of replacing the predicate can be included.

前記コーパス生成部により生成されたコーパスを、所定のアプリケーションプログラムで扱うべき発話内容であるIND（In Domain）判定文、または、所定のアプリケーションプログラムで扱うべきではない想定外の発話内容であるOOD（Out of Domain）判定文に分類する分類部をさらに含ませるようにしてもよい。 The corpus generated by the corpus generation unit is an IND (In Domain) judgment statement that is the utterance content that should be handled by a predetermined application program, or an unexpected utterance content that should not be handled by the predetermined application program OOD ( An out of domain) classification sentence may be further included in the classification section.

前記OOD判定文であって、かつ、前記IND判定文との、それぞれの素性で表現される特徴空間における境界近傍に存在するコーパスをCOOD（Close OOD）判定文として、前記IND判定文として分類されたコーパスより抽出するCOOD判定文抽出部をさらに含ませるようにすることができる。 The OOD decision sentence, and the IND decision sentence, the corpus existing near the boundary in the feature space expressed by each feature is classified as the IND decision sentence as a COOD (Close OOD) decision sentence. It is possible to further include a COOD judgment sentence extraction unit for extracting from the corpus.

前記COOD判定文抽出部には、前記IND判定文として分類されたコーパスを含むドメインにおいて、自ら及び他のコーパスに含まれない単語数が所定数より多いコーパスを、前記ドメインより前記COOD判定文として抽出させるようにすることができる。 The COOD judgment sentence extraction unit, in the domain including the corpus classified as the IND judgment sentence, a corpus in which the number of words not included in itself and other corpus is larger than a predetermined number, as the COOD judgment sentence from the domain. It can be made to extract.

前記COOD判定文抽出部には、前記ドメインのコーパスに含まれる単語数に対する、前記自ら及び他のコーパスに含まれない単語数の割合で表される非出現性が所定値より高いコーパスを、前記ドメインより前記COOD判定文として抽出させるようにすることができる。 In the COOD determination sentence extraction unit, the corpus of the non-occurrence represented by the ratio of the number of words not included in the corpus and other corpus with respect to the number of words included in the corpus of the domain is higher than a predetermined value, It is possible to extract the COOD judgment sentence from the domain.

前記COOD判定文抽出部には、前記ドメインのコーパスのうち、TF(Term Frequency)値とIDF(Inverse Document Frequency)値とからなるTF/IDFで表される単語の非出現性が所定値より高いコーパスを、前記ドメインより前記COOD判定文として抽出させるようにすることができる。 In the COOD determination sentence extraction unit, in the corpus of the domain, the non-occurrence of a word represented by TF/IDF consisting of a TF (Term Frequency) value and an IDF (Inverse Document Frequency) value is higher than a predetermined value. A corpus can be extracted as the COOD determination sentence from the domain.

前記COOD判定文抽出部には、前記ドメインのコーパスにおけるPerplexity値を算出させ、前記Perplexity値が所定値よりも高いものを非文として廃棄させるようにすることができる。 The COOD determination sentence extraction unit may calculate a Perplexity value in the corpus of the domain, and discard a Perplexity value higher than a predetermined value as a non-sentence.

前記IND判定文であって、かつ、前記OOD判定文との、それぞれの素性で表現される特徴空間における境界近傍に存在するコーパスをCIND（Close IND）判定文として、前記OOD判定文として分類されたコーパスより抽出するCIND判定文抽出部をさらに含ませるようにすることができる。 The IND judgment sentence, and the corpus existing near the boundary in the feature space expressed by each feature with the OOD judgment sentence are classified as the OOD judgment sentence as a CIND (Close IND) judgment sentence. It is possible to further include a CIND judgment sentence extraction unit that extracts from the corpus.

前記CIND判定文抽出部には、前記OOD判定文として分類されたコーパスを含むドメインにおいて、INDコーパスに含まれる単語数が所定数より多いコーパスを、前記OOD判定文として分類された全コーパスより前記CIND判定候補文として抽出させるようにすることができる。 In the domain including the corpus classified as the OOD judgment sentence, in the CIND judgment sentence extraction unit, a corpus in which the number of words included in the IND corpus is larger than a predetermined number is selected from all the corpus classified as the OOD judgment sentence. It can be made to be extracted as a CIND judgment candidate sentence.

前記CIND判定文抽出部には、前記ドメインのコーパスに含まれる単語数に対する、前記INDコーパスに含まれない単語数の割合で表される非出現性が所定数より低いコーパスを、前記CIND判定候補文として抽出させるようにすることができる。 The CIND determination sentence extraction unit, the non-occurrence represented by the ratio of the number of words not included in the IND corpus with respect to the number of words included in the corpus of the domain, the corpus is lower than a predetermined number, the CIND determination candidates It can be made to be extracted as a sentence.

前記CIND判定文抽出部には、前記ドメインのコーパスのうち、TF(Term Frequency)値とIDF(Inverse Document Frequency)値とからなるTF/IDFで表される単語の非出現性が所定数より低いコーパスを、前記CIND判定文として抽出させるようにすることができる。 In the CIND judgment sentence extraction unit, in the corpus of the domain, the non-occurrence of a word represented by TF/IDF consisting of a TF (Term Frequency) value and an IDF (Inverse Document Frequency) value is lower than a predetermined number. A corpus can be extracted as the CIND judgment sentence.

前記CIND判定文抽出部には、前記ドメインのコーパスにおけるPerplexity値を算出させ、前記Perplexity値が所定値よりも高いものを非文として廃棄させるようにすることができる。 The CIND determination statement extracting unit may be configured to calculate a Perplexity value in the corpus of the domain, and discard a Perplexity value higher than a predetermined value as a non-sentence.

本開示の一側面の情報処理方法は、入力文の構造を解析し、前記構造の解析結果に基づいて、前記入力文における置換箇所を設定し、前記入力文における前記置換箇所の単語を置換してコーパスを生成するステップを含む情報処理方法である。 An information processing method according to an aspect of the present disclosure analyzes a structure of an input sentence, sets a replacement part in the input sentence based on an analysis result of the structure, and replaces a word at the replacement part in the input sentence. And an information processing method including a step of generating a corpus.

本開示の一側面のプログラムは、入力文の構造を解析する構造解析部と、前記構造解析部の解析結果に基づいて、前記入力文における置換箇所を設定する置換箇所設定部と、前記入力文における前記置換箇所の単語を置換してコーパスを生成するコーパス生成部としてコンピュータを機能させるプログラムである。 A program according to one aspect of the present disclosure includes a structure analysis unit that analyzes a structure of an input sentence, a replacement location setting unit that sets a replacement location in the input sentence based on an analysis result of the structure analysis unit, and the input sentence. Is a program that causes a computer to function as a corpus generation unit that generates a corpus by replacing the word at the replacement location in.

本開示の一側面においては、入力文の構造が解析され、前記構造の解析結果に基づいて、前記入力文における置換箇所が設定され、前記入力文における前記置換箇所の単語が置換されてコーパスが生成される。 In one aspect of the present disclosure, a structure of an input sentence is analyzed, a replacement part in the input sentence is set based on an analysis result of the structure, and a word in the replacement part in the input sentence is replaced to form a corpus. Is generated.

本開示の一側面によれば、特に、意味解析器の開発コストを低減させることが可能となる。 According to the 1 side of this indication, especially it becomes possible to reduce the development cost of a semantic analyzer.

意味解析部の動作を説明する図である。It is a figure explaining operation|movement of a meaning analysis part. COOD判定文およびCIND判定文を説明する図である。It is a figure explaining a COOD judgment sentence and a CIND judgment sentence. 本開示のコーパス生成装置の構成例を説明する図である。It is a figure explaining an example of composition of a corpus generation device of this indication. 図３のフィルタリング処理部の構成例を説明する図である。It is a figure explaining the structural example of the filtering process part of FIG. コーパスが生成される過程と、生成されるコーパスの種別を説明する図である。It is a figure explaining the process in which a corpus is generated, and the kind of corpus generated. 似ている文と非なる文とを説明する図である。It is a figure explaining a similar sentence and a non-sentence sentence. コーパス生成処理を説明するフローチャートである。It is a flow chart explaining corpus generation processing. 述語項構造の解析結果を説明する図である。It is a figure explaining the analysis result of a predicate term structure. 置換方式の設定例を説明する図である。It is a figure explaining the example of a setting of a substitution system. 置換箇所の設定例を説明する図である。It is a figure explaining the example of a setting of a substitution part. 置換箇所の設定例を説明する図である。It is a figure explaining the example of a setting of a substitution part. 日本語の述語項構造の解析結果の格納例を説明する図である。It is a figure explaining the example of storage of the analysis result of a Japanese predicate term structure. 日本語における置換方式がAction固定Category置換方式のときの、図１２の述語項構造の解析結果を用いた、対象格を置換する単語群の例を説明する図である。It is a figure explaining the example of the word group which replaces a target case using the analysis result of the predicate argument structure of FIG. 12 when the replacement method in Japanese is the Action fixed Category replacement method. 日本語における置換方式がCategory固定Action置換方式のときの、図１２の述語項構造の解析結果を用いた、述語を置換する単語群の例を説明する図である。It is a figure explaining the example of the word group which replaces a predicate using the analysis result of the predicate term structure of FIG. 12 when the replacement method in Japanese is the Category fixed action replacement method. 日本語における置換されたOOD候補文の例を説明する図である。It is a figure explaining the example of the replaced OOD candidate sentence in Japanese. 図１２の述語項構造の解析結果に対応する、英語での述語項構造の解析結果の格納例を説明する図である。It is a figure explaining the example of storage of the analysis result of the predicate argument structure in English corresponding to the analysis result of the predicate argument structure of FIG. 英語における置換方式がAction固定Category置換方式のときの、図１６の述語項構造の解析結果を用いた、対象格を置換する単語群の例を説明する図である。It is a figure explaining the example of the word group which replaces a target case using the analysis result of the predicate term structure of FIG. 16 when the replacement method in English is Action fixed Category replacement method. 英語における置換方式がCategory固定Action置換方式のときの、図１６の述語項構造の解析結果を用いた、述語を置換する単語群の例を説明する図である。It is a figure explaining the example of the word group which replaces a predicate using the analysis result of the predicate term structure of FIG. 16 when the replacement method in English is the Category fixed action replacement method. 英語における置換されたOOD候補文の例を説明する図である。It is a figure explaining the example of the replaced OOD candidate sentence in English. 深層格解析の解析結果と表層格解析の解析結果とを説明する図である。It is a figure explaining the analysis result of deep case analysis and the analysis result of surface case analysis. フィルタリング処理を説明するフローチャートである。It is a flowchart explaining a filtering process. COODコーパス抽出処理を説明するフローチャートである。It is a flowchart explaining a COOD corpus extraction process. Perplexity値とn-gram確率値とを説明する図である。It is a figure explaining a Perplexity value and an n-gram probability value. 非出現性を説明する図である。It is a figure explaining non-appearance. TF値とTF/IDF値とを説明する図である。It is a figure explaining a TF value and a TF/IDF value. CINDコーパス抽出処理を説明するフローチャートである。It is a flow chart explaining CIND corpus extraction processing. 汎用のパーソナルコンピュータの構成例を説明する図である。It is a figure explaining the structural example of a general purpose personal computer.

以下に添付図面を参照しながら、本開示の好適な実施の形態について詳細に説明する。なお、本明細書及び図面において、実質的に同一の機能構成を有する構成要素については、同一の符号を付することにより重複説明を省略する。 Hereinafter, preferred embodiments of the present disclosure will be described in detail with reference to the accompanying drawings. In the present specification and the drawings, constituent elements having substantially the same functional configuration are designated by the same reference numerals, and a duplicate description will be omitted.

＜意味解析システムについて＞
本開示の技術を適用したコーパス生成装置を説明するにあたり、コーパス生成装置により生成されたコーパスを用いた意味解析システムについて説明する。<About the semantic analysis system>
Before describing the corpus generation device to which the technique of the present disclosure is applied, a semantic analysis system using the corpus generated by the corpus generation device will be described.

意味解析システムは、ユーザの発話を認識して、対応するアプリケーションプログラムを実行させる。 The semantic analysis system recognizes a user's utterance and executes a corresponding application program.

意味解析システムは、例えば、図１で示されるように、入力受付部１１、音声認識部１２、フレーム推定部１３、各ドメインの意味解析部１４−１乃至１４−３、およびアプリケーションプログラム１８乃至２０より構成される。 The semantic analysis system is, for example, as shown in FIG. 1, an input reception unit 11, a voice recognition unit 12, a frame estimation unit 13, semantic analysis units 14-1 to 14-3 of each domain, and application programs 18 to 20. It is composed of

入力受付部１１は、ユーザの発話入力を、音声信号（speech）の入力として受け付けて音声認識部１２に出力する。 The input reception unit 11 receives a user's utterance input as an input of a voice signal (speech) and outputs it to the voice recognition unit 12.

音声認識部１２は、音声信号を認識して、テキスト文字列（text）に変換し、フレーム推定部１３に出力する。 The voice recognition unit 12 recognizes the voice signal, converts it into a text character string (text), and outputs it to the frame estimation unit 13.

フレーム推定部１３は、テキスト文字列に基づいて後段の最適なアプリケーション（ドメイン）の意味解析部１４−１乃至１４−３に処理を遷移させる。また、フレーム推定部１３は、どのドメインにも属さないテキスト文字列は棄却する。 The frame estimation unit 13 shifts the process to the semantic analysis units 14-1 to 14-3 of the optimum application (domain) in the subsequent stage based on the text character string. Moreover, the frame estimation unit 13 rejects a text character string that does not belong to any domain.

意味解析部１４−１乃至１４−３は、テキスト文字列に基づいて、属性（attribute）と対応する値（value）を解析して、動作（action）の対象となる天気案内のアプリケーションプログラム１８、予定確認のアプリケーションプログラム１９、または、音楽再生のアプリケーションプログラム２０に対して、それぞれに対応する解析結果１５乃至１７を供給する。尚、意味解析部１４−１乃至１４−３は、特に、区別する必要がない場合、単に、意味解析部１４と称する。 The semantic analysis units 14-1 to 14-3 analyze the value corresponding to the attribute based on the text character string, and the weather guidance application program 18, which is the target of the action. The analysis results 15 to 17 corresponding to the schedule confirmation application program 19 or the music reproduction application program 20 are supplied. Note that the semantic analysis units 14-1 to 14-3 are simply referred to as the semantic analysis unit 14 unless it is necessary to distinguish them.

より詳細には、例えば、「明日の東京の天気は？」といった発話入力Ｖ１が、入力受付部１１で、音声信号として受け付けられた場合、音声認識部１２は、「明日の東京の天気は」というテキスト文字列として認識して、フレーム推定部１３に供給する。 More specifically, for example, when the utterance input V1 such as "What is the weather in Tokyo tomorrow?" is accepted as a voice signal by the input accepting unit 11, the voice recognition unit 12 asks "What is the weather in Tokyo tomorrow?" Is recognized as a text character string and is supplied to the frame estimation unit 13.

フレーム推定部１３は、例えば、発話入力が天気案内のアプリケーションプログラム１８に対する発話であると判定するとき、天気案内のドメインの意味解析部１４−１に処理を遷移させる。 For example, when the frame estimation unit 13 determines that the utterance input is an utterance to the weather guidance application program 18, the frame estimation unit 13 shifts the processing to the meaning analysis unit 14-1 of the weather guidance domain.

意味解析部１４−１は、このテキスト文字列に基づいて、天気に関する発話を解析し、どこ（where）を示す情報が「東京」であり、いつ（when）を示す情報が「明日」であることを含む解析結果１５をアプリケーションプログラム１８に供給する。 The semantic analysis unit 14-1 analyzes the utterance related to the weather based on this text character string, the information indicating where is "Tokyo", and the information indicating when is "tomorrow". The analysis result 15 including the above is supplied to the application program 18.

天気案内のアプリケーションプログラム１８は、解析結果１５に基づいて、東京の明日の天気案内を表示する。 The weather guidance application program 18 displays tomorrow's weather guidance for Tokyo based on the analysis result 15.

また、例えば、「今日の１５時からの予定は？」といった発話入力Ｖ２が、入力受付部１１で、音声信号として受け付けられた場合、音声認識部１２は、「今日の１５時からの予定は」というテキスト文字列として認識して、フレーム推定部１３に供給する。 Further, for example, when the utterance input V2 such as “What is your schedule from today's 15:00?” is accepted as a voice signal by the input accepting unit 11, the voice recognition unit 12 asks, “What is your schedule from today's 15:00? It is recognized as a text character string “” and supplied to the frame estimation unit 13.

フレーム推定部１３は、例えば、発話入力が予定確認のアプリケーションプログラム１９に対する発話であると判定するとき、予定確認のドメインの意味解析部１４−２に処理を遷移させる。 For example, when the frame estimation unit 13 determines that the utterance input is a utterance to the schedule confirmation application program 19, the frame estimation unit 13 shifts the process to the schedule confirmation domain semantic analysis unit 14-2.

意味解析部１４−２は、このテキスト文字列に基づいて、予定確認に関する発話を解析し、例えば、発話入力が予定確認のアプリケーションプログラム１９に対する発話であり、期日（date）を示す情報が「今日」であり、時刻（time）を示す情報が「１５時」であることを含む解析結果１６をアプリケーションプログラム１９に供給する。 The semantic analysis unit 14-2 analyzes the utterance relating to the schedule confirmation based on this text character string, and, for example, the utterance input is the utterance to the application program 19 for confirming the schedule, and the information indicating the date is "Today. And the analysis result 16 including that the information indicating the time is “15:00” is supplied to the application program 19.

予定確認のアプリケーションプログラム１９は、解析結果１６に基づいて、今日の１５時の予定確認を表示する。 The schedule confirmation application program 19 displays the schedule confirmation at 15:00 today based on the analysis result 16.

さらに、例えば、「東野ナカの新曲かけて！」といった発話入力Ｖ３が、入力受付部１１で、音声信号として受け付けられた場合、音声認識部１２は、「東野ナカの新曲かけて」というテキスト文字列として認識して、フレーム推定部１３に供給する。 Further, for example, when the utterance input V3 such as "Higashino Naka's new song over!" is accepted as a voice signal by the input reception unit 11, the voice recognition unit 12 causes the text character "Higashino Naka over new song". It is recognized as a column and supplied to the frame estimation unit 13.

フレーム推定部１３は、例えば、発話入力が音楽再生のアプリケーションプログラム２０に対する発話であると判定するとき、音楽再生のドメインの意味解析部１４−３に処理を遷移させる。 For example, when the frame estimation unit 13 determines that the utterance input is a utterance to the music reproduction application program 20, the frame estimation unit 13 shifts the processing to the music reproduction domain meaning analysis unit 14-3.

意味解析部１４−３は、このテキスト文字列に基づいて、音楽再生に関する発話を解析し、例えば、発話入力が音楽再生のアプリケーションプログラム２０に対する発話であり、アーティスト（artist）を示す情報が「東野ナカ」であり、楽曲（music）を示す情報が「新曲」であることを含む解析結果１７をアプリケーションプログラム２０に供給する。 The semantic analysis unit 14-3 analyzes the utterance related to music reproduction based on this text character string. For example, the utterance input is the utterance to the application program 20 for music reproduction, and the information indicating the artist is "Higashino. The analysis result 17 including that the information indicating “music” is “new song” is supplied to the application program 20.

音楽再生のアプリケーションプログラム２０は、解析結果１７に基づいて、東野ナカの新曲を再生する。 The music reproduction application program 20 reproduces the new song by Naka Higashino based on the analysis result 17.

ここで、意味解析部１４は、テキスト文字列からなる情報を解析するために、例文集であるコーパスを用いた機械学習を行う。 Here, the semantic analysis unit 14 performs machine learning using a corpus, which is a collection of example sentences, in order to analyze information including a text character string.

コーパスは、主に、アプリケーションプログラムで扱うべき発話内容（In Domain発話：以下、INDとも称する）からなるコーパス（INDコーパスとも称する）と、アプリケーションプログラムで扱えない発話内容（Out of Domain発話：以下OOD発話とも称する）からなるコーパス（OODコーパスとも称する）とに分けられる。 The corpus is mainly composed of utterances (In Domain utterances: hereinafter also referred to as INDs) that should be handled by application programs (also referred to as IND corpus) and utterances that cannot be handled by application programs (Out of Domain utterances: hereinafter referred to as OOD). It is also divided into a corpus (also called OOD corpus) consisting of (utterances).

意味解析部１４は、INDコーパスを学習することで、アプリケーションプログラムで扱うべき発話内容を解析し、認識できるようになる。また、同様に、フレーム推定部１３は、INDコーパスとともにOODコーパスを学習することで、正しいドメインの意味解析部１４に処理を遷移させ、想定外の発話に対しては棄却できるようになる。すなわち、フレーム推定部１３は、INDコーパスとOODコーパスとを学習することで、扱うべき発話と、棄却すべき発話とを解析し、発話内容を適切に認識できるようになる。 By learning the IND corpus, the semantic analysis unit 14 can analyze and recognize the utterance content to be handled by the application program. Similarly, the frame estimation unit 13 learns the OOD corpus together with the IND corpus, and thereby transitions the process to the semantic analysis unit 14 of the correct domain, and can reject an unexpected utterance. That is, by learning the IND corpus and the OOD corpus, the frame estimation unit 13 can analyze the utterances to be handled and the utterances to be rejected, and can appropriately recognize the utterance content.

ところで、フレーム推定部１３、および意味解析部１４の学習は、ともに、多くのコーパスが必要とされるが、一般に、コーパスは、人手で作成されている。 By the way, both the frame estimation unit 13 and the semantic analysis unit 14 require a large number of corpora, but generally, the corpora are manually created.

例えば、天気案内のアプリケーションプログラムなどの特定のサービスを想定しても様々な発話の言い回しが想定され、アルゴリズムにも依存するが一般に1000以上の発話事例が必要とされている。 For example, various utterances are supposed even if a specific service such as a weather guidance application program is assumed, and generally 1000 or more utterance cases are required although it depends on an algorithm.

さらに、サービスの種類が増えれば、増えたサービスの種類に対応するアプリケーションプログラムの数だけ、それぞれのコーパスを作成する必要がある。 Further, as the number of types of services increases, it is necessary to create corpus of each number of application programs corresponding to the increased types of services.

しかしながら、コーパスを人手で作成するには、膨大なコストが掛かり開発の大きな負担となっている。特に、フレーム推定部１３は、ドメイン毎のOOD発話を用意せねばならず、例えば、コーパス総数において約２倍のコーパスを用意する必要が生じる。 However, creating a corpus manually requires enormous cost and a great burden for development. In particular, the frame estimation unit 13 must prepare an OOD utterance for each domain, and for example, it becomes necessary to prepare a corpus that is about twice the total number of corpora.

また、コーパスを構成する文章の一部の単語やフレーズをソフトウエアプログラムで置換して行う方法も広く行われているが、入れ替え部分の指定や、置換される単語のルール文をパターン単位でコーディングする作業は煩雑であり、さらに、できあがった文章を再度、人手で判定して仕分けする作業コストも発生し、その負担は大きなものとなる。 There is also a widely used method that replaces some words and phrases in the sentences that make up the corpus with software programs.However, the replacement part is specified and the rule sentence of the word to be replaced is coded in pattern units. The work to do is complicated, and in addition, the work cost of manually judging again the completed sentences and sorting them is generated, and the burden is large.

さらに、インターネット上の膨大なテキストから類似の文を集めるようにしてコーパスを作成する方法も提案されているが、インターネット上の文章の殆どは書き言葉でかかれたもので発話事例は少ない。 Furthermore, a method of creating a corpus by collecting similar sentences from a huge amount of text on the Internet has been proposed, but most of the sentences on the Internet are written in written words and there are few utterance cases.

＜COOD判定文について＞
ところで、フレーム推定部１３や意味解析部１４によりINDコーパスとして判定されたコーパスには、認識精度による限界があるので、一部にOODコーパスとみなされるコーパスが含まれることがある。このようなOOD判定文は、IND判定文であったものが、誤判定によりOOD判定文とされてしまったものであるので、IND判定文に近いコーパスであると考えることができる。<About COOD judgment statement>
By the way, the corpus determined as the IND corpus by the frame estimation unit 13 and the semantic analysis unit 14 has a limit due to the recognition accuracy, and therefore a part of the corpus that is regarded as the OOD corpus may be included. Since such an OOD judgment sentence is an IND judgment sentence but is changed to an OOD judgment sentence due to an erroneous judgment, it can be considered as a corpus close to the IND judgment sentence.

例えば、コーパスに含まれる単語の素性１，２により、コーパスが表現する意味の特徴を、図２で示されるような特徴空間内の分布として表現することを考える。図２の例においては、黒丸印が、IND判定文とみなされるコーパスの分布であり、バツ印が、OOD判定文とみなされるコーパスの分布である。 For example, let us consider expressing the features of the meaning represented by the corpus as distributions in the feature space as shown in FIG. 2 based on the features 1 and 2 of the words included in the corpus. In the example of FIG. 2, black circles represent the distribution of corpora considered as IND judgment sentences, and crosses represent the distribution of corpus regarded as OOD judgment sentences.

図２のバツ印で示されるOOD判定文とみなされるコーパスのうち、IND判定文とみなされるコーパスの近傍に存在するコーパスは、IND判定文に類似したコーパスであると考えることができる。このOOD判定文とみなされるコーパスのうち、IND判定文とみなされるコーパスの分布の近傍に存在するコーパスは、COOD（Close Out of Domain）判定文といわれる。 Of the corpses considered to be the OOD decision sentence indicated by crosses in FIG. 2, the corpora existing in the vicinity of the corpus considered to be the IND decision sentence can be considered to be a corpus similar to the IND decision sentence. Among the corpses considered to be the OOD decision sentence, the corpus existing near the distribution of the corpus considered to be the IND decision sentence is called a COOD (Close Out of Domain) decision sentence.

COOD判定文は、OOD判定文ではあるが、IND判定文に類似したコーパスであり、換言すれば、IND判定文に似て非なるコーパスであると考えることができる。さらに言えば、COOD判定文は、IND判定文と判別するには、非常に紛らわしい表現であり、誤判定を起こし易いコーパスであるとも考えることができる。 Although the COOD judgment sentence is an OOD judgment sentence, it can be considered as a corpus similar to the IND judgment sentence, in other words, a corpus that is similar to the IND judgment sentence and does not exist. Furthermore, the COOD judgment sentence is a very confusing expression to distinguish it from the IND judgment sentence, and it can be considered that it is a corpus that easily causes an erroneous judgment.

フレーム推定部１３、および意味解析部１４の認識精度を向上させるには、このCOOD判定文からなるコーパスを必要にして十分な量だけ学習させることが重要である。本開示のコーパス生成装置は、特に、認識精度の向上に有効なCOODコーパスを含むコーパスを容易で、かつ、大量に生成できるようにし、コーパスの開発に係る負荷を低減させるようにする。 In order to improve the recognition accuracy of the frame estimation unit 13 and the semantic analysis unit 14, it is important to learn a corpus composed of this COOD judgment sentence and to learn a sufficient amount. The corpus generation device of the present disclosure makes it possible to easily generate a large number of corpus including a COOD corpus that is effective in improving recognition accuracy and reduce the load related to corpus development.

＜本開示のコーパス生成装置の構成例＞
本開示のコーパス生成装置は、図１のフレーム推定部１３、および意味解析部１４に相当する構成の学習に必要とされるコーパスを効率的に生成できるようにすることで、フレーム推定部１３、および意味解析部１４の開発コストを低減させるものである。<Example of Configuration of Corpus Generation Device of Present Disclosure>
The corpus generation device according to the present disclosure enables the frame estimation unit 13 and the semantic analysis unit 14 in FIG. 1 to efficiently generate a corpus required for learning the configuration corresponding to the frame estimation unit 13 and the frame estimation unit 13. And the development cost of the meaning analysis part 14 is reduced.

図３は、本開示のコーパス生成装置の一実施の形態の構成例を示している。 FIG. 3 illustrates a configuration example of an embodiment of the corpus generation device of the present disclosure.

コーパス生成装置５１は、入力文である人手で、または、何らかのその他の手法で生成されたIND文からなるコーパスを受け付けて、言語解析などにより単語を置換することで置換生成文からなるコーパスを生成すると共に、生成したコーパスをフィルタリング処理によりCOOD判定文、OOD判定文、IND判定文、およびCIND判定文からなるコーパスに分類する。ここで、CIND（Close IND）判定文は、OOD判定文に分類されたコーパスのうち、IND判定文と類似したコーパスである。すなわち、CIND判定文のコーパスを機械学習させることでも、フレーム推定部１３、および意味解析部１４の認識精度を向上させることができる。 The corpus generation device 51 accepts a corpus including an IND sentence generated manually by an input sentence or by some other method, and replaces a word by language analysis or the like to generate a corpus including a substitution generation sentence. At the same time, the generated corpus is classified into a corpus including a COOD judgment sentence, an OOD judgment sentence, an IND judgment sentence, and a CIND judgment sentence by a filtering process. Here, the CIND (Close IND) judgment sentence is a corpus similar to the IND judgment sentence among the corpora classified as the OOD judgment sentence. That is, the recognition accuracy of the frame estimation unit 13 and the semantic analysis unit 14 can also be improved by machine learning the corpus of the CIND determination sentence.

コーパス生成装置５１は、IND文受付部１０１、言語解析部１０２、置換箇所設定部１０３、辞書照会部１０４、置換実行部１０５、重文排除部１０６、格フレーム辞書１０７、生成条件設定データ記憶部１０８、置換生成文記憶部１０９、フィルタリング処理部１１０を備えている。 The corpus generation device 51 includes an IND sentence reception unit 101, a language analysis unit 102, a replacement location setting unit 103, a dictionary inquiry unit 104, a replacement execution unit 105, a compound sentence elimination unit 106, a case frame dictionary 107, and a generation condition setting data storage unit 108. A replacement generation statement storage unit 109 and a filtering processing unit 110.

IND文受付部１０１は、人手により生成されたIND文や、その他の手法で生成されたIND文の入力を受け付けて、言語解析部１０２に出力する。 The IND sentence receiving unit 101 receives an input of an IND sentence manually generated or an IND sentence generated by another method, and outputs it to the language analysis unit 102.

言語解析部１０２は、IND文受付部１０１より出力されるIND文のそれぞれについて、形態素、句、および述語項構造を解析し、解析結果を置換箇所設定部１０３に出力する。 The language analysis unit 102 analyzes the morpheme, the phrase, and the predicate-argument structure of each IND statement output from the IND statement reception unit 101, and outputs the analysis result to the replacement location setting unit 103.

置換箇所設定部１０３は、IND文の述語項構造の解析結果に基づいて、置換条件を設定し、辞書照会部１０４に出力する。 The replacement part setting unit 103 sets a replacement condition based on the analysis result of the predicate term structure of the IND statement, and outputs it to the dictionary inquiry unit 104.

辞書照会部１０４は、格フレーム辞書１０７を照会し、IND判定文のうち置換条件に基づいて設定された置換位置の単語を、設定された置換方式で検索し、検索結果を置換実行部１０５に出力する。 The dictionary inquiry unit 104 inquires the case frame dictionary 107, searches for a word at the replacement position set based on the replacement condition in the IND determination sentence by the set replacement method, and sends the search result to the replacement execution unit 105. Output.

置換実行部１０５は、置換条件に基づいて設定された置換位置の単語を、設定された置換方式で検索された単語で置換し新たなコーパスを生成する。この時に生成条件設定データ記憶部１０８に記憶されている生成条件設定データに基づいて、新たに生成したコーパスの文末などを調整する。以上の処理の結果生成された文を重文排除部１０６に出力する。 The replacement executing unit 105 replaces the word at the replacement position set based on the replacement condition with the word searched for by the set replacement method to generate a new corpus. At this time, the sentence ends of the newly generated corpus are adjusted based on the generation condition setting data stored in the generation condition setting data storage unit 108. The sentence generated as a result of the above processing is output to the duplicate sentence removing unit 106.

重文排除部１０６は、置換実行部１０５より出力されるコーパスが、重文であるか否かを判定し、重文である場合、廃棄判定文とみなして、廃棄処理する。また、重文排除部１０６は、置換実行部１０５より出力されるコーパスが、重文ではない場合、コーパスを置換生成文として置換生成文記憶部１０９に記憶させる。 The compound sentence removing unit 106 determines whether or not the corpus output from the replacement executing unit 105 is a compound sentence, and if it is a compound sentence, treats it as a discard determination sentence and discards it. Further, when the corpus output from the replacement executing unit 105 is not a compound sentence, the compound sentence removing unit 106 stores the corpus in the replacement generated sentence storage unit 109 as a replacement generated sentence.

格フレーム辞書１０７は、述語をその語義（格フレーム）ごとに分類し、それに特定の格をもってかかる単語がまとめられた辞書であり、辞書照会部１０４により、置換箇所に設定された格フレームの語義となる単語が検索される。 The case frame dictionary 107 is a dictionary in which predicates are classified according to their meanings (case frames), and such words are collected with a specific case, and the meaning of the case frame set by the dictionary inquiry unit 104 as a replacement part. Is searched for.

生成条件設定データ記憶部１０８は、置換実行部１０５により調整されるコーパスを生成する上での条件のデータである生成条件設定データを記憶しており、置換実行部１０５は、生成条件設定データ記憶部１０８の生成条件設定データに基づいて、置換されたコーパスを調整する。 The generation condition setting data storage unit 108 stores generation condition setting data that is data of a condition for generating the corpus adjusted by the replacement executing unit 105, and the replacement executing unit 105 stores the generation condition setting data storage. The replaced corpus is adjusted based on the generation condition setting data of the unit 108.

フィルタリング処理部１１０は、置換生成文記憶部１０９に記憶されているコーパスを、OOD判定文、COOD判定文、IND判定文、およびCIND判定文のコーパスに分類する。尚、フィルタリング処理部１１０の構成例については、図４を参照して、詳細を後述する。 The filtering processing unit 110 classifies the corpus stored in the replacement generation sentence storage unit 109 into corpus of OOD determination sentence, COOD determination sentence, IND determination sentence, and CIND determination sentence. The configuration example of the filtering processing unit 110 will be described later in detail with reference to FIG.

＜フィルタリング処理部＞
次に、図４を参照して、図３のフィルタリング処理部１１０の構成例について説明する。<Filtering unit>
Next, with reference to FIG. 4, a configuration example of the filtering processing unit 110 of FIG. 3 will be described.

フィルタリング処理部１１０は、意味解析器１３１、IND判定文記憶部１３２、COODコーパス抽出部１３３、COOD判定文記憶部１３４、確定IND判定文記憶部１３５、OOD判定文記憶部１３６、CINDコーパス抽出部１３７、CIND判定文記憶部１３８、および確定OOD判定文記憶部１３９を備えている。 The filtering processing unit 110 includes a semantic analyzer 131, an IND judgment sentence storage unit 132, a COOD corpus extraction unit 133, a COOD judgment sentence storage unit 134, a confirmed IND judgment sentence storage unit 135, an OOD judgment sentence storage unit 136, and a CIND corpus extraction unit. 137, a CIND determination statement storage unit 138, and a definite OOD determination statement storage unit 139.

意味解析器１３１は、古いバージョンのコーパス（これまでに生成されたコーパス）で学習された意味解析器（図１の意味解析部１４に対応するもの）であり、置換生成文記憶部１０９に記憶されている置換生成文からなるコーパスのそれぞれについて、所定のアプリケーションプログラムで扱うべき発話内容（以下、IND判定文とも称する）であるか、所定のアプリケーションプログラムで扱うことができない（棄却すべき）発話内容（以下、OOD判定文とも称する）であるかを判定する。また、意味解析器１３１は、IND判定文をIND判定文記憶部１３２に、OOD判定文をOOD判定文記憶部１３６に、それぞれ記憶させる。 The semantic analyzer 131 is a semantic analyzer (corresponding to the semantic analysis unit 14 of FIG. 1) learned by an old version of the corpus (corpus generated so far), and is stored in the replacement generation sentence storage unit 109. For each of the corpora consisting of the generated replacement generated sentences, the utterance content that should be handled by the predetermined application program (hereinafter also referred to as IND judgment sentence) or the utterance that cannot be handled by the predetermined application program (should be rejected) It is determined whether the contents (hereinafter, also referred to as OOD determination sentence). Further, the semantic analyzer 131 stores the IND judgment sentence in the IND judgment sentence storage unit 132 and the OOD judgment sentence in the OOD judgment sentence storage unit 136, respectively.

COOD（Close OOD）コーパス抽出部１３３は、IND判定文記憶部１３２に記憶されているIND判定文のうち、COOD判定文に分類されるコーパスを抽出し、COOD判定文記憶部１３４に記憶させると共に、それ以外のIND判定文を確定IND判定文として確定IND判定文記憶部１３５に記憶させる。また、COODコーパス抽出部１３３は、IND判定文からなるコーパスのうち、非文であり、COOD判定文にも、確定IND判定文にも分類されないコーパスを廃棄判定文として廃棄する。ここで、COOD判定文とは、IND判定文との判定において、その境界付近に存在するOOD判定文である。尚、COOD判定文については、図５を参照して詳細を後述する。 The COOD (Close OOD) corpus extraction unit 133 extracts a corpus classified as a COOD judgment sentence from the IND judgment sentences stored in the IND judgment sentence storage unit 132 and stores it in the COOD judgment sentence storage unit 134. The other IND judgment sentences are stored in the definite IND judgment sentence storage unit 135 as definite IND judgment sentences. Further, the COOD corpus extraction unit 133 discards, as a discard determination sentence, a corpus that is a non-sentence and is not classified as a COOD determination sentence or a definite IND determination sentence among the corpus including the IND determination sentence. Here, the COOD judgment sentence is an OOD judgment sentence existing near the boundary in the judgment with the IND judgment sentence. The COOD determination statement will be described later in detail with reference to FIG.

また、COODコーパス抽出部１３３は、非文判定部１３３ａおよび非出現性判定部１３３ｂを備えており、非文判定部１３３ａを制御して、非文であるか否かを判定させ、非文を廃棄判定文とみなし、廃棄処理する。また、COODコーパス抽出部１３３は、非出現性判定部１３３ｂを制御して、非文とみなされなかったコーパスの非出現性に基づいて、IND判定文がCOOD判定文であるか、または、確定IND判定文であるかを判定させ、このうちCOOD判定文を抽出してCOOD判定文記憶部１３４に記憶させ、確定IND判定文を確定IND判定文記憶部１３５に記憶させる。 Further, the COOD corpus extraction unit 133 includes a non-sentence determination unit 133a and a non-occurrence determination unit 133b, controls the non-sentence determination unit 133a to determine whether the sentence is a non-sentence, and It is considered as a discard judgment statement and is discarded. Further, the COOD corpus extraction unit 133 controls the non-occurrence determination unit 133b to determine whether the IND determination sentence is a COOD determination sentence or not based on the non-occurrence of the corpus that is not considered as a non-sentence. It is determined whether or not it is an IND judgment sentence, a COOD judgment sentence is extracted from this and stored in the COOD judgment sentence storage unit 134, and a definite IND judgment sentence is stored in the definite IND judgment sentence storage unit 135.

非文判定部１３３ａは、IND判定文からなるコーパスにおける、意味の通る文章らしさを示す指標であるPerplexity値を算出し、Perplexity値と所定の閾値との比較により非文であるか否かを判定し、非文とみなしたコーパスを廃棄判定文として廃棄処理する。 The non-sentence determination unit 133a calculates a Perplexity value that is an index indicating the likelihood of a sentence in a corpus including IND determination sentences, and determines whether the sentence is a non-sentence by comparing the Perplexity value with a predetermined threshold. Then, the corpus regarded as a non-sentence is discarded as a discarding judgment sentence.

非出現性判定部１３３ｂは、非文ではないとみなされたコーパスのそれぞれについて、IND判定文に判定されたドメインにおけるコーパス群で出現し難い単語を含んでいるか否かを示す非出現性を示すパラメータを算出する。そして、非出現性判定部１３３ｂは、非出現性を示すパラメータと所定の閾値との比較により、IND判定文に判定されたドメインにおけるコーパス群で出現し難い単語を含んでいるコーパスをCOOD判定文とみなして抽出する。さらに、非出現性判定部１３３ｂは、抽出したCOOD判定文をCOOD判定文記憶部１３４に記憶させ、それ以外のIND判定文からなるコーパスを確定IND判定文とみなして、確定IND判定文記憶部１３５に記憶させる。 The non-occurrence determination unit 133b indicates the non-occurrence indicating whether or not the corpus group determined to be a non-sentence includes a word that is difficult to appear in the corpus group in the domain determined by the IND determination sentence. Calculate the parameters. Then, the non-occurrence determining unit 133b compares the parameter indicating the non-occurrence with a predetermined threshold to determine the COOD determination statement as a corpus containing a word that is difficult to appear in the corpus group in the domain determined by the IND determination statement. And extract. Furthermore, the non-occurrence determination unit 133b stores the extracted COOD determination statement in the COOD determination statement storage unit 134, regards the corpus of the other IND determination statements as the fixed IND determination statement, and determines the fixed IND determination statement storage unit. Store in 135.

CIND（Close IND）コーパス抽出部１３７は、OOD判定文記憶部１３６に記憶されているOOD判定文のうち、CIND判定文に分類されるコーパスを抽出し、CIND判定文記憶部１３８に記憶させ、それ以外のOOD判定文を確定OOD判定文として確定OOD判定文記憶部１３９に記憶させる。また、CINDコーパス抽出部１３７は、OOD判定文からなるコーパスのうち、CIND判定文にも、確定OOD判定文にも分類されないコーパスを廃棄判定文として廃棄する。ここで、CIND判定文とは、OOD判定文との判定において、その境界付近に存在するIND判定文である。各CIND判定文がどのドメインに属するかの判定は最終的には人手などで行う。尚、CIND判定文については、図５を参照して詳細を後述する。 The CIND (Close IND) corpus extraction unit 137 extracts a corpus classified as a CIND judgment sentence from the OOD judgment sentences stored in the OOD judgment sentence storage unit 136, and stores it in the CIND judgment sentence storage unit 138. Other OOD judgment sentences are stored in the definite OOD judgment sentence storage unit 139 as definite OOD judgment sentences. Further, the CIND corpus extraction unit 137 discards, as a discard determination sentence, a corpus that is not classified into a CIND determination sentence or a definite OOD determination sentence among the corpus including the OOD determination sentence. Here, the CIND judgment sentence is an IND judgment sentence existing near the boundary in the judgment with the OOD judgment sentence. The domain to which each CIND statement belongs belongs is finally determined manually. Details of the CIND determination statement will be described later with reference to FIG.

また、CINDコーパス抽出部１３７は、非文判定部１３７ａおよび非出現性判定部１３７ｂを備えており、非文判定部１３７ａを制御して、非文であるか否かを判定させ、非文を廃棄判定文とみなし、廃棄処理する。また、CINDコーパス抽出部１３７は、非出現性判定部１３７ｂを制御して、非文とみなされなかったコーパスがCIND判定文であるか、または、確定OOD判定文であるかを判定させ、このうちCIND判定文を抽出してCIND判定文記憶部１３８に記憶させ、確定OOD判定文を確定OOD判定文記憶部１３９に記憶させる。 The CIND corpus extraction unit 137 includes a non-sentence determination unit 137a and a non-occurrence determination unit 137b. The CIND corpus extraction unit 137 controls the non-sentence determination unit 137a to determine whether the sentence is a non-sentence. It is considered as a discard judgment statement and is discarded. In addition, the CIND corpus extraction unit 137 controls the non-occurrence determination unit 137b to determine whether the corpus not regarded as a non-sentence is a CIND determination sentence or a definite OOD determination sentence. The CIND judgment sentence is extracted and stored in the CIND judgment sentence storage unit 138, and the definite OOD judgment sentence is stored in the definite OOD judgment sentence storage unit 139.

非文判定部１３７ａは、OOD判定文からなるコーパスにおける、意味の通る文章らしさを示す指標であるPerplexity値を算出し、Perplexity値と所定の閾値との比較により非文であるか否かを判定し、非文とみなしたコーパスを廃棄判定文として廃棄処理する。 The non-sentence determination unit 137a calculates a Perplexity value, which is an index indicating the likelihood of a sentence having meaning, in a corpus composed of OOD determination sentences, and determines whether the sentence is a non-sentence by comparing the Perplexity value with a predetermined threshold value. Then, the corpus regarded as a non-sentence is discarded as a discarding judgment sentence.

非出現性判定部１３７ｂは、非文ではないとみなされたコーパスのそれぞれについて、OOD判定文に判定されたドメインにおけるコーパス群で出現し難い単語を含んでいるか否かを判定し、OOD判定文に判定されたドメインにおけるコーパス群で出現し難い単語を含んでいるコーパスを確定OOD判定文とみなして抽出する。さらに、非出現性判定部１３７ｂは、抽出した確定OOD判定文を確定OOD判定文記憶部１３９に記憶させ、それ以外のコーパスをCIND判定文とみなして抽出し、CIND判定文記憶部１３８に記憶させる。 The non-occurrence determination unit 137b determines, for each corpus that is considered not to be a non-sentence, whether or not it contains a word that is difficult to appear in the corpus group in the domain determined by the OOD determination sentence, and the OOD determination sentence A corpus containing a word that is unlikely to appear in the corpus group in the domain determined by is regarded as a definite OOD determination sentence and is extracted. Furthermore, the non-occurrence determination unit 137b stores the extracted definite OOD determination sentence in the definite OOD determination sentence storage unit 139, regards the other corpus as a CIND determination sentence, extracts the stored corpus, and stores it in the CIND determination sentence storage unit 138. Let

＜COOD判定文およびCIND判定文について＞
ここで、COOD判定文およびCIND判定文について説明する。<About COOD and CIND statements>
Here, the COOD determination statement and the CIND determination statement will be described.

図４の意味解析器１３１は、置換生成文記憶部１０９に記憶されているIND文から生成された置換生成文からなるコーパス群を、図５で示されるように、IND判定文のコーパス群とOOD判定文のコーパス群とに分類する。 As shown in FIG. 5, the semantic analyzer 131 of FIG. 4 uses a corpus group including replacement generation sentences generated from IND statements stored in the replacement generation sentence storage unit 109 as a corpus group of IND determination sentences. It is classified into a corpus group of OOD judgment sentences.

さらに、COODコーパス抽出部１３３は、IND判定文からなるコーパス群より、後述するPerplexity値に基づいた判定により、非文とみなされるコーパスを廃棄判定文として廃棄処理すると共に、単語の非出現性に基づいた判定によりCOOD判定文とみなされるコーパスを抽出する。そして、COODコーパス抽出部１３３は、残りのIND判定文からなるコーパスを確定IND判定文とみなす。 Further, the COOD corpus extraction unit 133 discards the corpus considered as a non-sentence as a discard determination sentence from the corpus group including the IND determination sentence, based on the determination based on the Perplexity value described later, and determines the word non-appearance. A corpus that is regarded as a COOD judgment sentence is extracted based on the judgment. Then, the COOD corpus extraction unit 133 regards the corpus composed of the remaining IND judgment sentences as the definite IND judgment sentence.

すなわち、COODコーパス抽出部１３３に入力されるのはIND文を元に様々に単語が置換されることで生成されたIND判定文からなるコーパスであるので、確定IND判定文とみなされるコーパスが比較的多くなる。 That is, what is input to the COOD corpus extraction unit 133 is a corpus composed of IND judgment sentences generated by variously replacing words based on the IND sentence, and therefore the corpus regarded as the definite IND judgment sentence is compared. It will be many.

しなしながら、意味解析器１３１によりIND判定文として判定されたコーパスには、認識精度による限界がある上、さらに、単語が置換されているので、一部にOOD判定文とみなされるコーパスも含まれる。このようなOOD判定文は、IND判定文であったものが、単語の置換によりOOD判定文に代わったものであるので、IND判定文に近いコーパスであり、図２を参照して説明したCOOD判定文であると考えることができる。 However, the corpus judged as the IND judgment sentence by the semantic analyzer 131 has a limit due to the recognition accuracy, and moreover, since words are replaced, a part of the corpus regarded as the OOD judgment sentence is also included. Be done. Such an OOD judgment sentence is a corpus that is close to the IND judgment sentence because the IND judgment sentence is replaced with the OOD judgment sentence by word replacement, and the COOD described with reference to FIG. It can be considered as a judgment sentence.

ここで、２つ以上のコーパス（文章）が相互に「似ている」とは、例えば、２つ以上の文章が相互に類似した語義の述語を持った文章であり、述語と、述語に係る項の構造（述語項構造）が似ていることである。さらに、述語にかかる項の単語の意味や役割まで似ている文章であれば、２つ以上の文章は、より似ているといえる。 Here, “two or more corpus (sentences) are “similar” to each other means, for example, a sentence having two or more sentences having predicates having similar meanings, and relates to the predicate and the predicate. The structure of terms (predicate-argument structure) is similar. Furthermore, two or more sentences can be said to be more similar if the meanings and roles of the words of the terms related to the predicate are similar.

例えば、「設定する」という言葉を例にする場合、語義は、複数に存在し、例えば、図６の例Ｅｘ１で示されるように、「設定する」の語義のうちの２種類の例が、「設定する４」および「設定する８」として挙げられている。 For example, when the word “set” is taken as an example, there are a plurality of word senses, and for example, as shown in the example Ex1 of FIG. 6, two types of examples of the meaning of “set” are: They are listed as "set 4" and "set 8".

例Ｅｘ１においては、「設定する４」では「設定する」という述語にかかる項として、動作主格としての（“システム”，83）、（“会社”，42）、（“学校”，33）、（“上司”，18）が挙げられており、対象格としての（“会議”，41）、（“参加者”，27）、（“移動時間”，10）が挙げられており、道具格としての（“ＰＣ（パーソナルコンピュータ）”，95）、（“スケジューラ”，72）、（“スマホ”，33）が挙げられている。尚、図中において、各単語の後に表記される数値（動作主格であれば、“システム”、“会社”、“学校”、および“上司”といった言葉に対応付けて表記されている83,42,33,18）は、所定数の文章を検索したときに単語と述語とが結びついて検索される回数（頻度）を示しており、ここでは、頻度の高い順に表記されている。尚、この数字は重みの様な別の指標でもよいし、母集団によりノーマライズして使用するようにしてもよいし、指標を乗算するなどして組み合わせて使用するようにしてもよい。 In the example Ex1, as the terms related to the predicate “set” in “set 4”, (“system”, 83), (“company”, 42), (“school”, 33) as action nominatives, (“Boss”, 18) is listed, and (“Meeting”, 41), (“Participants”, 27), (“Movement time”, 10) are listed as target cases, (“PC (personal computer)”, 95), (“scheduler”, 72), and (“smartphone”, 33). It should be noted that in the figure, the numerical value written after each word (in the case of the action principal, it is written in association with the words such as "system", "company", "school", and "boss" 83,42. , 33, 18) indicates the number of times (frequency) in which a word and a predicate are associated with each other when a predetermined number of sentences are searched, and are listed here in descending order of frequency. It should be noted that this number may be another index such as a weight, may be used after being normalized by the population, or may be used in combination by multiplying the index.

また、例Ｅｘ１においては、「設定する８」では同様にそれに係る項として、動作主格としての、（“妻”，40）、（“娘”，33）、（“息子”，28）、（“母”，13）が挙げられており、対象格としての（“アラーム”，52）、（“目覚まし”，48）、（“タイマ”，42）が挙げられており、道具格としての（“目覚まし”，94）、（“時計”，42）、（“携帯”，35）、（“スマホ，19）が挙げられている。 Further, in the example Ex1, in the case of “setting 8”, the terms related to the same are (“wife”, 40), (“daughter”, 33), (“son”, 28), ( "Mother", 13) is listed, and ("Alarm", 52), ("Alarm", 48), ("Timer", 42) are listed as target cases, and ( "Awakening", 94), "Clock", 42), "Mobile", 35), and "Smartphone, 19" are listed.

さらに、「設定する」に類似する語義の述語として「セットする１」が挙げられており、この場合、それにかかる項として、動作主格としての、（“彼女”，56）、（“父”，52）、（“妻”，49）が挙げられており、対象格としての（“タイマ”，67）、（“スリープタイマ”，42）（“アラーム”，41）が挙げられており、道具格としての（“炊飯器”，52）、（“エアコン”，45）、（“ラジオ”，32）、（“携帯”，12）が挙げられている。 Furthermore, "set to 1" is mentioned as a predicate having a similar meaning to "set", and in this case, the terms related to this are ("she", 56), ("father", 52), ("wife", 49), ("Timer", 67), ("Sleep timer", 42) ("Alarm", 41) as objects and tools The cases ("rice cooker", 52), ("air conditioner", 45), ("radio", 32), ("mobile", 12) are listed.

つまり、例Ｅｘ１で挙げられている「設定する４」、「設定する８」、および「セットする１」とは、それぞれに分類される語義で使用される際、動作主格、対象格、道具格をそれぞれの語義の範囲内で置き換えられた文章同士は、述語構造が類似した文章であり、「設定する８」は「セットする１」はどちらも対象格にタイマに関係する意味クラスの単語がかかる点で類似した語義をもつ可能性が高いと考えることができる。 That is, “set 4”, “set 8”, and “set 1” mentioned in the example Ex1 are the action nominative case, the target case, and the instrument case when used in the meanings classified into the respective categories. Sentences that are replaced within the respective meanings are sentences with similar predicate structures, and "set 8" and "set 1" are both words of the semantic class related to the timer in the target case. It can be considered that there is a high possibility that they have similar meanings in this respect.

また、２つ以上の文章が相互に「非なる」とは、例えば、述語項構造が似ていて、異なる表記の述語、または、意味クラスの名詞句を持った文章である。 Further, two or more sentences are “non-existent” with each other, for example, a sentence having a similar predicate-argument structure and having different notations or a noun phrase of a semantic class.

例えば、述語項構造が似ていて、異なる表記の述語を持つ場合の例としては、図６の例Ｅｘ２の上段で示されるように、「６時にアラームを設定して。」、「６時にアラームを破壊して。」、「６時にアラームを解放して。」などが挙げられる。 For example, as an example in the case where the predicate term structures are similar and the predicates have different notations, as shown in the upper part of the example Ex2 of FIG. 6, “set an alarm at 6 o'clock”, “alarm at 6 o'clock” Destroy ".", "Release the alarm at 6 o'clock" and so on.

また、述語項構造が似ていて、異なる表記の意味クラスの名詞句を持つ場合の例としては、図６の例Ｅｘ２の下段で示されるように、「８時にタイマを設定して。」、「８時に営業会議を設定して。」、「８時にシャットダウンを設定して。」、などが挙げられる。１番目の「タイマ」は時計機能に関する名詞句、２番目の「営業会議」は仕事のイベントに関する名詞句、「シャットダウン」はコンピュータの制御に関する名詞句で意味的には異なる分類になる。 Further, as an example of the case where the predicate term structures are similar and there are noun phrases of semantic classes of different notation, as shown in the lower part of the example Ex2 of FIG. 6, “set timer at 8 o'clock”, "Set a business meeting at 8 o'clock", "Set a shutdown at 8 o'clock", and the like. The first "timer" is a noun phrase related to the clock function, the second "business meeting" is a noun phrase related to work events, and "shutdown" is a noun phrase related to computer control, which are semantically different.

さらに名詞句の意味クラスの例として、図６の例Ｅｘ２の最下段で示されるように、例えば、「山形」という言葉を例とする場合、「山形」という人物や姓などを意味するクラス、「山形県」という地名や県名を意味するクラス、「山形新幹線」などの路線名を意味するクラスなどである。 Further, as an example of the meaning class of the noun phrase, as shown at the bottom of the example Ex2 in FIG. 6, for example, when the word “Yamagata” is taken as an example, a class meaning a person, a family name, etc. of “Yamagata”, Classes that mean the place name and prefecture name "Yamagata Prefecture", classes that mean route names such as "Yamagata Shinkansen", and so on.

これに対して、OOD判定文とみなされるコーパスのうち、IND判定文とみなされるコーパスと特徴空間内において離れているコーパスは、IND判定文と誤認識される可能性が低いと考えることができる。 On the other hand, among corpora considered as OOD judgment sentences, corpora that are separated from the corpus regarded as IND judgment sentences in the feature space can be considered to be less likely to be erroneously recognized as IND judgment sentences. ..

したがって、フレーム推定部１３および意味解析部１４は、COOD判定文を多く学習することで、IND判定文に類似しているが、OOD判定文であるコーパスを確実に棄却することが可能となり、結果として、認識精度を向上させることが可能となる。このため、より多くのCOOD判定文となるコーパスを生成し、学習させることで認識精度を向上させることができる。 Therefore, the frame estimation unit 13 and the semantic analysis unit 14 can reject a corpus that is an OOD judgment sentence, though it is similar to the IND judgment sentence, by learning a large number of COOD judgment sentences. As a result, it is possible to improve the recognition accuracy. Therefore, the recognition accuracy can be improved by generating and learning more corpora that are COOD determination sentences.

CIND判定文は、COOD判定文と対応するコーパスであり、図２の特徴空間内において、IND判定文とみなされるコーパスのうち、OOD判定文とみなされるコーパスの分布との境界付近に存在するコーパスである。 The CIND judgment sentence is a corpus corresponding to the COOD judgment sentence, and exists in the vicinity of the boundary with the distribution of the corpus regarded as the OOD judgment sentence among the corpus regarded as the IND judgment sentence in the feature space of FIG. Is.

CIND判定文は、IND判定文であるが、OOD判定文に類似したコーパスであり、換言すれば、OOD判定文に似て非なるコーパスであると考えることができる。さらに言えば、CIND判定文は、OOD判定文と判別するには、非常に紛らわしい表現であり、誤判定を起こし易いコーパスであるとも考えることができる。 Although the CIND judgment sentence is an IND judgment sentence, it can be considered as a corpus similar to the OOD judgment sentence, in other words, a corpus that is similar to the OOD judgment sentence and does not exist. Furthermore, the CIND judgment sentence is a very confusing expression for distinguishing it from the OOD judgment sentence, and it can be considered that it is a corpus that easily causes an erroneous judgment.

これに対して、IND判定文とみなされるコーパスのうち、OOD判定文とみなされるコーパスと特徴空間内において離れているコーパスは、OOD判定文と誤判定を起こし難いと考えることができる。 On the other hand, among the corpses considered to be the IND decision sentence, the corpus that is separated from the corpus considered to be the OOD decision sentence in the feature space can be considered to be unlikely to make an erroneous decision with the OOD decision sentence.

したがって、フレーム推定部１３および意味解析部１４は、CIND判定文を多く学習することで、OOD判定文に類似しているが、IND判定文であるコーパスを確実に認識することが可能となり、結果として、認識精度を向上させることが可能となる。このため、より多くのCIND判定文となるコーパスを生成し、学習させることで認識精度を向上させることができる。 Therefore, the frame estimation unit 13 and the semantic analysis unit 14 can recognize a corpus that is similar to the OOD judgment sentence but is an IND judgment sentence by learning a large number of CIND judgment sentences. As a result, it is possible to improve the recognition accuracy. Therefore, the recognition accuracy can be improved by generating and learning more corpora that become CIND judgment sentences.

以上のことから、COOD判定文とCIND判定文とをそれぞれ多く学習することでフレーム推定部１３および意味解析部１４の認識精度を向上させることが可能となる。 From the above, it is possible to improve the recognition accuracy of the frame estimation unit 13 and the semantic analysis unit 14 by learning many COOD determination sentences and CIND determination sentences.

＜コーパス生成処理＞
次に、図７のフローチャートを参照して、図３のコーパス生成装置５１によるコーパス生成処理について説明する。<Corpus generation processing>
Next, the corpus generation processing by the corpus generation device 51 of FIG. 3 will be described with reference to the flowchart of FIG. 7.

ステップＳ１１において、IND文受付部１０１は、人手などで作成されたIND文のうち、未処理のIND文を処理対象のIND文として選択して、入力を受け付けて言語解析部１０２に出力する。 In step S11, the IND statement accepting unit 101 selects an unprocessed IND statement as an IND statement to be processed among the IND statements manually created, accepts an input, and outputs it to the language analysis unit 102.

ステップＳ１２において、言語解析部１０２は、処理対象となるIND文の形態素、句、および述語項構造を解析する。 In step S12, the language analysis unit 102 analyzes the morpheme, the phrase, and the predicate-argument structure of the IND statement to be processed.

ステップＳ１３において、言語解析部１０２は、述語構造解析結果を保存する。より詳細には、言語解析部１０２は、述語構造解析処理においてエラーが生じない限り、述語構造解析結果を保存する。尚、エラーが生じた場合、言語解析部１０２は、例えば、処理対象となるIND文を廃棄する。 In step S13, the language analysis unit 102 saves the predicate structure analysis result. More specifically, the language analysis unit 102 saves the predicate structure analysis result unless an error occurs in the predicate structure analysis processing. When an error occurs, the language analysis unit 102 discards the IND statement to be processed, for example.

ステップＳ１４において、言語解析部１０２は、未処理のIND文が存在するか否かを判定し、存在する場合、処理は、ステップＳ１１に戻る。すなわち、未処理のIND文がなくなるまで、ステップＳ１１乃至Ｓ１４の処理が繰り返される。そして、ステップＳ１４において、全てのIND文が処理されて、未処理のIND文が存在しないと判定された場合、処理は、ステップＳ１５に進む。 In step S14, the language analysis unit 102 determines whether or not there is an unprocessed IND statement, and if it exists, the process returns to step S11. That is, the processes of steps S11 to S14 are repeated until there are no unprocessed IND statements. Then, in step S14, when all the IND statements are processed and it is determined that there is no unprocessed IND statement, the process proceeds to step S15.

ここで、図８を参照して、保存された述語項構造解析の結果のデータの例について説明する。 Here, with reference to FIG. 8, an example of the data of the result of the stored predicate term structure analysis will be described.

述語項構造を解析するとは、入力文が、「銀座でおいしい寿司屋を教えて」である場合、例えば、深層格解析で解析されるとき、図８の例Ｅｘ１１で示されるような構造として解析される。すなわち、「銀座で」が、場所格として、「おいしい」が連体修飾節として、「寿司屋を」が対象格として、「教えて」が述語節として解析される。 Analyzing the predicate term structure means that when the input sentence is "Tell me a delicious sushi restaurant in Ginza", for example, when analyzed by deep case analysis, it is analyzed as the structure shown in example Ex11 of FIG. To be done. In other words, "Ginza de" is analyzed as a place case, "delicious" as an adnominal modifier, "sushi restaurant" as a target case, and "teach" as a predicate clause.

また、同様の入力文において、表層格解析で解析されるとき、図８の例Ｅｘ１２で示されるような構造として解析される。すなわち、「銀座で」が、デ格として、「おいしい」が形容詞として、「寿司屋を」がヲ格として、「教えて」が動詞として解析される。 Further, when the same input sentence is analyzed by the surface case analysis, it is analyzed as a structure as shown in the example Ex12 of FIG. That is, "Ginza de" is analyzed as a de-case, "delicious" is analyzed as an adjective, "sushi restaurant" is analyzed as "wo", and "teach" is analyzed as a verb.

述語項構造解析に際しては、深層格解析や表層格解析などを切り替えられるように設定し、いずれかをユーザが選択できるようにしてもよい。 In the predicate term structure analysis, it may be set so that the deep case analysis and the surface case analysis can be switched so that the user can select one of them.

このように、解析結果で動詞句の位置や名詞句で対象格になる部分など（英語ならば目的語など）が決定される。 In this way, the position of the verb phrase and the target part of the noun phrase (in English, the object, etc.) are determined by the analysis result.

尚、例えば、処理対象となるIND文が「近場で美味しいレストランは？」といった場合、IND文に述語が無く、述語項構造解析がうまくいかない場合がある。このような場合、所定のルールに基づいて、省略されている述語を補完し述語項構造の解析結果を補間できるようにしてもよい。 Note that, for example, when the IND statement to be processed is "What is a delicious restaurant in the near field?", there is no predicate in the IND statement, and the predicate term structure analysis may not be successful. In such a case, the omitted predicate may be complemented and the analysis result of the predicate term structure may be interpolated based on a predetermined rule.

ステップＳ１５において、置換箇所設定部１０３は、ステップＳ１３の処理において保存された述語項構造解析の解析結果を入力とし予め指定された置換条件を設定し、設定結果を辞書照会部１０４に供給する。 In step S15, the replacement part setting unit 103 sets the replacement condition designated in advance by using the analysis result of the predicate term structure analysis stored in the process of step S13 as input, and supplies the setting result to the dictionary inquiry unit 104.

＜置換条件＞
ここで、置換条件について説明する。置換条件には置換方式と置換箇所がある。<Replacement condition>
Here, the replacement condition will be described. The replacement condition includes a replacement method and a replacement part.

まず、置換方式には、大きく以下の２方式があり、例えば、そのいずれかに設定される。 First, there are the following two replacement methods, and for example, one of them is set.

より詳細には、置換方式の第一の方式は、Action固定（述部固定）Category置換（対象格置換）方式であり、第二の方式は、Category固定（対象格固定）Action置換（述部置換）方式である。 More specifically, the first replacement method is the Action fixed (predicate fixed) Category replacement (target case replacement) method, and the second method is the Category fixed (target case fixed) Action replacement (predicate) (Replacement) method.

Action固定（述部固定）Category置換（対象格置換）方式は、例えば、入力文が、「７時にアラームを設定して」である場合、図９の例Ｅｘ２１の上段の１）で示されるように、「設定して」である述部（Action）を固定し、対象格（Category）である「アラーム」を置換する方式であり、例Ｅｘ２１の１）においては、「アラーム」が、「物性」に置換されている。 The Action fixed (predicate fixed) Category replacement (target case replacement) method is, for example, when the input sentence is “set an alarm at 7 o'clock”, as shown in 1) in the upper part of the example Ex21 in FIG. Is a method of fixing a predicate (Action) that is "set" and replacing an alarm that is a target case (Category). In Example 1 of Ex21, "alarm" is "physical property". Has been replaced by ".

また、Category固定（対象格固定）Action置換（述部置換）方式は、例えば、入力文が、「７時にアラームを設定して」である場合、図９の例Ｅｘ２１の下段の２）で示されるように、対象格（Category）である「アラーム」を固定し、「設定して」である述部（Action）を置換する方式であり、例Ｅｘ３１の２）においては、「設定して」が、「解放して」に置換されている。 In addition, the Category fixed (target case fixed) Action replacement (predicate replacement) method is shown in 2) in the lower part of the example Ex21 in FIG. 9 when the input sentence is, for example, “set an alarm at 7 o'clock”. As described above, in this method, the “alarm” that is the target case is fixed and the predicate (Action) that is “set” is replaced. Has been replaced by "release it".

尚、述部が２カ所ある文や、対象格以外の格を指定したい場合（時間格、道具格など）の設定や、句の境界の設定などは、ユーザが任意に指定できるようにして、指定内容に応じて切り替えられるようにしてもよい。 It should be noted that the user can arbitrarily specify a sentence having two predicates, a case when it is desired to specify a case other than the target case (time case, instrument case, etc.), and a phrase boundary setting. You may make it switchable according to designation|designated content.

＜置換方式がAction固定、Category置換方式の場合の置換箇所＞
置換方式がAction固定、Category置換方式の場合の置換箇所は、例えば、図１０の例Ｅｘ２２で示されるように設定される。<Replacement method when the replacement method is Action fixed or Category replacement method>
When the replacement method is the Action fixed or Category replacement method, the replacement location is set, for example, as shown in the example Ex22 of FIG. 10.

処理対象となるIND文が、例えば、「今週末駅の近くでおすすめのスポット教えて」である場合、ヲ格を置換指定場所に指定した場合置換箇所設定部１０３は、例Ｅｘ２２の最上段で示されるように、分ち書き単位の構造として、「今週末」、「駅の」、「近く」、「で」、「おすすめ」、「の」、「スポット」、および「教えて」に分割する。 When the IND statement to be processed is, for example, "Tell me about recommended spots near the station this weekend," when you specify wo case as the designated replacement location, the replacement location setting unit 103 displays the top row of the example Ex22. As shown, it is divided into "this weekend", "station", "near", "de", "recommended", "no", "spot", and "teach me" as the structure of the segmentation unit. To do.

また、置換箇所設定部１０３は。特定の単語や句の分かち書き単位に関して必要に応じて１つの句にまとめる設定をルールや単語辞書を使って変更する事ができてもよい。例えば、例Ｅｘ２２の上から２段目で示されるように、「今週末」、「駅の近く」、「で」、「おすすめ」、「の」、「スポット」、および「教えて」のように調整する。 Also, the replacement location setting unit 103. It may be possible to change the setting of grouping into one phrase as needed for the unit of separating a specific word or phrase using a rule or a word dictionary. For example, "This weekend", "near the station", "de", "recommended", "no", "spot", and "tell me" as shown in the second row from the top of Example Ex22 Adjust to.

さらに、置換箇所設定部１０３は、この調整された分ち書き単位の構造から置換箇所のうち、例えば、例Ｅｘ２２の上から３段目で示されるように、「おすすめ」をヲ格でかかる単語群で置換する。 Further, the replacement part setting unit 103 selects, from the adjusted structure of the punctuation unit, the word “recommended” in the replacement part, for example, as shown in the third row from the top of Example Ex22. Replace with group.

この様に１カ所だけ単語を変換することはCOOD文を作るのに有効であるが、さらにデ格である「駅の近く」を置換対象に加えたり、ヲ格の「スポット」は置換せずにデ格の「駅の近く」のみ置換するなどの設定バリエーションを加える事もできる。 In this way, converting words in only one place is effective for creating a COOD sentence, but it does not add the declinative "near the station" to the replacement target or replace the "spot" in the wo case. It is also possible to add setting variations such as replacing only the "near the station", which is a special case.

＜置換方式がCategory固定、Action置換の場合の置換箇所＞
置換方式がCategory固定（対象格固定）、Action置換（述部置換）の場合の置換箇所は、例えば、図１１の例Ｅｘ２３で示されるように設定される。<Replacement method when the replacement method is Category fixed or Action replacement>
When the replacement method is Category fixed (target case fixed) or Action replacement (predicate replacement), the replacement location is set as shown in Example Ex23 of FIG. 11, for example.

処理対象となるIND文が、例えば、「今週末駅の近くでおすすめのスポット教えて」である場合、述部が置換場所に指定される場合、置換箇所設定部１０３は、例Ｅｘ２３の最上段で示されるように、分ち書き単位の構造として、「今週末」、「駅の」、「近く」、「で」、「おすすめ」、「の」、「スポット」、および「教えて」に分割する。 If the IND statement to be processed is, for example, “Tell me recommended spots near the station this weekend”, and if the predicate is designated as the replacement location, the replacement location setting unit 103 causes the replacement location setting unit 103 to display the top row of the example Ex23. As shown in, as the structure of the segmentation unit, "this weekend", "station", "near", "de", "recommended", "no", "spot", and "teach" To divide.

さらに、置換箇所設定部１０３は、この調整された分ち書き単位の構造から置換箇所のうち、例えば、例Ｅｘ２３の上から３段目で示されるように、ヲ格「おすすめ」の係り元である「教えて」を同じ「おすすめ」をヲ格に持つ類似の語義の述語で置換する。また、ヲ格に同じ「おすすめ」を持たない非類似の語義の述語で置換するという設定を行ってもよい。 Further, the replacement location setting unit 103 determines whether or not, out of the replacement locations based on the adjusted structure of the segmentation unit, for example, in the third row from the top of the example Ex23, “Recommended”. Replace a "teach me" with a predicate with a similar meaning that has the same "recommendation". Alternatively, it may be set to replace with a predicate having a dissimilar meaning that does not have the same "recommendation" in each case.

前記の置換する述語の選択基準はヲ格にどういう単語を持つかで判断するだけではなく、デ格やニ格など他の項の単語の類似性、非類似性で判断することもできる。 The selection criterion of the predicate to be replaced can be judged not only by what kind of word it has, but also by the similarity or dissimilarity of the words of other terms such as de-case and ni-case.

ステップＳ１６において、辞書照会部１０４は、言語解析部１０２に述語構造解析結果の保存データのうちの未処理の１個のIND文を読み出して処理対象のIND文として受け付ける。 In step S16, the dictionary inquiry unit 104 reads one unprocessed IND statement in the storage data of the predicate structure analysis result to the language analysis unit 102 and accepts it as an IND statement to be processed.

ステップＳ１７において、辞書照会部１０４は、設定結果に基づいて、指定された置換方式に応じて置換箇所を特定し、置換箇所の単語の項に対応する名詞句や、語義に対応する述語を、格フレーム辞書１０７を参照して検索する。 In step S17, the dictionary inquiring unit 104 specifies the replacement part according to the specified replacement method based on the setting result, and determines the noun phrase corresponding to the term of the word of the replacement part and the predicate corresponding to the meaning. The case frame dictionary 107 is referenced and searched.

ステップＳ１８において、辞書照会部１０４は、処理対象となるIND文、置換に関する設定情報、および検索結果を、それぞれ対応付けて保存する。 In step S18, the dictionary inquiry unit 104 stores the IND sentence to be processed, the setting information regarding replacement, and the search result in association with each other.

ステップＳ１９において、辞書照会部１０４は、述語構造解析結果の保存データのうち、未処理のIND文があるか否かを判定し、未処理のIND文がある場合、処理は、ステップＳ１６に戻る。すなわち、述語構造解析結果の保存データである全てのIND文について、置換候補を検索する処理が完了するまで、ステップＳ１６乃至Ｓ１９の処理が繰り返される。そして、ステップＳ１９において、全てのIND文について、置換候補を検索する処理が完了し、未処理のIND文がないとみなされた場合、処理は、ステップＳ２０に進む。 In step S19, the dictionary inquiry unit 104 determines whether or not there is an unprocessed IND statement in the saved data of the predicate structure analysis result. If there is an unprocessed IND statement, the process returns to step S16. .. That is, the processes of steps S16 to S19 are repeated until the process of searching the replacement candidate is completed for all the IND statements that are the storage data of the predicate structure analysis result. Then, in step S19, when the process of searching for replacement candidates for all IND statements is completed and it is determined that there is no unprocessed IND statement, the process proceeds to step S20.

尚、日本語における述語項構造解析結果が、図１２で示されるような場合、置換方式がAction固定Category置換方式のときには、図１３で示されるように、述語項（ヲ格）を置換する単語群が検索される。また、述語項構造解析結果が、図１２で示されるような場合、Category固定Action置換方式のときには、図１４で示されるように、述語部分を置換する単語群が検索される。このような置換により、例えば、図１５で示されるような日本語のOOD候補文として生成される。 In the case where the predicate term structure analysis result in Japanese is as shown in FIG. 12, when the replacement method is the Action fixed Category replacement method, as shown in FIG. 13, the word that replaces the predicate term (wo case) Groups are searched. Further, in the case where the result of the predicate term structure analysis is as shown in FIG. 12, in the case of the Category fixed action replacement method, as shown in FIG. 14, the word group that replaces the predicate part is searched. By such replacement, for example, a Japanese OOD candidate sentence as shown in FIG. 15 is generated.

また、図１２の述語項構造解析結果に対応する英語の述語項構造解析結果が、図１６で示されるような場合、置換方式がAction固定Category置換方式のときには、図１７で示されるように、述語項を置換する単語群が検索される。また、述語項構造解析結果が、図１６で示されるような場合、Category固定Action置換方式のときには、図１８で示されるように、述語部分を置換する単語群が検索される。このような置換により、例えば、図１９で示されるような英語のOOD候補文として生成される。 Further, when the English predicate term structure analysis result corresponding to the predicate term structure analysis result of FIG. 12 is as shown in FIG. 16, when the replacement method is Action fixed Category replacement method, as shown in FIG. The words that replace the predicate term are searched. In addition, in the case where the result of the predicate term structure analysis is as shown in FIG. 16, in the case of the Category fixed action replacement method, as shown in FIG. 18, a word group that replaces the predicate portion is searched. By such replacement, for example, an English OOD candidate sentence as shown in FIG. 19 is generated.

以下、図１２乃至図１５を参照して日本語の述語項構造解析結果の例、置換方式がAction固定Category置換方式であるときの検索結果、Category固定Action置換方式であるときの検索結果、およびOOD候補文の例について説明する。また、図１６乃至図１９を参照して、英語の述語項構造解析結果の例、置換方式がAction固定Category置換方式であるときの検索結果、Category固定Action置換方式であるときの検索結果、およびOOD候補文の例について説明する。 Hereinafter, with reference to FIGS. 12 to 15, an example of a Japanese predicate argument structure analysis result, a search result when the replacement method is the Action fixed Category replacement method, a search result when the replacement method is the Category fixed Action replacement method, and An example of the OOD candidate sentence will be described. In addition, referring to FIGS. 16 to 19, an example of an English predicate term structure analysis result, a search result when the replacement method is the Action fixed Category replacement method, a search result when the replacement method is the Category fixed Action replacement method, and An example of the OOD candidate sentence will be described.

＜日本語の述語項構造解析結果の例＞
図１２においては、述語項構造解析結果の例が示されており、左から文ID、文、述語、述語語尾、述語項、および元のドメインが示されている。また、述語項は、左から場所格またはデ格（場所格orデ格）、連体修飾節またはノ格（連体修飾節orノ格）、・・・対象格またはヲ格（対象格orヲ格）が示されている。<Example of Japanese predicate term structure analysis result>
FIG. 12 shows an example of the result of the predicate argument structure analysis, and from the left, the statement ID, the sentence, the predicate, the predicate tail, the predicate term, and the original domain are shown. From the left, predicate terms are place case or de case (place case or de case), adnominal modifier clause or no case (adjective modifier clause or no case), ... target case or wo case (target case or wo case) )It is shown.

より詳細には、文IDが1の文“7時にアラームを設定して”においては、述語が“設定”であり、述語語尾が“して”であり、対象格またはヲ格が“アラーム”であり、元のドメインがALARM-SETUPであることが示されている。 More specifically, in the sentence "set alarm at 7 o'clock" with sentence ID 1, the predicate is "set", the predicate ending is "do", and the target case or wo case is "alarm". And the original domain is shown to be ALARM-SETUP.

また、文IDが1001の文“銀座でおいしい寿司屋を教えて”においては、述語が“教え”であり、述語語尾が“て”であり、場所格またはデ格が“銀座”であり、対象格またはヲ格が“寿司屋”であり、元のドメインがRESTAURANT-SEARCHであることが示されている。 Also, in the sentence “Teach me a delicious sushi restaurant in Ginza” with a sentence ID of 1001, the predicate is “teaching”, the predicate ending is “te”, and the place or de-case is “Ginza”, It is shown that the target case or wo case is “sushi restaurant” and the original domain is RESTAURANT-SEARCH.

さらに、文IDが1002の文“イタリアンのレストランを教えて”においては、述語が“教え”であり、述語語尾が“て”であり、連体修飾節またはノ格が“イタリアン”であり、対象格またはヲ格が“レストラン”であり、元のドメインがRESTAURANT-SEARCHであることが示されている。 Furthermore, in the sentence “Teach Italian Restaurant” with sentence ID 1002, the predicate is “teaching”, the predicate ending is “te”, and the adnominal modifier or no case is “italian” It is shown that the case or wo case is “restaurant” and the original domain is RESTAURANT-SEARCH.

また、文IDが1003の文“イタリアンのコースの食べれるレストランをさがしてください”においては、述語が“さが”であり、述語語尾が“してください”であり、連体修飾節またはノ格が“イタリアン”、“コース”であり、対象格またはヲ格が“レストラン”であり、元のドメインがRESTAURANT-SEARCHであることが示されている。 Also, in the sentence “Search for eatable restaurants in Italian courses” with sentence ID 1003, the predicate is “Sa”, the predicate ending is “Please”, and the adnominal modifier or no case Is "Italian", "Course", the subject or wo is "Restaurant", and the original domain is RESTAURANT-SEARCH.

＜日本語の置換方式がAction固定Category置換方式のときの対象格を置換する単語群の例＞
図１３は、図１２で示される述語項構造解析結果の文に対して、置換方式がAction固定Category置換方式において置換する述語項のうち、対象格またはヲ格を置換するときの単語群の例が示されている。<Example of words that replace the target case when the Japanese substitution method is the Action fixed Category substitution method>
FIG. 13 is an example of a word group when the target case or the wo case is replaced among the predicate terms that the replacement method replaces in the Action fixed Category replacement method for the sentence of the predicate argument structure analysis result shown in FIG. It is shown.

図１３においては、左から文ID、文、述語、述語語尾、項の置換単語、および元のドメインが示されている。また、項の置換単語は、左からデ格（場所格またはデ格）、ノ格（連体修飾節またはノ格）、・・・ヲ格（対象格またはヲ格）が示されている。図１３においては、このうちヲ格（対象格またはヲ格）を置換するときの単語群の例が示されている。尚、図１３における、図１２と対応する項目については、同一の記載とされているので、その説明は適宜省略する。 In FIG. 13, a sentence ID, a sentence, a predicate, a predicate ending, a term replacement word, and an original domain are shown from the left. In addition, from the left, the replacement words of the terms are de-cases (location cases or de-cases), no-cases (adjective modified clauses or no-cases),... FIG. 13 shows an example of a word group when the wo case (target case or wo case) is replaced. Note that items in FIG. 13 corresponding to those in FIG. 12 are the same as those in FIG.

すなわち、文IDが1の場合、ヲ格である“アラーム”の置換単語の例として、“会議”、“参加者”および“移動時間”が示されている。 That is, when the sentence ID is 1, "meeting", "participant", and "moving time" are shown as examples of replacement words for "alarm", which is a case.

また、文IDが1001の場合、ヲ格である“寿司屋”の置換単語の例として、“ニュース”、“仕組み”、“人生”、“流れ“、および”芸”が示されている。 Further, when the sentence ID is 1001, "news", "mechanism", "life", "flow", and "arts" are shown as examples of replacement words for "sushi restaurant", which is a case.

さらに、文IDが1002の場合、ヲ格である“レストラン”の置換単語の例として、“ニュース”、“仕組み”、“人生”、“流れ“、および”芸”が示されている。 Furthermore, when the sentence ID is 1002, "news", "mechanism", "life", "flow", and "arts" are shown as examples of replacement words for "restaurant", which is a case.

また、文IDが1003の場合、ヲ格である“レストラン”の置換単語の例として、“外科”、”オフィス”、“親父”、“自動車”、”講座”、および“一戸建て”が示されている。 When the sentence ID is 1003, "surgical", "office", "father", "automobile", "course", and "detached house" are shown as examples of replacement words for "restaurant", which is a case. ing.

＜日本語の置換方式がCategory固定Action置換方式における述語を置換する単語群の例＞
図１４は、図１２で示される文に対して、置換方式がCategory固定Action置換方式における述語を置換するときの単語群の例が示されている。<Example of word group that replaces predicate in Category fixed Action replacement method where Japanese replacement method is>
FIG. 14 shows an example of word groups when the replacement method replaces a predicate in the Category fixed Action replacement method for the sentence shown in FIG.

図１４においては、左から文ID、文、述語、述語語尾、述語の置換単語、および元のドメインが示されている。尚、図１４における、図１２と対応する項目については、同一の記載とされているので、その説明は適宜省略する。 In FIG. 14, a sentence ID, a sentence, a predicate, a predicate ending, a predicate replacement word, and an original domain are shown from the left. The items in FIG. 14 corresponding to those in FIG. 12 are the same as those in FIG.

すなわち、文IDが1の場合、述語である“設定”の置換単語の例として、“改善する”、“装備する”、および”装着する”が示されている。 That is, when the sentence ID is 1, “improve”, “equip”, and “mount” are shown as examples of replacement words for the predicate “setting”.

また、文IDが1001，1002の場合、述語である“教え”の置換単語の例として、“抜け出す”、“食べ歩く”、“開業する”、“買い取る“、”手伝う”、”切り盛りする”、および“格付ける”が示されている。 When the sentence IDs are 1001 and 1002, as an example of the replacement word for the predicate “teaching”, “get out”, “eat”, “open”, “buy”, “help”, “cut” , And “rating” are shown.

さらに、文IDが1003の場合、述語である“さが”の置換単語の例として、“営む”、“開く”、“手伝う”、“利用する”、“特集する”、“下見する”、および“探し当てる”が示されている。 Further, when the sentence ID is 1003, as an example of a replacement word of the predicate “Saga”, “run”, “open”, “help”, “use”, “feature”, “preview”, And "find".

＜置換生成される日本語のOOD候補文の例＞
（Action固定Category置換方式のOOD候補文の例）
まず、図１５の左部を参照して、上述した処理により置換生成される日本語のOOD候補文の例のうち、Action固定Category置換方式のOOD候補文の例について説明する。<Example of Japanese OOD candidate sentence generated by replacement>
(Example of OOD candidate sentence of Action Fixed Category replacement method)
First, with reference to the left part of FIG. 15, an example of the OOD candidate sentence of the Action fixed Category replacement method among the examples of the Japanese OOD candidate sentence generated by the above-described replacement will be described.

すなわち、図１５の左部で示されるように、図１２における文IDが1の文“7時にアラームを設定して”に対するOOD候補文の例として、「7時に会議を設定して」、および「7時に参加者を設定して」が示されている。すなわち、「アラーム」が「会議」、「参加者」にそれぞれ置換されている。 That is, as shown in the left part of FIG. 15, as an example of the OOD candidate sentence for the sentence “set an alarm at 7 o'clock” with the sentence ID of 1 in FIG. 12, “set a conference at 7 o'clock”, and "Set participants at 7 o'clock" is shown. That is, "alarm" is replaced with "conference" and "participant", respectively.

また、図１２における文IDが1001の文“銀座でおいしい寿司屋を教えて”に対するOOD候補文の例として、「銀座でおいしいニュースを教えて」、および「銀座でおいしい仕組みを教えて」が示されている。すなわち、「寿司屋」が「ニュース」、「仕組み」にそれぞれ置換されている。 Also, as an example of the OOD candidate sentence for the sentence "Tell me a delicious sushi restaurant in Ginza" with a sentence ID of 1001 in Fig. 12, "Tell me delicious news in Ginza" and "Tell me how to eat delicious in Ginza" are It is shown. That is, “sushi restaurant” is replaced with “news” and “mechanism”.

さらに、図１２における文IDが1002の文“イタリアンのレストランを教えて”に対するOOD候補文の例として、「イタリアンのニュースを教えて」、および「イタリアンの仕組みを教えて」が示されている。すなわち、「レストラン」が「ニュース」、「仕組み」にそれぞれ置換されている。 Furthermore, "Tell me Italian news" and "Tell me how Italian works" are shown as examples of OOD candidate sentences for the sentence "Tell me Italian restaurant" with sentence ID 1002 in FIG. .. That is, “restaurant” is replaced with “news” and “mechanism”.

また、図１２における文IDが1003の文“イタリアンのコースの食べれるレストランをさがしてください”に対するOOD候補文の例として、「イタリアンのコースを食べれる外科をさがして」、および「イタリアンのコースを食べれるオフィスをさがして」が示されている。すなわち、「レストラン」が「外科」、「コース」にそれぞれ置換されている。 In addition, as an example of the OOD candidate sentence for the sentence "Search for eatable restaurants in Italian courses" with the sentence ID 1003 in FIG. 12, "Search for surgery that can eat Italian courses" and "Italian courses" Looking for an office where you can eat." That is, “restaurant” is replaced with “surgery” and “course”, respectively.

（Category固定Action置換方式のOOD候補文の例）
次に、図１５の右部を参照して、上述した処理により置換生成される日本語のOOD候補文の例のうち、Category固定Action置換方式のOOD候補文の例について説明する。(Example of OOD candidate sentence of Category fixed Action replacement method)
Next, with reference to the right part of FIG. 15, an example of the OOD candidate sentence of the Category fixed Action replacement method among the examples of the Japanese OOD candidate sentence generated by the above-described replacement will be described.

すなわち、図１５の右部で示されるように、図１２における文IDが1の文“7時にアラームを設定して”に対するOOD候補文の例として、「7時にアラームを改善して」、および「7時にアラームを装備して」が示されている。すなわち、「設定」が「改善」、「装備」にそれぞれ置換されている。 That is, as shown in the right part of FIG. 15, as an example of the OOD candidate sentence for the sentence “set alarm at 7 o'clock” with the sentence ID 1 in FIG. 12, “improve alarm at 7 o'clock”, and "Equip an alarm at 7 o'clock" is shown. That is, “setting” is replaced with “improvement” and “equipment”, respectively.

また、図１２における文IDが1001の文“銀座でおいしい寿司屋を教えて”に対するOOD候補文の例として、「銀座でおいしい寿司屋を抜け出して」、および「銀座でおいしい寿司屋を食べ歩いて」が示されている。すなわち、「教えて」が「抜け出して」、「食べ歩いて」にそれぞれ置換されている。 Further, as an example of the OOD candidate sentence for the sentence "Tell me a delicious sushi restaurant in Ginza" with the sentence ID 1001 in FIG. Is shown. That is, "teach" is replaced with "get out" and "eat and eat".

さらに、図１２における文IDが1002の文“イタリアンのレストランを教えて”に対するOOD候補文の例として、「イタリアンのレストランを抜け出して」、および「イタリアンのレストランを食べ歩いて」が示されている。すなわち、「教えて」が「抜け出して」、「食べ歩いて」にそれぞれ置換されている。 Further, as an example of the OOD candidate sentence for the sentence "Tell me Italian restaurant" with the sentence ID 1002 in FIG. 12, "Get out of the Italian restaurant" and "Eat and eat Italian restaurant" are shown. There is. That is, "teach" is replaced with "get out" and "eat and eat".

また、図１２における文IDが1003の文“イタリアンのコースの食べれるレストランをさがしてください”に対するOOD候補文の例として、「イタリアンのコースの食べれるレストランを営んで」、および「イタリアンのコースの食べれるレストランを開いて」が示されている。すなわち、「さがして」が「営んで」、「開いて」にそれぞれ置換されている。 In addition, as an example of the OOD candidate sentence for the sentence "Search for eatable restaurants on Italian courses" with the sentence ID 1003 in FIG. 12, "Run an eatable restaurant on Italian courses" and "Italian courses" "Open an edible restaurant." That is, “search” is replaced with “open” and “open” respectively.

＜英語の述語項構造解析結果の例＞
図１６においては、英語の述語項構造解析結果の例が示されており、左から文ID、文（Sentence）、述語（verb：Action）、述語項（Argument）、および元のドメイン（Original Domain）が示されている。また、述語項は、左からinで述語に係る項（prep_in）、・・・対象格（dobj）が示されている。<Example of results of predicate term structure analysis in English>
In FIG. 16, an example of an English predicate term structure analysis result is shown. From the left, a sentence ID, a sentence (Sentence), a predicate (verb: Action), a predicate term (Argument), and an original domain (Original Domain). )It is shown. As for the predicate term, the term (prep_in),..., the object case (dobj) relating to the predicate is indicated by in from the left.

尚、英語の述語語項構造解析は、Stanford Parser（詳細については、Marie-Catherine de Marneffe and Christopher D. Manning 2008 Revised for the Stanford Parser v. 3.3 in December 2013. “Stanford typed dependencies manual”を参照されたい）の解析結果を例としている。dobjは述語に係る項の意味役割（格）が対象格であることを表している。 For English predicate-argument structure analysis, refer to Stanford Parser (For details, see Marie-Catherine de Marneffe and Christopher D. Manning 2008 Revised for the Stanford Parser v. 3.3 in December 2013. “Stanford typed dependencies manual”. The analysis result of () is used as an example. dobj represents that the semantic role (case) of the term related to the predicate is the target case.

より詳細には、文IDが1の文“find a Chinese buffet nearby”においては、述語が“find”であり、対象格（dobj）が“Chinese buffet”であり、元のドメインがAREA_INFO-SEARCH_EVENTであることが示されている。 More specifically, in the sentence “find a Chinese buffet nearby” with the sentence ID 1, the predicate is “find”, the target case (dobj) is “Chinese buffet”, and the original domain is AREA_INFO-SEARCH_EVENT. Has been shown to be.

また、文IDが2の文“find Chinese food in Austin”においては、述語が“find”であり、inで述語に係る項（prep_in）が“Austin”であり、対象格（dobj）が“Chinese food”であり、元のドメインがAREA_INFO-SEARCH_EVENTであることが示されている。 In the sentence “find Chinese food in Austin” with the sentence ID 2, the predicate is “find”, the term related to the predicate in in (prep_in) is “Austin”, and the target case (dobj) is “Chinese”. It is shown that it is “food” and the original domain is AREA_INFO-SEARCH_EVENT.

さらに、文IDが1537の文“turn on some tunes please”においては、述語が“turn on”であり、対象格（dobj）が“tunes”であり、元のドメインがMUSIC_PLAYであることが示されている。 Furthermore, in the sentence “turn on some tunes please” with the sentence ID 1537, it is shown that the predicate is “turn on”, the target case (dobj) is “tunes”, and the original domain is MUSIC_PLAY. ing.

また、文IDが1538の文“I'd like to hear some Beatles”においては、述語が“hear”であり、対象格（dobj）が“Beatles”であり、元のドメインがMUSIC_PLAYであることが示されている。 In the sentence “I'd like to hear some Beatles” with a sentence ID of 1538, the predicate is “hear”, the target case (dobj) is “Beatles”, and the original domain is MUSIC_PLAY. It is shown.

＜英語の置換方式がAction固定Category置換方式のときの対象格を置換する単語群の例＞
図１７は、図１６で示される英語の述語項構造解析結果の文に対して、置換方式がAction固定Category置換方式において置換する述語項のうち、対象格（dobj）を置換するときの単語群の例が示されている。<Example of a word group that replaces the target case when the English replacement method is the Action fixed Category replacement method>
FIG. 17 is a word group when the target case (dobj) is replaced among the predicate terms that the replacement method replaces in the Action fixed Category replacement method for the sentence of the English predicate term structure analysis result shown in FIG. An example of is shown.

図１７においては、左から文ID、文（Sentence）、述語（verb）、項の置換単語（Argument）、および元のドメイン（Original Domain）が示されている。また、項の置換単語は、左からinで述語に係る項（prep_in）、・・・対象格（dobj）が示されている。図１７においては、このうち対象格（dobj）を置換するときの単語群の例が示されている。尚、図１７における、図１６と対応する項目については、同一の記載とされているので、その説明は適宜省略する。 In FIG. 17, a sentence ID, a sentence (Sentence), a predicate (verb), a term replacement word (Argument), and an original domain (Original Domain) are shown from the left. In addition, in terms of the replacement word of a term, a term (prep_in) related to the predicate,... FIG. 17 shows an example of a word group when substituting the target case (dobj) among them. Note that items in FIG. 17 corresponding to those in FIG. 16 are the same as those in FIG.

すなわち、文IDが1の文“find a Chinese buffet nearby”の場合、対象格である“Chinese buffet”の置換単語の例として、“victim”,“bomb”,“cache”、および“remains”が示されている。 That is, in the case of the sentence “find a Chinese buffet nearby” with the sentence ID 1, “victim”, “bomb”, “cache”, and “remains” are examples of replacement words for the target case “Chinese buffet”. It is shown.

また、文IDが2の文“find Chinese food in Austin”の場合、対象格である“Chinese food”の置換単語の例として、“victim”,“bomb”,“cache”、および“remains”が示されている。 Also, in the case of the sentence “find Chinese food in Austin” with the sentence ID 2, “victim”, “bomb”, “cache”, and “remains” are examples of replacement words for the target case “Chinese food”. It is shown.

さらに、文IDが1537の文“turn on some tunes please”の場合、対象格である“tunes”の置換単語の例として、“light”,“power”、および“you”が示されている。 Furthermore, in the case of the sentence “turn on some tunes please” with the sentence ID 1537, “light”, “power”, and “you” are shown as examples of the replacement words of the target case “tunes”.

また、文IDが1538の文“I'd like to hear some Beatles”の場合、対象格である“Beatles”の置換単語の例として、“team-mate”,“boss”、および“neighbor”が示されている。 In the case of the sentence “I'd like to hear some Beatles” with the sentence ID of 1538, “team-mate”, “boss”, and “neighbor” are examples of replacement words for the target case “Beatles”. It is shown.

＜英語の置換方式がCategory固定Action置換方式における述語を置換する単語群の例＞
図１８は、図１６で示される文に対して、置換方式がCategory固定Action置換方式における述語を置換するときの単語群の例が示されている。<Example of word group that replaces predicate in Category fixed Action replacement method where English replacement method is>
FIG. 18 shows an example of word groups when the replacement method replaces a predicate in the Category fixed Action replacement method for the sentence shown in FIG.

図１８においては、左から文ID、文、述語、述語語尾、述語の置換単語、および元のドメインが示されている。尚、図１８における、図１６と対応する項目については、同一の記載とされているので、その説明は適宜省略する。 In FIG. 18, a sentence ID, a sentence, a predicate, a predicate ending, a predicate replacement word, and an original domain are shown from the left. The items in FIG. 18 corresponding to those in FIG. 16 are the same as those in FIG.

すなわち、文IDが1の文“find a Chinese buffet nearby”の場合、述語（verb：Action）である“find”の置換単語の例として、“include”,“open”,“run”、および“operate”が示されている。 That is, in the case of the sentence “find a Chinese buffet nearby” having the sentence ID 1, “include”, “open”, “run”, and “ operate” is shown.

また、文IDが2の文“find Chinese food in Austin”の場合、述語である“find”の置換単語の例として、“include”,“open”,“run”、および“operate”が示されている。 Also, in the case of the sentence “find Chinese food in Austin” with the sentence ID 2, “include”, “open”, “run”, and “operate” are shown as examples of replacement words for the predicate “find”. ing.

さらに、文IDが1537の文“turn on some tunes please”の場合、述語である“turn on”の置換単語の例として、“download”,“record”、および“compose”,が示されている。 Furthermore, in the case of the sentence “turn on some tunes please” with a sentence ID of 1537, “download”, “record”, and “compose”, are shown as examples of replacement words for the predicate “turn on”. ..

また、文IDが1538の文“I'd like to hear some Beatles”の場合、述語である“hear”の置換単語の例として、“work with”,“copy”、および“remove”が示されている。 In the case of the sentence “I'd like to hear some Beatles” with the sentence ID of 1538, “work with”, “copy”, and “remove” are shown as examples of the replacement words for the predicate “hear”. ing.

＜置換生成される英語のOOD候補文の例＞
（Action固定Category置換方式のOOD候補文の例）
まず、図１９の左部を参照して、上述した処理により置換生成される英語のOOD候補文の例のうち、Action固定Category置換方式のOOD候補文の例について説明する。<Example of English OOD candidate sentence generated by replacement>
(Example of OOD candidate sentence of Action Fixed Category replacement method)
First, with reference to the left part of FIG. 19, an example of the OOD candidate sentence of the Action fixed Category replacement method among the examples of the English OOD candidate sentence generated by the above-described replacement will be described.

すなわち、図１９の左部で示されるように、図１６における文IDが1の文“find a Chinese buffet nearby”に対するOOD候補文の例として、“find a bomb nearby”、および“find a victim nearby”が示されている。すなわち、“Chinese buffet”が“bomb”、“victim”にそれぞれ置換されている。 That is, as shown in the left part of FIG. 19, as examples of OOD candidate sentences for the sentence “find a Chinese buffet nearby” having the sentence ID of 1 in FIG. 16, “find a bomb nearby” and “find a victim nearby” are given. "It is shown. That is, "Chinese buffet" is replaced with "bomb" and "victim", respectively.

また、図１６における文IDが2の文“find Chinese food in Austin”に対するOOD候補文の例として、“find cache in Austin”、および“find remains in Austin”が示されている。すなわち、“Chinese food”が“cache”、“remains”にそれぞれ置換されている。 In addition, "find cache in Austin" and "find remains in Austin" are shown as examples of OOD candidate sentences for the sentence "find Chinese food in Austin" having the sentence ID 2 in FIG. That is, "Chinese food" is replaced with "cache" and "remains", respectively.

さらに、図１６における文IDが1537の文“turn on some tunes please”に対するOOD候補文の例として、“turn on light please”、および“turn on power please”が示されている。すなわち、“some tunes”が“light”、“power”にそれぞれ置換されている。 Further, “turn on light please” and “turn on power please” are shown as examples of OOD candidate sentences for the sentence “turn on some tunes please” with the sentence ID 1537 in FIG. That is, "some tunes" are replaced with "light" and "power", respectively.

また、図１６における文IDが1538の文“I'd like to hear some Beatles”に対するOOD候補文の例として、“I'd like to hear team-mate”、および“I'd like to hear neighbor”が示されている。すなわち、“Beatles”が“team-mate”、“neighbor”にそれぞれ置換されている。 Also, as an example of the OOD candidate sentence for the sentence "I'd like to hear some Beatles" whose sentence ID is 1538 in FIG. 16, "I'd like to hear team-mate" and "I'd like to hear neighbor". "It is shown. That is, "Beatles" are replaced with "team-mate" and "neighbor", respectively.

（Category固定Action置換方式のOOD候補文の例）
次に、図１９の右部を参照して、上述した処理により置換生成される英語のOOD候補文の例のうち、Category固定Action置換方式のOOD候補文の例について説明する。(Example of OOD candidate sentence of Category fixed Action replacement method)
Next, with reference to the right part of FIG. 19, an example of the OOD candidate sentence of the Category fixed Action replacement method will be described among the examples of the English OOD candidate sentences that are replaced and generated by the above-described processing.

すなわち、図１６の右部で示されるように、図１６における文IDが1の文“find a Chinese buffet nearby”に対するOOD候補文の例として、“Open Chinese buffet nearby”、および“Operate Chinese buffet nearby”が示されている。すなわち、“find”が“Open”、“Operate”にそれぞれ置換されている。 That is, as shown in the right part of FIG. 16, as an example of the OOD candidate sentence for the sentence “find a Chinese buffet nearby” whose sentence ID is 1 in FIG. 16, “Open Chinese buffet nearby” and “Operate Chinese buffet nearby” are given. "It is shown. That is, "find" is replaced with "Open" and "Operate", respectively.

また、図１６における文IDが2の文“find Chinese food in Austin”に対するOOD候補文の例として、“Open Chinese food in Austin”、および“Operate Chinese food in Austin”が示されている。すなわち、“find”が“Open”、“Operate”にそれぞれ置換されている。 In addition, "Open Chinese food in Austin" and "Operate Chinese food in Austin" are shown as examples of OOD candidate sentences for the sentence "find Chinese food in Austin" having the sentence ID 2 in FIG. That is, "find" is replaced with "Open" and "Operate", respectively.

さらに、図１６における文IDが1537の文“turn on some tunes please”に対するOOD候補文の例として、“Record some tunes please”、および“Compose some tunes please”が示されている。すなわち、“turn on”が“Record”、“Compose”にそれぞれ置換されている。 Furthermore, “Record some tunes please” and “Compose some tunes please” are shown as examples of OOD candidate sentences for the sentence “turn on some tunes please” with the sentence ID of 1537 in FIG. That is, “turn on” is replaced with “Record” and “Compose”, respectively.

また、図１６における文IDが1538の文“I'd like to hear some Beatles”に対するOOD候補文の例として、“I’d like to copy some Beatles”、および“I’d like to remove some Beatles”が示されている。すなわち、“hear”が“copy”、“remove”にそれぞれ置換されている。 In addition, as an example of the OOD candidate sentence for the sentence "I'd like to hear some Beatles" whose sentence ID is 1538 in FIG. 16, "I'd like to copy some Beatles" and "I'd like to remove some Beatles" "It is shown. That is, “hear” is replaced with “copy” and “remove”, respectively.

＜COOD文になりすい単語の格フレーム辞書からの検索方法＞
次に、COOD文になりすい単語を格フレーム辞書１０７から検索する方法について述べる。<Search method from case frame dictionary for spoofing words in COOD sentence>
Next, a method of searching the case frame dictionary 107 for a spoofed word in a COOD sentence will be described.

図２０に格フレーム辞書の簡単なイメージを示す。 FIG. 20 shows a simple image of the case frame dictionary.

例えば、深層格解析の場合、「設定する」という述部であるときには、例えば、図２０の例Ｅｘ３１で示されるように、「設定する」として、＜設定する４＞および＜設定する８＞の２種類の格フレーム例が挙げられている。尚、末尾の番号は、同一の「設定する」のうちの異なる語義を識別する番号である。 For example, in the case of deep case analysis, when the predicate “set”, for example, as shown in an example Ex31 of FIG. 20, as “set”, <set 4> and <set 8> are set. Two types of case frame examples are given. The numbers at the end are numbers that identify different meanings of the same "set".

図２０の例Ｅｘ３１においては、＜設定する４＞では、それぞれの役割を持った述語項としてどのような単語が係るかが表されている。（）の中は単語と数字で表されている。数字はその単語と述語と結びついた回数（頻度）を表している。例えば（“会議”、41）というのは、格フレーム辞書を作る元になった大量のコーパスデータの中で「会議」という単語が＜設定する４＞という語義に対象格として４１回係った事を表している。この数値は重みの様な別の指標でもよいし、母集団によりノーマライズして使用するようにしてもよいし、指標を乗算するなどして組み合わせて使用するようにしてもよい。 In the example Ex31 of FIG. 20, <Set 4> indicates what kind of word is involved as a predicate term having each role. Words and numbers are shown in parentheses. The numbers represent the number of times (frequency) the word was associated with the predicate. For example, (“meeting”, 41) means that the word “meeting” is involved 41 times as a target case in the meaning of <set 4> in the large amount of corpus data that is the source of the case frame dictionary. It represents a thing. This numerical value may be another index such as a weight, may be normalized for use according to the population, or may be used in combination by multiplying the index.

図２０の例Ｅｘ３１においては、「設定する４」では、動作主として、（“システム”，83）、（“会社”，42）、（“学校”，33）、（“上司”，18）が挙げられており、対象格として（“会議”，41）、（“参加者”，27）、（“移動時間”，10）が挙げられており、道具格として（“ＰＣ（パーソナルコンピュータ）”，95）、（“スケジューラ”，72）、（“スマホ”，33）が挙げられている。 In the example Ex31 of FIG. 20, in “4 to set”, the main operations are (“system”, 83), (“company”, 42), (“school”, 33), and (“boss”, 18). The target cases are (“meeting”, 41), (“participants”, 27), (“travel time”, 10), and the tool case is “(PC (personal computer)” , 95), (“scheduler”, 72), (“smartphone”, 33).

また、例Ｅｘ３１においては、「設定する８」では、動作主格として、（“妻”，40）、（“娘”，33）、（“息子”，28）、（“母”，13）が挙げられており、対象格として（“アラーム”，52）、（“目覚まし”，48）、（“タイマ”，42）が挙げられており、道具格として（“目覚まし”，94）、（“時計”，48）、（“携帯”，35）、（“スマホ”，19）が挙げられている。 In Example Ex31, in “8 to set”, (“wife”, 40), (“daughter”, 33), (“son”, 28), (“mother”, 13) are the action principals. The target cases are ("alarm", 52), ("alarm", 48), ("timer", 42), and the instrumental cases ("alarm", 94), (" "Watch", 48), ("Mobile", 35), ("Smartphone", 19) are listed.

さらに、例Ｅｘ３１においては、「設定する」と表記が異なり語義が類似する格フレームとして＜セットする１＞が挙げられており、この場合、動作主格として、（“彼女”，56）、（“父”，52）、（“妻”，49）が挙げられており、対象格として（“タイマ”，67）、（“スリープタイマ”，42）、（“アラーム”，41）が挙げられており、道具格として（“炊飯器”，52）、（“エアコン”，45）、（“ラジオ”，32）、（“携帯”，12）が挙げられている。 Further, in Example Ex31, <Set 1> is listed as a case frame having a different notation as “set” and a similar word meaning, and in this case, the action nominatives are (“she”, 56), (“ Fathers, 52), (wives), 49) are listed, and target cases are ("Timers", 67), ("Sleep timers", 42), ("Alarms", 41). There are ("rice cooker", 52), ("air conditioner", 45), ("radio", 32), ("mobile", 12) as tool cases.

さらに、例Ｅｘ３１においては、「設定する」とは表記が異なり語義が類似しない格フレームとして＜改善する１５＞が挙げられており、この場合、動作主格として、（“手法”，102）、（“品質”，73）、（“工程”，67）が挙げられており、対象格として（“動作”，81）、（“性能”，75）、（“アラーム”，2）が挙げられており、道具格として（“交換”，58）、（“工夫”，49）、（“方法”，41）が挙げられている。 Furthermore, in Example Ex31, <improve 15> is listed as a case frame whose description is different from that of “set” and whose word meaning is not similar. In this case, as the action nominative, (“method”, 102), ( “Quality”, 73), (“Process”, 67) are listed, and target cases are (“Operation”, 81), (“Performance”, 75), and (“Alarm”, 2). As a tool case, (“exchange”, 58), (“ingenuity”, 49) and (“method”, 41) are listed.

「７時にアラームを設定して」という元のIND文を例に、深層格解析を用いた置換による文生成の処理を説明する。 The process of sentence generation by replacement using deep case analysis will be described by taking the original IND sentence "set an alarm at 7 o'clock" as an example.

Action（述部）固定Category（対象格）置換モードの場合、述部である“設定する”が固定されると、対象格に“アラーム”を持たない＜設定する４＞が選択される。＜設定する４＞は対象格に“アラーム”という単語を含まず、そのかわりに（“会議”，41）、（“参加者”，27）、（“移動時間”，10）といった異なる意味クラスの単語が含まれる。これらの単語で置換して新たなコーパスが生成されることで「設定する」の語義が微妙に異なるCOOD文の候補を作ることができる。 In the case of the action (predicate) fixed Category (target case) replacement mode, when the predicate “set” is fixed, <set 4> which does not have “alarm” in the target case is selected. <Set 4> does not include the word “alarm” in the target case, but instead has different semantic classes such as (“meeting”, 41), (“participants”, 27), and (“travel time”, 10). Is included. By replacing with these words and generating a new corpus, it is possible to create a COOD sentence candidate in which the meaning of "set" is slightly different.

Category（対象格）固定Action（述部）置換モードの場合、対象格である“アラーム”が固定されると、同じ“アラーム”を対象格にもつ表記の異なる格フレーム＜セットする１＞＜改善する１５＞が選択される。どちらも対象格に“アラーム”という同じ単語を含むが、＜セットする１＞においては、“アラーム”の頻度は４１回であるのに対して、＜改善する１５＞は“アラーム”の頻度は２回と少ない。この様な述語は語義が微妙に異なる可能性が高い。タイマ機能にはあまり関連しない意味クラスの単語群が含まれる。この様に、固定した単語を同じ項に持ち、かつ、その単語の頻度や関係の強さを表す値nが、ある閾値αより少ないという条件を満たす述語を選択する。 In the case of Category (target case) fixed Action (predicate) replacement mode, when the target "alarm" is fixed, different case frames with the same "alarm" as the target case <Set 1> <Improve Yes 15> is selected. Both cases include the same word "alarm" in the target case, but in <set 1>, the frequency of "alarm" is 41 times, whereas in <improve 15>, the frequency of "alarm" is 2 times less. Such predicates are likely to have slightly different meanings. The timer function contains words of a semantic class that are less relevant. In this way, a predicate that has a fixed word in the same term and that satisfies the condition that the value n indicating the frequency and the strength of the relationship of the word is less than a certain threshold value α is selected.

元の述語を「改善する」で置換して新たなコーパスが生成することで語義が微妙に異なるCOOD文の候補「７時にアラームを改善して」を作ることができる。 By replacing the original predicate with “improve” and generating a new corpus, it is possible to create a COOD sentence candidate “improve the alarm at 7 o'clock” with a slightly different meaning.

また、表層格解析の場合、「設定する」という述部であるときには、例えば、図２０の例Ｅｘ３２で示されるように、「設定する」には、使用する形態として、「設定する４」および「設定する８」の二つが例として挙げられている。 Further, in the case of the surface case analysis, when the predicate “set”, for example, as shown in the example Ex32 of FIG. Two of "set 8" are given as examples.

例Ｅｘ３２においては、「設定する４」では、ガ格として、（“システム”，83）、（“会社”，42）、（“学校”，33）、（“上司”，18）が挙げられており、ヲ格として（“会議”，41）、（“参加者”，27）、（“移動時間”，10）が挙げられており、デ格として（“ＰＣ（パーソナルコンピュータ）”，95）、（“スケジューラ”，72）、（“スマホ”，33）が挙げられている。 In Example Ex32, in “4 to set”, as the case, (“system”, 83), (“company”, 42), (“school”, 33), (“boss”, 18) are listed. ("Meeting", 41), ("Participants", 27), ("Movement time", 10) are listed as wo cases, and ("PC (personal computer)", 95 ), (“Scheduler”, 72) and (“Smartphone”, 33).

また、例Ｅｘ３２においては、「設定する８」では、ガ格として、（“妻”，40）、（“娘”，33）、（“息子”，28）、（“母”，13）が挙げられており、ヲ格として（“アラーム”，52）、（“目覚まし”，48）、（“タイマ”，42）が挙げられており、デ格として（“目覚まし”，94）、（“時計”，42）、（“携帯”，35）、（“スマホ”，19）が挙げられている。 Further, in the example Ex32, in “8 to set”, as the case, (“wife”, 40), (“daughter”, 33), (“son”, 28), (“mother”, 13) It is mentioned that the wo case is ("alarm", 52), ("alarm", 48), ("timer", 42), and the de case is ("alarm", 94), (" "Watch", 42), ("Mobile", 35), ("Smartphone", 19) are listed.

さらに、例Ｅｘ３２においては、「設定する」に類似する語義として「セットする１」が挙げられており、この場合、ガ格として、（“彼女”，56）、（“父”，52）、（“妻”，49）が挙げられており、ヲ格として（“タイマ”，67）、（“スリープタイマ”，42）、（“アラーム”，41）が挙げられており、デ格として（“炊飯器”，52）、（“エアコン”，45）、（“ラジオ”，32）、（“携帯”，12）が挙げられている。 Furthermore, in Example Ex32, “set to 1” is cited as a word meaning similar to “set”, and in this case, as a case, (“she”, 56), (“father”, 52), ("Wife", 49) is listed, and "(timer", 67), ("sleep timer", 42), ("alarm", 41) are listed as cases, and () "Rice cooker", 52), ("Air conditioner", 45), ("Radio", 32), ("Mobile", 12) are listed.

さらに、例Ｅｘ３２においては、「設定する」とは表記が異なり語義が類似しない格フレームとして＜改善する 15＞が挙げられており、この場合、ガ格として、（“手法”，102）、（“品質”，73）、（“工程”，67）が挙げられており、ヲ格として（“動作”，81）、（“性能”，75）、（“アラーム”，2）が挙げられており、デ格として（“交換”，58）、（“工夫”，49）、（“方法”，41）が挙げられている。 Furthermore, in Example Ex32, <improve 15> is mentioned as a case frame whose notation is different from that of “set” and whose word meaning is not similar. “Quality”, 73), (“Process”, 67) are listed, and as the rating, (“Action”, 81), (“Performance”, 75), (“Alarm”, 2) are listed. In addition, (“exchange”, 58), (“ingenuity”, 49), and (“method”, 41) are listed as disqualities.

同じく「７時にアラームを設定して」という元のIND文を例に、表層格解析を用いた置換による文生成の処理を説明する。 Similarly, the process of generating a sentence by replacement using the surface case analysis will be described by using the original IND sentence "set an alarm at 7 o'clock" as an example.

Action（述部）固定Category（ヲ格）置換モードの場合、述部である「設定する」が固定されると、ヲ格に“アラーム”を持たない＜設定する４＞が選択される。＜設定する４＞はヲ格に“アラーム”という単語を含まず、そのかわりに（“会議”，41）、（“参加者”，27）、（“移動時間”，10）といった異なる意味クラスの単語が含まれる。これらの単語で置換して新たなコーパスが生成されることで「設定する」の語義が微妙に異なるCOOD文の候補「７時に参加者を設定して」等を作ることができる。 In the Action (predicate) fixed Category replacement mode, if the predicate “set” is fixed, <set 4> that does not have an “alarm” in the case is selected. <Set 4> does not include the word “alarm” in its category, but instead has different meaning classes such as (“meeting”, 41), (“participants”, 27), and (“travel time”, 10). Is included. By replacing with these words and generating a new corpus, it is possible to create a COOD sentence candidate "set participants at 7 o'clock" etc., which has a slightly different meaning of "set".

Category（ヲ格）固定Action（述部）置換モードの場合、対象格である“アラーム”が固定されると、同じ“アラーム”をヲ格にもつ表記の異なる格フレーム＜セットする１＞＜改善する１５＞が選択される。どちらも対象格に“アラーム”という同じ単語を含むが、＜セットする１＞においては、“アラーム”の頻度は４１回であるのに対して、＜改善する１５＞は“アラーム”の頻度は２回と少ない。この様な述語は語義が微妙に異なる可能性が高い。タイマ機能にはあまり関連しない意味クラスの単語群が多く含まれる。この様に、固定した単語を同じ項に持ち、かつ、その単語の頻度や関係の強さを表す値nが、ある閾値αより少ないという条件を満たす述語を選択する。 In case of Category (wo case) fixed Action (predicate) replacement mode, if the target case "alarm" is fixed, different case frames with the same "alarm" in the case <set 1> <improved Yes 15> is selected. Both cases include the same word "alarm" in the target case, but in <set 1>, the frequency of "alarm" is 41 times, whereas in <improve 15>, the frequency of "alarm" is 2 times less. Such predicates are likely to have slightly different meanings. The timer function includes many words of a semantic class that are not so related. In this way, a predicate that has a fixed word in the same term and that satisfies the condition that the value n indicating the frequency and the strength of the relationship of the word is less than a certain threshold value α is selected.

元の述語を「改善する」で置換して新たなコーパスが生成されることで語義が微妙に異なりCOOD文の候補「７時にアラームを改善して」を作ることができる。 By replacing the original predicate with "improve" and generating a new corpus, the word meaning is slightly different, and a COOD sentence candidate "improve the alarm at 7 o'clock" can be created.

尚、格フレーム辞書１０７については、既存のものを利用するようにしてもよい。また、汎用的な既存の格フレーム辞書にはドメインとなるサービス目的で使う単語があまり含まれていない事も考えられるので、サービスに必要な単語を集めて編纂されたユーザ定義格フレーム辞書を追加できるようにしてもよい。 The case frame dictionary 107 may be an existing one. It is also possible that the existing general-purpose case frame dictionary does not contain many words used for domain service purposes, so we have added a user-defined case frame dictionary that has been compiled by collecting words necessary for the service. You may be able to.

既存の格フレーム辞書１０７については、例えば、Daisuke Kawahara and Sadao Kurohashi.著のA Fully-Lexicalized Probabilistic Model for Japanese Syntactic and Case Structure Analysis, In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL2006), pp.176-183, 2006.、または、Case Frame Compilation from the Web using High-Performance Computing, In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC2006), 2006を参照されたい。同辞書は述語の項の単語の頻度情報が付加されており、これを前記の述語と項単語の関係の強さに利用することができる。 Regarding the existing case frame dictionary 107, for example, Daisuke Kawahara and Sadao Kurohashi. A Fully-Lexicalized Probabilistic Model for Japanese Syntactic and Case Structure Analysis, In Proceedings of the Human Language Technology Conference of the North American Chapter of the Association for See Computational Linguistics (HLT-NAACL2006), pp.176-183, 2006., or Case Frame Compilation from the Web using High-Performance Computing, In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC2006), 2006. I want to be done. The dictionary is added with frequency information of words in terms of predicates, which can be used for the strength of the relationship between the predicates and term words.

ここで、図７のフローチャートの説明に戻る。 Here, the description returns to the flowchart of FIG. 7.

ステップＳ２０において、置換実行部１０５は、保存されている検索結果のうち、未処理のIND文を処理対象のIND文として読み出すと共に、対応付けて保存されている、置換に関する設定情報、および検索結果を読み出して、受け付ける。 In step S20, the replacement execution unit 105 reads out an unprocessed IND statement from the saved search results as an IND statement to be processed, and saves the associated setting information regarding replacement and the search result. Read out and accept.

ステップＳ２１において、置換実行部１０５は、処理対象となるIND判定文、置換に関する設定情報、および検索結果に基づいて、処理対象となるIND判定文の置換対象となる単語を、検索結果のうち未処理の検索結果に基づいて置換してコーパスを生成し、活用や語尾を調整する。 In step S21, the replacement executing unit 105 determines that the word to be replaced in the IND determination statement to be processed is not included in the search results based on the IND determination statement to be processed, the setting information about replacement, and the search result. The corpus is generated by substituting based on the search result of the processing, and the utilization and the ending are adjusted.

ステップＳ２２において、置換実行部１０５は、生成したコーパスを一次置換生成文として保存する。 In step S22, the replacement executing unit 105 stores the generated corpus as a primary replacement generation sentence.

ステップＳ２３において、置換実行部１０５は、保存されている検索結果のうち、未処理のIND文があるか否かを判定し、存在する場合、処理は、ステップＳ２０に戻る。すなわち、全てのIND文に対して、検索結果に基づいて、置換によりコーパスが生成されるまで、ステップＳ２０乃至Ｓ２３の処理が繰り返される。そして、ステップＳ２３において、未処理のIND文が存在しないとみなされた場合、処理は、ステップＳ２４に進む。 In step S23, the replacement execution unit 105 determines whether or not there is an unprocessed IND statement in the stored search results, and if it exists, the process returns to step S20. That is, the processes of steps S20 to S23 are repeated for all the IND statements until the corpus is generated by the replacement based on the search result. Then, if it is determined in step S23 that there is no unprocessed IND statement, the process proceeds to step S24.

ステップＳ２４において、重文排除部１０６は、ステップＳ２２の処理により保存されたコーパスのうち、未処理のコーパスを読み出して、処理対象のコーパスとして受け付ける。 In step S24, the duplicate sentence elimination unit 106 reads an unprocessed corpus from the corpus stored in the process of step S22 and accepts it as a corpus to be processed.

ステップＳ２５において、重文排除部１０６は、処理対象のコーパスと、それまでに生成され、ステップＳ２２の処理により保存されたコーパスと重複する文（重文）がないかを判定する。より詳細には、重文排除部１０６は、新たに生成されたコーパスとして記憶しているコーパス群より、処理対象に設定されたコーパスを検索し、一致するものの有無により重文であるか否かを判定する。ステップＳ２５において、重文であると判定された場合、処理は、ステップＳ２６に進む。 In step S25, the compound sentence eliminating unit 106 determines whether there is a sentence (compound sentence) that overlaps with the corpus to be processed and the corpus generated up to that time and stored by the process of step S22. More specifically, the compound sentence removing unit 106 searches the corpus group stored as the newly generated corpus for the corpus set as the processing target, and determines whether or not the compound sentence is a compound sentence by checking whether there is a match. To do. When it is determined in step S25 that the sentence is a compound sentence, the process proceeds to step S26.

ステップＳ２６において、重文排除部１０６は、生成されたコーパスが重文であり、破棄判定文であるとみなし、生成されたコーパスを廃棄処理する。 In step S26, the compound sentence removing unit 106 regards the generated corpus as a compound sentence and a discard determination sentence, and discards the generated corpus.

一方、ステップＳ２５において、生成されたコーパスが重文ではないと判定された場合、処理は、ステップＳ２７に進む。 On the other hand, if it is determined in step S25 that the generated corpus is not a compound sentence, the process proceeds to step S27.

ステップＳ２７において、重文排除部１０６は、処理対象である置換生成されたコーパスを置換生成文記憶部１０９に記憶させる。 In step S27, the multiple sentence elimination unit 106 stores the substitution-generated corpus that is the processing target in the substitution-generated sentence storage unit 109.

ステップＳ２８において、置換実行部１０５は、未処理の検索結果があるか否かを判定し、未処理の検索結果がある場合、処理は、ステップＳ２４に戻る。 In step S28, the replacement executing unit 105 determines whether or not there is an unprocessed search result, and if there is an unprocessed search result, the process returns to step S24.

そして、ステップＳ２８において、未処理の検索結果がないとみなされた場合、処理は、ステップＳ２９に進む。 Then, if it is determined in step S28 that there is no unprocessed search result, the process proceeds to step S29.

ステップＳ２９において、置換実行部１０５は、現状で重文として排除されずに保存されているコーパスを最終の置換生成文として置換生成文記憶部１０９に記憶させる。 In step S29, the replacement execution unit 105 stores the corpus, which is currently saved as a duplicate sentence and is not excluded, in the replacement generation sentence storage unit 109 as the final replacement generation sentence.

ステップＳ３０において、フィルタリング処理部１１０は、フィルタリング処理を実行して、置換生成文記憶部１０９に記憶されている新たに生成された置換生成文からなるコーパスをOOD判定文、COOD判定文、IND判定文、およびCIND判定文のコーパスに分類する。尚、フィルタリング処理については、図２１のフローチャートを参照して詳細を後述する。 In step S30, the filtering processing unit 110 executes the filtering process to find the corpus including the newly generated replacement generation statement stored in the replacement generation statement storage unit 109 as the OOD determination statement, the COOD determination statement, and the IND determination. Sentences and corpus of CIND judgment sentence. Details of the filtering process will be described later with reference to the flowchart of FIG.

以上の処理により、IND文に基づいて、単語の置換により新たなコーパスを置換生成文として生成することが可能となる。 With the above processing, it becomes possible to generate a new corpus as a replacement generation sentence by word replacement based on the IND sentence.

また、以上の処理により、図１５で示されるような日本語のOOD候補文や図１９で示されるような英語のOOD候補文が生成される。 Further, by the above processing, a Japanese OOD candidate sentence as shown in FIG. 15 and an English OOD candidate sentence as shown in FIG. 19 are generated.

＜フィルタリング処理＞
次に、図２１のフローチャートを参照して、フィルタリング処理部１１０によるフィルタリング処理について説明する。<Filtering process>
Next, the filtering processing by the filtering processing unit 110 will be described with reference to the flowchart in FIG.

ステップＳ３１において、意味解析器１３１は、置換生成文記憶部１０９に記憶されている置換生成文からなるコーパスのうち、未処理の置換生成文からなるコーパスを処理対象のコーパスとして受け付ける。 In step S31, the semantic analyzer 131 accepts, as a corpus to be processed, a corpus composed of unprocessed replacement generated sentences among the corpus composed of replacement generated sentences stored in the replacement generated sentence storage unit 109.

ステップＳ３２において、意味解析器１３１は、処理対象の置換生成文からなるコーパスが、IND判定文であるか否かを判定する。ステップＳ３２において、IND判定文であるとみなされた場合、処理は、ステップＳ３３に進む。 In step S32, the semantic analyzer 131 determines whether or not the corpus composed of the replacement generation sentence to be processed is the IND determination sentence. If it is determined in step S32 that the sentence is an IND determination sentence, the process proceeds to step S33.

ステップＳ３３において、意味解析器１３１は、処理対象となる置換生成文からなるコーパスをIND判定文記憶部１３２に記憶させる。 In step S33, the semantic analyzer 131 causes the IND determination statement storage unit 132 to store the corpus including the replacement generation statement to be processed.

一方、ステップＳ３２において、IND判定文ではないとみなされた場合、すなわち、処理対象の置換生成文からなるコーパスがOOD判定文であるとみなされた場合、処理は、ステップＳ３４に進む。 On the other hand, if it is determined in step S32 that it is not the IND determination sentence, that is, if the corpus including the replacement generation sentence to be processed is determined to be the OOD determination sentence, the process proceeds to step S34.

ステップＳ３４において、意味解析器１３１は、処理対象となる置換生成文をOOD判定文であるものとみなし、OOD判定文記憶部１３６に記憶させる。 In step S34, the semantic analyzer 131 regards the replacement generation statement to be processed as an OOD determination statement and stores it in the OOD determination statement storage unit 136.

ステップＳ３５において、意味解析器１３１は、置換生成文記憶部１０９において、未処理の置換生成文があるか否かを判定し、未処理の置換生成文が存在すると判定する場合、処理は、ステップＳ３１に戻り、それ以降の処理が繰り返される。すなわち、未処理の入力文がないとみなされるまで、全ての置換生成文が、IND判定文であるか否かが判定されて、IND判定文がIND判定文記憶部１３２に記憶され、それ以外であるOOD判定文がOOD判定文記憶部１３６に記憶される処理が繰り返される。 In step S35, the semantic analyzer 131 determines whether or not there is an unprocessed replacement generation statement in the replacement generation statement storage unit 109, and when it is determined that there is an unprocessed replacement generation statement, the process proceeds to step S35. The process returns to S31 and the subsequent processes are repeated. That is, until it is considered that there is no unprocessed input statement, it is determined whether or not all the replacement generation statements are IND decision statements, the IND decision statements are stored in the IND decision statement storage unit 132, and other than that, The process of storing the OOD determination text as is stored in the OOD determination text storage unit 136 is repeated.

そして、ステップＳ３５において、未処理の置換生成文がないとみなされると、処理は、ステップＳ３６に進む。すなわち、ここまでの処理により、置換生成文記憶部１０９に記憶されていた置換生成文からなる群が、古いバージョンのコーパスを用いた学習により生成された意味解析器１３１により、IND判定文とOOD判定文とに分類されて、それぞれが、IND判定文記憶部１３２およびOOD判定文記憶部１３６に記憶される。 Then, if it is determined in step S35 that there is no unprocessed replacement generation statement, the process proceeds to step S36. That is, by the processing up to this point, the group of substitution generation sentences stored in the substitution generation sentence storage unit 109 is generated by learning using the corpus of the old version, and the IND determination sentence and the OOD are generated by the semantic analyzer 131. Classification statements are stored in the IND judgment statement storage unit 132 and the OOD judgment statement storage unit 136.

ステップＳ３６において、COODコーパス抽出部１３３は、COODコーパス抽出処理を実行して、IND判定文とみなされたコーパスのドメインよりCOOD判定文の候補を抽出して、COOD判定文記憶部１３４に記憶させると共に、残りのIND判定文を確定IND判定文として、確定IND判定文記憶部１３５に記憶させる。この際、IND判定文とみなされたコーパスのうち、非文とみなされるコーパスについては、廃棄判定文とみなされて、廃棄処理される。 In step S36, the COOD corpus extraction unit 133 executes the COOD corpus extraction processing, extracts COOD judgment sentence candidates from the domain of the corpus regarded as the IND judgment sentence, and stores the COOD judgment sentence storage unit 134 in the COOD judgment sentence storage unit 134. At the same time, the remaining IND judgment sentences are stored in the definite IND judgment sentence storage unit 135 as definite IND judgment sentences. At this time, among the corpus considered as the IND judgment sentence, the corpus regarded as the non-sentence is regarded as the discard judgment sentence and is discarded.

尚、COODコーパス抽出処理については、図２２のフローチャートを参照して、詳細を後述する。 Details of the COOD corpus extraction processing will be described later with reference to the flowchart of FIG.

ステップＳ３７において、CINDコーパス抽出部１３７は、CINDコーパス抽出処理を実行して、OOD判定文とみなされたコーパスよりCIND判定文を抽出して、CIND判定文記憶部１３８に記憶させると共に、残りのOOD判定文を確定OOD判定文として確定OOD判定文記憶部１３９に記憶させる。この際、OOD判定文とみなされたコーパスのうち、非文とみなされるコーパスについては、廃棄判定文とみなされて、廃棄処理される。 In step S37, the CIND corpus extraction unit 137 executes the CIND corpus extraction processing to extract the CIND judgment sentence from the corpus regarded as the OOD judgment sentence and stores the CIND judgment sentence in the CIND judgment sentence storage unit 138. The OOD judgment sentence is stored in the definite OOD judgment sentence storage unit 139 as the definite OOD judgment sentence. At this time, of the corpus regarded as the OOD judgment sentence, the corpus regarded as a non-sentence is regarded as a discard judgment sentence and is discarded.

尚、CINDコーパス抽出処理については、図２６のフローチャートを参照して、詳細を後述する。 Details of the CIND corpus extraction processing will be described later with reference to the flowchart in FIG.

以上の処理により、COOD判定文、確定IND判定文、CIND判定文、および確定OOD判定文とみなされるコーパスを効率よく、かつ、大量に生成することが可能となる。 By the above processing, it is possible to efficiently and in large quantities generate a corpus regarded as a COOD judgment sentence, a definite IND judgment sentence, a CIND judgment sentence, and a definite OOD judgment sentence.

また、この処理の後、COOD判定文、確定IND判定文、CIND判定文、および確定OOD判定文とみなされるコーパスについては、人手による確認作業が必要となるが、予めCOOD判定文、確定IND判定文、CIND判定文、および確定OOD判定文のいずれかに分類されているため、確認作業に係る負荷を低減させることが可能となり、結果として、コーパスの開発コストを低減させることが可能となる。また、フレーム推定部１３、および意味解析部１４は、生成されたCOOD判定文、およびCIND判定文からなるコーパスを用いた学習により認識精度を向上させることが可能となる。 After this process, the corpus considered to be a COOD judgment sentence, a definite IND judgment sentence, a CIND judgment sentence, and a definite OOD judgment sentence requires manual confirmation work. Since it is classified into one of the sentence, the CIND judgment sentence, and the definite OOD judgment sentence, it is possible to reduce the load related to the confirmation work, and as a result, it is possible to reduce the development cost of the corpus. In addition, the frame estimation unit 13 and the semantic analysis unit 14 can improve the recognition accuracy by learning using the corpus including the generated COOD determination sentence and CIND determination sentence.

＜COODコーパス抽出処理＞
次に、図２２のフローチャートを参照して、COODコーパス抽出処理について説明する。置換生成文からCOOD判定文を抽出する処理は人手で行うのが望ましいが、作業効率を上げるために以下のフィルタリングによるCOODコーパス抽出処理によりCOOD候補文をさらに絞り込むことが可能である。<COOD corpus extraction processing>
Next, the COOD corpus extraction processing will be described with reference to the flowchart in FIG. Although it is desirable to manually perform the process of extracting the COOD determination sentence from the replacement generation sentence, the COOD candidate sentence can be further narrowed down by the COOD corpus extraction process by the following filtering in order to improve work efficiency.

ステップＳ５１において、COODコーパス抽出部１３３は、IND判定文記憶部１３２に記憶されているIND判定文となるコーパスのうち、いずれか未処理のIND判定文となるコーパスの入力を受け付けて、処理対象コーパスとする。 In step S51, the COOD corpus extraction unit 133 receives an input of a corpus that is an unprocessed IND judgment sentence from among the IND judgment sentence stored in the IND judgment sentence storage unit 132, and is processed. It is a corpus.

ステップＳ５２において、COODコーパス抽出部１３３は、非文判定部１３３ａを制御して、処理対象コーパスのPerplexity値を算出させる。 In step S52, the COOD corpus extraction unit 133 controls the non-sentence determination unit 133a to calculate the Perplexity value of the processing target corpus.

ここで、Perplexity値とは、ある単語の次に来る単語の分岐数（候補数）をn-gram確率の逆数で表現したときの平均分岐数を表す値である。すなわち、複数の単語をランダムに組み合わせて生成された文と比較すると、意味のある文における単語間の結合確率は高くなり、連接する単語の分岐数が低くなるためPerplexity値は低くなる。逆に、意味のない文については、単語の結合確率が低くなるため、連接する単語の分岐数が高くなるため、Perplexity値は高くなる。 Here, the Perplexity value is a value that represents the average number of branches when the number of branches (the number of candidates) of a word that follows a certain word is expressed by the reciprocal of the n-gram probability. That is, when a sentence generated by randomly combining a plurality of words is compared, the probability of connection between words in a meaningful sentence is high, and the number of branches of concatenated words is low, so the Perplexity value is low. On the other hand, for a meaningless sentence, the word combination probability is low and the number of concatenated words is high, so the Perplexity value is high.

すなわち、単語置換によって生成された文の中には意味の通らない文も存在する。例えば、「この近くにある評判のいいレストラン教えて」は意味が通るが、「レストラン」を「責任」に置換した文である「この近くにある評判のいい責任教えて」は、自然な意味とは解釈し難く、違和感がある。別の見方をすれば「責任」という単語は「評判」、および「いい」などの単語に連なって出現し難いと言える。 That is, some sentences generated by word replacement have no meaning. For example, "Tell me about a reputable restaurant near here" makes sense, but "Responsible" replaces "restaurant" means "Tell me about a reputable restaurant near here." It is hard to interpret and there is a feeling of strangeness. From another perspective, it can be said that the word “responsibility” is difficult to appear in succession with words such as “reputation” and “good”.

英語の場合も同様である。“repeat phone number again”という文のrepeatをbreakに置換して作成した文”break phone number again”は、やはり自然な意味ではない。breakに続いてphone numberが出現する確率がきわめて低いと言える。もし単語間の連接確率(n-gram)をあらかじめ大量の文(Training Data)で学習(Training)しておけば、生成された文の確率的な妥当性を判定することができる。 The same applies to English. The sentence "break phone number again" created by replacing repeat in the sentence "repeat phone number again" with break is not natural. It can be said that the probability that a phone number will appear following a break is extremely low. If the concatenation probability (n-gram) between words is trained with a large number of sentences (Training Data) in advance, the stochastic validity of the generated sentence can be determined.

すなわち、Perplexity値は、生成された文の確率的な妥当性を判断する指標であるともいえる。 That is, it can be said that the Perplexity value is an index for determining the probabilistic validity of the generated sentence.

生成された文のperplexity値の具体的な計算方法は、例えば、以下の通りである。尚、Perplexity値の算出方法の詳細については、Daniel Jurafsky著の2016“Language Modeling with N-grams”Chapter4,https://web.stanford.edu/~jurafsky/slp3/4.pdf 2016を参照されたい。 A specific method of calculating the perplexity value of the generated sentence is as follows, for example. Please refer to Daniel Jurafsky's 2016 “Language Modeling with N-grams” Chapter 4, https://web.stanford.edu/~jurafsky/slp3/4.pdf 2016 for details on the method of calculating the Perplexity value. ..

確率的言語モデルでは、単語列は確率的に生成されるという考えに基づき、単語列（文）の結合確率P(w)をモデル化する。結合確率p(w)のモデル化の手法はいろいろあるが下記n-gramを使ったモデル化を例として示す。 In the probabilistic language model, the joint probability P(w) of a word string (sentence) is modeled on the basis that the word string is generated stochastically. There are various methods of modeling the joint probability p(w), but the modeling using the following n-gram is shown as an example.

・・・（１）

...(1)

・・・（２）

...(2)

ここで、式（１）が、Bi-gram（n=2）の場合のn-gram確率モデルであり、式（２）が、Tri-gram（n=3）の場合のn-gram確率モデルである。 Here, the equation (1) is an n-gram probability model in the case of Bi-gram (n=2), and the equation (2) is an n-gram probability model in the case of Tri-gram (n=3). Is.

言語モデルの学習においては、例えば、インターネットサイトや、News記事等のテキスト等大量の学習用テキスト(Training data)を用いて単語のn-gram確率を用いた上記のn-gramパラメータを学習する。 In learning a language model, for example, the above n-gram parameters using n-gram probabilities of words are learned using a large amount of learning data (Training data) such as texts of Internet sites and News articles.

このような学習においては、コーパスに現われない単語列が多いためほとんどのn-gramがゼロになる問題(Sparseness)があるため、ゼロ以外の値で補間するスムージング処理（Language Modeling Smoothing）とバックオフ処理(Back off)とが行われる。 In such learning, there are many word strings that do not appear in the corpus, so there is a problem that most n-grams become zero (Sparseness), so smoothing processing (Language Modeling Smoothing) that interpolates with a value other than zero and backoff Processing (Back off) is performed.

尚、スムージング処理（Language Modeling Smoothing）とバックオフ処理(Back off)とについては、Zhai & Lafferty著の2001, A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrievalを参照されたい。 For the smoothing process (Language Modeling Smoothing) and the back-off process (Back off), see 2001, A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval by Zhai & Lafferty.

非文判定部１３３ａは、このように学習したn-gramモデルを用いて、生成されたコーパスに対して以下の式（３）で示されるPerplexity値を計算する。 The non-sentence determination unit 133a calculates the Perplexity value represented by the following Expression (3) for the generated corpus using the n-gram model learned in this way.

・・・（３）

...(3)

例えば、図２３の例Ｅｘ４１における例文１）で示されるように「この近くにある評判のいい責任を教えて」という文章は、不自然な意味の通らない文である。このような場合、例えば、「責任」と「いい」とのn-gram確率p(責任|いい)=1.90365e-05が低いため、Perplexity値PLLは、80.4152となる。 For example, as shown in the example sentence 1) in the example Ex41 of FIG. 23, the sentence “Tell me a reputable responsibility near here” is an unnatural sentence. In such a case, for example, since the n-gram probability p (responsibility|good)=1.90365e-05 between “responsibility” and “good” is low, the Perplexity value PLL is 80.4152.

また、例えば、図２３の例Ｅｘ４１における例文２）で示されるように「この近くにある評判のいいサーフィンを教えて」という文章は、やはり不自然な意味の通らない文である。しかしながら、このような場合、例えば、「サーフィン」と「いい」とのn-gram確率p(サーフィン|いい)=2.13532e-05が低いため、Perplexity値PLLは、70.6759となる。 Also, for example, as shown in the example sentence 2) in the example Ex41 of FIG. 23, the sentence “Tell me about popular surfing near here” is an unnatural sentence. However, in such a case, for example, the n-gram probability p (surfing|good)=2.13532e-05 between “surfing” and “good” is low, so the Perplexity value PLL is 70.6759.

さらに、例えば、図２３の例Ｅｘ４１における例文３）で示されるように「この近くにある評判のいい店舗を教えて」という文章は、比較的意味の通る文である。このような場合、例えば、「店舗」と「いい」とのn-gram確率p(店舗|いい)=0.000105223であり比較的高めであるため、Perplexity値PLLは、57.4806となる。 Further, for example, as shown in the example sentence 3) in the example Ex41 of FIG. 23, the sentence “Tell me a good store near here” is a relatively meaningful sentence. In such a case, for example, the n-gram probability p (store|good)=0.000105223 between “store” and “good”, which is relatively high, and thus the Perplexity value PLL is 57.4806.

また、例えば、図２３の例Ｅｘ４１における例文４）で示されるように「この近くにある評判のいいマッサージを教えて」という文章は、意味の通る文である。このような場合、例えば、「マッサージ」と「いい」とのn-gram確率p(マッサージ|いい)=0.000378552であり比較的高めであるため、Perplexity値PLLは、57.0273となる。 Further, for example, as shown in the example sentence 4) in the example Ex41 of FIG. 23, the sentence “Tell me a massage with a good reputation in the vicinity” is a sentence that makes sense. In such a case, for example, the n-gram probability p (massage|good) of “massage” and “good” is 0.00000378552, which is relatively high. Therefore, the Perplexity value PLL is 57.0273.

このように、意味の通る文からなるコーパスについては、n-gram確率が高くなり、Perplexity値は小さくなる。 Thus, for a corpus composed of meaningful sentences, the n-gram probability increases and the Perplexity value decreases.

ステップＳ５３において、非文判定部１３３ａは、算出した処理対象コーパスのPerplexity値が、所定の閾値αより大きいか否かに基づいて、非文であるか否かを判定する。 In step S53, the non-sentence determination unit 133a determines whether or not the sentence is a non-sentence based on whether or not the calculated Complexity value of the corpus to be processed is larger than a predetermined threshold value α.

ステップＳ５３において、例えば、Perplexity値PLLが、所定の閾値αよりも大きい場合、処理は、ステップＳ５５に進む。 In step S53, for example, when the Perplexity value PLL is larger than the predetermined threshold value α, the process proceeds to step S55.

ステップＳ５５において、非文判定部１３３ａは、処理対象のコーパスを非文とみなし、廃棄判定文として廃棄処理し、処理は、ステップＳ５６に進む。 In step S55, the non-sentence determination unit 133a regards the corpus to be processed as a non-sentence and discards it as a discard determination sentence, and the process proceeds to step S56.

一方、ステップＳ５３において、例えば、Perplexity値PLLが、所定の閾値αよりも大きくない場合、すなわち、Perplexity値PLLが、所定の閾値αよりも小さい場合、非文判定部１３３ａは、処理対象コーパスが、意味の通る文であるとみなして、処理は、ステップＳ５４に進む。 On the other hand, in step S53, for example, when the Perplexity value PLL is not larger than the predetermined threshold value α, that is, when the Perplexity value PLL is smaller than the predetermined threshold value α, the non-sentence determination unit 133a determines that the processing target corpus is , And the processing proceeds to step S54.

ステップＳ５４において、非文判定部１３３ａは、処理対象のコーパスを記憶する。 In step S54, the non-sentence determination unit 133a stores the corpus to be processed.

ステップＳ５６において、COODコーパス抽出部１３３は、IND判定文記憶部１３２に記憶されているIND判定文となるコーパスのうち、未処理のIND判定文となるコーパスがあるか否かを判定し、未処理のIND判定文となるコーパスがある場合、処理は、ステップＳ５１に戻る。すなわち、全てのIND判定文についてPerplexity値PLLが算出されて、所定の閾値αとの比較により、非文ではなく、意味の通る文からなるコーパスであるか否かの判定がなされるまで、ステップＳ５１乃至Ｓ５６の処理が繰り返される。そして、ステップＳ５６において、全てのIND判定文についてPerplexity値PLLが算出されて、所定の閾値αとの比較により、非文ではなく、意味の通る文からなるコーパスであるか否かの判定がなされて、未処理のコーパスがないとみなされた場合、処理は、ステップＳ５７に進む。 In step S56, the COOD corpus extraction unit 133 determines whether or not there is a corpus that is an unprocessed IND determination sentence among the corpus that is the IND determination sentence stored in the IND determination sentence storage unit 132. If there is a corpus that serves as the IND determination sentence for the process, the process returns to step S51. That is, the Perplexity value PLL is calculated for all IND determination sentences, and by comparison with a predetermined threshold value α, until it is determined whether the corpus is not a non-sentence but a sentence that makes sense, The processing of S51 to S56 is repeated. Then, in step S56, the Perplexity value PLL is calculated for all the IND determination sentences, and it is determined by comparison with the predetermined threshold value α whether the corpus is not a non-sentence but a sentence having a meaning. When it is determined that there is no unprocessed corpus, the process proceeds to step S57.

ステップＳ５７において、COODコーパス抽出部１３３は、ステップＳ５２，Ｓ５３の非文判定部１３３ａの処理により、非文ではなく、意味の通る文からなるコーパスとみなされて記憶されているIND判定文となるコーパスのうち、いずれか未処理のIND判定文となるコーパスの入力を受け付けて、処理対象コーパスとする。 In step S57, the COOD corpus extraction unit 133 becomes an IND determination sentence that is stored as being regarded as a corpus that is a non-sentence sentence and has a meaning by the processing of the non-sentence determination unit 133a in steps S52 and S53. Of the corpora, the input of any unprocessed IND judgment sentence is accepted as the corpus to be processed.

ステップＳ５８において、COODコーパス抽出部１３３は、非出現性判定部１３３ｂを制御して、処理対象コーパスに含まれる単語の、対象となるドメインにおける非出現性を算出させる。 In step S58, the COOD corpus extraction unit 133 controls the non-occurrence determination unit 133b to calculate the non-occurrence of the word included in the processing target corpus in the target domain.

非出現性とは、生成されたコーパスを意味解析器にかけてIND判定文と判定されたドメインのコーパス群に出現しない単語をどの程度含んでいるかを示す指標である。 The non-occurrence property is an index indicating how many words that do not appear in the corpus group of the domain determined to be the IND determination sentence by applying the generated corpus to the semantic analyzer.

例えば、図２４で示されるような文（コーパス）は、いずれもALARM-CHANGEというドメインにおいて、出現頻度の低い単語を含んでおり、すなわち、非出現性が高く、廃棄判定文を含むClose OODとなる。ただし、ここでは、非出現性を求める前の処理で、非文は排除されているので、実質的に、COOD判定文が抽出されることになる。尚、以下においては、『』内の表記が非出現性の高い単語である。 For example, the sentences (corpus) as shown in FIG. 24 each include a word with a low frequency of occurrence in the domain ALARM-CHANGE, that is, a high non-occurrence and a Close OOD including a discard decision sentence. Become. However, here, since the non-sentences are excluded in the process before determining the non-occurrence, the COOD determination sentence is substantially extracted. In the following, the notation in “” is a word with high non-occurrence.

すなわち、図２４においては、「アラームの『レジストリ』をかえて」の『レジストリ』が、「『脚本』を8時にセットしなおして」の『脚本』が、「アラームの『食事』直してくれるかな」の『食事』が、「朝7時の『文言』を朝8時に変更してください」の『文言』が、「朝6時半に鳴る『ログファイル』を7時頃に変更してください」の『ログファイル』が、「6時の『制度』を7時に変更して」の『制度』が、「アラームの『考え』を7時に変えてください」の『考え』が、「夕方5時の『工期』を5時半に変更して」の『工期』が、「午前6時半の『設計』を午前8時半に変更してください」の『設計』が、「朝7時の『価格』を8時に変更して」の『価格』が、「7時の『献立』を7:30に変えて」の『献立』が、それぞれ非出現性の高い言葉であることから、COOD判定文とみなされる。 That is, in FIG. 24, the “registry” of “change the alarm's “registry”” and the “script” of “reset the “script” at 8 o'clock” corrects the “alarm meal”. "Food" of "Kana", "Please change "text" of 7:00 am at 8:00 am", "Text of" change "log file" that sounds at 6:30 am to about 7:00 "Log file" of "Please," "Change "system" at 6 o'clock at 7 o'clock" "System", "Think of alarm "please change at 7 o'clock" is "evening" "Change the construction period of 5 o'clock to 5:30 am", "Construction period", "Change the design of 6:30 am to 8:30 am", "Design", "7 am Because the "price" of "changing the "price" at 8 o'clock" and the "menu" of "changing the "menu" at 7 o'clock to 7:30" are non-occurrence words. , COOD judgment sentence.

これらの非出現性については、例えば、処理対象コーパスに含まれる、目的とするドメインで出現しない単語数を用いることで数値的に求めることができる。 These non-appearances can be numerically obtained by using the number of words included in the corpus to be processed that do not appear in the target domain.

例えば、非出現性判定部１３３ｂは、IND判定文からなる処理対象コーパスに含まれる全単語数をnとし、IND判定文のドメインにおいて出現しない単語数（IND判定文からなるドメインに属するコーパスのうち、処理対象コーパスを除いたいずれのコーパスにも含まれていない単語数）をnoとしたとき、非出現性を表すパラメータとしてno/nを算出する。 For example, the non-occurrence determination unit 133b sets the total number of words included in the processing corpus including the IND determination sentence to n, and determines the number of words that do not appear in the domain of the IND determination sentence (of the corpus belonging to the domain including the IND determination sentence. , No, which is the number of words not included in any corpus excluding the corpus to be processed) is calculated as no.

ステップＳ５９において、非出現性判定部１３３ｂは、処理対象コーパスに含まれる単語のIND判定文からなるドメインにおける非出現性を表すパラメータno/nが所定の閾値βよりも大きいか否かを判定する。 In step S59, the non-occurrence determination unit 133b determines whether or not the parameter no/n indicating the non-occurrence in the domain formed by the IND determination sentence of the word included in the processing target corpus is larger than the predetermined threshold β. ..

ステップＳ５９において、非出現性を表すパラメータno/nが所定の閾値βよりも大きい場合、すなわち、処理対象コーパスに含まれる単語の、IND判定文からなるドメインにおける非出現性が高い場合、処理は、ステップＳ６０に進む。 In step S59, if the parameter no/n indicating the non-occurrence is larger than the predetermined threshold β, that is, if the word included in the corpus to be processed has a high non-occurrence in the domain including the IND determination sentence, the process is performed. , And proceeds to step S60.

ステップＳ６０において、非出現性判定部１３３ｂは、処理対象コーパスをCOOD判定文とみなして抽出し、COOD判定文記憶部１３４に記憶させる。 In step S60, the non-appearance determination unit 133b regards the corpus to be processed as a COOD determination sentence, extracts it, and stores it in the COOD determination sentence storage unit 134.

一方、ステップＳ５９において、非出現性を表すパラメータno/nが所定の閾値βよりも大きくない場合、すなわち、非出現性を表すパラメータno/nが所定の閾値βよりも小さく、処理対象コーパスに含まれる単語の、IND判定文からなるドメインにおける非出現性が低い場合、処理は、ステップＳ６１に進む。 On the other hand, in step S59, when the parameter no/n indicating non-appearance is not larger than the predetermined threshold β, that is, the parameter no/n indicating non-appearance is smaller than the predetermined threshold β, and the processing target corpus is If the non-occurrence of the included word in the domain including the IND determination sentence is low, the process proceeds to step S61.

ステップＳ６１において、非出現性判定部１３３ｂは、処理対象コーパスを確定IND判定文とみなして確定IND判定文記憶部１３５に記憶させる。 In step S61, the non-occurrence determination unit 133b regards the processing target corpus as a confirmed IND determination sentence and stores it in the confirmed IND determination sentence storage unit 135.

ステップＳ６２において、COODコーパス抽出部１３３は、未処理のIND判定文が存在するか否かを判定し、未処理のIND判定文が存在する場合、処理は、ステップＳ５７に戻る。すなわち、未処理のIND判定文がなくなるまで、ステップＳ５７乃至Ｓ６２の処理が繰り返され、ステップＳ５８，Ｓ５９の非出現性判定部１３３ｂの処理が繰り返される。 In step S62, the COOD corpus extraction unit 133 determines whether or not there is an unprocessed IND determination sentence, and if there is an unprocessed IND determination sentence, the process returns to step S57. That is, the processes of steps S57 to S62 are repeated until the unprocessed IND determination sentences are exhausted, and the process of the non-appearance determination unit 133b of steps S58 and S59 is repeated.

そして、ステップＳ６２において、未処理のIND判定文が存在しない、すなわち、全てのIND判定文が処理されたとみなされた場合、処理は、終了する。 Then, in step S62, if there is no unprocessed IND determination statement, that is, if it is considered that all the IND determination statements have been processed, the process ends.

以上の処理により、IND判定文からなるコーパスにより構成されるドメインのうち、Perplexity値が高く、非文ではないものであって、含まれる単語の非出現性が高いコーパスが、COOD判定文とみなされ、COOD判定文記憶部１３４に記憶され、非文ではないものであって、含まれる単語の非出現性が低いコーパスが、確定IND判定文とみなされ、確定IND判定文記憶部１３５に記憶される。 Through the above process, among domains composed of a corpus of IND judgment sentences, a corpus that has a high Perplexity value and is not a non-sentence, and the non-occurrence of the included words is high is regarded as a COOD judgment sentence. The corpus that is stored in the COOD judgment sentence storage unit 134 and is not a non-sentence and in which the non-occurrence of the included words is low is regarded as the fixed IND judgment sentence and stored in the fixed IND judgment sentence storage unit 135. To be done.

尚、以上においては、非出現性をパラメータno/nにより表現する例について説明してきたが、非出現性を表現できれば、その他の値でもよく、例えば、TF(Term Frequency)/IDF(Inverse Document Frequency)値を用いるようにしてもよい。 In the above, an example of expressing the non-occurrence by the parameter no/n has been described, but other values may be used as long as the non-occurrence can be expressed, for example, TF (Term Frequency)/IDF (Inverse Document Frequency). ) Value may be used.

ここで、TF値とは、複数の文書（この場合は複数のドメイン）があった場合、それぞれの文書（この場合はドメイン）を特徴づける単語を分析するための指標であり、以下の式（４）で表される。 Here, the TF value is an index for analyzing words that characterize each document (in this case, domain) when there are multiple documents (in this case, multiple domains), and the following formula ( It is represented by 4).

・・・（４）

...(4)

また、IDF値は、各単語が文書間で共通に使われているか否かを表す指標であり、以下の式（５）で表される。 The IDF value is an index indicating whether or not each word is commonly used in documents, and is represented by the following formula (5).

・・・（５）

...(5)

さらに、TF/IDF値は、IND判定文の対象ドメインに偏在してよく出現する重要単語リストのうち出現頻度が閾値β(0≦β≦1)より小さい単語数、または、重要単語リストに存在しない単語数を値nlwとするとき、ステップＳ５９において、非出現性判定部１３３ｂは、非出現性を示すパラメータnlw/nを算出する。 In addition, the TF/IDF value is the number of words with an occurrence frequency less than the threshold β (0 ≤ β ≤ 1) in the important word list that often appears unevenly in the target domain of the IND judgment sentence, or exists in the important word list. When the number of words not to be processed is set to the value nlw, in step S59, the non-occurrence determination unit 133b calculates the parameter nlw/n indicating non-occurrence.

ステップＳ５９において、非出現性判定部１３３ｂは、処理対象コーパスに含まれる単語の所定のドメインにおける非出現性を表すパラメータnlw/nが所定の閾値γよりも大きいか否かを判定する。 In step S59, the non-occurrence determination unit 133b determines whether or not the parameter nlw/n representing the non-occurrence of a word included in the processing corpus in a predetermined domain is larger than a predetermined threshold value γ.

ステップＳ５９において、非出現性を表すパラメータnlw/nが所定の閾値γよりも大きい場合、処理は、ステップＳ６０に進み、非出現性判定部１３３ｂは、処理対象コーパスをCOOD判定文とみなして抽出し、COOD判定文記憶部１３４に記憶させる。 In step S59, when the parameter nlw/n indicating the non-appearance is larger than the predetermined threshold value γ, the process proceeds to step S60, and the non-appearance determination unit 133b considers the corpus to be processed as the COOD determination sentence and extracts it. Then, it is stored in the COOD judgment sentence storage unit 134.

一方、ステップＳ５９において、非出現性を表すパラメータnlw/nが所定の閾値γよりも大きくない場合、すなわち、非出現性を表すパラメータnlw/nが所定の閾値γよりも小さい場合、処理は、ステップＳ６１に進み、非出現性判定部１３３ｂは、処理対象コーパスを確定IND判定文とみなして確定IND判定文記憶部１３５に記憶させる。 On the other hand, in step S59, when the parameter nlw/n indicating non-appearance is not larger than the predetermined threshold value γ, that is, when the parameter nlw/n indicating non-appearance is smaller than the predetermined threshold value γ, the process is In step S61, the non-emergence determination unit 133b regards the corpus to be processed as a confirmed IND determination sentence and stores it in the confirmed IND determination sentence storage unit 135.

＜TF値とTF/IDF値について＞
例えば、図２５の例Ｅｘ５１における、ALARM-CHANGEというドメインのIND判定文からなるコーパス例に基づいて、求められるTF値およびTF/IDF値は、それぞれ図２５の例Ｅｘ５２，Ｅｘ５３で表される。<TF value and TF/IDF value>
For example, the TF value and the TF/IDF value obtained based on the example corpus including the IND determination sentence of the domain ALARM-CHANGE in the example Ex51 of FIG. 25 are represented by examples Ex52 and Ex53 of FIG. 25, respectively.

すなわち、例Ｅｘ５１のコーパス群は、上から順に、「アラームを8時に変更して」、「アラーム編集してくれる」、「アラーム設定しなおして」、「7時から8時にアラームを変更して」、「7時のアラームを8時に変えてもらえる」、「アラーム8時に変更して」、「目覚まし8時に変えて目覚ましの時間変更して」、「目覚ましの時間変更してもらえる」、「目覚まし変更して」、「10時のアラーム7時にして」、「10時のアラームを11時に変えて」、・・・、「６時に設定したアラームを５時３０分に変えたい」、「毎日７時に鳴らすのを６時３０分に設定を変えて」、「毎回７時に鳴らすのを６時３０分に設定を変更して」、および「土日朝のアラームを編集」からなる。 That is, the corpus group of the example Ex51 is, in order from the top, "change alarm at 8 o'clock", "edit alarm", "reset alarm", "change alarm at 7 o'clock to 8 o'clock". "Alarm at 7 o'clock can be changed at 8 o'clock", "Alarm can be changed at 8 o'clock", "Alarm can be changed at 8 o'clock to change the alarm time", "Alarm can be changed at time", "Alarm Change," 10 o'clock alarm at 7 o'clock, "10 o'clock alarm at 11 o'clock", ..., "I want to change the 6 o'clock alarm to 5:30 am", "every day Ring at 7:00 am, change the setting to 6:30 pm, "Always ring at 7:00 am, change the setting to 6:30 am," and "Edit Saturday and Sunday morning alarm."

この例Ｅｘ５１のコーパス群の各単語のTF値は、図中の例Ｅｘ５２で示されるように、TF値の大きな順に、上から「変更」が351、「７」が334、「８」が260、「変え」が258、「時間」が220、「設定」が159、「６」が152、「し」が148、「目覚まし」が110、「午前」が64、「セット」が56、「願い」が55、・・・、「目覚し」が6、「目覚まし時計」が5である。 As shown in the example Ex52 in the figure, the TF value of each word of the corpus group of the example Ex51 is 351 from the top in the descending order of the TF value, “change” is 351, “7” is 334, and “8” is 260. , "Change" is 258, "Time" is 220, "Setting" is 159, "6" is 152, "Shi" is 148, "Awakening" is 110, "Am" is 64, "Set" is 56, " "Wish" is 55, ..., "Alarm" is 6, and "Alarm clock" is 5.

また、この例Ｅｘ５１のコーパス群の各単語のTF/IDF値は、図中の例Ｅｘ５３で示されるように、TF/IDF値の降順に、上から「アラーム」が0.00504379225545、「セット」が0.00328857409316、「起こ」が0.00110030484831、「明日」が0.000795915410064、「目覚まし」が0.000763298323913、「起き」が0.000622699996573、「鳴ら」が0.00060708690425、「朝」が0.000521901615019、「設定」が0.000466290476509、「かけ」が0.000336399933349、「鳴」が0.000297198910881、「めざまし」が0.000223592438318、「おこ」が0.000196205208903、「目覚まし時計」が0.000185042017918、「起床」が0.000175552029018、「願い」が0.000124433590006、「時間」が0.000107951491631、「午前」が0.000102767940411・・・である。 In addition, the TF/IDF value of each word in the corpus group of this example Ex51 is, in the descending order of the TF/IDF value, the "alarm" is 0.00504379225545 and the "set" is 0.00328857409316 in descending order, as shown in the example Ex53 in the figure. , "Wake-up" is 0.00110030484831, "tomorrow" is 0.000795915410064, "wake-up" is 0.000763298323913, "wake-up" is 0.000622699996573, "ring" is 0.00060708690425, "morning" is 0.000521901615019, "setting" is 0.000466290476509, "kake" is 0.000336399933349, " "Meme" is 0.000297198910881, "Mezamashi" is 0.000223592438318, "Oko" is 0.000196205208903, "Alarm clock" is 0.000185042017918, "Wake up" is 0.000175552029018, "Wish" is 0.000124433590006, "Time" is 0.000107951491631, "AM" is 0.000102767940411... Is.

すなわち、TF値またはTF/IDF値が高い単語は、出現頻度が高く（非出現性が低く）、重要度が高い単語であると考えることができる。したがって、IND判定文に含まれるコーパスのうち、このTF値またはTF/IDF値がある閾値以下の単語を多く含むコーパスは、COOD判定文である可能性が高いと考えられる。このため、COOD判定文は、IND判定文に含まれるコーパスのうち、このTF/IDF値が高い単語群に含まれていない単語を多く含むコーパスとして求められる。 That is, a word having a high TF value or TF/IDF value can be considered as a word having a high appearance frequency (low non-occurrence) and a high degree of importance. Therefore, among the corpora included in the IND judgment sentence, the corpus containing many words whose TF value or TF/IDF value is less than or equal to a certain threshold value is highly likely to be the COOD judgment sentence. Therefore, the COOD judgment sentence is obtained as a corpus including many words that are not included in the word group having the high TF/IDF value among the corpus included in the IND judgment sentence.

＜CINDコーパス抽出処理＞
次に、図２６のフローチャートを参照して、CINDコーパス抽出処理について説明する。<CIND corpus extraction processing>
Next, the CIND corpus extraction processing will be described with reference to the flowchart in FIG.

ステップＳ１０１において、CINDコーパス抽出部１３７は、OOD判定文記憶部１３６に記憶されているOOD判定文からなるコーパスのうち、未処理のOOD判定文からなるコーパスの入力を受け付けて、処理対象コーパスとする。 In step S101, the CIND corpus extraction unit 137 receives an input of a corpus composed of unprocessed OOD judgment sentences among the corpus composed of OOD judgment sentences stored in the OOD judgment sentence storage unit 136, and selects a corpus to be processed. To do.

ステップＳ１０２において、CINDコーパス抽出部１３７は、非文判定部１３７ａを制御して、処理対象コーパスのPerplexity値を算出させる。 In step S102, the CIND corpus extraction unit 137 controls the non-sentence determination unit 137a to calculate the Perplexity value of the processing target corpus.

ステップＳ１０３において、非文判定部１３７ａは、算出した処理対象コーパスのPerplexity値が、所定の閾値αより大きいか否かに基づいて、非文であるか否かを判定する。 In step S103, the non-sentence determination unit 137a determines whether or not the sentence is a non-sentence based on whether or not the calculated Complexity value of the processing target corpus is larger than a predetermined threshold value α.

ステップＳ１０３において、例えば、Perplexity値PLLが、所定の閾値αよりも大きい場合、処理は、ステップＳ１０５に進む。 In step S103, for example, when the Perplexity value PLL is larger than the predetermined threshold value α, the process proceeds to step S105.

ステップＳ１０５において、非文判定部１３７ａは、処理対象のコーパスを非文とみなし、廃棄判定文として廃棄処理し、処理は、ステップＳ１０６に進む。 In step S105, the non-sentence determination unit 137a regards the corpus to be processed as a non-sentence and discards it as a discard determination sentence, and the process proceeds to step S106.

一方、ステップＳ１０３において、例えば、Perplexity値PLLが、所定の閾値αよりも大きくない場合、すなわち、Perplexity値PLLが、所定の閾値αよりも小さい場合、非文判定部１３７ａは、処理対象コーパスが、意味の通る文であるとみなして、処理は、ステップＳ１０４に進む。 On the other hand, in step S103, for example, when the Perplexity value PLL is not larger than the predetermined threshold value α, that is, when the Perplexity value PLL is smaller than the predetermined threshold value α, the non-sentence determination unit 137a determines that the processing target corpus is , And the process proceeds to step S104.

ステップＳ１０４において、非文判定部１３７ａは、処理対象のコーパスを記憶する。 In step S104, the non-sentence determination unit 137a stores the corpus to be processed.

ステップＳ１０６において、CINDコーパス抽出部１３７は、OOD判定文記憶部１３６に記憶されているOOD判定文となるコーパスのうち、未処理のOOD判定文となるコーパスがあるか否かを判定し、未処理のOOD判定文となるコーパスがある場合、処理は、ステップＳ１０１に戻る。すなわち、全てのOOD判定文についてPerplexity値PLLが算出されて、所定の閾値αとの比較により、非文ではなく、意味の通る文からなるコーパスであるか否かの判定がなされるまで、ステップＳ１０１乃至Ｓ１０６の処理が繰り返される。そして、ステップＳ１０６において、全てのOOD判定文についてPerplexity値PLLが算出されて、所定の閾値αとの比較により非文ではなく、意味の通る文からなるコーパスであるか否かの判定がなされて、未処理のコーパスがないとみなされた場合、処理は、ステップＳ１０７に進む。 In step S106, the CIND corpus extraction unit 137 determines whether or not there is a corpus that is an unprocessed OOD determination sentence among the corpus that is the OOD determination sentence stored in the OOD determination sentence storage unit 136. If there is a corpus that serves as the OOD determination sentence of the process, the process returns to step S101. That is, the Perplexity value PLL is calculated for all OOD determination sentences, by comparison with a predetermined threshold α, until it is determined whether the corpus is not a non-sentence, but a corpus consisting of meaningful sentences, The processing of S101 to S106 is repeated. Then, in step S106, the Perplexity value PLL is calculated for all the OOD determination sentences, and it is determined by comparison with a predetermined threshold value α whether the corpus is not a non-sentence sentence but a sentence having a meaning. If it is determined that there is no unprocessed corpus, the process proceeds to step S107.

ステップＳ１０７において、CINDコーパス抽出部１３７は、ステップＳ１０２，Ｓ１０３の非文判定部１３７ａの処理により、非文ではなく、意味の通る文からなるコーパスとみなされて記憶されているOOD判定文となるコーパスのうち、いずれか未処理のOOD判定文となるコーパスの入力を受け付けて、処理対象コーパスとする。 In step S107, the CIND corpus extraction unit 137 becomes an OOD determination sentence that is regarded as a corpus composed of a sentence that is meaningful rather than a non-sentence by the processing of the non-sentence determination unit 137a in steps S102 and S103. Of the corpora, the input of any unprocessed corpus that becomes the OOD determination sentence is accepted and set as the processing corpus.

ステップＳ１０８において、CINDコーパス抽出部１３７は、非出現性判定部１３７ｂを制御して、処理対象コーパスに含まれる単語の、対象となるドメインにおける非出現性を表すパラメータno/nを算出させる。 In step S108, the CIND corpus extraction unit 137 controls the non-occurrence determination unit 137b to calculate the parameter no/n indicating the non-occurrence of the word included in the processing corpus in the target domain.

ステップＳ１０９において、非出現性判定部１３７ｂは、処理対象コーパスに含まれる単語の所定のドメインにおける非出現性を表すパラメータno/nが所定の閾値βよりも大きいか否かを判定する。 In step S109, the non-occurrence determination unit 137b determines whether or not the parameter no/n representing the non-occurrence of the word included in the processing corpus in the predetermined domain is larger than the predetermined threshold β.

ステップＳ１０９において、非出現性を表すパラメータno/nが所定の閾値βよりも大きい場合、処理は、ステップＳ１１０に進む。 In step S109, when the parameter no/n indicating non-appearance is larger than the predetermined threshold β, the process proceeds to step S110.

ステップＳ１１０において、非出現性判定部１３７ｂは、処理対象コーパスを確定OOD判定文とみなして確定OOD判定文記憶部１３９に記憶させる。 In step S110, the non-emergence determination unit 137b regards the corpus to be processed as a definite OOD determination sentence and stores it in the definite OOD determination sentence storage unit 139.

一方、ステップＳ１０９において、非出現性を表すパラメータno/nが所定の閾値βよりも大きくない場合、すなわち、非出現性を表すパラメータno/nが所定の閾値βよりも小さい場合、処理は、ステップＳ１１１に進む。 On the other hand, in step S109, when the parameter no/n representing non-appearance is not larger than the predetermined threshold β, that is, when the parameter no/n representing non-appearance is smaller than the predetermined threshold β, the process is It proceeds to step S111.

ステップＳ１１１において、非出現性判定部１３７ｂは、処理対象コーパスをCIND判定文とみなしてCIND判定文記憶部１３８に記憶させる。 In step S111, the non-occurrence determination unit 137b regards the corpus to be processed as a CIND determination statement and stores it in the CIND determination statement storage unit 138.

ステップＳ１１２において、CINDコーパス抽出部１３７は、未処理のOOD判定文が存在するか否かを判定し、未処理のOOD判定文が存在する場合、処理は、ステップＳ１０７に戻る。すなわち、未処理のOOD判定文がなくなるまで、ステップＳ１０７乃至Ｓ１１２の処理が繰り返される。 In step S112, the CIND corpus extraction unit 137 determines whether or not there is an unprocessed OOD determination statement, and if there is an unprocessed OOD determination statement, the process returns to step S107. That is, the processes of steps S107 to S112 are repeated until there are no unprocessed OOD determination sentences.

そして、ステップＳ１１２において、未処理のOOD判定文が存在しない、すなわち、全てのOOD判定文が処理されたとみなされた場合、処理は、終了する。 Then, in step S112, when there is no unprocessed OOD determination statement, that is, when it is considered that all the OOD determination statements have been processed, the processing ends.

以上の処理により、OOD判定文となるコーパスのうち、Perplexity値が高く、非文ではないものであって、含まれる単語の非出現性が低いコーパスが、CIND判定文とみなされ、CIND判定文記憶部１３８に記憶され、非文ではないものであって、含まれる単語の非出現性が高いコーパスが、確定OOD判定文とみなされ、確定OOD判定文記憶部１３９に記憶される。尚、CIND判定文はどのドメインに属するCINDかの判定はわからないため、最終的に人手で行う必要がある。但しそのための絞り込みを自動化できるので作業工数を軽減することができる。 By the above process, among the corpus that becomes the OOD judgment sentence, a corpus with a high Perplexity value, which is not a non-sentence and whose word non-occurrence is low is regarded as the CIND judgment sentence, and the CIND judgment sentence A corpus that is stored in the storage unit 138 and is not a non-sentence and in which the included word has a high non-occurrence is regarded as a definite OOD determination sentence and is stored in the definite OOD determination sentence storage unit 139. Since the CIND judgment statement does not know which domain the CIND belongs to, it is necessary to finally make it manually. However, since the narrowing down for that purpose can be automated, the number of work steps can be reduced.

以上においては、非出現性を表すパラメータをno/nにより表現する例について説明してきたが、非出現性を表現できれば、その他の値でもよく、例えば、TF(Term Frequency)/IDF(Inverse Document Frequency)値を用いるようにしてもよい。 In the above, an example in which the parameter representing non-occurrence is represented by no/n has been described, but other values may be used as long as non-occurrence can be represented, for example, TF (Term Frequency)/IDF (Inverse Document Frequency). ) Value may be used.

さらに、TF/IDF値は、IND判定文の対象ドメインに偏在してよく出現する重要単語リストのうち出現頻度が閾値β(0≦β≦1)より小さい単語の数、または、重要単語リストに存在しない単語数を値nlwとするとき、ステップＳ１０８において、非出現性判定部１３７ｂは、非出現性を示すパラメータnlw/nを算出する。 Furthermore, the TF/IDF value is the number of words with an appearance frequency smaller than the threshold β (0≦β≦1) in the important word list that often appears unevenly in the target domain of the IND judgment sentence, or in the important word list. When the number of non-existing words is set to the value nlw, in step S108, the non-appearance determination unit 137b calculates a parameter nlw/n indicating non-appearance.

ステップＳ１０８において、非出現性判定部１３７ｂは、処理対象コーパスに含まれる単語の所定のドメインにおける非出現性を表すパラメータnlw/nが所定の閾値γよりも大きいか否かを判定する。 In step S108, the non-occurrence determination unit 137b determines whether or not the parameter nlw/n representing the non-occurrence of a word included in the processing corpus in a predetermined domain is larger than a predetermined threshold value γ.

ステップＳ１０８において、非出現性を表すパラメータnlw/nが所定の閾値γよりも大きい場合、処理は、ステップＳ１１０に進み、CINDコーパス抽出部１３７は、処理対象コーパスを確定OOD判定文とみなして確定OOD判定文記憶部１３９に記憶させる。 In step S108, if the parameter nlw/n indicating non-appearance is larger than the predetermined threshold value γ, the process proceeds to step S110, and the CIND corpus extraction unit 137 regards the corpus to be processed as a definite OOD determination sentence and determines it. It is stored in the OOD judgment sentence storage unit 139.

一方、ステップＳ１０８において、非出現性を表すパラメータnlw/nが所定の閾値γよりも大きくない場合、すなわち、非出現性を表すパラメータnlw/nが所定の閾値γよりも小さい場合、処理は、ステップＳ１１１に進み、非出現性判定部１３７ｂは、処理対象コーパスをCIND判定文とみなしてCIND判定文記憶部１３８に記憶させる。 On the other hand, in step S108, when the parameter nlw/n indicating the non-appearance is not larger than the predetermined threshold value γ, that is, when the parameter nlw/n indicating the non-appearance is smaller than the predetermined threshold value γ, the process is In step S111, the non-appearance determination unit 137b regards the corpus to be processed as a CIND determination sentence and stores it in the CIND determination sentence storage unit 138.

尚、CIND判定文はどのドメインに属するCIND文かの判定はわからないため、最終的に人手で行う必要がある。但しそのための絞り込みを自動化できるので作業工数を軽減することができる。 Since the CIND judgment statement does not know which domain the CIND statement belongs to, it is necessary to manually make the judgment. However, since the narrowing down for that purpose can be automated, the number of work steps can be reduced.

以上の処理により、所定の手段で生成されたIND文から多くのコーパスを、人手を必要とすることなく生成することが可能となる。このため、コーパスの開発に係る負荷を低減させることが可能となり、開発コストを低減させることが可能となる。 By the above processing, it becomes possible to generate many corpora from the IND statement generated by the predetermined means without requiring manual labor. Therefore, it is possible to reduce the load related to the development of the corpus, and it is possible to reduce the development cost.

また、コーパスは、IND判定文（確定IND判定文）、COOD判定文、CIND判定文、およびOOD判定文（確定OOD判定文）に分類された状態で生成される。 Further, the corpus is generated in a state in which the corpus is classified into an IND judgment sentence (confirmed IND judgment sentence), a COOD judgment sentence, a CIND judgment sentence, and an OOD judgment sentence (confirmed OOD judgment sentence).

結果として、より多くのCOOD判定文およびCIND判定文からなるコーパスを用いてフレーム推定部１３、および意味解析部１４を学習させるようにすることで、IND判定文とOOD判定文との境界付近に分布するコーパスによる学習が可能となるので、IND判定文とOOD判定文との境界付近に分布するような紛らわしい表現でも適切に認識することが可能になり、意味解析部の精度を向上させることが可能となる。 As a result, the frame estimation unit 13 and the semantic analysis unit 14 are made to learn by using a corpus composed of more COOD judgment sentences and CIND judgment sentences, so that the boundary between the IND judgment sentence and the OOD judgment sentence is near. Since it is possible to learn with a distributed corpus, it is possible to properly recognize even confusing expressions that are distributed near the boundary between the IND judgment sentence and the OOD judgment sentence, and improve the accuracy of the semantic analysis unit. It will be possible.

また、生成されたコーパスは、現実には、人手による確認作業が必要となることが想定されるが、IND判定文（確定IND判定文）、COOD判定文、CIND判定文、およびOOD判定文（確定OOD判定文）に分類されている上、非文や重文も廃棄されているので、確認作業に係る負荷を低減することが可能となり、結果として、コーパスの開発コストを低減させることが可能となる。 In addition, it is assumed that the generated corpus actually requires manual confirmation work, but the IND judgment sentence (determined IND judgment sentence), COOD judgment sentence, CIND judgment sentence, and OOD judgment sentence ( Since it is classified as a definite OOD judgment sentence) and non-sentences and compound sentences are also discarded, it is possible to reduce the load related to confirmation work, and as a result, it is possible to reduce the development cost of the corpus. Become.

尚、上述した処理順序は、前後を入れ替えるようにしてもよく、例えば、ステップＳ３６のCOODコーパス抽出処理とステップＳ３７のCINDコーパス抽出処理とは処理順序を入れ替えてもよい。また、COODコーパス抽出処理、およびCINDコーパス抽出処理におけるPerplexity値を用いた非文判定処理と、非出現性を表すパラメータを用いたCOOD判定文およびCIND判定文の抽出処理とは、順序を入れ替えてもよい。 The above-described processing order may be changed before and after, for example, the processing order of the COOD corpus extraction processing of step S36 and the CIND corpus extraction processing of step S37 may be switched. In addition, the COOD corpus extraction process, and the non-sentence determination process using the Perplexity value in the CIND corpus extraction process, and the extraction process of the COOD determination sentence and the CIND determination sentence using the parameter representing the non-occurrence property, the order is exchanged. Good.

＜ソフトウェアにより実行させる例＞
ところで、上述した一連の処理は、ハードウェアにより実行させることもできるが、ソフトウェアにより実行させることもできる。一連の処理をソフトウェアにより実行させる場合には、そのソフトウェアを構成するプログラムが、専用のハードウェアに組み込まれているコンピュータ、または、各種のプログラムをインストールすることで、各種の機能を実行することが可能な、例えば汎用のパーソナルコンピュータなどに、記録媒体からインストールされる。<Example of execution by software>
By the way, the series of processes described above can be executed not only by hardware but also by software. When a series of processes is executed by software, a program that constitutes the software can execute various functions by installing a computer in which dedicated hardware is installed or various programs. It is installed from a recording medium into a possible general-purpose personal computer or the like.

図２７は、汎用のパーソナルコンピュータの構成例を示している。このパーソナルコンピュータは、CPU(Central Processing Unit)１００１を内蔵している。CPU１００１にはバス１００４を介して、入出力インタフェース１００５が接続されている。バス１００４には、ROM(Read Only Memory)１００２およびRAM(Random Access Memory)１００３が接続されている。 FIG. 27 shows a configuration example of a general-purpose personal computer. This personal computer has a built-in CPU (Central Processing Unit) 1001. An input/output interface 1005 is connected to the CPU 1001 via a bus 1004. A ROM (Read Only Memory) 1002 and a RAM (Random Access Memory) 1003 are connected to the bus 1004.

入出力インタフェース１００５には、ユーザが操作コマンドを入力するキーボード、マウスなどの入力デバイスよりなる入力部１００６、処理操作画面や処理結果の画像を表示デバイスに出力する出力部１００７、プログラムや各種データを格納するハードディスクドライブなどよりなる記憶部１００８、LAN（Local Area Network）アダプタなどよりなり、インターネットに代表されるネットワークを介した通信処理を実行する通信部１００９が接続されている。また、磁気ディスク（フレキシブルディスクを含む）、光ディスク（CD-ROM(Compact Disc-Read Only Memory)、DVD(Digital Versatile Disc)を含む）、光磁気ディスク（ＭＤ(Mini Disc)を含む）、もしくは半導体メモリなどのリムーバブルメディア１０１１に対してデータを読み書きするドライブ１０１０が接続されている。 The input/output interface 1005 includes an input unit 1006 formed of an input device such as a keyboard and a mouse for a user to input operation commands, an output unit 1007 for outputting a processing operation screen and an image of a processing result to a display device, programs and various data. A storage unit 1008 including a hard disk drive for storing the data, a LAN (Local Area Network) adapter, and the like are connected to a communication unit 1009 that executes communication processing via a network typified by the Internet. Further, a magnetic disk (including a flexible disk), an optical disk (including a CD-ROM (Compact Disc-Read Only Memory), a DVD (Digital Versatile Disc)), a magneto-optical disk (including an MD (Mini Disc)), or a semiconductor. A drive 1010 for reading and writing data is connected to a removable medium 1011 such as a memory.

CPU１００１は、ROM１００２に記憶されているプログラム、または磁気ディスク、光ディスク、光磁気ディスク、もしくは半導体メモリ等のリムーバブルメディア１０１１ら読み出されて記憶部１００８にインストールされ、記憶部１００８からRAM１００３にロードされたプログラムに従って各種の処理を実行する。RAM１００３にはまた、CPU１００１が各種の処理を実行する上において必要なデータなども適宜記憶される。 The CPU 1001 is read from a program stored in the ROM 1002 or a removable medium 1011 such as a magnetic disk, an optical disk, a magneto-optical disk, or a semiconductor memory, installed in the storage unit 1008, and loaded into the RAM 1003 from the storage unit 1008. Various processes are executed according to the program. The RAM 1003 also appropriately stores data necessary for the CPU 1001 to execute various processes.

以上のように構成されるコンピュータでは、CPU１００１が、例えば、記憶部１００８に記憶されているプログラムを、入出力インタフェース１００５及びバス１００４を介して、RAM１００３にロードして実行することにより、上述した一連の処理が行われる。 In the computer configured as described above, the CPU 1001 loads the program stored in the storage unit 1008 into the RAM 1003 via the input/output interface 1005 and the bus 1004 and executes the program, thereby performing the above-described series of operations. Is processed.

コンピュータ（CPU１００１）が実行するプログラムは、例えば、パッケージメディア等としてのリムーバブルメディア１０１１に記録して提供することができる。また、プログラムは、ローカルエリアネットワーク、インターネット、デジタル衛星放送といった、有線または無線の伝送媒体を介して提供することができる。 The program executed by the computer (CPU 1001) can be provided, for example, by recording it on a removable medium 1011 as a package medium or the like. Further, the program can be provided via a wired or wireless transmission medium such as a local area network, the Internet, or digital satellite broadcasting.

コンピュータでは、プログラムは、リムーバブルメディア１０１１をドライブ１０１０に装着することにより、入出力インタフェース１００５を介して、記憶部１００８にインストールすることができる。また、プログラムは、有線または無線の伝送媒体を介して、通信部１００９で受信し、記憶部１００８にインストールすることができる。その他、プログラムは、ROM１００２や記憶部１００８に、あらかじめインストールしておくことができる。 In the computer, the program can be installed in the storage unit 1008 via the input/output interface 1005 by mounting the removable medium 1011 in the drive 1010. Also, the program can be received by the communication unit 1009 via a wired or wireless transmission medium and installed in the storage unit 1008. In addition, the program can be installed in the ROM 1002 or the storage unit 1008 in advance.

なお、コンピュータが実行するプログラムは、本明細書で説明する順序に沿って時系列に処理が行われるプログラムであっても良いし、並列に、あるいは呼び出しが行われたとき等の必要なタイミングで処理が行われるプログラムであっても良い。 The program executed by the computer may be a program in which processing is performed in time series in the order described in this specification, or in parallel or at a necessary timing such as when a call is made. It may be a program in which processing is performed.

尚、図２７におけるCPU１００１が、意味解析器１３１、COODコーパス抽出部１３３、およびCINDコーパス抽出部１３７の機能を実現させる。また、記憶部１００８が、IND判定文記憶部１３２、OOD判定文記憶部１３６、COOD判定文記憶部１３４、確定IND判定文記憶部１３５、CIND判定文記憶部１３８、および確定OOD判定文記憶部１３９を実現する。 Note that the CPU 1001 in FIG. 27 realizes the functions of the semantic analyzer 131, the COOD corpus extraction unit 133, and the CIND corpus extraction unit 137. Further, the storage unit 1008 includes an IND determination statement storage unit 132, an OOD determination statement storage unit 136, a COOD determination statement storage unit 134, a confirmed IND determination statement storage unit 135, a CIND determination statement storage unit 138, and a confirmed OOD determination statement storage unit. 139 is realized.

また、本明細書において、システムとは、複数の構成要素（装置、モジュール（部品）等）の集合を意味し、すべての構成要素が同一筐体中にあるか否かは問わない。したがって、別個の筐体に収納され、ネットワークを介して接続されている複数の装置、及び、１つの筐体の中に複数のモジュールが収納されている１つの装置は、いずれも、システムである。 In the present specification, the system means a set of a plurality of constituent elements (devices, modules (components), etc.), and it does not matter whether or not all the constituent elements are in the same housing. Therefore, a plurality of devices housed in separate housings and connected via a network, and one device housing a plurality of modules in one housing are all systems. ..

なお、本開示の実施の形態は、上述した実施の形態に限定されるものではなく、本開示の要旨を逸脱しない範囲において種々の変更が可能である。 Note that the embodiments of the present disclosure are not limited to the above-described embodiments, and various modifications can be made without departing from the scope of the present disclosure.

例えば、本開示は、１つの機能をネットワークを介して複数の装置で分担、共同して処理するクラウドコンピューティングの構成をとることができる。 For example, the present disclosure may have a configuration of cloud computing in which one device shares and jointly processes one function via a network.

また、上述のフローチャートで説明した各ステップは、１つの装置で実行する他、複数の装置で分担して実行することができる。 In addition, each step described in the above-mentioned flowcharts can be executed by one device or shared by a plurality of devices.

さらに、１つのステップに複数の処理が含まれる場合には、その１つのステップに含まれる複数の処理は、１つの装置で実行する他、複数の装置で分担して実行することができる。 Further, when one step includes a plurality of processes, the plurality of processes included in the one step can be executed by one device or shared by a plurality of devices.

尚、本開示は、以下のような構成も取ることができる。
＜１＞入力文の構造を解析する構造解析部と、
前記構造解析部の解析結果に基づいて、前記入力文における置換箇所を設定する置換箇所設定部と、
前記入力文における前記置換箇所の単語を置換してコーパスを生成するコーパス生成部と
を含む情報処理装置。
＜２＞前記入力文は、所定のアプリケーションプログラムで扱うべき発話内容であるIND（In Domain）判定文である
＜１＞に記載の情報処理装置。
＜３＞前記構造解析部は、前記入力文の述語項構造を解析する
前記置換箇所設定部は、前記構造解析部の解析結果である前記述語項構造に基づいて、前記入力文における置換箇所を設定する
＜１＞または＜２＞に記載の情報処理装置。
＜４＞前記入力文における前記置換箇所の単語を置換する候補を、辞書を照会して検索する辞書照会部をさらに含み、
前記コーパス生成部は、前記辞書照会部により検索された単語で、前記入力文における前記置換箇所の単語を置換する
＜３＞のいずれかに記載の情報処理装置。
＜５＞前記辞書は、格フレーム辞書である
＜４＞に記載の情報処理装置。
＜６＞前記置換箇所設定部は、前記構造解析部の解析結果である前記述語項構造に基づいて、前記入力文における置換箇所と、前記置換箇所の置換方式を設定し、
前記コーパス生成部は、前記入力文における前記置換箇所の単語を、前記置換方式で置換してコーパスを生成する
＜４＞に記載の情報処理装置。
＜７＞前記置換方式は、前記入力文の述部を固定し、かつ、対象格を含む述語項となる名詞を置換する第１の方式と、前記入力文の対象格を含む述語項を固定し、かつ、述部を置換する第２の方式とを含む
＜６＞に記載の情報処理装置。
＜８＞前記コーパス生成部により生成されたコーパスを、所定のアプリケーションプログラムで扱うべき発話内容であるIND（In Domain）判定文、または、所定のアプリケーションプログラムで扱うべきではない想定外の発話内容であるOOD（Out of Domain）判定文に分類する分類部をさらに含む
＜１＞乃至＜７＞のいずれかに記載の情報処理装置。
＜９＞前記OOD判定文であって、かつ、前記IND判定文との、それぞれの素性で表現される特徴空間における境界近傍に存在するコーパスをCOOD（Close OOD）判定文として、前記IND判定文として分類されたコーパスより抽出するCOOD判定文抽出部をさらに含む
＜８＞に記載の情報処理装置。
＜１０＞前記COOD判定文抽出部は、前記IND判定文として分類されたコーパスを含むドメインにおいて、自ら及び他のコーパスに含まれない単語数が所定数より多いコーパスを、前記ドメインより前記COOD判定文として抽出する
＜９＞に記載の情報処理装置。
＜１１＞前記COOD判定文抽出部は、前記ドメインのコーパスに含まれる単語数に対する、前記自ら及び他のコーパスに含まれない単語数の割合で表される非出現性が所定値より高いコーパスを、前記ドメインより前記COOD判定文として抽出する
＜１０＞に記載の情報処理装置。
＜１２＞前記COOD判定文抽出部は、前記ドメインのコーパスのうち、TF(Term Frequency)値とIDF(Inverse Document Frequency)値とからなる単語のTF/IDFで表される非出現性が所定値より低い単語を多く含むコーパスを、前記ドメインより前記COOD判定文として抽出する
＜１０＞に記載の情報処理装置。
＜１３＞前記COOD判定文抽出部は、前記ドメインのコーパスにおけるPerplexity値を算出し、前記Perplexity値が所定値よりも高いものを非文として廃棄する
＜１０＞に記載の情報処理装置。
＜１４＞前記IND判定文であって、かつ、前記OOD判定文との、それぞれの素性で表現される特徴空間における境界近傍に存在するコーパスをCIND（Close IND）判定文として、前記OOD判定文として分類されたコーパスより抽出するCIND判定文抽出部をさらに含む
＜８＞に記載の情報処理装置。
＜１５＞前記CIND判定文抽出部は、前記OOD判定文として分類されたコーパスを含むドメインにおいて、自ら以外の他のOOD判定文に分類されたコーパスに含まれる単語数が所定数より多いコーパスを、前記OOD判定文として分類された全コーパスより前記CIND判定文として抽出する
＜１４＞に記載の情報処理装置。
＜１６＞前記CIND判定文抽出部は、前記ドメインのコーパスに含まれる単語数に対する、前記自ら以外の他のコーパスに含まれない単語数の割合で表される非出現性が所定数より低いコーパスを、前記CIND判定文として抽出する
＜１５＞に記載の情報処理装置。
＜１７＞前記CIND判定文抽出部は、前記ドメインのコーパスのうち、TF(Term Frequency)値とIDF(Inverse Document Frequency)値とからなるTF/IDFで表される単語の非出現性が所定数より低いコーパスを、前記CIND判定文として抽出する
＜１５＞に記載の情報処理装置。
＜１８＞前記CIND判定文抽出部は、前記ドメインのコーパスにおけるPerplexity値を算出し、前記Perplexity値が所定値よりも高いものを非文として廃棄する
＜１５＞に記載の情報処理装置。
＜１９＞入力文の構造を解析し、
前記構造の解析結果に基づいて、前記入力文における置換箇所を設定し、
前記入力文における前記置換箇所の単語を置換してコーパスを生成する
ステップを含む情報処理方法。
＜２０＞入力文の構造を解析する構造解析部と、
前記構造解析部の解析結果に基づいて、前記入力文における置換箇所を設定する置換箇所設定部と、
前記入力文における前記置換箇所の単語を置換してコーパスを生成するコーパス生成部と
してコンピュータを機能させるプログラム。Note that the present disclosure can also take the following configurations.
<1> A structure analysis unit that analyzes the structure of the input sentence,
A replacement location setting unit that sets a replacement location in the input sentence based on the analysis result of the structural analysis unit;
A corpus generation unit configured to generate a corpus by substituting the word at the replacement portion in the input sentence.
<2> The information processing device according to <1>, wherein the input sentence is an IND (In Domain) determination sentence that is utterance content to be handled by a predetermined application program.
<3> The structure analysis unit analyzes the predicate-argument structure of the input sentence. The substitution place setting unit is based on the predescription word structure that is the analysis result of the structure analysis unit, and the substitution place in the input sentence. The information processing apparatus according to <1> or <2>.
<4> A dictionary inquiring unit that inquires a dictionary to retrieve a candidate for replacing the word of the replacement portion in the input sentence, further includes:
The said corpus production|generation part is an information processing apparatus in any one of <3> which replaces the word of the said substitution part in the said input sentence with the word searched by the said dictionary inquiry part.
<5> The information processing apparatus according to <4>, wherein the dictionary is a case frame dictionary.
<6> The replacement location setting unit sets a replacement location in the input sentence and a replacement method of the replacement location based on the predescription term structure that is an analysis result of the structure analysis unit,
The information processing device according to <4>, wherein the corpus generation unit replaces the word at the replacement location in the input sentence by the replacement method to generate a corpus.
<7> The replacement method fixes a predicate of the input sentence and a first method of replacing a noun that is a predicate term including a subject case and a predicate term including a subject case of the input sentence. And a second method of replacing a predicate, the information processing apparatus according to <6>.
<8> The corpus generated by the corpus generation unit is used as an IND (In Domain) determination statement that is the utterance content that should be handled by a predetermined application program, or as an unexpected utterance content that should not be handled by the predetermined application program. The information processing apparatus according to any one of <1> to <7>, further including a classification unit that classifies an OOD (Out of Domain) determination sentence.
<9> The IND determination statement, which is the OOD determination statement and is a corpus existing near the boundary in the feature space expressed by each feature with the IND determination statement, as a COOD (Close OOD) determination statement. The information processing apparatus according to <8>, further including a COOD determination statement extraction unit that extracts from the corpus classified as.
<10> In the domain including the corpus classified as the IND determination sentence, the COOD determination sentence extraction unit determines, from the domain, the COOD determination sentence in which the number of words not included in the corpus and the other corpus is greater than a predetermined number. The information processing device according to <9>, which is extracted as a sentence.
<11> The COOD determination sentence extraction unit selects a corpus whose non-occurrence is higher than a predetermined value, which is represented by the ratio of the number of words not included in the corpus of itself and the other corpus to the number of words included in the corpus of the domain. The information processing device according to <10>, wherein the COOD determination sentence is extracted from the domain.
<12> The COOD determination sentence extraction unit has a non-occurrence value represented by TF/IDF of a word composed of a TF (Term Frequency) value and an IDF (Inverse Document Frequency) value in the corpus of the domain. The information processing apparatus according to <10>, wherein a corpus including many lower words is extracted as the COOD determination sentence from the domain.
<13> The information processing device according to <10>, wherein the COOD determination statement extraction unit calculates a Perplexity value in the corpus of the domain and discards a Perplexity value higher than a predetermined value as a non-sentence.
<14> The OOD determination statement, which is the IND determination statement and is a corpus existing near the boundary in the feature space expressed by each feature with the OOD determination statement, as a CIND (Close IND) determination statement. The information processing apparatus according to <8>, further including a CIND determination sentence extraction unit that extracts from the corpus classified as.
<15> In the domain including the corpus classified as the OOD judgment sentence, the CIND judgment sentence extraction unit selects a corpus in which the number of words included in the corpus classified into the OOD judgment sentence other than itself is larger than a predetermined number. The information processing apparatus according to <14>, wherein the CIND determination text is extracted from all corpora classified as the OOD determination text.
<16> The CIND determination sentence extraction unit is a corpus having a non-occurrence ratio lower than a predetermined number, which is represented by a ratio of the number of words not included in the other corpus other than itself to the number of words included in the corpus of the domain. Is extracted as the CIND determination sentence. The information processing device according to <15>.
<17> In the corpus of the domain, the CIND determination sentence extraction unit determines that the non-occurrence of a word represented by TF/IDF including a TF (Term Frequency) value and an IDF (Inverse Document Frequency) value is a predetermined number. The information processing apparatus according to <15>, wherein a lower corpus is extracted as the CIND determination sentence.
<18> The information processing device according to <15>, wherein the CIND determination statement extracting unit calculates a Perplexity value in the corpus of the domain and discards a Perplexity value higher than a predetermined value as a non-sentence.
<19> Analyze the structure of the input sentence,
Based on the analysis result of the structure, set the replacement location in the input sentence,
An information processing method comprising the step of substituting a word at the replacement part in the input sentence to generate a corpus.
<20> A structure analysis unit that analyzes the structure of the input sentence,
A replacement location setting unit that sets a replacement location in the input sentence based on the analysis result of the structural analysis unit;
A program that causes a computer to function as a corpus generation unit that generates a corpus by replacing the word at the replacement position in the input sentence.

５１コーパス生成装置，１０１ IND文受付部，１０２言語解析部，１０３置換箇所設定部，１０４辞書照会部，１０５置換実行部，１０６重文排除部，１０７格フレーム辞書，１０８生成条件設定データ記憶部，１０９置換生成文記憶部，１１０フィルタリング処理部，１３１意味解析器，１３２ IND判定文記憶部，１３３ COODコーパス抽出部，１３３ａ非文判定部，１３３ｂ非出現性判定部，１３４ COOD判定文記憶部，１３５確定IND判定文記憶部，１３６ OOD判定文記憶部，１３７ CINDコーパス抽出部，１３７ａ非文判定部，１３７ｂ非出現性判定部，１３８ CIND判定文記憶部，１３９確定OOD判定文記憶部 51 corpus generation device, 101 IND sentence reception unit, 102 language analysis unit, 103 substitution place setting unit, 104 dictionary inquiry unit, 105 substitution execution unit, 106 multiple sentence elimination unit, 107 case frame dictionary, 108 generation condition setting data storage unit, 109 replacement generation sentence storage unit, 110 filtering processing unit, 131 semantic analyzer, 132 IND judgment sentence storage unit, 133 COOD corpus extraction unit, 133a non-sentence judgment unit, 133b non-occurrence judgment unit, 134 COOD judgment sentence storage unit, 135 definite IND judgment sentence storage unit, 136 OOD judgment sentence storage unit, 137 CIND corpus extraction unit, 137a non-statement judgment unit, 137b non-occurrence judgment unit, 138 CIND judgment sentence storage unit, 139 definite OOD judgment sentence storage unit

Claims

入力文の構造を解析する構造解析部と、
前記構造解析部の解析結果に基づいて、前記入力文における置換箇所を設定する置換箇所設定部と、
前記入力文における前記置換箇所の単語を置換してコーパスを生成するコーパス生成部と
を含む情報処理装置。A structure analysis unit that analyzes the structure of the input sentence,
A replacement location setting unit that sets a replacement location in the input sentence based on the analysis result of the structural analysis unit;
A corpus generation unit configured to generate a corpus by substituting the word at the replacement portion in the input sentence.

前記入力文は、所定のアプリケーションプログラムで扱うべき発話内容であるIND（In Domain）判定文である
請求項１に記載の情報処理装置。The information processing apparatus according to claim 1, wherein the input sentence is an IND (In Domain) determination sentence that is utterance content to be handled by a predetermined application program.

前記構造解析部は、前記入力文の述語項構造を解析し、
前記置換箇所設定部は、前記構造解析部の解析結果である前記述語項構造に基づいて、前記入力文における置換箇所を設定する
請求項１に記載の情報処理装置。The structure analysis unit analyzes the predicate term structure of the input sentence,
The information processing apparatus according to claim 1, wherein the replacement location setting unit sets a replacement location in the input sentence based on a predescription term structure that is an analysis result of the structure analysis unit.

前記入力文における前記置換箇所の単語を置換する候補を、辞書を照会して検索する辞書照会部をさらに含み、
前記コーパス生成部は、前記辞書照会部により検索された単語で、前記入力文における前記置換箇所の単語を置換する
請求項３に記載の情報処理装置。Further comprising a dictionary inquiring unit for inquiring a dictionary for a candidate for replacing the word of the replacement portion in the input sentence,
The information processing apparatus according to claim 3, wherein the corpus generation unit replaces the word at the replacement location in the input sentence with the word searched by the dictionary inquiry unit.

前記辞書は、格フレーム辞書である
請求項４に記載の情報処理装置。The information processing apparatus according to claim 4, wherein the dictionary is a case frame dictionary.

前記置換箇所設定部は、前記構造解析部の解析結果である前記述語項構造に基づいて、前記入力文における置換箇所と、前記置換箇所の置換方式を設定し、
前記コーパス生成部は、前記入力文における前記置換箇所の単語を、前記置換方式で置換してコーパスを生成する
請求項４に記載の情報処理装置。The replacement location setting unit sets a replacement location in the input sentence and a replacement method of the replacement location based on a predescription term structure that is an analysis result of the structure analysis unit,
The information processing device according to claim 4, wherein the corpus generation unit replaces the word at the replacement location in the input sentence by the replacement method to generate a corpus.

前記置換方式は、前記入力文の述部を固定し、かつ、対象格を含む述語項となる名詞を置換する第１の方式と、前記入力文の対象格を含む述語項を固定し、かつ、述部を置換する第２の方式とを含む
請求項６に記載の情報処理装置。The replacement method fixes a predicate of the input sentence, and a first method of replacing a noun that is a predicate term including a subject case, and a predicate term including a subject case of the input sentence, and The information processing apparatus according to claim 6, further comprising a second method of replacing the predicate.

前記コーパス生成部により生成されたコーパスを、所定のアプリケーションプログラムで扱うべき発話内容であるIND（In Domain）判定文、または、所定のアプリケーションプログラムで扱うべきではない想定外の発話内容であるOOD（Out of Domain）判定文に分類する分類部をさらに含む
請求項１に記載の情報処理装置。The corpus generated by the corpus generation unit is an IND (In Domain) judgment statement that is the utterance content that should be handled by a predetermined application program, or an unexpected utterance content that should not be handled by the predetermined application program OOD ( The information processing apparatus according to claim 1, further comprising a classifying unit that classifies into an Out of Domain) determination sentence.

前記OOD判定文であって、かつ、前記IND判定文との、それぞれの素性で表現される特徴空間における境界近傍に存在するコーパスをCOOD（Close OOD）判定文として、前記IND判定文として分類されたコーパスより抽出するCOOD判定文抽出部をさらに含む
請求項８に記載の情報処理装置。The OOD decision sentence, and the IND decision sentence, the corpus existing near the boundary in the feature space expressed by each feature is classified as the IND decision sentence as a COOD (Close OOD) decision sentence. The information processing apparatus according to claim 8, further comprising a COOD determination sentence extraction unit that extracts from the corpus.

前記COOD判定文抽出部は、前記IND判定文として分類されたコーパスを含むドメインにおいて、自ら及び他のコーパスに含まれない単語数が所定数より多いコーパスを、前記ドメインより前記COOD判定文として抽出する
請求項９に記載の情報処理装置。The COOD determination sentence extraction unit, in the domain including the corpus classified as the IND determination sentence, a corpus in which the number of words not included in itself and other corpora is larger than a predetermined number, and is extracted as the COOD determination sentence from the domain. The information processing apparatus according to claim 9.

前記COOD判定文抽出部は、前記ドメインのコーパスに含まれる単語数に対する、前記自ら及び他のコーパスに含まれない単語数の割合で表される非出現性が所定値より高いコーパスを、前記ドメインより前記COOD判定文として抽出する
請求項１０に記載の情報処理装置。The COOD determination sentence extraction unit, the corpus of the non-occurrence higher than a predetermined value represented by the ratio of the number of words not included in the corpus itself and other corpus with respect to the number of words included in the corpus of the domain, the domain The information processing apparatus according to claim 10, wherein the information is extracted as the COOD determination sentence.

前記COOD判定文抽出部は、前記ドメインのコーパスのうち、TF(Term Frequency)値とIDF(Inverse Document Frequency)値とからなるTF/IDFで表される単語の非出現性が所定値より低い単語を多く含むコーパスを、前記ドメインより前記COOD判定文として抽出する
請求項１０に記載の情報処理装置。The COOD determination sentence extraction unit, in the corpus of the domain, the non-occurrence of the word represented by TF / IDF consisting of TF (Term Frequency) value and IDF (Inverse Document Frequency) value is lower than a predetermined value The information processing apparatus according to claim 10, wherein a corpus including a large number of characters is extracted as the COOD determination sentence from the domain.

前記COOD判定文抽出部は、前記ドメインのコーパスにおけるPerplexity値を算出し、前記Perplexity値が所定値よりも高いものを非文として廃棄する
請求項１０に記載の情報処理装置。The information processing apparatus according to claim 10, wherein the COOD determination sentence extraction unit calculates a Perplexity value in a corpus of the domain, and discards a Perplexity value higher than a predetermined value as a non-sentence.

前記IND判定文であって、かつ、前記OOD判定文との、それぞれの素性で表現される特徴空間における境界近傍に存在するコーパスをCIND（Close IND）判定文として、前記OOD判定文として分類されたコーパスより抽出するCIND判定文抽出部をさらに含む
請求項８に記載の情報処理装置。The IND judgment sentence, and the corpus existing near the boundary in the feature space expressed by each feature with the OOD judgment sentence are classified as the OOD judgment sentence as a CIND (Close IND) judgment sentence. The information processing apparatus according to claim 8, further comprising a CIND determination statement extraction unit that extracts from the corpus.

前記CIND判定文抽出部は、前記OOD判定文として分類されたコーパスを含むドメインにおいて、自ら以外の他のOOD判定文に分類されたコーパスに含まれる単語数が所定数より多いコーパスを、前記OOD判定文として分類された全コーパスより前記CIND判定文として抽出する
請求項１４に記載の情報処理装置。The CIND judgment sentence extraction unit, in the domain including the corpus classified as the OOD judgment sentence, the corpus classified into the OOD judgment sentence other than itself, the number of words included in the corpus having a number larger than a predetermined number, the OOD The information processing apparatus according to claim 14, wherein the CIND determination statement is extracted from all corpora classified as determination statements.

前記CIND判定文抽出部は、前記ドメインのコーパスに含まれる単語数に対する、前記自ら以外の他のコーパスに含まれない単語数の割合で表される非出現性が所定数より低いコーパスを、前記CIND判定文として抽出する
請求項１５に記載の情報処理装置。The CIND determination sentence extraction unit, with respect to the number of words included in the corpus of the domain, the non-occurrence represented by the ratio of the number of words not included in the other corpus other than itself, a corpus is lower than a predetermined number, The information processing apparatus according to claim 15, wherein the information is extracted as a CIND determination sentence.

前記CIND判定文抽出部は、前記ドメインのコーパスのうち、TF(Term Frequency)値とIDF(Inverse Document Frequency)値とからなるTF/IDFで表される単語の非出現性が所定数より低いコーパスを、前記CIND判定文として抽出する
請求項１５に記載の情報処理装置。The CIND determination sentence extraction unit, of the corpus of the domain, a corpus in which the non-occurrence of a word represented by TF/IDF consisting of a TF (Term Frequency) value and an IDF (Inverse Document Frequency) value is lower than a predetermined number. 16. The information processing apparatus according to claim 15, wherein is extracted as the CIND determination statement.

前記CIND判定文抽出部は、前記ドメインのコーパスにおけるPerplexity値を算出し、前記Perplexity値が所定値よりも高いものを非文として廃棄する
請求項１５に記載の情報処理装置。The information processing apparatus according to claim 15, wherein the CIND determination statement extraction unit calculates a Perplexity value in the corpus of the domain, and discards a Perplexity value higher than a predetermined value as a non-sentence.

入力文の構造を解析し、
前記構造の解析結果に基づいて、前記入力文における置換箇所を設定し、
前記入力文における前記置換箇所の単語を置換してコーパスを生成する
ステップを含む情報処理方法。Analyze the structure of the input sentence,
Based on the analysis result of the structure, set the replacement location in the input sentence,
An information processing method comprising the step of substituting a word at the replacement part in the input sentence to generate a corpus.

入力文の構造を解析する構造解析部と、
前記構造解析部の解析結果に基づいて、前記入力文における置換箇所を設定する置換箇所設定部と、
前記入力文における前記置換箇所の単語を置換してコーパスを生成するコーパス生成部と
してコンピュータを機能させるプログラム。A structure analysis unit that analyzes the structure of the input sentence,
A replacement location setting unit that sets a replacement location in the input sentence based on the analysis result of the structural analysis unit;
A program that causes a computer to function as a corpus generation unit that generates a corpus by substituting the word at the replacement portion in the input sentence.