JP5244661B2

JP5244661B2 - Missing detection device and missing detection program for ending punctuation

Info

Publication number: JP5244661B2
Application number: JP2009065308A
Authority: JP
Inventors: 亮永田
Original assignee: 株式会社教育測定研究所
Priority date: 2009-03-17
Filing date: 2009-03-17
Publication date: 2013-07-24
Anticipated expiration: 2029-03-17
Also published as: JP2010218318A

Description

本発明は、文末句読点の欠落検出装置及び欠落検出プログラムに係り、更に詳しくは、文書中の構成文からピリオド等の文末句読点が一部抜けている場合に、当該文末句読点が抜けている構成文を自動的に検出することのできる文末句読点の欠落検出装置及び欠落検出プログラムに関する。 The present invention relates to a sentence ending punctuation missing detection apparatus and missing detection program, and more specifically, a constituent sentence in which a sentence ending punctuation is missing when a part of sentence ending punctuation such as a period is missing from a constituent sentence in a document. The present invention relates to a missing-end punctuation detection apparatus and a missing-detection program that can automatically detect a sentence.

従来、英文の綴りや文法等の誤りを検出する文書作成支援装置が知られている（特許文献１参照）。この支援装置は、学習者が英語で作成した文書中に複数存在する構成文を一文単位に分割する文切り出し部を備え、当該文切り出し部で分割された各構成文を用いて前記誤りを検出するようになっている。ここで、前記文切り出し部は、学習者が入力した文書中に存在するピリオド、カンマ、疑問符、感嘆符等の文末句読点の存在を検知することで、文書を一文単位に分割する。 2. Description of the Related Art Conventionally, a document creation support apparatus that detects errors in English spelling, grammar, and the like is known (see Patent Document 1). The support apparatus includes a sentence segmenting unit that divides a plurality of constituent sentences existing in a document created in English by a learner into single sentence units, and detects the error using each constituent sentence divided by the sentence extracting unit. It is supposed to be. Here, the sentence segmenting unit divides the document into one sentence by detecting the presence of a sentence ending punctuation mark such as a period, a comma, a question mark, or an exclamation mark existing in the document input by the learner.

また、入力された文書を部分的に分割しながら翻訳する機械翻訳装置が知られている（特許文献２参照）。この機械翻訳装置は、入力された原文書に対して、当該原文書の各構成文を一文毎に切り出す一文切り出し部を備え、一文切り出し部で切り出された各構成文が長い場合に、予め記憶された分割規則に従って構成文の所定部分で分割した上で、各分割部分の翻訳処理を行うようになっている。ここでの一文切り出し部も、文書中に存在するピリオド、カンマ、疑問符、感嘆符等の文末句読点の存在に基づき、文書を一文単位に分割する。 Also, a machine translation device that translates an input document while partially dividing it is known (see Patent Document 2). The machine translation device includes a single sentence cutout unit that cuts out each constituent sentence of the original document for each input original document, and stores in advance when each constituent sentence cut out by the single sentence cutout unit is long According to the divided rules, the divided parts are divided into predetermined parts, and then the divided parts are translated. The single sentence segmenting section here also divides the document into single sentences based on the presence of end-of-sentence punctuation marks such as periods, commas, question marks, and exclamation marks existing in the document.

特開平８−３０５９８号公報JP-A-8-30598 特開平８−２３５１８０号公報JP-A-8-235180

しかしながら、前記支援装置及び前記機械翻訳装置にあっては、文末句読点が正確に付された文書に対する処理を前提としており、文末句読点が一部誤って欠落した文書を入力した場合に、正確な処理が行えなくなる虞がある。すなわち、前記特許文献１の支援装置では、文末句読点が一部欠落した文書を入力すると、当該文末句読点が欠落した構成文は、一文単位に分割することができない。このため、以降に続く誤り検出処理が適切に動作しなくなる。また、前記特許文献２の機械翻訳装置では、文末句読点が一部欠落した文書を入力すると、同様の理由で翻訳を正確に行えなくなる虞がある。従って、これら支援装置及び機械翻訳装置に入力された文書に対し、文末句読点が一部欠落している場合には、当該文末句読点が欠落した構成文を検出することが必要である。 However, the support device and the machine translation device are premised on processing for a document in which sentence ending punctuation marks are accurately attached, and when a document in which sentence ending punctuation marks are partially missing is input, accurate processing is performed. May not be able to be performed. That is, in the support apparatus of Patent Document 1, when a document in which a part of sentence ending punctuation is missing is input, a constituent sentence in which the sentence ending punctuation is missing cannot be divided into single sentences. For this reason, subsequent error detection processing does not operate properly. Further, in the machine translation device of Patent Document 2, if a document in which a part of sentence ending punctuation is missing is input, there is a possibility that translation cannot be performed accurately for the same reason. Therefore, when a part of sentence ending punctuation is missing from a document input to the support device and the machine translation apparatus, it is necessary to detect a constituent sentence in which the sentence ending punctuation is missing.

本発明は、このような課題に着目して案出されたものであり、その目的は、文書中の各構成文それぞれの最後に付すべき文末句読点が誤って一部欠落した場合に、当該文末句読点が欠落している構成文を自動的に検出することができる文末句読点の欠落検出装置及び欠落検出プログラムを提供することにある。 The present invention has been devised by paying attention to such a problem, and its purpose is that when a part of sentence punctuation to be added at the end of each constituent sentence in a document is mistakenly missing, It is an object of the present invention to provide a sentence ending punctuation missing detection device and a missing detection program that can automatically detect a constituent sentence in which punctuation is missing.

前記目的を達成するため、本発明は、文末句読点が一部欠落している複数の構成文からなる対象文書に対し、前記文末句読点が欠落している構成文を検出する欠落検出装置であって、
前記対象文書に存在する文末句読点を境に隣り合う二文を一文とした構成文からなる対比用文書を作成する対比用文書作成手段と、前記対象文書及び前記対比用文書それぞれの構成文について、当該各構成文中の単語数や単語種類に着目して求められる複数の特徴量をベクトル要素とした特徴ベクトルを作成するベクトル作成手段と、当該ベクトル作成手段で作成された各特徴ベクトルに基づき、前記対比用文書の特徴ベクトルに近似する前記対象文書の特徴ベクトルを検出し、当該検出された特徴ベクトルの構成文を前記文末句読点が欠落している構成文と特定する欠落特定手段とを備える、という構成を採っている。 In order to achieve the above object, the present invention is a missing detection device for detecting a component sentence in which the sentence ending punctuation is missing for a target document consisting of a plurality of constituent sentences in which a part of sentence ending punctuation is partially missing. ,
About the comparison document creating means for creating a comparison document consisting of a composition sentence with two sentences adjacent to each other at the end of a sentence ending punctuation mark existing in the target document, and the composition sentence of each of the target document and the comparison document, Based on the feature vector created by the vector creation means, the vector creation means for creating a feature vector having a plurality of feature quantities obtained by paying attention to the number of words and the word type in each constituent sentence, and the vector creation means, The feature vector of the target document that approximates the feature vector of the comparison document is detected, and the component sentence of the detected feature vector includes a component sentence in which the sentence ending punctuation is missing and a missing identification unit that identifies the sentence. The composition is taken.

また、品詞や活用形が単語毎に記憶された辞書データベースを備え、
前記ベクトル作成手段は、前記対象文書及び前記対比用文書から各構成文を一文ずつ切り出す切り出し部と、当該切り出し部で切り出された構成文毎に、当該各構成文中の各単語の品詞や活用形を前記辞書データベースのデータから特定する形態素解析部と、予め記憶されたルールに基づいて構成文の文数や単語をカウントするカウント部と、当該カウント部でのカウント結果から前記特徴量を算出する特徴量算出部とを有する、という構成を採ることができる。 It also has a dictionary database that stores part-of-speech and usage forms for each word,
The vector creation means includes a cutout unit that cuts out each constituent sentence from the target document and the comparison document one by one, and for each constituent sentence cut out by the cutout unit, the part of speech and the utilization form of each word in the constituent sentence Is calculated from the data in the dictionary database, a count unit that counts the number of sentences and words of a constituent sentence based on a rule stored in advance, and the feature amount is calculated from a count result of the count unit It is possible to adopt a configuration including a feature amount calculation unit.

更に、前記特徴量は、一文当たりの文の長さの確率となる第１の特徴量と、大文字から始まる単語の総数となる第２の特徴量と、動詞である単語の総数となる第３の特徴量と、ｔｈａｔ節をとることのできる動詞の単語毎の数となる第４の特徴量と、接続詞である単語の総数となる第５の特徴量と、接続詞の単語毎の数となる第６の特徴量と、ｗｈ形の代名詞及び副詞である単語の総数となる第７の特徴量と、ｗｈ形の代名詞及び副詞の単語毎の数となる第８の特徴量と、一人称の人称代名詞からなる単語の総数である第９の特徴量と、一人称の人称代名詞の単語毎の数である第１０の特徴量と、前置詞である単語の総数となる第１１の特徴量と、前置詞の単語毎の数となる第１２の特徴量との少なくとも一部からなる、という構成を採ることができる。 Further, the feature amount is a first feature amount that is a sentence length probability per sentence, a second feature amount that is a total number of words starting from a capital letter, and a third number that is a total number of words that are verbs. , The fourth feature value that is the number of verbs that can take that clause, the fifth feature value that is the total number of words that are conjunctions, and the number of conjunction words for each word A sixth feature value, a seventh feature value that is the total number of words that are pronouns and adverbs of the wh form, an eighth feature value that is the number of words of pronouns and adverbs of the wh form, and first person personality A ninth feature quantity that is the total number of words composed of pronouns, a tenth feature quantity that is the number of first person personal pronouns per word, an eleventh feature quantity that is the total number of prepositional words, and a preposition of A configuration that consists of at least a part of the twelfth feature amount that is the number of each word. Can.

また、前記欠落特定手段は、予め記憶されたパターン認識手法により、前記対象文書の特徴ベクトルが含まれる第１データ群と前記対比用文書の特徴ベクトルが含まれる第２データ群とに分類し、当該第２データ群にノイズとして含まれる前記対象文書の特徴ベクトルを検出し、当該検出された特徴ベクトルの構成文を前記文末句読点が欠落している構成文と特定する、という構成を採ることができる。 Further, the missing specifying means classifies into a first data group including a feature vector of the target document and a second data group including a feature vector of the comparison document by a pattern recognition method stored in advance. The feature vector of the target document included as noise in the second data group may be detected, and the constituent sentence of the detected feature vector may be identified as the constituent sentence in which the sentence ending punctuation is missing. it can.

更に、前記欠落特定手段は、前記対比用文書中の各構成文それぞれに求められた特徴ベクトルから一部を抽出し、当該抽出された特徴ベクトルを用いて前記文末句読点が欠落している構成文を特定する、という構成を採ることが好ましい。 Further, the missing specifying unit extracts a part from the feature vector obtained for each constituent sentence in the comparison document, and the sentence ending punctuation is missing using the extracted feature vector It is preferable to adopt the configuration of specifying

また、本発明は、文末句読点が一部欠落している複数の構成文からなる対象文書に対し、文末句読点が欠落している構成文を検出する処理をコンピュータに実行させるためのプログラムであって、
前記対象文書に存在する文末句読点を境に隣り合う二文を一文とした構成文からなる対比用文書を作成する対比用文書作成手段と、前記対象文書及び前記対比用文書それぞれの構成文について、当該各構成文中の単語数や単語種類に着目して求められる複数の特徴量をベクトル要素とした特徴ベクトルを作成するベクトル作成手段と、当該ベクトル作成手段で作成された各特徴ベクトルに基づき、前記対比用文書の特徴ベクトルに近似する前記対象文書の特徴ベクトルを検出し、当該検出された特徴ベクトルの構成文を前記文末句読点が欠落している構成文と特定する欠落特定手段として前記コンピュータを機能させる、という構成を採っている。 Further, the present invention is a program for causing a computer to execute a process of detecting a constituent sentence in which a sentence ending punctuation is missing for a target document composed of a plurality of constituent sentences in which a sentence ending punctuation is partially missing. ,
About the comparison document creating means for creating a comparison document consisting of a composition sentence with two sentences adjacent to each other at the end of a sentence ending punctuation mark existing in the target document, and the composition sentence of each of the target document and the comparison document, Based on the feature vector created by the vector creation means, the vector creation means for creating a feature vector having a plurality of feature quantities obtained by paying attention to the number of words and the word type in each constituent sentence, and the vector creation means, Detecting a feature vector of the target document that approximates a feature vector of a contrast document, and functioning the computer as a missing specifying unit that identifies a constituent sentence of the detected feature vector as a constituent sentence in which the sentence ending punctuation is missing The structure of letting it is adopted.

本発明によれば、一部の構成文の文末句読点が欠落しているがその他の構成文は正確に文末句読点が付されている対象文書から、文末句読点を更に意図的に抜いた構成文からなる対比用文書を作成し、当該対比用文書から得られる特徴ベクトルを文末句読点の欠落基準として用いられることになり、当該欠落基準となる特徴ベクトルに近似する対象文書の構成文の特徴ベクトルを検出することで、文末句読点が欠落している構成文を自動的に検出可能になる。 According to the present invention, sentence ending punctuation marks of some constituent sentences are missing, but other constituent sentences are accurately extracted from a target sentence to which sentence ending punctuation marks are added, The feature vector obtained from the comparison document is used as the missing reference for the end-of-sentence punctuation mark, and the feature vector of the target sentence that approximates the feature vector that becomes the missing reference is detected. By doing so, it becomes possible to automatically detect a component sentence in which a sentence ending punctuation mark is missing.

本実施形態に係る文末句読点の欠落検出システムの構成を表すブロック図。The block diagram showing the structure of the missing sentence punctuation detection system which concerns on this embodiment. （Ａ）は、欠落検出装置での処理を説明するための対象文書のテキストデータを例示的に示す模式図であり、（Ｂ）は、欠落検出装置での処理を説明するための対比用文書のテキストデータを例示的に示す模式図である。(A) is a schematic diagram exemplarily showing text data of a target document for explaining processing in the missing detection device, and (B) is a comparison document for explaining processing in the missing detection device. It is a schematic diagram which shows the text data of this.

以下、本発明の実施形態について図面を参照しながら説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

図１には、本実施形態に係る文末句読点の欠落検出システムの構成を表すブロック図が示されている。この図において、前記欠落検出システム１０は、英語で作成された複数の構成文からなる対象文書が利用者により入力される入力装置１１と、文末句読点が一部欠落した対象文書が入力装置１１に入力されたときに、この対象文書から文末句読点が欠落している構成文を検出する文末句読点の欠落検出装置１３とを備えて構成されている。ここで、文末句読点とは、文末に設けられ、隣り合う構成文同士を区切るカンマ、感嘆符、疑問符等の記号を意味する。 FIG. 1 is a block diagram showing a configuration of a sentence ending punctuation missing detection system according to the present embodiment. In this figure, the missing detection system 10 includes an input device 11 in which a target document composed of a plurality of constituent sentences created in English is input by the user, and a target document in which a part of sentence ending punctuation is missing. It is configured to include a sentence ending punctuation missing detection device 13 that detects a constituent sentence in which a sentence ending punctuation is missing from the target document. Here, the sentence ending punctuation mark means a symbol such as a comma, exclamation mark, question mark, etc., provided at the end of the sentence and separating adjacent constituent sentences.

前記入力装置１１は、図示省略したキーボード等の機器により構成されているが、これに限定されるものでなく、紙媒体に記録された対象文書を画像データとして読み込み、当該画像データからテキストデータに変換するスキャナ装置、或いは、記憶媒体に記憶されたテキストデータを読み取り可能な装置等であっても良い。また、入力装置１１として、インターネット回線等のネットワーク回線を使って欠落検出装置１３に繋がる端末を用い、当該端末に入力された対象文書のデータを欠落検出装置１３に送信することもできる。 The input device 11 is configured by a device such as a keyboard (not shown), but is not limited to this. The input device 11 reads a target document recorded on a paper medium as image data, and converts the image data into text data. It may be a scanner device for conversion or a device that can read text data stored in a storage medium. Further, as the input device 11, a terminal connected to the missing detection device 13 using a network line such as an Internet line can be used, and the data of the target document input to the terminal can be transmitted to the missing detection device 13.

前記欠落検出装置１３は、ＣＰＵ等の演算処理装置、メモリやハードディスク等の記憶装置等からなるコンピュータによって構成され、当該コンピュータを後述する各手段として機能させるためのプログラムがインストールされている。 The missing detection device 13 is configured by a computer including an arithmetic processing device such as a CPU, a storage device such as a memory and a hard disk, and the like is installed with a program for causing the computer to function as each unit described later.

この欠落検出装置１３は、入力装置１１で入力された対象文書を一時的に記憶する対象文書記憶手段１５と、前記文末句読点の欠落部分を検出するための対比用文書を対象文書から作成する対比用文書作成手段１７と、単語毎に品詞や活用形が記憶された辞書データベース１８と、対象文書及び対比用文書それぞれの構成文について、後述する複数の特徴量をベクトル要素とした特徴ベクトルを作成するベクトル作成手段２０と、ベクトル作成手段２０で作成された特徴ベクトルに基づき、対象文書の構成文の中から文末句読点が欠落している構成文を特定する欠落特定手段２２とを備えている。 The missing detection device 13 includes a target document storage unit 15 for temporarily storing a target document input by the input device 11 and a comparison document for generating a comparison document for detecting a missing portion of the sentence ending punctuation from the target document. For feature document creation means 17, dictionary database 18 in which part of speech and utilization form are stored for each word, and a feature vector having a plurality of feature values, which will be described later, as vector elements, for each constituent sentence of the target document and comparison document And a missing specifying means 22 for specifying a constituent sentence in which sentence ending punctuation is missing from constituent sentences of the target document based on the feature vector created by the vector creating means 20.

前記対比用文書作成手段１７は、対象文書記憶手段１５に記憶された対象文書中に存在する文末句読点を検出し、当該文末句読点を境に隣り合う二文を連ね、当該各文間の文末句読点を除去して一文に纏めた上で、その文末にピリオドがない場合は、ピリオドを付すことで対比用文書を作成するようになっている。従って、この対比用文書の構成文は、全て、文末句読点が途中で抜けている文になる。 The comparison document creating means 17 detects a sentence ending punctuation mark existing in the target document stored in the target document storage means 15, connects two adjacent sentences with the sentence ending punctuation mark as a boundary, and ends the sentence punctuation mark between the sentences. If the period is not found at the end of the sentence after removing them, the comparison document is created by adding a period. Therefore, all of the constituent sentences of the comparison document are sentences in which sentence ending punctuation is missing.

前記ベクトル作成手段２０では、文中の単語数や単語種類に着目した特徴量が予め複数設定されており、当該各特徴量を構成文それぞれについて算出することで、各特徴量がベクトル要素となった特徴ベクトルが、対象文書及び対比用文書それぞれの構成文毎に求められる。すなわち、このベクトル作成手段２０は、対象文書及び対比用文書から各構成文を一文ずつ切り出す切り出し部２４と、切り出し部２４で切り出された構成文毎に、当該各構成文中の各単語の品詞や活用形を前記辞書データベース１８のデータから特定する形態素解析部２５と、予め記憶されたルールに基づいて構成文の文数や単語をカウントするカウント部２７と、カウント部２７でのカウント結果から各特徴量を算出する特徴量算出部２８とを備えている。 In the vector creating means 20, a plurality of feature amounts focusing on the number of words and word types in the sentence are set in advance, and each feature amount becomes a vector element by calculating each feature amount for each constituent sentence. A feature vector is obtained for each constituent sentence of the target document and the comparison document. That is, the vector creating means 20 cuts out each constituent sentence from the target document and the comparison document one sentence at a time, and for each constituent sentence extracted by the extracting section 24, the part of speech of each word in each constituent sentence From the morphological analysis unit 25 that identifies the utilization form from the data in the dictionary database 18, the counting unit 27 that counts the number of sentences and words of the constituent sentences based on the rules stored in advance, and the count results of the counting unit 27 And a feature amount calculation unit 28 for calculating the feature amount.

前記切り出し部２４では、対象文書中及び対比用文書中にそれぞれ存在する文末句読点の存在を検出して当該文末句読点を境に前後を分割するようになっており、これにより、各文書から一文単位の構成文に切り出される。 The cutout unit 24 detects the presence of sentence ending punctuation marks existing in the target document and the comparison document, and divides the sentence punctuation around the sentence ending punctuation marks. It is cut out to the composition sentence.

前記形態素解析部２５では、切り出し部２４で切り出された各構成文すなわち対象文書中の各構成文と対比用文書中の各構成文それぞれについて、各構成文中のスペースの存在から各構成文の単語を特定し、当該各単語について、前後に存在する単語と辞書データベース１８のデータとに基づいて品詞や活用形を特定するようになっている。 In the morphological analysis unit 25, for each component sentence extracted by the cutout unit 24, that is, each component sentence in the target document and each component sentence in the comparison document, the word of each component sentence is determined from the presence of a space in each component sentence. For each word, the part of speech and the utilization form are specified based on the words existing before and after and the data in the dictionary database 18.

前記カウント部２７では、対象文書の構成文数がカウントされるとともに、対象文書及び対比用文書の各構成文それぞれについて、単語に関する以下の各数がカウントされる。つまり、当該単語に関しては、各構成文中に存在する単語の総数と、大文字から始まる単語の総数と、動詞である単語の総数と、ｔｈａｔ節をとることのできる動詞の単語毎の数と、接続詞である単語の総数と、接続詞の単語毎の数と、ｗｈ形の代名詞及び副詞である単語の総数と、ｗｈ形の代名詞及び副詞の単語毎の数と、一人称の人称代名詞からなる単語の総数と、一人称の人称代名詞の単語毎の数と、前置詞からなる単語の総数と、前置詞の単語毎の数とがカウントされる。 The counting unit 27 counts the number of constituent sentences of the target document and counts the following numbers related to words for each of the constituent sentences of the target document and the comparison document. That is, for the word, the total number of words present in each constituent sentence, the total number of words starting with a capital letter, the total number of words that are verbs, the number of verbs that can take a tat clause, and the conjunction The total number of words, the number of conjunction words, the total number of words that are wh-shaped pronouns and adverbs, the number of wh-shaped pronouns and adverbs for each word, and the total number of first-person personal pronouns And the number of first person personal pronouns for each word, the total number of words consisting of prepositions, and the number of preposition words for each word.

なお、前記ｔｈａｔ節をとることのできる動詞の単語毎の数、接続詞の単語毎の数、ｗｈ形の代名詞及び副詞の単語毎の数、一人称の人称代名詞の単語毎の数、及び前置詞の単語毎の数は、それぞれ該当する単語が予めリスト化されて記憶されており、当該リスト化された単語それぞれの数が、構成文毎にカウントされることになる。例えば、リスト化された前置詞として、「ｉｎ」、「ｏｎ」、「ｏｆ」、「ａｔ」・・・が記憶されているとすると、予め記憶された前置詞の単語毎の数としては、各構成文それぞれについて、「ｉｎ」、「ｏｎ」、「ｏｆ」、「ａｔ」・・・の存在数がカウントされる。 It should be noted that the number of verb words that can take the tat clause, the number of conjunction words, the number of wh-shaped pronouns and adverb words, the number of first person personal pronoun words, and the preposition word For each number, the corresponding words are listed and stored in advance, and the number of each of the listed words is counted for each constituent sentence. For example, if “in”, “on”, “of”, “at”,... Are stored as prepositions listed, the number of prepositions stored in advance for each word is as follows. The number of “in”, “on”, “of”, “at”... Is counted for each sentence.

前記特徴量算出部２８では、カウント部２７でのカウント結果に基づき、対象文書及び対比用文書の各構成文それぞれについて、以下の第１〜第１２の特徴量が求められる。 The feature quantity calculation unit 28 obtains the following first to twelfth feature quantities for each component sentence of the target document and the comparison document based on the count result of the count unit 27.

前記第１の特徴量は、次式（１）により求められる一文当たりの文の長さの確率ｐ（ｌ）である。

The first feature amount is a sentence length probability p (l) per sentence obtained by the following equation (1).

ここで、「ｌ」は、カウント部２７でカウントされた各構成文中の単語の総数であり、「μ」は、対象文書における一文当たりの単語数の平均であり、次式（２）により求められる。また、「σ^２」は、対象文書における一文当たりの単語数の不偏分散であり、次式（３）により求められる。

上式（２）、（３）中、「ｎ」は、カウント部２７でカウントされた対象文書中の構成文数であり、「Ｉ_ｉ」は、対象文書中のｉ番目の構成文中に存在する単語数である。 Here, “l” is the total number of words in each constituent sentence counted by the counting unit 27, and “μ” is the average number of words per sentence in the target document, and is obtained by the following equation (2). It is done. “Σ ² ” is an unbiased variance of the number of words per sentence in the target document, and is obtained by the following equation (3).

In the above formulas (2) and (3), “n” is the number of constituent sentences in the target document counted by the counting unit 27, and “I _i ” is present in the i-th constituent sentence in the target document. The number of words

上式（１）から、対象文書及び対比用文書の各構成文それぞれについて、第１の特徴量である確率ｐ（ｌ）が求められる。なお、上式（２）、（３）で求められる平均μと分散σ^２は、対比用文書の各構成文における確率ｐ（ｌ）を求めるときであっても、対象文書中の構成文数ｎ、対象文書中のｉ番目の構成文中に存在する単語数ｌ_ｉが用いられ、対象文書に応じて定まるようになっている。また、この確率ｐ（ｌ）は、各構成文それぞれについて１値ずつ求められ、第１の特徴量は、一次元のベクトル要素になる。 From the above equation (1), the probability p (l) that is the first feature amount is obtained for each component sentence of the target document and the comparison document. Note that the average μ and variance σ ² obtained by the above equations (2) and (3) are the number of constituent sentences in the target document even when the probability p (l) in each constituent sentence of the comparison document is obtained. n, the number of words l _i existing in the i-th constituent sentence in the target document is used, and is determined according to the target document. The probability p (l) is obtained for each component sentence by one value, and the first feature amount is a one-dimensional vector element.

前記第２〜第１２の特徴量は、対象文書及び対比用文書の各構成文中の単語に関する数であり、カウント部２７でカウントされた数がそのまま用いられる。 The second to twelfth feature amounts are numbers related to words in each constituent sentence of the target document and the comparison document, and the numbers counted by the counting unit 27 are used as they are.

すなわち、前記第２の特徴量は、大文字から始まる単語の総数となり、各構成文それぞれについて１値ずつ求められ、一次元のベクトル要素になる。 That is, the second feature amount is the total number of words starting with a capital letter, one value is obtained for each constituent sentence, and becomes a one-dimensional vector element.

前記第３の特徴量は、動詞である単語の総数となり、各構成文それぞれについて１値ずつ求められ、一次元のベクトル要素になる。 The third feature amount is the total number of words that are verbs, one value is obtained for each constituent sentence, and becomes a one-dimensional vector element.

前記第４の特徴量は、予め設定されたｔｈａｔ節をとることのできる動詞の単語毎の数となり、各構成文それぞれについて単語毎に各１値ずつ求められ、前記リスト中の単語数の次元のベクトル要素になる。 The fourth feature amount is the number of verbs that can take a preset tat clause for each word, one value is obtained for each word for each constituent sentence, and the dimension of the number of words in the list Vector element.

前記第５の特徴量は、接続詞である単語の総数となり、各構成文それぞれについて１値ずつ求められ、一次元のベクトル要素になる。 The fifth feature amount is the total number of words that are conjunctions, one value is obtained for each constituent sentence, and becomes a one-dimensional vector element.

前記第６の特徴量は、予め設定された接続詞の単語毎の数となり、各構成文それぞれについて単語毎に各１値ずつ求められ、前記リスト中の単語数の次元のベクトル要素になる。 The sixth feature amount is a preset number of conjunctions for each word, one value is obtained for each word for each constituent sentence, and becomes a vector element in the dimension of the number of words in the list.

前記第７の特徴量は、ｗｈ形の代名詞及び副詞である単語の総数となり、各構成文それぞれについて１値ずつ求められ、一次元のベクトル要素になる。 The seventh feature amount is the total number of words that are wh-shaped pronouns and adverbs, one value is obtained for each constituent sentence, and becomes a one-dimensional vector element.

前記第８の特徴量は、予め設定されたｗｈ形の代名詞及び副詞の単語毎の数となり、各構成文それぞれについて単語毎に各１値ずつ求められ、前記リスト中の単語数の次元のベクトル要素になる。 The eighth feature amount is a preset number of words of pronouns and adverbs of wh type, and is obtained for each word for each constituent sentence, and is a vector of dimensions of the number of words in the list Become an element.

前記第９の特徴量は、一人称の人称代名詞である単語の総数となり、各構成文それぞれについて１値ずつ求められ、一次元のベクトル要素になる。 The ninth feature amount is the total number of words that are first-person personal pronouns, and is obtained for each component sentence by one value, and becomes a one-dimensional vector element.

前記第１０の特徴量は、予め設定された一人称の人称代名詞の単語毎の数となり、各構成文それぞれについて単語毎に各１値ずつ求められ、前記リスト中の単語数の次元のベクトル要素になる。 The tenth feature amount is a preset number of first person personal pronouns for each word, one value is obtained for each word for each constituent sentence, and a vector element of the dimension of the number of words in the list is used. Become.

前記第１１の特徴量は、前置詞である単語の総数となり、各構成文それぞれについて１値ずつ求められ、一次元のベクトル要素になる。 The eleventh feature amount is the total number of prepositional words, one value is obtained for each constituent sentence, and becomes a one-dimensional vector element.

前記第１２の特徴量は、予め設定された前置詞の単語毎の数であり、各構成文それぞれについて単語毎に各１値ずつ求められ、前記リスト中の単語数の次元のベクトル要素になる。 The twelfth feature amount is a preset number of prepositions for each word, one value is obtained for each word for each constituent sentence, and becomes a vector element in the dimension of the number of words in the list.

以上のように、第１〜第１２の特徴量をベクトル要素とした特徴ベクトルは、数百次元程度となり、当該数百次元程度の特徴ベクトルが、対象文書及び対比用文書の各構成文それぞれに対して求められる。 As described above, the feature vector having the first to twelfth feature quantities as vector elements is about several hundred dimensions, and the feature vector of about several hundred dimensions is included in each component sentence of the target document and the comparison document. Against it.

なお、特徴ベクトルの要素として、第１〜第１２の特徴量をいずれかを省略することもできる。 Note that any one of the first to twelfth feature amounts may be omitted as the feature vector element.

前記欠落特定手段２２では、対比用文書中の各構成文の特徴ベクトルに近似する対象文書中の構成文の特徴ベクトルを検出し、当該検出された特徴ベクトルの構成文を文末句読点が欠落している構成文と特定する。すなわち、ここでは、ベクトル作成手段２０で作成された各特徴ベクトルから、予め記憶されたパターン認識手法により、対象文書の特徴ベクトルが含まれる第１データ群と、対比用文書の特徴ベクトルが含まれる第２データ群とに分類し、当該第２データ群にノイズとして含まれる対象文書の特徴ベクトルを検出し、当該検出された特徴ベクトルの構成文を文末句読点が欠落している構成文と特定するようになっている。更に換言すれば、意図的に文末句読点を欠落させた対比用文書の各構成文からの特徴ベクトルから、文末句読点が欠落している可能性の高い特徴ベクトルの範囲が求められ、当該範囲に、対象文書の各構成文から求めた特徴ベクトルが含まれ得るか否かで、対象文書の各構成文の文末句読点の欠落が判断される。 The missing specifying means 22 detects the feature vector of the constituent sentence in the target document that approximates the feature vector of each constituent sentence in the comparison document, and the sentence ending punctuation mark is missing from the constituent sentence of the detected feature vector. Identified as a component sentence. That is, here, the first data group including the feature vector of the target document and the feature vector of the comparison document are included from each feature vector generated by the vector generation unit 20 by a pattern recognition method stored in advance. It classifies into the second data group, detects the feature vector of the target document included as noise in the second data group, and identifies the constituent sentence of the detected feature vector as the constituent sentence in which the sentence ending punctuation is missing It is like that. In other words, from the feature vector from each component sentence of the comparison document in which the sentence ending punctuation mark is intentionally deleted, a range of feature vectors that are likely to have the sentence ending punctuation missing is obtained. Whether or not the feature vector obtained from each component sentence of the target document can be included is determined to determine whether a sentence ending punctuation mark is missing in each component sentence of the target document.

以上におけるパターン認識手法としては、サポートベクターマシン（ＳＶＭ）、ナイーブベイズ、コサイン尺度、Ｋ近傍法等の公知手法を例示できる。これら手法については、本発明の本質でないため説明を省略する。 Examples of the pattern recognition method described above include known methods such as support vector machine (SVM), naive Bayes, cosine scale, and K-nearest neighbor method. Since these methods are not the essence of the present invention, description thereof will be omitted.

次に、対象文書から文末句読点が欠落している構成文を特定する流れについて、具体例を用いながら説明する。 Next, a flow for identifying a constituent sentence in which sentence ending punctuation is missing from the target document will be described using a specific example.

先ず、利用者が入力装置１１に図２（Ａ）で示される文書データを入力すると、当該データが対象文書記憶手段１５に記憶される。 First, when the user inputs the document data shown in FIG. 2A to the input device 11, the data is stored in the target document storage unit 15.

次に、対比用文書作成手段１７で、対象文書中に存在するピリオド（文末句読点）を検出し、当該ピリオドを境に存在する二文を一文にすることで、図２（Ｂ）に示される対比用文書が作成される。 Next, the comparison document creation means 17 detects a period (sentence punctuation mark) that exists in the target document, and two sentences existing at the period are combined into one sentence, which is shown in FIG. A comparison document is created.

そして、ベクトル作成手段２０で、対象文書中の各構成文と対比用文書の各構成文それぞれについて、前記特徴ベクトルが求められる。具体的に、先ず、切り出し部２４で、対象文書及び対比用文書それぞれについて、図２中に破線で示されるように、ピリオドの存在に基づいて構成文が切り出される。本例では、対象文書は４つの構成文に切り出され、対比用文書は３つの構成文に切り出される。更に、形態素解析部２５で、対象文書の４つの構成文と対比用文書の３つの構成文それぞれについて、構成する各単語の形態素解析が行われる。また、カウント部２７で、前記第１〜第１２の特徴量を求めるための各数が、対象文書及び対比用文書の構成文毎にそれぞれカウントされ、当該構成文毎に、特徴量算出部２８で第１〜第１２の特徴量が算出されることにより、各構成文それぞれについて特徴ベクトルが求められる。 Then, the vector creating means 20 obtains the feature vector for each constituent sentence in the target document and each constituent sentence of the comparison document. Specifically, first, the cutout unit 24 cuts out a constituent sentence for each of the target document and the comparison document based on the presence of a period, as indicated by a broken line in FIG. In this example, the target document is cut into four component sentences, and the comparison document is cut into three component sentences. Further, the morpheme analysis unit 25 performs morpheme analysis of each word constituting the four constituent sentences of the target document and the three constituent sentences of the comparison document. In addition, the counting unit 27 counts the numbers for obtaining the first to twelfth feature amounts for each constituent sentence of the target document and the comparison document, and for each constituent sentence, the feature amount calculating unit 28. Thus, by calculating the first to twelfth feature amounts, a feature vector is obtained for each constituent sentence.

次に、欠落特定手段２２で、対象文書の各構成文それぞれについて求められた４つの特徴ベクトルが、対比用文書の各構成文それぞれについて求められた３つの特徴ベクトルに近似するか否かが判断される。この近似判断は、予め記憶されたサポートベクターマシーン（ＳＶＭ）等のパターン認識手法が用いられる。前述例で言えば、図２（Ａ）の対象文書中、前から３番目の構成文「Ｉｗｅｎｔｔｏ・・・ｖｅｒｙｈａｒｄ．」の特徴ベクトルが、対比用文書の各構成文の特徴ベクトルに近似すると判断され、これにより、当該構成文が文中にピリオドが抜けている可能性が高いと特定される。 Next, it is determined whether or not the four feature vectors obtained for each component sentence of the target document approximate the three feature vectors obtained for each component sentence of the comparison document by the missing specifying unit 22. Is done. This approximation determination uses a pattern recognition method such as a support vector machine (SVM) stored in advance. In the above example, the feature vector of the third constituent sentence “Iwent to ... very hard.” In the target document in FIG. 2A becomes the feature vector of each constituent sentence of the comparison document. As a result, it is determined that there is a high possibility that a period is missing in the sentence.

このように、ピリオドが抜けている可能性が高いと判断された構成文は、図示省略している他の装置に出力されることで、画面上に表示され、或いは音声により通知される等の処理が可能になる。 In this way, a composition sentence that is determined to have a high possibility of missing a period is displayed on the screen by being output to another device (not shown) or notified by voice. Processing becomes possible.

以上によれば、対象文書の構成文から文末句読点を抜いた構成文からなる対比用文書を作成した上で、当該対比用文書の各特徴ベクトルをピリオドが抜けている構成文の特徴ベクトルとし、当該特徴ベクトルを基準に対象文書中の構成文のピリオド抜けが判断される。従って、ピリオド抜けの判断用の文書データベースを構築することなく、対象文書のみから、当該対象文書中におけるピリオド抜けの構成文を特定することができる。 According to the above, after creating a comparison document composed of a composition sentence obtained by removing the sentence ending punctuation from the composition sentence of the target document, each feature vector of the comparison document is set as a feature vector of the composition sentence in which a period is missing, Based on the feature vector, it is determined whether a period is missing in the constituent sentence in the target document. Therefore, it is possible to specify a constituent sentence in a period missing in the target document only from the target document without constructing a document database for determining period missing.

なお、前記ベクトル作成手段２０では、対比用文書の各構成文それぞれについて特徴ベクトルを求めているが、そのうち予め設定した数のみをランダムに抽出し、当該抽出した特徴ベクトルを基に、対象文書の各構成文の特徴ベクトルの近似を判定することもできる。このようにすれば、対比用文書の構成文が膨大になったときの対比用文書の特徴ベクトルに関するノイズが低減することになり、前記近似判定の誤りを防止でき、文末句読点の欠落検出精度を高めることができる。 The vector creating means 20 obtains feature vectors for each component sentence of the comparison document, but only a preset number of them is extracted at random, and based on the extracted feature vectors, the target document An approximation of the feature vector of each constituent sentence can also be determined. In this way, noise related to the feature vector of the comparison document when the composition sentence of the comparison document becomes enormous will be reduced, so that the approximation judgment error can be prevented, and the punctuation mark missing detection accuracy can be improved. Can be increased.

また、前記実施形態では、欠落検出装置１３での検出対象となる文書を英語の文書として説明しているが、本発明はこれに限らず、前述と同様のロジックにより、他言語の文書を検出対象とすることもできる。 In the embodiment, the document to be detected by the missing detection device 13 is described as an English document. However, the present invention is not limited to this, and a document in another language is detected by the same logic as described above. It can also be targeted.

その他、本発明における各構成は前述例に限定されるものではなく、実質的に同様の作用を奏する限りにおいて、種々の変更が可能である。 In addition, each structure in this invention is not limited to the said example, A various change is possible as long as there exists a substantially similar effect | action.

本発明は、学習者が作成した作文や論文等の文書に対する採点評価を自動的行う文書評価装置や、利用者が入力した文書を他言語に翻訳する自動翻訳装置等に付随して利用することができ、当該文書評価装置や自動翻訳装置の処理精度を向上させることに寄与する。 The present invention is used in conjunction with a document evaluation apparatus that automatically performs scoring evaluation on a document such as a composition or a thesis created by a learner, or an automatic translation apparatus that translates a document input by a user into another language. This contributes to improving the processing accuracy of the document evaluation apparatus and automatic translation apparatus.

１０欠落検出システム
１３欠落検出装置
１７対比用文書作成手段
１８辞書データベース
２０ベクトル作成手段
２２欠落特定手段
２４切り出し部
２５形態素解析部
２７カウント部
２８特徴量算出部 DESCRIPTION OF SYMBOLS 10 Missing detection system 13 Missing detection apparatus 17 Comparison document preparation means 18 Dictionary database 20 Vector creation means 22 Missing specification means 24 Cutout part 25 Morphological analysis part 27 Count part 28 Feature-value calculation part

Claims

文末句読点が一部欠落している複数の構成文からなる対象文書に対し、前記文末句読点が欠落している構成文を検出する欠落検出装置であって、
前記対象文書に存在する文末句読点を境に隣り合う二文を一文とした構成文からなる対比用文書を作成する対比用文書作成手段と、前記対象文書及び前記対比用文書それぞれの構成文について、当該各構成文中の単語数や単語種類に着目して求められる複数の特徴量をベクトル要素とした特徴ベクトルを作成するベクトル作成手段と、当該ベクトル作成手段で作成された各特徴ベクトルに基づき、前記対比用文書の特徴ベクトルに近似する前記対象文書の特徴ベクトルを検出し、当該検出された特徴ベクトルの構成文を前記文末句読点が欠落している構成文と特定する欠落特定手段とを備えたことを特徴とする文末句読点の欠落検出装置。 For a target document consisting of a plurality of constituent sentences in which a part of ending punctuation marks are missing, a missing detection device that detects a constituent sentence in which the ending punctuation marks are missing,
About the comparison document creating means for creating a comparison document consisting of a composition sentence with two sentences adjacent to each other at the end of a sentence ending punctuation mark existing in the target document, and the composition sentence of each of the target document and the comparison document, Based on the feature vector created by the vector creation means, the vector creation means for creating a feature vector having a plurality of feature quantities obtained by paying attention to the number of words and the word type in each constituent sentence, and the vector creation means, A feature identifying unit that detects a feature vector of the target document that approximates a feature vector of a comparison document, and identifies a constituent sentence of the detected feature vector as a constituent sentence in which the sentence ending punctuation is missing. An end-of-sentence punctuation detection device characterized by

品詞や活用形が単語毎に記憶された辞書データベースを備え、
前記ベクトル作成手段は、前記対象文書及び前記対比用文書から各構成文を一文ずつ切り出す切り出し部と、当該切り出し部で切り出された構成文毎に、当該各構成文中の各単語の品詞や活用形を前記辞書データベースのデータから特定する形態素解析部と、予め記憶されたルールに基づいて構成文の文数や単語をカウントするカウント部と、当該カウント部でのカウント結果から前記特徴量を算出する特徴量算出部とを有することを特徴とする請求項１記載の文末句読点の欠落検出装置。 It has a dictionary database that stores part-of-speech and usage forms for each word,
The vector creation means includes a cutout unit that cuts out each constituent sentence from the target document and the comparison document one by one, and for each constituent sentence cut out by the cutout unit, the part of speech and the utilization form of each word in the constituent sentence Is calculated from the data in the dictionary database, a count unit that counts the number of sentences and words of a constituent sentence based on a rule stored in advance, and the feature amount is calculated from a count result of the count unit A missing-punctuation punctuation detection apparatus according to claim 1, further comprising a feature amount calculation unit.

前記特徴量は、一文当たりの文の長さの確率となる第１の特徴量と、大文字から始まる単語の総数となる第２の特徴量と、動詞である単語の総数となる第３の特徴量と、ｔｈａｔ節をとることのできる動詞の単語毎の数となる第４の特徴量と、接続詞である単語の総数となる第５の特徴量と、接続詞の単語毎の数となる第６の特徴量と、ｗｈ形の代名詞及び副詞である単語の総数となる第７の特徴量と、ｗｈ形の代名詞及び副詞の単語毎の数となる第８の特徴量と、一人称の人称代名詞からなる単語の総数である第９の特徴量と、一人称の人称代名詞の単語毎の数である第１０の特徴量と、前置詞である単語の総数となる第１１の特徴量と、前置詞の単語毎の数となる第１２の特徴量との少なくとも一部からなることを特徴とする請求項２記載の文末句読点の欠落検出装置。 The feature amount includes a first feature amount that is a probability of sentence length per sentence, a second feature amount that is a total number of words starting from a capital letter, and a third feature that is a total number of words that are verbs. A fourth feature value that is the number of verbs that can take a tat clause, and a fifth feature value that is the total number of words that are conjunctions, and a sixth feature value that is the number of words that are conjunctions. , The seventh feature amount that is the total number of words that are pronouns and adverbs of the wh type, the eighth feature amount that is the number of pronouns and adverb words of the wh type, and the first person personal pronouns A ninth feature quantity that is the total number of words, a tenth feature quantity that is the number of first person personal pronouns per word, an eleventh feature quantity that is the total number of prepositional words, and a preposition word. 3. The twelfth feature quantity which is the number of Missing detection device of the end of the sentence punctuation.

前記欠落特定手段は、予め記憶されたパターン認識手法により、前記対象文書の特徴ベクトルが含まれる第１データ群と前記対比用文書の特徴ベクトルが含まれる第２データ群とに分類し、当該第２データ群にノイズとして含まれる前記対象文書の特徴ベクトルを検出し、当該検出された特徴ベクトルの構成文を前記文末句読点が欠落している構成文と特定することを特徴とする請求項１、２又は３記載の文末句読点の欠落検出装置。 The missing specifying means classifies into a first data group including a feature vector of the target document and a second data group including a feature vector of the comparison document by a pattern recognition method stored in advance, and 2. A feature vector of the target document included as noise in two data groups is detected, and a constituent sentence of the detected feature vector is specified as a constituent sentence in which the ending punctuation mark is missing. 2. A missing punctuation detection device at the end of a sentence described in 2 or 3.

前記欠落特定手段は、前記対比用文書中の各構成文それぞれに求められた特徴ベクトルから一部を抽出し、当該抽出された特徴ベクトルを用いて前記文末句読点が欠落している構成文を特定することを特徴とする請求項１〜４の何れかに記載の文末句読点の欠落検出装置。 The missing identification means extracts a part from the feature vector obtained for each component sentence in the comparison document, and identifies the component sentence in which the sentence ending punctuation is missing using the extracted feature vector The missing sentence punctuation detection apparatus according to any one of claims 1 to 4, wherein:

文末句読点が一部欠落している複数の構成文からなる対象文書に対し、文末句読点が欠落している構成文を検出する処理をコンピュータに実行させるためのプログラムであって、
前記対象文書に存在する文末句読点を境に隣り合う二文を一文とした構成文からなる対比用文書を作成する対比用文書作成手段と、前記対象文書及び前記対比用文書それぞれの構成文について、当該各構成文中の単語数や単語種類に着目して求められる複数の特徴量をベクトル要素とした特徴ベクトルを作成するベクトル作成手段と、当該ベクトル作成手段で作成された各特徴ベクトルに基づき、前記対比用文書の特徴ベクトルに近似する前記対象文書の特徴ベクトルを検出し、当該検出された特徴ベクトルの構成文を前記文末句読点が欠落している構成文と特定する欠落特定手段として前記コンピュータを機能させることを特徴とする文末句読点の欠落検出プログラム。 A program for causing a computer to execute processing for detecting a constituent sentence in which a sentence ending punctuation is missing for a target document including a plurality of constituent sentences in which a part of ending punctuation is missing,
About the comparison document creating means for creating a comparison document consisting of a composition sentence with two sentences adjacent to each other at the end of a sentence ending punctuation mark existing in the target document, and the composition sentence of each of the target document and the comparison document, Based on the feature vector created by the vector creation means, the vector creation means for creating a feature vector having a plurality of feature quantities obtained by paying attention to the number of words and the word type in each constituent sentence, and the vector creation means, Detecting a feature vector of the target document that approximates a feature vector of a contrast document, and functioning the computer as a missing specifying unit that identifies a constituent sentence of the detected feature vector as a constituent sentence in which the sentence ending punctuation is missing A program for detecting missing punctuation at the end of a sentence, characterized in that