JPH04211868A

JPH04211868A - Method for creating keyword for retrieval of cd-rom data

Info

Publication number: JPH04211868A
Application number: JP2202974A
Authority: JP
Inventors: Masa Saito; 斎藤　雅; Hiroshi Teranishi; 浩寺西; Takahiro Nakajima; 孝浩中島
Original assignee: Dai Nippon Printing Co Ltd
Current assignee: Dai Nippon Printing Co Ltd
Priority date: 1990-07-31
Filing date: 1990-07-31
Publication date: 1992-08-03

Abstract

PURPOSE:To automatically create the keyword for retrieval of the CD-ROM data by using the natural word processing technology of one field of A>=(artificial intelligence) CONSTITUTION:An input processing 101 copies a natural word processing system input data magnetic tape as input data 102 on a disk file to perform the check of a KANJI(Chinese character) code or the like. Then, it is converted into a record for Japanese processing and an output processing 120 copies the processing result file on the disk as processing result data 321 to the natural word processing output magnetic tape. A driver 103 performs the classification and analysis of the input data 102, controls a Japanese processing system 110, and acquires the extraction results of division writing, kana printing to give the pronunciation of a Chinese character, and keyword. The processing result is edited and outputted. A Japanese processing system 110 performs the morpheme analysis through a basic dictionary access routine 112, and the reading is extracted on every word recognized by language processing to be outputted as a KANA print output sentence.

Description

【発明の詳細な説明】発明の目的；（産業上の利用分野）この発明は、ＣＤ−ＲＯＭデータの検索用キーワードの
作成に自然言語処理システムを利用し、キーワードの抽
出を行なうデータの分かち書き／カナ振りを行ない、品
詞情報より名詞、形容詞、動詞の抽出を行なうキーワー
ドを作成するようにしたＣＤ−ＲＯＭデータの検索用キ
ーワードの作成方法に関する。[Detailed Description of the Invention] Purpose of the Invention; (Industrial Field of Application) This invention utilizes a natural language processing system to create keywords for searching CD-ROM data, and uses a data separation/writing system to extract the keywords. This invention relates to a method for creating keywords for searching CD-ROM data, which creates keywords for extracting nouns, adjectives, and verbs from part-of-speech information using Japanese kana.

（従来の技術）最近、印刷物用に蓄積した文書データを２次利用してＣ
Ｄ−ＲＯＭやデータベースを作成することが多くなって
いる。そして、データベース検索用のキーワードを抽出
する作業は、従来より専門家による手作業によっていた
。(Prior art) Recently, document data accumulated for printed matter has been used as a secondary
D-ROMs and databases are increasingly being created. Conventionally, the work of extracting keywords for database searches has been done manually by experts.

（発明が解決しようとする課題）データベース検索用のキーワードを抽出する作業が、従
来は専門家が文書の中から重要語を選択し、更に読み方
を付けるようになっている。このため、データベースの
キーワード抽出作業に多大な労力を要し、作業そのもの
が非効率的であった。(Problem to be Solved by the Invention) Conventionally, the task of extracting keywords for database searches has been to have experts select important words from documents and then add readings to them. For this reason, the task of extracting keywords from the database requires a great deal of effort, and the task itself is inefficient.

この発明は上述のような事情より成されたものであり、
この発明の目的は、ＡＩ（人工知能）の一分野の自然言
語処理技術を利用してＣＤ−ＲＯＭデータの検索用キー
ワードを自動的に作成するための方法を提供することに
ある。This invention was made due to the above-mentioned circumstances,
An object of the present invention is to provide a method for automatically creating search keywords for CD-ROM data using natural language processing technology in the field of AI (artificial intelligence).

発明の構成；（課題を解決するための手段）この発明はＣＤ−ＲＯＭデータの検索用キーワードの作
成方法に関するもので、この発明の上記目的は、磁気記
憶媒体に格納されたデータベースを前処理し、基本辞書
を参照して自然言語処理による自然言語処理出力ファイ
ルを作成し、後処理によってキーワードデータを作成す
ることによって達成される。Structure of the Invention; (Means for Solving the Problems) The present invention relates to a method for creating keywords for searching CD-ROM data, and the above object of the present invention is to preprocess a database stored in a magnetic storage medium. This is achieved by creating a natural language processing output file by natural language processing with reference to a basic dictionary, and creating keyword data through post-processing.

（作用）この発明では、ＣＤ−ＲＯＭデータに対する検索用キー
ワードの作成にＡＩの一種である自然言語処理を用いて
おり、基本辞書を参照して入力原文データに対して分か
ち書き（品詞分解）及びカナ振りを自動的に行なってい
る。(Operation) In this invention, natural language processing, which is a type of AI, is used to create search keywords for CD-ROM data, and by referring to a basic dictionary, input source data is divided into parts of speech and kana. The movements are performed automatically.

コンピュータに内蔵した辞書とＡＩの手法により名詞、
助詞、動詞等の要素に分解し、分割された文書の漢字へ
の読みがなの付加とキーワードの抽出を行なう。従来は
人手によって行なわれた作業を機械が処理するので、後
は従来と同じチェックだけで済む。作成されたキーワー
ドは、ＣＤ−ＲＯＭやオンラインデータベースのインデ
ックスとして加工されて利用され、またカナ振り機能を
利用して総ルビの本として組版することもできる。Nouns, using the computer's built-in dictionary and AI techniques.
It decomposes the document into elements such as particles and verbs, adds readings to the kanji in the divided documents, and extracts keywords. Since the machine handles the work that was previously done by hand, all that is left is to perform the same checks as before. The created keywords can be processed and used as indexes for CD-ROMs and online databases, and can also be typesetting into a full ruby book using the kana-furi function.

（実施例）先ず、この発明で用いる自然言語処理システムについて
説明する。(Example) First, a natural language processing system used in the present invention will be explained.

第６図は自然言語処理システムのハードウェア構成例を
示しており、ホストマシン１０にはＣＰＵ１１及び実装
メモリ１２か内蔵されると共に、バスライン１３を介し
て磁気ディスク装置１４、カセット磁気テープ装置１５
が接続されている。ホストマシン１０には、更に磁気テ
ープ装置２０、レーザープリンタ２１及びコンソール端
末２３が接続されると共に、ＲＳ−２３２Ｃのインター
フェイス１６を介して確認／修正用端末２２が接続され
ている。FIG. 6 shows an example of the hardware configuration of a natural language processing system, in which a host machine 10 has a built-in CPU 11 and a built-in memory 12, and a magnetic disk device 14 and a cassette magnetic tape device 15 are connected via a bus line 13.
is connected. The host machine 10 is further connected to a magnetic tape device 20, a laser printer 21, and a console terminal 23, as well as a confirmation/correction terminal 22 via an RS-232C interface 16.

第７図は自然言語処理システムのソフトウェア構成を示
しており、磁気テープからの入力データは入力処理１０
１されて取込まれ、ホストマシン１０で処理された情報
は出力処理１２０されて磁気テープの出力データとなる
。すなわち、入力処理１０１は自然言語処理システム入
力データ磁気テープをディスクファイル上に入力データ
１０２としてコピーし、漢字コード等のチェックを行な
い、その後に日本語処理用レコードに変換する。また、
出力処理１２０はディスク上の処理結果ファイルを処理
結果データ１２１として自然言語処理出力磁気テープへ
コピーする。ドライバ１０３は入力データ１０２の分類
／解析を行ない、日本語処理システム１１０を制御し、
分かち書き、カナ振り、キーワード抽出結果を取得し、
自然言語処理システム出力データ形式で、処理結果を編
集／出力する。Figure 7 shows the software configuration of the natural language processing system, in which input data from the magnetic tape is processed by input processing 10.
The information that has been read in and processed by the host machine 10 is output processed 120 and becomes output data on the magnetic tape. That is, the input processing 101 copies the natural language processing system input data magnetic tape onto a disk file as input data 102, checks the Kanji code, etc., and then converts it into a record for Japanese processing. Also,
Output processing 120 copies the processing result file on the disk to the natural language processing output magnetic tape as processing result data 121. The driver 103 classifies/analyzes the input data 102, controls the Japanese language processing system 110,
Get parting notes, kana writing, keyword extraction results,
Edit/output processing results in natural language processing system output data format.

日本語処理システム１１０は基本辞書アクセスルーチン
１１２を介して形態素解析を行ない、言語処理で認定す
る全ての単語についてその読みを抽出し、カナ振り出力
文として出力する。名詞列抽出は言語処理による単語認
定結果で、その品詞が次の（ａ）、（ｂ）に該当すると
きに名詞として抽出する。The Japanese language processing system 110 performs morphological analysis via the basic dictionary access routine 112, extracts the pronunciations of all the words recognized through language processing, and outputs them as kana script output sentences. Noun string extraction is the result of word recognition through language processing, and when the part of speech corresponds to the following (a) or (b), it is extracted as a noun.

（ａ）一般名詞、サ変型名詞、形動型名詞、転成名詞、
時詞、数詞、固有名詞、代名詞、形式名詞（ｂ）接辞についてはそれぞれ前後の品詞が以下に該当
するとき、該当単語を名詞として抽出する。(a) General nouns, sa-modified nouns, verbal nouns, transposed nouns,
For temporal words, numeral words, proper nouns, pronouns, and formal nouns (b) affixes, when the preceding and following parts of speech correspond to the following, the corresponding word is extracted as a noun.

■接頭辞の場合後方品詞：一般名詞、サ変型名詞、形動型名詞、転成名
詞、時詞、数詞、固有名詞、代名詞、形式名詞 ■接尾辞の場合前方品詞：一般名詞、サ変型名詞、形動型名詞、転成名
詞、時詞、数詞、固有名詞、代名詞、形式名詞また、日本語文章と上記より求められたキーワード分析
テーブルを入力すると共に、統計的解析、構文解析、知
識処理等の手法を用いてアクセスファイルルーチン１１
１と協働して入力日本語文章の解析を行ない、キーワー
ド抽出、絞り込み、重要度評価を行なう。■For prefixes, backward parts of speech: common nouns, S-inflected nouns, morphological nouns, transpositional nouns, temporal nouns, numerals, proper nouns, pronouns, formal nouns. ■For suffixes, forward parts of speech: common nouns, S-inflected nouns, In addition to inputting verb type nouns, transposition nouns, temporal nouns, numerals, proper nouns, pronouns, formal nouns, Japanese sentences and the keyword analysis table obtained from the above, statistical analysis, syntactic analysis, knowledge processing, etc. Access file routine using method 11
In collaboration with 1, it analyzes the input Japanese text, extracts keywords, narrows down the keywords, and evaluates their importance.

端末通信処理１２３は確認／修正用端末２２との間で通
信を行ない、端末出力用のデータ変換を行なう。そして
、端末からの修正データを出力ファイルの形式に変換し
て書込む。また、リスト出力処理１２２は、端末から出
力依頼のあった処理結果データをプリンタ出力用データ
に編集すると共に、プリンタ出力用データをレーザープ
リンタ２１に出力する。The terminal communication processing 123 communicates with the confirmation/correction terminal 22 and converts data for terminal output. Then, the modified data from the terminal is converted into an output file format and written. Further, the list output processing 122 edits the processing result data requested to be output from the terminal into data for printer output, and outputs the data for printer output to the laser printer 21 .

ところで、ホストマシン１０が扱い得る自然言語処理機
能は、Ａ．処理種１：分かち書きＢ．処理種２：カナ振りＩ（分かち書き単位のカナ振り
）Ｃ．処理種３：カナ振りＩＩ（漢字単位のカナ振り、総
ルビ振り）Ｄ．処理種４：キーワード抽出及びキーワードへのカナ
振りの４種であり、入力ファイルのレコード単位に上記各機
能を切替えて処理することができる。By the way, the natural language processing functions that the host machine 10 can handle are A. Processing type 1: Partition B. Processing type 2: Kana-furi I (Kana-furi for separate writing units) C. Processing type 3: Kana-furi II (kana-furi for kanji units, total ruby-furi) D. Processing type 4: There are four types: keyword extraction and kana translation for keywords, and each of the above functions can be switched and processed for each record of the input file.

次に、各機能（処理種１〜４）について説明する。Next, each function (processing types 1 to 4) will be explained.

Ａ．分かち書き（処理種１）：日本語文章（漢字かな交じり文）を入力して分かち書き
を行ない、名詞、動詞、形容詞について品詞情報を付加
する。出力される情報は、スラッシュ“／”による分か
ち書きと品詞情報（名詞、動詞、形容詞、未知語）であ
る。処理種１の出力形式は第８図のようになる。A. Partitioning (processing type 1): Inputs Japanese sentences (sentences mixed with kanji and kana), performs separation, and adds part-of-speech information for nouns, verbs, and adjectives. The output information is separated by slashes "/" and part-of-speech information (nouns, verbs, adjectives, unknown words). The output format of processing type 1 is as shown in FIG.

Ｂ．カナ振りＩ（処理種２：分かち書き単位のカナ振り
）：日本語文章（漢字かな交じり分）を入力して分かち書き
を行ない、分かち書きされた単語単位にカナ振りを行な
う。読みはカタカナで振られ、名詞、動詞、形容詞につ
いては品詞情報を付加する。そして、出力される情報は
、スラッシュによる分かち書き、品詞情報（名詞、動詞
、形容詞、末知語）、分かち書き単語要素へのカナ振り
結果である。処理種２の出力形式は第９図のようになる
。B. Kana-furi I (Processing type 2: Kana-furi for each separated word): Inputs a Japanese sentence (including Kanji and kana), performs separate writing, and performs kana-furi for each separated word. Readings are written in katakana, and part-of-speech information is added to nouns, verbs, and adjectives. The information that is output is the separation using slashes, part-of-speech information (noun, verb, adjective, last-taught word), and the result of changing kana to the separation word element. The output format of processing type 2 is as shown in FIG.

Ｃ．カナ振りＩＩ（処理種３）：この処理種３は、分野別辞書１０６を使用したカナ振り
及び総ルビ振り（漢字（列）単位のカナ振り）の機能を
有している。分野別辞書１０６を使用したカナ振りは人
名、地名、各種専門用語等の項目データに対して、品目
専用の辞書を利用してカナ振りを行なうものである。か
な振りの方法は項目データをＫＥＹにして分野別辞書１
０６をサーチし、マッチングした場合に分野別辞書１０
６に登録されているカナを振る。これでカナが得られな
かった場合、日本語処理システムを呼出して基本辞書１
１５によってカナを振る。C. Kana-furi II (processing type 3): This processing type 3 has the function of kana-furi and total ruby-furi (kana-furi for each kanji (column)) using the field-specific dictionary 106. Kana translation using the field-specific dictionary 106 is performed for item data such as personal names, place names, and various technical terms using a dictionary dedicated to the item. The kana-furi method uses item data as the key and creates a field-specific dictionary 1
06 and if there is a match, the field-specific dictionary 10
Roll the kana registered in number 6. If this does not yield kana, call the Japanese processing system and use basic dictionary 1.
Roll kana by 15.

データの入力形式は、単項目データの場合は“項目デー
タ”であり、複数項目データを１レコードで処理する場
合は、“項目データ１”／“項目データ２”／……・／
“項目データＮ”のように各項目データをスラッシュで
区切るようにしている。そして、出力される情報は、入
力項目データに対する読み（カタカナ）とカナデータの
典拠辞書識別（どの辞書に基づいてカナが振られたかの
識別）である。処理種３の出力形式は第１０図のように
なっており、■分野別辞書１０６で読みが取得された場
合、■基本辞書１１５で読みが取得された場合、■分野
別辞書１０６及び基本辞書１１５の両方共に読みが登録
されていない場合、に分けて識別コード（例えばＡＡ、
ＡＢ、ＡＣ）を与えている。The data input format is "item data" for single item data, and "item data 1"/"item data 2"/...// when processing multiple item data in one record.
Each item data is separated by a slash like "Item data N". The output information is the reading (katakana) for the input item data and the authority dictionary identification of the kana data (identification of which dictionary was used to assign the kana). The output format of processing type 3 is as shown in FIG. If readings are not registered for both of 115, separate identification codes (for example, AA,
AB, AC) are given.

分野別辞書１０６を使用したカナ振りで処理対象となる
データは、人名、地名、各種専門用語等の項目データ（
主に固有名詞）であり、総ルビ振りで処理対象となるデ
ータは日本語の漢字かな交じり文である。総ルビ振り（
漢字（列）単位のカナ振り）の機能は、日本語文章（漢
字かな交じり文）を入力して全ての漢字に対してカナ振
りを行なうものである。カナ振り方法は、入力原文中の
漢字（列）（ＪＩＳ非漢字以外）に対してカナ（ルビ）
を振り、ルビは「群扱いルビ」の形式で振られる。その
出力形式は第１１図のようになっている。The data to be processed in kana-furi using the field-specific dictionary 106 includes item data such as person names, place names, and various technical terms (
The data to be processed with full ruby processing is Japanese kanji and kana mixed sentences. Total ruby swing (
The Kana-Furi (Kana-Furi) function for each Kanji (column) unit inputs a Japanese sentence (a combination of Kanji and Kana) and performs Kana-Furi for all the Kanji. The kana method uses kana (ruby) for kanji (sequences) (other than JIS non-kanji) in the input source text.
, and ruby is rolled in the form of "group-treated ruby". The output format is as shown in FIG.

Ｄ．キーワード抽出及びキーワードへのカナ振り（処理
種４）：入力した日本語文章から日本語処理システムの言語処理
機能によりフリーキーワードの抽出を行ない、抽出した
キーワードに読みを付加する。D. Keyword extraction and kana translation to keywords (processing type 4): Free keywords are extracted from the input Japanese text using the language processing function of the Japanese language processing system, and pronunciations are added to the extracted keywords.

出力される情報は、抽出されたキーワード、キーワード
の読み（カタカナ）及びキーワードの解析結果であり、
出力形式は第１２図のようになっている。なお、解析情
報は、日本語処理システムによるキーワード認定の過程
で得られた解析情報がセットされるエリアである。The output information is the extracted keywords, keyword pronunciation (Katakana), and keyword analysis results.
The output format is as shown in Figure 12. Note that the analysis information is an area where analysis information obtained in the process of keyword recognition by the Japanese language processing system is set.

確認／修正用端末２２の機能は、処理結果ファイルの中
の入力原文データと処理結果データ１２１をホストマシ
ン１０より端末通信処理１２３を介して受け取り、端末
装置のディスプレイに表示し、ホストマシン１０のレー
ザープリンタ２１に出力することにより処理結果の確認
及び修正作業を容易に行なうことを目的とする。端末２
２からのキーボード操作により、確認／修正を行なう処
理結果ファイルのジョブ名指定を行ない、１レコード毎
に入力原文データと処理結果データ１２１を端末装置の
ディスプレイ上に表示し、確認／修正作業を行なう。The function of the confirmation/correction terminal 22 is to receive the input original text data and processing result data 121 in the processing result file from the host machine 10 via the terminal communication processing 123, display it on the display of the terminal device, and display it on the display of the host machine 10. The purpose is to easily confirm and correct processing results by outputting them to a laser printer 21. Terminal 2
By keyboard operation from step 2, specify the job name of the processing result file to be checked/corrected, display the input original text data and processing result data 121 for each record on the display of the terminal device, and perform the confirmation/correction work. .

ディスプレイの表示形式は、処理種により以下（Ａ）〜
（Ｄ）のようになっている。The display format varies depending on the type of processing (A) to
It looks like (D).

（Ａ）処理種１（分かち書き）の場合は、入力原文と処理された入力原文の分かち書き結果を画面出力する。(A) For processing type 1 (separation), the input source text and Outputs the separated text result of the processed input source text on the screen.

（Ｂ）処理種２（分かち書き単位のカナ振り）の場合は
、入力原文と処理された入力原文の分かち書き単位のカ
ナ振り結果を画面出力する。(B) In the case of processing type 2 (kana-jiri for each parting line), the input original text and the result of kana-jiri for each parting line of the input original text that has been processed are output on the screen.

（Ｃ）処理種３（総ルビ振り）の場合は、入力原文中の
全ての漢字に対してのカナ振り結果を表示色を変えて画
面出力する。(C) In the case of processing type 3 (total ruby writing), the kana writing results for all kanji in the input original text are output on the screen with different display colors.

（Ｄ）処理種４（キーワード抽出）の場合は、入力原文
と入力原文中から抽出されたキーワード及びそのカナ振
り結果を画面出力する。(D) In the case of processing type 4 (keyword extraction), the input original text, the keyword extracted from the input original text, and the kana translation result are output on the screen.

次に、キーボード操作により処理結果データの修正を行
なうが基本的な修正機能を以下に挙げて説明する。Next, the processing result data is corrected by keyboard operations, and the basic correction functions will be listed and explained below.

処理種３及び処理種４の場合のみ修正が可能である。処
理種３（総ルビ振り）の場合はカナ振り結果の修正が可
能であり、処理種４（キーワード抽出）の場合はカナ振
り結果の修正及びキーワードの挿入、削除、順位の入れ
替えが可能である。Correction is possible only in the case of processing type 3 and processing type 4. In the case of processing type 3 (total ruby swing), it is possible to modify the kana swing results, and in the case of processing type 4 (keyword extraction), it is possible to modify the kana swing results, insert or delete keywords, and change the ranking. .

端末２２で処理結果データ１２１の修正があった場合、
キーボード操作によって修正後データをホストマシン１
０に送信する。ホストマシン１０では、修正後データを
基に処理結果ファイルのレコード更新を行なう。When the processing result data 121 is modified on the terminal 22,
Transfer the corrected data using the keyboard to host machine 1.
Send to 0. The host machine 10 updates the record of the processing result file based on the corrected data.

一方、端末２２からのキーボード操作により、ホストマ
シン１０のレーザープリンタ２１に指定された処理結果
ファイルあるいはレコードのプリンタ出力を行なう。オ
ペレータによるＰキー（プリントキー）の押下による処
理結果ファイルあるいは処理結果レコード単位のプリン
ト出力要求があった場合、処理種毎のフォーマットに合
せてホストマシン１０から取り出したレコードのプリン
タ出力を行なう。On the other hand, a keyboard operation from the terminal 22 causes the laser printer 21 of the host machine 10 to print out the designated processing result file or record. When an operator presses the P key (print key) to request a printout of a processing result file or processing result record, the record taken out from the host machine 10 is output to the printer in accordance with the format of each processing type.

以上が自然言語処理システムの概要であるが、この発明
は上記自然言語処理システムを用いてＣＤ−ＲＯＭデー
タベースのキーワードを自動作成するものである。第１
図はこの発明の処理フローを示しており、磁気記憶媒体
に格納されたデータベースに対して先ず前処理を行なう
（ステップＳ１０）。The above is an overview of the natural language processing system, and the present invention uses the above natural language processing system to automatically create keywords for a CD-ROM database. 1st
The figure shows the processing flow of the present invention, in which preprocessing is first performed on the database stored in the magnetic storage medium (step S10).

前処理の詳細は第２図に示すようになっており、最初に
データの抽出を行ない（ステップＳ１１）、抽出したデ
ータのコード変換を行なう（ステップＳ１２）。そして
、コード変換されたデータに対して自然言語処理入力フ
ァイルを作成し（ステップＳ１３）、全データに対して
上記動作を繰り返す。The details of the preprocessing are shown in FIG. 2. First, data is extracted (step S11), and the extracted data is converted into a code (step S12). Then, a natural language processing input file is created for the code-converted data (step S13), and the above operation is repeated for all data.

データの抽出はデータベースより当処理でキーワードを
作成する元データの抽出を行なうもので、コード変換デ
ータはＪＩＳコード及びＣＴＳ（Ｃｏｍｐｕｔｅｒｉｚ
ｅｄ　Ｔｙｐｅ　Ｓｅｔｔｉｎｇ）コードで作成されて
いる場合が多い。自然言語処理システムのコード体系は
一般的にシステム固有コードであるため、データのコー
ド変換を行なう必要がある。Data extraction involves extracting the original data from the database to create keywords in this process, and the code conversion data is JIS code and CTS (Computeriz
ed Type Setting) code. Since the coding system of a natural language processing system is generally a system-specific code, it is necessary to perform code conversion of data.

また、自然言語処理入力ファイル作成は、抽出したデー
タ毎に自然言語処理入力ファイルレコードの作成を行な
うものである。Furthermore, natural language processing input file creation involves creating a natural language processing input file record for each extracted data.

上述のように前処理されたデータは次のステップＳ１で
自然言語処理されるが、これに関しては後に詳述する。The data preprocessed as described above is subjected to natural language processing in the next step S1, which will be described in detail later.

自然言語処理では自然言語処理入力ファイルを作成し、
自然言語処理で基本辞書１１５（システム辞書１３１＋
ユーザ辞書１３２）を参照して、第３図に示すような入
力原文データに対して第４図に示すように分かち書き（
品詞分解）及びカナ振りを行なう。分かち書きされたデ
ータの直前にはその単語の品詞識別ＩＤが付加されてお
り、単語の品詞を判別できるようになっている。次に、
自然言語処理された自然言語処理出力ファイルに対して
後処理を行なう（ステップＳ２０）。後処理の詳細は第
５図に示すようになっており、先ず品詞抽出を行なう（
ステップＳ２１）。すなわち、分かち書き／カナ振りの
行なわれたデータから名詞、形容詞、動詞の抽出を行な
う。そして、複合語作成を行なうが（ステップＳ２２）
、これは名詞が連続している場合に複合語の作成を行な
うものである。例えば自然言語処理結果が“自然／言語
／処理”の場合、複合語は“自然、自然言語、自然言語
処理、言語、言語処理、処理”となる。これと同時に形
容詞、動詞は語尾の終止形への変換を行なう（ステップ
Ｓ２３）。そして、自然言語処理システムの処理結果は
システム固有コードで出力されるので、ＣＴＳコードへ
のコード変換を行ない（ステップＳ２４）、次にデータ
ベースの作成を行なう（ステップＳ２５）。つまり、品
詞を抽出し、加工された単語をＣＤ−ＲＯＭ検索用キー
ワード候補語としてデータベースへの登録を行なう。次
に、データベースの内容をリスト出力し（ステップＳ２
）、赤字等を入れた後にキーワードデータの校正を行な
う。校正を終了したキーワードデータをＣＤ−ＲＯＭ検
索用キーワードとする。分かち書き及びカナ振りが正し
く行なわれなかったデータについて、基本辞書１１５（
実際はユーザ辞書１３２）の修正を行ない、次回の自然
言語処理の精度の向上を図る。In natural language processing, create a natural language processing input file,
Basic dictionary 115 (system dictionary 131+) with natural language processing
With reference to the user dictionary 132), the input original text data as shown in FIG. 3 is divided as shown in FIG.
(Part of speech decomposition) and kana movement. The part of speech identification ID of the word is added immediately before the separated data, so that the part of speech of the word can be determined. next,
Post-processing is performed on the natural language processing output file that has undergone natural language processing (step S20). The details of the post-processing are shown in Figure 5. First, the parts of speech are extracted (
Step S21). That is, nouns, adjectives, and verbs are extracted from the data that has been separated/written in kana. Then, compound words are created (step S22).
, which creates compound words when nouns are consecutive. For example, when the natural language processing result is "nature/language/processing", the compound word is "nature, natural language, natural language processing, language, language processing, processing". At the same time, the adjectives and verbs are converted into final forms (step S23). Since the processing results of the natural language processing system are output as system-specific codes, the codes are converted into CTS codes (step S24), and then a database is created (step S25). That is, the part of speech is extracted and the processed word is registered in the database as a keyword candidate word for CD-ROM search. Next, the contents of the database are output as a list (step S2
), red characters, etc., and then proofread the keyword data. The keyword data that has been proofread is used as a keyword for CD-ROM search. Basic dictionary 115 (
In reality, the user dictionary 132) is corrected to improve the accuracy of the next natural language processing.

基本辞書１１５は自然言語処理（分かち書き／カナ振り
）を行なう上で一番基本となる辞書で、システム辞書１
３１とユーザ辞書１３２とから構成されている。ユーザ
辞書１３２の修正を行なう事により、自然言語処理の精
度を向上する事が出来る。The basic dictionary 115 is the most basic dictionary for natural language processing (separate writing/kana writing), and is the system dictionary 115.
31 and a user dictionary 132. By modifying the user dictionary 132, the accuracy of natural language processing can be improved.

この発明ではＣＴＳの自然言語処理の汎用入出力ファイ
ルとして汎用ファイル（以下、ＮＬファイルとする）を
用いているが、ＮＬファイルでは第１３図に示すように
ＮＬインファイル、ＮＬアウトファイル及びＮＬ情報フ
ァイルの３種類で構成され、フォーマットは同一である
。全体のフォーマットはヘダーレコード及びデータレコ
ードで成っており、ヘダーレコードにはレコード識別、
シーケンス番号、ファイル識別、ジョブ名、原稿名、Ｃ
ＴＳシステム名等がある。また、データレコードとして
はレコード識別、シーケンス番号、データ番号、処理種
、データ等が含まれている。In this invention, a general-purpose file (hereinafter referred to as NL file) is used as a general-purpose input/output file for CTS natural language processing, but the NL file includes an NL in file, an NL out file, and NL information as shown in FIG. It consists of three types of files, and the format is the same. The overall format consists of a header record and a data record, and the header record includes record identification,
Sequence number, file identification, job name, manuscript name, C
There are TS system names, etc. Further, the data record includes record identification, sequence number, data number, processing type, data, etc.

入力ルーチンＳ１００は第１４図に示すように、ＮＬイ
ンファイルをパラメータと共に読込んで自然言語処理入
力ファイル及びＮＬ情報ファイルを作成するようになっ
ており、その詳細は第１５図に示すようになっている。As shown in FIG. 14, the input routine S100 reads the NL in-file along with parameters to create a natural language processing input file and NL information file, and the details are shown in FIG. 15. There is.

ＮＬインファイルを読込んで、パラメータの指定による
ファンクションの削除及びコード変換（外部−システム
固有コード）を行ない、自然言語処理入力ファイルを作
成する。削除したファンクションの位置情報及びコード
変換情報は、情報ファイルに格納し、処理終了後にジョ
ブ名等をリスト出力する。パラメータチェック（ステッ
プＳ１０１）では、ファンクション削除実行の有無及び
コード変換情報の指示の解析を行なう。A natural language processing input file is created by reading the NL in-file, deleting functions by specifying parameters, and converting code (external to system-specific code). The position information and code conversion information of the deleted function are stored in an information file, and the job name etc. are output as a list after the processing is completed. In the parameter check (step S101), the presence or absence of function deletion execution and the instruction of code conversion information are analyzed.

ヘダーレコード作成（ステップＳ１０２）では、ＮＬイ
ンファイルのヘダーレコードの内容より、自然言語処理
入力ファイル及びＮＬ情報ファイルのヘダーレコードを
作成する。同データＮＯのデータの読込２１（ステップ
Ｓ１０３）の処理は、同データＮＯを持つレコードの全
有効データを処理単位とする。従って、ＮＬインファイ
ルデータレコード中の同データＮＯを持つデータレコー
ドから有効データを抽出する。データの加工（ステップ
Ｓ１０４）では、ＮＬインファイルから抽出したデータ
のファンクションの削除及びコード変換を行なう。削除
したファンクションの情報及びコード変換情報はＮＬ情
報ファイルへ、処理されたデータは自然言語処理入力フ
ァイルに出力する。また、データレコードの作成（ステ
ップＳ１０５）では、同データＮＯの加工後（ファンク
ションの削除、コード変換）のデータを自然言語処理入
力ファイルへ出力し、加工情報をＮＬ情報ファイルへ出
力する。In header record creation (step S102), header records of the natural language processing input file and the NL information file are created from the contents of the header record of the NL in-file. The process of reading data with the same data number 21 (step S103) uses all valid data of records having the same data number as a processing unit. Therefore, valid data is extracted from data records having the same data NO in the NL in-file data records. In data processing (step S104), functions of data extracted from the NL in-file are deleted and code converted. Information on the deleted function and code conversion information are output to the NL information file, and processed data is output to the natural language processing input file. Further, in creating a data record (step S105), the data after processing (deleting functions, code conversion) of the same data number is output to the natural language processing input file, and the processing information is output to the NL information file.

一方、第１３図の出力ルーチンＳ２００は第１６図に示
すように、自然言語処理の後処理として自然言語処理出
力ファイルとＮＬ情報ファイルを、パラメータと共に読
込んでＮＬアウトファイルを作成するものであり、その
詳細は第１７図のようになっている。すなわち、自然言
語処理出力ファイルとＮＬ情報ファイルを読込んで、パ
ラメータの指定によるファンクションの復帰及びコード
変換（システム固有コード−外部）を行ない、ＮＬアウ
トファイルを作成する。処理終了後にジョブ名等をリス
ト出力する。パラメータチェック（ステップＳ２０１）
では、ファンクション復帰実行の有無及びコード変換情
報の指示の解析を行なう。ヘダーレコードの作成（ステ
ップＳ２０３）では、ＮＬ情報ファイル及び自然言語処
理出力ファイルのヘダーレコードの内容よりＮＬアウト
ファイルのヘダーレコードを作成する。同データＮＯの
データの読込み（ステップＳ２０４）は同データＮＯを
持つレコードの全有効データを処理単位とする。自然言
語処理出力ファイルデータレコード中には、入力原文デ
ータと処理結果データが存在するが、処理結果データの
みを有効データとする。従って、自然言語処理出力ファ
イルレコード中の同データＮＯを持つデータレコードか
ら処理結果データを抽出する。また、データの加工（ス
テップＳ２０５）では、自然言語処理出力ファイルから
抽出したデータにファンクションの復帰及びコード変換
を行なう。加工したデータはＮＬアウトファイルに出力
する。On the other hand, the output routine S200 in FIG. 13, as shown in FIG. 16, reads the natural language processing output file and the NL information file together with parameters to create an NL out file as post-processing of the natural language processing. The details are shown in Figure 17. That is, the natural language processing output file and the NL information file are read, the function is restored by specifying the parameters, and the code is converted (system specific code - external) to create the NL out file. After processing is completed, job names, etc. are output as a list. Parameter check (step S201)
Now, we will analyze the presence or absence of function return execution and the instructions for code conversion information. In the creation of a header record (step S203), a header record of the NL out file is created from the contents of the header records of the NL information file and the natural language processing output file. Reading of data with the same data number (step S204) uses all valid data of records having the same data number as a processing unit. Although input original text data and processing result data exist in the natural language processing output file data record, only the processing result data is considered valid data. Therefore, the processing result data is extracted from the data record having the same data number in the natural language processing output file record. Furthermore, in data processing (step S205), function restoration and code conversion are performed on the data extracted from the natural language processing output file. The processed data is output to the NL out file.

この発明はＣＤ−ＲＯＭ等のデータベースの構築支援と
して利用でき、検索用キーワードの抽出、抽出したキー
ワードへの読みの付加を行ない得る。また、印刷業務で
の利用が可能で、カナ振り機能を利用した総ルビの印刷
物作成や名簿の住所、氏名などの項目の自動カナ振り、
索引作成の支援システムとして利用できる。The present invention can be used to support the construction of databases such as CD-ROMs, and can extract search keywords and add pronunciations to the extracted keywords. In addition, it can be used for printing work, such as creating printed materials with full ruby using the kana translation function, automatic kana translation of items such as addresses and names in lists, etc.
It can be used as a support system for index creation.

発明の効果；以上のようにこの発明の検索用キーワードの作成方法に
よれば、専門的な知識や技術を要することなく自動的に
ＣＤ−ＲＯＭデータの検索用キーワードを作成すること
ができる。Effects of the Invention: As described above, according to the search keyword creation method of the present invention, search keywords for CD-ROM data can be automatically created without requiring any specialized knowledge or skills.

【図面の簡単な説明】[Brief explanation of the drawing]

第１図はこの発明の動作例を示すフローチャート、第２
図は前処理の動作例を示すフローチャート、第３図は自
然言語処理する原文の例を示す図、第４図は分かちカナ
の例を示す図、第５図は後処理の動作例を示すフローチ
ャート、第６図は自然言語処理システムのハードウェア
構成例を示すブロック図、第７図はそのソフトウエア構
成例を示す図、第８図は分かち書きの出力形式を示す図
、第１０図は分野別辞書を使用したカナ振りの出力形式
を示す図、第１１図は総ルビ振りの出力形式を示す図、
第１２図はキーワード抽出及びキーワードへのカナ振り
の出力形式を示す図、第１３図はこの発明に用いる汎用
ファイルの構成例を示すフローチャート、第１４図は入
力ルーチンの入出力を示す図、第１５図は入力ルーチン
の詳細を示すフローチャート、第１６図は出ルリーチン
の入出力を示す図、第１７図は出力ルーチンの詳細を示
すフ　ローチャートである。１０…ホストマシン、１１…ＣＰＵ、１２…メモリ、１
４…磁気ディスク装置、１５…カセット磁気テープ装置
、２０…磁気テープ装置、２１…レーザープリンタ、２
２…確認／修正用端末、２３…コンソール端末。出願人　代理人　安形雄三FIG. 1 is a flowchart showing an example of the operation of this invention, and FIG.
Figure 3 is a flowchart showing an example of pre-processing operations, Figure 3 is a diagram showing an example of an original text subjected to natural language processing, Figure 4 is a diagram showing an example of dividing kana characters, and Figure 5 is a flowchart showing an example of post-processing operations. , Figure 6 is a block diagram showing an example of the hardware configuration of a natural language processing system, Figure 7 is a diagram showing an example of its software configuration, Figure 8 is a diagram showing the output format of parting notes, and Figure 10 is a diagram showing an example of the hardware configuration of a natural language processing system. Figure 11 is a diagram showing the output format of kana furi using a dictionary, Figure 11 is a diagram showing the output format of total ruby furi,
Fig. 12 is a diagram showing the output format of keyword extraction and kana translation to the keyword, Fig. 13 is a flowchart showing an example of the configuration of a general-purpose file used in this invention, Fig. 14 is a diagram showing input and output of the input routine, FIG. 15 is a flowchart showing the details of the input routine, FIG. 16 is a diagram showing the input and output of the output routine, and FIG. 17 is a flowchart showing the details of the output routine. 10...Host machine, 11...CPU, 12...Memory, 1
4...Magnetic disk device, 15...Cassette magnetic tape device, 20...Magnetic tape device, 21...Laser printer, 2
2...Confirmation/correction terminal, 23...Console terminal. Applicant Agent Yuzo Agata

Claims

【特許請求の範囲】[Claims]

【請求項１】磁気記憶媒体に格納されたデータベースを
前処理し、基本辞書を参照して自然言語処理による自然
言語処理出力ファイルを作成し、後処理によってキーワ
ードデータを作成するようにしたことを特徴とするＣＤ
−ＲＯＭ、光磁気ディスク等のテキストデータの検索用
キーワードの作成方法。Claim 1: Pre-processing a database stored in a magnetic storage medium, creating a natural language processing output file by natural language processing with reference to a basic dictionary, and creating keyword data by post-processing. Featured CD
- A method for creating keywords for searching text data on ROMs, magneto-optical disks, etc.

【請求項２】前記キーワードデータの校正時に前記基本
辞書を修正するようになっている請求項１に記載のＣＤ
−ＲＯＭデータの検索用キーワードの作成方法。2. The CD according to claim 1, wherein the basic dictionary is corrected when proofreading the keyword data.
- How to create keywords for searching ROM data.

【請求項３】前記前処理が、データの抽出、コード変換
及び自然言語処理入力ファイルの作成の繰り返しである
請求項１に記載のＣＤ−ＲＯＭデータの検索用キーワー
ドの作成方法。3. The method of creating keywords for searching CD-ROM data according to claim 1, wherein the preprocessing is a repetition of data extraction, code conversion, and creation of a natural language processing input file.

【請求項４】前記後処理が、前記自然言語処理出力ファ
イルの品詞の抽出を行なって後に複合語作成又は終止形
への語尾変換を行ない、その後にコード変換及びデータ
ベース形式ファイルの作成を行ない、上記動作を繰り返
すようになっている請求項１に記載のＣＤ−ＲＯＭデー
タの検索用キーワードの作成方法。4. The post-processing extracts the part of speech of the natural language processing output file, then creates a compound word or converts the ending into a final form, and then performs code conversion and creates a database format file, 2. The method of creating a keyword for searching CD-ROM data according to claim 1, wherein the above operation is repeated.