JPH09237277A

JPH09237277A - Method for analyzing compound noun

Info

Publication number: JPH09237277A
Application number: JP8042460A
Authority: JP
Inventors: Toru Hisamitsu; 徹久光
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1996-02-29
Filing date: 1996-02-29
Publication date: 1997-09-09

Abstract

PROBLEM TO BE SOLVED: To provide a method for structure-estimating a compound noun containing unknown words by collecting keys which actually appear in the other places in a document and combining them. SOLUTION: Words obtained by morpheme-analyzing a compound word and which cooccur with a core word are retrieved by using pattern matchers. A cooccurrence example is stored in a data storage area. When the unknown words are detected, they are stored in an unknown storage area (1006 and 1007). A cooccurrence pattern is retrieved on the detected unknown words and a result is added in a cooccurrence data storage area (1008 and 1009). Data on a modification likelihood storage area is constituted from a cooccurrence data base (1010). When the unknown word is detected, a morpheme analysis result is corrected (1011) and a modification relation between the corrected word groups is estimated by using data of the modification likelihood storage area (1012).

Description

【発明の詳細な説明】Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、文書中に出現する
複合名詞の構造を解析する方法に係り、機械翻訳におけ
る翻訳精度向上等に利用される。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a method of analyzing a structure of a compound noun appearing in a document, and is used for improving translation accuracy in machine translation.

【０００２】[0002]

【従来の技術】この種の技術については，例えば，宮崎正弘他:複合語の構造化に基づく対訳辞書の単語
結合難辞書引き,情処論,Vol。 34, No。 4, pp743-754
(1993)，あるいは Kobayashi, Y。 et al:"Analysis of
Japanese Compound Noun using Collocational Inform
ation",in Proc。 of COLING, pp865-869 (1994)に記載
されている。2. Description of the Related Art For this kind of technique, see, for example, Masahiro Miyazaki et al .: Word combination difficult dictionary lookup of bilingual dictionaries based on structuring of compound words, information theory, Vol. 34, No. 4, pp743-754
(1993), or Kobayashi, Y. et al: "Analysis of
Japanese Compound Noun using Collocational Inform
ation ", in Proc. of COLING, pp865-869 (1994).

【０００３】複合名詞解析は，何等かの自然言語処理プ
ログラムの一部として行われるのが普通である。図１で
は，機械翻訳システムの構成モジュールとなる場合を示
す。Compound noun analysis is usually performed as part of some natural language processing program. FIG. 1 shows a case where the machine translation system is a constituent module.

【０００４】対象文書１０１は，まず通常の形態素解析
１０７を用いて，１０２のごとき構成単語列に分解され
る。単語列中には，１０４に示す如き複合名詞が一般に
多数存在し，機械翻訳等における問題となっていた。そ
こで，複合語を構成する単語間の関係を明らかにするた
め，複合名詞は，複合名詞構造解析モジュール１０３へ
出現位置情報とともにわたされる。１０３では，一般に
複合名詞構成ルール１１０，複合名詞データベース１１
１，シソーラス１１２等の言語知識１０９を用いて，１
０５のごとき複合名詞構造解析結果を出力し，１０２の
通常の形態素解析結果とともに，１０６にしめしたごと
き後続処理群に受け渡す。ここで，１０５における複合
名詞構造解析結果の表記においては，図２，２０１に示
した如き木構造が単位となっている。ここで，α，βは
単語をあらわし，それが可能な場合，（δ）はαとβの
間の推定された関係をあらわす。木の根元がβであると
き，この２語間の関係ではβが主要部であるという。こ
れは，βが複合語αβの意味的な主要部であることに対
応する。以下，この関係を表記の都合上，単に202のご
とく [[α（δ）β]β] であらわす。例えば，α＝”構
造”，β＝”解析”のとき，αβ＝”構造解析”の主要
部は”解析”であり，αとβの間の関係δは，”αはβ
の目的語である”となる。以下では，主要部のみに注目
するとき，これを単に， [[α β] β]と書くこともあ
る。更に，日本語では二語の名詞連鎖においては，ほ
ぼ例外無く後ろ側の名詞が主要部となるので，三語以上
の名詞の係り受け関係を表わすときは，特に断わらない
限り単に[α β]と書き，[[α β] γ]，[α [β γ]]
のように用いる。The target document 101 is first decomposed into a constituent word string such as 102 using a normal morphological analysis 107. In general, a large number of compound nouns such as 104 are present in the word string, which has been a problem in machine translation. Therefore, in order to clarify the relationship between the words forming the compound word, the compound noun is passed to the compound noun structure analysis module 103 together with the appearance position information. In 103, the compound noun composition rule 110 and the compound noun database 11 are generally used.
1, using the language knowledge 109 such as the thesaurus 112, 1
A compound noun structure analysis result such as 05 is output, and is passed to the subsequent processing group such as 106 as well as the normal morphological analysis result of 102. Here, in the description of the compound noun structure analysis result in 105, the tree structure as shown in FIGS. Here, α and β represent a word, and when it is possible, (δ) represents an estimated relationship between α and β. When the root of the tree is β, β is said to be the main part in the relationship between these two words. This corresponds to β being the semantic main part of the compound αβ. Hereafter, for convenience of notation, this relationship is simply expressed as 202 [[α (δ) β] β]. For example, when α = “structure” and β = “analysis”, the main part of αβ = “structure analysis” is “analysis”, and the relation δ between α and β is “α is β
In the following, when paying attention only to the main part, this may be simply written as [[α β] β]. Furthermore, in Japanese, in a two-word noun chain, Since there are almost no exceptions, the noun on the back side becomes the main part, so when expressing the dependency relationship of three or more nouns, simply write [α β] unless otherwise noted, and write [[α β] γ], [α [β γ]]
Use as.

【０００５】従来の複合名詞解析方法を，”複合名詞解
析”を例に用いて説明する。”複合名詞解析”は，まず
通常の形態素解析により，”複合 SN／名詞 N／解析 S
N”のごとく分割される。これらの３つの単語の間の修
飾／被修飾関係は，例えば”複合した名詞を解析するこ
と”である。すなわち，始めの２単語がまとまって，主
要部が”名詞”であるところの”複合名詞”を構成し，
これが”解析”の目的語となっているのである。この分
析を得るためには，従来次のような方法が用いられき
た。図３，３０１に示したごとき文脈自由文法に従え
ば，上記分割に対しては，図４，３１０１１，３１０２
１に示すごとき，解析木が得られる。これらに対応する
上記複合名詞の解釈は，それぞれ３１０１２，３１０２
２のごとく表記することができる。３１０１２の解釈で
は，複合は名詞に係っているが，３１０２２では解析に
係っている。A conventional compound noun analysis method will be described using "composite noun analysis" as an example. "Compound noun analysis" is first performed by normal morphological analysis, "composite SN / noun N / analysis S".
The modified / modified relationship between these three words is, for example, "parsing a compound noun." That is, the first two words are grouped together and the main part is " Compose a "compound noun" that is a "noun",
This is the object of "analysis". To obtain this analysis, the following methods have been used conventionally. According to the context-free grammar as shown in FIGS.
As shown in 1, the parse tree is obtained. The interpretations of the compound nouns corresponding to these are 31012 and 3102, respectively.
It can be written as 2. In the interpretation of 31012, the compound is related to the noun, but in 31022, it is related to the analysis.

【０００６】ここで，それぞれの解析木のノードには，
構文解析の仮定で"head"と呼ぶ属性を付与しておく。こ
れは，例えば NP → N N なるルールに関して，左
辺のNPの属性"head"の属性値は，右辺第１のNの属性"he
ad"と，右辺第２のNの属性"head"のうち，後者のものの
属性値と一致するとする。また， N → w（wは単
語）なるルールに関しては，左辺のNの属性"head"の属
性値は，w自体とする。これは一般に，二つの名詞が複
合して名詞を構成するとき，日本語では後者が全体の主
要部となるからである。Here, each parse tree node has
An attribute called "head" is added on the assumption of parsing. For example, for the rule NP → NN, the attribute value of the NP attribute "head" on the left side is the first N attribute "he" on the right side.
and "ad" and the attribute value of the latter of the second N attributes "head" on the right side. Also, regarding the rule N → w (w is a word), the N attribute "head" on the left side The attribute value of is the w itself, because, in general, when two nouns are compounded to form a noun, the latter is the main part of the whole in Japanese.

【０００７】"head"属性を利用することにより，３１０
１１と３１０２１のうちいずれが総合的により妥当な解
釈であるかを判定することができる。以下これを，具体
例を用いて示す：図５，４０１は，あらかじめ記録され
ている複合名詞の用例データベースである。４０１に
は，４０１１のごとく，"複合"と"名詞"が複合名詞を構
成している例が実際に含まれていることもあるが，予め
利用できるデータは限られているため，どの組み合わせ
も含まれていないこともある。このような"sparseness
problem"と呼ばれている事態に対処するため，通常は４
０１と，図７に示すシソーラスを用いて，実際の単語で
なく，その単語を，その単語を類似した意味を持つ単語
群のうち，それらを代表するシソーラス中の特定の階層
に属する単語で置き換え，これら代表単語同士の係り受
け（もしくは共起関係）が観測されたとして，代表単語
間の係受け妥当性を，例えば条件付き確率として計算す
る。この計算結果は，図６，４０２のごとく，共起尤度
データベースとして記録される。By using the "head" attribute,
It is possible to determine which of 11 and 31021 is a more comprehensive interpretation. This is shown below using a concrete example: FIG. 5, 401 is a prerecorded example database of compound nouns. Although 401 may actually include an example in which "composite" and "noun" compose a compound noun, such as 4011, data that can be used in advance is limited, so any combination is possible. It may not be included. Such "sparseness
Usually 4 to deal with what is called a "problem"
01 and the thesaurus shown in FIG. 7, replace that word, not the actual word, with a word belonging to a specific hierarchy in the thesaurus that represents them, out of a group of words having similar meanings. Assuming that the dependency (or co-occurrence relation) between these representative words is observed, the dependency validity between the representative words is calculated as, for example, a conditional probability. The calculation result is recorded as a co-occurrence likelihood database as shown in FIGS.

【０００８】これを用いて，３１０１１の部分木３１０
１１１，３１０１１２，３１０１１３において，それぞ
れ３１０１１１，３１０１１２のhead属性である"複合"
と"名詞"との間の係り受け妥当性は，0。02と評価され
る。ここで"複合"の代表語として"集合"，"名詞"の同階
層の代表語として，"文法"がとられたと仮定している。
更に，部分木３１０１１３，３１０１１４，３１０１１
５において，３１０１１３のhead属性は"名詞"となるた
め，今度は４０２２より，この部分木の妥当性は，"文
法"と"学問"の係受け妥当性を用いて，0。11と評価され
る。ここで，"名詞"の同階層の代表語として"文法"，"
解析"の同階層の代表語として"学問"がとられたと仮定
している。By using this, the subtree 31011 of 31011
In 111, 310112, and 310113, "composite", which is the head attribute of 310111 and 310112, respectively
Dependency validity between and "noun" is evaluated as 0.02. Here, it is assumed that "set" is taken as a representative word of "composite" and "grammar" is taken as a representative word of the same hierarchy of "noun".
Furthermore, subtrees 310113, 310114, 31011
In 5, the head attribute of 310113 becomes "noun", so this time from 4022, the validity of this subtree is evaluated as 0.11, using the dependency validity of "grammar" and "academic". It Here, "grammar", "as a representative word of the same hierarchy of" noun "
It is assumed that "scholarship" was taken as a representative word in the same class of "analysis".

【０００９】これらより，３１０１１全体の妥当性は，
これが含む部分木の妥当性の積として，0。02×0。11=
0。0022となる。From these, the validity of 31011 as a whole is
As the product of the validity of the subtrees it contains, 0.02 × 0.11 =
It becomes 0.0022.

【００１０】全く同様に，３１０２１の妥当性は，0。0
1×0。11=0。0011となり，解析木３１０１１が，すなわ
ち解釈３１０１２がより妥当であると結論付けられる。
この手法を用いた複合名詞解析の精度は，６文字漢字複
合名詞で，60％程度と報告されている[2]。Exactly the same, the validity of 31021 is 0.0.
1 × 0.11 = 0.0011, and it is concluded that the parse tree 31011, ie the interpretation 31012, is more valid.
The accuracy of compound noun analysis using this method is reported to be about 60% for 6-character Kanji compound nouns [2].

【００１１】[0011]

【発明が解決しようとする課題】従来の手法の問題点
は，例えば"改正大店法施行"のような複合名詞の解析の
際に発生する。これは，「改正された大店法（大規模小
売店舗法）を施行すること」という意味を持つが， "大
店法"は"大規模小売店舗法"の省略形であり，通常は辞
書には記載されない。このため，"改正大店法施行" は
通常 "改正 SN／大 N／店 N／法 N／施行 SN" と分割さ
れることになる。これに対して３０１のルールを用いて
解析木を作っても，意味の有る解析は得られない。例え
ば，「短く分割された漢字部分を一つの名詞と考える」
という方針で，"大店法"を名詞であると推定したとして
も，"大店法"がシソーラス中のどの単語によって代表さ
れるのか分からないため，やはり"大規模小売店舗法"の
解析は不可能である。The problems of the conventional method occur when parsing compound nouns such as "Revised large store law enforcement". This has the meaning of "enforcing the revised Large Store Law (Large Scale Retail Store Law)", but "Large Store Law" is an abbreviation for "Large Scale Retail Store Law" and is usually not listed in the dictionary. . For this reason, "Revised large store law enforcement" is usually divided into "Revised SN / Large N / Store N / Law N / Enforcement SN". On the other hand, even if an analysis tree is created using the rule 301, no meaningful analysis can be obtained. For example, "think a short kanji part as a noun"
Even if we presume that "Large store law" is a noun, it is not possible to analyze "Large-scale retail store law" because we do not know which word in the thesaurus represents "Large store law". .

【００１２】仮に複合名詞内の単語がすべて登録語であ
っても，解析すべき文書と無関係に蓄積された知識源を
用いた場合，sparseness problem により解析精度が低
下する可能性がある。Even if all the words in the compound noun are registered words, if a knowledge source accumulated regardless of the document to be analyzed is used, the sparseness problem may lower the analysis accuracy.

【００１３】このため，新聞記事等の解析をする場合，
従来の技術は実用上利用することができなかった。Therefore, when analyzing newspaper articles, etc.,
The conventional technique could not be used practically.

【００１４】本発明は，新聞記事等に現われる，単語の
省略形等の未知語を含む複合名詞に対しても，文書自体
の他の部分から関連情報を直接抽出することによりその
構造を推定することを課題とする。The present invention estimates the structure of a compound noun that appears in a newspaper article or the like and includes an unknown word such as an abbreviation of a word by directly extracting relevant information from other parts of the document itself. This is an issue.

【００１５】[0015]

【課題を解決するための手段】本発明は，新聞記事にお
ける省略形等も含む複合名詞の解析を可能とする。複合
名詞と同時に，それを実際に含む大規模なテキストが与
えられることは自然であるので，これを想定する。この
とき，大規模なテキストをすべて解析しなくもすむこと
は重要な要件であるので，与えられた複合名詞を形態素
解析して得られた複数個の単語をそれぞれキーとして，
テキスト中にそれらが指定された条件を満たしつつ出現
する箇所をパターンマッチャで発見し，それらの情報か
ら，キーとされた単語が他のいかなる単語，もしくは文
字列と共起しているかを調べることにより，共起データ
ベースを構成する。The present invention enables analysis of compound nouns including abbreviations in newspaper articles. This is assumed because it is natural that a large-scale text that actually contains a compound noun is given at the same time as the compound noun. At this time, it is an important requirement to avoid having to analyze all large-scale texts, so multiple words obtained by morphological analysis of a given compound noun are used as keys,
Find the place where they appear in the text while satisfying the specified conditions with the pattern matcher, and check from that information whether any other word or character string is the key word. Construct a co-occurrence database.

【００１６】"大店法"のように，短い長さの単語列に分
割されてしまう未登録語を捕捉するため，キーとなる単
語を含む，単語または文字列からなる列で，総文字数３
（３は実験により決定された）以下のものが文書中の他
の部分に指定された条件を満たしつつ出現した場合，上
記列自体を新たに単語とみなすことにする。このように
して新たに捕捉された単語については，これを更にキー
に加えて，共起語検索を続行する。新たに発見された単
語が省略形であるかどうか，その場合原形はなんである
かは，例えば Tsuyoshi Kitani。 et al: "An Accurate
MorphologicalAnalysis and Proper Name Identificat
ion for Japanese Text Processing",in Trans。 of IP
SJ, Vol。 35, No。 3,pp404-413 (1994)に記載された
方法で決定できる。この方法は，文書中の縮約形と思わ
れる単語に対して，それを構成する文字の集合と，他の
部分に出現する各単語の構成文字の集合を比較し，出現
順序や，包含される割合により，どの単語の縮約形かを
判定するものである。In order to capture unregistered words that are divided into short word strings, such as the "large store method," a string of words or character strings that includes a key word, and the total number of characters is 3
If the following (3 determined by experiment) appears while satisfying the conditions specified in other parts of the document, the above sequence itself is considered as a new word. For the word newly captured in this way, this is further added to the key and the co-occurrence word search is continued. Whether a newly discovered word is an abbreviation, and in what case the original form is, for example, Tsuyoshi Kitani. et al: "An Accurate
MorphologicalAnalysis and Proper Name Identificat
ion for Japanese Text Processing ", in Trans. of IP
SJ, Vol. 35, No. 3, pp404-413 (1994). This method compares the set of characters that make up a word that appears to be a contracted form in a document with the set of characters that appear in other parts, and determines the order of appearance and inclusion. Depending on the ratio, the word is reduced.

【００１７】[0017]

【発明の実施の形態】以下，パターンマッチャと指定条
件についても詳細をのべつつ，"大規模小売店舗法"の解
析を例として説明する。BEST MODE FOR CARRYING OUT THE INVENTION An analysis of the "large-scale retail store method" will be described below as an example, with details of pattern matchers and specified conditions.

【００１８】図８において，Aは与えられた単語とす
る。Bは，単語，または文字列，Dは，空白，記号，”
の”以外の平仮名等のいずれかであるとする。Aが長さ
１の単語の場合，Bは長さ２以下の単語または文字列，A
が長さ２以上の単語の場合，Bは単語または長さ３以下
の文字列とする。In FIG. 8, let A be a given word. B is a word or character string, D is a space, symbol, "
It is assumed to be any of the hiragana characters, etc. other than ". If A is a word with a length of 1, B is a word or character string with a length of 2 or less, A
If is a word with a length of 2 or more, B is a word or a character string with a length of 3 or less.

【００１９】６０１は，基本的には２つの名詞が，一つ
の複合語の構成単位をして共起する例を獲得するための
パターンである。以下，特殊なもの以外は，A，Bの出現
順序が，Aが先であるもの，Bが先であるものの両方を対
象とする。Reference numeral 601 is basically a pattern for obtaining an example in which two nouns co-occur as a constituent unit of one compound word. In the following, except for special ones, the order of appearance of A and B is both those in which A is first and those in which B is first.

【００２０】６０２は，基本的には２つの名詞が，助詞
「の」をはさんで共起する例を獲得するためのパターン
である。D中から「の」を除くのは，「AのBのC」中から
「AのB」，「BのC」を抜き出すと，誤った単語間の係受
けを獲得することが多いためである。ここで，６０２以
下は，Bは３文字以下の単語，または文字列である。Reference numeral 602 is basically a pattern for obtaining an example in which two nouns co-occur with the particle "no". The reason for excluding “no” from D is that when “A B” and “B C” are extracted from “A B C”, an incorrect inter-word correlation is often obtained. is there. Here, 602 or less is a word or a character string in which B is three characters or less.

【００２１】６０３は，形容詞，形容動詞と，それが就
職する単語の例を獲得するパターンである。Reference numeral 603 is a pattern for acquiring an example of an adjective, an adjective verb, and a word in which it gets a job.

【００２２】６０４は，サ変動詞と，そのガ格，ニ格，
ヲ格となる名詞の例を獲得するパターンである。Reference numeral 604 denotes a sa verb, its case, ni case,
This is a pattern to obtain an example of a noun that is a case.

【００２３】６０５は，サ変動詞が直接修飾する名詞の
例を獲得するパターンである。Reference numeral 605 is a pattern for acquiring an example of a noun directly modified by the sa verb.

【００２４】６０６は，並列関係をとる名詞の例を獲得
するパターンである。Reference numeral 606 is a pattern for acquiring examples of nouns having a parallel relationship.

【００２５】６０７は，「〜についての〜」という関係
にある，二つの名詞の例を獲得するパターンである。Reference numeral 607 is a pattern for acquiring examples of two nouns having a relationship of "about about".

【００２６】これらは主要な関係であるが，必要に応じ
て他のパターンを追加することもできる。These are the main relationships, but other patterns can be added as needed.

【００２７】"改正大店法施行"の解析の場合，始めの形
態素解析において， "改正 SN／大 N／店 N／法 N／施行 SN" なる５つの単語を得る。In the case of the analysis of "Revised large store law enforcement", five words "revision SN / large N / store N / law N / enforcement SN" are obtained in the first morpheme analysis.

【００２８】第１の単語"改正"については，例えば図９
に示す７０１１のごとく，パターンマッチャ６０１か
ら"改正中"が，６０５から"改正された法規"が，６０２
から"法律の改正"が，６０４から"法律を改正する"が，
獲得される。For the first word "revision", refer to FIG.
As indicated by reference numeral 7011 in FIG. 7, the pattern matcher 601 indicates “revising”, and the reference numeral 605 indicates “revised law” 602.
From "Revising the law", from 604 "Revising the law",
Be acquired.

【００２９】第２の単語"大"について，例えば７０１２
のごとく，パターンマッチャ６０３から"大きな変化"
が，６０１から"大学"，"大型"，"大店法"等が，獲得さ
れる。ここで，６０１により発見されたパターンに関し
ては，次の規約を設ける。For the second word "large", say 7012
Like pattern matcher 603, a "big change"
However, from 601, "university", "large-scale", "large store law", etc. are acquired. Here, the following rules are set for the pattern found by 601.

【００３０】規約パターンマッチャ６０１により発見されたABおよびBAに
ついては(a) Aの長さが１，B長さが１であり，かつ連結
した文字列としてのABが辞書に記載されていないとき，
ABを独立した単語とみなし，ABの出現を回数とともに記
録する。Regarding AB and BA found by the convention pattern matcher 601, (a) when A has a length of 1 and B has a length of 1, and AB as a concatenated character string is not listed in the dictionary ，
Consider AB as an independent word and record the occurrence of AB with the number of times.

【００３１】(b) Aの長さが１，B長さが２であり，かつ
連結した文字列としてのABが辞書に記載されておらず，
連続文字列としてのAB＝"C1C2C3"が，"C1"+"C2C3"とい
う二つの辞書登録された単語の連結として分割できない
とき，ABを独立した単語とみなし，ABの出現を回数とと
もに記録する。(B) A has a length of 1 and B has a length of 2, and AB as a concatenated character string is not listed in the dictionary,
When AB = "C1C2C3" as a continuous character string cannot be split as a concatenation of two dictionary registered words "C1" + "C2C3", AB is regarded as an independent word and the occurrence of AB is recorded with the number of times .

【００３２】(c) AB が (a)または(b)を満たす場合，B
をなす文字列が，解析すべき複合語内でAに後続する文
字列先頭部分と一致するとき，与えられたキーに関する
検索終了後，ABをキーとする検索を行う。(C) If AB satisfies (a) or (b), then B
When the character string that forms a matches the beginning of the character string that follows A in the compound word to be parsed, the search with AB as the key is performed after the search for the given key ends.

【００３３】(d) Aの長さが２以上の場合，連続文字列
としてのAB＝"C1C2C3。。。Cn"が，Aと先頭部分を共有
し，かつAと等しくないA'="C1。。。Cm"と，Bと末尾部
分を共有し，かつBと等しくないB'="Cm+1。。。Cn"とで
あって，A', B'ともに辞書登録された２単語に分割でき
る時を除き，AとBの複合語内共起(in-word-rel)が観察
されたとして記録する。(D) When the length of A is 2 or more, AB = “C1C2C3 ... Cn” as a continuous character string shares the beginning part with A and is not equal to A '= C1 .. Cm "and B '=" Cm + 1 ... Cn "that share the end part with B and are not equal to B, and both A'and B'are in two words registered in the dictionary. Record as in-word-rel compound A-B co-occurrence, except when divisible.

【００３４】"大きな変化"は，漢字語幹部分"大"と，"
変化"の間の関係として，"大学"，"大型"は，それぞれ
すでに単語なので規約(a)により無視され，"大店法"は
規約(b)により新たな単語として登録され，かつ規約(c)
により，これをキーとする検索を行うことを記録する。
これらの結果，７０２２と，７０２６が得られる。"Major change" means that the kanji word stem part is "large"
As for the relationship between "changes", "university" and "large-scale" are already words, so they are ignored by convention (a), "large store law" is registered as a new word by convention (b), and convention (c)
Record that a search is performed using this as a key.
As a result, 7022 and 7026 are obtained.

【００３５】第３の単語"店"，第４の単語"法"について
は，第２の単語"大"についてと同様，それぞれ７０１
３，７０１４を得，その結果７０２３，７０２４を得
る。７０２６については，すでに得られている。The third word "store" and the fourth word "method" are respectively 701 as in the case of the second word "large".
3,7014 and, as a result, 7023 and 7024 are obtained. 7026 has already been obtained.

【００３６】第４の単語"施行"についても同様に，７０
１１から７０２５が，獲得される。Similarly for the fourth word "enforcement", 70
11 to 7025 are won.

【００３７】最後に，新たに発見された単語"大店法"に
ついて検索を実施し，７０１６を得，最終的に７０２７
が獲得される。Finally, a search is performed for the newly discovered word "large store method" to obtain 7016, and finally 7027.
Is acquired.

【００３８】上述の規約によれば，"大 N／店 N／法 N"
の"大"をキーとするパターンマッチから，"大店法"が新
たに一つの単語であることが発見される可能性は極めて
高い。なぜならば，与えられた文書は，まさに"大店法"
を話題として含むからである。こうして新たに発見され
た単語を新たにキーとして追加することにより，未登録
語の発見にとどまらず，それと他の単語との共起関係も
発見できる。これにより，未登録語をふくむ複合名詞に
ついても，未登録語を同定しつつ共起データベースを構
成でき，構造解析を遂行することができる。According to the above rules, "large N / store N / law N"
It is highly possible that "large store law" is a new word from the pattern matching with "large" as the key. Because the given document is exactly "large store law"
Is included as a topic. By adding the newly discovered word as a new key, it is possible to discover not only the unregistered word but also the co-occurrence relation between it and other words. As a result, even for compound nouns including unregistered words, the co-occurrence database can be constructed while identifying the unregistered words, and structural analysis can be performed.

【００３９】このようにして獲得された例を用いて，従
来の方式と同様に係り受け尤度判定のための条件付き確
率テーブルを生成することができる。新たに獲得した語
に関しては，例えば末尾単語を手がかりにしてその語の
シソーラスにおける位置を推定し，代表語で置き換える
という方法も可能であるが，文書が大きな場合，共起デ
ータベースは十分な情報を含むと期待されるため，各単
語を直接用いて計算することもできる。図１０は，直接
単語を用いて計算された例である。Using the example obtained in this way, a conditional probability table for dependency likelihood determination can be generated as in the conventional method. For newly acquired words, it is possible to estimate the position of the word in the thesaurus using the last word as a clue and replace it with a representative word. However, if the document is large, the co-occurrence database will provide sufficient information. Since it is expected to include, it is possible to calculate directly using each word. FIG. 10 is an example calculated using direct words.

【００４０】これを用いて，"改正大店法施行"を解析す
る。単語検索の仮定で，"大店法"が一まとまりの単語で
あると認識されているので，係り受け解析の前に形態素
解析結果を修正し，解析すべき複合名詞は，"改正 SN/
大店法 N/施行 SN"となっている。ここで，新たに発見
された単語Bは，「BなC」や，「BするC」等のパターン
がない限り，すべて普通名詞として推定しておくことに
する。もちろん，何らかの方法でより詳細な品詞を推定
することは妨げない。By using this, "Revised large store law enforcement" is analyzed. Assuming that the word "big store method" is a group of words under the assumption of word search, the compound noun to be corrected and analyzed by the morphological analysis result before the dependency analysis is "modified SN /
Large store law N / enforcement SN ". Here, the newly discovered word B is presumed to be an ordinary noun unless there is a pattern such as" B na C "or" B do C ". I will decide. Of course, it does not prevent us to infer a more detailed part of speech in some way.

【００４１】従来の技術と同様な方法で，[[改正大店
法] 施行]と，[改正 [大店法施行]]を比べることによ
り，前者がより尤度が高いものとして選ばれる。その詳
細構造を図１１の９１に示す。このとき，各ノードに
は，"head"属性の他に，そのノードに二つの子がある場
合，子となる二つのノードの"head"属性の値である二つ
の単語が，実際に文書中でどのようなパターンで共起し
たかを記録する"mod-rel"属性を付与することができ
る。The former is selected as the one with higher likelihood by comparing [[Revised large store law] enforcement] with [Revised [large store law enforcement]] in the same manner as the conventional technique. The detailed structure is shown at 91 in FIG. At this time, if each node has two children in addition to the "head" attribute, the two words that are the values of the "head" attribute of the two child nodes actually appear in the document. You can add a "mod-rel" attribute to record the pattern that co-occurred with.

【００４２】例えば，ノード９１３では，"改正"と"大
店法"の間に，"rv-no-rel"，"sareta-rel"が発見された
ことを示しているが，これはそれぞれ，「大店法の改
正」，「改正された大店法」の存在を示す。"rv-no-re
l"は，７０２７中の"no-rel"を持つ実例と，出現順序が
逆であることを示している。For example, in the node 913, it is shown that "rv-no-rel" and "sareta-rel" have been found between "Revision" and "Large store law". Of the revision of the "," and "revised large store law". "rv-no-re
“L” indicates that the order of appearance is opposite to that of the actual example of “no-rel” in 7027.

【００４３】また，ノード９１５では，"大店法"，"改
正"の間に，"wo-rel"が発見されたことを示している
が，これは，「大店法を施行」という例の存在を示す。
９１の情報を総合して，９２を得られるることは自明で
ある。これを更に非省略形と組み合わせて「改正された
大規模小売店舗法を（の）施行」のように，通常の言葉
に変換することも可能である。Also, in the node 915, it is shown that "wo-rel" was discovered between "Large store law" and "Revision", which indicates the existence of the example "Large store law is enforced". .
It is obvious that 92 can be obtained by combining 91 information. It is also possible to combine this with the non-abbreviated form and convert it into a normal word, such as "the revised large-scale retail store law is enforced."

【００４４】以上に述べた手順を図示したものが図１２
である。FIG. 12 shows the procedure described above.
It is.

【００４５】解析の尤度を比較する場合，上記の方法で
二つの解析が同等の尤度を持つ場合，文書中，解析すべ
き複合名詞により近い位置で発見された共起例を含む解
釈を優先することも考えられるが，この変更は容易であ
る。When comparing the likelihoods of the analyzes, if the two analyzes have the same likelihood by the above method, the interpretation including the co-occurrence example found at a position closer to the compound noun to be analyzed in the document is used. It is possible to give priority, but this change is easy.

【００４６】[0046]

【発明の効果】パターンマッチャを用いて実際の文書か
ら実例を抽出することにより，従来の方法では取り扱う
ことが不可能であった未登録語を含む複合名詞の解析が
可能となる。By extracting an actual example from an actual document using a pattern matcher, it becomes possible to analyze a compound noun including an unregistered word which cannot be handled by the conventional method.

【図面の簡単な説明】[Brief description of drawings]

【図１】機械翻訳における複合名詞解析の位置付けを示
す図。FIG. 1 is a diagram showing the position of compound noun analysis in machine translation.

【図２】複合名詞の構造表記法を示す図。FIG. 2 is a diagram showing a structure notation of a compound noun.

【図３】複合名詞解析に用いる文法ルールを示す図。FIG. 3 is a diagram showing grammar rules used for compound noun analysis.

【図４】複合名詞の解析例。FIG. 4 is an example of analysis of compound nouns.

【図５】複合名詞用例データベースの例。FIG. 5 is an example of a compound noun example database.

【図６】係り受け尤度データベースの例。FIG. 6 shows an example of a dependency likelihood database.

【図７】シソーラスの例。FIG. 7 shows an example of a thesaurus.

【図８】パターンマッチャ。FIG. 8 is a pattern matcher.

【図９】用例と，用例データベースの例。FIG. 9 is an example of an example and an example database.

【図１０】係り受け尤度データベースの例。FIG. 10 shows an example of a dependency likelihood database.

【図１１】複合名詞構造解析例。FIG. 11 is an example of compound noun structure analysis.

【図１２】複合名詞構造解析手順。FIG. 12 is a compound noun structure analysis procedure.

Claims

【特許請求の範囲】[Claims]

【請求項１】文書中に出現する，長さ２以下の名詞，名
詞接頭辞，名詞接尾辞等の，以下単に単語とよぶ要素の
列であるところの複合名詞に対し，該複合名詞を構成す
る単語間の係り受け関係，係り受け関係が存在する場合
には，その種類を推定することである複合名詞解析方法
において，複合名詞を構成する各単語が，該文書を含む
一つ又は複数の文書集合中に出現する箇所を探索し，該
単語と，該単語の各出現位置の近傍に存在する単語の間
に，指定された関係があるか否かを判定し，指定された
関係を満たす場合，用例データベースと呼ぶ記憶領域に
該単語と，近傍単語を，それらの間の関係とともに記録
し，該複合名詞中に複数の可能な係り受け関係が存在す
る場合，上記記録された情報を用いて，最も尤もらしい
関係を決定することを特徴とする複合名詞解析方法。1. A compound noun that is composed of a noun having a length of 2 or less, a noun prefix, a noun suffix, etc., which appears in a document and is a sequence of elements hereinafter simply called a word. In the compound noun analysis method, which is to estimate the type of the dependency relationship between words, if there is a dependency relationship, each word forming the compound noun includes one or more words including the document. Searches the places that appear in the document set, determines whether or not there is a specified relationship between the word and words that exist near each occurrence position of the word, and satisfies the specified relationship In this case, the word and nearby words are recorded in a storage area called an example database together with the relationship between them, and when there are a plurality of possible dependency relationships in the compound noun, the recorded information is used. And determine the most plausible relationship. Compound nouns analysis method according to claim.

【請求項２】請求項１記載の複合名詞構造解析方法にお
いて，用例データベースを構成する際にシソーラスを利
用し，複合名詞内に出現する単語と類似の意味を持つ
か，もしくは上位概念をもつ単語に置換して用例データ
ベースを構成することを特徴とする複合名詞解析方法。2. The compound noun structure analysis method according to claim 1, wherein a thesaurus is used in constructing an example database, and the word has a similar meaning to a word appearing in the compound noun or has a superordinate concept. A compound noun analysis method characterized in that the example database is constructed by replacing with.

【請求項３】請求項１記載の複合名詞構造解析方法にお
いて，該複合名詞の構成単語中に，より長い複合語の省
略形が含まれている場合，該複合語を用いて用例データ
ベースを構成することを特徴とする複合名詞解析方法。3. The compound noun structure analysis method according to claim 1, wherein when a compound word of the compound noun includes an abbreviation of a longer compound word, the compound word is used to form an example database. A compound noun analysis method characterized by:

【請求項４】請求項１記載の複合名詞構造解析方法にお
いて，該複合名詞中に出現する２単語間に，複数の可能
な係り受け関係が存在する場合，もしくは，ある単語が
他の複数個の単語を修飾する可能性がある場合，解析対
象と同一文書内もしくは類似文書内の近傍単語に関する
情報を優先的に用いて，係り受け関係を決定することを
特徴とする複合名詞解析方法。4. The compound noun structure analysis method according to claim 1, wherein a plurality of possible dependency relationships exist between two words appearing in the compound noun, or a certain word is included in another plurality. When there is a possibility of modifying the word, the compound noun analysis method is characterized in that the dependency relation is determined by preferentially using information about neighboring words in the same document as the analysis target or in a similar document.

【請求項５】請求項１記載の複合名詞構造解析方法にお
いて，該複合名詞中に連続して出現する複数単語が，単
一の未登録語であることが文書探索中に発見された場
合，該単語を新たに一単語とみなし，該単語の近傍単語
の検索と，用例データベースへの追加を行うことを特徴
とする複合名詞解析方法。5. The compound noun structure analysis method according to claim 1, wherein a plurality of consecutive words appearing in the compound noun is found during document search as a single unregistered word, A compound noun analysis method characterized in that the word is newly regarded as one word, a word near the word is searched, and the word is added to an example database.