JP2011118872A

JP2011118872A - Method and device for determining category of unregistered word

Info

Publication number: JP2011118872A
Application number: JP2010210648A
Authority: JP
Inventors: Changjian Hu; チェンジエンフー; Zhao Kai; カイザオ; Likun Qiu; リクンチュ
Original assignee: NEC China Co Ltd
Current assignee: NEC China Co Ltd
Priority date: 2009-11-30
Filing date: 2010-09-21
Publication date: 2011-06-16
Anticipated expiration: 2030-09-21
Also published as: CN102081602A; KR20110060806A; JP5216063B2; CN102081602B; KR101195341B1

Abstract

<P>PROBLEM TO BE SOLVED: To provide a method and device for determining a category of an unregistered word. <P>SOLUTION: This method includes the steps for: selecting synonyms of an unregistered word from a dictionary on the basis of a word organization rule; generating context of the unregistered word from a collection of works; and determining a category to which the unregistered word belongs, on the basis of the context and synonyms of the unregistered word. According to the method and device, the category of the unregistered word can be more efficiently and more accurately determined. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、一般に情報処理分野に関し、特に、未登録語のカテゴリを決定するための方法と装置に関する。 The present invention relates generally to the field of information processing, and more particularly to a method and apparatus for determining a category of unregistered words.

インターネットの普及および社会情報の増大傾向に伴い、テキスト情報がますます多くなり、それに伴いテキスト情報処理に関する社会の要求がますます高まってきている。人々は、自然言語を通してコンピュータと通信することを望み、多量のテキスト情報の自動処理を望んでいる。テキスト情報をより効果的に処理するためには、大量の言語データリソース（例えば辞書）を蓄積することが必要である。
しかしながら、文書処理のための主要な機器では、辞書は往々にして手作業で編集される。それはかなり時間を消費し非能率的である。
さらに、単語分割技術において、未登録語による分割エラーが、単語分割全体の回収率に非常に影響を与えると共に、続く文法および語義解釈の精度にさらに影響を与え、その結果として、情報処理に一定の障害が生じることになる。
その他の情報処理技術、例えば情報抽出において、未登録語の属性が極めて不明瞭であると、未登録語とその情報の不完全さのために情報抽出の結果が曖昧となるばかりでなく間違いさえも含むことになる。
よって、このことは、未登録語のカテゴリを決定するうえでに差し迫った問題である。 With the spread of the Internet and the increasing trend of social information, the amount of text information is increasing, and so is the demand for society regarding text information processing. People want to communicate with computers through natural language, and want automatic processing of large amounts of text information. In order to process text information more effectively, it is necessary to store a large amount of language data resources (for example, a dictionary).
However, in major equipment for document processing, dictionaries are often manually edited. It is quite time consuming and inefficient.
Furthermore, in word segmentation technology, segmentation errors due to unregistered words greatly affect the recovery rate of the entire word segmentation and further affect the accuracy of subsequent grammar and semantic interpretation, resulting in constant information processing. The failure will occur.
In other information processing technologies, such as information extraction, if the attributes of unregistered words are very unclear, the result of information extraction is not only ambiguous but also incorrect due to the incompleteness of unregistered words and their information Will also be included.
Therefore, this is an urgent problem in determining the category of unregistered words.

特許文献１（中国特許公開公報ＣＮ１７１７６７９号）は、品詞タグ付け方法を開示している。
この方法は、一区切りの単語に集合的にタグ付けするために、事前に記録されたキーワード−単語クラスリポジトリを主に使用する。一区切りの単語が特定のキーワードを含んでいれば、その一区切りの単語がキーワードに相応する品詞としてタグ付けされる。 Patent Document 1 (Chinese Patent Publication CN1717679) discloses a part-of-speech tagging method.
This method mainly uses a pre-recorded keyword-word class repository to collectively tag a single word. If a word contains a specific keyword, the word is tagged as a part of speech corresponding to the keyword.

特許文献２（米国特許公開公報ＵＳ２００６０１００８５６Ａ１）は、語義推測方法を開示している。この方法の基本思想は、ウェブ検索を通して各新語の用例を抽出し、既存の用例辞書に従ってその用例に基づいて語義クラス候補を抽出することである。その結果、複数の候補が存在する場合には、特定の文集において新語と最も高い共起頻度を有する語義クラスを選択する。 Patent Document 2 (US Patent Publication No. US20060100856A1) discloses a meaning estimation method. The basic idea of this method is to extract examples of each new word through a web search, and to extract semantic class candidates based on the examples according to an existing example dictionary. As a result, if there are a plurality of candidates, the semantic class having the highest co-occurrence frequency with the new word in the specific sentence collection is selected.

特許文献３（中国特許公開公報ＣＮ１３６９８７７号）は、新語のカテゴリを推測する方法を開示している。
この方法は、まず新語内の各文字に対する分離確率を決定し、次に、各カテゴリについての合計分離確率を形成するために、語カテゴリベースで各文字の確率を組み合わせる。
しきい値と合計確率の比較に基づいて、その確率がしきい値を越える語カテゴリを、それぞれ、複数文字語（multi-character word）のための可能なカテゴリとして追加する。 Patent Document 3 (Chinese Patent Publication No. CN1369877) discloses a method for estimating a new word category.
This method first determines the separation probabilities for each character in the new word and then combines the probabilities for each character on a word category basis to form a total separation probability for each category.
Based on the comparison of the threshold and the total probability, each word category whose probability exceeds the threshold is added as a possible category for a multi-character word.

非特許文献１（ＮＡＡＣＬＨＬＴ２００７の１８８−１９５ページのHybrid
Models for Semantic Classification of Chinese Unknown Words（中国語の未登録語の意味的な分類のための混合モデル）において、ＸｉａｏｆｅｉＬｕ）は、人工のルール、統計的方法およびコンテキストに基づいて構築した混合型の品詞推測方法を提案している。そこでは、ルールと統計的方法は、コンテキストベースの方法に対して語義クラス候補を提供する。 Non-Patent Document 1 (NAACL HLT 2007, pages 188-195, Hybrid
In the Models for Semantic Classification of Chinese Unknown Words, Xiaofei Lu) is a mixed type constructed based on artificial rules, statistical methods and contexts. A part-of-speech estimation method is proposed. There, rules and statistical methods provide semantic class candidates for context-based methods.

非特許文献２（In Proceedings of the 2nd Chinese Language Processing
Workshopの７〜１４ページの2000. Sense-tagging Chinese Corpusにおいて、Chen, H.-H.とC.-C. Lin）は、中国語−英語辞典による相互翻訳を通した語義クラスタグ付け方法を提案している。この方法の基本ステップは以下の４つのステップを含む。
１）１つに新語を与え、所定の中国語−英語辞典に基づいてその語について全ての可能な英語翻訳を検索する。
２）ＷｏｒｄＮｅｔから全ての翻訳について対応する語義項目（semantic item）を検索する。
３）マッピングテーブルを検索し、ステップ２で得られた語義項目とＣｉｌｉｎの語義タグとのマッチングを行う。
４）語義曖昧性除去方法（semantic de-ambiguity methods）によってステップ３から得られた語義タグから１つを、最終結果として選択する。 Non-Patent Document 2 (In Proceedings of the 2nd Chinese Language Processing
At the Workshop. Pp. 7-14, 2000. Sense-tagging Chinese Corpus, Chen, H.-H. and C.-C. Lin) developed a semantic class tagging method through mutual translation of Chinese-English dictionaries. is suggesting. The basic steps of this method include the following four steps.
1) Give one new word and search all possible English translations for that word based on a given Chinese-English dictionary.
2) Retrieval semantic items corresponding to all translations from WordNet.
3) The mapping table is searched, and the meaning item obtained in step 2 is matched with the meaning tag of Cilin.
4) Select one of the meaning tags obtained from step 3 by the semantic de-ambiguity methods as the final result.

中国特許公開公報ＣＮ１７１７６７９号Chinese Patent Publication No. CN1717679 米国特許公開公報ＵＳ２００６０１００８５６Ａ１US Patent Publication US20060100856A1 中国特許公開公報ＣＮ１３６９８７７号Chinese Patent Publication CN13698877

ＸｉａｏｆｅｉＬｕ、HybridModels for Semantic Classification of Chinese Unknown Words（中国語の未登録語の意味的な分類のための混合モデル）、ＮＡＡＣＬＨＬＴ２００７、１８８−１９５ページXiaofei Lu, Hybrid Models for Semantic Classification of Chinese Unknown Words, NAACL HLT 2007, 188-195 Chen, H.-H.とC.-C. Lin、In Proceedings of the2nd Chinese Language Processing Workshop、2000. Sense-tagging Chinese Corpus、７〜１４ページChen, H.-H. and C.-C. Lin, In Proceedings of the 2nd Chinese Language Processing Workshop, 2000. Sense-tagging Chinese Corpus, pp. 7-14

しかしながら、現在の技術のどれも、自動タグ付けを実現するために未登録語のカテゴリを効果的に決定するという課題を解決しない。
既存の技術は、一般に、新語についての品詞解析を実行するために事前に編集された辞書を使用する。そのため、タグ付け結果の妥当性は、対応する辞書あるいは知識ベースの構造に左右され、その性能は比較的低い。 However, none of the current technologies solves the problem of effectively determining the category of unregistered words to achieve automatic tagging.
Existing techniques typically use pre-edited dictionaries to perform part-of-speech analysis for new words. Therefore, the validity of the tagging result depends on the structure of the corresponding dictionary or knowledge base, and its performance is relatively low.

従って、未登録語のカテゴリを効果的にかつ高い効率で決定する技術的解決方法が要望されている。 Therefore, there is a need for a technical solution that determines the category of unregistered words effectively and with high efficiency.

関連技術における上記の問題を解決するため、本発明の目的は、未登録語のカテゴリを決定するための方法と装置を提供することである。 In order to solve the above problems in the related art, an object of the present invention is to provide a method and apparatus for determining a category of unregistered words.

本発明による未登録語のカテゴリ決定方法は、語構成規則に基づいて未登録語の同義語を辞書から選択するステップと、文集から未登録語のコンテキストを生成するステップと、未登録語のコンテキストと同義語に従って未登録語のカテゴリを決定するステップとを含む。 An unregistered word category determination method according to the present invention includes a step of selecting a synonym of an unregistered word from a dictionary based on a word composition rule, a step of generating a context of an unregistered word from a sentence collection, and a context of an unregistered word And determining a category of unregistered words according to synonyms.

本発明による未登録語のカテゴリ決定装置は、語構成規則に基づいて未登録語の同義語を辞書から選択する同義語選択部と、文集から未登録語のコンテキストを生成するコンテキスト生成部と、未登録語のコンテキストと同義語に従って未登録語のカテゴリを決定するカテゴリ決定部とを備える。 An unregistered word category determination device according to the present invention includes a synonym selection unit that selects a synonym of an unregistered word from a dictionary based on a word composition rule, a context generation unit that generates a context of an unregistered word from a sentence collection, A category determining unit that determines the category of the unregistered word according to the context of the unregistered word and the synonym.

本発明の他の特徴および利点は、添付図面を参照した本発明の好ましい実施の形態についての以下の説明から明らかになるであろう。 Other features and advantages of the present invention will become apparent from the following description of preferred embodiments of the invention with reference to the accompanying drawings.

本発明によれば、未登録語のカテゴリをより効率的により正確に決定することができる。 According to the present invention, the category of unregistered words can be determined more efficiently and accurately.

上記から本発明を深く理解し、かつ以下の添付図面を参照しつつ説明を読むことにより、本発明の他の目的と効果がより明確となり、理解が容易となるであろう。
本発明の実施の形態による未登録語のカテゴリを決定する装置のブロック図である。本発明の実施の形態による未登録語のカテゴリを決定する方法のフローチャートである。本発明の他の実施の形態に従って未登録語のカテゴリを決定する方法のフローチャートを示す；本発明の他の実施の形態による未登録語のカテゴリを決定する方法のフローチャートである。本発明のさらに他の実施の形態による未登録語を決定する方法のフローチャートである。上記の全ての添付図面において、同一の参照符号は、同じか類似の、或いは対応する特徴或いは機能を示す。 The above and other objects and advantages of the present invention will become clearer and easier to understand by reading the description with a deep understanding of the present invention and referring to the accompanying drawings.
It is a block diagram of the apparatus which determines the category of the unregistered word by embodiment of this invention. 4 is a flowchart of a method for determining a category of unregistered words according to an embodiment of the present invention. FIG. 6 shows a flowchart of a method for determining a category of unregistered words according to another embodiment of the present invention; 6 is a flowchart of a method for determining a category of unregistered words according to another embodiment of the present invention. 6 is a flowchart of a method for determining an unregistered word according to still another embodiment of the present invention. In all the accompanying drawings, the same reference signs indicate the same, similar or corresponding features or functions.

以下では、図面を参照しながら、本発明の各実施例について詳細に説明する。本発明の図面と実施例は例示的な説明のみを目的とし、本発明に対する保護の範囲を限定するものではないことを理解されたい。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings. It should be understood that the drawings and examples of the present invention are for illustrative purposes only and are not intended to limit the scope of protection of the present invention.

明確化のため、まず本発明で使用する技術的用語について説明する。 For clarity, technical terms used in the present invention will be described first.

１．辞書
辞書は、例えば、

（ｗｏｒｄｗｏｏｄ）、ＨｏｗＮｅｔ、ＷｏｒｄＮｅｔなどの、一般に５００００以上のエントリを有する、未処理言語の中心語彙を記録する辞書を指す。辞書は１つ以上の語を含んでいる。各語に対して、品詞、カテゴリ、語義および例文などのような情報がタグ付けされている。テーブル１は、辞書のデータ構造の例を示している。ここでは、３つの語”北京”、”保健品”（健康管理用品）および”愉快”（幸福）が示され、各語がそれぞれ品詞とカテゴリを有する。 1. Dictionary

It refers to a dictionary that records the central vocabulary of unprocessed languages, typically having more than 50000 entries, such as (word wood), HowNet, WordNet, etc. A dictionary contains one or more words. Each word is tagged with information such as part of speech, category, meaning and example sentence. Table 1 shows an example of the data structure of the dictionary. Here, three words “Beijing”, “health goods” (health care products) and “fun” (happiness) are shown, and each word has a part of speech and a category.

２．文集
文集はフリーテキストの集合である。フリーテキストは、文、段落、文章等であり、またそれらの任意の組合せである。 2. A collection of sentences A collection of sentences is a collection of free text. Free text is a sentence, paragraph, sentence, etc., and any combination thereof.

３．文字、直接要素および語
文字はテキストの最小単位である。例えば、中国語で、”天”、”我”および”好”が、それぞれ文字である。 3. Letters, direct elements, and words are the smallest units of text. For example, in Chinese, “Ten”, “I”, and “Good” are characters.

直接要素：大きな単位を形成する小さな単位を大きな単位の要素と呼ぶ。従って、大きな単位を直接構成する小さな単位を大きな単位の直接要素と呼ぶ。
語の直接要素は音素あるいはその語より小さな語である。例えば、

の直接要素は、”科学”、

および”部”であり、”冰晶”の直接要素は、”冰”および”晶”である。 Direct element: A small unit forming a large unit is called a large unit element. Therefore, a small unit that directly constitutes a large unit is called a direct element of a large unit.
The direct element of a word is a phoneme or a smaller word. For example,

The direct element of is “science”

And “parts”, and the direct elements of “crystals” are “crystals” and “crystals”.

語は、１以上の文字によって形成される文字列であり、ある意味を有する。例えば、

は２文字で構成される語であり、

は３文字で構成される語である。 A word is a character string formed by one or more characters and has a certain meaning. For example,

Is a word consisting of two letters,

Is a word composed of three letters.

４．未登録語
未登録語は現在の辞書に含まれていない語である。 4). Unregistered words Unregistered words are words that are not included in the current dictionary.

５．カテゴリ
カテゴリは、語義クラスと語義クラスより広い有効範囲を有す超クラス（supersense）を含む。 5. Category The category includes a semantic class and a superclass having a wider scope than the semantic class.

語義クラスは、例えば、”都市”、”感情”等である。語義クラスは、複数の語を含む。例えば、語”北京”と”上海”は、語義クラス”都市”に属する。語は複数の語義クラスを有する場合がある。例えば、語”腕”は、２つの語義クラス”身体部分”と”人間”を有する。 The semantic class is, for example, “city”, “emotion”, and the like. The semantic class includes a plurality of words. For example, the words “Beijing” and “Shanghai” belong to the semantic class “city”. A word may have multiple semantic classes. For example, the word “arm” has two semantic classes “body part” and “human”.

超クラスは語義クラスより広いカテゴリであり、例えば、”位置”、”物質”などである。”位置”は、語義クラス”都市”より広い範囲を有する。 The super class is a broader category than the semantic class, such as “position” and “substance”. “Position” has a wider range than the semantic class “city”.

本発明は、未登録語のカテゴリを決定する方法に関する。この方法は、語構成規則（word formation rule）に基づいて辞書から未登録語の同義語を選択するステップ、文集から未登録語のコンテキストを生成するステップ、及び未登録語のコンテキストおよび同義語に基づいて未登録語が属するカテゴリを決定するステップを有する。 The present invention relates to a method for determining a category of unregistered words. The method includes selecting a synonym for an unregistered word from a dictionary based on a word formation rule, generating a context for an unregistered word from a collection of sentences, and a context and synonym for an unregistered word. And determining a category to which the unregistered word belongs.

本発明の実施の形態によれば、未登録語と１つ以上の語構成要素（word formation constituents）を共有する語は、未登録語の同義語として辞書から選択できることから、語構成規則に基づいて辞書から未登録語の同義語を選択する処理を実行する。本発明のさらに他の実施の形態によれば、語構成規則に基づいて辞書から未登録語の同義語を選択する処理は、以下のステップによって実現される。未登録語の品詞を決定するステップ、未登録語と１つ以上の語構成要素を共有する語を辞書から選択するステップ、選択した全ての語から未登録語と同じ品詞を有する語を未登録語の同義語として選択するステップを有する。 According to an embodiment of the present invention, a word sharing one or more word formation constituents with an unregistered word can be selected from the dictionary as a synonym for the unregistered word. The process of selecting synonyms of unregistered words from the dictionary is executed. According to still another embodiment of the present invention, the process of selecting a synonym of an unregistered word from a dictionary based on word composition rules is realized by the following steps. Determining a part of speech for an unregistered word, selecting a word that shares one or more word components with the unregistered word from a dictionary, and unregistering a word having the same part of speech as the unregistered word from all selected words Selecting as a synonym of the word.

本発明の実施の形態によれば、文集から未登録語のコンテキストを生成する処理は、以下の、文集から未登録語を検索するステップ、ウィンドウ方式の適用により未登録語に隣接する文字を取り出すステップ、取り出した未登録語に隣接する文字に語分割を実行するステップ、語分割の後に得られた各語の重みを決定し、語分割の後に取得したそれぞれの語とそれらの重みを未登録語のコンテキストとして使用するステップによって実現される。本発明の他の実施の形態によれば、文集から未登録語のコンテキストを生成する処理は、以下の、文集から未登録語を検索するステップ、未登録語のコンテキストとして依存関係を用いるために、依存木方式によって未登録語の依存関係を分析するステップによって実現される。 According to the embodiment of the present invention, the process of generating a context of an unregistered word from a sentence collection includes the following steps of searching for an unregistered word from the sentence collection, and extracting a character adjacent to the unregistered word by applying a window method: Step, step of performing word division on characters adjacent to the extracted unregistered word, determining the weight of each word obtained after word division, and unregistering each word obtained after word division and their weight This is accomplished by using it as a word context. According to another embodiment of the present invention, the process of generating a context of an unregistered word from a sentence collection includes the following steps of searching for an unregistered word from the sentence collection, using a dependency relationship as the context of the unregistered word: This is realized by the step of analyzing the dependency relationship of unregistered words by the dependency tree method.

本発明の実施の形態によれば、未登録語のコンテキストと同義語に基づいて未登録語が属するカテゴリを決定する処理は、同義語が属するカテゴリについて統計を取るステップ、各カテゴリに含まれる全ての語のコンテキストを、各カテゴリのコンテキストとして、文集から生成するステップ、未登録語のコンテキストと各カテゴリのコンテキストの間の類似度を計算するステップ、最大の類似度を有するカテゴリを未登録語が属するカテゴリとして決定するステップを含む。
本発明の他の実施の形態によれば、未登録語のコンテキストと同義語に基づいて未登録語が属するカテゴリを決定する処理は、文集から同義語のコンテキストを生成するステップ、未登録語のコンテキストと同義語のコンテキストの間の類似度を計算するステップ、計算によって取得した類似度に基づいて、同義語から集合を抽出するステップ、抽出した集合において同じカテゴリに属する同義語に対応する類似度を合計するステップ、合計した類似度に基づいて未登録語が属するカテゴリを決定するステップを含む。
本発明の他の実施の形態によれば、未登録語のコンテキストと同義語に基づいて未登録語が属するカテゴリを決定する処理は、文集から同義語のコンテキストを生成するステップ、未登録語のコンテキストと同義語のコンテキストの間の類似度を計算するステップ、同義語が属するカテゴリについて統計を取るステップ、同義語に関連付けた所定の重み係数を入力するステップ、入力した所定の重み係数に基づいて、関連する同義語に対応する類似度に重みを加えるステップ、重みが加えられた類似度に基づいて同義語から集合を抽出するステップ、抽出した集合中、同じカテゴリに属する同義語に対応する重みが加えられた類似度を合計するステップ、合計した類似度に基づいて未登録語が属するカテゴリを決定するステップを含む。 According to the embodiment of the present invention, the process of determining a category to which an unregistered word belongs based on the context and synonym of the unregistered word is a step of taking statistics on the category to which the synonym belongs, all included in each category The context of each word as the context of each category, the step of generating from the collection, the step of calculating the similarity between the context of the unregistered word and the context of each category, the category having the maximum similarity is the unregistered word Determining a category to belong to.
According to another embodiment of the present invention, the process of determining a category to which an unregistered word belongs based on the context of the unregistered word and the synonym includes generating a context of the synonym from the sentence collection, A step of calculating the similarity between the context and the context of the synonym, a step of extracting a set from the synonym based on the similarity obtained by the calculation, and a similarity corresponding to a synonym belonging to the same category in the extracted set And a step of determining a category to which the unregistered word belongs based on the total similarity.
According to another embodiment of the present invention, the process of determining a category to which an unregistered word belongs based on the context of the unregistered word and the synonym includes generating a context of the synonym from the sentence collection, Calculating the similarity between the context and the context of the synonym, taking a statistic on the category to which the synonym belongs, inputting a predetermined weighting factor associated with the synonym, based on the input predetermined weighting factor , A step of adding a weight to the similarity corresponding to the related synonym, a step of extracting a set from the synonym based on the similarity with the weight added, and a weight corresponding to a synonym belonging to the same category in the extracted set And a step of summing the similarities to which the unregistered words belong based on the summed similarities.

以下、本発明のそれぞれの実施の形態について詳細に説明する。 Hereinafter, each embodiment of the present invention will be described in detail.

図１は、本発明の実施の形態による未登録語のカテゴリを決定するための装置１００のブロック図を示す。 FIG. 1 shows a block diagram of an apparatus 100 for determining a category of unregistered words according to an embodiment of the present invention.

本発明による未登録語のカテゴリを決定する装置１００は、同義語選択部１１０、コンテキスト生成部１２０およびカテゴリ決定部１３０を含んでいる。同義語選択部１１０は、語構成規則に基づいて辞書から未登録語の同義語を選択する。コンテキスト生成部１２０は、文集から未登録語のコンテキストを生成する。カテゴリ決定部１３０は、未登録語のコンテキストおよび同義語に基づいて未登録語が属するカテゴリを決定する。 The apparatus 100 for determining a category of unregistered words according to the present invention includes a synonym selection unit 110, a context generation unit 120, and a category determination unit 130. The synonym selection unit 110 selects a synonym of an unregistered word from the dictionary based on the word composition rule. The context generation unit 120 generates a context of unregistered words from the sentence collection. The category determination unit 130 determines the category to which the unregistered word belongs based on the context of the unregistered word and the synonym.

本発明の実施の形態によれば、同義語選択部１１０は、未登録語と１つ以上の語構成要素を共有する語を未登録語の同義語として辞書から選択する手段を含むことが可能である。
本発明の実施の形態によれば、同義語選択部１１０は、未登録語の品詞を決定する手段と、辞書から未登録語と１つ以上の語構成要素を共有する語を選択する手段と、選択された全ての語から未登録語と同じ品詞を有する語を未登録語の同義語として選択する手段とを含むことも可能である。 According to the embodiment of the present invention, the synonym selection unit 110 can include means for selecting, from the dictionary, a word that shares one or more word components with an unregistered word as a synonym for the unregistered word. It is.
According to the embodiment of the present invention, the synonym selection unit 110 determines a part of speech of an unregistered word, and selects a word that shares one or more word components with the unregistered word from the dictionary. It is also possible to include means for selecting a word having the same part of speech as the unregistered word from all the selected words as a synonym of the unregistered word.

本発明の実施の形態によるコンテキスト生成部１２０は、文集中の未登録語を検索する手段と、ウィンドウ方式の適用により未登録語に隣接する文字を取り出す手段と、取り出した未登録語に隣接する文字に語分割を実行する手段と、語分割の後に得られた各語の重みを決定し、語分割の後に取得したそれぞれの語とそれらの重みを未登録語のコンテキストとして使用する手段とを含むことが可能である。 The context generation unit 120 according to the embodiment of the present invention includes a means for searching for sentence-intensive unregistered words, a means for extracting characters adjacent to the unregistered words by application of the window method, and an adjacent to the extracted unregistered words. Means for performing word division on the characters, means for determining the weight of each word obtained after word division, and using each word obtained after word division and their weight as the context of an unregistered word It is possible to include.

本発明の実施の形態によるコンテキスト生成部１２０は、文集中の未登録語を検索する手段と、未登録語のコンテキストとして依存関係を用いるために、依存木方式によって未登録語の依存関係を分析する手段とを含むこともできる。 The context generation unit 120 according to the embodiment of the present invention analyzes a dependency relationship of unregistered words using a dependency tree method in order to use a dependency relationship as a context of unregistered words and a means for searching for sentence-intensive unregistered words. And means for performing.

本発明の実施の形態によるコンテキスト生成部１２０は、文集から同義語のコンテキストを生成する手段をさらに含むことが可能である。 The context generation unit 120 according to the embodiment of the present invention may further include means for generating a synonym context from a sentence collection.

本発明の一実施の形態によれば、カテゴリ決定部１３０は、同義語が属するカテゴリについて統計を取る手段と、文集から各カテゴリに含まれる全ての語のコンテキストを、各カテゴリのコンテキストとして生成する手段と、未登録語のコンテキストと各カテゴリのコンテキストの類似度を計算する手段と、最大の類似度に相応するカテゴリを未知の語が属するカテゴリとして決定する手段とを含むことが可能である。 According to one embodiment of the present invention, the category determination unit 130 generates a statistic about a category to which a synonym belongs, and generates contexts of all words included in each category from the sentence collection as contexts of each category. It is possible to include means, means for calculating the similarity between the context of the unregistered word and the context of each category, and means for determining the category corresponding to the maximum similarity as the category to which the unknown word belongs.

本発明の実施の形態によれば、カテゴリ決定部１３０は、未登録語のコンテキストと同義語のコンテキストとの類似度を計算する手段と、類似度に基づいて同義語から１つの集合を抽出する手段と、抽出した集合中の同じカテゴリに属する同義語に対応する類似度を合計する手段と、合計した類似度に基づいて未登録語が属するカテゴリを決定する手段とを含むことが可能である。
この実施の形態において、カテゴリ決定部１３０に含まれる、合計した類似度に基づいて未登録語が属するカテゴリを決定する手段は、Ｋ最近接アルゴリズムを実行する。 According to the embodiment of the present invention, the category determination unit 130 calculates a similarity between the context of the unregistered word and the context of the synonym, and extracts one set from the synonym based on the similarity. Means for summing up similarities corresponding to synonyms belonging to the same category in the extracted set, and means for determining a category to which the unregistered word belongs based on the summed similarity. .
In this embodiment, the means for determining the category to which the unregistered word belongs based on the total similarity included in the category determination unit 130 executes the K nearest neighbor algorithm.

本発明の実施の形態によれば、カテゴリ決定部１３０は、未登録語のコンテキストと同義語のコンテキストの間の類似度を計算する手段と、同義語が属するカテゴリについての統計を取る手段と、同義語に関連する所定の重み係数を入力する手段と、入力した所定の重み係数に基づいて、関連する同義語に相応する類似度に重み付けをする手段と、類似度に基づいて同義語から集合を抽出する手段と、抽出した集合中の同じカテゴリに属する同義語に対応する重み付けされた類似度を合計する手段と、合計した類似度に基づいて未登録語が属するカテゴリを決定する手段とを含むことも可能である。
この実施の形態において、所定の重み係数の設定は以下のポリシーを満足する。
未登録語とカテゴリ内の語が最後の１文字と最後から２番目の文字を共有するならば、カテゴリに関連する所定の重み係数をλ１と設定する。
そうではなく、未登録語とカテゴリ内の語が最後の１文字と最初の１文字を共有するならば、カテゴリに関連する所定の重み係数をλ２と設定する。
そうではなく、未登録語とカテゴリ内の語が最初の１文字或いは最後の１文字を共有するならば、カテゴリに関連する所定の重み係数をλ３と設定する。
それ以外の場合、カテゴリに関連する所定の重み係数をλ４と設定する。
ここで、λ_１≧λ_２≧λ_３≧λ_４である。
この実施の形態において、カテゴリ決定部１３０に含まれる、類似度に基づいて同義語から集合を抽出する手段は、類似度を降順にソートする手段と、上位から順番に、所定数の類似度に対応する同義語を、集合に抽出する手段とを含むことが可能である。 According to the embodiment of the present invention, the category determination unit 130 calculates a similarity between a context of an unregistered word and a synonym context, a means for taking statistics about a category to which the synonym belongs, A means for inputting a predetermined weighting factor related to the synonym, a means for weighting the similarity corresponding to the related synonym based on the input predetermined weighting factor, and a set of synonyms based on the similarity Means for summing up weighted similarities corresponding to synonyms belonging to the same category in the extracted set, and means for determining a category to which unregistered words belong based on the summed similarity. It can also be included.
In this embodiment, the setting of the predetermined weighting factor satisfies the following policy.
If the unregistered word and the word in the category share the last character and the second character from the end, a predetermined weighting factor related to the category is set to λ1.
Otherwise, if the unregistered word and the word in the category share the last one character and the first one character, the predetermined weighting factor related to the category is set to λ2.
Otherwise, if the unregistered word and the word in the category share the first one character or the last one character, the predetermined weighting factor related to the category is set to λ3.
In other cases, the predetermined weighting factor related to the category is set to λ4.
Here, λ ₁ ≧ λ ₂ ≧ λ ₃ ≧ λ ₄ is satisfied.
In this embodiment, the means for extracting the set from the synonyms based on the similarity included in the category determination unit 130 is a means for sorting the similarities in descending order, and a predetermined number of similarities in order from the top. Means for extracting corresponding synonyms into the set.

図２は、本発明の実施の形態による未登録語のカテゴリを決定する方法のフローチャートを示す。 FIG. 2 shows a flowchart of a method for determining a category of unregistered words according to an embodiment of the present invention.

ステップ２０１において、語構成規則に基づいて、辞書から未登録語の同義語を選択する。 In step 201, a synonym of an unregistered word is selected from the dictionary based on the word composition rule.

本発明の実施の形態による語構成規則は、語構成要素、要素属性および要素関係を含んでいる。語構成要素は、語を形成する文字および（または）直接要素等を含む。要素属性は、語の注釈、長さ、品詞を含む。また、要素関係は、語の各要素間の関係（例えば、並列、修飾、限定など）を含む。 The word composition rule according to the embodiment of the present invention includes word composition elements, element attributes, and element relationships. Word components include letters and / or direct elements that form a word. Element attributes include word annotation, length, and part of speech. In addition, the element relation includes a relation (for example, parallel, modification, limitation, etc.) between the elements of the word.

一例において、未登録語と１つ以上の文字および（または）直接要素を共有する語が辞書から未登録語の同義語として選択される。
例えば、未登録語が”基”および”民”を含む２文字の”基民”であると仮定する。辞書において、文字”基”を含む語が

、”基本”、”奠基者”、”地基”であり、および”民”を含む語が”人民”（人々）および”民主”であると仮定すると、これらの語を全て未登録語”基民”の同義語とみなす。この時、同義語の集合は、｛

、”基本”、”奠基者”、”地基”、”人民”、”民主”｝となる。図３において示される例はこの実施の形態を記述している。 In one example, an unregistered word and one or more characters and / or words that directly share an element are selected from the dictionary as synonyms for the unregistered word.
For example, assume that the unregistered word is a two-letter “base” including “base” and “people”. In the dictionary, words that contain the letter “base”

, “Basic”, “basic”, “geographical”, and the words containing “people” are “people” (people) and “democratic”, all these terms are unregistered words It is regarded as a synonym for “people”. At this time, the set of synonyms is {

, “Basic”, “basket”, “base”, “people”, “democratic”}. The example shown in FIG. 3 describes this embodiment.

さらに、他の例において、名詞、形容詞、動詞などの未登録語の品詞が、まず決定され、その後、１以上文字および（または）直接要素を共有する語が辞書から選択される。そして、未登録語と同じ品詞を有する語が、未登録語の同義語として選び出される。図４および図５において示す例は、この実施の形態を記述している。 Further, in other examples, the part of speech of unregistered words such as nouns, adjectives and verbs is first determined, and then words that share one or more characters and / or direct elements are selected from the dictionary. A word having the same part of speech as the unregistered word is selected as a synonym for the unregistered word. The examples shown in FIGS. 4 and 5 describe this embodiment.

ステップ２０２で、未登録語のコンテキストを文集から生成する。 In step 202, a context for unregistered words is generated from the collection of sentences.

本発明の実施の形態によれば、語のコンテキストは、ウィンドウ（window）方式を適用することにより、依存木方式（dependent tree mode）あるいは当業者にとって既知の他の方式で生成される。 According to embodiments of the present invention, word context is generated in a dependent tree mode or other manner known to those skilled in the art by applying a window manner.

以下、ウィンドウ方式を適用して文書から所定の語のコンテキストをどのように取得するかについて説明する。
所定の語が

で、文集が複数の文を含むと仮定する。ここで、１文が

であり、ウィンドウのサイズを”６”に設定する。 The following describes how to obtain the context of a predetermined word from a document by applying the window method.
A given word

Assume that the collection contains multiple sentences. Here, one sentence

The window size is set to “6”.

まず、この語を文集内で検索する。
この例では、語

が文

に含まれることが検索される。 First, search for this word in the collection.
In this example, the word

Is a sentence

To be included in the search.

次に、語

に隣接する文字がウィンドウ方式を適用することによって取り出される。
サイズ６のウィンドウは、文集中この語を含む文あるいは段落において、この語をカバーする方法で定義される。
この語をカバーする方式は、例えば、隣接する直前の３文字”好把握”および隣接する後続の３文字”毎个人”が、語

を中心として取り出されること、あるいは、この語から始まって次に続く隣接する６文字”毎个人的人生”が取り出されること、あるいは、この語を最後として隣接する直前の６文字”一定好好把握”が取り出されること、あるいは、隣接する直前の１又は２文字と隣接する次の５又は４文字が取り出されること等を意味している。 Then the word

The characters adjacent to are extracted by applying the window method.
A window of size 6 is defined in a way that covers this word in sentences or paragraphs that contain this word.
The method of covering this word is, for example, that the three characters immediately before “adjacent” and the adjacent three characters “per person” are

Or the next six adjacent characters “Personal Life” starting from this word, or the next six characters immediately following the word “adjacent” Or the next 5 or 4 characters adjacent to the immediately preceding 1 or 2 characters are extracted.

ウィンドウのサイズと等しい数の文字を切り出した後、該当語に隣接する切り出し済みの語について語分割を実行する。
例えば、該当語（すなわち、

）を中心として近接する直前の３文字”好把握”および近接する後続の３文字”毎个人”が切り出された場合、文字”好把握”および”毎个人”の２つのグループが取得され、その後、２つのグループの文字について語分割が実行され、例えば、以下の語分割結果が取得される。”好”、”把握”、”毎个”、”人” After the number of characters equal to the size of the window is cut out, word division is performed on the cut out words adjacent to the corresponding word.
For example, the corresponding word (ie,

), When the three characters “good grasp” immediately before and the adjacent three characters “per person” are cut out, two groups of characters “good grasp” and “every individual” are acquired, and then Word division is performed on two groups of characters, and for example, the following word division results are obtained. “Good”, “Understanding”, “Individual”, “People”

次に、語分割後に取得されたそれぞれの語の重みが決定される。
語分割後に取得された結果は、対応するベクトル<v₁, v₂, ..., v_n
>を有している。ここで、ｎはこの語の語分割結果の個数を示す。
上記の例においては、合計で４語の語分割結果がある。従って、ｎ＝４であり、v_iが対応する語の重み（ｉ＝１，・・・，ｎ）を示している。
例えば、ＴＦＩＤＦ−word frequency × inverse document frequency（語頻度×逆文書頻度）、ＢＯＯＬ（存在するかどうか）、ＩＤＦ−inverse document frequency（逆文書頻度）、ＰＭＩ-pointwise mutual information（自己相互情報量）等の複数の重み計算方法が存在する。
通常、語のコンテキストの出現回数はその語の意義の決定にほとんど寄与しない、一方それが出現するかどうかは決定的な重要性を有する。したがって、本発明の実施の形態においては、ＩＤＦ−inverse document frequency（逆文書頻度）が重みを計算するのに使用される。 Next, the weight of each word acquired after word division is determined.
The result obtained after word splitting is the corresponding vector <v ₁ , v ₂ , ..., v _n
> Here, n indicates the number of word division results of this word.
In the above example, there are a total of four word segmentation results. Therefore, n = 4 and v _i indicates the weight of the corresponding word (i = 1,..., N).
For example, TFIDF-word frequency × inverse document frequency (word frequency × inverse document frequency), BOOT (whether it exists), IDF-inverse document frequency, PMI-pointwise mutual information (self-mutual information amount), etc. There are a plurality of weight calculation methods.
Usually, the number of occurrences of the context of a word contributes little to determining the meaning of the word, while whether it appears is critical. Therefore, in an embodiment of the present invention, IDF-inverse document frequency is used to calculate the weight.

上記の処理を通して、語分割後に取得された各語とそれらの重みが取得され、また、取得されたそれらの語および重みが対象となる語のコンテキストとして使用される。 Through the above processing, each word obtained after word division and its weight are obtained, and the obtained word and weight are used as the context of the target word.

その他に、文集中の未登録語の検索および依存木方式における未登録語の分析によって、分析によって取得された依存性が対象となる語のコンテキストとして使用される。 In addition, the dependency acquired by the analysis is used as the context of the target word by the sentence-intensive unregistered word search and the unregistered word analysis in the dependency tree method.

上記のコンテキスト生成方法を通して、未登録語のコンテキストが取得される。 Through the above context generation method, the context of the unregistered word is acquired.

ステップ２０３で、未登録語のコンテキストおよび同義語に基づいて未登録語が属するカテゴリを決定する。 In step 203, the category to which the unregistered word belongs is determined based on the context of the unregistered word and the synonym.

未登録語のコンテキストおよび同義語に基づいて未登録語が属するカテゴリを決定する処理は、複数の方法で実現することが可能である。図３から図５の以下の詳細な説明において、未登録語のコンテキストと同義語に基づいて未登録語が属するカテゴリを決定する複数の実施例を示す。 The process of determining the category to which the unregistered word belongs based on the context of the unregistered word and the synonym can be realized by a plurality of methods. In the following detailed description of FIGS. 3-5, a plurality of embodiments are shown for determining the category to which an unregistered word belongs based on the context and synonyms of the unregistered word.

図３において示す実施例において、まず、これらの同義語がそれぞれどのカテゴリに属するかを判定するために、未登録語の同義語について統計を取る。その後、それぞれのカテゴリのコンテキストを生成する。ここで、それぞれのカテゴリのコンテキストは、文集から生成されたそれぞれのカテゴリに包含されるすべての語のコンテキストに基づいて取得される。次に、関連技術における既知又は一般的な類似度計算方法を使用して、未登録語のコンテキストと各カテゴリのコンテキストの間の類似度を計算する。最後に、最大の類似度に対応するカテゴリを、未登録語が属するカテゴリとして決定する。 In the embodiment shown in FIG. 3, first, statistics are taken on synonyms of unregistered words in order to determine which category each of these synonyms belongs to. Then, the context of each category is generated. Here, the context of each category is acquired based on the context of all words included in each category generated from the collection of sentences. Next, the similarity between the unregistered word context and the context of each category is calculated using a known or general similarity calculation method in the related art. Finally, the category corresponding to the maximum similarity is determined as the category to which the unregistered word belongs.

図４において示す実施例において、まず、同義語のコンテキストを文集から生成する。ここでは、ステップ２０２における未登録語のコンテキストを生成する場合と同じ実現方法と同じ用いる。その後、未登録語のコンテキストと同義語のコンテキストの間の類似度を計算する。計算した類似度に基づいて未登録語の同義語から集合（この集合は予め定めた数の同義語を含む）を抽出する。次に、抽出した集合中同じカテゴリに属する同義語に対応する類似度を合計する。最後に、合計した類似度に基づいて未登録語が属するカテゴリを決定する。
図４に示す実施例においては、例えば、Ｋ最近接（ＫＮＮ）アルゴリズム（K-nearest Neighbor algorithm）あるいは当業者にとって周知の他の方法を利用することができる。 In the embodiment shown in FIG. 4, first, a synonym context is generated from a collection of sentences. Here, the same implementation method as that used when generating the context of the unregistered word in step 202 is used. Then, the similarity between the unregistered word context and the synonym context is calculated. A set (this set includes a predetermined number of synonyms) is extracted from synonyms of unregistered words based on the calculated similarity. Next, the similarities corresponding to synonyms belonging to the same category in the extracted set are summed. Finally, the category to which the unregistered word belongs is determined based on the total similarity.
In the embodiment shown in FIG. 4, for example, the K-nearest Neighbor algorithm or other methods known to those skilled in the art can be utilized.

図５において示す実施例においては、まず、文集から同義語のコンテキストを生成し、未登録語のコンテキストと同義語のコンテキストの間の類似度を計算する。
その後、重み係数を用いて計算した類似度を重み付けすることにより、最適な類似度結果を取得する。さらに、最適な類似度に基づいて未登録語が属するカテゴリを決定する。具体的には、まず、同義語のコンテキストを文集から生成する。未登録語のコンテキストと同義語のコンテキストの間の類似度を計算する。同義語が属するカテゴリについて統計を取る。同義語に関連した所定の重み係数を入力する。入力した所定の重み係数に基づいて関連する同義語に対応する類似度を重み付けする。
そして、重み付けした類似度に基づいて未登録語の同義語から集合（この集合は予め定めた数の同義語を含む）を抽出する。集合中同じカテゴリに属する同義語の重み付けした類似度を合計する。合計した類似度に基づいて未登録語が属するカテゴリを決定する。 In the embodiment shown in FIG. 5, first, a synonym context is generated from a collection of sentences, and a similarity between the unregistered word context and the synonym context is calculated.
Thereafter, an optimum similarity result is obtained by weighting the similarity calculated using the weighting coefficient. Further, the category to which the unregistered word belongs is determined based on the optimum similarity. Specifically, first, a synonym context is generated from a collection of sentences. Calculate the similarity between the unregistered word context and the synonym context. Take statistics on the category to which the synonym belongs. A predetermined weighting factor related to the synonym is input. The similarity corresponding to the related synonyms is weighted based on the input predetermined weighting factor.
Then, a set (this set includes a predetermined number of synonyms) is extracted from synonyms of unregistered words based on the weighted similarity. Summing weighted similarities of synonyms belonging to the same category in the set. The category to which the unregistered word belongs is determined based on the total similarity.

以下、図３〜５の実施例について詳細に説明する。 Hereinafter, the embodiment of FIGS. 3 to 5 will be described in detail.

図３は、本発明の他の実施例による未登録語のカテゴリを決定する方法のフローチャートを示す。 FIG. 3 shows a flowchart of a method for determining a category of unregistered words according to another embodiment of the present invention.

ステップ３０１で、未登録語を入力する。 In step 301, an unregistered word is input.

この実施例においては、入力した未登録語が”冰晶”であると仮定する。 In this embodiment, it is assumed that the input unregistered word is “冰晶”.

ステップ３０２で、未登録語と１つ以上の語構成要素を共有する語を、未登録語の同義語として辞書から選択する。 In step 302, a word that shares one or more word components with an unregistered word is selected from the dictionary as a synonym for the unregistered word.

上述したように、語構成規則は、語構成要素、要素属性および要素関係等を含んでいる。一方、語構成要素は、語を形成する文字および（または）直接要素等を含んでいる。未登録語と辞書が与えられると、辞書中の語がその未登録語と１つ以上の語構成要素を共有すれば、それらは全て未登録語の同義語として認められ、同義語集合内に置かれる。上記は、語構成規則に基づいて辞書から未登録語の同義語を選択するのための具体的な実現方法と認められる。 As described above, the word composition rules include word composition elements, element attributes, element relationships, and the like. On the other hand, the word component includes characters and / or direct elements that form a word. Given an unregistered word and a dictionary, if a word in the dictionary shares one or more word components with the unregistered word, they are all accepted as synonyms for the unregistered word and are in the synonym set. Placed. The above is recognized as a specific implementation method for selecting a synonym of an unregistered word from a dictionary based on word composition rules.

以下に、同じ文字を共有する例を提示する。
例えば、未登録語が２文字”冰”および”晶”からなる”冰晶”であると想定する。
辞書中の文字”冰”を含む語が”冰刀”、

、”冰雨”および”冰雪”であり、文字”晶”を含む語が”水晶”、”晶粒”および”晶体”あると想定すると、このとき、未登録語の同義語の集合={“冰刀”,

,”冰雨”,”冰雪”,”水晶”,”晶粒”,”晶体”}となる。 The following is an example of sharing the same character.
For example, assume that the unregistered word is “冰晶” consisting of two letters “文字” and “晶”.
The word that contains the letter “文字” in the dictionary is “sword”,

Assuming that the words that contain the letters “Crystal” are “Crystal”, “Crystal”, and “Crystal”, then a set of synonyms of unregistered words = { “Sword”,

, “Rainy rain”, “Snowy snow”, “Crystal”, “Grain”, “Crystal”}.

ステップ３０３で、未登録語のコンテキストを文集から生成する。 In step 303, the context of the unregistered word is generated from the sentence collection.

未登録語のコンテキストは、ウィンドウ方式を適用することにより、依存木方式あるいは当業者に既知の他の方式で生成される。具体的な実現方法はステップ２０２において説明したので、ここでは説明を省略する。 The unregistered word context is generated by applying the window method in a dependency tree method or other methods known to those skilled in the art. Since the specific implementation method has been described in step 202, description thereof is omitted here.

ステップ３０４で、同義語が属するカテゴリについて統計を取る。このステップにおいては、未登録語のそれぞれの同義語のカテゴリを取得し、その後、それらについて統計を取り、それらの同義語が属する全てのカテゴリをそれぞれ決定する。 At step 304, statistics are taken for the category to which the synonym belongs. In this step, the category of each synonym of the unregistered word is acquired, and then statistics are collected for them to determine all the categories to which the synonym belongs.

例えば、”冰刀”はカテゴリＣ１に属し、

はカテゴリＣ２に属する。”冰雨”と”冰雪”はカテゴリＣ４に属する。また、”水晶”、”晶粒”、”晶体”はカテゴリＣ３に属する。
前述したように、辞書中の語はそれぞれ、品詞、カテゴリ、語義、例文及び他の情報のタグが付けられている。したがって、各語がどのカテゴリに属するかは直接辞書から導き出すことが可能である。さらに、語のカテゴリは手作業で設定することも可能である。 For example, “Sword” belongs to category C1,

Belongs to category C2. “Rainy rain” and “cold snow” belong to category C4. “Crystal”, “crystal grain”, and “crystal” belong to category C3.
As described above, each word in the dictionary is tagged with parts of speech, categories, meanings, example sentences, and other information. Therefore, it is possible to derive directly from the dictionary which category each word belongs to. Furthermore, word categories can be set manually.

この例において、カテゴリＣ１に属する語が”冰刀”を含み、カテゴリＣ２に属する語が

を含み、カテゴリＣ３に属する語が”水晶”、”晶粒”、”晶体”を含み、カテゴリＣ４に属する語が”冰雨”と”冰雪”を含む。 In this example, a word belonging to category C1 includes “sword”, and a word belonging to category C2

And the word belonging to the category C3 includes “quartz”, “crystal grain”, and “crystal”, and the word belonging to the category C4 includes “rainstorm” and “light snow”.

このことから、未登録語”冰晶”の同義語の属するカテゴリがそれぞれＣ１、Ｃ２、Ｃ３およびＣ４であることが取得される。 From this, it is acquired that the categories to which the synonyms of the unregistered word “冰晶” belong are C1, C2, C3 and C4, respectively.

ステップ３０５で、各カテゴリ内に含まれる全ての語のコンテキストを、各カテゴリのコンテキストとして文集から生成する。 In step 305, contexts of all words included in each category are generated from the sentence collection as contexts of each category.

このステップでは、まず、それぞれのカテゴリ内に含まれている語を全て確定する。
例えば、カテゴリＣ１が”冰刀”の他に、さらに

および”大刀”を含むとすると、Ｃ１＝｛”冰刀”、

、”大刀”｝と記録する。
カテゴリＣ２が

の他にさらに

を含むとすると、

と記録する。
一方、カテゴリＣ３が”水晶”、”晶粒”、”晶体”だけを含むならば、Ｃ３＝｛”水晶”、”晶粒”、”晶体”｝と記録し、カテゴリＣ４が”冰雨”、”冰雪”だけを含むならば、Ｃ４＝｛”冰雨”、”冰雪”｝と記録する。 In this step, first, all the words included in each category are determined.
For example, in addition to the category C1 “sword”,

And “Sword”, C1 = {“Sword”,

, “Sword”}.
Category C2

Besides

Including

And record.
On the other hand, if the category C3 includes only “crystal”, “crystal grain”, and “crystal”, then C3 = {“crystal”, “crystal grain”, “crystal”} and the category C4 is “rainstorm”. , If only “Snow” is included, record C4 = {“Sweet”, “Snow”}.

ステップ２０２で説明したような文集から語のコンテキストを生成する方法によれば、上記４つ以上のカテゴリＣ１−Ｃ４内に含まれる各語のコンテキストが生成される。各カテゴリに含まれる全ての語のコンテキストは、各カテゴリのコンテキストとして考えられる。
例えば、カテゴリＣ１に含まれる

のコンテキスト、”大刀”のコンテキスト及び”冰刀”のコンテキストは、共にカテゴリＣ１のコンテキストとすることができ、Ｃ１のコンテキスト＝｛”冰刀”のコンテキスト、

のコンテキストおよび”大刀”のコンテキスト｝と記録する。 According to the method for generating a word context from the collection of sentences as described in step 202, the context of each word included in the four or more categories C1-C4 is generated. The context of all words contained in each category is considered as the context of each category.
For example, it is included in category C1

, The context of the “sword” and the context of the “sword” can both be the context of the category C1, the context of C1 = {the context of “sword”,

And the context of the “sword”}.

ステップ３０６で、未登録語のコンテキストと各カテゴリのコンテキストとの間の類似度を計算する。 In step 306, the similarity between the unregistered word context and the context of each category is calculated.

前述したように、未登録語のコンテキストはベクトルと見なすことができ、かつカテゴリのコンテキストについてもそこに含まれるすべての語のコンテキストを組み合わせたものあることからベクトルと見なすことができる。
したがって、ベクトルのコサイン距離を利用して２つのベクトル間の類似度を計算することができる。
コサイン距離は、式（１）で表される。

ここで、ＸとＹは２つのベクトルであり、ｎはＸとＹのベクトルの長さを示し、ｘ_ｊとｙ_ｊはＸベクトルとＹベクトル中のｊ番目の要素を示す。 As described above, the context of an unregistered word can be regarded as a vector, and the context of a category can be regarded as a vector because there is a combination of contexts of all words included therein.
Therefore, the similarity between two vectors can be calculated using the cosine distance of the vectors.
The cosine distance is expressed by equation (1).

Here, X and Y are two vectors, n indicates the length of the X and Y vectors, and x _j and y _j indicate the _jth element in the X and Y vectors.

本発明に関する例によれば、Ｘは未登録語のコンテキストであり、Ｙはカテゴリのコンテキストであり、一方、ｘ_ｊとｙ_ｊはＸとＹのコンテキスト中のｊ番目の要素に対応する重みをそれぞれ示している。
２つのコンテキスト内に含まれている要素の数が異なるならば、２つのベクトルの全ての要素が、新たなコンテキスト・ベクトルＸ’およびＹ’を再構成するために抽出される。
Ｘ’について、Ｘ’内の要素がＸ内に出現しなければ、対応する重みを０として設定する。
しかしながら、ＸとＹの間の類似度の計算は、式（１）によってＸ’とＹ’の間の類似度を計算することにより実施される。
上記のコサイン距離の計算を通して、未登録語のコンテキストとそれぞれのカテゴリのコンテキストの間の類似度は、次のように取得される。
Sim(context(冰晶），context(C1)) = 0.71，
Sim(context(冰晶），context(C2)) = 0.67，
Sim(context(冰晶)，context(C3))
= 0.81，
Sim(context(冰晶)，context(C4))
= 0.65，

ここで、context（冰晶）は、語（冰晶）のコンテキストを示し、context（Ｃ１）は、カテゴリＣ１のコンテキストを示す。そして、Ｓｉｍ（Ａ、Ｂ）は、ＡとＢの間の類似度を示す。
従って、未登録語”冰晶”のコンテキストとカテゴリＣ１、Ｃ２、Ｃ３、Ｃ４のコンテキスト間の類似度は、それぞれ、０．７１，０．６７，０．８１，０．６５である。 According to an example related to the present invention, X is an unregistered word context, Y is a category context, while x _j and y _j are weights corresponding to the j th element in the X and Y contexts. Each is shown.
If the number of elements contained in the two contexts is different, all elements of the two vectors are extracted to reconstruct new context vectors X ′ and Y ′.
For X ′, if the element in X ′ does not appear in X, the corresponding weight is set to zero.
However, the similarity between X and Y is calculated by calculating the similarity between X ′ and Y ′ according to equation (1).
Through the above cosine distance calculation, the similarity between the unregistered word context and the context of each category is obtained as follows.
Sim (context (冰晶), context (C1)) = 0.71,
Sim (context (冰晶), context (C2)) = 0.67,
Sim (context (冰晶), context (C3))
= 0.81,
Sim (context (冰晶), context (C4))
= 0.65,

Here, context (冰晶) indicates the context of the word (冰晶), and context (C1) indicates the context of category C1. Sim (A, B) indicates the similarity between A and B.
Therefore, the similarities between the context of the unregistered word “冰晶” and the contexts of the categories C1, C2, C3, and C4 are 0.71, 0.67, 0.81, and 0.65, respectively.

なお、２つの間の類似度については、当業者によって既知の他の方法によって計算することが可能である。 Note that the degree of similarity between the two can be calculated by other methods known to those skilled in the art.

ステップ３０７で、最大の類似度に対応するカテゴリを未登録語のカテゴリとして決定する。 In step 307, the category corresponding to the maximum similarity is determined as the category of unregistered words.

ステップ３０６で計算される類似度の比較によって、未登録語”冰晶”のコンテキストとカテゴリＣ３のコンテキストの間の類似度が最も高いことが分かる。その結果、未登録語”冰晶”のカテゴリはカテゴリＣ３として確定される。 By comparing the similarity calculated in step 306, it can be seen that the similarity between the context of the unregistered word “冰晶” and the context of category C3 is the highest. As a result, the category of the unregistered word “冰晶” is determined as category C3.

図４は、本発明の他の実施の形態による未登録語のカテゴリを決定する方法のフローチャートを示す。 FIG. 4 shows a flowchart of a method for determining a category of unregistered words according to another embodiment of the present invention.

ステップ４０１で、未登録語を入力する。 In step 401, an unregistered word is input.

この実施の形態においては、図３の実施の形態と同様に、受信した未登録語が”冰晶”であると仮定する。 In this embodiment, as in the embodiment of FIG. 3, it is assumed that the received unregistered word is a “crystal”.

ステップ４０２で、未登録語の品詞を決定する。 In step 402, the part of speech of the unregistered word is determined.

未登録語の品詞の決定するためには複数の方法が存在する。例えば、未登録語の品詞を既知の様々なモデルを使用して推測することも可能であるし、あるいは人為的なタグ付けによって決定することも可能である。この例において、未登録語”冰晶”の品詞が名詞であると仮定する。 There are several ways to determine the part of speech of an unregistered word. For example, the part of speech of an unregistered word can be inferred using various known models, or can be determined by artificial tagging. In this example, it is assumed that the part of speech of the unregistered word “語晶” is a noun.

ステップ４０３で、未登録語と語構成要素を共有する語を辞書から選択する。 In step 403, a word that shares a word component with an unregistered word is selected from the dictionary.

例えば、ステップ３０２と同じように、未登録語が”冰晶”であると仮定すると、未登録語”冰晶”と１文字を共有する集合は｛”冰刀”、

、”冰雨”、”冰雪”、”水晶”、”晶粒”｝であることが判定される。 For example, as in step 302, assuming that the unregistered word is “冰晶”, the set that shares one character with the unregistered word “冰晶” is {“冰冰”,

, “Rainy rain”, “light snow”, “crystal”, and “crystal grain”}.

この時、ステップ３０２と異なるのは、上記集合は未登録語の同義語として直接取得されず、かつステップ４０４における品詞フィルタリング処理が続いて実行されることである。 At this time, the difference from step 302 is that the set is not directly acquired as a synonym of an unregistered word, and the part-of-speech filtering process in step 404 is subsequently executed.

ステップ４０４で、未登録語と同じ品詞を持つ語を未登録語の同義語として上記選択した語から選び出す。 In step 404, a word having the same part of speech as the unregistered word is selected from the selected words as a synonym for the unregistered word.

前述したように、語構成規則は、語構成要素、要素属性および要素関係を含んでおり、かつ要素属性は、語の注釈、長さ、品詞等を含んでいる。
図４の実施の形態においては、語構成規則中の品詞を利用して未登録語の同義語の選択を実行する。 As described above, the word composition rule includes a word composition element, an element attribute, and an element relationship, and the element attribute includes a word annotation, a length, a part of speech, and the like.
In the embodiment of FIG. 4, synonyms for unregistered words are selected using the part of speech in the word composition rules.

この実施の形態においては、ステップ４０２から未登録語”冰晶”の品詞が名詞であることが決定でき、また集合{"冰刀",

,"冰雨","冰雪","水晶","晶粒"}のそれぞれの語の品詞は辞書から取得できる。よって、ステップ４０４で、集合中の名詞が未登録語”冰晶”の同義語として選択される。 In this embodiment, it can be determined from step 402 that the part-of-speech of the unregistered word “名詞晶” is a noun, and the set {“sword”,

The part of speech for each of the following words can be obtained from the dictionary: 冰, 冰雨, 冰冰雪, 晶石, 晶粒. Therefore, in step 404, the noun in the set is selected as a synonym for the unregistered word “冰晶”.

ステップ４０５で、未登録語のコンテキストを文集から生成する。 In step 405, the context of the unregistered word is generated from the sentence collection.

語のコンテキストは、ウィンドウ方式を適用することにより、依存木方式あるいは当業者にとって既知の他の方式で生成される。具体的な実現方法はステップ２０２において説明したので、ここでは説明を省略する。 The word context is generated by applying a window scheme in a dependency tree scheme or other schemes known to those skilled in the art. Since the specific implementation method has been described in step 202, description thereof is omitted here.

ステップ４０６で、同義語のコンテキストを文集から生成する。 At step 406, a synonym context is generated from the collection.

同義語のコンテキストは、ウィンドウ方式を適用することにより、依存木方式あるいは当業者にとって既知の他の方式で生成される。具体的な実現方法はステップ２０２において説明したので、ここでは説明を省略する。 Synonym contexts are generated by applying a window scheme in a dependency tree scheme or other schemes known to those skilled in the art. Since the specific implementation method has been described in step 202, description thereof is omitted here.

ステップ４０７で、未登録語のコンテキストと同義語のコンテキストの間の類似度を計算する。 In step 407, the similarity between the unregistered word context and the synonym context is calculated.

前述したように、未登録語のコンテキストはベクトルと見なすことができ、かつ同義語のコンテキストについてもベクトルと見なすことができる。したがって、ベクトルのコサイン距離の式（１）を利用して２つのベクトル間の類似度を計算することができる。 As described above, the context of unregistered words can be regarded as a vector, and the context of synonyms can also be regarded as a vector. Therefore, the similarity between two vectors can be calculated using the expression (1) of the cosine distance of the vectors.

本発明に関する例によれば、Ｘは未登録語のコンテキストであり、Ｙは未登録語の同義語のコンテキストであり、一方、ｘ_ｊとｙ_ｊはＸとＹのコンテキスト中のｊ番目の要素に対応する重みをそれぞれ示している。
上記のコサイン距離の計算を通して、未登録語のコンテキストとその同義語のコンテキストの間の類似度は、次のように取得される。
Sim(context(冰晶)，context(冰刀))
= 0.30，
Sim(context(冰晶)，context

) = 0.67，
Sim(context(冰晶），context(水晶)) = 0.81，
Sim(context(冰晶），context(晶粒)) = 0.74，
Sim(context(冰晶），context(冰雨)) = 0.69，
Sim(context(冰晶)，context(冰雪))
= 0.56，
ここで、context(冰晶)は未登録語”冰晶”のコンテキストを示し、context(冰刀)は未登録語”冰晶”の同義語”冰刀”のコンテキストを示す。そして、Ｓｉｍ（Ａ、Ｂ）は、ＡとＢの間の類似度を示す。
従って、未登録語”冰晶”と同義語”冰刀”，

，”水晶”，”晶粒”，”冰雨”，”冰雪”との間の類似度は、それぞれ、０．３０，０．６７，０．８１，０．７４，０．６９，０．５６である。 According to an example relating to the present invention, X is the context of an unregistered word, Y is the context of a synonym of an unregistered word, while x _j and y _j are the jth element in the context of X and Y The weights corresponding to are respectively shown.
Through the above cosine distance calculation, the similarity between an unregistered word context and its synonym context is obtained as follows.
Sim (context (冰晶) 、 context (冰刀))
= 0.30,
Sim (context (冰晶) 、 context

) = 0.67,
Sim (context (crystal), context (crystal)) = 0.81
Sim (context (冰晶) 、 context (晶粒)) = 0.74,
Sim (context (冰晶), context (冰雨)) = 0.69,
Sim (context (冰晶) 、 context (冰雪))
= 0.56,
Here, context (冰晶) indicates the context of the unregistered word “冰晶”, and context (冰 sword) indicates the context of the synonym “冰冰” of the unregistered word “冰晶”. Sim (A, B) indicates the similarity between A and B.
Therefore, the unregistered word “冰晶” is synonymous with “sword”,

, “Crystal”, “grain”, “rainstorm”, and “snowfall” are 0.30, 0.67, 0.81, 0.74, 0.69, 0. 56.

ステップ４０８で、類似度に基づいて未登録語の同義語から１つの集合を抽出する。 In step 408, one set is extracted from synonyms of unregistered words based on the similarity.

抽出される集合中の同義語の数を予め設定する。一例において、その集合は、所定数の同義語を含むように設定される。所定数は、未登録語の同義語の合計数以下の任意の数である。本実施の形態においては、所定数をＫとし、所定数が５であると仮定する。すなわち、Ｋ＝５と仮定する。 The number of synonyms in the set to be extracted is set in advance. In one example, the set is set to include a predetermined number of synonyms. The predetermined number is an arbitrary number equal to or less than the total number of synonyms of unregistered words. In the present embodiment, it is assumed that the predetermined number is K and the predetermined number is 5. That is, assume that K = 5.

まず、ステップ４０７で取得した類似度を、降順にソートする。 First, the similarities acquired in step 407 are sorted in descending order.

本実施の形態においては、ステップ４０７で、計算された合計６つの類似度がある。降順にそれらをソートすることで、以下の順序を取得する。
０．８１，０．７４，０．６９，０．６７，０．５６
また、この順序の類似度に対応する同義語は、それぞれ、”水晶”、”晶粒”、”冰雨”、

、”冰雪”、”冰刀”である。 In the present embodiment, there are a total of six similarities calculated in step 407. Sort them in descending order to get the following order:
0.81, 0.74, 0.69, 0.67, 0.56
The synonyms corresponding to the similarity in this order are “crystal”, “crystal grain”, “rainstorm”,

, “Snow and Snow” and “Sword”.

その後、上位から順番に、所定数の類似度に対応する同義語を、集合に抽出する。 Then, in order from the top, synonyms corresponding to a predetermined number of similarities are extracted into a set.

この実施の形態においては、所定数Ｋ＝５であり、未登録語について合計で６つの同義語があるので、降順にソートされた類似度のうち最初の５つの類似度、すなわち、０．８１、０．７４、０．６９、０．６７、０．５６が選択される。
そして、これらの類似度に対応する同義語”水晶”,”晶粒”,”冰雨”,

,”冰雪”を集合に加えるために集合のメンバとして抽出する。 In this embodiment, since the predetermined number K = 5 and there are a total of six synonyms for unregistered words, the first five similarities sorted in descending order, ie 0.81 , 0.74, 0.69, 0.67, and 0.56 are selected.
And the synonyms “crystal”, “grain”, “rainstorm” corresponding to these similarities,

Therefore, to extract “冰雪” to the set, it is extracted as a member of the set.

ステップ４０９で、集合中、同じカテゴリに属する同義語に対応する類似度を合計する。 In step 409, the similarities corresponding to synonyms belonging to the same category in the set are summed.

このステップにおいて、まず、未登録語の同義語が属するカテゴリを決定する。それはステップ３０４の方法に従って実施され、それによってステップ３０４と同じ結果が得られる。すなわち、カテゴリＣ２に属する語が

を含み、カテゴリＣ３に属する語は”水晶”、”晶粒”、”晶体”を含み、カテゴリＣ４に属する語が”冰雨”、”冰雪”を含む。
従って、ステップ４０８で抽出した集合に含まれる同義語は、それぞれ、Ｃ２、Ｃ３およびＣ４に属する。 In this step, first, a category to which synonyms of unregistered words belong is determined. It is performed according to the method of step 304, which gives the same result as step 304. That is, words belonging to category C2

And the term belonging to the category C3 includes “crystal”, “crystal grain”, and “crystal”, and the term belonging to the category C4 includes “rainstorm” and “snowfall”.
Therefore, the synonyms included in the set extracted in step 408 belong to C2, C3, and C4, respectively.

次に、未登録語のコンテキストと同じカテゴリに属する同義語のコンテキストの間の類似度を合計し、その結果として、未登録語とそれぞれのカテゴリの間の類似度を得る。例えば、以下のように未登録語とそれぞれのカテゴリの間の類似度を得る。
Sim(冰晶，C2)=Sim(context(冰晶)，context

) = 0.67，
Sim(冰晶，C3)=Sim(context(冰晶)，context(水晶))+Sim(context(冰晶)，
context(晶粒)) = 1.55，
Sim(冰晶，C4)=Sim(context(冰晶)，context(冰雨))+Sim(context(冰晶)，
context(冰雪)) = 1.25 Next, the degrees of similarity between the contexts of synonyms belonging to the same category as the context of unregistered words are summed, and as a result, the degree of similarity between the unregistered word and each category is obtained. For example, the similarity between an unregistered word and each category is obtained as follows.
Sim (冰晶、 C2) = Sim (context (冰晶) 、 context

) = 0.67,
Sim (冰晶、 C3) = Sim (context (冰晶) 、 context (quartz)) + Sim (context (冰晶) 、
context (grain)) = 1.55,
Sim (冰晶、 C4) = Sim (context (冰晶) 、 context (冰雨)) + Sim (context (冰晶) 、
context ()) = 1.25

ステップ４１０で、合計された類似度に従って未登録語のカテゴリを決定する。 In step 410, categories of unregistered words are determined according to the total similarity.

ステップ４０９で取得した未登録語とそれぞれのカテゴリの間の類似度をソートする。それにより、未登録語”冰晶”とカテゴリＣ３の間の類似度が最も高いことが分かる。その結果、カテゴリＣ３が未登録語のカテゴリとして決定される。 The degree of similarity between the unregistered word acquired in step 409 and each category is sorted. Thereby, it can be seen that the similarity between the unregistered word “語晶” and the category C3 is the highest. As a result, the category C3 is determined as an unregistered word category.

さらに、本発明のいくつかの実施の形態においては、合計した類似度に基づいて未登録語が属するカテゴリを決定するのにその他のルールを利用することも可能である。
例えば、未登録語とそれぞれのカテゴリの間の最大の類似度が選択されないことがあるが、この場合、類似度中の中間の値に対応するカテゴリを未登録語のカテゴリとして決定する。 Further, in some embodiments of the present invention, other rules may be used to determine the category to which the unregistered word belongs based on the total similarity.
For example, the maximum similarity between the unregistered word and each category may not be selected. In this case, the category corresponding to the intermediate value in the similarity is determined as the category of the unregistered word.

図５は、本発明のさらに他の実施の形態による未登録語のカテゴリを決定する方法のフローチャートを示す。 FIG. 5 shows a flowchart of a method for determining a category of unregistered words according to still another embodiment of the present invention.

ステップ５０１で、未登録語を入力する。 In step 501, an unregistered word is input.

この実施の形態において、入力した未登録語が

であると仮定する。 In this embodiment, the input unregistered word is

Assume that

ステップ５０２で、未登録語と１つ以上の語構成要素を共有する語を、未登録語の同義語として辞書から選択する。 In step 502, a word that shares one or more word components with an unregistered word is selected from the dictionary as a synonym for the unregistered word.

ステップ３０２と同じように、
ステップ５０２で語構成規則に基づいて選択された未登録語の同義語は、

、”厂主”である。 Similar to step 302,
The synonym of the unregistered word selected based on the word composition rules in step 502 is

, "The Lord".

ステップ５０３で、未登録語のコンテキストを文集から生成する。 In step 503, a context for unregistered words is generated from the collection of sentences.

未登録語のコンテキストは、ウィンドウ方式を適用することにより、依存木方式あるいは当業者にとって既知の他の方式で生成される。具体的な実現方法はステップ２０２において説明したので、ここでは説明を省略する。 The unregistered word context is generated by applying the window method in a dependency tree method or other methods known to those skilled in the art. Since the specific implementation method has been described in step 202, description thereof is omitted here.

ステップ５０４で、同義語のコンテキストを文集から生成する。 In step 504, a synonym context is generated from the collection.

未登録語の同義語のコンテキストは、ウィンドウ方式を適用することにより、依存木方式あるいは当業者にとって既知の他の方式で生成される。具体的な実現方法はステップ２０２において説明したので、ここでは説明を省略する。 The context of synonyms for unregistered words is generated by applying a windowing scheme in a dependency tree scheme or other schemes known to those skilled in the art. Since the specific implementation method has been described in step 202, description thereof is omitted here.

ステップ５０５で、未登録語のコンテキストと同義語のコンテキストの間の類似度を計算する。 In step 505, the similarity between the unregistered word context and the synonym context is calculated.

このステップはステップ４０７と同様であるので、ここでは詳細を説明しない。ステップ５０５で、未登録語

のコンテキストとその同義語のコンテキストの間の類似度が、次のように取得される。

Since this step is the same as step 407, details are not described here. In step 505, unregistered words

The similarity between the context and the context of its synonyms is obtained as follows:

ステップ５０６で、同義語が属するカテゴリについて統計を取る。 At step 506, statistics are taken for the category to which the synonym belongs.

このステップは、ステップ３０４において説明した方法で実現され、以下の結果を得る。
カテゴリＣ１に属する語は

を含み、カテゴリＣ２に属する語は

と

を含み、カテゴリＣ３に属する語は

を含み、そして、カテゴリＣ４に属する語は

と”厂主”を含む。 This step is realized by the method described in step 304, and the following result is obtained.
The words belonging to category C1

Words that belong to category C2

When

Words that belong to category C3

And the words belonging to category C4 are

And “lord”.

ステップ５０７で、同義語に関連する所定の重み係数を入力する。 In step 507, a predetermined weighting factor associated with the synonym is input.

カテゴリの決定のために、語のコンテキストは非常に重要であり、また、同様に、語の構造情報は、カテゴリの判定にとって非常に重要である。従って、本発明は、混合類似度の概念、すなわち語の構造情報を用いて未登録語のコンテキストと同義語のコンテキストの間の類似度を重み付けすることを提案する。
本実施の形態において、語の構造情報は、例えば所定の重み係数λ（ｗ、ｗ_ｉ）である。所定の重み係数を用いて未登録語のコンテキストと同義語のコンテキストの間の類似度に重みを加えることは、以下の式において示される。 For category determination, word context is very important, and similarly, word structure information is very important for category determination. Accordingly, the present invention proposes weighting the similarity between the context of unregistered words and the context of synonyms using the concept of mixed similarity, ie word structure information.
In the present embodiment, the word structure information is, for example, a predetermined weighting factor λ (w, w _i ). Adding a weight to the similarity between an unregistered word context and a synonym context using a predetermined weighting factor is shown in the following equation.

Sim(w, w_i)
= λ(w,w_i) * CTS(w, w_i)
（2）

ここで、ｗは未登録語である。ｗ_ｉは未登録語の同義語である。λ（ｗ、ｗ_ｉ）は、未登録語ｗおよび同義語ｗ_ｊの構造情報に基づく重み係数を指す。そして、ＣＴＳ（ｗ、ｗ_ｉ）は、未登録語ｗのコンテキストとその同義語ｗ_ｊのコンテキストの間の類似度を示す。 Sim (w, w _i )
= λ (w, w _i ) * CTS (w, w _i )
(2)

Here, w is an unregistered word. w _i is a synonym for an unregistered word. λ (w, w _i ) refers to a weighting factor based on the structure information of the unregistered word w and the synonym w _j . CTS (w, w _i ) indicates the similarity between the context of the unregistered word w and the context of its synonym w _j .

重み係数を設定するには複数の方法を使用することが可能である。１つの実施例では、重み係数の設定は以下のポリシーを満足する必要がある。
未登録語ｗとその同義語ｗ_ｊが最後の１文字と最後から２番目の文字を共有するならば、所定の重み係数λ（ｗ、ｗ_ｉ）をλ_１として設定する。
例えば、

そうではなく、
未登録語ｗおよびその同義語ｗ_ｊが最後の１文字と最初の１文字を共有しならば、所定の重み係数λ（ｗ、ｗ_ｉ）をλ_２として設定する。
例えば

そうではなく、
未登録語ｗおよびその同義語ｗ_ｊが最初の１文字或いは最後の１文字を共有するならば、所定の重み係数λ（ｗ、ｗ_ｉ）をλ_３として設定する。
例えば、λ（基民、市民）＝λ_３ Several methods can be used to set the weighting factor. In one embodiment, the weighting factor setting needs to satisfy the following policy:
If the unregistered word w and its synonym w _j share the last one character and the second character from the end, a predetermined weighting factor λ (w, w _i ) is set as λ ₁ .
For example,

Rather,
If the unregistered word w and its synonym w _j share the last one character and the first one character, a predetermined weight coefficient λ (w, w _i ) is set as λ ₂ .
For example

Rather,
If the unregistered word w and its synonym w _j share the first character or the last character, a predetermined weight coefficient λ (w, w _i ) is set as λ ₃ .
For example, λ (base people, citizens) = λ ₃

それ以外の場合、重み係数λ（ｗ、ｗ_ｉ）をλ_４と設定する。
ここで、λ_１≧λ_２≧λ_３≧λ_４であり、対応する数字は試験を通して取得される。 Otherwise, it sets the weight coefficient λ _(w, w _i) of the lambda _4.
Here, λ ₁ ≧ λ ₂ ≧ λ ₃ ≧ λ ₄ and the corresponding numbers are obtained throughout the test.

ステップ５０８で、所定の重み係数の用いて関連する同義語に対応する類似度に重み付けを行う。 In step 508, the similarity corresponding to the related synonyms is weighted using a predetermined weighting factor.

一例において、ステップ５０７によって、

は、λ_４＝０．３８２とそれぞれ設定され、

はλ_２＝１０と設定される。 In one example, by step 507,

Are respectively set to λ ₄ = 0.382,

Is set to λ ₂ = 10.

ステップ５０７によって取得した上記重み係数、およびステップ５０５によって取得した未登録語のコンテキストと同義語のコンテキストの間の類似度は、式（２）に適用され、その結果、以下に示すような重み付けされた類似度が得られる。

The weighting factor obtained at step 507 and the similarity between the unregistered word context and the synonym context obtained at step 505 are applied to equation (2), resulting in the weighting as shown below: Similar similarity is obtained.

ステップ５０９で、類似度に基づいて未登録語の同義語から集合を抽出する。 In step 509, a set is extracted from synonyms of unregistered words based on the similarity.

このステップはステップ４０８と類似している。まず、ステップ５０７で取得した重みが加えられた類似度を、降順にソートする。その後、上位から順番に、所定数の類似度に対応する同義語を、集合に抽出する。 This step is similar to step 408. First, the similarities to which weights acquired in step 507 are added are sorted in descending order. Then, in order from the top, synonyms corresponding to a predetermined number of similarities are extracted into a set.

この実施の形態において、同様に設定回数Ｋ＝５と仮定しているので、降順にソートされた類似度のうち最初の５つの類似度、すなわち、３．０、０．１７２、０．１１５、０．１０３、０．０７６が選択される。
そして、これらの類似度に相応する同義語

、”厂主”を集合に加えるために集合のメンバとして抽出する。 In this embodiment, since the set number of times K = 5 is also assumed, the first five similarities sorted in descending order, that is, 3.0, 0.172, 0.115, 0.103 and 0.076 are selected.
And synonyms corresponding to these similarities

, “Lord” is extracted as a member of the set to add it to the set.

ステップ５１０で、集合中、同じカテゴリに属する同義語に対応する類似度を合計する。 In step 510, the similarities corresponding to synonyms belonging to the same category in the set are summed.

このステップ５１０はステップ４０９に類似している。 This step 510 is similar to step 409.

まず、ステップ５０６の結果から、抽出された集合中の

および

がカテゴリＣ２に属し、

がカテゴリＣ３に属し、そして、

と”厂主”がカテゴリＣ４に属することが分かる。
これにより、ステップ５０９で抽出された集合に含まれる同義語は、それぞれＣ２、Ｃ３およびＣ４に属し、かつ、これらのカテゴリがすなわち未登録語の候補カテゴリである。 First, from the result of step 506,

and

Belongs to category C2,

Belongs to category C3, and

It can be seen that “owner” belongs to the category C4.
Thereby, the synonyms included in the set extracted in step 509 belong to C2, C3, and C4, respectively, and these categories are candidate categories of unregistered words.

次に、未登録語のコンテキストと、同じカテゴリに属する同義語のコンテキストの間の類似度を合計し、その結果として、未登録語とそれぞれのカテゴリの間の類似度を取得する。例えば、以下のように類似度を取得する。

Next, the similarities between the contexts of the unregistered words and the synonym contexts belonging to the same category are summed, and as a result, the similarities between the unregistered words and the respective categories are acquired. For example, the similarity is acquired as follows.

ステップ５１１で、合計された類似度に従って未登録語のカテゴリを決定する。 In step 511, categories of unregistered words are determined according to the total similarity.

ステップ５１０で取得した未登録語とそれぞれのカテゴリの間の類似度をソートする。これにより、未登録語

とカテゴリＣ３の間の類似度が最も高いことが分かる。その結果、カテゴリＣ３が未登録語のカテゴリとして決定される。 The degree of similarity between the unregistered word acquired in step 510 and each category is sorted. As a result, unregistered words

And the category C3 has the highest similarity. As a result, the category C3 is determined as an unregistered word category.

本発明によれば、未登録語の同義語を語構成規則に基づいて辞書から選択し、未登録語のコンテキストを文集から生成する。そして、未登録語のコンテキストおよび同義語に基づいて未登録語が属するカテゴリを決定する。
本発明は、関連技術における性能低下の問題を解決し、高いカバー範囲でカテゴリ選択を実現するために語構成規則に基づいて既存の辞書からどのように自動的に同義語を選択するかの問題を解決する。
そして、正確に語義類似度を計算するために語構造的情報およびコンテキスト情報をどのようにマージするかという問題を解決する。 According to the present invention, a synonym of an unregistered word is selected from a dictionary based on a word composition rule, and a context of the unregistered word is generated from a collection of sentences. Then, the category to which the unregistered word belongs is determined based on the context of the unregistered word and the synonym.
The present invention solves the problem of performance degradation in the related art and how to automatically select synonyms from existing dictionaries based on word construction rules to achieve category selection with high coverage To solve.
Then, the problem of how to merge the word structure information and the context information to accurately calculate the semantic similarity is solved.

本発明による方法は、ソフトウェア、ハードウェア、またはソフトウェアとハードウェアとの組み合わせにとして実現することができる。ハードウェア部分は専用ロジックを使用して実装でき、ソフトウェア部分は、メモリ内に格納しておき、マイクロプロセッサ、パーソナルコンピュータ（ＰＣ）、メインフレーム等の適切な命令実行システムによって実行することができる。 The method according to the invention can be implemented as software, hardware, or a combination of software and hardware. The hardware part can be implemented using dedicated logic, and the software part can be stored in a memory and executed by a suitable instruction execution system such as a microprocessor, personal computer (PC), mainframe or the like.

上記の説明では、本発明をより理解しやすくするため、本発明の実現には必須とされるが、当該技術に精通した当業者には広く知られている技術の詳細を一部省略している。 In the above description, in order to make the present invention easier to understand, it is essential for the realization of the present invention, but some details of the technique widely known to those skilled in the art are omitted. Yes.

本発明の説明は例示と説明を目的として記載したものであり、すべての実施例を網羅することや、本発明を上記の形態に限定することを意図するものではない。多数の変更や改変が可能なことは、当該技術の通常の技量を有する当業者には明らかであろう。 The description of the invention has been presented for purposes of illustration and description, and is not intended to be exhaustive or to limit the invention to the form described above. Many modifications and variations will be apparent to practitioners skilled in this art.

したがって、上記の実施例は、本発明の原理と実用用途とをより明快に説明すると共に、本発明の精神から逸脱することなくなされたすべての変更および改変は付記される請求項に定義される本発明の保護範囲に含まれることを、当該技術の通常の技量を有する当業者が理解する上での助けとすることを目的として、選択され説明されたものである。 Thus, the foregoing embodiments more clearly illustrate the principles and practical applications of the present invention, and all changes and modifications made without departing from the spirit of the present invention are defined in the appended claims. It has been selected and described for the purpose of assisting those skilled in the art having the ordinary skill in the art to understand that they are within the protection scope of the present invention.

さらに、上記実施形態の一部又は全部は、以下の付記のようにも記載されうるが、これに限定されない。 Further, a part or all of the above-described embodiment can be described as in the following supplementary notes, but is not limited thereto.

（付記１）
語構成規則に基づいて未登録語の同義語を辞書から選択するステップと、
文集から未登録語のコンテキストを生成するステップと、
未登録語のコンテキストと同義語に従って未登録語のカテゴリを決定するステップと
を含むことを特徴とする未登録語のカテゴリ決定方法。 (Appendix 1)
Selecting synonyms for unregistered words from a dictionary based on word composition rules;
Generating a context of unregistered words from the collection of sentences;
Determining the category of the unregistered word according to the context of the unregistered word and the synonym.

（付記２）
前記語構成規則が、語構成要素、要素属性および要素関係を含むことを特徴とする付記１に記載の未登録語のカテゴリ決定方法。 (Appendix 2)
The unregistered word category determination method according to claim 1, wherein the word composition rule includes a word composition element, an element attribute, and an element relationship.

（付記３）
語構成規則に基づいて未登録語の同義語を辞書から選択するステップが、未登録語と１つ以上の語構成要素を共有する語を未登録語の同義語として辞書から選択するステップを含むことを特徴とする付記２に記載の未登録語のカテゴリ決定方法。 (Appendix 3)
Selecting a synonym of an unregistered word from the dictionary based on a word composition rule includes selecting from the dictionary a word that shares one or more word components with the unregistered word as a synonym of the unregistered word The method for determining a category of unregistered words according to Supplementary Note 2, wherein:

（付記４）
語構成規則に基づいて未登録語の同義語を辞書から選択するステップが、
未登録語の品詞を決定するステップと、
未登録語と１つ以上の語構成要素を共有する語を辞書から選択するステップと、
選択した全ての語から未登録語と同じ品詞を有する語を未登録語の同義語として選択するステップとを含むことを特徴とする付記２に記載の未登録語のカテゴリ決定方法。 (Appendix 4)
Selecting a synonym of an unregistered word from a dictionary based on word composition rules,
Determining a part of speech for an unregistered word;
Selecting words from a dictionary that share one or more word components with unregistered words;
The method for determining an unregistered word category according to appendix 2, further comprising: selecting a word having the same part of speech as the unregistered word from all the selected words as a synonym of the unregistered word.

（付記５）
文集から未登録語のコンテキストを生成するステップが、
文集から未登録語を検索するステップと、
ウィンドウ方式の適用により未登録語に隣接する文字を取り出すステップと、
取り出した未登録語に隣接する文字に語分割を実行するステップと、
語分割の後に得られた各語の重みを決定し、語分割の後に取得したそれぞれの語とそれらの重みを未登録語のコンテキストとして使用するステップとを含むことを特徴とする付記１に記載の未登録語のカテゴリ決定方法。 (Appendix 5)
The step of generating a context of unregistered words from the collection of sentences
Searching unregistered words from the collection of sentences;
Extracting a character adjacent to an unregistered word by applying a window method;
Performing word splitting on characters adjacent to the unregistered word that was retrieved;
Supplementary note 1 including the step of determining the weight of each word obtained after word division and using each word obtained after word division and the weight as the context of an unregistered word To determine the category of unregistered words.

（付記６）
文集から未登録語のコンテキストを生成するステップが、
文集から未登録語を検索するステップと、
未登録語のコンテキストとして依存関係を用いるために、依存木方式によって未登録語の依存関係を分析するステップとを含むことを特徴とする付記１に記載の未登録語のカテゴリ決定方法。 (Appendix 6)
The step of generating a context of unregistered words from the collection of sentences
Searching unregistered words from the collection of sentences;
The method for determining a category of unregistered words according to claim 1, further comprising a step of analyzing the dependency relationship of unregistered words by a dependency tree method in order to use the dependency relationship as a context of unregistered words.

（付記７）
未登録語のコンテキストと同義語に基づいて未登録語が属するカテゴリを決定するステップが、
同義語が属するカテゴリについて統計を取るステップと、
各カテゴリに含まれる全ての語のコンテキストを、各カテゴリのコンテキストとして、文集から生成するステップと、
未登録語のコンテキストと各カテゴリのコンテキストの間の類似度を計算するステップと、
最大の類似度を有するカテゴリを未登録語が属するカテゴリとして決定するステップとを含むことを特徴とする付記１に記載の未登録語のカテゴリ決定方法。 (Appendix 7)
Determining the category to which the unregistered word belongs based on the context of the unregistered word and the synonym,
Taking statistics about the category to which the synonym belongs,
Generating the context of all words included in each category from the collection as the context of each category;
Calculating the similarity between the unregistered word context and the context of each category;
The method for determining a category of unregistered words according to appendix 1, further comprising: determining a category having the maximum similarity as a category to which an unregistered word belongs.

（付記８）
未登録語のコンテキストと同義語に基づいて未登録語が属するカテゴリを決定するステップが、
文集から同義語のコンテキストを生成するステップと、
未登録語のコンテキストと同義語のコンテキストの間の類似度を計算するステップと、
計算によって取得した類似度に基づいて、同義語から集合を抽出するステップと、
抽出した集合において同じカテゴリに属する同義語に対応する類似度を合計するステップと、
合計した類似度に基づいて未登録語が属するカテゴリを決定するステップとを含むことを特徴とする付記１に記載の未登録語のカテゴリ決定方法。 (Appendix 8)
Determining the category to which the unregistered word belongs based on the context of the unregistered word and the synonym,
Generating a synonym context from the collection,
Calculating a similarity between an unregistered word context and a synonym context;
Extracting a set from synonyms based on the similarity obtained by calculation;
Summing similarities corresponding to synonyms belonging to the same category in the extracted set;
The method for determining a category of unregistered words according to appendix 1, further comprising: determining a category to which unregistered words belong based on the total similarity.

（付記９）
未登録語のコンテキストと同義語に基づいて未登録語が属するカテゴリを決定するステップが、
文集から同義語のコンテキストを生成するステップと、
未登録語のコンテキストと同義語のコンテキストの間の類似度を計算するステップと、
同義語が属するカテゴリについて統計を取るステップと、
同義語に関連付けた所定の重み係数を入力するステップと、
入力した所定の重み係数に基づいて、関連する同義語に対応する類似度に重みを加えるステップと、
重みが加えられた類似度に基づいて同義語から集合を抽出するステップと、
抽出した集合中、同じカテゴリに属する同義語に対応する重みが加えられた類似度を合計するステップと、
合計した類似度に基づいて未登録語が属するカテゴリを決定するステップを含むことを特徴とする付記１に記載の未登録語のカテゴリ決定方法。 (Appendix 9)
Determining the category to which the unregistered word belongs based on the context of the unregistered word and the synonym,
Generating a synonym context from the collection,
Calculating a similarity between an unregistered word context and a synonym context;
Taking statistics about the category to which the synonym belongs,
Inputting a predetermined weighting factor associated with the synonym;
Adding weights to similarities corresponding to related synonyms based on the input predetermined weighting factors;
Extracting a set from synonyms based on the weighted similarity,
Summing the similarities with weights corresponding to synonyms belonging to the same category in the extracted set;
The category determination method for unregistered words as set forth in appendix 1, further comprising the step of determining a category to which unregistered words belong based on the total similarity.

（付記１０）
所定の重み係数の設定が、
未登録語とカテゴリ内の語が最後の１文字と最後から２番目の文字を共有するならば、カテゴリに関連する所定の重み係数をλ１と設定し、
そうではなく、未登録語とカテゴリ内の語が最後の１文字と最初の１文字を共有するならば、カテゴリに関連する所定の重み係数をλ２と設定し、
そうではなく、未登録語とカテゴリ内の語が最初の１文字或いは最後の１文字を共有するならば、カテゴリに関連する所定の重み係数をλ３と設定し、
それ以外の場合、カテゴリに関連する所定の重み係数をλ４と設定する
ここで、λ_１≧λ_２≧λ_３≧λ_４である
ポリシーを満たすことを特徴とする付記９に記載の未登録語のカテゴリ決定方法。 (Appendix 10)
The predetermined weighting factor setting is
If the unregistered word and the word in the category share the last character and the penultimate character, set the predetermined weighting factor associated with the category as λ1,
Otherwise, if the unregistered word and the word in the category share the last one letter and the first one letter, set the predetermined weighting factor associated with the category as λ2,
Otherwise, if the unregistered word and the word in the category share the first character or the last character, set the predetermined weighting factor associated with the category as λ3,
Otherwise, the predetermined weighting factor related to the category is set as λ4, where λ ₁ ≧ λ ₂ ≧ λ ₃ ≧ λ ₄ The unregistered word according to appendix 9, characterized by satisfying the policy Category determination method.

（付記１１）
類似度に基づいて同義語から集合を抽出するステップが、
類似度を降順にソートするステップと、
上位から順番に、所定数の類似度に対応する同義語を、集合に抽出するステップとを含むことを特徴とする付記８又は付記９に記載の未登録語のカテゴリ決定方法。 (Appendix 11)
Extracting a set from synonyms based on similarity,
Sorting the similarities in descending order;
The method for determining a category of unregistered words according to appendix 8 or appendix 9, further comprising: extracting synonyms corresponding to a predetermined number of similarities in order from the top.

（付記１２）
語構成規則に基づいて未登録語の同義語を辞書から選択する同義語選択部と、
文集から未登録語のコンテキストを生成するコンテキスト生成部と、
未登録語のコンテキストと同義語に従って未登録語のカテゴリを決定するカテゴリ決定部と
を備えることを特徴とする未登録語のカテゴリ決定装置。 (Appendix 12)
A synonym selection unit that selects a synonym of an unregistered word from a dictionary based on a word composition rule;
A context generator that generates a context of unregistered words from the collection of sentences;
A category determining unit that determines a category of an unregistered word according to a context of the unregistered word and a synonym.

（付記１３）
前記語構成規則が、語構成要素、要素属性および要素関係を含むことを特徴とする付記１２に記載の未登録語のカテゴリ決定装置。 (Appendix 13)
The category determination device for unregistered words according to attachment 12, wherein the word composition rule includes a word composition element, an element attribute, and an element relationship.

（付記１４）
前記同義語選択部が、
未登録語と１つ以上の語構成要素を共有する語を未登録語の同義語として辞書から選択する手段を含むことを特徴とする付記１３に記載の未登録語のカテゴリ決定装置。 (Appendix 14)
The synonym selection unit is
14. The category determination device for unregistered words according to appendix 13, further comprising means for selecting, from the dictionary, a word that shares one or more word components with the unregistered word as a synonym of the unregistered word.

（付記１５）
前記同義語選択部が、
未登録語の品詞を決定する手段と、
未登録語と１つ以上の語構成要素を共有する語を辞書から選択する手段と、
選択した全ての語から未登録語と同じ品詞を有する語を未登録語の同義語として選択する手段とを含むことを特徴とする付記１３に記載の未登録語のカテゴリ決定装置。 (Appendix 15)
The synonym selection unit is
A means of determining the part of speech for unregistered words;
Means for selecting words from the dictionary that share one or more word components with unregistered words;
The unregistered word category determination device according to appendix 13, further comprising means for selecting, from all selected words, a word having the same part of speech as the unregistered word as a synonym of the unregistered word.

（付記１６）
前記コンテキスト生成部が、
文集から未登録語を検索する手段と、
ウィンドウ方式の適用により未登録語に隣接する文字を取り出す手段と、
取り出した未登録語に隣接する文字に語分割を実行する手段と、
語分割の後に得られた各語の重みを決定し、語分割の後に取得したそれぞれの語とそれらの重みを未登録語のコンテキストとして使用する手段とを含むことを特徴とする付記１２に記載の未登録語のカテゴリ決定装置。 (Appendix 16)
The context generator
A means of searching for unregistered words from a collection of sentences;
Means for extracting characters adjacent to unregistered words by applying the window method;
Means for performing word division on characters adjacent to the unregistered word that has been taken out;
Supplementary note 12 including means for determining the weight of each word obtained after word division and using each word obtained after word division and using those weights as context for unregistered words Unregistered word category determination device.

（付記１７）
前記コンテキスト生成部が、
文集から未登録語を検索する手段と、
未登録語のコンテキストとして依存関係を用いるために、依存木方式によって未登録語の依存関係を分析する手段とを含むことを特徴とする付記１２に記載の未登録語のカテゴリ決定装置。 (Appendix 17)
The context generator
A means of searching for unregistered words from a collection of sentences;
The unregistered word category determination device according to appendix 12, further comprising means for analyzing the dependency relationship of the unregistered word by a dependency tree method in order to use the dependency relationship as the context of the unregistered word.

（付記１８）
前記カテゴリ決定部が、
同義語が属するカテゴリについて統計を取る手段と、
各カテゴリに含まれる全ての語のコンテキストを、各カテゴリのコンテキストとして、文集から生成する手段と、
未登録語のコンテキストと各カテゴリのコンテキストの間の類似度を計算する手段と、
最大の類似度を有するカテゴリを未登録語が属するカテゴリとして決定する手段とを含むことを特徴とする付記１２に記載の未登録語のカテゴリ決定装置。 (Appendix 18)
The category determining unit
A means of taking statistics about the category to which the synonym belongs,
Means for generating the context of all words included in each category from the collection as the context of each category;
Means for calculating the similarity between the unregistered word context and the context of each category;
The category determining apparatus for unregistered words according to claim 12, further comprising means for determining a category having the maximum similarity as a category to which the unregistered word belongs.

（付記１９）
前記コンテキスト生成部が、文集から同義語のコンテキストを生成する手段を含み、
前記カテゴリ決定部が、
未登録語のコンテキストと同義語のコンテキストの間の類似度を計算する手段と、
計算によって取得した類似度に基づいて、同義語から集合を抽出する手段と、
抽出した集合において同じカテゴリに属する同義語に対応する類似度を合計する手段と、
合計した類似度に基づいて未登録語が属するカテゴリを決定する手段とを含むことを特徴とする付記１２に記載の未登録語のカテゴリ決定装置。 (Appendix 19)
The context generator includes means for generating a synonym context from a collection of sentences;
The category determining unit
Means for calculating the similarity between the context of the unregistered word and the synonym context;
Means for extracting a set from synonyms based on the similarity obtained by calculation;
Means for summing up similarities corresponding to synonyms belonging to the same category in the extracted set;
The apparatus for determining a category of unregistered words according to appendix 12, further comprising means for determining a category to which the unregistered word belongs based on the total similarity.

（付記２０）
前記コンテキスト生成部が、文集から同義語のコンテキストを生成する手段を含み、
前記カテゴリ決定部が、
未登録語のコンテキストと同義語のコンテキストの間の類似度を計算する手段と、
同義語が属するカテゴリについて統計を取る手段と、
同義語に関連付けた所定の重み係数を入力する手段と、
入力した所定の重み係数に基づいて、関連する同義語に対応する類似度に重みを加える手段と、
重みが加えられた類似度に基づいて同義語から集合を抽出する手段と、
抽出した集合中、同じカテゴリに属する同義語に対応する重みが加えられた類似度を合計する手段と、
合計した類似度に基づいて未登録語が属するカテゴリを決定する手段とを含むことを特徴とする付記１２に記載の未登録語のカテゴリ決定装置。 (Appendix 20)
The context generator includes means for generating a synonym context from a collection of sentences;
The category determining unit
Means for calculating the similarity between the context of the unregistered word and the synonym context;
A means of taking statistics about the category to which the synonym belongs,
Means for inputting a predetermined weighting factor associated with the synonym;
Means for adding a weight to the similarity corresponding to the related synonym based on the input predetermined weighting factor;
Means for extracting a set from synonyms based on the weighted similarity;
Means for summing the similarities to which weights corresponding to synonyms belonging to the same category are added in the extracted set;
The apparatus for determining a category of unregistered words according to appendix 12, comprising means for determining a category to which an unregistered word belongs based on the total similarity.

（付記２１）
所定の重み係数の設定が、
未登録語とカテゴリ内の語が最後の１文字と最後から２番目の文字を共有するならば、カテゴリに関連する所定の重み係数をλ１と設定し、
そうではなく、未登録語とカテゴリ内の語が最後の１文字と最初の１文字を共有するならば、カテゴリに関連する所定の重み係数をλ２と設定し、
そうではなく、未登録語とカテゴリ内の語が最初の１文字或いは最後の１文字を共有するならば、カテゴリに関連する所定の重み係数をλ３と設定し、
それ以外の場合、カテゴリに関連する所定の重み係数をλ４と設定する
ここで、λ_１≧λ_２≧λ_３≧λ_４である
ポリシーを満たすことを特徴とする付記２０に記載の未登録語のカテゴリ決定装置。 (Appendix 21)
The predetermined weighting factor setting is
If the unregistered word and the word in the category share the last character and the penultimate character, set the predetermined weighting factor associated with the category as λ1,
Otherwise, if the unregistered word and the word in the category share the last one letter and the first one letter, set the predetermined weighting factor associated with the category as λ2,
Otherwise, if the unregistered word and the word in the category share the first character or the last character, set the predetermined weighting factor associated with the category as λ3,
Otherwise, the predetermined weighting factor related to the category is set as λ4, where λ ₁ ≧ λ ₂ ≧ λ ₃ ≧ λ ₄ The unregistered word according to appendix 20, characterized by satisfying the policy Category determination device.

（付記２２）
類似度に基づいて同義語から集合を抽出する手段が、
類似度を降順にソートする手段と、
上位から順番に、所定数の類似度に対応する同義語を、集合に抽出する手段とを含むことを特徴とする付記１９又は付記２０に記載の未登録語のカテゴリ決定装置。 (Appendix 22)
A means for extracting a set from synonyms based on similarity is
Means for sorting the similarities in descending order;
An unregistered word category determination device according to supplementary note 19 or supplementary note 20, comprising means for extracting, in order from the top, synonyms corresponding to a predetermined number of similarities to a set.

１１０：同義語選択部
１２０：コンテキスト生成部
１３０：カテゴリ決定部
１２０：コンテキスト生成部 110: Synonym selection unit 120: Context generation unit 130: Category determination unit 120: Context generation unit

Claims

語構成規則に基づいて未登録語の同義語を辞書から選択するステップと、
文集から未登録語のコンテキストを生成するステップと、
未登録語のコンテキストと同義語に従って未登録語のカテゴリを決定するステップと
を含むことを特徴とする未登録語のカテゴリ決定方法。 Selecting synonyms for unregistered words from a dictionary based on word composition rules;
Generating a context of unregistered words from the collection of sentences;
Determining the category of the unregistered word according to the context of the unregistered word and the synonym.

前記語構成規則が、語構成要素、要素属性および要素関係を含むことを特徴とする請求項１に記載の未登録語のカテゴリ決定方法。 The unregistered word category determination method according to claim 1, wherein the word composition rule includes a word composition element, an element attribute, and an element relationship.

語構成規則に基づいて未登録語の同義語を辞書から選択するステップが、未登録語と１つ以上の語構成要素を共有する語を未登録語の同義語として辞書から選択するステップを含むことを特徴とする請求項２に記載の未登録語のカテゴリ決定方法。 Selecting a synonym of an unregistered word from the dictionary based on a word composition rule includes selecting from the dictionary a word that shares one or more word components with the unregistered word as a synonym of the unregistered word The method for determining a category of unregistered words according to claim 2.

語構成規則に基づいて未登録語の同義語を辞書から選択するステップが、
未登録語の品詞を決定するステップと、
未登録語と１つ以上の語構成要素を共有する語を辞書から選択するステップと、
選択した全ての語から未登録語と同じ品詞を有する語を未登録語の同義語として選択するステップとを含むことを特徴とする請求項２に記載の未登録語のカテゴリ決定方法。 Selecting a synonym of an unregistered word from a dictionary based on word composition rules,
Determining a part of speech for an unregistered word;
Selecting words from a dictionary that share one or more word components with unregistered words;
The method for determining a category of unregistered words according to claim 2, further comprising: selecting a word having the same part of speech as the unregistered word from all the selected words as a synonym of the unregistered word.

文集から未登録語のコンテキストを生成するステップが、
文集から未登録語を検索するステップと、
ウィンドウ方式の適用により未登録語に隣接する文字を取り出すステップと、
取り出した未登録語に隣接する文字に語分割を実行するステップと、
語分割の後に得られた各語の重みを決定し、語分割の後に取得したそれぞれの語とそれらの重みを未登録語のコンテキストとして使用するステップとを含むことを特徴とする請求項１に記載の未登録語のカテゴリ決定方法。 The step of generating a context of unregistered words from the collection of sentences
Searching unregistered words from the collection of sentences;
Extracting a character adjacent to an unregistered word by applying a window method;
Performing word splitting on characters adjacent to the unregistered word that was retrieved;
The method of claim 1, further comprising: determining a weight for each word obtained after word division and using each word obtained after word division and the weight as the context of an unregistered word. Category determination method for unregistered words.

文集から未登録語のコンテキストを生成するステップが、
文集から未登録語を検索するステップと、
未登録語のコンテキストとして依存関係を用いるために、依存木方式によって未登録語の依存関係を分析するステップとを含むことを特徴とする請求項１に記載の未登録語のカテゴリ決定方法。 The step of generating a context of unregistered words from the collection of sentences
Searching unregistered words from the collection of sentences;
The method for determining a category of unregistered words according to claim 1, further comprising the step of analyzing the dependency relationship of unregistered words by a dependency tree method in order to use the dependency relationship as a context of unregistered words.

未登録語のコンテキストと同義語に基づいて未登録語が属するカテゴリを決定するステップが、
同義語が属するカテゴリについて統計を取るステップと、
各カテゴリに含まれる全ての語のコンテキストを、各カテゴリのコンテキストとして、文集から生成するステップと、
未登録語のコンテキストと各カテゴリのコンテキストの間の類似度を計算するステップと、
最大の類似度を有するカテゴリを未登録語が属するカテゴリとして決定するステップとを含むことを特徴とする請求項１に記載の未登録語のカテゴリ決定方法。 Determining the category to which the unregistered word belongs based on the context of the unregistered word and the synonym,
Taking statistics about the category to which the synonym belongs,
Generating the context of all words included in each category from the collection as the context of each category;
Calculating the similarity between the unregistered word context and the context of each category;
The method for determining a category of unregistered words according to claim 1, further comprising: determining a category having the maximum similarity as a category to which an unregistered word belongs.

未登録語のコンテキストと同義語に基づいて未登録語が属するカテゴリを決定するステップが、
文集から同義語のコンテキストを生成するステップと、
未登録語のコンテキストと同義語のコンテキストの間の類似度を計算するステップと、
計算によって取得した類似度に基づいて、同義語から集合を抽出するステップと、
抽出した集合において同じカテゴリに属する同義語に対応する類似度を合計するステップと、
合計した類似度に基づいて未登録語が属するカテゴリを決定するステップとを含むことを特徴とする請求項１に記載の未登録語のカテゴリ決定方法。 Determining the category to which the unregistered word belongs based on the context of the unregistered word and the synonym,
Generating a synonym context from the collection,
Calculating a similarity between an unregistered word context and a synonym context;
Extracting a set from synonyms based on the similarity obtained by calculation;
Summing similarities corresponding to synonyms belonging to the same category in the extracted set;
The method for determining a category of unregistered words according to claim 1, further comprising: determining a category to which unregistered words belong based on the total similarity.

未登録語のコンテキストと同義語に基づいて未登録語が属するカテゴリを決定するステップが、
文集から同義語のコンテキストを生成するステップと、
未登録語のコンテキストと同義語のコンテキストの間の類似度を計算するステップと、
同義語が属するカテゴリについて統計を取るステップと、
同義語に関連付けた所定の重み係数を入力するステップと、
入力した所定の重み係数に基づいて、関連する同義語に対応する類似度に重みを加えるステップと、
重みが加えられた類似度に基づいて同義語から集合を抽出するステップと、
抽出した集合中、同じカテゴリに属する同義語に対応する重みが加えられた類似度を合計するステップと、
合計した類似度に基づいて未登録語が属するカテゴリを決定するステップを含むことを特徴とする請求項１に記載の未登録語のカテゴリ決定方法。 Determining the category to which the unregistered word belongs based on the context of the unregistered word and the synonym,
Generating a synonym context from the collection,
Calculating a similarity between an unregistered word context and a synonym context;
Taking statistics about the category to which the synonym belongs,
Inputting a predetermined weighting factor associated with the synonym;
Adding weights to similarities corresponding to related synonyms based on the input predetermined weighting factors;
Extracting a set from synonyms based on the weighted similarity,
Summing the similarities with weights corresponding to synonyms belonging to the same category in the extracted set;
The method for determining a category of unregistered words according to claim 1, further comprising the step of determining a category to which unregistered words belong based on the total similarity.

語構成規則に基づいて未登録語の同義語を辞書から選択する同義語選択部と、
文集から未登録語のコンテキストを生成するコンテキスト生成部と、
未登録語のコンテキストと同義語に従って未登録語のカテゴリを決定するカテゴリ決定部と
を備えることを特徴とする未登録語のカテゴリ決定装置。
A synonym selection unit that selects a synonym of an unregistered word from a dictionary based on a word composition rule;
A context generator that generates a context of unregistered words from the collection of sentences,
A category determining unit that determines a category of an unregistered word according to a context of the unregistered word and a synonym.