JP6612505B2

JP6612505B2 - Splicing processing system, program, and splicing processing method

Info

Publication number: JP6612505B2
Application number: JP2015023430A
Authority: JP
Inventors: 遼藤井
Original assignee: Spontena
Current assignee: Spontena
Priority date: 2014-11-19
Filing date: 2015-02-09
Publication date: 2019-11-27
Anticipated expiration: 2035-02-09
Also published as: JP2016105263A

Description

本発明は、入力文字列を分かち書きするシステム、プログラム及び方法に関する。 The present invention relates to a system, a program, and a method for sharing an input character string.

従来、形態素解析により文字列の分割パターンを決定する技術が知られている。例えば、ベイズ階層言語モデルによる教師無し形態素解析技術（特許文献１参照）、条件付確率場を用いた教師有り形態素解析技術（特許文献２参照）、及び、条件付確率場とベイズ階層言語モデルとを用いた半教師有り形態素解析技術（特許文献３参照）が知られている。 Conventionally, a technique for determining a character string division pattern by morphological analysis is known. For example, an unsupervised morpheme analysis technique using a Bayesian hierarchical language model (see Patent Document 1), a supervised morpheme analysis technique using a conditional random field (see Patent Document 2), a conditional random field, and a Bayesian hierarchical language model There is known a semi-supervised morphological analysis technique (see Japanese Patent Application Laid-Open No. H10-228707) using the above.

持橋大地、他２名、”ベイズ階層言語モデルによる教師なし形態素解析”、[online]、２００９年、［平成２６年１１月７日検索］、インターネット（URL:http://www.ism.ac.jp/~daichi/paper/nl190segment.pdf）Daichi Mochihashi, two others, “Unsupervised morphological analysis using Bayesian hierarchical language model”, [online], 2009, [searched November 7, 2014], Internet (URL: http: //www.ism. ac.jp/~daichi/paper/nl190segment.pdf) 工藤拓、他２名、”ＣｏｎｄｉｔｉｏｎａｌＲａｎｄｏｍＦｉｅｌｄｓを用いた日本語形態素解析”、[online]、２００４年５月１４日、情報処理学会、［平成２６年１１月７日検索］、インターネット（URL: http://ci.nii.ac.jp/naid/110002911717）Taku Kudo, two others, “Japanese morphological analysis using Conditional Random Fields”, [online], May 14, 2004, Information Processing Society of Japan, [November 7, 2014 search], Internet (URL: http://ci.nii.ac.jp/naid/110002911717) 持橋大地、他２名、”条件付確率場とベイズ階層言語モデルの統合による半教師あり形態素解析”、[online]、２０１１年、［平成２６年１１月７日検索］、インターネット（URL: http://www.ism.ac.jp/~daichi/paper/nlp2011semiseg.pdf）Daichi Mochihashi and two others, “Semi-supervised morphological analysis by integrating conditional random fields and Bayesian hierarchical language model”, [online], 2011, [November 7, 2014 search], Internet (URL: http://www.ism.ac.jp/~daichi/paper/nlp2011semiseg.pdf)

ところで、本発明者らは、ネットワークを通じて複数のデバイスから利用可能なシステムとして、デバイスから入力された文字列を分かち書きして、分かち書き後の文字列をデバイスに返すシステムの構築を考えている。 By the way, the present inventors are considering the construction of a system that, as a system that can be used from a plurality of devices through a network, splits a character string input from the device and returns the character string after the split writing to the device.

しかしながら、分かち書きに用いる言語モデル（確率モデル）に対する学習データを、システムの管理者が用意して、言語モデルを学習する方法では、日常で利用される言葉の変化や新語の登場に迅速に対応して言語モデルを学習することができず、ユーザが利用するデバイスから入力される文字列を、適切に分かち書きすることが難しい。 However, the system administrator prepares learning data for the language model (probability model) to be used for writing, and the method of learning the language model responds quickly to changes in words used in daily life and the emergence of new words. Therefore, it is difficult to learn a language model, and it is difficult to appropriately write a character string input from a device used by a user.

従って、本発明の一側面では、言葉の変化や新語の登場に迅速に対応して分かち書きの法則性を学習し、入力文字列を適切に分かち書き可能なシステム、プログラム及び方法を提供できることが望ましい。 Therefore, in one aspect of the present invention, it is desirable to be able to provide a system, a program, and a method that can learn the rules of division in response to changes in words and the appearance of new words, and can appropriately input characters.

本発明の一側面に係る分かち書き処理システムは、分かち書きユニットと、学習ユニットと、を備える。分かち書きユニットは、入力された文字列を、分かち書きの法則性を表す確率モデルに基づき、分かち書きして出力する。学習ユニットは、分かち書きユニットに対する文字列の入力毎に、この文字列に基づいて、確率モデルを学習する。学習ユニットは、教師無し学習方式又は半教師有り学習方式により確率モデルを学習する構成にされ得る。 A segmentation processing system according to one aspect of the present invention includes a segmentation unit and a learning unit. The splitting unit splits and outputs the input character string based on a probability model representing the rule of splitting. The learning unit learns the probability model based on the character string every time the character string is input to the division writing unit. The learning unit may be configured to learn the probability model by an unsupervised learning method or a semi-supervised learning method.

この分かち書き処理システムによれば、分かち書き対象として入力された文字列を、確率モデルの学習に用いるので、言葉の変化や新語の登場に迅速に対応して確率モデルを学習することができる。従って、本発明の一側面によれば、文字列を適切に分かち書き可能な分かち書き処理システムを構築することができる。 According to this segmentation processing system, since the character string input as the segmentation target is used for learning of the probability model, the probability model can be learned in response to the change of words or the appearance of new words. Therefore, according to one aspect of the present invention, it is possible to construct a split writing processing system that can appropriately split a character string.

ところで、この分かち書き処理システムは、ユーザ毎の確率モデルを備えた構成にされ得る。この場合、分かち書きユニットは、入力された文字列を、入力元ユーザの確率モデルに基づき分かち書きする構成にされ得る。学習ユニットは、分かち書きユニットに対する文字列の入力毎に、入力元ユーザの確率モデルを、この文字列に基づいて学習する構成にされ得る。 By the way, this splitting processing system can be configured to include a probability model for each user. In this case, the splitting unit may be configured to split the input character string based on the probability model of the input source user. The learning unit may be configured to learn the probability model of the input source user based on the character string every time the character string is input to the division writing unit.

使用される言葉は、使用者の世代、地域、趣味及び興味等によって異なる。従って、ユーザ毎の確率モデルを備え、これらの確率モデルの学習を、対応するユーザから入力される文字列に基づいて行う本発明の一側面に係る分かち書き処理システムによれば、文字列の入力元ユーザに対し、一層適切な文字列の分かち書き結果を提供することができる。 The language used varies depending on the user's generation, region, hobby and interest. Therefore, according to the segmentation processing system according to one aspect of the present invention in which a probability model for each user is provided, and learning of these probability models is performed based on the character string input from the corresponding user, the input source of the character string A more appropriate character string segmentation result can be provided to the user.

この他、本発明の別側面において、分かち書き処理システムは、上述した分かち書きユニットに加えて、次の学習ユニット及び登録ユニットを備えた構成にされてもよい。別側面の分かち書き処理システムにおける学習ユニットは、確率モデルを学習用文字列の一群に基づいて学習する第一学習処理を反復実行する一方、分かち書きユニットから出力される分かち書き後の文字列に基づき、確率モデルを学習する第二学習処理を、第一学習処理とは並列に実行する構成にされる。登録ユニットは、分かち書きユニットに入力された文字列を、学習用文字列として登録する構成にされる。 In addition, in another aspect of the present invention, the segmentation processing system may be configured to include the following learning unit and registration unit in addition to the segmentation unit described above. The learning unit in the separate writing processing system according to another aspect repeatedly executes the first learning process for learning the probability model based on the group of learning character strings, while the probability based on the character string after the dividing output from the dividing unit. The second learning process for learning the model is configured to be executed in parallel with the first learning process. The registration unit is configured to register the character string input to the splitting unit as a learning character string.

第二学習処理は、分かち書きユニットに対する文字列の入力毎に、分かち書き後の文字列に基づき、確率モデルを更新することにより、確率モデルを学習する構成にされ得る。第二学習処理は、第一学習処理とは独立して、確率モデルを更新するように実行され得る。この他、第一学習処理は、学習用文字列の一群の中から処理対象文字列を選択し、処理対象文字列に基づき確率モデルを更新する処理を繰返し実行することにより、確率モデルを学習する構成にされ得る。 The second learning process may be configured to learn the probability model by updating the probability model on the basis of the character string after the division for each input of the character string to the division unit. The second learning process can be executed to update the probability model independently of the first learning process. In addition, the first learning process learns a probability model by selecting a processing target character string from a group of learning character strings and repeatedly executing a process of updating the probability model based on the processing target character string. Can be configured.

公知のＮＰＹＬＭ（ＮｅｓｔｅｄＰｉｔｍａｎ−ＹｏｒＬａｎｇｕａｇｅＭｏｄｅｌ）等の言語モデルに対する学習処理では、学習用文字列の一群から処理対象文字列をランダムに選択して、確率モデルを更新する処理を繰返し実行することにより、確率モデルを学習する。従って、分かち書きユニットに入力された文字列を、学習用文字列として登録しても、この文字列に基づく確率モデルの学習には、時間を要する。 In a learning process for a known language model such as NPYLM (Nested Pitman-Yor Language Model), a processing target character string is randomly selected from a group of learning character strings, and a process for updating a probability model is repeatedly performed. Learn the probability model. Therefore, even if a character string input to the splitting unit is registered as a learning character string, it takes time to learn a probability model based on this character string.

一方、分かち書きユニットに入力される文字列に対しては、分かち書きユニットによる分かち書き後の文字列に基づいて、確率モデルを学習することができる。従って、第一学習処理及び第二学習処理を並列実行する本発明の一側面に係る分かち書き処理システムによれば、分かち書きユニットに入力された文字列に基づく確率モデルの学習を迅速に行うことができる。 On the other hand, a probability model can be learned for a character string input to the segmentation unit based on the character string after the segmentation by the segmentation unit. Therefore, according to the split writing processing system according to one aspect of the present invention in which the first learning process and the second learning process are executed in parallel, the probability model based on the character string input to the split writing unit can be quickly learned. .

本発明の一側面によれば、分かち書きユニットに入力された文字列は、反復実行される第一学習処理の終了毎に、学習用文字列として登録され得る。この場合、第一学習処理とは独立した第二学習処理の実行は、分かち書きユニットに入力された文字列に基づく確率モデルの迅速な学習に効果的に役立つ。 According to one aspect of the present invention, the character string input to the splitting unit can be registered as a learning character string at the end of the first learning process that is repeatedly executed. In this case, the execution of the second learning process independent of the first learning process is effectively useful for rapid learning of the probability model based on the character string input to the splitting unit.

また、本発明の一側面によれば、分かち書き処理システムは、生成モデル及び識別モデルを含む確率モデルを備える構成にされてもよい。生成モデルは、教師無しデータに基づき学習可能であり、識別モデルは、教師有りデータに基づき学習可能である。生成モデルとしては、ＮＰＹＬＭ等のベイズ階層言語モデルを一例に挙げることができる。識別モデルとしては、条件付確率場（ＣＲＦ：ＣｏｎｄｉｔｉｏｎａｌＲａｎｄｏｍＦｉｅｌｄｓ）に基づく言語モデルを一例に挙げることができる。 In addition, according to one aspect of the present invention, the separation processing system may be configured to include a probability model including a generation model and an identification model. The generation model can be learned based on unsupervised data, and the identification model can be learned based on supervised data. An example of the generation model is a Bayesian hierarchical language model such as NPYLM. As an identification model, a language model based on a conditional random field (CRF) can be given as an example.

本発明の一側面によれば、第一学習処理及び第二学習処理は、生成モデル及び識別モデルを含む確率モデルを、次のように学習する構成にされてもよい。即ち、第一学習処理は、学習用文字列の一群の少なくとも一部を教師無しデータとして用いて、当該教師無しデータ及び識別モデルに基づき、生成モデルを更新する構成にされてもよい。また、第一学習処理は、学習用文字列の一群の内、教師情報を含む学習用文字列の一群の少なくとも一部を教師有りデータとして用いて、当該教師有りデータ及び生成モデルに基づき、識別モデルを更新する構成にされてもよい。 According to one aspect of the present invention, the first learning process and the second learning process may be configured to learn a probability model including a generation model and an identification model as follows. That is, the first learning process may be configured to update the generation model based on the unsupervised data and the identification model using at least a part of the group of learning character strings as the unsupervised data. Further, the first learning process uses at least a part of a group of learning character strings including supervised information among a group of learning character strings as supervised data, and identifies based on the supervised data and the generation model. It may be configured to update the model.

第一学習処理では、例えば、生成モデルが表す条件付確率と識別モデルから算出される条件付確率との平均値（加重平均を含む。）に基づいて、生成モデルを更新し得る。付言すれば、第一学習処理では、学習用文字列の一群の中から処理対象文字列を選択し、上記条件付確率の平均値に基づく処理対象文字列の分かち書き結果に基づき、生成モデルを更新する処理を繰返し実行することにより、生成モデルを学習し得る。 In the first learning process, for example, the generation model can be updated based on an average value (including a weighted average) of the conditional probability represented by the generation model and the conditional probability calculated from the identification model. In other words, in the first learning process, the processing target character string is selected from the group of learning character strings, and the generation model is updated based on the result of the division of the processing target character string based on the average value of the conditional probabilities. The generation model can be learned by repeatedly executing the processing.

この他、第一学習処理では、教師有りデータに基づく一以上の素性、及び、生成モデルに基づく一以上の素性を要素とする素性ベクトルに基づき識別モデルを更新することにより、識別モデルを学習し得る。 In addition, in the first learning process, the identification model is learned by updating the identification model based on one or more features based on supervised data and a feature vector having one or more features based on the generation model as elements. obtain.

一例として、素性ベクトルに基づく識別モデルの更新は、素性ベクトルとの内積が、最大（又は最小）となる重みベクトルを算出することにより行い得る。この場合、重みベクトルを構成する要素の内、上記生成モデルに基づく一以上の素性に対応する要素の値に基づき、上記加重平均の重み付けを決定し得る。 As an example, the identification model based on the feature vector can be updated by calculating a weight vector whose inner product with the feature vector is maximum (or minimum). In this case, the weight of the weighted average can be determined based on the values of elements corresponding to one or more features based on the generation model among the elements constituting the weight vector.

この他、第二学習処理は、分かち書き後の文字列に基づき生成モデルを更新する構成にされ得る。生成モデルの更新は、例えば、分かち書き後の文字列から生成モデルが表す条件付確率を更新することにより実現することができる。 In addition, the second learning process may be configured to update the generation model based on the character string after the division. The generation model can be updated, for example, by updating the conditional probability represented by the generation model from the character string after the division.

こうした本発明の一側面に係る分かち書き処理システムによれば、確率モデルの半教師有り学習を適切に行うことができる。更には、分かち書きユニットに入力された文字列に基づく確率モデルの学習を迅速に行うことができる。 According to such a split processing system according to one aspect of the present invention, semi-supervised learning of a probability model can be appropriately performed. Furthermore, it is possible to quickly learn a probability model based on a character string input to the separating unit.

この他、ユーザ毎の確率モデルを備えるように構成された本発明の一側面に係る分かち書き処理システムによれば、登録ユニットは、分かち書きユニットに入力された文字列を、入力元ユーザの確率モデルに対する学習用文字列として登録する構成にされ得る。 In addition, according to the handwriting processing system according to one aspect of the present invention configured to include a probability model for each user, the registration unit may store the character string input to the handwriting unit with respect to the probability model of the input source user. It can be configured to register as a learning character string.

そして、学習ユニットは、ユーザ毎に、上記第一学習処理として、該当ユーザの確率モデルに対する学習用文字列の一群に基づき、該当ユーザの確率モデルを学習する処理を実行する構成にされ得る。学習ユニットは、上記第二学習処理として、分かち書きユニットに対する文字列の入力毎に、分かち書き後の文字列に基づき、入力元ユーザの確率モデルを学習する処理を実行する構成にされ得る。 The learning unit may be configured to execute, for each user, a process of learning the probability model of the corresponding user based on a group of learning character strings for the probability model of the corresponding user as the first learning process. The learning unit may be configured to execute, as the second learning process, a process of learning the probability model of the input source user based on the character string after the division for each input of the character string to the division unit.

本発明の一側面によれば、分かち書き処理システムは、ユーザ毎の確率モデルに加えて、ユーザ共通の確率モデルである共通確率モデルを備えてもよい。この場合、ユーザ毎に反復実行される第一学習処理は、該当ユーザの確率モデルに対する学習用文字列の一群、及び、共通確率モデルに基づいて、該当ユーザの確率モデルを学習する構成にされ得る。こうした分かち書き処理システムによれば、システムの利用開始初期から、ユーザからの入力文字列を適切に分かち書きして、出力し得る。 According to one aspect of the present invention, the split-text processing system may include a common probability model that is a probability model common to users in addition to the probability model for each user. In this case, the first learning process that is repeatedly executed for each user may be configured to learn the probability model of the corresponding user based on the group of learning character strings for the probability model of the corresponding user and the common probability model. . According to such a segmentation processing system, an input character string from the user can be appropriately segmented and output from the beginning of use of the system.

本発明の一側面では、分かち書き処理システムは、生成モデル及び識別モデルを含む確率モデルをユーザ毎に備え、これらに加えて共通確率モデルを備える構成にされてもよい。 In one aspect of the present invention, the separation processing system may include a probability model including a generation model and an identification model for each user, and may include a common probability model in addition to these.

この場合、学習ユニットは、ユーザ毎に、第一学習処理として、ユーザの確率モデルに対する学習用文字列の一群の少なくとも一部を教師無しデータとして用いて、教師無しデータ、ユーザの識別モデル及び共通確率モデルに基づき、ユーザの生成モデルを更新する処理を実行する構成にされ得る。更に、学習ユニットは、学習用文字列の一群の内、ユーザの確率モデルに対する学習用文字列であって教師情報を含む学習用文字列の一群の少なくとも一部を教師有りデータとして用いて、教師有りデータ、ユーザの生成モデル及び共通確率モデルに基づき、ユーザの識別モデルを更新する処理を実行する構成にされ得る。 In this case, the learning unit uses, as the first learning process, at least a part of a group of character strings for learning with respect to the user's probability model as unsupervised data, unsupervised data, user identification models, and common Based on the probability model, it may be configured to execute a process of updating the generation model of the user. Further, the learning unit uses at least a part of the group of learning character strings including the teacher information as a supervised data among the group of learning character strings, which is a learning character string for the user's probability model and includes teacher information. Based on the presence data, the user generation model, and the common probability model, a process for updating the user identification model may be executed.

また、学習ユニットは、第二学習処理として、分かち書きユニットに対する文字列の入力毎に、分かち書き後の文字列に基づき、入力元ユーザの生成モデルを更新する処理を実行する構成にされ得る。 In addition, the learning unit may be configured to execute a process of updating the generation model of the input source user based on the character string after the division for each input of the character string to the division unit as the second learning process.

例えば、第一学習処理では、ユーザの生成モデルが表す条件付確率と、ユーザの識別モデルから算出される条件付確率と、共通確率モデルから特定される条件付確率と、の平均値（加重平均を含む。）に基づいて、ユーザの生成モデルを更新し得る。付言すれば、第一学習処理では、上記条件付確率の平均値に基づく処理対象文字列の分かち書き結果に基づき、ユーザの生成モデルを更新する処理を繰返し実行することにより、ユーザの生成モデルを学習し得る。 For example, in the first learning process, an average value (weighted average) of the conditional probability represented by the user generation model, the conditional probability calculated from the user identification model, and the conditional probability specified from the common probability model The user's generation model may be updated. In other words, in the first learning process, the user's generation model is learned by repeatedly executing the process of updating the user's generation model based on the result of segmentation of the processing target character string based on the average value of the conditional probabilities. Can do.

この他、第一学習処理では、ユーザの教師有りデータに基づく一以上の素性と、ユーザの生成モデルに基づく一以上の素性と、共通確率モデルに基づく一以上の素性と、を要素とする素性ベクトルに基づき、ユーザの識別モデルを更新することにより、ユーザの識別モデルを学習し得る。学習ユニットが上記構成にされる本発明の一側面に係る分かち書き処理システムによれば、ユーザからの入力文字列を一層適切に分かち書きできるように、確率モデルを学習し得る。 In addition, in the first learning process, a feature having one or more features based on the supervised data of the user, one or more features based on the user generation model, and one or more features based on the common probability model as elements. The user identification model can be learned by updating the user identification model based on the vector. According to the segmentation processing system according to one aspect of the present invention in which the learning unit is configured as described above, the probability model can be learned so that the input character string from the user can be segmented more appropriately.

本発明の一側面によれば、上述した分かち書きユニット、学習ユニット及び登録ユニットが有する機能の一部又は全ては、ハードウェア回路により実現され得る。本発明の別側面によれば、分かち書きユニット、学習ユニット、及び登録ユニットが有する機能の一部又は全ては、プログラムによりコンピュータに実現されてもよい。 According to one aspect of the present invention, some or all of the functions of the above-described splitting unit, learning unit, and registration unit can be realized by a hardware circuit. According to another aspect of the present invention, some or all of the functions of the division writing unit, the learning unit, and the registration unit may be realized in a computer by a program.

分かち書きユニット、学習ユニット及び登録ユニットの少なくとも一部としてコンピュータを機能させるためプログラムは、コンピュータ読取可能な一時的でない記録媒体に記録して提供することができる。 The program for causing the computer to function as at least a part of the writing unit, the learning unit, and the registration unit can be provided by being recorded in a computer-readable non-transitory recording medium.

また、本発明の一側面によれば、コンピュータにより実行される手順として、入力された文字列を、分かち書きの法則性を表す確率モデルに基づき、分かち書きして出力する手順と、文字列の入力毎に、この文字列に基づいて確率モデルを学習する手順と、を備える分かち書き処理方法が提供されてもよい。 Further, according to one aspect of the present invention, as a procedure executed by a computer, a procedure for dividing and outputting an input character string based on a probability model representing the rule of the division method, and for each input of the character string And a step of learning a probability model based on the character string.

本発明の別側面によれば、コンピュータにより実行される手順として、入力された文字列を、分かち書きの法則性を表す確率モデルに基づき、分かち書きして出力する手順と、確率モデルを学習する手順であって、確率モデルを学習用文字列の一群に基づき学習する第一学習処理を反復実行する一方、入力された文字列に対応する分かち書き後の文字列に基づき、確率モデルを学習する第二学習処理を、第一学習処理とは並列に実行する手順と、入力された文字列を、学習用文字列として登録する手順と、を備える分かち書き処理方法が提供されてもよい。これらの方法において、確率モデルは、教師無し学習方式又は半教師有り学習方式により学習され得る。 According to another aspect of the present invention, as a procedure executed by a computer, an input character string is divided and output based on a probability model representing a rule of division, and a procedure for learning a probability model is provided. In the second learning, the first learning process is repeated to learn the probability model based on the group of learning character strings, while the probability model is learned based on the character string after the division corresponding to the input character string. A split writing processing method may be provided that includes a procedure for executing the process in parallel with the first learning process and a procedure for registering the input character string as a learning character string. In these methods, the probability model can be learned by an unsupervised learning method or a semi-supervised learning method.

分かち書き処理システムの構成を表すブロック図である。It is a block diagram showing the structure of a segmentation processing system. 制御デバイスによって実現される機能を表すブロック図である。It is a block diagram showing the function implement | achieved by the control device. 各ユニットの詳細を表すブロック図である。It is a block diagram showing the detail of each unit. 第二実施例の制御デバイスによって実現される機能を表すブロック図である。It is a block diagram showing the function implement | achieved by the control device of a 2nd Example. 第二実施例のＣＲＦ学習ユニットの詳細を表すブロック図である。It is a block diagram showing the detail of the CRF learning unit of 2nd Example. 第二実施例のＮＰＹＬＭ学習ユニットの詳細を表すブロック図である。It is a block diagram showing the detail of the NPYLM learning unit of 2nd Example. 第三実施例の制御デバイスによって実現される機能を表すブロック図である。It is a block diagram showing the function implement | achieved by the control device of a 3rd Example. 第三実施例の個別処理ユニットの詳細を表すブロック図である。It is a block diagram showing the detail of the separate processing unit of a 3rd Example. 第三実施例のＣＲＦ学習ユニットの詳細を表すブロック図である。It is a block diagram showing the detail of the CRF learning unit of 3rd Example. 第三実施例のＮＰＹＬＭ学習ユニットの詳細を表すブロック図である。It is a block diagram showing the detail of the NPYLM learning unit of 3rd Example. 第四実施例のＣＲＦ学習ユニットの詳細を表すブロック図である。It is a block diagram showing the detail of the CRF learning unit of 4th Example. 第五実施例のＣＲＦ学習ユニットの詳細を表すブロック図である。It is a block diagram showing the detail of the CRF learning unit of 5th Example.

以下に本発明の実施例について、図面と共に説明する。
［第一実施例］
本実施例の分かち書き処理システム１は、図１に示すように、利用者装置５からネットワークを通じて入力された分かち書き対象文字列Ｄ１を、分かち書きした文字列Ｄ２に変換して、利用者装置５に返信する機能を有したシステムである。この機能は、例えば、ウェブＡＰＩによって実現される機能として利用者装置５に提供される。 Embodiments of the present invention will be described below with reference to the drawings.
[First embodiment]
As shown in FIG. 1, the segmentation processing system 1 according to the present embodiment converts a segmentation target character string D1 input from the user device 5 through the network into a segmented character string D2 and sends it back to the user device 5. It is a system with the function to do. This function is provided to the user device 5 as a function realized by a web API, for example.

具体的に、分かち書き処理システム１は、制御デバイス１０と、記憶デバイス２０と、通信デバイス３０と、を備える。通信デバイス３０は、ネットワークに接続されて、複数の利用者装置５と通信可能に配置される。制御デバイス１０は、記憶デバイス２０に記憶された各種プログラムを実行するＣＰＵ１０Ａを備え、ＣＰＵ１０Ａは、これらプログラムを実行することにより、各種機能を実現する。図示しないが、制御デバイス１０は、更に作業用メモリ（ＲＡＭ）を備えた構成にされる。以下では、ＣＰＵ１０Ａがプログラムを実行することにより実現される処理、動作及び機能を、便宜的に、制御デバイス１０により実現される処理、動作及び機能として説明する。 Specifically, the separation processing system 1 includes a control device 10, a storage device 20, and a communication device 30. The communication device 30 is connected to a network and is arranged to be able to communicate with a plurality of user devices 5. The control device 10 includes a CPU 10A that executes various programs stored in the storage device 20, and the CPU 10A implements various functions by executing these programs. Although not shown, the control device 10 is further provided with a working memory (RAM). Hereinafter, processes, operations, and functions realized by the CPU 10A executing the program will be described as processes, operations, and functions realized by the control device 10 for convenience.

記憶デバイス２０は、各種プログラムの他、プログラム実行時に必要な各種データを記憶する。記憶デバイス２０は、制御デバイス１０（ＣＰＵ１０Ａ）が読取可能な一時的でない記録媒体、例えばハードディスク装置を有した構成にされる。 The storage device 20 stores various data necessary for program execution in addition to various programs. The storage device 20 includes a non-temporary recording medium that can be read by the control device 10 (CPU 10A), for example, a hard disk device.

制御デバイス１０は、プログラムの実行により、図２に示すように、分かち書きユニット１１０、第一学習ユニット１４０、第二学習ユニット１６０、及び、登録ユニット１８０として機能する。制御デバイス１０は、記憶デバイス２０からのデータ読出により、更に言語モデルＬＭ１及び学習用文字列群ＬＤ１を有した構成にされる。 The control device 10 functions as a split writing unit 110, a first learning unit 140, a second learning unit 160, and a registration unit 180 as shown in FIG. The control device 10 is configured to further include a language model LM1 and a learning character string group LD1 by reading data from the storage device 20.

分かち書きユニット１１０は、通信デバイス３０が利用者装置５から受信した分かち書き対象文字列Ｄ１を、言語モデルＬＭ１に基づいて分かち書きし、分かち書きした文字列Ｄ２（以下、分かち書き後文字列Ｄ２と表現する。）を、通信デバイス３０を通じて利用者装置５に返信するように構成される。言語モデルＬＭ１は、ＮＰＹＬＭに基づくベイズ階層言語モデルであり得る。ＮＰＹＬＭは、一つの文字セットに対して別の文字セットが出現する条件付確率により、文字セットの並びに関する法則性を確率的に表す確率モデルの一種であり、分かち書きの法則性を表す確率モデルの一種である。ここで言う文字セットは、一以上の文字からなる文字の並びのことを言う。ＮＰＹＬＭは、生成モデルの一種である。 The splitting unit 110 splits and writes the split target character string D1 received from the user device 5 by the communication device 30 based on the language model LM1 (hereinafter referred to as post-split character string D2). To the user device 5 through the communication device 30. Language model LM1 may be a Bayesian hierarchical language model based on NPYLM. NPYLM is a type of probability model that probabilistically expresses the laws related to the arrangement of character sets by the conditional probability that another character set appears for one character set. It is a kind. A character set here refers to a sequence of one or more characters. NPYLM is a kind of generation model.

即ち、分かち書きユニット１１０は、言語モデルＬＭ１から特定される条件付確率に基づいて、分かち書き対象文字列Ｄ１に対する尤もらしい分割パターンを決定し、決定した分割パターンに従う分かち書き後文字列Ｄ２を出力するように構成される。分かち書きユニット１１０は、公知のビタビアルゴリズムに従って、分かち書きを行うように構成され得る。 That is, the segmentation unit 110 determines a likely division pattern for the segmentation target character string D1 based on the conditional probability specified from the language model LM1, and outputs the segmented character string D2 according to the determined segmentation pattern. Composed. The splitting unit 110 may be configured to perform splitting according to a known Viterbi algorithm.

第一学習ユニット１４０は、与えられた学習用文字列群ＬＤ１に基づいて言語モデルＬＭ１を学習する第一学習処理を反復実行するように構成される。第二学習ユニット１６０は、分かち書きユニット１１０から出力される分かち書き後文字列Ｄ２に基づき、言語モデルＬＭ１を学習する第二学習処理を、第一学習処理とは並列に実行する構成にされる。 The first learning unit 140 is configured to repeatedly execute a first learning process for learning the language model LM1 based on the given learning character string group LD1. The second learning unit 160 is configured to execute a second learning process for learning the language model LM1 in parallel with the first learning process based on the post-split character string D2 output from the split writing unit 110.

登録ユニット１８０は、利用者装置５から分かち書きユニット１１０に入力される分かち書き対象文字列Ｄ１を、学習用文字列として、学習用文字列群ＬＤ１に登録するように構成される。この学習用文字列群ＬＤ１は、初期学習データとしてシステム管理者から提供された文字列群、及び、利用者装置５から入力された分かち書き対象文字列Ｄ１の一群を含んだ文字列群として構成される。 The registration unit 180 is configured to register a character string D1 to be written, which is input from the user device 5 to the character writing unit 110, as a learning character string in the learning character string group LD1. The learning character string group LD1 is configured as a character string group including a character string group provided by the system administrator as initial learning data and a group of character strings D1 to be written input from the user device 5. The

続いて、第一学習ユニット１４０、第二学習ユニット１６０及び登録ユニット１８０の構成を、図３を用いて詳述する。図３に示すように、第一学習ユニット１４０は、選択部１４１、抽出部１４３、更新部１４５、及び、全選択判定部１４７を備える。 Next, the configuration of the first learning unit 140, the second learning unit 160, and the registration unit 180 will be described in detail with reference to FIG. As shown in FIG. 3, the first learning unit 140 includes a selection unit 141, an extraction unit 143, an update unit 145, and an all selection determination unit 147.

第一学習ユニット１４０で反復実行される上記第一学習処理は、公知のＮＰＹＬＭに基づく言語学習と同様に、与えられた学習用文字列群ＬＤ１に属する文字列の夫々をランダムに一つずつ処理対象文字列に選択して、処理対象文字列に基づく言語モデルＬＭ１の更新動作を繰返し実行することにより言語モデルを学習する処理である。１回の第一学習処理は、学習用文字列群ＬＤ１に含まれる文字列の全てを処理対象文字列に選択して、言語モデルＬＭ１を更新すると終了する。 In the first learning process repeatedly executed by the first learning unit 140, each of the character strings belonging to the given learning character string group LD1 is randomly processed one by one, as in the language learning based on the known NPYLM. This is a process of learning a language model by selecting the target character string and repeatedly executing the update operation of the language model LM1 based on the processing target character string. The first learning process of one time ends when all the character strings included in the learning character string group LD1 are selected as the processing target character strings and the language model LM1 is updated.

選択部１４１は、学習用文字列群ＬＤ１から処理対象文字列を一つランダムに選択するように構成される。ランダムな選択動作は、処理対象文字列として未選択の文字列群を対象に行われる。 The selection unit 141 is configured to randomly select one processing target character string from the learning character string group LD1. The random selection operation is performed on a character string group that has not been selected as a processing target character string.

抽出部１４３は、言語モデルＬＭ１から処理対象文字列の情報を抽出して、言語モデルＬＭ１を処理対象文字列の情報を含まない言語モデルに置換するように構成される。学習用文字列群ＬＤ１に属する文字列の情報は、予め言語モデルＬＭ１に登録される。初期学習データに対応する文字列は、分かち書き処理システム１が利用者装置５によって利用可能な状態に置かれる前に言語モデルＬＭ１に登録される。分かち書きユニット１１０に入力される文字列Ｄ１は、分かち書き後文字列Ｄ２の情報として、第二学習ユニット１６０により言語モデルＬＭ１に登録される。 The extraction unit 143 is configured to extract information on the processing target character string from the language model LM1 and replace the language model LM1 with a language model that does not include information on the processing target character string. Information on character strings belonging to the learning character string group LD1 is registered in the language model LM1 in advance. The character string corresponding to the initial learning data is registered in the language model LM1 before the segmentation processing system 1 is put into a usable state by the user device 5. The character string D1 input to the segmentation unit 110 is registered in the language model LM1 by the second learning unit 160 as information on the character string D2 after the segmentation.

更新部１４５は、処理対象文字列が教師情報を含まない文字列である場合、上記置換された言語モデルから特定される条件付確率に基づき、処理対象文字列を分かち書きして得られる分かち書き後文字列を言語モデルＬＭ１に再登録し、この際、処理対象文字列の分割パターンに従って、言語モデルＬＭ１のパラメータ（条件付確率を示すパラメータ）を更新するように動作する。 When the processing target character string is a character string that does not include teacher information, the update unit 145 performs the post-split character obtained by splitting the processing target character string based on the conditional probability specified from the replaced language model. The column is re-registered in the language model LM1, and at this time, the parameter (parameter indicating the conditional probability) of the language model LM1 is updated according to the division pattern of the processing target character string.

学習用文字列群ＬＤ１には、教師情報を含まない文字列（教師無しデータ）、及び、教師情報を含む文字列（教師有りデータ）が含まれる。ここで言う教師情報は、文字列の正しい分割パターン（分かち書き）を示す情報である。 The learning character string group LD1 includes a character string not including teacher information (unsupervised data) and a character string including teacher information (supervised data). The teacher information referred to here is information indicating a correct division pattern (description) of the character string.

更新部１４５は、処理対象文字列が教師情報を含む文字列である場合、教師情報から特定される処理対象文字列に対応する分かち書き後文字列を言語モデルＬＭ１に再登録し、この際、言語モデルＬＭ１の上記パラメータを更新するように動作する。 When the processing target character string is a character string including teacher information, the update unit 145 re-registers the post-descripted character string corresponding to the processing target character string specified from the teacher information in the language model LM1, and at this time, the language It operates to update the above parameters of the model LM1.

全選択判定部１４７は、学習用文字列群ＬＤ１に属する文字列の全てが処理対象文字列に選択されたか否かを判定する。全てが処理対象文字列に選択されている場合には、プール１８３に蓄積された文字列を学習用文字列群ＬＤ１に登録する動作を登録ユニット１８０に実行させる一方、選択部１４１が有する文字列の選択履歴をリセットする動作を実行し、学習用文字列群ＬＤ１内の文字列を、全て未選択の状態に置く。その後、選択部１４１に、処理対象文字列として新しい文字列を、学習用文字列群ＬＤ１から選択するように指示する。 The all selection determination unit 147 determines whether all of the character strings belonging to the learning character string group LD1 have been selected as the processing target character strings. When all the character strings to be processed are selected, the registration unit 180 is caused to execute the operation of registering the character strings stored in the pool 183 in the learning character string group LD1, while the character string that the selection unit 141 has. The operation of resetting the selection history is executed, and all the character strings in the learning character string group LD1 are placed in an unselected state. Thereafter, the selection unit 141 is instructed to select a new character string as a processing target character string from the learning character string group LD1.

全選択判定部１４７は、学習用文字列群ＬＤ１に処理対象文字列として未選択の文字列が存在する場合には、上記登録及びリセットに関する動作を実行せずに、選択部１４１に対して、処理対象文字列として新しい文字列を、学習用文字列群ＬＤ１から選択するように指示する。第一学習ユニット１４０は、このようにして上述した選択、抽出、更新、及び、判定動作のシーケンスを繰返し実行することにより、上記第一学習処理の反復実行を実現する。 When there is an unselected character string as a processing target character string in the learning character string group LD1, the all selection determining unit 147 performs the above-described operations relating to registration and reset without performing the operations related to registration and resetting. An instruction is given to select a new character string as the processing target character string from the learning character string group LD1. The first learning unit 140 thus repeatedly executes the first learning process by repeatedly executing the selection, extraction, update, and determination operation sequences described above.

一方、登録ユニット１８０は、受付部１８１、プール１８３及び追加部１８５を備える。受付部１８１は、分かち書きユニット１１０に分かち書き対象文字列Ｄ１が入力される度、この分かち書き対象文字列Ｄ１（教師情報を含まない文字列）をプール１８３に蓄積する。追加部１８５は、第一学習ユニット１４０において各回の第一学習処理が終了する度、それまでにプール１８３に蓄積された文字列Ｄ１の夫々を、プール１８３から取り出して学習用文字列群ＬＤ１に登録するように動作する。 On the other hand, the registration unit 180 includes a reception unit 181, a pool 183, and an addition unit 185. The reception unit 181 stores the character string D1 to be written (character string not including teacher information) in the pool 183 every time the character string D1 to be written is input to the dividing unit 110. Each time the first learning process of each time is completed in the first learning unit 140, the adding unit 185 takes out each character string D1 stored in the pool 183 from the pool 183 and stores it in the learning character string group LD1. Operates to register.

このようにして、登録ユニット１８０は、第一学習処理の終了毎に、この第一学習処理の実行中に入力された文字列Ｄ１が、次の第一学習処理では言語モデルＬＭ１の学習に用いられるように、学習用文字列群ＬＤ１を更新する。 In this way, the registration unit 180 uses the character string D1 input during the execution of the first learning process for learning the language model LM1 in the next first learning process every time the first learning process is completed. The learning character string group LD1 is updated.

この他、第二学習ユニット１６０は、第一学習処理とは並列に、第一学習処理とは独立して上記第二学習処理を実行することにより、分かち書きユニット１１０に分かち書き対象文字列Ｄ１が入力される度、その分かち書き後文字列Ｄ２に基づく言語モデルＬＭ１の学習を行う。 In addition, the second learning unit 160 executes the second learning process in parallel with the first learning process and independently of the first learning process, whereby the character string D1 to be divided is input to the division writing unit 110. Each time, the language model LM1 is learned based on the character string D2 after the writing.

具体的に、この第二学習ユニット１６０は、更新部１４５と同様に、分かち書きユニット１１０から得られた分かち書き後文字列Ｄ２を言語モデルＬＭ１に登録し、この際には、分かち書き後文字列Ｄ２の分割パターンに従って、言語モデルＬＭ１のパラメータ（条件付確率を示すパラメータ）を更新するように動作する。 Specifically, the second learning unit 160 registers the post-split character string D2 obtained from the split writing unit 110 in the language model LM1 in the same manner as the update unit 145, and at this time, the post-split character string D2 It operates so as to update the parameter (parameter indicating the conditional probability) of the language model LM1 according to the division pattern.

このようにして本実施例では、分かち書きユニット１１０に入力された文字列Ｄ１に基づく言語モデルＬＭ１の学習が、第一学習処理の実行サイクルに依存せずに、分かち書き後文字列Ｄ２が得られた時点で、第二学習ユニット１６０により直ちに実行される。これにより、言語モデルＬＭ１は、分かち書きユニット１１０に入力された文字列Ｄ１に基づき効率的且つ迅速に学習される。 In this way, in this embodiment, learning of the language model LM1 based on the character string D1 input to the splitting unit 110 is not dependent on the execution cycle of the first learning process, and the post-split character string D2 is obtained. At that time, it is immediately executed by the second learning unit 160. Thereby, the language model LM1 is efficiently and quickly learned based on the character string D1 input to the split writing unit 110.

以上、本実施例の分かち書き処理システム１について説明したが、本実施例によれば、分かち書き対象文字列Ｄ１を、言語モデルＬＭ１の学習に用いるので、言葉の変化や新語の登場に迅速に対応して言語モデルＬＭ１を学習することができ、利用者装置５からの文字列Ｄ１を適切に分かち書き可能な分かち書き処理システム１を構築することができる。本実施例によれば特に、分かち書きユニット１１０に対する文字列Ｄ１の入力毎に、言語モデルＬＭ１を迅速に更新するので、高性能な分かち書き処理システム１を利用者装置５に提供することができる。 As described above, the segmentation processing system 1 according to the present embodiment has been described. However, according to the present embodiment, the segmentation target character string D1 is used for learning the language model LM1, so that it is possible to respond quickly to changes in words and the appearance of new words. Thus, it is possible to learn the language model LM1 and to construct the split writing processing system 1 that can appropriately split the character string D1 from the user device 5. In particular, according to the present embodiment, the language model LM1 is quickly updated every time the character string D1 is input to the splitting unit 110, so that the high-performance splitting processing system 1 can be provided to the user device 5.

ところで、本実施例の学習用文字列群ＬＤ１には、教師無しデータ及び教師有りデータが含まれるが、学習用文字列群ＬＤ１は、教師無しデータのみから構成されてもよい。また、言語モデルＬＭ１は、ＮＰＹＬＭに限定されるものではなく、言語モデルＬＭ１には、例えばＮＰＹＬＭと同種又は類似の特徴を有する他の言語モデルを採用することが可能である。 Incidentally, the learning character string group LD1 of the present embodiment includes unsupervised data and supervised data, but the learning character string group LD1 may be composed of only unsupervised data. Further, the language model LM1 is not limited to NPYLM, and for the language model LM1, for example, another language model having the same or similar characteristics as NPYLM can be adopted.

［第二実施例］
続いて、第二実施例の分かち書き処理システム１を説明する。第二実施例の分かち書き処理システム１は、制御デバイス１０によって実現される機能が第一実施例の分かち書き処理システム１とは異なる。一方で、第二実施例の分かち書き処理システム１のハードウェア構成は、図１に示す分かち書き処理システム１と同一である。従って、以下では、第二実施例として、制御デバイス１０によって実現される機能を選択的に説明する。 [Second Example]
Subsequently, the split writing processing system 1 of the second embodiment will be described. The splitting processing system 1 of the second embodiment is different from the splitting processing system 1 of the first embodiment in the function realized by the control device 10. On the other hand, the hardware configuration of the segmentation processing system 1 of the second embodiment is the same as that of the segmentation processing system 1 shown in FIG. Accordingly, in the following, functions implemented by the control device 10 are selectively described as a second embodiment.

本実施例の制御デバイス１０は、プログラムの実行により、図４に示すように、分かち書きユニット２１０、ＣＲＦ学習ユニット２２０、ＮＰＹＬＭ学習ユニット２３０、及び、登録ユニット２８０として機能する。制御デバイス１０は、記憶デバイス２０からのデータ読出により、学習用文字列群ＬＤ２を有した構成にされる。更に、制御デバイス１０は、ＣＲＦ学習ユニット２２０内に識別モデルＤＭ２を有し、ＮＰＹＬＭ学習ユニット２３０内に言語モデルＬＭ２を有した構成にされる。 As shown in FIG. 4, the control device 10 according to the present embodiment functions as a division writing unit 210, a CRF learning unit 220, an NPYLM learning unit 230, and a registration unit 280 by executing the program. The control device 10 is configured to have a learning character string group LD2 by reading data from the storage device 20. Furthermore, the control device 10 is configured to have the identification model DM2 in the CRF learning unit 220 and the language model LM2 in the NPYLM learning unit 230.

分かち書きユニット２１０は、分かち書きユニット１１０と同様に、通信デバイス３０が受信した分かち書き対象文字列Ｄ１を分かち書きし、分かち書き後文字列Ｄ２を、通信デバイス３０を通じて利用者装置５に返信する構成にされる。分かち書きユニット２１０は、識別モデルＤＭ２及び言語モデルＬＭ２の少なくとも一方を用いて、分かち書き対象文字列Ｄ１を分かち書きする構成にされる。 Similar to the split writing unit 110, the split writing unit 210 is configured to split the character string D 1 to be split and received by the communication device 30, and to return the post-split character string D 2 to the user device 5 through the communication device 30. The segmentation unit 210 is configured to segment the character string D1 to be segmented using at least one of the identification model DM2 and the language model LM2.

ＣＲＦ学習ユニット２２０は、学習用文字列群ＬＤ２に基づいて識別モデルＤＭ２を学習するように構成される。識別モデルＤＭ２は、条件付確率場（ＣＲＦ）に基づく識別モデルである。識別モデルＤＭ２は、教師有りデータ及び言語モデルＬＭ２に基づいて学習される識別スコアＳ，Ｓｎを含む。即ち、ＣＲＦ学習ユニット２２０は、学習用文字列群ＬＤ２の内、教師情報を含む学習用文字列の一群及び言語モデルＬＭ２に基づき、識別スコアＳ，Ｓｎを学習することにより、識別モデルＤＭ２を学習する。この識別モデルＤＭ２もまた、ＮＰＹＬＭに基づく言語モデルＬＭ２と同様に、分かち書きの法則性を表す確率モデルの一種である。 The CRF learning unit 220 is configured to learn the identification model DM2 based on the learning character string group LD2. The identification model DM2 is an identification model based on a conditional random field (CRF). The identification model DM2 includes identification scores S and Sn learned based on the supervised data and the language model LM2. That is, the CRF learning unit 220 learns the identification model DM2 by learning the identification scores S and Sn based on the learning character string group LD2 and the learning character string group including the teacher information and the language model LM2. To do. The identification model DM2 is also a kind of a probability model that represents the rule of division as in the language model LM2 based on NPYLM.

ＮＰＹＬＭ学習ユニット２３０は、学習用文字列群ＬＤ２及び識別モデルＤＭ２に基づいて言語モデルＬＭ２を学習するように構成される。言語モデルＬＭ２は、第一実施例と同様に、ＮＰＹＬＭに基づくベイズ階層言語モデルである。 The NPYLM learning unit 230 is configured to learn the language model LM2 based on the learning character string group LD2 and the identification model DM2. The language model LM2 is a Bayesian hierarchical language model based on NPYLM, as in the first embodiment.

登録ユニット２８０は、登録ユニット１８０と同様に、分かち書きユニット２１０に入力される分かち書き対象文字列Ｄ１を、学習用文字列として、学習用文字列群ＬＤ２に登録する。学習用文字列群ＬＤ２は、第一実施例と同様、初期学習データとしてシステム管理者から提供された文字列群、及び、利用者装置５から入力された分かち書き対象文字列Ｄ１の一群を含んだ構成にされる。 Similar to the registration unit 180, the registration unit 280 registers the character string D1 to be segmented input to the segmentation unit 210 as a learning character string in the learning character string group LD2. As in the first embodiment, the learning character string group LD2 includes a character string group provided by the system administrator as initial learning data and a group of character strings D1 to be written that are input from the user device 5. Made up.

続いて、ＣＲＦ学習ユニット２２０、ＮＰＹＬＭ学習ユニット２３０及び登録ユニット２８０の構成を、図５及び図６を用いて詳述する。図５に示すように、ＣＲＦ学習ユニット２２０は、文字列取込部２２１と、素性ベクトル生成部２２２と、条件付確率取得部２２３と、ＮＰＹＬＭ素性生成部２２４と、統合素性ベクトル生成部２２７と、更新部２２８と、出力部２２９と、識別モデルＤＭ２と、を有する。 Next, configurations of the CRF learning unit 220, the NPYLM learning unit 230, and the registration unit 280 will be described in detail with reference to FIGS. As shown in FIG. 5, the CRF learning unit 220 includes a character string capturing unit 221, a feature vector generation unit 222, a conditional probability acquisition unit 223, an NPYLM feature generation unit 224, an integrated feature vector generation unit 227, , An updating unit 228, an output unit 229, and an identification model DM2.

このＣＲＦ学習ユニット２２０は、公知のＮＰＹＣＲＦ技術（条件付確率場とベイズ階層言語モデルの統合による半教師有り形態素解析技術）と同様の手法で、教師有りデータ及び言語モデルＬＭ２に基づき、識別モデルＤＭ２を更新する構成にされ得る。 This CRF learning unit 220 is based on the supervised data and the language model LM2 based on the supervised data and the language model LM2 in the same manner as the known NPYCRF technology (semi-supervised morphological analysis technology by integrating conditional random fields and Bayesian hierarchical language models). Can be configured to update.

文字列取込部２２１は、学習用文字列群ＬＤ２から、教師情報を含む文字列の一群を教師有りデータとして取り込む。素性ベクトル生成部２２２は、この教師有りデータに基づく素性ベクトルＦを生成する。素性ベクトルＦは、文字列毎に生成される。 The character string capturing unit 221 captures a group of character strings including teacher information as supervised data from the learning character string group LD2. The feature vector generation unit 222 generates a feature vector F based on the supervised data. The feature vector F is generated for each character string.

条件付確率取得部２２３は、言語モデルＬＭ２から特定される各素性ベクトルＦに対応する文字列の条件付確率Ｐを、ＮＰＹＬＭ学習ユニット２３０から取得する。ＮＰＹＬＭ素性生成部２２４は、この条件付確率取得部２２３が取得した条件付確率Ｐを、素性ベクトルＦに追加する素性ｆｎに変換する。以下では、この素性ｆｎを、ＮＰＹＬＭ素性と表現する。この確率素性変換は、公知のＮＰＹＣＲＦ技術に従う演算によって実現され得る。あるいは、この確率素性変換は、上記公知技術に従う演算を近似的な演算に置き換えた演算等の代替する演算によって実現され得る。 The conditional probability acquisition unit 223 acquires from the NPYLM learning unit 230 the conditional probability P of the character string corresponding to each feature vector F specified from the language model LM2. The NPYLM feature generation unit 224 converts the conditional probability P acquired by the conditional probability acquisition unit 223 into a feature fn to be added to the feature vector F. Hereinafter, this feature fn is expressed as an NPYLM feature. This stochastic feature conversion can be realized by an operation according to a known NPYCRF technique. Alternatively, this probabilistic feature conversion can be realized by an alternative operation such as an operation in which an operation according to the known technique is replaced with an approximate operation.

統合素性ベクトル生成部２２７は、素性ベクトルＦに、対応する上記ＮＰＹＬＭ素性ｆｎを、新たな要素として追加して、統合素性ベクトルＦＥ＝（Ｆ，ｆｎ）を生成する。統合素性ベクトルＦＥは、文字列毎に生成される。ＮＰＹＬＭ素性ｆｎがスカラー量である場合、Ｍ次元の素性ベクトルＦに対して、統合素性ベクトルＦＥは、（Ｍ＋１）次元ベクトルである。 The integrated feature vector generation unit 227 adds the corresponding NPYLM feature fn to the feature vector F as a new element, and generates an integrated feature vector FE = (F, fn). The integrated feature vector FE is generated for each character string. When the NPYLM feature fn is a scalar quantity, the integrated feature vector FE is an (M + 1) -dimensional vector with respect to the M-dimensional feature vector F.

更新部２２８は、この統合素性ベクトルＦＥと、識別スコア（Ｓ，Ｓｎ）との内積Ｆ・Ｓ＋ｆｎ・Ｓｎに対応する同時確率Ｐｅに関して、文字列毎の同時確率Ｐｅの平均が最大となる識別スコア（Ｓ，Ｓｎ）を算出する。そして、識別モデルＤＭ２が有する識別スコアＳ，Ｓｎを、算出された識別スコアＳ，Ｓｎに更新することにより、識別モデルＤＭ２を更新する。ここで識別スコアＳは、素性ベクトルＦに対するスコアであり、識別スコアＳｎは、ＮＰＹＬＭ素性ｆｎに対するスコアである。識別スコアＳｎの初期値はゼロである。識別スコアＳの初期値は、設計者により定められる。 The update unit 228 determines an identification score that maximizes the average of the joint probabilities Pe for each character string with respect to the joint probability Pe corresponding to the inner product F · S + fn · Sn of the integrated feature vector FE and the discrimination score (S, Sn). (S, Sn) is calculated. Then, the identification model DM2 is updated by updating the identification scores S and Sn of the identification model DM2 to the calculated identification scores S and Sn. Here, the identification score S is a score for the feature vector F, and the identification score Sn is a score for the NPYLM feature fn. The initial value of the identification score Sn is zero. The initial value of the identification score S is determined by the designer.

出力部２２９は、ＮＰＹＬＭ学習ユニット２３０及び分かち書きユニット２１０からの要求に応じて、識別スコアＳ，Ｓｎ及び同時確率Ｐｅ，Ｐｊを出力するように構成される。ここでの同時確率Ｐｊは、素性ベクトルＦと識別スコアＳとの内積Ｆ・Ｓに対応し、素性ベクトルＦに対応する文字列における各文字の境界が区切り目となる確率及び区切り目とならない確率の同時確率を表す。換言すれば、同時確率Ｐｊは、対応する文字列及び分割パターンの出現確率を表す。 The output unit 229 is configured to output the identification scores S and Sn and the joint probabilities Pe and Pj in response to requests from the NPYLM learning unit 230 and the division writing unit 210. Here, the joint probability Pj corresponds to the inner product F · S of the feature vector F and the identification score S, and the probability that the boundary of each character in the character string corresponding to the feature vector F becomes a break and the probability that it does not become a break. Represents the joint probability of. In other words, the joint probability Pj represents the appearance probability of the corresponding character string and division pattern.

一方、ＮＰＹＬＭ学習ユニット２３０は、図６に示すように、第一学習ユニット２４０、第二学習ユニット２６０、出力ユニット２７０及び言語モデルＬＭ２を有した構成にされる。第一学習ユニット２４０は、選択部２４１、抽出部２４３、更新部２４５、全選択判定部２４７、加重平均算出部２５０、ＮＰＹＬＭ素性スコア入力部２５１及び変換部２５３を有した構成にされる。第一学習ユニット２４０は、公知のＮＰＹＣＲＦ技術と同様の手法で、学習用文字列群ＬＤ２及び識別モデルＤＭ２に基づき、言語モデルＬＭ２を学習する構成にされ得る。 On the other hand, the NPYLM learning unit 230 has a first learning unit 240, a second learning unit 260, an output unit 270, and a language model LM2, as shown in FIG. The first learning unit 240 includes a selection unit 241, an extraction unit 243, an update unit 245, an all selection determination unit 247, a weighted average calculation unit 250, an NPYLM feature score input unit 251, and a conversion unit 253. The first learning unit 240 can be configured to learn the language model LM2 based on the learning character string group LD2 and the identification model DM2 in the same manner as the known NPYCRF technique.

選択部２４１、抽出部２４３及び全選択判定部２４７は、第一実施例の選択部１４１、抽出部１４３及び全選択判定部１４７と同様に構成される。更新部２４５も、基本的には、第一実施例の更新部１４５と同様に構成される。但し、更新部２４５は、言語モデルＬＭ２から特定される条件付確率Ｐに代えて、ＣＲＦ学習ユニット２２０から提供される識別スコアＳｎ及び同時確率Ｐｊに基づき修正された条件付確率Ｐａに基づいて、処理対象文字列を分かち書きする。条件付確率Ｐａは、加重平均算出部２５０から提供される。 The selection unit 241, the extraction unit 243, and the full selection determination unit 247 are configured in the same manner as the selection unit 141, the extraction unit 143, and the full selection determination unit 147 of the first embodiment. The updating unit 245 is basically configured in the same manner as the updating unit 145 of the first embodiment. However, the update unit 245 replaces the conditional probability P specified from the language model LM2, based on the conditional probability Pa modified based on the identification score Sn and the joint probability Pj provided from the CRF learning unit 220, Write a string to be processed. The conditional probability Pa is provided from the weighted average calculation unit 250.

即ち、更新部２４５は、処理対象文字列が教師情報を含まない文字列である場合、処理対象文字列の情報が削除された言語モデルＬＭ２から特定される条件付確率Ｐに対応する加重平均算出部２５０からの条件付確率Ｐａに基づき、処理対象文字列を分かち書きし、この分かち書き後文字列を言語モデルＬＭ２に登録し、言語モデルＬＭ２のパラメータを更新するように動作する。 That is, when the processing target character string is a character string that does not include teacher information, the update unit 245 calculates a weighted average corresponding to the conditional probability P specified from the language model LM2 from which the information on the processing target character string is deleted. Based on the conditional probability Pa from the unit 250, the character string to be processed is divided, the character string after the division is registered in the language model LM2, and the parameters of the language model LM2 are updated.

加重平均算出部２５０は、言語モデルＬＭ２から特定される条件付確率Ｐと、ＣＲＦ学習ユニット２２０からＮＰＹＬＭ素性スコア入力部２５１を通じて取得した識別スコアＳｎと、変換部２５３から入力される条件付確率Ｐ’とに基づいて、条件付確率Ｐと条件付確率Ｐ’との加重平均である条件付確率Ｐａ＝（Ｐ＋Ｓｎ・Ｐ’）／（１＋Ｓｎ）を算出する。 The weighted average calculation unit 250 includes the conditional probability P specified from the language model LM2, the identification score Sn acquired from the CRF learning unit 220 through the NPYLM feature score input unit 251, and the conditional probability P input from the conversion unit 253. Based on “,” a conditional probability Pa = (P + Sn · P ′) / (1 + Sn), which is a weighted average of the conditional probability P and the conditional probability P ′, is calculated.

ここで、変換部２５３は、ＣＲＦ学習ユニット２２０から得た同時確率Ｐｊを、条件付確率Ｐ’に変換するように動作する。条件付確率Ｐ’は、同時確率Ｐｊに対応する条件付確率Ｐ’であり、言語モデルＬＭ２の条件付確率Ｐに対応する。但し、条件付確率Ｐ’は、ＣＲＦ学習ユニット２２０から得られた同時確率Ｐｅに基づいて算出されてもよい。 Here, the conversion unit 253 operates to convert the joint probability Pj obtained from the CRF learning unit 220 into a conditional probability P ′. The conditional probability P ′ is a conditional probability P ′ corresponding to the joint probability Pj, and corresponds to the conditional probability P of the language model LM2. However, the conditional probability P ′ may be calculated based on the joint probability Pe obtained from the CRF learning unit 220.

第二学習ユニット２６０は、第二学習ユニット１６０と同様に構成され、第一学習ユニット２４０が実行する第一学習処理とは並列に、分かち書き対象文字列Ｄ１が入力される度、その分かち書き後文字列Ｄ２に基づく言語モデルＬＭ２の学習を行う第二学習処理を実行する。即ち、第二学習ユニット２６０は、文字列Ｄ１の入力毎に、分かち書きユニット２１０から得られた分かち書き後文字列Ｄ２を言語モデルＬＭ２に登録し、この際には、分かち書き後文字列Ｄ２の分割パターンに従って、言語モデルＬＭ２のパラメータを更新するように動作する。 The second learning unit 260 is configured in the same manner as the second learning unit 160, and in parallel with the first learning process executed by the first learning unit 240, each time the character string D1 to be separated is input, the character after the division is written. A second learning process for learning the language model LM2 based on the column D2 is executed. That is, for each input of the character string D1, the second learning unit 260 registers the post-split character string D2 obtained from the split writing unit 210 in the language model LM2, and at this time, the division pattern of the post-split character string D2 The operation of updating the parameters of the language model LM2 is performed according to the above.

出力ユニット２７０は、ＣＲＦ学習ユニット２２０及び分かち書きユニット２１０からの要求に応じて、要求された条件付確率Ｐ，Ｐａを言語モデルＬＭ２に基づいて特定し、これを要求元に出力するように動作する。 The output unit 270 operates to identify the requested conditional probabilities P and Pa based on the language model LM2 in response to requests from the CRF learning unit 220 and the splitting unit 210, and to output them to the request source. .

分かち書きユニット２１０は、例えば、ＮＰＹＬＭ学習ユニット２３０から取得した上記条件付確率Ｐに基づいて分かち書き対象文字列Ｄ１を分かち書きするように構成され得る。あるいは、分かち書きユニット２１０は、ＮＰＹＬＭ学習ユニット２３０から取得した条件付確率Ｐａに基づいて分かち書き対象文字列Ｄ１を分かち書きするように構成され得る。 For example, the splitting unit 210 may be configured to split the character string D1 to be split based on the conditional probability P acquired from the NPYLM learning unit 230. Alternatively, the segmentation unit 210 may be configured to segment the segmentation target character string D1 based on the conditional probability Pa acquired from the NPYLM learning unit 230.

あるいは、分かち書きユニット２１０は、ＣＲＦ学習ユニット２２０から取得した識別スコアＳに基づいて、分かち書き対象文字列Ｄ１を分かち書きする構成にされ得る。例えば、分かち書きユニット２１０は、識別スコアＳとの内積に対応する同時確率Ｐｊが最大となる分かち書き対象文字列Ｄ１の素性ベクトルＦを探索することにより、分かち書き対象文字列Ｄ１に対応する尤もらしい分割パターンを決定し、分かち書き対象文字列Ｄ１を分かち書きすることができる。あるいは、分かち書きユニット２１０は、ＣＲＦ学習ユニット２２０から取得した識別スコアＳ，Ｓｎ及びＮＰＹＬＭ学習ユニット２３０から取得した条件付確率Ｐに基づいて、分かち書き対象文字列Ｄ１を分かち書きする構成にされてもよい。 Alternatively, the segmentation unit 210 may be configured to segment the segmentation target character string D1 based on the identification score S acquired from the CRF learning unit 220. For example, the segmentation unit 210 searches for a feature vector F of the segmentation target character string D1 that maximizes the joint probability Pj corresponding to the inner product with the identification score S, thereby causing a likely division pattern corresponding to the segmentation target character string D1. And the character string D1 to be divided can be written. Alternatively, the segmentation unit 210 may be configured to segment the segmentation target character string D1 based on the identification scores S and Sn acquired from the CRF learning unit 220 and the conditional probability P acquired from the NPYLM learning unit 230.

登録ユニット２８０は、登録ユニット１８０と同様、受付部２８１、プール２８３及び追加部２８５を有する。即ち、登録ユニット２８０は、第一学習ユニット２４０による第一学習処理の終了毎に、プール２８３に蓄積された文字列Ｄ１を学習用文字列群ＬＤ２に追加登録することにより、第一学習処理の実行中に入力された文字列Ｄ１が、次の第一学習処理では言語モデルＬＭ２の学習に用いられるように動作する。 Similar to the registration unit 180, the registration unit 280 includes a reception unit 281, a pool 283, and an addition unit 285. That is, the registration unit 280 additionally registers the character string D1 stored in the pool 283 in the learning character string group LD2 every time the first learning process by the first learning unit 240 is completed, thereby performing the first learning process. The character string D1 input during execution operates so as to be used for learning the language model LM2 in the next first learning process.

以上、第二実施例の分かち書き処理システム１について説明した。本実施例の分かち書き処理システム１には、高度なモデル学習機能（言語モデルＬＭ２及び識別モデルＤＭ２の半教師学習機能）が備わっている上に、分かち書き処理システム１は、分かち書き後文字列Ｄ２に基づく効率的且つ迅速なモデル学習を実現可能である。従って、本実施例によれば、一層高性能な分かち書き処理システム１を、利用者装置５に対して提供可能である。 Heretofore, the splitting processing system 1 according to the second embodiment has been described. In addition to the advanced model learning function (semi-teacher learning function of the language model LM2 and the identification model DM2), the segmentation processing system 1 of the present embodiment is based on the character string D2 after the segmentation. Efficient and quick model learning can be realized. Therefore, according to the present embodiment, it is possible to provide the user device 5 with the higher-performance splitting processing system 1.

［第三実施例］
続いて、第三実施例の分かち書き処理システム１を説明する。第三実施例の分かち書き処理システム１は、制御デバイス１０によって実現される機能が第一及び第二実施例の分かち書き処理システム１とは異なる。一方、第三実施例の分かち書き処理システム１のハードウェア構成は、図１に示す分かち書き処理システム１と同一である。従って、以下では、第三実施例として、制御デバイス１０によって実現される機能を選択的に説明する。 [Third embodiment]
Next, the split writing processing system 1 according to the third embodiment will be described. The splitting processing system 1 of the third embodiment is different from the splitting processing system 1 of the first and second embodiments in the function realized by the control device 10. On the other hand, the hardware configuration of the segmentation processing system 1 of the third embodiment is the same as that of the segmentation processing system 1 shown in FIG. Therefore, in the following, functions realized by the control device 10 are selectively described as a third embodiment.

本実施例の制御デバイス１０は、プログラムの実行により、図７に示すように、マスタ処理ユニット３００、切替ユニット３５０、及び、複数の個別処理ユニット４００として機能する。マスタ処理ユニット３００は、第二実施例の制御デバイス１０にて実現される機能と同様の機能を有する。但し、マスタ処理ユニット３００は、分かち書きユニット２１０及び登録ユニット２８０を有さず、初期学習データのみの学習用文字列群ＬＤ２に基づき学習した識別モデルＤＭ２及び言語モデルＬＭ２を有した構成にされる。 The control device 10 of the present embodiment functions as a master processing unit 300, a switching unit 350, and a plurality of individual processing units 400 as shown in FIG. The master processing unit 300 has the same function as that realized by the control device 10 of the second embodiment. However, the master processing unit 300 does not have the splitting unit 210 and the registration unit 280, but has the identification model DM2 and the language model LM2 learned based on the learning character string group LD2 of only the initial learning data.

即ち、マスタ処理ユニット３００は、初期学習データに基づくＣＲＦ学習ユニット２２０による識別モデルＤＭ２の学習動作、及び、初期学習データに基づくＮＰＹＬＭ学習ユニット２３０による言語モデルＬＭ２の学習動作により、学習された識別モデルＤＭ２及び言語モデルＬＭ２を有した構成にされる。 That is, the master processing unit 300 learns the identification model learned by the learning operation of the identification model DM2 by the CRF learning unit 220 based on the initial learning data and the learning operation of the language model LM2 by the NPYLM learning unit 230 based on the initial learning data. It is configured to have DM2 and language model LM2.

更に、マスタ処理ユニット３００は、言語モデルＬＭ２に基づく条件付確率Ｐ，Ｐａ及び識別モデルＤＭ２に基づく同時確率Ｐｊ，Ｐｅを個別処理ユニット４００からの要求に従って、個別処理ユニット４００に提供可能に構成される。以下では、マスタ処理ユニット３００から個別処理ユニット４００に提供される条件付確率Ｐ，Ｐａのことを、マスタ条件付確率Ｐｍと表現する。マスタ条件付確率Ｐｍは、条件付確率Ｐであってもよいし、条件付確率Ｐａであってもよい。 Furthermore, the master processing unit 300 is configured to be able to provide conditional probabilities P and Pa based on the language model LM2 and joint probabilities Pj and Pe based on the identification model DM2 to the individual processing unit 400 according to a request from the individual processing unit 400. The Hereinafter, the conditional probabilities P and Pa provided from the master processing unit 300 to the individual processing unit 400 are expressed as a master conditional probability Pm. The master conditional probability Pm may be the conditional probability P or the conditional probability Pa.

一方、個別処理ユニット４００は、ユーザ毎に設けられる。切替ユニット３５０は、通信デバイス３０を介して受信した分かち書き対象文字列Ｄ１を、その送信元ユーザに対応する個別処理ユニット４００に入力するように動作する。 On the other hand, the individual processing unit 400 is provided for each user. The switching unit 350 operates so as to input the character string D <b> 1 that is the text to be written received via the communication device 30 to the individual processing unit 400 corresponding to the transmission source user.

切替ユニット３５０は、教師有りデータとして、教師情報を含む新規文字列を、利用者装置５から通信デバイス３０を通じて受信可能に構成されており、受信した新規文字列を、送信元ユーザの個別処理ユニット４００に入力するように動作する。この新規文字列は、ユーザからの登録語に対応する。 The switching unit 350 is configured to receive a new character string including teacher information as supervised data from the user device 5 through the communication device 30, and the received new character string is an individual processing unit of the transmission source user. It operates to input to 400. This new character string corresponds to a registered word from the user.

切替ユニット３５０は、例えば、分かち書き処理システム１と利用者装置５との接続が確立される際に、ユーザ認証を行うことにより、利用者装置５とユーザとの対応関係、又は、利用者装置５と個別処理ユニット４００との対応関係を特定することができる。切替ユニット３５０は、上記ユーザ認証及び対応関係特定のためのデータベースを有した構成にされ得る。 For example, the switching unit 350 performs the user authentication when the connection between the separation processing system 1 and the user device 5 is established, so that the correspondence between the user device 5 and the user or the user device 5 is performed. And the individual processing unit 400 can be identified. The switching unit 350 may be configured to have a database for user authentication and identification of correspondence.

本実施例の分かち書き処理システム１は、このユーザ毎の個別処理ユニット４００を通じて、識別モデルＤＭ３及び言語モデルＬＭ３を、ユーザの使用履歴及びユーザからの登録語に適合するように、ユーザ毎に学習する。この学習により、本実施例の分かち書き処理システム１は、ユーザの夫々にとって適切な分かち書き結果（分かち書き後文字列Ｄ２）を、分かち書き対象文字列Ｄ１を送信してきた利用者装置５に返信可能であるように構成される。 Through the individual processing unit 400 for each user, the separation processing system 1 according to the present embodiment learns the identification model DM3 and the language model LM3 for each user so as to conform to the usage history of the user and the registered words from the user. . By this learning, the segmentation processing system 1 of the present embodiment can return a segmentation result (character string D2 after segmentation) appropriate for each user to the user apparatus 5 that has transmitted the segmentation target character string D1. Configured.

ここで用語「ユーザ」について付言する。本実施例で言う一ユーザは、一人とは限らない。即ち、ユーザ毎の個別処理ユニット４００は、一人毎の個別処理ユニット４００であってもよいし、集団毎の個別処理ユニット４００であってもよい。集団は、例えば、性別、年齢、居住地域、職業及び所属団体等の属性により分類される集団で有り得る。 Here, the term “user” is added. One user referred to in this embodiment is not limited to one person. That is, the individual processing unit 400 for each user may be the individual processing unit 400 for each person or the individual processing unit 400 for each group. The group may be a group classified by attributes such as sex, age, residential area, occupation, and affiliated organization.

続いて、個別処理ユニット４００の詳細を、図８〜図１０を用いて説明する。個別処理ユニット４００の夫々は、図８に示すように、分かち書きユニット４１０、ＣＲＦ学習ユニット４２０、ＮＰＹＬＭ学習ユニット４３０、登録ユニット４８０及び学習用文字列群ＬＤ３を有する。 Next, details of the individual processing unit 400 will be described with reference to FIGS. As shown in FIG. 8, each of the individual processing units 400 includes a splitting unit 410, a CRF learning unit 420, an NPYLM learning unit 430, a registration unit 480, and a learning character string group LD3.

分かち書きユニット４１０は、分かち書きユニット２１０と同様、切替ユニット３５０を介して入力される対応ユーザの分かち書き対象文字列Ｄ１を分かち書きし、分かち書き後文字列Ｄ２を、分かち書き対象文字列Ｄ１を送信してきた利用者装置５に返信する構成にされる。ここで言う対応ユーザは、個別処理ユニット４００に対応するユーザのことを示す。文字列Ｄ１の分かち書きは、同じ個別処理ユニット４００が有する対応ユーザの言語モデルＬＭ３及び識別モデルＤＭ３の一方又は両方に基づいて行うことができる。 As with the splitting unit 210, the splitting unit 410 splits the corresponding target character string D1 input via the switching unit 350, and transmits the post-split character string D2 and the splitting target character string D1. It is configured to send a reply to the device 5. The corresponding user referred to here indicates a user corresponding to the individual processing unit 400. The character string D1 can be divided based on one or both of the language model LM3 and the identification model DM3 of the corresponding user included in the same individual processing unit 400.

ＣＲＦ学習ユニット４２０は、図９に示すように、文字列取込部４２１と、素性ベクトル生成部４２２と、条件付確率取得部４２３と、ＮＰＹＬＭ素性生成部４２４と、マスタ条件付確率取得部４２５と、マスタ素性生成部４２６と、統合素性ベクトル生成部４２７と、更新部４２８と、出力部４２９と、識別モデルＤＭ３と、を有した構成にされる。 As shown in FIG. 9, the CRF learning unit 420 includes a character string capturing unit 421, a feature vector generation unit 422, a conditional probability acquisition unit 423, an NPYLM feature generation unit 424, and a master conditional probability acquisition unit 425. A master feature generation unit 426, an integrated feature vector generation unit 427, an update unit 428, an output unit 429, and an identification model DM3.

文字列取込部４２１は、学習用文字列群ＬＤ３の中から教師情報を含む文字列の一群を教師有りデータとして取り込むように動作する。素性ベクトル生成部４２２は、この教師有りデータに基づく素性ベクトルＦを生成する。素性ベクトルＦは、文字列毎に生成される。 The character string capturing unit 421 operates so as to capture a group of character strings including teacher information as supervised data from the learning character string group LD3. The feature vector generation unit 422 generates a feature vector F based on the supervised data. The feature vector F is generated for each character string.

条件付確率取得部４２３は、対応ユーザの言語モデルＬＭ３から特定される各素性ベクトルＦに対応する文字列の条件付確率Ｐを、同じ個別処理ユニット４００内のＮＰＹＬＭ学習ユニット４３０から取得する。ＮＰＹＬＭ素性生成部４２４は、この条件付確率取得部４２３が取得した条件付確率Ｐを、素性ベクトルＦに追加するＮＰＹＬＭ素性ｆｎに変換する。 The conditional probability acquisition unit 423 acquires the conditional probability P of the character string corresponding to each feature vector F specified from the language model LM3 of the corresponding user from the NPYLM learning unit 430 in the same individual processing unit 400. The NPYLM feature generation unit 424 converts the conditional probability P acquired by the conditional probability acquisition unit 423 into an NPYLM feature fn to be added to the feature vector F.

一方、マスタ条件付確率取得部４２５は、各素性ベクトルＦに対応する文字列のマスタ条件付確率Ｐｍを、マスタ処理ユニット３００から取得する。マスタ素性生成部４２６は、このマスタ条件付確率Ｐｍを、素性ベクトルＦに追加する素性ｆｍに変換する。以下では、この素性ｆｍを、マスタ素性と表現する。 On the other hand, the master conditional probability acquisition unit 425 acquires the master conditional probability Pm of the character string corresponding to each feature vector F from the master processing unit 300. The master feature generation unit 426 converts the master conditional probability Pm into a feature fm to be added to the feature vector F. Hereinafter, this feature fm is expressed as a master feature.

統合素性ベクトル生成部４２７は、素性ベクトルＦに、対応するＮＰＹＬＭ素性ｆｎ及びマスタ素性ｆｍを、新たな要素として追加して、統合素性ベクトルＦＥ＝（Ｆ，ｆｎ，ｆｍ）を生成する。統合素性ベクトルＦＥは、文字列毎に生成される。ＮＰＹＬＭ素性ｆｎ及びマスタ素性ｆｍがスカラー量である場合、Ｍ次元の素性ベクトルＦに対して、統合素性ベクトルＦＥは、（Ｍ＋２）次元ベクトルである。 The integrated feature vector generation unit 427 adds the corresponding NPYLM feature fn and the master feature fm to the feature vector F as new elements, and generates an integrated feature vector FE = (F, fn, fm). The integrated feature vector FE is generated for each character string. When the NPYLM feature fn and the master feature fm are scalar quantities, the integrated feature vector FE is an (M + 2) -dimensional vector with respect to the M-dimensional feature vector F.

更新部４２８は、この統合素性ベクトルＦＥと、識別スコア（Ｓ，Ｓｎ，Ｓｍ）との内積Ｆ・Ｓ＋ｆｎ・Ｓｎ＋ｆｍ・Ｓｍに対応する同時確率Ｐｅに関して、文字列毎の同時確率Ｐｅの平均が最大となる識別スコア（Ｓ，Ｓｎ，Ｓｍ）を算出する。そして、識別モデルＤＭ３が有する識別スコアＳ，Ｓｎ，Ｓｍを、算出された識別スコアＳ，Ｓｎ，Ｓｍに更新することにより、識別モデルＤＭ３を更新する。ここで識別スコアＳｍは、マスタ素性ｆｍに対するスコアであり、その初期値は、設計者により定められる。 The update unit 428 calculates the maximum of the joint probabilities Pe for each character string with respect to the joint probabilities Pe corresponding to the inner product F · S + fn · Sn + fm · Sm of the integrated feature vector FE and the identification score (S, Sn, Sm). An identification score (S, Sn, Sm) is calculated. Then, the identification model DM3 is updated by updating the identification scores S, Sn, Sm of the identification model DM3 to the calculated identification scores S, Sn, Sm. Here, the identification score Sm is a score for the master feature fm, and its initial value is determined by the designer.

出力部４２９は、ＮＰＹＬＭ学習ユニット４３０及び分かち書きユニット４１０からの要求に応じて、識別スコアＳ，Ｓｎ，Ｓｍ及び同時確率Ｐｅ，Ｐｊを出力するように構成される。ここでの同時確率Ｐｊは、素性ベクトルＦと識別スコアＳとの内積Ｆ・Ｓに対応する。 The output unit 429 is configured to output the identification scores S, Sn, Sm and the joint probabilities Pe, Pj in response to requests from the NPYLM learning unit 430 and the division writing unit 410. Here, the joint probability Pj corresponds to the inner product F · S of the feature vector F and the identification score S.

ＣＲＦ学習ユニット４２０は、上記の構成を有することにより、対応ユーザの学習用文字列群ＬＤ３の内、教師情報を含む文字列の一群と、対応ユーザの言語モデルＬＭ３及びユーザ共通のマスタ処理ユニット３００の言語モデルと、に基づき、対応ユーザの識別モデルＤＭ３を学習する。 Since the CRF learning unit 420 has the above-described configuration, the corresponding user learning character string group LD3 includes a group of character strings including teacher information, the corresponding user language model LM3, and a common master processing unit 300 for the user. The corresponding user identification model DM3 is learned based on the language model.

ＮＰＹＬＭ学習ユニット４３０は、図１０に示すように、第一学習ユニット４４０、第二学習ユニット４６０、出力ユニット４７０及び言語モデルＬＭ３を有した構成にされる。第一学習ユニット４４０は、選択部４４１、抽出部４４３、更新部４４５、全選択判定部４４７、加重平均算出部４５０、ＮＰＹＬＭ素性スコア入力部４５１、変換部４５３、マスタ素性スコア入力部４５５及びマスタ条件付確率入力部４５７を有した構成にされる。 As shown in FIG. 10, the NPYLM learning unit 430 includes a first learning unit 440, a second learning unit 460, an output unit 470, and a language model LM3. The first learning unit 440 includes a selection unit 441, an extraction unit 443, an update unit 445, an all selection determination unit 447, a weighted average calculation unit 450, an NPYLM feature score input unit 451, a conversion unit 453, a master feature score input unit 455, and a master. A conditional probability input unit 457 is provided.

第一学習ユニット４４０は、第一学習ユニット２４０と基本的に同様の手順で、ＮＰＹＬＭ学習ユニット４３０が有する対応ユーザの言語モデルＬＭ３を学習する。但し、第一学習ユニット４４０は、同じ個別処理ユニット４００内のＣＲＦ学習ユニット４２０から提供される識別スコアＳｎ，Ｓｍ、このＣＲＦ学習ユニット４２０から提供される同時確率Ｐｊ（又は同時確率Ｐｅ）、及び、マスタ処理ユニット３００から提供されるマスタ条件付確率Ｐｍに基づく条件付確率Ｐａを用いて、処理対象文字列を分かち書きする点で、第二実施例の第一学習ユニット２４０とは異なる。 The first learning unit 440 learns the language model LM3 of the corresponding user included in the NPYLM learning unit 430 in the same procedure as the first learning unit 240. However, the first learning unit 440 includes the identification scores Sn and Sm provided from the CRF learning unit 420 in the same individual processing unit 400, the joint probability Pj (or joint probability Pe) provided from the CRF learning unit 420, and The difference from the first learning unit 240 of the second embodiment is that the processing target character string is separated using the conditional probability Pa based on the master conditional probability Pm provided from the master processing unit 300.

選択部４４１、抽出部４４３、更新部４４５、全選択判定部４４７、ＮＰＹＬＭ素性スコア入力部４５１及び変換部４５３は、基本的に、第二実施例の選択部２４１、抽出部２４３、更新部２４５、全選択判定部２４７、ＮＰＹＬＭ素性スコア入力部２５１及び変換部２５３と同様に動作する。 The selection unit 441, the extraction unit 443, the update unit 445, the all selection determination unit 447, the NPYLM feature score input unit 451, and the conversion unit 453 are basically the selection unit 241, the extraction unit 243, and the update unit 245 of the second embodiment. The same operation as the all-selection determination unit 247, the NPYLM feature score input unit 251 and the conversion unit 253 is performed.

但し、言語モデルＬＭ３の学習は、対応ユーザの学習用文字列群ＬＤ３から処理対象文字列を一つずつ選択して行われる。更新部４４５は、加重平均算出部４５０から提供される条件付確率Ｐａに基づいて、処理対象文字列を分かち書きし、その結果に基づき言語モデルＬＭ３を更新するが、加重平均算出部４５０から更新部４４５には、第二実施例とは異なる次の条件付確率Ｐａが提供される。 However, the learning of the language model LM3 is performed by selecting the processing target character strings one by one from the corresponding user learning character string group LD3. Based on the conditional probability Pa provided from the weighted average calculation unit 450, the update unit 445 writes the processing target character string and updates the language model LM3 based on the result, but the update unit 445 updates the language model LM3. 445 is provided with the following conditional probability Pa different from the second embodiment.

加重平均算出部４５０は、対応ユーザの言語モデルＬＭ３から特定される条件付確率Ｐと、同じ個別処理ユニット４００のＣＲＦ学習ユニット４２０からＮＰＹＬＭ素性スコア入力部４５１を通じて取得した識別スコアＳｎ及びマスタ素性スコア入力部４５５を通じて取得した識別スコアＳｍと、変換部４５３から入力される条件付確率Ｐ’と、マスタ処理ユニット３００からマスタ条件付確率入力部４５７を通じて取得したマスタ条件付確率Ｐｍと、に基づいて、条件付確率Ｐと条件付確率Ｐ’とマスタ条件付確率Ｐｍとの加重平均である条件付確率Ｐａ＝（Ｐ＋Ｓｎ・Ｐ’＋Ｓｍ・Ｐｍ）／（１＋Ｓｎ＋Ｓｍ）を算出する。 The weighted average calculation unit 450 includes the conditional probability P specified from the language model LM3 of the corresponding user, the identification score Sn and the master feature score acquired from the CRF learning unit 420 of the same individual processing unit 400 through the NPYLM feature score input unit 451. Based on the identification score Sm acquired through the input unit 455, the conditional probability P ′ input from the conversion unit 453, and the master conditional probability Pm acquired from the master processing unit 300 through the master conditional probability input unit 457. Then, a conditional probability Pa = (P + Sn · P ′ + Sm · Pm) / (1 + Sn + Sm), which is a weighted average of the conditional probability P, the conditional probability P ′, and the master conditional probability Pm, is calculated.

このようにして第一学習ユニット４４０は、対応ユーザの学習用文字列群ＬＤ３、対応ユーザの識別モデルＤＭ３及びマスタ処理ユニット３００の言語モデルに基づいて、対応ユーザの言語モデルＬＭ３を学習する。 In this manner, the first learning unit 440 learns the corresponding user language model LM3 based on the corresponding user learning character string group LD3, the corresponding user identification model DM3, and the language model of the master processing unit 300.

第二学習ユニット４６０は、第二学習ユニット１６０，２６０と同様に構成され、第一学習ユニット４４０が実行する第一学習処理とは並列に、対応ユーザの分かち書きユニット４１０に分かち書き対象文字列Ｄ１が入力される度、その分かち書き後文字列Ｄ２に基づく対応ユーザの言語モデルＬＭ３の学習を行うように動作する。即ち、第二学習ユニット４６０は、対応ユーザからの文字列Ｄ１の入力毎に、分かち書きユニット４１０から得られた分かち書き後文字列Ｄ２を言語モデルＬＭ３に登録し、この際には、分かち書き後文字列Ｄ２の分割パターンに従って、言語モデルＬＭ３のパラメータを更新するように動作する。 The second learning unit 460 is configured in the same manner as the second learning units 160 and 260, and in parallel with the first learning process executed by the first learning unit 440, the character string D1 to be written is stored in the writing unit 410 of the corresponding user. Each time it is input, it operates so as to learn the language model LM3 of the corresponding user based on the character string D2 after the writing. That is, for each input of the character string D1 from the corresponding user, the second learning unit 460 registers the post-split character string D2 obtained from the split writing unit 410 in the language model LM3. At this time, the post-split character string It operates to update the parameter of the language model LM3 according to the division pattern of D2.

出力ユニット４７０は、分かち書きユニット４１０及びＣＲＦ学習ユニット４２０からの要求に応じて、要求された条件付確率Ｐ，Ｐａを言語モデルＬＭ３に基づいて特定し、これを要求元に出力するように動作する。 The output unit 470 operates to identify the requested conditional probabilities P and Pa based on the language model LM3 in response to requests from the division writing unit 410 and the CRF learning unit 420, and to output them to the request source. .

登録ユニット４８０は、登録ユニット１８０，２８０と同様、受付部４８１、プール４８３及び追加部４８５を有する。この登録ユニット４８０は、登録ユニット１８０，２８０と同様に動作し、対応ユーザの分かち書きユニット４１０に入力される分かち書き対象文字列Ｄ１をプールし、第一学習ユニット４４０による第一学習処理の終了毎に、プール４８３に蓄積された文字列Ｄ１を、学習用文字列として、対応ユーザの学習用文字列群ＬＤ３に登録するように動作する。 Similar to the registration units 180 and 280, the registration unit 480 includes a reception unit 481, a pool 483, and an addition unit 485. The registration unit 480 operates in the same manner as the registration units 180 and 280, pools the character string D1 to be written to the corresponding user, and each time the first learning process is completed by the first learning unit 440. The character string D1 stored in the pool 483 operates so as to be registered in the learning character string group LD3 of the corresponding user as a learning character string.

更に、本実施例の受付部４８１は、切替ユニット３５０を通じて、対応ユーザから入力された教師情報を含む新規文字列をプール４８３に蓄積するように動作し、追加部４８５は、第一学習ユニット４４０による第一学習処理の終了毎に、この教師情報を含む文字列についても文字列Ｄ１と同様に、プール４８３から取り出して、学習用文字列として、対応ユーザの学習用文字列群ＬＤ３に登録するように動作する。 Furthermore, the receiving unit 481 of this embodiment operates so as to accumulate a new character string including teacher information input from the corresponding user through the switching unit 350 in the pool 483, and the adding unit 485 is used for the first learning unit 440. Each time the first learning process is completed, the character string including the teacher information is also taken out of the pool 483 and registered as a learning character string in the learning character string group LD3 of the corresponding user as in the character string D1. To work.

本実施例では、上述したように登録ユニット４８０が、対応ユーザから入力される分かち書き対象文字列Ｄ１を教師無しデータとして、及び、教師情報を含む新規文字列を教師有りデータとして、学習用文字列群ＬＤ３に登録するように構成される。このため、学習用文字列群ＬＤ３は、システム管理者からの初期学習データを有さない構成にされる。 In the present embodiment, as described above, the registration unit 480 uses the character string D1 to be written, which is input from the corresponding user, as unsupervised data, and a new character string including teacher information as supervised data. It is configured to register with the group LD3. For this reason, the learning character string group LD3 is configured to have no initial learning data from the system administrator.

上述したようにして学習される識別モデルＤＭ３及び言語モデルＬＭ３に基づいて、分かち書きユニット４１０は、次のように分かち書き対象文字列Ｄ１を分かち書きする構成にされ得る。例えば、分かち書きユニット４１０は、同じ個別処理ユニット４００内のＮＰＹＬＭ学習ユニット４３０から取得した条件付確率Ｐ又は条件付確率Ｐａに基づいて、分かち書きユニット２１０と同様に、対応ユーザの分かち書き対象文字列Ｄ１を分かち書きするように構成され得る。あるいは、分かち書きユニット４１０は、同じ個別処理ユニット４００内のＣＲＦ学習ユニット４２０から取得した識別スコアＳに基づいて、分かち書きユニット２１０と同様に、分かち書き対象文字列Ｄ１を分かち書きする構成にされ得る。 Based on the identification model DM3 and the language model LM3 learned as described above, the splitting unit 410 may be configured to split the split target character string D1 as follows. For example, the splitting unit 410 uses the conditional probability P or the conditional probability Pa acquired from the NPYLM learning unit 430 in the same individual processing unit 400, as in the case of the splitting unit 210, to write the character string D1 to be written by the corresponding user. Can be configured to write. Alternatively, the segmentation unit 410 may be configured to segment the segmentation target character string D1 based on the identification score S acquired from the CRF learning unit 420 in the same individual processing unit 400, similarly to the segmentation unit 210.

あるいは、分かち書きユニット４１０は、同じ個別処理ユニット４００内のＣＲＦ学習ユニット４２０から取得した識別スコアＳ，Ｓｎ，Ｓｍ、ＮＰＹＬＭ学習ユニット４３０から取得した条件付確率Ｐに対応するＮＰＹＬＭ素性ｆｎ、及び、マスタ処理ユニット３００から取得したマスタ条件付確率Ｐｍに対応するマスタ素性ｆｍに基づいて、分かち書き対象文字列Ｄ１を分かち書きする構成にされ得る。 Alternatively, the splitting unit 410 includes the identification score S, Sn, Sm acquired from the CRF learning unit 420 in the same individual processing unit 400, the NPYLM feature fn corresponding to the conditional probability P acquired from the NPYLM learning unit 430, and the master Based on the master feature fm corresponding to the master conditional probability Pm acquired from the processing unit 300, the character string D1 to be split can be split.

例えば、分かち書きユニット４１０は、統合素性ベクトル（Ｆ，ｆｎ，ｆｍ）と識別スコア（Ｓ，Ｓｎ，Ｓｍ）との内積に対応する同時確率Ｐｅが最大となる分かち書き対象文字列Ｄ１の素性ベクトルＦを探索することにより、分かち書き対象文字列Ｄ１に対応する尤もらしい分割パターンを決定し、分かち書き対象文字列Ｄ１を分かち書きすることができる。 For example, the segmentation unit 410 calculates the feature vector F of the segmentation target character string D1 having the maximum joint probability Pe corresponding to the inner product of the integrated feature vector (F, fn, fm) and the identification score (S, Sn, Sm). By searching, it is possible to determine a likely division pattern corresponding to the character string D1 to be divided, and to write the character string D1 to be divided.

以上、本実施例の分かち書き処理システム１について説明したが、本実施例によれば、ユーザ毎に、ユーザから入力された分かち書き対象文字列Ｄ１及び教師情報を含む文字列の一群に基づいて、識別モデルＤＭ３及び言語モデルＬＭ３を学習する。従って、ユーザに適合したモデル学習を行うことができる。 As described above, the segmentation processing system 1 of the present embodiment has been described. According to the present embodiment, for each user, identification is performed based on a group of segmentation target character strings D1 input from the user and a group of character strings including teacher information. The model DM3 and the language model LM3 are learned. Therefore, model learning suitable for the user can be performed.

更に本実施例によれば、ユーザから入力される分かち書き対象文字列Ｄ１を、このユーザの識別モデルＤＭ３及び言語モデルＬＭ３に基づいて分かち書きして、分かち書き後文字列Ｄ２を当該ユーザに提供するので、ユーザにとって適切な分かち書き結果を提供することができる。よって、本実施例によれば、ユーザの夫々が使用する言葉に応じた適切な分かち書き結果を提供することでき、高性能な分かち書き処理システム１を提供することができる。 Furthermore, according to the present embodiment, the character string D1 to be written is input based on the user's identification model DM3 and language model LM3, and the post-split character string D2 is provided to the user. An appropriate result for the user can be provided. Therefore, according to the present embodiment, it is possible to provide an appropriate segmentation result corresponding to the words used by each user, and to provide a high-performance segmentation processing system 1.

この他、本実施例では、処理対象文字列に基づき言語モデルＬＭ３を更新する際に、加重平均された条件付確率Ｐａを用いる。そして、条件付確率Ｐａを算出する際の条件付確率Ｐ，Ｐ’，Ｐｍの重み付けを、識別スコアＳｎ，Ｓｍに基づいて自動決定する。従って、本実施例の分かち書き処理システム１によれば、ユーザからの入力に基づく学習結果、及び、初期学習データに基づく学習結果を良好な比率で言語モデルＬＭ３に反映させることができる。これにより、本実施例の分かち書き処理システム１は、非常に適切に、ユーザからの入力文字列を分かち書きすることができる。 In addition, in this embodiment, the weighted average conditional probability Pa is used when the language model LM3 is updated based on the processing target character string. Then, the weights of the conditional probabilities P, P ′, and Pm when calculating the conditional probability Pa are automatically determined based on the identification scores Sn and Sm. Therefore, according to the handwriting processing system 1 of the present embodiment, the learning result based on the input from the user and the learning result based on the initial learning data can be reflected in the language model LM3 at a good ratio. As a result, the segmentation processing system 1 according to the present embodiment can very appropriately segment the input character string from the user.

ここで、第三実施例の変形例について説明する。以上には、第一学習ユニット４４０がマスタ条件付確率Ｐｍを取得する例について説明したが、マスタ条件付確率Ｐｍに相当するものは、マスタ処理ユニット３００の識別モデルから算出される同時確率Ｐｊ，Ｐｅを条件付確率に変換することによっても得ることができる。 Here, a modified example of the third embodiment will be described. The example in which the first learning unit 440 obtains the master conditional probability Pm has been described above, but what corresponds to the master conditional probability Pm is the simultaneous probability Pj, calculated from the identification model of the master processing unit 300. It can also be obtained by converting Pe into a conditional probability.

従って、第一学習ユニット４４０は、マスタ条件付確率入力部４５７に代えて、マスタ処理ユニット３００から取得した同時確率Ｐｊ又は同時確率Ｐｅを、条件付確率Ｐｍ’に変換して、加重平均算出部４５０に提供する変換部を備えた構成にされてもよい。この場合、加重平均算出部４５０は、条件付確率Ｐｍに代えて条件付確率Ｐｍ’を用いて条件付確率Ｐａ＝（Ｐ＋Ｓｎ・Ｐ’＋Ｓｍ・Ｐｍ’）／（１＋Ｓｎ＋Ｓｍ）を算出する構成にされ得る。 Therefore, the first learning unit 440 converts the joint probability Pj or joint probability Pe acquired from the master processing unit 300 into the conditional probability Pm ′ instead of the master conditional probability input unit 457, and calculates the weighted average calculation unit. 450 may be included. In this case, the weighted average calculation unit 450 is configured to calculate the conditional probability Pa = (P + Sn · P ′ + Sm · Pm ′) / (1 + Sn + Sm) using the conditional probability Pm ′ instead of the conditional probability Pm. obtain.

また、加重平均算出部４５０は、条件付確率Ｐｍ及び条件付確率Ｐｍ’の両者を用いて条件付確率Ｐａを算出する構成にされてもよい。例えば、条件付確率Ｐｍに代えて、条件付確率Ｐｍと条件付確率Ｐｍ’との平均値を用いて、条件付確率Ｐａは算出されてもよい。 Further, the weighted average calculation unit 450 may be configured to calculate the conditional probability Pa using both the conditional probability Pm and the conditional probability Pm ′. For example, the conditional probability Pa may be calculated using an average value of the conditional probability Pm and the conditional probability Pm ′ instead of the conditional probability Pm.

同様に、ＣＲＦ学習ユニット４２０は、マスタ処理ユニット３００からマスタ条件付確率Ｐｍを取得して、マスタ素性ｆｍを生成するのではなく、マスタ処理ユニット３００から得られる同時確率Ｐｊ又は同時確率Ｐｅからマスタ素性ｆｍを生成するように構成されてもよい。ＣＲＦ学習ユニット４２０は、マスタ処理ユニット３００から得られるマスタ条件付確率Ｐｍ及び同時確率Ｐｊ，Ｐｅに基づいて、マスタ素性ｆｍを生成するように構成されてもよい。 Similarly, the CRF learning unit 420 does not acquire the master conditional probability Pm from the master processing unit 300 to generate the master feature fm, but instead uses the master probability Pj or the joint probability Pe obtained from the master processing unit 300 as a master. It may be configured to generate a feature fm. The CRF learning unit 420 may be configured to generate the master feature fm based on the master conditional probability Pm obtained from the master processing unit 300 and the joint probabilities Pj and Pe.

［第四実施例］
続いて、第四実施例の分かち書き処理システム１を説明する。第四実施例の分かち書き処理システム１のハードウェア構成は、上記実施例と同一である。第四実施例の分かち書き処理システム１は、第二実施例の分かち書き処理システム１において制御デバイス１０により実現されるＣＲＦ学習ユニット２２０（図４参照）の構成を変更したものであり、他の構成に関しては、基本的に、第二実施例と同様の構成を有する。従って、以下では、第二実施例のＣＲＦ学習ユニット２２０に代わるＣＲＦ学習ユニット５２０の構成を選択的に説明する。 [Fourth embodiment]
Next, the split writing processing system 1 according to the fourth embodiment will be described. The hardware configuration of the handwriting processing system 1 of the fourth embodiment is the same as that of the above embodiment. The segmentation processing system 1 of the fourth embodiment is obtained by changing the configuration of the CRF learning unit 220 (see FIG. 4) realized by the control device 10 in the segmentation processing system 1 of the second embodiment. Basically has the same configuration as the second embodiment. Therefore, hereinafter, the configuration of the CRF learning unit 520 that replaces the CRF learning unit 220 of the second embodiment will be described selectively.

本実施例の制御デバイス１０は、プログラムの実行により、図１１に示すＣＲＦ学習ユニット５２０として機能する。
ＣＲＦ学習ユニット５２０は、第二実施例と同様、学習用文字列群ＬＤ２の内の教師情報を含む学習用文字列の一群（教師有りデータ）と、言語モデルＬＭ２と、に基づいて識別モデルＤＭ２を学習するように構成される。識別モデルＤＭ２は、識別スコアＳ，Ｓｎを含む。 The control device 10 of this embodiment functions as a CRF learning unit 520 shown in FIG. 11 by executing a program.
Similar to the second embodiment, the CRF learning unit 520 is configured to identify the identification model DM2 based on the group of learning character strings (supervised data) including the teacher information in the learning character string group LD2 and the language model LM2. Configured to learn. The identification model DM2 includes identification scores S and Sn.

具体的に、ＣＲＦ学習ユニット５２０は、ＮＰＹＬＭ学習ユニット２３０から得られる条件付確率ＰをＮＰＹＬＭ素性に変換することなく、教師有りデータに対応する学習用文字列の夫々の出現確率を、条件付確率Ｐａに基づき算出したときの出現確率の平均値が最大となるように、識別スコアＳ，Ｓｎを更新する。これにより、識別モデルＤＭ２を学習する。第二実施例で説明したように、文字の並びに関する条件付確率Ｐａは、言語モデルＬＭ２に基づく条件付確率Ｐと、識別モデルＤＭ２に基づく条件付確率Ｐ’との加重平均に対応する統合条件付確率Ｐａ（＝（Ｐ＋Ｓｎ・Ｐ’）／（１＋Ｓｎ））である。学習用文字列の出現確率は、その学習用文字列の分割パターン及び文字の並びに従う統合条件付確率Ｐａの積を算出することで得ることができる。 Specifically, the CRF learning unit 520 converts each occurrence probability of the learning character string corresponding to the supervised data to the conditional probability without converting the conditional probability P obtained from the NPYLM learning unit 230 into the NPYLM feature. The identification scores S and Sn are updated so that the average value of appearance probabilities when calculated based on Pa is maximized. Thereby, the identification model DM2 is learned. As described in the second embodiment, the conditional probability Pa regarding the arrangement of characters is an integrated condition corresponding to the weighted average of the conditional probability P based on the language model LM2 and the conditional probability P ′ based on the identification model DM2. The attached probability Pa (= (P + Sn · P ′) / (1 + Sn)). The appearance probability of the learning character string can be obtained by calculating the product of the divided pattern of the learning character string, the character, and the integrated conditional probability Pa.

条件付確率Ｐ’は、第二実施例で説明したように、識別モデルＤＭ２から算出される同時確率Ｐｊを、対応する条件付確率に変換したものであり、識別スコアＳをパラメータとして有する。従って、統合条件付確率Ｐａは、言語モデルＬＭ２に基づく条件付確率Ｐを定数とみなせば、識別スコアＳ，Ｓｎをパラメータとして有する関数であり、上記出現確率の平均値は、識別スコアＳ，Ｓｎをパラメータとする関数である。上記出現確率の平均値を最大化する識別スコアＳ，Ｓｎを求めることは、公知の準ニュートン法などを用いて行うことが可能である。 As described in the second embodiment, the conditional probability P ′ is obtained by converting the simultaneous probability Pj calculated from the identification model DM2 into a corresponding conditional probability, and has the identification score S as a parameter. Therefore, the integrated conditional probability Pa is a function having the identification scores S and Sn as parameters if the conditional probability P based on the language model LM2 is regarded as a constant, and the average value of the appearance probabilities is the identification score S, Sn. Is a function with a parameter. The identification scores S and Sn that maximize the average value of the appearance probabilities can be obtained by using a known quasi-Newton method or the like.

本実施例では、このような原理に基づき、教師有りデータに対応する学習用文字列の一群及び言語モデルＬＭ２に基づく条件付確率Ｐに基づき、識別スコアＳ，Ｓｎを更新し、識別モデルＤＭ２を学習する。 In the present embodiment, based on such a principle, the identification scores S and Sn are updated based on the group of learning character strings corresponding to the supervised data and the conditional probability P based on the language model LM2, and the identification model DM2 is obtained. learn.

図１１に示すように、本実施例のＣＲＦ学習ユニット５２０は、文字列取込部５２１と、条件付確率取得部５２３と、出現確率最大化部５２８と、出力部５２９と、識別モデルＤＭ２と、を備える。 As shown in FIG. 11, the CRF learning unit 520 of the present embodiment includes a character string capturing unit 521, a conditional probability acquiring unit 523, an appearance probability maximizing unit 528, an output unit 529, and an identification model DM2. .

文字列取込部５２１は、学習用文字列群ＬＤ２から、教師情報を含む文字列の一群を教師有りデータとして取り込む。条件付確率取得部５２３は、文字列取込部５２１により取り込まれる各学習用文字列に対応する、言語モデルＬＭ２に基づく条件付確率Ｐの一群を、ＮＰＹＬＭ学習ユニット２３０から取得する。取得対象の条件付確率Ｐの一群は、各学習用文字列に関して、教師情報が示す分割パターンに従う学習用文字列の出現確率を算出するのに必要な条件付確率Ｐの一群である。 The character string capturing unit 521 captures a group of character strings including teacher information as supervised data from the learning character string group LD2. The conditional probability acquisition unit 523 acquires from the NPYLM learning unit 230 a group of conditional probabilities P based on the language model LM2 corresponding to each learning character string captured by the character string capturing unit 521. The group of conditional probabilities P to be acquired is a group of conditional probabilities P necessary for calculating the appearance probability of the learning character string according to the division pattern indicated by the teacher information for each learning character string.

出現確率最大化部５２８は、上記原理に基づき、識別モデルＤＭ２に基づく条件付確率Ｐ’と、この条件付確率Ｐ’に対応する言語モデルＬＭ２に基づく条件付確率Ｐと、の加重平均で定義される統合条件付確率Ｐａ（＝（Ｐ＋Ｓｎ・Ｐ’）／（１＋Ｓｎ））を用いて算出される各学習用文字列の出現確率の平均値が最大となる方向に、識別モデルＤＭ２の識別スコアＳ，Ｓｎを更新する。これにより、識別モデルＤＭ２を更新する。 Based on the above principle, the appearance probability maximizing unit 528 is defined by a weighted average of the conditional probability P ′ based on the identification model DM2 and the conditional probability P based on the language model LM2 corresponding to the conditional probability P ′. The identification score of the identification model DM2 in the direction in which the average value of the appearance probabilities of the learning character strings calculated using the integrated conditional probability Pa (= (P + Sn · P ′) / (1 + Sn)) is maximized S and Sn are updated. Thereby, the identification model DM2 is updated.

出力部５２９は、ＮＰＹＬＭ学習ユニット２３０及び分かち書きユニット２１０からの要求に応じて、上記のように更新される識別スコアＳ，Ｓｎ及び識別モデルＤＭ２に基づく同時確率Ｐｊを出力する。本実施例のＮＰＹＬＭ学習ユニット２３０は、ＣＲＦ学習ユニット５２０からの同時確率Ｐｊ及び識別スコアＳｎに基づいて、条件付確率Ｐａを算出し、言語モデルＬＭ２を学習することができる。本実施例によっても、上記実施例と同様、高性能な分かち書き処理システム１を提供することができる。 The output unit 529 outputs the joint probabilities Pj based on the discrimination scores S and Sn and the discrimination model DM2 updated as described above in response to requests from the NPYLM learning unit 230 and the division writing unit 210. The NPYLM learning unit 230 of the present embodiment can calculate the conditional probability Pa based on the simultaneous probability Pj and the identification score Sn from the CRF learning unit 520, and can learn the language model LM2. Also according to the present embodiment, a high-performance splitting processing system 1 can be provided as in the above-described embodiment.

［第五実施例］
続いて、第五実施例分かち書き処理システム１を説明する。第五実施例の分かち書き処理システム１のハードウェア構成は、上記実施例と同一である。第五実施例の分かち書き処理システム１は、第三実施例の分かち書き処理システム１において制御デバイス１０により実現されるＣＲＦ学習ユニット４２０（図９参照）の構成を変更したものであり、他の構成に関しては、基本的に、第三実施例と同様の構成を有する。従って、以下では、第三実施例のＣＲＦ学習ユニット４２０に代わるＣＲＦ学習ユニット６２０の構成を選択的に説明する。 [Fifth Example]
Next, a fifth embodiment of the handwriting processing system 1 will be described. The hardware configuration of the handwriting processing system 1 of the fifth embodiment is the same as that of the above embodiment. The segmentation processing system 1 of the fifth embodiment is obtained by changing the configuration of the CRF learning unit 420 (see FIG. 9) realized by the control device 10 in the segmentation processing system 1 of the third embodiment. Basically has the same configuration as the third embodiment. Therefore, hereinafter, the configuration of the CRF learning unit 620 that replaces the CRF learning unit 420 of the third embodiment will be described selectively.

本実施例の制御デバイス１０は、プログラムの実行により、図１２に示すＣＲＦ学習ユニット６２０として機能する。ＣＲＦ学習ユニット６２０の第三実施例のＣＲＦ学習ユニット４２０に対する相違点は、第四実施例と同様に、統合条件付確率Ｐａ（＝（Ｐ＋Ｓｎ・Ｐ’＋Ｓｍ・Ｐｍ）／（１＋Ｓｎ＋Ｓｍ））に基づく学習用文字列群の出現確率の平均値を最大化するように、識別スコアＳ，Ｓｎ，Ｓｍを更新して、識別モデルＤＭ３を学習する点である。 The control device 10 of this embodiment functions as a CRF learning unit 620 shown in FIG. 12 by executing a program. The difference between the CRF learning unit 620 and the CRF learning unit 420 in the third embodiment is based on the integrated conditional probability Pa (= (P + Sn · P ′ + Sm · Pm) / (1 + Sn + Sm)), as in the fourth embodiment. The identification score DM3 is learned by updating the identification scores S, Sn, and Sm so as to maximize the average value of the appearance probabilities of the learning character string group.

条件付確率Ｐは、同じ個別処理ユニット４００（図８参照）のＮＰＹＬＭ学習ユニット４３０から提供される言語モデルＬＭ３に基づく条件付確率Ｐであり、条件付確率Ｐｍは、マスタ処理ユニット３００（図７参照）から提供される言語モデルＬＭ２に基づくマスタ条件付確率Ｐｍである。本実施例におけるマスタ処理ユニット３００は、第四実施例と同様の原理で、初期学習データのみの学習用文字列群ＬＤ２に基づき学習した識別モデルＤＭ２及び言語モデルＬＭ２を有した構成にされ得る。 The conditional probability P is a conditional probability P based on the language model LM3 provided from the NPYLM learning unit 430 of the same individual processing unit 400 (see FIG. 8), and the conditional probability Pm is the master processing unit 300 (FIG. 7). This is the master conditional probability Pm based on the language model LM2 provided by the reference). The master processing unit 300 in the present embodiment can be configured to have an identification model DM2 and a language model LM2 learned based on the learning character string group LD2 of only the initial learning data on the same principle as in the fourth embodiment.

条件付確率Ｐ’は、識別モデルＤＭ３から算出される同時確率Ｐｊを、対応する条件付確率に変換したものであり、識別スコアＳをパラメータとして有する。従って、統合条件付確率Ｐａは、同じ個別処理ユニット４００の言語モデルＬＭ３に基づく条件付確率Ｐ及びマスタ処理ユニット３００の言語モデルＬＭ２に基づくマスタ条件付確率Ｐｍを定数とみなせば、識別スコアＳ，Ｓｎ，Ｓｍをパラメータとして有する関数である。従って、上記出現確率の平均値は、識別スコアＳ，Ｓｎ，Ｓｍをパラメータとする関数であり、第四実施例と同様の原理で、識別スコアＳ，Ｓｎ，Ｓｍを更新して、識別モデルＤＭ３を学習可能であることが理解できる。 The conditional probability P ′ is obtained by converting the simultaneous probability Pj calculated from the identification model DM3 into a corresponding conditional probability, and has the identification score S as a parameter. Accordingly, the integrated conditional probability Pa is obtained by assuming that the conditional probability P based on the language model LM3 of the same individual processing unit 400 and the master conditional probability Pm based on the language model LM2 of the master processing unit 300 are constants. It is a function having Sn and Sm as parameters. Therefore, the average value of the appearance probabilities is a function using the identification scores S, Sn, Sm as parameters, and the identification scores S, Sn, Sm are updated by the same principle as in the fourth embodiment, and the identification model DM3 Can be understood.

図１２に示すようにＣＲＦ学習ユニット６２０は、文字列取込部６２１と、条件付確率取得部６２３と、マスタ条件付確率取得部６２５と、出現確率最大化部６２８と、出力部６２９と、識別モデルＤＭ３と、を備える。 As shown in FIG. 12, the CRF learning unit 620 includes a character string capturing unit 621, a conditional probability acquisition unit 623, a master conditional probability acquisition unit 625, an appearance probability maximization unit 628, an output unit 629, An identification model DM3.

文字列取込部６２１は、対応ユーザの学習用文字列群ＬＤ３から、教師情報を含む文字列の一群を教師有りデータとして取り込む。条件付確率取得部６２３は、文字列取込部６２１により取り込まれた各学習用文字列に対応する、対応ユーザの言語モデルＬＭ３に基づく条件付確率Ｐの一群を、ＮＰＹＬＭ学習ユニット４３０から取得する。 The character string capturing unit 621 captures a group of character strings including teacher information as supervised data from the corresponding user learning character string group LD3. The conditional probability acquisition unit 623 acquires, from the NPYLM learning unit 430, a group of conditional probabilities P based on the language model LM3 of the corresponding user corresponding to each learning character string captured by the character string capturing unit 621. .

マスタ条件付確率取得部６２５は、文字列取込部６２１により取り込まれた各学習用文字列に対応する、全ユーザ共通の言語モデルＬＭ２に基づく条件付確率Ｐｍ（マスタ条件付確率Ｐｍ）の一群を、マスタ処理ユニット３００から取得する。 The master conditional probability acquisition unit 625 is a group of conditional probabilities Pm (master conditional probability Pm) based on the language model LM2 common to all users corresponding to each learning character string captured by the character string capturing unit 621. Is acquired from the master processing unit 300.

出現確率最大化部６２８は、対応ユーザの識別モデルＤＭ３に基づく条件付確率Ｐ’と、この条件付確率Ｐ’に対応する言語モデルＬＭ２，ＬＭ３に基づく条件付確率Ｐｍ，Ｐと、の加重平均で定義される統合条件付確率Ｐａ（＝（Ｐ＋Ｓｎ・Ｐ’＋Ｓｍ・Ｐｍ）／（１＋Ｓｎ＋Ｓｍ））を用いて算出される各学習用文字列の出現確率の平均値が最大となる方向に、識別モデルＤＭ３の識別スコアＳ，Ｓｎ，Ｓｍを更新する。これにより、識別モデルＤＭ３を更新する。 The appearance probability maximizing unit 628 is a weighted average of the conditional probability P ′ based on the identification model DM3 of the corresponding user and the conditional probabilities Pm, P based on the language models LM2 and LM3 corresponding to the conditional probability P ′. In the direction in which the average value of the appearance probabilities of each learning character string calculated using the integrated conditional probability Pa defined by (= (P + Sn · P ′ + Sm · Pm) / (1 + Sn + Sm)) is maximized The identification score S, Sn, Sm of the model DM3 is updated. As a result, the identification model DM3 is updated.

出力部６２９は、ＮＰＹＬＭ学習ユニット４３０及び分かち書きユニット４１０からの要求に応じて、識別スコアＳ，Ｓｎ，Ｓｍ及び同時確率Ｐｊを出力する。本実施例のＮＰＹＬＭ学習ユニット４３０は、ＣＲＦ学習ユニット４２０からの同時確率Ｐｊ及び識別スコアＳｎ，Ｓｍに代えて、ＣＲＦ学習ユニット６２０からの同時確率Ｐｊ及び識別スコアＳｎ，Ｓｍを用いて、条件付確率Ｐａを算出し、言語モデルＬＭ３を学習することができる。本実施例によっても、上記実施例と同様、高性能な分かち書き処理システム１を提供することができる。 The output unit 629 outputs the identification score S, Sn, Sm and the joint probability Pj in response to requests from the NPYLM learning unit 430 and the split writing unit 410. The NPYLM learning unit 430 according to the present embodiment uses the simultaneous probability Pj and the identification scores Sn and Sm from the CRF learning unit 620 in place of the simultaneous probability Pj and the identification scores Sn and Sm from the CRF learning unit 420. The probability Pa can be calculated and the language model LM3 can be learned. Also according to the present embodiment, a high-performance splitting processing system 1 can be provided as in the above-described embodiment.

［他の実施形態］
以上、本発明の実施例について説明したが、本発明は、上記実施例に何ら限定されるものではなく種々の態様を採り得る。例えば、第一及び第二実施例の分かち書き処理システム１における登録ユニット１８０，２８０は、第三実施例と同様に、教師情報を含む新規文字列を、利用者装置５から通信デバイス３０を通じて受信可能に構成されてもよく、受信した新規文字列を学習用文字列群ＬＤ１，ＬＤ２に登録するように構成されてもよい。 [Other Embodiments]
As mentioned above, although the Example of this invention was described, this invention is not limited to the said Example at all, and can take a various aspect. For example, as in the third embodiment, the registration units 180 and 280 in the split writing processing system 1 of the first and second embodiments can receive a new character string including teacher information from the user device 5 through the communication device 30. The received new character string may be registered in the learning character string groups LD1 and LD2.

この他、上記システムの構成の一部は、同様の機能を有する公知の構成に置き換えられてもよいし、省略されてもよい。また、一実施例のシステムの構成の一部は、他の実施例のシステムの構成に対して付加又は置換されてもよい。１つの構成要素が有する機能は、複数の構成要素に分散して付与され得る。また、複数の構成要素が有する機能は、１つの構成要素に統合され得る。特許請求の範囲に記載の文言から特定される技術思想に含まれるあらゆる態様が本発明の実施形態である。 In addition, a part of the configuration of the system may be replaced with a known configuration having the same function, or may be omitted. A part of the system configuration of one embodiment may be added to or replaced with the system configuration of another embodiment. The function of one component can be distributed and provided to a plurality of components. In addition, functions of a plurality of components can be integrated into one component. Any aspect included in the technical idea specified by the wording of the claims is an embodiment of the present invention.

１…分かち書き処理システム、５…利用者装置、１０…制御デバイス、１０Ａ…ＣＰＵ、２０…記憶デバイス、３０…通信デバイス、１１０，２１０，４１０…分かち書きユニット、１４０，２４０，４４０…第一学習ユニット、１４１，２４１，４４１…選択部、１４３，２４３，４４３…抽出部、１４５，２４５，４４５…更新部、１４７，２４７，４４７…全選択判定部、１６０，２６０，４６０…第二学習ユニット、１８０，２８０，４８０…登録ユニット、１８１，２８１，４８１…受付部、１８３，２８３，４８３…プール、１８５，２８５，４８５…追加部、２２０，４２０，５２０，６２０…ＣＲＦ学習ユニット、２２１，４２１，５２１，６２１…文字列取込部、２２２，４２２…素性ベクトル生成部、２２３，４２３，５２３，６２３…条件付確率取得部、２２４，４２４…ＮＰＹＬＭ素性生成部、２２７，４２７…統合素性ベクトル生成部、２２８，４２８…更新部、２２９，４２９，５２９，６２９…出力部、２３０，４３０…ＮＰＹＬＭ学習ユニット、２５０，４５０…加重平均算出部、２５１，４５１…素性スコア入力部、２５３，４５３…変換部、２７０，４７０…出力ユニット、３００…マスタ処理ユニット、３５０…切替ユニット、４００…個別処理ユニット、４２５，６２５…マスタ条件付確率取得部、４２６…マスタ素性生成部、４５５…マスタ素性スコア入力部、４５７…マスタ条件付確率入力部、５２８，６２８…出現確率最大化部、Ｄ１…分かち書き対象文字列、Ｄ２…分かち書き後文字列、ＬＤ１，ＬＤ２，ＬＤ３…学習用文字列群、ＤＭ２，ＤＭ３…識別モデル、ＬＭ１，ＬＭ２，ＬＭ３…言語モデル。 DESCRIPTION OF SYMBOLS 1 ... Splicing processing system, 5 ... User apparatus, 10 ... Control device, 10A ... CPU, 20 ... Storage device, 30 ... Communication device, 110, 210, 410 ... Splicing unit, 140, 240, 440 ... First learning unit 141, 241, 441 ... selection unit, 143, 243, 443 ... extraction unit, 145, 245, 445 ... update unit, 147, 247, 447 ... all selection determination unit, 160, 260, 460 ... second learning unit, 180, 280, 480 ... registration unit, 181, 281, 481 ... accepting unit, 183, 283, 483 ... pool, 185, 285, 485 ... adding unit, 220, 420, 520, 620 ... CRF learning unit, 221, 421 , 521, 621... Character string fetching unit, 222, 422. 23, 623 ... Conditional probability acquisition unit, 224, 424 ... NPYLM feature generation unit, 227, 427 ... Integrated feature vector generation unit, 228, 428 ... Update unit, 229, 429, 529, 629 ... Output unit, 230, 430 NPYLM learning unit, 250, 450 ... weighted average calculation unit, 251, 451 ... feature score input unit, 253, 453 ... conversion unit, 270, 470 ... output unit, 300 ... master processing unit, 350 ... switching unit, 400 ... Individual processing unit, 425, 625 ... Master conditional probability acquisition unit, 426 ... Master feature generation unit, 455 ... Master feature score input unit, 457 ... Master conditional probability input unit, 528, 628 ... Appearance probability maximization unit, D1 ... Character string to be written separately, D2 ... Character string after being written, LD1, LD2, LD3 ... Character string for learning , DM2, DM3 ... identification model, LM1, LM2, LM3 ... language model.

Claims

入力された文字列を、言語モデルに基づき分かち書きして出力する分かち書きユニットと、
前記分かち書きユニットに対する前記文字列の入力毎に、前記言語モデルを学習する学習ユニットと、
を備え、
前記言語モデルは、ベイズ階層言語モデルに対応し、
前記言語モデルの学習は、前記分かち書きユニットに対する前記文字列の入力毎に、前記分かち書きユニットから出力される分かち書き後の文字列を構成する複数の部分文字列の並びに基づき、前記言語モデルにより定義される文字列の並びに関する条件付確率を更新することにより実現されることを特徴とする分かち書き処理システム。 The input string, and the word-separated unit to output the based-out minute Tokachi writing in the language model,
For each input of the character string with respect to the word-separated unit, and a learning unit for learning before Symbol language model,
Equipped with a,
The language model corresponds to a Bayesian hierarchical language model,
The learning of the language model is defined by the language model on the basis of a sequence of a plurality of partial character strings constituting a character string after segmentation output from the segmentation unit for each input of the character string to the segmentation unit. leaving a space between words processing system according to claim Rukoto is achieved by updating the conditional probability about the arrangement of the character string.

前記言語モデルとして、ユーザ毎の言語モデルを備え、
前記分かち書きユニットは、入力された前記文字列を、入力元ユーザの前記言語モデルに基づき分かち書きし、
前記学習ユニットは、前記分かち書きユニットに対する前記文字列の入力毎に、前記分かち書きユニットから出力される前記分かち書き後の文字列を構成する前記複数の部分文字列の並びに基づき、入力元ユーザの前記言語モデルにより定義される前記条件付確率を更新することにより、入力元ユーザの前記言語モデルを学習すること
を特徴とする請求項１記載の分かち書き処理システム。 As the language model comprises a language model for each user,
The splitting unit splits the input character string based on the language model of the input source user,
For each input of the character string to the segmentation unit, the learning unit is configured to input the language model of the input source user based on a sequence of the partial character strings constituting the character string after the segmentation output from the segmentation unit. leaving a space between words processing system according to claim 1, wherein the updating the conditional probability defined, characterized by learning the language model of the source user by.

入力された文字列を、言語モデルに基づき分かち書きして出力する分かち書きユニットと、
前記言語モデルを学習する学習ユニットであって、学習用文字列の一群に基づき前記言語モデルを学習する第一学習処理を反復実行する一方、前記分かち書きユニットから出力される分かち書き後の前記文字列に基づき、前記言語モデルを学習する第二学習処理を、前記第一学習処理とは並列に実行する学習ユニットと、
前記分かち書きユニットに入力された前記文字列を、前記学習用文字列として登録する登録ユニットと、
を備え、
前記言語モデルは、ベイズ階層言語モデルに対応し、
前記第一学習処理における前記言語モデルの学習は、前記言語モデルにより定義される文字列の並びに関する条件付確率を、前記学習用文字列の一群に基づき更新することにより実現され、
前記第二学習処理における前記言語モデルの学習は、前記言語モデルの前記条件付確率を、前記分かち書き後の前記文字列を構成する複数の部分文字列の並びに基づき更新することにより実現されることを特徴とする分かち書き処理システム。 The input string, and the word-separated unit to output the based-out minute Tokachi writing in the language model,
A learning unit for learning the language model, while Repeating a first learning process for learning the language model based on a set of learning string, the character string after word-separated output from the word-separated unit A learning unit that executes a second learning process for learning the language model in parallel with the first learning process;
A registration unit for registering the character string input to the segmentation unit as the learning character string;
Equipped with a,
The language model corresponds to a Bayesian hierarchical language model,
The learning of the language model in the first learning process is realized by updating a conditional probability related to the arrangement of character strings defined by the language model based on the group of learning character strings,
The learning of the language model in the second learning processing, the conditional probability of the language model, the Rukoto be realized by updating based on the sequence of a plurality of substrings constituting the character string after the word-separated Characteristic processing system.

前記第一学習処理は、前記学習用文字列の一群の中から処理対象文字列を選択し、前記処理対象文字列を構成する複数の部分文字列の並びに基づき、前記言語モデルの前記条件付確率を更新する処理を、前記処理対象文字列を切り替えて繰返し実行することにより、前記学習用文字列の一群に基づき前記言語モデルを学習する処理であり、
前記分かち書きユニットに入力された前記文字列は、反復実行される前記第一学習処理の終了毎に、前記学習用文字列として登録され、
前記第二学習処理は、前記第一学習処理とは独立して、前記分かち書きユニットに対する前記文字列の入力毎に、前記分かち書き後の前記文字列を構成する前記複数の部分文字列の並びに基づき、前記言語モデルの前記条件付確率を更新することにより、前記言語モデルを学習する処理であること
を特徴とする請求項３記載の分かち書き処理システム。 The first learning process selects a processing target character string from a group of the learning character strings, and based on a sequence of a plurality of partial character strings constituting the processing target character string , the conditional model of the language model the process of updating the probability, by repeatedly executing switching the processing target character string, a process for learning the language model-out based on the set of the learning text,
The character string input to the segmentation unit is registered as the learning character string every time the first learning process is repeatedly executed,
The second learning process is based on a sequence of the plurality of partial character strings constituting the character string after the segmentation for each input of the character string to the segmentation unit independently of the first learning process. by updating the probabilities the condition of the language model, word-separated processing system according to claim 3, characterized in that the process of learning the language model.

入力された文字列を、言語モデルに基づき分かち書きして出力する分かち書きユニットと、
前記言語モデルを学習する学習ユニットであって、学習用文字列の一群に基づき前記言語モデルを学習する第一学習処理を反復実行する一方、前記分かち書きユニットから出力される分かち書き後の前記文字列に基づき、前記言語モデルを学習する第二学習処理を、前記第一学習処理とは並列に実行する学習ユニットと、
前記分かち書きユニットに入力された前記文字列を、前記学習用文字列として登録する登録ユニットと、
を備え、
前記言語モデルは、教師無しデータに基づき学習される生成モデル、及び、教師有りデータに基づき学習される識別モデルを含み、前記生成モデルは、ベイズ階層言語モデルに対応し、前記識別モデルは、条件付確率場に基づく言語モデルに対応し、
前記第一学習処理は、前記学習用文字列の一群の少なくとも一部を前記教師無しデータとして用いて、前記教師無しデータ及び前記識別モデルに基づき、前記生成モデルを更新する一方、前記学習用文字列の一群の内、教師情報を含む学習用文字列の一群の少なくとも一部を前記教師有りデータとして用いて、前記教師有りデータ及び前記生成モデルに基づき、前記識別モデルを更新することにより、前記言語モデルを学習する処理であり、
前記第一学習処理における前記生成モデルの更新は、前記生成モデルにより定義される文字列の並びに関する条件付確率を前記識別モデルに基づき補正した補正後の条件付確率に基づき、前記教師無しデータに対応する文字列を分かち書きし、分かち書き後の前記教師無しデータに対応する文字列を構成する複数の部分文字列の並びに基づき、前記生成モデルの前記条件付確率を更新することにより実現され、
前記第一学習処理における前記識別モデルの更新は、前記生成モデルにより定義される前記教師有りデータに対応する文字列の条件付確率を素性に変換し、前記教師有りデータに対応する素性ベクトルに前記変換後の素性を追加した拡張後の素性ベクトルと、前記識別モデルから特定される識別スコアとの内積に対応する同時確率を指標に、前記識別モデルの前記識別スコアを更新することにより実現され、
前記第二学習処理は、前記分かち書きユニットに対する前記文字列の入力毎に、前記分かち書き後の前記文字列に基づき、前記生成モデルを更新することにより、前記言語モデルを学習する処理であり、
前記第二学習処理における前記生成モデルの更新は、前記分かち書きユニットから出力される前記分かち書き後の前記文字列を構成する複数の部分文字列の並びに基づき、前記生成モデルの前記条件付確率を更新することにより実現されること
を特徴とする分かち書き処理システム。 A splitting unit that splits and outputs an input character string based on a language model, and
A learning unit for learning the language model, wherein the first learning process for learning the language model based on a group of character strings for learning is repeatedly executed, while the character string after the segmentation output from the segmentation unit A learning unit that executes a second learning process for learning the language model in parallel with the first learning process;
A registration unit for registering the character string input to the segmentation unit as the learning character string;
With
The language model includes a generation model learned based on unsupervised data and an identification model learned based on supervised data, the generation model corresponds to a Bayesian hierarchical language model, and the identification model is a condition Corresponding to the language model based on the random field
The first learning process uses at least a part of the group of learning character strings as the unsupervised data, and updates the generation model based on the unsupervised data and the identification model, while the learning characters among a group of columns, a group of at least a part of the learning string containing instruction information by using as the supervised data, based on the supervised data and the generated model, by updating the identification model, the A process of learning a language model ,
The update of the generation model in the first learning process is performed on the unsupervised data based on the corrected conditional probability obtained by correcting the conditional probability related to the arrangement of the character strings defined by the generation model based on the identification model. Realized by updating the conditional probability of the generation model based on a sequence of a plurality of partial character strings constituting a character string corresponding to the unsupervised data after the corresponding character string is written,
The update of the identification model in the first learning process is performed by converting a conditional probability of a character string corresponding to the supervised data defined by the generation model into a feature, and converting the feature vector into the feature vector corresponding to the supervised data. It is realized by updating the identification score of the identification model using as an index the coincidence probability corresponding to the inner product of the expanded feature vector to which the feature after conversion is added and the identification score specified from the identification model,
The second learning processing, for each input of the character string for the word-separated units, based on the character string after the word-separated, by updating the generated model, Ri processing der to learn the language model,
In the second learning process, the generation model is updated by updating the conditional probability of the generation model based on a sequence of a plurality of partial character strings constituting the character string after the segmentation output from the segmentation unit. min achieved writing processing system that said Rukoto be achieved by.

前記言語モデルとして、ユーザ毎の言語モデルを備え、
前記分かち書きユニットは、入力された前記文字列を、入力元ユーザの前記言語モデルに基づき分かち書きし、
前記登録ユニットは、前記分かち書きユニットに入力された前記文字列を、入力元ユーザの前記言語モデルに対する前記学習用文字列として登録し、
前記学習ユニットは、
前記ユーザ毎に、前記第一学習処理として、前記ユーザの前記言語モデルに対する前記学習用文字列の一群に基づいて、前記ユーザの前記言語モデルを学習する処理を反復実行し、
前記第二学習処理として、前記分かち書きユニットに対する前記文字列の入力毎に、前記分かち書き後の前記文字列に基づき、入力元ユーザの前記言語モデルを学習する処理を実行すること
を特徴とする請求項３〜請求項５のいずれか一項記載の分かち書き処理システム。 As the language model comprises a language model for each user,
The splitting unit splits the input character string based on the language model of the input source user,
The registration unit registers the character string input to the segmentation unit as the learning character string for the language model of the input source user,
The learning unit is
For each of the users, as the first learning process, based on the group of learning character strings for the language model of the user, a process of learning the language model of the user is repeatedly executed.
The said 2nd learning process performs the process which learns the said language model of the input user based on the said character string after the said division | segmentation for every input of the said character string with respect to the said division | segmentation unit. The division processing system according to any one of claims 3 to 5.

前記言語モデルとして、ユーザ毎の言語モデルを備え、更には、ユーザ共通の言語モデルである共通言語モデルを備え、
前記分かち書きユニットは、入力された前記文字列を、入力元ユーザの前記言語モデルに基づき分かち書きし、
前記登録ユニットは、前記分かち書きユニットに入力された前記文字列を、入力元ユーザの前記言語モデルに対する前記学習用文字列として登録し、
前記学習ユニットは、
前記ユーザ毎に、前記第一学習処理として、前記ユーザの前記言語モデルに対する前記学習用文字列の一群に基づいて、前記ユーザの前記言語モデルを学習する処理を反復実行し、
前記第二学習処理として、前記分かち書きユニットに対する前記文字列の入力毎に、前記分かち書き後の文字列に基づき、入力元ユーザの前記言語モデルを学習する処理を実行し、
前記ユーザ毎に反復実行される前記第一学習処理は、前記ユーザの前記言語モデルに対する前記学習用文字列の一群の中から処理対象文字列を選択し、前記処理対象文字列を、前記ユーザの前記言語モデルにより定義される条件付確率と前記共通言語モデルにより定義される条件付確率との加重平均に基づき分かち書きし、分かち書き後の処理対象文字列を構成する複数の部分文字列の並びに基づき、前記ユーザの前記言語モデルの前記条件付確率を更新する処理を、前記処理対象文字列を切り替えて繰返し実行することにより、前記ユーザの前記言語モデルを学習する処理であること
を特徴とする請求項３又は請求項４記載の分かち書き処理システム。 As the language model comprises a language model for each User chromatography The, further comprising a common language model is a user common language model,
The splitting unit splits the input character string based on the language model of the input source user,
The registration unit registers the character string input to the segmentation unit as the learning character string for the language model of the input source user,
The learning unit is
For each of the users, as the first learning process, based on the group of learning character strings for the language model of the user, a process of learning the language model of the user is repeatedly executed.
As the second learning process, for each input of the character string to the segmentation unit, based on the character string after the segmentation, a process of learning the language model of the input source user,
The first learning process repeatedly executed for each user selects a processing target character string from a group of the learning character strings for the language model of the user, and the processing target character string is selected from the user's language model . Based on a weighted average of the conditional probability defined by the language model and the conditional probability defined by the common language model, and based on a sequence of a plurality of partial character strings constituting the processing target character string after the division, It claims a process of updating the conditional probabilities of the language model of the user, by repeatedly executing switching the processed character string, which is a process of learning the language model before SL user Item 5. The divisional writing processing system according to item 3 or 4 .

前記言語モデルとして、ユーザ毎の言語モデルを備え、更には、ユーザ共通の言語モデルである共通言語モデルを備え、前記ユーザ毎の言語モデル及び前記共通言語モデルのそれぞれは、前記生成モデル及び前記識別モデルを含み、
前記分かち書きユニットは、入力された前記文字列を、入力元ユーザの前記言語モデルに基づき分かち書きし、
前記登録ユニットは、前記分かち書きユニットに入力された前記文字列を、入力元ユーザの前記言語モデルに対する前記学習用文字列として登録し、
前記学習ユニットは、
前記ユーザ毎に、前記第一学習処理として、前記ユーザの前記言語モデルに対する前記学習用文字列の一群の少なくとも一部を前記教師無しデータとして用いて、前記教師無しデータ、前記ユーザの前記識別モデル及び前記共通言語モデルに基づき、前記ユーザの前記生成モデルを更新する一方、前記学習用文字列の一群の内、前記ユーザの前記言語モデルに対する学習用文字列であって教師情報を含む学習用文字列の一群の少なくとも一部を前記教師有りデータとして用いて、前記教師有りデータ、前記ユーザの前記生成モデル及び前記共通言語モデルに基づき、前記ユーザの前記識別モデルを更新する処理を実行し、
前記第二学習処理として、前記分かち書きユニットに対する前記文字列の入力毎に、前記分かち書き後の前記文字列に基づき、入力元ユーザの前記生成モデルを更新する処理を実行し、
前記第一学習処理における前記ユーザの前記生成モデルの更新は、前記ユーザの前記生成モデルの前記条件付確率を前記ユーザの前記識別モデル並びに前記共通言語モデルの前記識別モデル及び前記生成モデルに基づき補正した補正後の条件付確率に基づき、前記教師無しデータに対応する文字列を分かち書きし、分かち書き後の前記教師無しデータに対応する文字列を構成する複数の部分文字列の並びに基づき、前記ユーザの前記生成モデルの前記条件付確率を更新することにより実現され、
前記第一学習処理における前記ユーザの前記識別モデルの更新は、前記ユーザの前記生成モデル及び前記共通言語モデルの前記生成モデルにより定義される前記教師有りデータに対応する文字列の条件付確率を、それぞれ素性に変換し、前記教師有りデータに対応する素性ベクトルに前記変換後の素性を追加した拡張後の素性ベクトルと、前記ユーザの前記識別モデルから特定される識別スコアとの内積に対応する同時確率を指標に、前記ユーザの前記識別モデルの前記識別スコアを更新することにより実現されること
を特徴とする請求項５記載の分かち書き処理システム。 As the language model comprises a language model for each user, and further, a common language model is a user common language model, respectively, the product model and the identification of the language model and the common language model for each of the user Including models,
The splitting unit splits the input character string based on the language model of the input source user,
The registration unit registers the character string input to the segmentation unit as the learning character string for the language model of the input source user,
The learning unit is
For each user, as the first learning process, the unsupervised data, the identification model of the user, using at least a part of the group of the learning character strings for the language model of the user as the unsupervised data. And, based on the common language model, updates the generated model of the user, while learning a character string for learning with respect to the language model of the user in the group of the learning character string and including teacher information Using at least a part of a group of columns as the supervised data, executing a process of updating the identification model of the user based on the supervised data, the generation model of the user and the common language model;
As the second learning process, for each input of the character string to the segmentation unit, based on the character string after the segmentation, a process of updating the generation model of the input source user ,
Updating the generation model of the user in the first learning process corrects the conditional probability of the generation model of the user based on the identification model of the user and the identification model and the generation model of the common language model. Based on the corrected conditional probability, the character string corresponding to the unsupervised data is segmented, and based on a sequence of a plurality of partial character strings constituting the character string corresponding to the unsupervised data after the segmentation, the user's Realized by updating the conditional probability of the generated model,
The update of the identification model of the user in the first learning process is performed by calculating a conditional probability of a character string corresponding to the supervised data defined by the generation model of the user and the generation model of the common language model, Simultaneously corresponding to the inner product of the expanded feature vector obtained by adding the converted feature to the feature vector corresponding to the supervised data and the discrimination score specified from the discrimination model of the user. leaving a space between words processing system according to claim 5 is realized, wherein Rukoto by an index the probability and updates the identification score of the identified model of the user.

請求項１又は請求項２記載の分かち書き処理システムが備える前記分かち書きユニット及び前記学習ユニットとしての機能を、コンピュータに実現させるためのプログラム。 The program for making a computer implement | achieve the function as the said division | segmentation unit and the said learning unit with which the division | segmentation processing system of Claim 1 or Claim 2 is provided.

請求項３〜請求項８のいずれか一項記載の分かち書き処理システムが備える前記分かち書きユニット、前記学習ユニット及び前記登録ユニットとしての機能を、コンピュータに実現させるためのプログラム。 The program for making a computer implement | achieve the function as the said division | segmentation writing unit with which the division | segmentation processing system as described in any one of Claims 3-8 is provided, the said learning unit, and the said registration unit.

コンピュータにより実行される手順として、
入力された文字列を、言語モデルに基づき分かち書きして出力する手順と、
前記文字列の入力毎に、前記言語モデルを学習する手順と、
を備え、
前記言語モデルは、ベイズ階層言語モデルに対応し、
前記言語モデルの学習は、前記文字列の入力毎に、前記入力された文字列に対応する分かち書き後の文字列を構成する複数の部分文字列の並びに基づき、前記言語モデルにより定義される文字列の並びに関する条件付確率を更新することにより実現されることを特徴とする分かち書き処理方法。 As a procedure executed by the computer,
The input string, and procedures to output the writing Tokachi based-out minute in the language model,
For each input of the character string, and the procedure to learn the previous Symbol language model,
Equipped with a,
The language model corresponds to a Bayesian hierarchical language model,
The learning of the language model is performed by, for each input of the character string, a character string defined by the language model based on a sequence of a plurality of partial character strings that constitute the character string after the division corresponding to the input character string. leaving a space between words processing method is implemented, characterized in Rukoto by updating the conditional probabilities for the sequence.

コンピュータにより実行される手順として、
入力された文字列を、言語モデルに基づき分かち書きして出力する手順と、
前記言語モデルを学習する手順であって、学習用文字列の一群に基づき前記言語モデルを学習する第一学習処理を反復実行する一方、前記入力された文字列に対応する分かち書き後の文字列に基づき、前記言語モデルを学習する第二学習処理を、前記第一学習処理とは並列に実行する手順と、
前記入力された文字列を、前記学習用文字列として登録する手順と、
を備え、前記言語モデルは、ベイズ階層言語モデルに対応し、
前記第一学習処理における前記言語モデルの学習は、前記言語モデルにより定義される文字列の並びに関する条件付確率を、前記学習用文字列の一群に基づき更新することにより実現され、
前記第二学習処理における前記言語モデルの学習は、前記言語モデルの前記条件付確率を、前記分かち書き後の文字列を構成する複数の部分文字列の並びに基づき更新することにより実現されることを特徴とする分かち書き処理方法。 As a procedure executed by the computer,
The input string, and the procedure to output the writing Tokachi based-out minute in the language model,
A procedure for learning the language model, while Repeating a first learning process for learning the language model based on the group of academic習用string, the character string after word-separated corresponding to the input character string A second learning process for learning the language model, in parallel with the first learning process;
Registering the input character string as the learning character string;
The language model corresponds to a Bayesian hierarchical language model,
The learning of the language model in the first learning process is realized by updating a conditional probability related to the arrangement of character strings defined by the language model based on the group of learning character strings,
The learning of the language model in the second learning processing, characterized Rukoto be realized by the conditional probability of the language model is updated based on the sequence of a plurality of substrings constituting the character string after the word-separated The method of processing for writing.