JPH01274272A - Word tone separating system and word tone kanji conversion system for chinese language - Google Patents

Word tone separating system and word tone kanji conversion system for chinese language

Info

Publication number
JPH01274272A
JPH01274272A JP63105030A JP10503088A JPH01274272A JP H01274272 A JPH01274272 A JP H01274272A JP 63105030 A JP63105030 A JP 63105030A JP 10503088 A JP10503088 A JP 10503088A JP H01274272 A JPH01274272 A JP H01274272A
Authority
JP
Japan
Prior art keywords
speech
sound
monosyllabic
word
frequency
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
JP63105030A
Other languages
Japanese (ja)
Other versions
JP2798931B2 (en
Inventor
Takeshi Kusui
楠井 健
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to JP63105030A priority Critical patent/JP2798931B2/en
Priority to CN 89102915 priority patent/CN1019233B/en
Publication of JPH01274272A publication Critical patent/JPH01274272A/en
Application granted granted Critical
Publication of JP2798931B2 publication Critical patent/JP2798931B2/en
Anticipated expiration legal-status Critical
Expired - Fee Related legal-status Critical Current

Links

Landscapes

  • Machine Translation (AREA)
  • Document Processing Apparatus (AREA)

Abstract

PURPOSE:To obtain a phonetic character KANJI(Chinese character) conversion system suitable for Chinese language by performing word tone KANJI conversion based on a word tone separating system in which a tone frequency method is applied. CONSTITUTION:When a syllable is inputted from a syllable input means 11, a word tone and tone frequency retrieval means 2 retrieves data from a dictionary 1, and sends it to a storage means 3. A syllable judging means 4 judges a node where a tone frequency phrase ends. An optimum word tone separation type generating means 5 generates a word tone separation type setting the tone frequency phrase as an object, and sends it to a separation type storage means 6, and also, generates the optimum word tone separation type, and sends it to a KANJI string conversion means 7. The KANJI string conversion means 7 receives data of the optimum word tone separation type from the optimum word tone separation type generating means 5, and retrieves the KANJI vocabulary of the dictionary 1 sequentially, setting each separated word tone as a header, and selects a KANJI word judged as the word nearest to a targeted word out of homophony words, and outputs a KANJI word string by connecting the KANJI words. The KANJI word string is sent to a KANJI word string storage means 8 and a document storage means 12.

Description

【発明の詳細な説明】 3.1 産業上の利用分野 本発明は、中国語ワードプロセッサに関する。[Detailed description of the invention] 3.1 Industrial application fields The present invention relates to a Chinese word processor.

更に詳しくは、本発明は、中国語の音節の連なりで表わ
された語音の列を入力し、該語音列を語音ごとに区切る
中V 話An 音区切方式、およびその語音区切方式で
区切られた語音を漢字に変換する中国語語音漢字変換方
式に関する。
More specifically, the present invention inputs a string of speech sounds represented by a series of Chinese syllables, and divides the speech sound string into individual sounds, and the speech sound separation method is used to divide the speech sound string into segments according to the speech sound segmentation method. This paper relates to a Chinese sound-kanji conversion method for converting Chinese word sounds into Kanji.

3.2 従来の技術 3.2.1  日本語ワードプロセッサにおける仮名漢
字変換の主流は最長一致法による文の形態素分析 中1213Fjの文(以下、単に「中文」と略記する)
を扱うワードプロセッサでは、ローマ字入力鍵盤の一つ
として中文M式鍵盤(日本電気(株)製)が知られてい
る。中国漢字音を中文M式鍵盤から入力するのは、日本
語文(0文)のなかの漢字音を日本語M式鍵盤から入力
するのと同じように、たいへん易しく覚えやすく、入力
速度も速い。
3.2 Conventional technology 3.2.1 The mainstream of kana-kanji conversion in Japanese word processors is morphological analysis of sentences using the longest match method (hereinafter simply referred to as "Chinese text").
For word processors that handle , the Chinese M-type keyboard (manufactured by NEC Corporation) is known as one of the Roman character input keyboards. Inputting Chinese kanji sounds from the Chinese M-type keyboard is very easy to remember and remember, and the input speed is fast, just like inputting the kanji sounds in a Japanese sentence (0 sentences) from the Japanese M-type keyboard.

現在活躍している各社の日本語ワードプロセッサは、仮
名人力漢字変換方式が主流である。仮名漢字変換の単位
は、連文節あるいは文−括であり、変換の手法は、■文
節を単位として、最長一致法により自立語と付属語とを
切りはなし、■自立語と付属語の音の間の妥当と思われ
る接続関係を参照しつつ辞書データを引いて、■自立語
を漢字に、付属語を仮名に変換し、■最も可能性の高い
漢字仮名文を発生させるものである。これを形態素分析
法という。
Japanese word processors from various companies currently in use mainly use the kana-to-kanji conversion method. The unit of kana-kanji conversion is a connected clause or a sentence-baku, and the conversion method is: ■ Using the clause as a unit, separate independent words and attached words by the longest match method, ■ Separate independent words and attached words. This method looks up dictionary data while referring to connections that are considered to be valid, converts independent words into kanji and attached words into kana, and generates the most likely kanji-kana sentence. This is called morphological analysis method.

前述の中文M式鍵盤を開発する以前から考案されていた
日本語用のM大入力法がある(特開昭56−14963
1) 、このM大入力法においては、入力の時点で漢字
と仮名が自動的に分離されるから、■の処理が非常に簡
単になり、このことがM大入力法における高いかな漢字
変換の成功率の大きな原因となっている。
There is an M-type input method for Japanese that was devised even before the development of the Chinese M-type keyboard mentioned above (Japanese Patent Application Laid-Open No. 14963-1983).
1) In this M input method, kanji and kana are automatically separated at the time of input, making the processing of This is a major cause of the rate.

3.2.2.  日本語と中国語のちがい中文人力にお
いても、このような日本語ワードプロセッサ流のローマ
字漢字変換がうまく成立するだろうと、日本人ならば誰
でも考える。
3.2.2. Any Japanese person would think that this kind of Japanese word processor-style Romanization/Kanji conversion would be successful even in the difference between Japanese and Chinese in terms of Chinese proficiency.

ところが、中文は0文とは構造がおおいに異なっており
、0文ワードプロセッサにおける鋭利な武器である文節
の概念が存在せず、そのなかで自立語と付属語とを分離
することができない。
However, the structure of Chinese sentences is significantly different from zero sentences, and there is no concept of clauses, which is a sharp weapon in zero sentence word processors, and it is not possible to separate independent words and attached words.

(1)中国語は付属語がすくない 第2図に中文の例とこの中文に対応する0文を示す。(1) Chinese has few attached words. Figure 2 shows an example of a Chinese sentence and the zero sentence corresponding to this Chinese sentence.

図において゛を付して示す中文の語は付属語、0文にお
いて゛を付して示す語は中文の付属語に対応する付属語
、−を付して示す語は0文特有の付属語、無印は中文・
0文とも自立語である。これらの例から、極度に漢字を
多用した0文において残った仮名のところが付属語であ
り中文にはもともと付属語が少ないことが明瞭である。
In the figure, words in Chinese with a ゛ are adjuncts, words with a `` in sentence 0 are adjuncts that correspond to adjuncts in Chinese, and words with a - are adjuncts specific to sentence 0. , unmarked is Chinese/
Both sentences are independent words. From these examples, it is clear that in sentence 0, which uses an extremely large number of kanji, the remaining kana are adjunct words, and that Chinese sentences originally do not have many adjunct words.

(2)中国語には「文節」の概念がない次に第2図の(
b)の0文について文節区切りをすると第3図(a)に
示すようになる。ところが第2図(a)の中文を、第3
図(a)の0文の文節に似せて文のリズムを切ってみる
と第3図(b)の如くになる。第3図(a)、(b)か
ら中文と0文とでは言語および文の構造がずいぶん違う
ことが、いっそうよく分かる。
(2) There is no concept of “bunsetsu” in Chinese. Next, in Figure 2 (
When the sentence 0 in b) is separated into clauses, it becomes as shown in Fig. 3(a). However, the middle sentence in Figure 2 (a) is
If we cut the rhythm of the sentence to resemble the clause of sentence 0 in Figure (a), it will look like Figure 3 (b). From Figures 3 (a) and (b), it is clearer that the language and sentence structure of the Chinese and 0 sentences are quite different.

(3)中国語は一語の長さが短い0日本語の文では一語
の長さが長い さらに第2図(a>の中文を「語」に区切ってみると第
4図の如くに区切れる。
(3) In Chinese, the length of each word is short; in Japanese sentences, the length of each word is long; Can be separated.

中国語では1漢字をっねに1音節に読むことを考慮する
と、中文では「語」の音節の長さは大抵1か2音節とい
う短いものであることがわかる。
Considering that in Chinese, each kanji is always read as one syllable, it can be seen that in Chinese, the syllable length of ``words'' is usually as short as one or two syllables.

いっぽう0文では第5図に示すように自立語・付属語と
も、助詞以外には1音節の語は希である。
On the other hand, in sentence 0, as shown in Figure 5, single-syllable words other than particles are rare for both independent and attached words.

但し、図において仮名−字が1音節またはカナ文字の1
キーを示す。
However, in the diagram, the kana-character is one syllable or one kana character.
Indicates the key.

これに対し中文では第6図に示すようになる。In contrast, the Chinese version is as shown in Figure 6.

但し、カタカナ読みは、いちおうの近似音である。However, the katakana reading is an approximation of the sound.

中国式ローマ字は、大文字が子音、それに続く小文字が
母音で、それらひと組で1音節となる0M式中文人力法
では、子音は右1打、母音は左1打、右左の2打で1音
節を打つ。
In the Chinese style Roman alphabet, the uppercase letter is a consonant, and the lowercase letter that follows it is a vowel, and each set of these forms one syllable.In the 0M Chinese Roman alphabet, a consonant is one stroke on the right, a vowel is one stroke on the left, and two strokes on the right and left make one syllable. hit.

第5図、第6図の文例において、数字の゛ところを除き
、0文は50音節、中文は23音節である。中文は0文
にくらべて、いかに簡潔であるかが分かる。この差は中
文にカナがなく0文にカナが割り増しされているところ
から来ているのは明瞭である。
In the example sentences in Figures 5 and 6, except for the numbers, the 0 sentence has 50 syllables, and the Chinese sentence has 23 syllables. You can see how concise Chinese sentences are compared to 0 sentences. It is clear that this difference comes from the fact that there is no kana in the Chinese sentence, and the kana is added to the zero sentence.

(4)中国語は言語学上の「孤立語J、日本語は「膠@
語」、最長一致法は膠着語向き 言語学の分類によれば、中国語は「孤立語」の典型であ
って、各品詞は一切の語尾変化をせず、語と語の関係を
示す助詞に乏しく、文の意味は、もっばら語順できまる
(4) Chinese is a linguistic "isolated word J," Japanese is "glue@
According to the classification of linguistics, Chinese is a typical "isolated word," in which each part of speech does not change its ending at all, and the longest match method is suitable for agglutinated words. The meaning of a sentence is determined entirely by word order.

いっぽう日本語は「膠着語」の一種であって、語と語の
関係は自立語の後に助詞や接尾要素を次々に連ねていっ
て表現する0日本語における「文節」は、大体「自立語
十接尾要素」に相当するものである0日本語ワードプロ
セッサの開発途上で、文節を単位とするカナ漢字変換法
が基本的発想となり、ひとつの文節のなかで自立語を分
離するのに「最長一致法Jが有力な武器として使われて
いるのは必然であろう。
On the other hand, Japanese is a type of "agglutinative language", and the relationship between words is expressed by sequentially following an independent word with particles or suffixes. During the development of the 0 Japanese word processor, which corresponds to the ``ten suffix elements,'' the basic idea was to use a kana-kanji conversion method that uses phrases as a unit, and the ``longest match'' method was used to separate independent words within a phrase. It is inevitable that Law J is being used as a powerful weapon.

「こきゅうは」に対して、先頭から辞書を引いていき、
「語」が辞書にあるかないかを見ていく。
For "Kokyuha", look up the dictionary from the beginning,
Let's see if the word is in the dictionary.

「こ」 (子、粉、濃・・・)、「こき」 (濃き、扱
き、古希・・・)、「こきゆ」 (ナシ)、「こきゆう
」(呼吸、放言・・・)、「こきゆうは」 (ナシ)、
ここでいちばん長い語は「こきゆう」であり、残った「
は」は助詞の「は」と一致する。ゆえに「こきゅうは」
は多分「呼吸は」または「放言は」であろう0日本語の
例文を、文節を単位として、このような発想の「最長一
致法」によってカナ漢字変換をすれば、呼吸と放言とい
う同音語の誤り以外は、兄事に正しい第7図の文が出て
くる。
``ko'' (child, powder, thick...), ``koki'' (deep, treated, ancient...), ``kokiyu'' (none), ``kokiyu'' (breathing, profanity...), `` Kokiyuuha” (none),
The longest word here is ``kokiyuu,'' and the remaining word ``kokiyuu.''
'wa' matches the particle 'wa'. Therefore, “Kokyuha”
If you convert the Japanese example sentence, which is probably ``breathing is'' or ``boudan wa,'' into kana-kanji using the ``longest match method'' using phrases as a unit, you will get the homonyms ``breathing'' and ``boudanwa.'' Except for the error in , the correct sentence in Figure 7 appears in Anji.

日本語ワードプロセッサにおける同音語の問題は、主と
して「音読み」の漢字熟語の部分にあり、それらはほと
んど名詞の自立語である。動詞や形容詞などの述語性品
詞の語幹は、たいてい「訓読み」であって、1音節のも
の以外では同音語が少ない。助詞や述語性品詞の接尾要
素(変化話尾)についても同音語はあまりない。ゆえに
日本語文の漢字カナ表現に対して、「文節区切り」と「
最長一致法」はカナ漢字変換において最も適した処理の
方法であるといえる。
The problem with homonyms in Japanese word processors is mainly in the kanji idioms of ``on-yomi'', which are almost independent words of nouns. The stems of predicative parts of speech such as verbs and adjectives are usually ``kun-yomi'', and there are few homophones other than one-syllable ones. There are also not many homophones for particles and suffix elements of predicative parts of speech (deflective endings). Therefore, for kanji kana expressions in Japanese sentences, ``bunseki break'' and ``
The "longest match method" can be said to be the most suitable processing method for kana-kanji conversion.

3.3 発明が解決しようとする課題 中文のローマ字(読み)漢字変換において、以上に述べ
たような日本語におけるカナ漢字変換の方法論は、どの
程度に有効に適用できるかを次に検討する。
3.3 Problems to be Solved by the Invention Next, we will examine how effectively the methodology for Japanese kana-kanji conversion described above can be applied to the conversion of Chinese into Roman characters (reading) and Kanji.

(1)文節とは異なる概念□中文における階層分析法と
文の構成要素□ 階層分析法とは、文の各構成要素のあいだの階層関係を
表現するtree (木)を求め、文の構造から意味を
あきらかにする方法である。第8図(a)の例文におい
てV印は「停頓(ポーズ)」で、ゆっくりと解説的に朗
読する時、ここで小休止をおいてもよい箇所である。ポ
ーズを最大限に置いた時、例文は自然に細分されて第8
図(b)の如くになる。第8図(b)のような区切り方
が、中文における最も自然な区切り方であり、中国語の
言葉の流れに沿った区切り方あるから、中国語ワードプ
ロセッサでは「ポーズ区切り」におうじて「変換キー」
を入れれば、変換処理システムの設計上はたいへん都合
がよい、ただし、中国語を中国人一般が読む場合に、例
文のVの箇所でがならず「変換キー」を入れるように強
制する事は無理がある6日本語では「放言はネ」、「−
四〇六年からネ」のように、「ネ」をつけて区切れるの
が「文節」であるから、文節の概念は誰にもたいへん明
快である。ところが中国語における「ボーズ区切り」は
日本語における文節よりも不確かな存在で、「ネ」に匹
敵する万能で便利な言葉は中国語にはない、ローマ字漢
字変換方式の中文ワードプロセッサは、オペレータがポ
ーズを無視して句読点から次の句読点まで一挙に入力し
たようなときでも、自動的に文を語に区切りつつ、ロー
マ字の読みを順々に漢字に変換していく能力を持たなけ
ればならない。
(1) A concept different from bunsetsu □ Hierarchical analysis method in Chinese and sentence components □ Hierarchical analysis method calculates a tree that expresses the hierarchical relationship between the constituent elements of a sentence, based on the structure of the sentence. This is a way to clarify meaning. In the example sentence in Figure 8 (a), the V mark is a ``pause'', and this is a place where you can take a short break when reciting slowly and explanatoryly. When the pose is maximized, the example sentence is naturally subdivided into 8th
It will look like figure (b). The division method shown in Figure 8 (b) is the most natural division method in Chinese, and it follows the flow of Chinese words, so in Chinese word processors, "conversion" is used as "pause division". Key"
It is very convenient in terms of the design of the conversion processing system if you insert the ``conversion key''.However, when the general Chinese read Chinese, it is not possible to force them to insert the ``conversion key'' at the point V in the example sentence. It's unreasonable 6 In Japanese, there's ``to say no'', ``-''
The concept of a bunsetsu is very clear to everyone, as it can be separated by adding the ``ne'' character, as in ``406 to ne''. However, ``Bose delimiters'' in Chinese are more uncertain than phrases in Japanese, and there is no universally useful word comparable to ``Ne'' in Chinese. Even when inputting from one punctuation mark to the next without regard to the above, it is necessary to have the ability to automatically separate sentences into words and convert the readings of Roman characters into kanji one by one.

(2)中国語における「語」の観念とその不確定性 中国語は文法上では「孤立jM Jであり、語合上では
歴史的に「単音節語」である、Qt音節話というのは、
基本的には1語が1音節で表わされる言語である。現在
中1’IJ語は社会の近代化に伴って2音節の語♀が急
速に増えつつあるが、最も頻繁に使われる基本的な話に
は依然として単音節語が多い。例文における従、年、到
、有、多、的、了などがそうである。(これらは日本語
ではそれぞれより、年、まで、ある、あまり、の、だ 
などのように、はとんどがカナで書かれる機能語または
付属語のたぐいである。) 日本語のカナ音節の数は102個、中国の標準語の音節
の数は411個であり、日本語の4倍はあるものの、1
80万字(音節)の統計によれば、中文における単音節
語の率は約48%である。なお2音節話は約50%、3
・4音節語は残りの2%で、5以上の音節をもつ語はゼ
ロと考えてよい。このように中m語には、単音節語の頻
度がきわめて多い。
(2) The concept of “word” in Chinese and its uncertainty Chinese is grammatically “isolated jM J”, and historically it is “monosyllabic” in terms of words. Qt syllabic words are ,
Basically, it is a language in which each word is expressed as one syllable. Currently, the number of two-syllable words in Chinese 1'IJ language is rapidly increasing due to the modernization of society, but monosyllabic words are still common in the most frequently used basic words. Examples include following, year, arrival, existence, many, target, and 了 in example sentences. (In Japanese, these words are more, year, up to, some, less, of, respectively.
It is a type of function word or adjunct word that is mostly written in kana, such as . ) The number of kana syllables in Japanese is 102, and the number of syllables in standard Chinese is 411, which is four times as many as Japanese, but 1
According to the statistics of 800,000 characters (syllables), the rate of monosyllabic words in Chinese is about 48%. In addition, about 50% of 2-syllable speech, 3
・Four syllable words make up the remaining 2%, and words with five or more syllables can be considered zero. In this way, middle m words have a very high frequency of monosyllabic words.

このことが、中文のローマ字漢字変換において、日本語
処理にならって「最長−教法」をとったとき、正変換率
が高くならない大きな原因になっている。いちばん困る
のは、中国語に最長−教法を適用すると「区切り違い」
が頻発することである。
This is a major reason why the correct conversion rate is not high when the ``longest-teaching method'' is adopted, following Japanese processing, when converting Chinese characters into Roman characters and Kanji. The biggest problem is that when applying the longest teaching method to Chinese, there is a ``wrong delimiter''
occurs frequently.

第9図(a)の例文について最長−教法による漢字変換
を試みると第9t7I(b)の結果が得られる。変換の
単位は句読点間−括変換である。ただし数字は数字キー
から直接に入力する。
When attempting to convert the example sentence in FIG. 9(a) into kanji using the longest-teaching method, the result shown in FIG. 9t7I(b) is obtained. The unit of conversion is inter-punctuation-bracket conversion. However, numbers must be entered directly using the number keys.

変換を誤まった語に、゛を付して示したが、このように
最長−教法では正解率65%という低さである。原因は
下に示すように「同音語の絡み合いによる区切り違い」
である、第10図では太線と網掛けの字が最長−教法に
よる漢字変換の経路を示し、X印は誤った場所を示す、
到現在己経有を導線存意経由のように連続して間違った
のは、最長−教法か到を収らずに、より長い導線を取っ
たからで、いわゆる「チョッキの最初のボタンをかけ違
った」類の誤りである。典型的な「区切りミス」である
Words that have been converted incorrectly are marked with a ``,'' but the correct answer rate for the longest-teaching method is as low as 65%. The cause is "difference in delimitation due to intertwining of homophones" as shown below.
In Figure 10, the bold lines and shaded characters indicate the path of kanji conversion by the longest-teaching method, and the X marks indicate incorrect locations.
The reason why I made a series of mistakes in changing the current state of self-judgement through the line of existence and intention is because I took a longer line without settling on the longest - teaching method or reach. This is a mistake of the “wrong” kind. This is a typical "separation error".

以上に示したように日本語のカナ漢字変換に有効な最長
−教法の手法は、中国語に適用すると必らずしも有効で
ない、そこで、本発明の目的は、中国語に適した音標文
字漢字変換方式を提供することにある。
As shown above, the longest-teaching method that is effective for Japanese kana-kanji conversion is not necessarily effective when applied to Chinese. Therefore, the purpose of the present invention is to The objective is to provide a method for converting characters into Kanji.

3.4 課題を解決するための手段 前述の課題を解決するために本発明が提供する一手段は
、中国語の音節の連らなりで表わされた語音の列を入力
語音列として入力し、該入力語音列を語合ごとに区切る
中国語詰音区切方式であって、中国語の語合のうちで単
音節開音および双音節語音について中国語文に出現する
統計的頻度の対数値の絶対値を該語合の頻級として記憶
する辞書と、 前記入力語音列の各単音節開音について前記辞書を検索
し、該単音節開音及びこの単音節開音の頻級を該辞書か
ら読み出すとともに、該単音節開音の直前に別の単音節
開音が入力されているときにはこれら両車音節語音でな
る双音節語音があるか否かを前記辞書で検索し、該辞書
に該双音節語音がある場合には該双音節語音およびこの
双音節語音の頻級を該辞書から読み出す語合・頻級検索
手段と、 該語合・頻級検索手段で読み出された前記単音節開音、
該単音節開音の頻級、前記双音節語音および該双音節語
音の頻級を記憶する第1の記憶手段と、 前記語合・頻級検索手段で双音節語音が検索されなかっ
た単音節開音と前記直前入力単音節語合との間における
前記入力語音列上の仮想の点を節点とし、直近の2つの
該節点の間にある前記入力語音列の句を音頻句とすると
き、該音頻句に対応する前記単音節開音及び双音節語音
並びに該単音節開音の頻級及び該双音節語音の頻級を前
記第1の記憶手段から読み出す節点判断手段と、前記節
点判断手段で読み出された前記単音節開音および双音節
語音並びに該単音節開音の頻級および該双音節語音の頻
級を受け、前記音頻句を語合の単位に区切る区切り方を
語音区切望として生成し、該訪音区切望のうちで最適の
語合区切型を運んで出力するft適語語音切型生成手段
と、前記最適語合−区切型生成手段で生成された前記語
合区切型を記憶する第2の記憶手段 とを備え、 前記最適語音区切望生成手段は、前記語合区切型におけ
る各詰合の前記頻級の和を求め、該和が最少である語n
区切型を萌記最適語音区切型とすることを特徴とする。
3.4 Means for Solving the Problems One means provided by the present invention to solve the above-mentioned problems is to input a sequence of speech sounds expressed as a series of Chinese syllables as an input speech sequence. , is a Chinese condensation separation method that divides the input word sound string into each word combination, which calculates the logarithm of the statistical frequency of monosyllabic open sounds and disyllable sound sounds that appear in Chinese sentences among Chinese word combinations. a dictionary that stores the absolute value as the frequency of the word combination, and searches the dictionary for each monosyllabic open sound of the input word sound string, and retrieves the monosyllabic open sound and the frequency of this monosyllabic open sound from the dictionary. At the same time as reading, if another monosyllabic open sound is input immediately before the monosyllabic open sound, the dictionary is searched to see if there is a double syllabic sound consisting of these double syllable sounds, and the dictionary contains the double syllable sound. If there is a syllabic word sound, a word combination/frequency retrieval means reads out the disyllable word sound and the frequency of the disyllable word sound from the dictionary, and the monosyllabic word sound read out by the word combination/frequency retrieval means. sound,
a first storage means for storing the frequency of the monosyllabic open sound, the di-syllabic speech sound, and the frequency of the di-syllabic speech sound; and monosyllables for which no di-syllabic speech sound was retrieved by the conjunction/frequency search means. When a virtual point on the input word sequence between an open sound and the immediately input monosyllabic phrase is taken as a node, and a phrase of the input word sound sequence between the two most recent nodes is taken as a frequent phrase, a nodal point determining means for reading out the monosyllabic open sound and the disyllabic word sound corresponding to the frequent phrase, the frequent class of the monosyllabic open sound, and the frequent class of the disyllable word sound from the first storage means; and the nodal point determining means receives the monosyllabic open sounds and disyllable speech sounds read out, the frequent class of the monosyllabic open sounds, and the frequent class of the disyllable sound sounds, and selects a speech division method for dividing the frequent phrases into units of phrases. ft suitable word sound segmentation type generating means for generating and outputting the optimal word-conjunction segmentation type among the desired word-visit segmentation; a second storage means for storing a pattern, and the optimal speech segmentation desire generation means calculates the sum of the frequencies of each combination in the speech segmentation pattern, and selects the word n for which the sum is the smallest.
It is characterized in that the segmentation type is a Moeki optimal speech segmentation type.

前述の課題を解決するなめに本発明が提供する別の手段
は、中国語の音節の連らなりで表わされた語合の列を入
力語音列として入力し、該入力語音列を語合ごとに区切
り、前記入力語音列から区切られた前記語合をそれぞれ
漢字に変換する中国話語′g漢字変換方式であって、 中国語の語合を見出しとして該語合の漢字を記憶すると
ともに、中国語の語合のうちで単音節語音および双音節
語音について中国語文に出現する統計的頻度の対数値の
絶対値を該語合の頻級として記憶する辞書と、 前記入力語音列の各単音節語音について前記辞書を検索
し、該単音節語音及びこの単音節語音の頻級を該辞書か
ら読み出すとともに、該単音節語音の直前に別の単音節
話合が入力されているときにはこれら両車音節語音でな
る双音節語音があるか否かを前記辞書で検索し、該辞書
に該双音節語音がある場合には該双音節語音およびこの
双音節gh音の頻級を該辞書から読み出す語合・頻級検
索手段と、 該語合・頻級検索手段で読み出された前記単音節語音、
該単音節語音の頻級、前記双音節語音および該双音節語
音の頻級を記憶する第1の記憶手段と、 前記語合・頻級検索手段で双音節語音が検索されなかっ
た単音節語音と前記直前入力単音節語音との間における
前記入力語音列上の仮想の点を節点とし、直近の2つの
該節点の間にある前記入力語音列の句を音頻句とすると
き、該音頻句に対応する前記単音節語音及び双音節語音
並びに該単音節語音の頻級及び該双音節語音の頻級を前
記第1の記憶手段から読み出す節点判断手段と、前記節
点判断手段で読み出された前記単音節語音および双音節
語音並びに該単音節語音の頻級および該双音節語音の頻
級を受け、前記音頻句を語合の歩位に区切る区切り方を
語合区切型として生成し、該語音区切型のうちでfL適
の語合区切型を選んで出力する最適語音区切望生成手段
と、前記最適語合区切型生成手段で生成された前記語音
区切型を記憶する第2の記憶手段と、前記fit適話音
語音型で区切って示される各語合について前記辞書を検
索し、該語合を見出しとする漢字のうちの1つを該辞書
から読み出し、前記入力語音列に対応する漢字列を生成
する漢字列生成手段 とを備え 前記最適語合区切型生成手段は、前記語合区切型におけ
る各語合の前記頻級の和を求め、該和が岐少である語合
区切型を前記最適話合区切型とすることを特徴とする。
Another means provided by the present invention to solve the above-mentioned problems is to input a string of word combinations expressed as a series of Chinese syllables as an input word sound string, and to convert the input word sound string into word combinations. This is a Chinese spoken kanji conversion method that converts the word combinations separated from the input word sound string into Kanji characters, and stores the Chinese word combinations as headings and the Kanji characters of the word combinations. A dictionary that stores the absolute value of the logarithm of the statistical frequency of monosyllabic and disyllable sounds appearing in Chinese sentences as the frequency of the word combination; and The dictionary is searched for the syllabic speech sound, and the monosyllabic speech sound and the frequency of this monosyllabic speech sound are read from the dictionary, and if another monosyllabic speech is input immediately before the monosyllabic speech sound, both of these speech sounds are searched. The dictionary is searched to see if there is a di-syllabic sound consisting of a syllabic word sound, and if the di-syllabic sound is present in the dictionary, the di-syllabic sound and the frequency of this di-syllabic gh sound are read out from the dictionary. a combination/frequency search means; the monosyllabic word sounds read out by the combination/frequency search means;
a first storage means for storing the frequency of the monosyllabic speech sound, the di-syllabic speech sound, and the frequency of the di-syllabic speech sound; and monosyllabic speech sounds for which the di-syllabic speech sound was not retrieved by the conjunction/frequency search means. and the immediately input monosyllabic speech sound, when a virtual point on the input speech sequence is a node, and a phrase of the input speech sequence between the two most recent nodes is a frequent phrase, the frequent phrase a nodal determination means for reading out the monosyllabic speech sounds and disyllable speech sounds, the frequency of the monosyllabic speech sounds, and the frequency of the disyllable speech sounds corresponding to from the first storage means; Receiving the monosyllabic speech sounds and disyllable speech sounds, the frequency class of the monosyllabic speech sounds, and the frequency class of the disyllable speech sounds, generate a division method for dividing the frequent phrases into phrases in phrases as a phrase division type, and Optimal speech segmentation generation means for selecting and outputting an fL-suitable speech segmentation type from among the speech segmentation types, and a second storage means for storing the speech segmentation type generated by the optimal speech segmentation type generation means. , the dictionary is searched for each word combination shown as separated by the fit appropriate speech sound type, one of the kanji with the word combination as a heading is read from the dictionary, and one of the kanji characters corresponding to the input word sound string is read out from the dictionary. and a kanji string generation means for generating a kanji character string. The method is characterized in that the type is the optimal conversation delimiter type.

3.5 作用 本発明は、音頻法を適用した語音区切り方式と、この語
合区切方式に基づき語合漢字変換を行う方式である。音
頻法は中文を語合に区切る方法であり、従来全く見られ
ない方法である。現在の時点で日本語の仮名漢字変換に
広く使われている文節の概念、最長−教法および文節に
対する形態素分析などの基本技術に比べ、本音頻法によ
る語合区切漢字変換は、まったく異質の発想に基づくか
ら、今まで日本語処置の技術に慣れて来た人には理解し
雛い面がある。そこで、以下では例文を挙げて主に音頻
法につき詳しく説明をする。
3.5 Effects The present invention is a word separation method that applies the phonetic method, and a method that performs word combination Kanji conversion based on this word combination separation method. The phonetic method is a method of dividing Chinese sentences into phrases, and is a method that has never been seen before. Compared to basic technologies such as the concept of clauses, the longest teaching method, and morphological analysis of clauses, which are widely used for Japanese kana-kanji conversion at present, the conversion of word-separated kanji using the Honon Frequency Method is completely different. Since it is based on ideas, it may be difficult for people who have become accustomed to Japanese processing techniques to understand it. Therefore, below, we will mainly explain the frequency method in detail using example sentences.

3.5.1  中文における語音と語音情報量第11図
は中文における語合とその・Wt報量の例を示す図であ
る。ここで「語合」とは漢字文における個々の単語の「
読み」をいう。個々の漢字の読みは「漢字前」であって
語合ではない。
3.5.1 Speech sounds and speech information amount in Chinese sentences Figure 11 is a diagram showing examples of word combinations and their Wt information amounts in Chinese sentences. Here, ``gogo'' means ``gogo'' of individual words in a kanji sentence.
reading. The reading of each kanji is ``before the kanji,'' not the word conjunction.

ここで「語合」と「漢字前」の意義を明らかにしておく
Let me clarify the meaning of ``gogo'' and ``kanjimae'' here.

漢字は表意文字であり、個々の漢字はそれぞれ固有の義
(意味)・音(よみ)・形の3方面の情報を持つ、漢字
の音は例外なしに1音節の音で、これを「字音」と呼ぶ
。中文は漢字だけを使用して書かれているから、見方に
よっては中文の「よみ」は1音節字音の列といえる9字
音で読まれる例としてお経がある。お経では、意味より
も音の雰囲気を大切にするから、「字音よみ」でよいの
である、古典中国語においては、おおむね1つの漢字は
文法上でも「語」であったが、現代中文では2字語の語
粱が優勢である。但し、伝統の力は強く、中文には分ち
書きがなく、漢語辞典はまず漢字を引いてその漢字から
始まる語を引く方式(日本の漢和辞典に同じ)が支配し
ている。
Kanji are ideograms, and each kanji has its own unique meaning, sound, and form.The sound of a kanji is the sound of one syllable without exception, and this is called a ``character sound''. ” is called. Chinese texts are written using only kanji, so depending on how you look at it, sutras are an example of how the Chinese word ``yomi'' can be read as a nine-syllable string of sounds. In sutras, the atmosphere of the sound is more important than the meaning, so it is sufficient to read the sound of the sound.In classical Chinese, roughly one kanji was grammatically a "word," but in modern Chinese, there are two. The literal word 粱 is predominant. However, the power of tradition is strong, there is no division in Chinese, and Chinese dictionaries are dominated by the method of first looking up a kanji and then looking up the word that starts with that kanji (the same method used in Japanese kanji dictionaries).

(中国には音引きの辞書はない)中国人の中文朗読も、
棒よみ字音よみの傾向かのこっている。
(There is no sound dictionary in China) Chinese recitation by Chinese people,
This suggests a tendency to read the pronunciation of the pronunciation of letters and sounds.

いっぽう、漢字からいちおう離れて、ローマ字などを発
音記号として利用し、音節を単位として中文の「よみ」
を書くこともできる。このときには、語のわかち書きは
不可欠となる。
On the other hand, we are moving away from kanji and using Roman letters as phonetic symbols, and using syllables as units to read Chinese sentences.
You can also write In this case, writing the words separately is essential.

わかち書きされた個々の語ごとの音節列を「語合」と呼
ぶ0語音には、音節の数によって単音節記音・双音節語
音・3音節語音・4音節話音などがある。
The string of syllables for each word written in wakagi is called a ``gogo''.The zero-word sounds include monosyllabic sounds, di-syllabic sounds, trisyllabic sounds, and four-syllable speech sounds, depending on the number of syllables.

文が漢字だけで書かれていても、語合を意識して朗読す
るのが中文の合理的な読み方である。
Even if a sentence is written only in kanji, the rational way to read Chinese is to read it aloud while being aware of the word combinations.

本発明の眼目は、「語合」で入力された中文の音節列を
処理し、自動語区切を行ったのち、はじめて語ごとの漢
字変換を行うというアイデアにある。もっとも、本発明
の方式では、入力された音節を自動的に語合に区切るか
ら、音節を入力する者が字音と認識して入力しても語合
として入力した場合と同様に処理される。すなわち、本
発明の方式で入力として受は付ける情報は、音節列であ
れば足り、語合であっても字音であっても差し支えない
The focus of the present invention lies in the idea of processing a string of Chinese syllables input using "gogo" and performing automatic word segmentation before converting each word into kanji. However, in the method of the present invention, input syllables are automatically divided into phrases, so even if the person inputting the syllable recognizes it as a letter sound and inputs it, it will be processed in the same way as if it were input as a phrase. In other words, the information accepted as input in the method of the present invention only needs to be a string of syllables, and may be a combination or a letter sound.

本発明では、中文が音節ごとに順に入力されてくること
を前提にしており、第11図の(1)は入力される音の
順を示す番号である。
The present invention is based on the assumption that Chinese sentences are input syllable by syllable, and (1) in FIG. 11 is a number indicating the order of input sounds.

図の(2)は14字の中文例文を示し、この例文は「此
の式の中文人力システムは許多(あまた)の利点を具有
している」という意味である。同図(3)は図(2)の
漢字音を示す、ただしローマ字の綴りがたは中国国定の
「持合(ピンイン)」方式による。1漢字はかならず1
音節に対応している。以後、漢字からまったく離れて、
読みだけに注目すれば、(3)は14個の音節からなる
音節列となる。
(2) in the figure shows a 14-character Chinese example sentence, which means ``This type of Chinese human power system has the advantage of being generous.'' Figure (3) shows the sounds of the kanji characters in Figure (2), but the spelling of the Roman characters is based on the Chinese national pinyin system. 1 kanji is always 1
corresponds to a syllable. After that, I completely moved away from kanji,
If we focus only on the reading, (3) becomes a syllable string consisting of 14 syllables.

(3)に対して番号1から順々に1音節の語を調べてい
くと、(8)のように、すべての音節に単音節の藷が実
在する。ゆえに(5)のように1から14までの単音節
語音記号a−′−nに内容が存在し、それらはそれぞれ
zhe、shi。
For (3), if we examine one-syllable words in order from number 1, we find that there are monosyllabic words in all syllables, as in (8). Therefore, as shown in (5), the monosyllabic phonetic symbols a-'-n from 1 to 14 have contents, and they are zhe and shi, respectively.

z h o n g 、・・・、dianであることに
なる。
This means that z h o n g , ..., dian.

中文では、すべての音節には対応する単音節記音が存在
するという原則がある。2以上の音節の語(多音節語)
の一部として使用されるだけで、それ自身は単音節記音
にならない音節は、中文には存在しない、なお、図にお
いて、単音節記音をR3で示しているが、RはRead
(読み)を表わし、SはSingle(単数)を表わし
ている。
In Chinese, there is a principle that every syllable has a corresponding monosyllabic notation. Words with two or more syllables (polysyllabic words)
There are no syllables in Chinese that are used only as part of a syllable and are not monosyllabic on their own. In the figure, monosyllabic syllables are indicated by R3, but R stands for Read.
(pronunciation), and S represents Single.

(6)の頻率は中文における単音節記音の出現確率であ
り、数値は総計約180万字の文献調査から得たもので
ある。たとえばshiという「読み」の単音節記音の出
現確率ps(pはprobabilityを表わす)は p s = 159.04x 10−’= 1.59%
であって、この語合は平均63字に1回のわりで出てく
ることがわかる。
The frequency in (6) is the probability of occurrence of a monosyllabic notation in Chinese, and the value was obtained from a literature survey of approximately 1.8 million characters in total. For example, the probability of appearance ps (p represents probability) of the monosyllabic notation of the reading shi is ps = 159.04x 10-' = 1.59%
It can be seen that this combination appears once in every 63 characters on average.

(7)は単音節記音の情報量頻級Isを示す(以下では
単に頻級、IS又は頻級ISと略記する)、l5(Iは
Informationを表わす)は IS=−1ogt ps のように定義される。対数の底は、ここでは2とするが
、理論上は1を越える正の実数なら何でもよい。ISの
値は小数点以下を切り捨てて整数化し、級の呼びをつけ
て段階的に表現する。
(7) indicates the information content frequency Is of monosyllabic notation (hereinafter simply abbreviated as frequent, IS, or frequent IS), and l5 (I represents Information) is as shown in IS=-1ogt ps. defined. The base of the logarithm is assumed to be 2 here, but in theory any positive real number greater than 1 may be used. The IS value is rounded down to an integer, and expressed in stages by adding a class name.

中文において最大頻度の単音節記音はde(r的、地、
得」の語合。それぞれ「〜の、〜な、〜して」の意味)
であり、deについてはp s =0.0472、I 
S=4級である。本例ではISの最小値は18級とした
。すなわち、ps<=2−”  (=O,0OOO04
)のpsの語合のISはずべて18級に簡略化した。
The most frequent monosyllabic notation in Chinese is de (r, 地,
A combination of ``obtained''. (respectively meaning "~ no, ~ na, ~ do")
and for de p s =0.0472, I
S = grade 4. In this example, the minimum value of IS was set to grade 18. That is, ps<=2−” (=O,0OOO04
)'s ps word combinations were all simplified to grade 18.

ISを15段階に規準化したのである。ISが小さけれ
ば語合の出現率は大きい。ISが1級少なくなれば語合
の出現確率は2倍になり、情報量は1/2になる。
IS was standardized into 15 levels. The smaller IS is, the higher the occurrence rate of word combinations is. If the number of IS decreases by one level, the probability of word combinations appearing will double, and the amount of information will be halved.

(8)の単音節語粱は、(5)の単音節記号の内容であ
るそれぞれの単音節記音を持つ実在の辞書に搭載してい
る)単音節語を、漢字表現で羅列したものである。羅列
の順序は統計的頻度の順である。たとえば語合shiに
対しては、是(〜である)が最大頻度、使(〜に〜させ
る)、時(〜したとき)がそれに次ぐ0例文中の式(森
田式などの式)は11番目の頻度である。
The monosyllabic word 粱 in (8) is a list of monosyllabic words (installed in an actual dictionary with each monosyllabic notation, which is the content of the monosyllabic symbol in (5)) expressed in Kanji. be. The order of listing is in order of statistical frequency. For example, for the word shi, kore (to be) is the most frequent, followed by usage (to cause ~ to ~) and time (to do ~), followed by 0. The expressions in the example sentences (such as the Morita style) are 11. It is the th frequency.

中文においては単音節開音の同音語愈は例にあるように
相当多い。しかし同音語のなかで、頻度が上位の少数の
語が語合頻度の大部分を占める場合が多い。たとえば語
合shiに対しては、頻度1位の「是」だけで占有率7
6%、2位の「使Jまでで82%、3位の「時」まで8
7%の占有率になる。
In Chinese, there are quite a lot of monosyllabic open-sound homonyms, as shown in the example. However, among homophones, a small number of words with high frequency often account for most of the word combinations. For example, for the word combination shi, the occupancy rate is 7 for just ``kore'', which has the highest frequency.
6%, 82% up to the second place “J”, 8 up to the third place “Toki”
This will result in an occupancy rate of 7%.

(9)〜(13)は、双音節語音(2つの音節からなる
語の音、即ち2つの漢字で表わせれる語の音)について
の同様のデータ例を示す、ただし双音節語音の・[n報
量頻級ID(DはDoubleを示す)は、その語合の
頻度の統計総音節数に対する比をpaとしたとき ID=  Iogz  (2xpd) とし、Isの場合と同様に15段階に規準化する(双音
節語音の情報量頻級は以下では単に頻級、ID又は頻級
IDと略記する)。
(9) to (13) show similar data examples for di-syllabic sounds (the sounds of words consisting of two syllables, i.e., the sounds of words that can be represented by two kanji), except for the di-syllabic sounds of ・[n The frequency frequency ID (D indicates Double) is standardized into 15 steps as in the case of Is, with pa being the ratio of the frequency of the word combination to the statistical total number of syllables, ID = Iogz (2xpd) (The information frequency frequency of disyllable speech sounds is simply abbreviated as frequency, ID, or frequency ID below).

(9)〜(13)に見えるように、例文において双音節
記号hi、jk、1mには音節情報がない。
As seen in (9) to (13), in the example sentence, the di-syllabic symbols hi, jk, and 1m have no syllable information.

これはtongju、youxu、duoyouという
双音節語音が中国語には元来なく、これらの音の2音節
語電がないからである。双音語節音にはID=18また
は18に近いほとんど使われない語合が多数存在する。
This is because the disyllabic sounds of tongju, youxu, and duoyou are not originally in Chinese, and there are no disyllabic sounds for these sounds. Among diphonetic syllables, there are many rarely used combinations with ID=18 or close to 18.

その反面、women(我々)とかzhuyi (主義
、注意)などID−7に達する高頻度語合も存在する。
On the other hand, there are also high-frequency word combinations that reach ID-7, such as women (us) and zhuyi (principle, caution).

双音節語音における同音語の数は、単音節開音にくらべ
ると一般に少ない。
The number of homophones in disyllable speech sounds is generally smaller than in monosyllabic speech sounds.

3または4音節の語合は中文においては、平均すればき
わめて希である。同時にこれらには同音語電がほとんど
ない。これらの語合は単音節開音および双音節晶音と干
渉しあって、音頻法による語音区切を混乱させることが
ほとんどない、干渉がないというのは、たとえば音節列
abcdeにおいて3音節語音cdeと3個の双音節語
音ab。
Three- and four-syllable phrases are, on average, extremely rare in Chinese. At the same time, these have almost no homophones. These words interfere with monosyllabic open sounds and disyllable crystal sounds, and rarely confuse speech divisions based on the phonetic system.The lack of interference means that, for example, in the syllable sequence abcde, there is a three-syllable word sound cde. Three disyllable word sounds ab.

bcおよびdeが存在するとき、区切a / b c 
/deはほとんどあり得す、a b / c d eが
ほとんど眞になることをいう、ゆえに音頻法の実行に際
して、3または4音の語合が開音列中に検出されなとき
には、これらを音頻区切処理から切り離し、これらに対
して独立に語合漢字変換を行ない、これらを除外した単
音節開音と双音節語音の2種類の語合だけで構成される
語音列の部分に限って音頻区切処理(音頻法による開音
区切処理)をすることによって、自動音頻区切の処理ア
ルゴリズムを大幅に単純化することができる。
When bc and de are present, the separator a/b c
/de is almost possible, and a b / c d e is almost true. Therefore, when performing the frequency method, when 3- or 4-tone combinations are not detected in the open string, these are Separated from the frequency segmentation process, we perform word combination Kanji conversion for these words independently, and only calculate the sound frequency in the part of the word sequence that consists of only two types of word combinations, monosyllabic open sounds and disyllable word sounds, excluding these. By performing delimitation processing (open note delimitation processing using the frequency method), the processing algorithm for automatic frequency division can be greatly simplified.

本発明では、以上の理由により、単音節開音と双音節語
音だけによって構成される語音列に対象を限定して、音
頻法による語音区切を行う。
In the present invention, for the above reasons, the target is limited to speech sequences composed of only monosyllabic open sounds and disyllable speech sounds, and speech segmentation is performed using the frequency method.

3.5.2  中文音節列の構造モデル第12図に漢字
を捨象して構成した中文音節列のモデルを示す。モデル
は第11図の例文による。
3.5.2 Structural model of a Chinese syllable string Figure 12 shows a model of a Chinese syllable string constructed by abstracting Chinese characters. The model is based on the example sentences in Figure 11.

第12図の(1)は文例のaからnまでの14音節のシ
ステムである。音節単位の入力は点■から始まり、点0
で終わる。このモデルでは点[相]の次には句読点が入
力されることを想定している。
(1) in FIG. 12 is a system of 14 syllables from a to n in the sentence example. Input of syllable units starts from point ■, point 0
end with. This model assumes that a punctuation mark is input after the point [phase].

点■から点■までを結ぶ直線軸は入力作業の時間経過を
示す「時間」軸である。それぞれの点は音節の入力直前
の時点を表している。隣りあわせの2点の間に、入力さ
れた単音節語合記号a〜nがそれぞれ書き込まれる。軸
上の経路は、単音節開音だけを選んで通る径路で、第1
2図の(2)に示すように、これを「軸路」と称する。
The straight line axis connecting point ■ to point ■ is the "time" axis indicating the time elapse of the input work. Each point represents a point immediately before the input of a syllable. The input monosyllabic conjunction symbols a to n are respectively written between two adjacent points. The path on the axis is the path through which only monosyllabic open sounds are selected, and the first
As shown in FIG. 2 (2), this is called the "axial path."

音列中に双音節語音が存在する時には、同図の(2)に
示すように、隣り合う2個の単音をまたぎ、それらの両
件側の2点を短絡するバイパス(傍路)を引き、その上
に該当する双音節語音記号を置く。−本の軸路と多くの
傍路によってシステム図は鎖状の形となる。このように
中文語音列を語音接続関係図としてモデル化したダイヤ
グラムを「87i、TFネットワーク」と名付ける。
When a disyllable sound exists in a sound sequence, a bypass is drawn that straddles two adjacent single sounds and shorts the two points on both sides, as shown in (2) in the same figure. , and place the corresponding disyllable syllabic symbol above it. -The system diagram has a chain-like shape due to the main axis path and many side paths. A diagram in which a Chinese language sound string is modeled as a speech connection diagram is named "87i, TF network."

原文の文頭に対応する音列の開始点、句読点、数字ロー
マ字など非変換の字や符号の単独または列の直n+1直
後の「点」は、音頻法の実行に際して外部条件により音
列を区切るものであるので、これらを「外部区切点」と
呼ぶ、第12図の例では、点■と点0、■が「外部区切
点」である。
The starting point of a phonetic sequence that corresponds to the beginning of a sentence in the original text, punctuation marks, numbers, Roman letters, or other non-converted characters or symbols alone or the "point" immediately after the line n+1 in the sequence is a point that separates the phonetic sequence according to external conditions when executing the phonetic method. Therefore, these are called "external breakpoints." In the example of FIG. 12, point ■, point 0, and ■ are "external breakpoints."

その点の直前直後の両車音節をつないで得られる双音節
語音が存在しないとき、その点を節点という。その点を
またぐ双音節語音の存在不存在にかかわらず、その点の
直前または直後に3または4音節の語合が存在している
ときも、その点は「節点」である、また「外部区切点」
もまた「節点」である。
When there is no bisyllabic sound obtained by connecting the two syllables immediately before and after that point, that point is called a node. A point is also a ``node'' when there is a three- or four-syllable conjugation immediately before or after that point, regardless of the presence or absence of a di-syllabic speech sound that straddles that point. point"
is also a "node".

隣りあわせの2個の節点間の軸路と傍路上の単音節およ
び双音節の全語音によって成立する語合列を「音頻句」
と名付ける。
A word sequence formed by monosyllabic and disyllable whole words on the axis and paras between two adjacent nodes is called a "frequent phrase".
Name it.

隣りあわせの2節点の間が3または4音節語音の場合、
それらの節点の間を「多音句」と名付ける。
If there is a 3 or 4 syllable word between two adjacent nodes,
The space between these nodes is called a "polyphonic phrase."

隣りあわぜの2節点が共に「外部区切点」で、その間に
ローマ字・数字などの非変換の字または符号の一個ある
いは列だけがあるとき、それらの節点の間を「非変換句
」と名付ける。
When two adjacent nodes are both "external breakpoints" and there is only one or a string of non-converted characters or symbols such as Roman letters or numbers between them, the space between those nodes is called a "non-converted phrase". .

隣りあわぜの2個の「外部区切点」の間をr文」と称す
る。「文」はふつう「音頻旬」および「多音句」からな
っている、第11図と第12図に例示する文は、4個の
[音頻句Jによって構成されてt)る。
The space between two adjacent "external breakpoints" is called an "r sentence". A ``sentence'' usually consists of ``phonetic phrases'' and ``polyphonic phrases.'' The sentences illustrated in FIGS. 11 and 12 are composed of four ``phonetic phrases J.''

音頻法は個々の「音頻句」だけを対象として、その音列
におけるf!L′i!i語音区切を求心音区切であり、
本発明の語合漢字変換方式ではその音頻法により区切ら
れたそれぞれの語合に対して漢字変換を行なう。
The frequency method targets only individual "frequency phrases" and calculates f! in that tone sequence. L'i! The i-word sound break is a centripetal sound break,
In the word combination Kanji conversion method of the present invention, Kanji conversion is performed for each word combination separated by the phonetic system.

3.5.3  音頻ネットワーク 第12図に示すような音頻旬の中の単音節記号および双
音節記号のそれぞれに、頻級をつけた形式を「音頻ネッ
トワーク」という、第12図の(3)は第11図(2)
の例文に対する音頻ネ・ントワークである。音頻ネット
ワークは、多数のイベントを矢を持つアクティビティで
連結して、おのおののアクティビティの時間的順序と接
続関係を示したPERTのプロジェクト・ネットワーク
に、ツマタンのうえでは類似している。しかじ音頻法は
、むしろ最適経路発見のために用いられるDP(ダイナ
ミック・プログラミング)の一種と見た方がよい。
3.5.3 Frequency Network The format in which a frequency is attached to each of the monosyllabic symbols and disyllabic symbols in the frequency as shown in Figure 12 is called a "frequency network," as shown in (3) in Figure 12. Figure 11 (2)
This is the frequency tone work for the example sentence. The frequency network is similar in theory to the PERT project network, which connects a large number of events using activities with arrows and shows the temporal order and connection relationships of each activity. Rather, it is better to view the Shikaji-on frequency method as a type of DP (dynamic programming) used for finding the optimal route.

単音節語合と双音節語音だけによって構成される音頻ネ
ットワークは、第12図で見てきたようにきわめて単純
なバタンを示す、それは軸路上に等間隔に並ぶ単音節記
号と互いに1音節周期ずれて軸路の上下に並ぶ傍路上の
双音節記号によって連続的につくられるバタンである。
As we have seen in Figure 12, a frequency network composed of only monosyllabic and disyllable speech sounds exhibits an extremely simple bang, consisting of monosyllabic symbols arranged at equal intervals on the axis and shifted by one syllable period from each other. It is a batan that is continuously created by double syllabic symbols on the side roads that are arranged above and below the axis road.

第12図の(3)の表現は、同図(4)のような2段の
ブロツク積み形式に直して表現したほうが描き易くわか
り易い、音頚ネットワークは、以下、第12図の(4)
のバタンによって記述することにする。
The representation in (3) in Figure 12 is easier to draw and understand if it is expressed in a two-stage block stacking format as shown in (4) in the same figure.
I will describe it with the help of a button.

3.5.4  最小音頻径路と殻適語合区切の理論最小
音頻経路とは、その経路上の個々の語合の頻級の総和I
T(以後ITを頻級和と称する)が最小になる経路を指
す、第12図の(4)における太線が例文の赦小音頻経
路である。この例文の音頻句1では、太線の音頻経路で
は頻級和ITは52であり、他のどの音頻経路の頚級和
ITよりも小さい、最小音頻経路に沿う語合区切を行な
い、さらに各語合の同音語中の最多頻度の漢字語に各語
合を変換すれば、第12図の(5)のように、はとんど
真に近い結果が得られる。この例では、bのshiに対
する変換だけが正しい「式」ではなく、「是」となり、
同音語ミスを起こしている9話合区切は完全に正しくな
されていることに注意すべきである。
3.5.4 Minimum frequency path and theory of consonant separation The minimum frequency path is the sum of the frequencies of the individual phrases on that path
The thick line in (4) of FIG. 12, which indicates the path where T (hereinafter IT will be referred to as frequent sum), is the minimum frequent path of the example sentence. In frequency phrase 1 of this example sentence, the frequency sum IT is 52 in the bold line frequency path, which is smaller than the neck severity sum IT of any other frequency path. If each word combination is converted to the most frequent Kanji word among the homonyms of the word combination, a result as close to the truth as shown in (5) in FIG. 12 can be obtained. In this example, only the conversion of b to shi is not a correct "formula" but a "re",
It should be noted that the 9-episode group break that caused the homophone error was done completely correctly.

後に説明する第16図は本発明の方式により216音節
にわたる話音を漢字に変換した例を示す図であ、るが、
本発明で採用する音頻法によれば、IrL適語音語合の
経路が真の語区切にほとんど近い事か本図から了解され
るであろう。
FIG. 16, which will be explained later, is a diagram showing an example of converting 216 syllables of speech into kanji using the method of the present invention.
It will be understood from this figure that, according to the frequency method adopted in the present invention, the path of IrL appropriate word combinations is almost close to true word breaks.

本発明はri小音頻経路に沿う語合区切は、最も真に近
い語合区切である」という法則に基づいている。この法
則は本発明者の発見に係わるものである。この法則が中
文に対してよく合うことは、第16図の216音節に及
ぶ長い例文に対する漢字変換の結果から証明されている
The present invention is based on the law that a phrase break along the ri minor frequency path is the closest phrase break to the truth. This rule is related to the discovery of the present inventor. That this rule applies well to Chinese sentences is proven from the results of Kanji conversion for a long example sentence of 216 syllables in Figure 16.

この法則が成立する理由は次のように考えられる。The reason why this rule holds is considered to be as follows.

(1)中文には411個の単音節語音があるが、各語音
の統計的使用頻度は、けっして−様ではなく、非常に泊
った分布をしている。
(1) There are 411 monosyllabic speech sounds in Chinese, but the statistical frequency of use of each speech sound is not at all -like, but rather has a very uniform distribution.

(2)中文における双音節語合は単純計算最大限の41
12=16万種はなく、本発明におけるように声調(中
国語に特有の音節ごとの4種のアクセント)を熟視した
とき3万種、その中でも主要なものは一万種をはるかに
下まわる程度であり、双音節語合において結び付く2個
の音節の相性はずいぶん制限されている。
(2) The number of disyllable combinations in Chinese is the maximum of 41 simple calculations.
12 = There are not 160,000 types, but when we carefully consider tones (the four types of accents for each syllable unique to Chinese) as in the present invention, there are 30,000 types, and among them, the main ones are far less than 10,000 types. The compatibility of two syllables connected in a di-syllabic conjunction is quite limited.

実存する双音節各語合の統計的使用頻度は個々の語音に
よって非常に偏った分布をしている。
The statistical frequency of use of each disyllable in real life has a highly biased distribution depending on the individual sounds.

(3)3音節4音節の多音節語合になると、音節組合わ
せの特殊性と実在語音の頻度の偏りは、さらに大きくな
る。
(3) When it comes to polysyllabic combinations of 3 syllables and 4 syllables, the specificity of syllable combinations and the bias in frequency of real speech sounds become even greater.

(4)語音の時系列すなわち[句」においては、それを
構成する語音の連続の具合は決して無秩序ではなく、文
法や語脈の影響をうけ、自然言語特有の秩序と偏りを持
つ。
(4) In a chronological sequence of speech sounds, or a phrase, the sequence of speech sounds that make up the phrase is by no means disordered, but is influenced by grammar and phraseology, and has an order and bias unique to natural language.

(5)以上をまとめれば、 「個々の音節がlit音節・双音節などの語音を成立さ
せるとき、さらにそれら語音がlff1次に接続されて
句や文をなすとき、その成立や接続には自然言語固有の
秩序がある。」 その結果、 「各語音の情報量の総和が最小になるような、各語音の
音節vJ造と語音の区切と語音の接続が、自然言語にお
いては自ずから成立しているに違いない。」 この法則は[自然言語における語音情報エントロピー最
小の法則」とも呼ぶべきである。
(5) To summarize the above, ``When individual syllables form words such as lit syllables and disyllables, and when these sounds are connected to the lff1 order to form phrases and sentences, the formation and connection are natural. There is an order unique to language.'' As a result, ``In natural languages, the syllable vJ structure of each sound, the division of sounds, and the connections between sounds are established so that the sum of the amount of information for each sound is minimized. This law should also be called the law of minimum entropy of speech information in natural languages.

中文に対しては、この法則は非常によく合う。This rule applies very well to Chinese.

本発明の自動語音区切方式はこの法則を応用して、音節
を単位として入力された中文の個々の「音頻句」に対し
て、いわゆる変換キーの操作を省略して、音頻法により
自動語音区切を実行する方式である。また本発明の語音
漢字変換方式は、その自動語音区切方式により中文を最
も真に近く個々の語音に区切り、区切られて生成された
各語音を漢字に変換する方式である。
The automatic speech segmentation method of the present invention applies this law to automatically segment speech segments using the frequency method for each "phonetic phrase" of a Chinese sentence input in units of syllables, omitting the operation of so-called conversion keys. This is a method of executing Furthermore, the speech-to-Kanji conversion method of the present invention uses its automatic speech-sound segmentation method to segment a Chinese sentence into individual speech sounds that are closest to the truest, and converts each speech sound generated by segmentation into Kanji characters.

3.6 実施例 次に、実施例を挙げて本発明の詳細な説明する。3.6 Example Next, the present invention will be explained in detail by giving examples.

なお本発明の核心は、単音節語音と双音節語合だけから
なる「音頻句Jに対する逐次自動語音区切にあるので、
3音節および4音節の語音に対する処理については簡潔
に述べる。
The core of the present invention lies in the sequential automatic speech segmentation of frequent phrases J consisting of only monosyllabic speech sounds and disyllable speech sounds.
Processing for three-syllable and four-syllable speech sounds will be briefly described.

3.6.1  音頻語合区切処理の構成間第1図は「音
頻旬」に対し語音区切を施し、この語音区切により該音
頻区を区切ることにより生成された3△音を漢字に変換
する本発明の一実施例(語音漢字変換方式)の構成を示
すブロック図である。また第13図は第1図のなかの辞
書以外の記憶手段において必要なメモリの記憶内容を示
す表である。この第1図実施例により本発明による語音
区切方式および語音漢字変換方式を具体的に説明する。
3.6.1 Structure of frequent word combination separation processing Figure 1 shows how to perform speech separation on “onto taishun” and convert the 3△ sounds generated by dividing the phonophone division into kanji into kanji. 1 is a block diagram showing the configuration of an embodiment of the present invention (sound-to-kanji conversion method); FIG. Further, FIG. 13 is a table showing the storage contents of the memory required in the storage means other than the dictionary shown in FIG. The speech segmentation method and the speech-sound-kanji conversion method according to the present invention will be specifically explained using the embodiment shown in FIG.

第1図における各手段ブロックの機能は以下の通りであ
る。
The functions of each means block in FIG. 1 are as follows.

1:辞書 (1)音節列によって表現された語音見出しく2)各語
音の頻級 (3)各語音ごとの同音の漢字話電〈漢字コード表現)
のデータを記憶させである記憶手段である。
1: Dictionary (1) Find word sounds expressed by syllable strings 2) Frequency of each word sound (3) Kanji call phone with the same sound for each sound (Kanji code representation)
This is a storage means for storing data.

2:語音・頻級の検索手段 音節入力手段(キーボード)11から音節が入力される
都度、2は1の辞書から以下のデータを検索し、3へ送
る。
2: Word sound/frequency search means Every time a syllable is input from the syllable input means (keyboard) 11, 2 searches the dictionary of 1 for the following data and sends it to 3.

(1)その音節の単音節語音R3゜ (2)R3,の頻級Is。(1) Monosyllabic sound R3゜ of that syllable (2) Frequent Is of R3.

(3)nが2以上のとき、RS n−1とR3,とを結
合して得られる2音節の双音節語合RD、の有無(但し
、RD、がない場合、RD、無の情報は4の節点判断手
段に送られ、3へは送られない)。
(3) When n is 2 or more, the presence or absence of the two-syllable disyllable combination RD obtained by combining RS n-1 and R3 (however, if RD is absent, the information on RD and nothing is 4, but not to node 3).

(4)RD、があった場合にはその内容と頻級ID、。(4) If there is an RD, its contents and frequency ID.

(5)nが3以上のときその音節に終わる3音節語音R
T、の有無。あった場合にはRT、の内容 (6)nが4以上のときその音節に終わる4音節語音R
Q、の有無、あった場合にはRQ、の内容 RT、またはRQ、は多音節語合データであって、無の
情報は節点判断手段4に送られ、有の時にはRT 、ま
たはRQ、の語合が記憶手段3に送られる。RQ、が有
のときにはRT、、RD、。
(5) When n is 3 or more, the trisyllabic sound R that ends in that syllable
Presence or absence of T. If so, RT, content (6) When n is 4 or more, the four-syllable word sound R that ends in that syllable
The presence or absence of Q, and if so, the content of RQ. RT or RQ is multisyllabic word combination data, and information on the absence is sent to the node determination means 4, and when it is present, the content of RT or RQ. The word combination is sent to the storage means 3. When RQ, is present, RT,,RD,.

RSイ、IDア、Is、は3に送られない。RS, ID, and Is are not sent to 3.

RT、、が有のときにはRD、、R3,、ID、。When RT,, is present, RD,,R3,,ID,.

IS、は3に送られない。IS, is not sent to 3.

3:単音節開音(R3,)・単音節開音の頻級(rsn
)・双音節語合(RD、)・双音節語合の頻級(ID、
)・3音節語音(RT、)・4音節語音(RQ、)の記
憶手段 4:節点判断手段 「音頻旬」が終わりになる節点を判断する。
3: Monosyllabic open sound (R3,), monosyllabic open sound frequent grade (rsn
)・Disyllabic combinations (RD, )・Disyllabic combinations (ID,
), 3-syllable speech sounds (RT, ), and 4-syllable speech sounds (RQ, ) storage means 4: node determination means ``phonetic frequency'' determines the ending node.

(1)話合・頻級検索手段2によってRT、またはRQ
、が検索されていないとき (1)RDnが無のとき: R3,の直前の点は節点と
判断し、音節入力番号を1に 戻し、R3,とIS、を記憶手段3の R3,とIs、に入れ直す。
(1) RT or RQ by discussion/frequency search means 2
When , is not retrieved (1) When RDn is empty: The point immediately before R3, is determined to be a node, the syllable input number is returned to 1, and R3, and IS are stored in the storage means 3 as R3, and IS. , put it back in.

(2)RD、が有のとき:記憶手段3からR3,、T 
S、、RD、およびID、の情報を最適区切型生成手段
5に送る。
(2) When RD is present: From storage means 3 to R3,,T
The information of S, RD, and ID is sent to the optimal delimited type generation means 5.

(2)語合・頻級検索手段2によってRQ、またはRT
、が検索されたとき (1)RQ。か存在するときはn−4に外部区切りがあ
るとして処理し、よたRT。
(2) RQ or RT by word combination/frequency search means 2
When , is searched (1) RQ. If there exists an external delimiter, it is processed as if there is an external delimiter at n-4, and then RT.

が存在するときはn−3に外部区切があるものとして処
理する。
If exists, it is assumed that there is an external delimiter at n-3.

5:#1適語音区切型生成手段 「音頻旬」を対象にして、処理中の音頻句の先頭から最
近に入力された音節までの音節列に対する語合区切型を
生成し、区切型記憶手段6に送る。
5: #1 Targeting the appropriate word sound segmentation type generation means "Ontaishun", generates a word segmentation type for the syllable string from the beginning of the syllable phrase being processed to the most recently input syllable, and generates a segmentation type storage means. Send to 6.

最適語合区切型生成手段5の主機能はi適語合区切型を
生成して漢字列変換手段7に送ることであるが、生成の
方法については浅部で述べる。
The main function of the optimal word combination delimiter type generation means 5 is to generate i-suitable word combination delimiter types and send them to the kanji string conversion means 7, and the method of generation will be described later.

6:語合区切型記性手段 7:漢字列生成手段 Ft適話話合切型生成手段5から’A適語語合切型のデ
ータを受は取り、区切られた各語合を見出しとして辞@
1の漢字語案を順次に検索し、同音語の中から現在数も
眞に近いと判断される漢字語を選定し、それを順次につ
らねて漢字語列を出力する。その漢字語列は、漢字語列
記憶手段8及び文書記憶手段12に送られる。
6: Word combination separation type notation means 7: Kanji string generation means Ft Receives the data of 'A suitable word combination type from the suitable word combination type generation means 5, and writes each separated word combination as a heading.
The proposed Kanji words in No. 1 are sequentially searched, and Kanji words whose current number is judged to be close to the truth are selected from the homophones, and they are sequentially strung together to output a Kanji word string. The Kanji word string is sent to Kanji word string storage means 8 and document storage means 12.

8:漢字列記憶手段 3.6.2 最n語合区切型の生成 第1図における最適語合区切型生成手段5の機能と作用
について詳しく述べる。
8: Kanji string storage means 3.6.2 Generation of the most n-word combination delimiter type The function and operation of the optimal word combination delimiter type generation means 5 shown in FIG. 1 will be described in detail.

第14図に、音節入力番号nに対する、n音節句におい
て可能な区切型のすべてを示す、ただし、n=1〜8に
ついて例示する。
FIG. 14 shows all possible break types in an n-syllable phrase for a syllable input number n, where n=1 to 8 are illustrated.

図において、音節はabc等のローマ字で示す。In the figure, syllables are shown in Roman letters such as abc.

「/」は語合の区切である0語合区切の型の右側に2進
数で表現したのは、区切型の数値表現T。
"/" is a word delimiter. 0 The binary representation on the right side of the word delimiter type is the numerical representation T of the delimiter type.

である。1゛、の意味は例示すれば 区切型   a b / c d / eT、1010
1 T、の最上位桁の1は、ローマ字表現の音節列の開始点
を示し、必ずこれを置く、以下、0は音節間に区切のな
いことを、1は区切のあることを示す。ローマ字表現の
最後に「/」がないのは、次の音節が入力されるときま
で最後の音節が単音節開音の音節か双音節語合の第一音
節かこの時点では不定なので「/」がつけられないので
ある。
It is. An example of the meaning of 1 is the delimited type a b / c d / eT, 1010
1 The most significant digit 1 in T indicates the starting point of the syllable string in the Roman alphabet, and must be placed here.Hereafter, 0 indicates that there is no break between syllables, and 1 indicates that there is a break. The reason why there is no "/" at the end of the Roman expression is because it is unclear at this point whether the last syllable is a monosyllabic open syllable or the first syllable of a disyllable syllable until the next syllable is input. cannot be attached.

′r、を「区切変数」と呼ぶ。T、によって「音頻旬に
おける区切の型一般を簡潔に表現することができる。
'r, is called a "delimiter variable". By T, we can express the general types of breaks in syllables succinctly.

第14図の最下欄に「数」とあるのは、各nに対する可
能な区切型の数である。可能な区切型の数をN、とすれ
ばN、=N、−1+N、−2の法則がある。すなわちN
。はnの増加に伴って急速に増えていき、音節のIlf
次入力に際して、nが小さい間は区切型の数は少ない。
The "number" in the bottom column of FIG. 14 is the number of possible delimiter types for each n. If the number of possible delimiter types is N, then there is a law of N, = N, -1+N, -2. That is, N
. increases rapidly as n increases, and the syllable Ilf
During the next input, the number of delimited types is small while n is small.

区切型の各々についてそれを構成する各語合の頻級を合
算して、その区切型の頻級相IT9を計算し、すべての
型のIT。
For each delimited type, the frequency of each phrase constituting it is summed up to calculate the frequent phase IT9 of that delimited type, and the IT of all types is calculated.

を比較し、そのなかからIT、が最小の型を最適区切型
変数T o p t 、として発見することができる。
, and the type with the smallest IT can be found as the optimal delimited variable T op t .

nに対してそのnに可能なすべての区切型を生成する一
般的な方法は次のごとくである。
A general method for generating all possible delimited types for n is as follows.

(1)音頻句には単音節語合と双音節語合しか含まれて
いないことに注目すれば、Toの各型の末尾に新たに双
音節語合を追加した型の集合に、T n + 1の各型
の末尾に新たに単音節重音を追加した型の集合を加えれ
ば、それはT *+2の型の集合になる。第15図にT
、がT2とT、とから生成される例を図示しである。
(1) If we note that frequent phrases include only monosyllabic and di-syllabic phrases, we can add T n to the set of types in which di-syllabic phrases are added to the end of each type of To. If we add a set of types in which a new monosyllabic diphthong is added to the end of each type of +1, it becomes a set of types of T*+2. T in Figure 15
, which is generated from T2 and T, is illustrated.

(2)上記の方法に従えば、n=1の型aと、n=2の
型abおよびa/bから出発して、nが1増えるな、び
に、新しいnに対する型を見出し、n−1とn−2の型
のITの値におのおの追加さるべき単音節語合あるいは
双音節語合の頻級を加算すれば、nのすべてについて、
可能なすべての型とおのおののIT。
(2) According to the above method, starting from type a with n=1 and types ab and a/b with n=2, each time n increases by 1, find a new type for n, and n- If we add the frequency of the monosyllabic or disyllabic phrases to be added to the IT values of types 1 and n-2, we get, for all n,
All possible types and each IT.

とを1111次に求めることができる。can be found in the 1111th order.

第15図は第14図の全部の区切変数T、について頻級
和IT、を計算し、おのおののnについてITヵの最小
の値ITopt、と、対応する型Toptfiを求めた
結果を示す。本図において大字がT o p t 、と
ITopt、である。
FIG. 15 shows the results of calculating the frequent sum IT for all the delimited variables T shown in FIG. 14, and finding the minimum value ITopt of ITka and the corresponding type Toptfi for each n. In this figure, the large letters are T op t and ITopt.

第15図において、最下欄に各T o p t 、にし
たがって漢字列変換手段7で開音漢字変換を実行した結
果を示す。’n=3においては、最小のIT。
In FIG. 15, the bottom column shows the results of performing open-on kanji conversion by the kanji string conversion means 7 according to each T op t . 'For n=3, the smallest IT.

の型が唯一に決まっていない、もしn=3で音頻句が閉
じているときには、解が決まらない。このときには、 (1)末尾が単音心音または双音語音のどちらかである
方を眞と定める。
The type of is not uniquely determined, and if n = 3 and the frequent phrase is closed, the solution is not determined. In this case, (1) The one whose ending is either a monophonic heart sound or a diphonic sound is determined as shin.

(2)区切の数が多い方または少ない方の一方を真とす
る。
(2) Either the one with more or less number of delimiters is true.

のひとつまたは両方の条件によって解をきめるものとす
る。
The solution shall be determined by one or both of the conditions.

3.7 発明の効果 3.7.1  音頻式語合最適区切と漢字変換の成績例 第16図に、やや長い例文に対する音頻法による詰合F
EN区切と漢字変換の結果を示す、この例文 。
3.7 Effects of the invention 3.7.1 Performance examples of optimal division of phonetic phrases and kanji conversion.
This example sentence shows the result of EN delimitation and Kanji conversion.

において、漢字変換を必要とする音節は219個、その
なかから人名の3音節を除いた216個の音節が音頻法
の対象となる。
There are 219 syllables that require Kanji conversion, of which 216 syllables, excluding the 3 syllables of personal names, are subject to the phonetic method.

例文は(1)から(12)までの12段に分けて記述さ
れている。記号等の定義は以下の通り。
The example sentences are divided into 12 columns from (1) to (12). Definitions of symbols etc. are as follows.

1)各段の1行目は例文、2行目は中国国定ローマ字(
持合:ピンイン)の「読み」、3行目は節点と句の状況
を示す。4〜6行は第12図(5)の形式による音頻構
造を示す、ただし音頻処理の対象となる単音節および双
音節の各語合は、aまたはab等の単音節開音記号を用
いずに、その語合を持つ同音語中の最多頻度の漢字語で
表現しである。a、ab等には、本来ローマ字の音節で
書くか又は同音漢字語の全部を書くべきかであるが、記
載の場所の制限から第16図では漢字話で代表して示し
た。
1) The first line of each column is an example sentence, and the second line is the Chinese national Roman alphabet (
The third line shows the status of the nodes and phrases. Lines 4 to 6 show the frequency structure in the format shown in Figure 12 (5), however, each monosyllabic and disyllable word combination that is subject to frequency processing does not use monosyllabic open syllables such as a or ab. Then, it is expressed by the most frequent kanji word among the homophones that have that combination. Originally, a, ab, etc. should be written in the syllables of the Roman alphabet, or all the homophones of the kanji word, but due to space limitations, Figure 16 shows them as representative kanji words.

2)3行目において、「■」は節点、「□」は音頻句、
「」は外部区切の節点、「多」は多音句、[非」は非変
換句、r人」は人名扱いの旬を指す。
2) In the third line, "■" is a node, "□" is a frequent phrase,
``'' refers to an external delimiter node, ``多'' refers to a polyphonic phrase, ``non'' refers to a non-transformed phrase, and ``rjin'' refers to a season treated as a person's name.

3)4〜6行の音頻ネットワークの説明において、太い
実線は最適語合句区切経路である。(6)段[1に2箇
所、9段目に1箇所、太い点線の経路が太い実線の経路
と共存している。これはこれらの音頻処理において、I
Tが最小となる型が2個出たため、3.6.2において
既に述べたところに従い、それらの音頻句の末尾の語合
が双音節語合の型を眞として太い実線の経路を描き、単
音節語合の型を偽として太い点線の経路を描いたのであ
る。
3) In the explanation of the frequency network in lines 4 to 6, the thick solid line is the optimal word/phrase separation path. (6) Stage [2 places in 1st stage and 1 place in 9th stage, thick dotted line route coexists with thick solid line route. In these frequency processing, I
Since two types with the minimum T have been found, as described in 3.6.2, the final conjunctive phrases of these frequent phrases draw a path of a thick solid line based on the disyllable conjunctive type, and He drew a thick dotted line path assuming that the monosyllabic compound type was false.

4)下線は多音節語合を示す。この例文の多音節話合は
6箇所に現れ、これらは全て3音節語音である、4音節
語音は出てこない。
4) Underlining indicates polysyllabic phrases. In this example sentence, polysyllabic speech appears in six places, all of which are three-syllable speech sounds, and four-syllable speech sounds do not occur.

5)重音の下の小字体の数値は頻級である。頻級の数値
は北東語言学院編「現代漢語頻率詞典1986年版」か
ら計算した。単語の語粱は中国文字改革委員会の[漢語
枡音詞[1963年版」を参考にした。後者にあって前
者にない双音節重音の頻級は一律に18級とした。前者
後者の双方にないが、辞書システム上に必要と判断され
る常用の双音節語(地名の北東や法用の地名も含む)の
重音の頻級は、双音節の前後の単音節の頻級の数値の和
から1を減じた数値を採用した。
5) The numbers in small letters below the double sign are frequent. Frequency values were calculated from ``Modern Chinese Frequency Dictionary 1986 Edition'' edited by Tohoku Language and Linguistics Institute. The vocabulary of the words was based on the Chinese Character Reform Committee's 1963 edition of Chinese words. Frequencies of double syllables in the latter but not in the former were uniformly set to grade 18. The frequency of double syllables in commonly used disyllable words (including northeastern place names and legal place names) that are not found in either the former or the latter, but are deemed necessary in the dictionary system, is the frequency of the monosyllables before and after the disyllable. The numerical value obtained by subtracting 1 from the sum of the numerical values of the class was adopted.

6)7行目の「△」は同音語ミスを示す、双音節語の同
音語ミスにおいては、ひとつの音節だけか正しく変換さ
れていても、2字分のミスとした。処理の単位が「字」
ではなく「語」だからである、正解はΔの下に太字で示
した。
6) "△" in the 7th line indicates a homophone error. In the case of a homophone error in a di-syllabic word, even if only one syllable was converted correctly, it was treated as a two-letter error. The unit of processing is “character”
The correct answer is shown in bold below the Δ.

7)「×」は区切ミスである。全例文のなかで区切ミス
は9段目における3音節にわたる1件しかない、このこ
とは、音頻法を応用した語合区切が変換キーをまったく
必要としないにもかかわらず、−語の入力ごとに人の判
断によっていちいち「変換キー」を入れていくいわゆる
[人力悲話変換」区切に比べて、はとんどこれに近い区
切能力を持っていることを示している。
7) "x" is a delimiter error. Among all the example sentences, there is only one break error that spans three syllables in the 9th line. This means that even though the phrase break that applies the phonetic method does not require any conversion keys, This shows that compared to the so-called ``human-powered sad story conversion'' section, in which a ``conversion key'' is inserted one by one based on human judgment, this section has a similar sectioning ability.

3.7.2  音頻法以外の中文漢字変換方式の効果 第17図は、第16図の例文に対する漢字変換の正変換
率についての分析を示す。第17図において、(1)は
例文の全部、(2)は例文における話区切、(3)は音
頻法による漢字変taの正変換率をそれぞれ示す。また
、(4)と(5)は参考対象として、最長−教法漢字変
換と単漢字変換のそれぞれの正変換率を示す。
3.7.2 Effects of Chinese to Kanji conversion methods other than the phonetic method Figure 17 shows an analysis of the correct conversion rate of Kanji conversion for the example sentence in Figure 16. In FIG. 17, (1) shows the entire example sentence, (2) shows the speech break in the example sentence, and (3) shows the correct conversion rate of the kanji inflection ta according to the phonetic method. Moreover, (4) and (5) show the correct conversion rates of the longest-to-teaching kanji conversion and single kanji conversion, respectively, as reference objects.

(3)と(4)の本文において、太字は語を単位とした
字の誤りである。(5)においては太字は字の誤りであ
る。(3)〜(5)の右の欄は誤変換の内容を示す。そ
の評価は以下の通りである。
In the main text of (3) and (4), the bold type is a typographical error in units of words. In (5), the bold text is a typographical error. The columns to the right of (3) to (5) indicate the contents of the erroneous conversion. The evaluation is as follows.

■)本図(5)の単漢字変換は中文人力法として、実用
性がまったくない、原変換率(漢字頻度学習機能がない
ときの変換率)は37.0%と低い。
■) The single kanji conversion shown in figure (5) has no practical use as a Chinese human power method, and the original conversion rate (conversion rate without the kanji frequency learning function) is as low as 37.0%.

最近使用先頭配列辞書式の学習機能をつけ、誤字が出る
ごとに候補字のなかから選択修正再入力していく場合の
シュミレーションをしてみると、正変換率は54,6%
に上昇するが、これでも変換性能はきわめて低い。単漢
字変換は中文人力には日本話人力においてと同じく、効
果の点で無意味である。
When we simulated a case where the learning function of the recently used leading array dictionary was installed, and each time a typo occurred, the correct characters were selected from among the candidate characters and re-entered, the correct conversion rate was 54.6%.
However, the conversion performance is still extremely low. Single kanji conversion is as meaningless in terms of effectiveness for Chinese language human power as it is for Japanese language human power.

2)単語単位の人力区切入力では区切ミスが当然に0に
なるから正変換率は高く、〈3)の例では正変換率は9
5.8%である。しかしこの方式は、この例文の入力の
ために区切変換キーを107回も打たなければならない
欠点がある。
2) In manual delimited input of words, the correct conversion rate is high because the delimiter error is naturally 0, and in the example of <3), the correct conversion rate is 9.
It is 5.8%. However, this method has the disadvantage that the delimiter conversion key must be pressed 107 times to input this example sentence.

日本語入力における文節単位の漢字変換とは異なり、中
文は分ち書きのない漢字だけの「列であるから、中文人
力においては、単語変換キーを打つ肉体労働の回数より
も、どこで語を切るべきかという判断の精神労働の回数
の方が、オペレータの誤操作と疲労を生む原因として重
大である。その点で、現在絣音入力では最も普遍的なこ
の方式は労働能率の面から不十分な効果しか持たないと
判断できる。
Unlike the kanji conversion of phrase units in Japanese input, Chinese characters are strings of kanji without any separation, so in terms of Chinese human power, it is more important to know where to cut a word than the number of manual labors of hitting the word conversion key. The number of mental labor involved in deciding whether to use the kasuri sound input method is more important as it causes operational errors and operator fatigue. It can be determined that it has no effect.

3)また、(4)の最長−数式自動区切漢字変換は、人
力による語区切方式の上記の欠点をカバーすべく試みら
れた方式で、現在日本語入力に広く用いられ非常に成功
しているものの中文に対する応用である。しかし、中文
の構造が白文とは相当に異なるために正変換率が高くな
らず、例文では誤字率17,6%にも達し、しかも相当
量の再区切操作を免れない。誤字のほとんどが区切ミス
に起因するものである0区切ミス修正はオペレータに対
して非常な手間と精神的負担をかけるので、正しい語区
切がなされた後の同音語ミスにくらべれば、悪質なミス
である0区切ミスが多い点で、最長−数式自動区切漢字
変換は中文人力に対しては、なお不十分な効果しかない
3) In addition, (4) longest-mathematical automatic delimitation kanji conversion is a method attempted to overcome the above-mentioned shortcomings of the manual word delimitation method, and is currently widely used for Japanese input with great success. This is an application to the Chinese language of mono. However, since the structure of Chinese sentences is considerably different from that of white sentences, the correct conversion rate is not high, and the error rate for example sentences reaches 17.6%, and a considerable amount of re-segmentation operations are required. Most typos are caused by punctuation mistakes.Correcting zero delimitation mistakes places a great deal of effort and mental strain on the operator, so compared to homophone mistakes made after correct word delimitation, it is more likely to be a malicious mistake. The longest-mathematical automatic delimitation kanji conversion is still insufficiently effective for Chinese human ability because there are many errors in zero delimitation.

3.7.3  音頻式自動重音区切漢字変換の眼目と効
果 1)オペレータは単語の判断をまったく気にしないで読
む通りに持合音節キーを打っていけば、打鍵に追従して
ほとんど真に近い語区切の漢字語列が自動的に出力され
てくる。
3.7.3 Eyes and Effects of Automatic Frequency Type Automatic Double-Sound Separated Kanji Conversion 1) If the operator presses the mochiai syllable keys as they are read without worrying about the word judgment at all, it will follow the keystrokes and be almost true. Kanji word strings with similar word breaks are automatically output.

2)音頻式自動区切の効果により、区切ミスは希にしか
発生せず、変換ミスはほとんど同音ミスに限られる。ゆ
えにオペレータはさしたる手間をかけずにミスを修正で
きる。
2) Due to the effect of automatic phonetic segmentation, segmentation errors rarely occur, and conversion errors are mostly limited to homophone errors. Therefore, the operator can correct mistakes without much effort.

3)区切ミスが希にしか起こらないから、語合から漢字
に変換する際に、連続した2語の間に成立する文法的な
関係を満たす同W語だけに変換することによって、同音
語ミスを減少させ、漢字正変換率をさらに上げることが
可能である。
3) Since punctuation errors occur only rarely, when converting words into kanji, we can eliminate homophone errors by converting only the same W words that satisfy the grammatical relationship between two consecutive words. It is possible to further increase the kanji correct conversion rate by reducing the number of characters.

最長一致法のように区切ミスが多い方式では、区切ミス
のまま文法処理を加えることになるので、文法処理の効
果は期待できない。
In a method such as the longest match method where there are many delimitation errors, grammatical processing is added to the delimitation errors, so the effects of the grammatical processing cannot be expected.

4)以上をまとめれば、本発明は、 (1)変換キーの操作がまったく不要で、漢字の「読み
」の通りに入力し く2)入力に追従して真に近い謔区切を自動的に行ない
、その区切にしたがって個々の語ごとに自動的に漢字に
変換する機能を持つことにより(3)オペレータの負担
を軽減し疲労を減少させ入力作業の生産性を向上させる
効果を持つ中文85音漢字変換方式を提供するものであ
る。
4) To summarize the above, the present invention: (1) Enables input of kanji according to the "reading" without requiring any conversion key operation; and (2) Automatically performs kanji separations that are close to the true one by following the input. By having a function that automatically converts each word into kanji according to the division, (3) Chinese 85-syllabary kanji, which has the effect of reducing the burden and fatigue of the operator and improving the productivity of input work. It provides a conversion method.

【図面の簡単な説明】[Brief explanation of the drawing]

第1図は本発明の一実施例の構成を示すブロック図であ
る。 第2図(a)は中文の例を示す図、第2図(b)はその
中文に対応する8文を示す図、第3図(a)は第2図(
b)の8文に文節区切を入れてなる文節区切白文を示す
図、第3図(b)は本図(a)の8文の文節区切りに似
せたリズムで区切った中文の例を示す図、第4図は第2
図(a)の中文を語で区切った語区切中文を示す図、第
5図は第3図(a)の8文の各音節を仮名で表現した仮
名文字音節区切0文を示す図、第6図は第4図の中文の
各音節を中国式ローマ字で表現したローマ字音節区切中
文を示す図、第7図は第2図(b)の8文を音節単位で
入力して最長一致法により文節に区切り、漢字に変換し
て得られる8文の例を示す図、第8図(a)は中文にお
ける階層分析法の概念を示す図、第8図(b)は朗読の
際に小休止する位置で区切る方式の中文の区切り方を例
示する図、第9図は第2図(a)と同じ中文の例を示す
図、第9図(b)は本図(a)の中文を音節で入力して
最長一致法により漢字に変換して得た漢字文を示す図、
第10図は第9図の音節漢字変換の概念を示す図である
。 第11図は中文における語合と情報量の例を示す図、第
12図(1)〜(4)は第11図(2)の中文における
漢字を捨象して構成した中文音節列の音頻ネットワーク
を示す図、第12図(5)は本図(4)の音頻ネットワ
ークにおける最小音頻経路の語合を最多頻度の漢字に変
換して得た漢字文を示す図、第13図は第1図実施例に
おける記憶手段に記憶されるデータを示す図、第14図
はn音節句において可能な区切の型を示す図、第15図
は第14図の区切変数T、に関する頻級和IT。を示す
図、第16図は本発明で用いる音頻法により中文を漢字
に変換して得た漢字文の例を示す図、第17図は第16
図の例文に対する漢字変換の正解率を示す図である。 1・・・辞書、2・・・語合・頻級検索手段、3・・・
単音節語合とその頻級、双音節重音とその頻級、3音節
語音および多音節重音を記憶する手段、4・・・節点判
断手段、5・・・最a重音区切型生成手段、6・・・語
合区切型記憶手段、7・・・漢字側生成手段、8・・・
漢字列記憶手段。
FIG. 1 is a block diagram showing the configuration of an embodiment of the present invention. Figure 2 (a) is a diagram showing an example of a Chinese sentence, Figure 2 (b) is a diagram showing eight sentences corresponding to the Chinese sentence, and Figure 3 (a) is a diagram showing an example of a Chinese sentence.
Figure 3 (b) is a diagram showing a blank sentence with clause breaks in the 8 sentences in Figure 3 (b), which shows an example of a Chinese sentence divided with a rhythm similar to the clause divisions in the 8 sentences in Figure 3 (a). , Figure 4 is the second
Figure 5 is a diagram showing a word-separated Chinese sentence in which the Chinese sentence in Figure 3 (a) is divided into words. Figure 6 is a diagram showing the Roman syllable separated Chinese sentence in which each syllable of the Chinese sentence in Figure 4 is expressed in Chinese-style Roman characters, and Figure 7 is a diagram showing the 8 sentences in Figure 2 (b) entered in syllable units using the longest match method. A diagram showing an example of eight sentences obtained by dividing them into clauses and converting them into kanji. Figure 8 (a) is a diagram showing the concept of the hierarchical analysis method in Chinese. Figure 8 (b) is a diagram showing a short pause during recitation. Figure 9 is a diagram showing an example of the same Chinese sentence as Figure 2 (a), and Figure 9 (b) is a diagram illustrating how to divide a Chinese sentence using the method of dividing the sentence at the position where A diagram showing a kanji sentence obtained by inputting it in and converting it to kanji using the longest match method.
FIG. 10 is a diagram showing the concept of syllable-kanji conversion shown in FIG. 9. Figure 11 is a diagram showing examples of word combinations and information content in Chinese sentences, and Figures 12 (1) to (4) are frequency networks of Chinese syllable strings constructed by abstracting the kanji in the Chinese sentences in Figure 11 (2). Figure 12 (5) is a diagram showing the kanji sentence obtained by converting the phrase of the least frequent route in the frequency network of this figure (4) into the most frequently occurring kanji, and Figure 13 is the diagram shown in Figure 1. FIG. 14 is a diagram showing possible break types in an n-syllable phrase, and FIG. 15 is a frequent sum IT regarding the break variable T in FIG. 14. 16 is a diagram showing an example of a kanji sentence obtained by converting a Chinese sentence into kanji using the phonetic method used in the present invention, and FIG.
It is a figure which shows the correct answer rate of kanji conversion with respect to the example sentence of a figure. 1... Dictionary, 2... Word combination/frequency search means, 3...
Means for storing monosyllabic compound and its frequency, di-syllabic compound and its frequency, trisyllabic compound and polysyllabic compound, 4... nodal point judgment means, 5... maximum a-subject division type generation means, 6 ... Word division type storage means, 7... Kanji side generation means, 8...
Kanji string storage means.

Claims (1)

【特許請求の範囲】 1、中国語の音節の連なりで表わされた語音の列を入力
語音列として入力し、該入力語音列を語音ごとに区切る
中国語語音区切方式において、中国語の語音のうちで単
音節語音および双音節語音について中国語文に出現する
統計的頻度の対数値の絶対値を該語音の頻級として記憶
する辞書と、 前記入力語音列の各単音節語音について前記辞書を検索
し、該単音節語音及びこの単音節語音の頻級を該辞書か
ら読み出すとともに、該単音節語音の直前に別の単音節
語音が入力されているときにはこれら両単音節語音でな
る双音節語音があるか否かを前記辞書で検索し、該辞書
に該双音節語音がある場合には該双音節語音およびこの
双音節語音の頻級を該辞書から読み出す語音・頻級検索
手段と、 該語音・頻級検索手段で読み出された前記単音節語音、
該単音節語音の頻級、前記双音節語音および該双音節語
音の頻級を記憶する第1の記憶手段と、 前記語音・頻級検索手段で双音節語音が検索されなかっ
た単音節語音と前記直前入力単音節語音との間における
前記入力語音列上の仮想の点を節点とし、直近の2つの
該節点の間にある前記入力語音列の句を音頻句とすると
き、該音頻句に対応する前記単音節語音及び双音節語音
並びに該単音節語音の頻級及び該双音節語音の頻級を前
記第1の記憶手段から読み出す節点判断手段と、 前記節点判断手段で読み出された前記単音節語音および
双音節語音並びに該単音節語音の頻級および該双音節語
音の頻級を受け、前記音頻句を語音の単位に区切る区切
り方を語音区切型として生成し、該語音区切型のうちで
最適の語音区切型を選んで出力する最適語音区切型生成
手段と、 前記最適語音区切型生成手段で生成された前記語音区切
型を記憶する第2の記憶手段 とを備え、 前記最適語音区切型生成手段は、前記語音区切型におけ
る各語音の前記頻級の和を求め、該和が最少である語音
区切型を前記最適語音区切型とすることを特徴とする中
国語語音区切方式。 2、中国語の音節の連らなりで表わされた語音の列を入
力語音列として入力し、該入力語音列を語音ごとに区切
り、前記入力語音列から区切られた前記語音をそれぞれ
漢字に変換する中国語語音漢字変換方式において、 中国語の語音を見出しとして該語音の漢字を記憶すると
ともに、中国語の語音のうちで単音節語音および双音節
語音について中国語文に出現する統計的頻度の対数値の
絶対値を該語音の頻級として記憶する辞書と、 前記入力語音列の各単音節語音について前記辞書を検索
し、該単音節語音及びこの単音節語音の頻級を該辞書か
ら読み出すとともに、該単音節語音の直前に別の単音節
語音が入力されているときにはこれら両単音節語音でな
る双音節語音があるか否かを前記辞書で検索し、該辞書
に該双音節語音がある場合には該双音節語音およびこの
双音節語音の頻級を該辞書から読み出ず語音・頻級検索
手段と、 該語音・頻級検索手段で読み出された前記単音節語音、
該単音節語音の頻級、前記双音節詰合および該双音節語
音の頻級を記憶する第1の記憶手段と、 前記語音・頻級検索手段で双音節語音が検索されなかっ
た単音節語音と前記直前入力単音節語音との間における
前記入力語音列上の仮想の点を節点とし、直近の2つの
該節点の間にある前記入力語音列の句を音頻句とすると
き、該音頻句に対応する前記単音節語音及び双音節語音
並びに該単音節語音の頻級及び該双音節語音の頻級を前
記第1の記憶手段から読み出す節点判断手段と、 前記節点判断手段で読み出された前記単音節語音および
双音節語音並びに該単音節語音の頻級および該双音節語
音の頻級を受け、前記音頻句を語音の単位に区切る区切
り方を語音区切型として生成し、該語音区切型のうちで
最適の語合区切型を選んで出力する最適語音区切型生成
手段と、 前記最適語音区切型生成手段で生成された前記語音区切
型を記憶する第2の記憶手段と、前記最適語音区切型で
区切って示される各語音について前記辞書を検索し、該
語音を見出しとする漢字のうちの1つを該辞書から読み
出し、前記入力語音列に対応する漢字列を生成する漢字
列生成手段 とを備え 前記最適語音区切型生成手段は、前記語音区切型におけ
る各語音の前記頻級の和を求め、該和が最少である語音
区切型を前記最適語音区切型とすることを特徴とする中
国語語音漢字変換方式。
[Scope of Claims] 1. In a Chinese speech segmentation method in which a string of speech sounds expressed as a series of Chinese syllables is input as an input speech string, and the input speech string is divided into each speech sound, a dictionary that stores the absolute value of the logarithm of the statistical frequency that appears in Chinese sentences for monosyllabic speech sounds and disyllable speech sounds as the frequency of the speech sounds; The monosyllabic speech sound and the frequency of this monosyllabic speech sound are read from the dictionary, and if another monosyllabic speech sound is input immediately before the monosyllabic speech sound, a disyllable speech sound consisting of both monosyllabic speech sounds is retrieved. a word sound/frequency search means for searching the dictionary to see if there is a word sound, and, if the dictionary contains the disyllable word sound, reading out the disyllable sound and the frequency of the disyllable sound from the dictionary; the monosyllabic speech sounds read out by the speech sound/frequency search means;
a first storage means for storing the frequency of the monosyllabic speech sound, the di-syllabic speech sound, and the frequency of the di-syllabic speech sound; and a monosyllabic speech sound for which no di-syllabic speech sound was retrieved by the speech sound/frequency search means. When a virtual point on the input speech sequence between the immediately input monosyllabic speech sound is defined as a node, and a phrase of the input speech sequence between the two most recent nodes is defined as a frequent phrase, Node determining means for reading out the corresponding monosyllabic speech sounds and disyllable speech sounds, the frequency of the monosyllabic speech sounds, and the frequency of the disyllable speech sounds from the first storage means; Receiving a monosyllabic speech sound and a di-syllabic speech sound, the frequency of the monosyllabic speech sound, and the frequency of the di-syllabic speech sound, generate a division method for dividing the frequent phrase into units of speech sounds as a speech division type, Optimal speech segmentation type generation means for selecting and outputting the optimal speech segmentation type, and second storage means for storing the speech segmentation type generated by the optimal speech segmentation type generation means, A Chinese speech segmentation method, characterized in that the segmentation type generation means calculates the sum of the frequencies of each of the speech sounds in the speech segmentation types, and sets the speech segmentation type with the smallest sum as the optimal speech segmentation type. 2. Input a sequence of speech sounds represented by a series of Chinese syllables as an input speech sequence, divide the input speech sequence into individual speech sounds, and convert each of the speech sounds separated from the input speech sequence into Chinese characters. In the Chinese sound-to-Kanji conversion method, we store the Chinese characters for the Chinese sounds as headings, and calculate the statistical frequencies of monosyllabic and disyllable sounds that appear in Chinese sentences. a dictionary that stores the absolute value of the logarithmic value as the frequency of the speech sound; and searching the dictionary for each monosyllabic speech sound in the input speech string, and reading out the monosyllabic speech sound and the frequency of the monosyllabic speech sound from the dictionary. At the same time, when another monosyllabic speech sound is input immediately before the monosyllabic speech sound, the dictionary is searched to see if there is a disyllable speech sound consisting of both monosyllabic speech sounds, and the dictionary contains the disyllable speech sound. in some cases, the disyllable speech sound and the frequency of the disyllable speech sound are not read out from the dictionary; a speech sound/frequency search means; and the monosyllabic speech sound read out by the speech sound/frequency search means;
a first storage means for storing the frequency of the monosyllabic speech sound, the disyllable combination, and the frequency of the disyllable speech sound; and monosyllabic speech sounds for which no disyllable speech sound was retrieved by the speech sound/frequency search means. and the immediately input monosyllabic speech sound, when a virtual point on the input speech sequence is a node, and a phrase of the input speech sequence between the two most recent nodes is a frequent phrase, the frequent phrase a nodal point determining means for reading from the first storage means the monosyllabic speech sounds and disyllable speech sounds, the frequency of the monosyllabic speech sounds, and the frequent class of the disyllable speech sounds, which are read out by the nodal point determining means; Receiving the monosyllabic speech sounds and disyllable speech sounds, the frequency of the monosyllabic speech sounds, and the frequency of the disyllable speech sounds, generating a division method for dividing the frequent phrase into units of speech sounds as a speech division type, an optimal speech segmentation type generating means for selecting and outputting an optimal speech segmentation pattern among the speech segmentation types; a second storage means for storing the speech segmentation pattern generated by the optimal speech segmentation pattern generation means; Kanji character string generation means that searches the dictionary for each word sound that is shown separated by a segmented type, reads out one of the Kanji characters whose heading is the word sound from the dictionary, and generates a Kanji character string that corresponds to the input word sound string. The optimum speech segmentation type generating means is characterized in that the speech segmentation type is characterized in that it calculates the sum of the frequencies of each speech sound in the speech segmentation type, and sets the speech segmentation type for which the sum is the smallest as the optimal speech segmentation type. Chinese phonetic Kanji conversion method.
JP63105030A 1988-04-26 1988-04-26 Chinese phonetic delimiter and phonetic kanji conversion Expired - Fee Related JP2798931B2 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
JP63105030A JP2798931B2 (en) 1988-04-26 1988-04-26 Chinese phonetic delimiter and phonetic kanji conversion
CN 89102915 CN1019233B (en) 1988-04-26 1989-04-26 Chinese characters transforming mode

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
JP63105030A JP2798931B2 (en) 1988-04-26 1988-04-26 Chinese phonetic delimiter and phonetic kanji conversion

Publications (2)

Publication Number Publication Date
JPH01274272A true JPH01274272A (en) 1989-11-02
JP2798931B2 JP2798931B2 (en) 1998-09-17

Family

ID=14396627

Family Applications (1)

Application Number Title Priority Date Filing Date
JP63105030A Expired - Fee Related JP2798931B2 (en) 1988-04-26 1988-04-26 Chinese phonetic delimiter and phonetic kanji conversion

Country Status (1)

Country Link
JP (1) JP2798931B2 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2010157260A (en) * 1998-02-13 2010-07-15 Microsoft Corp Word segmentation method in chinese text

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2010157260A (en) * 1998-02-13 2010-07-15 Microsoft Corp Word segmentation method in chinese text

Also Published As

Publication number Publication date
JP2798931B2 (en) 1998-09-17

Similar Documents

Publication Publication Date Title
US5893133A (en) Keyboard for a system and method for processing Chinese language text
US6014615A (en) System and method for processing morphological and syntactical analyses of inputted Chinese language phrases
TWI293455B (en) System and method for disambiguating phonetic input
US8977535B2 (en) Transliterating methods between character-based and phonetic symbol-based writing systems
JP2013117978A (en) Generating method for typing candidate for improvement in typing efficiency
US20070179779A1 (en) Language information translating device and method
Singh A computational phonetic model for indian language scripts
JP2000298667A (en) Kanji converting device by syntax information
JP5853595B2 (en) Morphological analyzer, method, program, speech synthesizer, method, program
Trinh et al. Applying prediction techniques to phoneme-based AAC systems
JP2006235916A (en) Text analysis device, text analysis method and speech synthesizer
JPH01274272A (en) Word tone separating system and word tone kanji conversion system for chinese language
JP2002207728A (en) Phonogram generator, and recording medium recorded with program for realizing the same
KR101777141B1 (en) Apparatus and method for inputting chinese and foreign languages based on hun min jeong eum using korean input keyboard
JP2812495B2 (en) Syllabic input of language using kanji
Algarni Light morphology and arabic information retrieval.
JPH11338498A (en) Voice synthesizer
KR100268297B1 (en) System and method for processing chinese language text
JP3069532B2 (en) Kana-kanji conversion method and device, and computer-readable recording medium storing a program for causing a computer to execute the kana-kanji conversion method
JPH08272780A (en) Processor and method for chinese input processing, and processor and method for language processing
Posner Balance of Complexity and Hierarchy of Precision: Two Principles of Economy in the Notation of Language and Music
Asahiah et al. A survey of approaches to diacritic restoration
JPH09231212A (en) Independent word deciding method
JPH08241315A (en) Word registering mechanism for document processor
JPH0962660A (en) Character input device

Legal Events

Date Code Title Description
LAPS Cancellation because of no payment of annual fees