JP5481957B2

JP5481957B2 - Speech synthesizer

Info

Publication number: JP5481957B2
Application number: JP2009143779A
Authority: JP
Inventors: 敏雄茂出木
Original assignee: Dai Nippon Printing Co Ltd
Current assignee: Dai Nippon Printing Co Ltd
Priority date: 2009-06-17
Filing date: 2009-06-17
Publication date: 2014-04-23
Anticipated expiration: 2029-06-17
Also published as: JP2011002524A

Description

本発明は人間の声を基にして、電子楽器、楽譜等で利用可能な符号データを得るための技術に関する。 The present invention relates to a technique for obtaining code data that can be used in electronic musical instruments, musical scores, and the like based on a human voice.

従来、人間の声を擬似的に合成する手法は、種々の分野で利用されている。出願人は、人間の声をＰＣＭでデジタル化した後、フーリエ変換を用い、実効強度の大きい周波数に対応する符号コードを取得することにより音声合成を行う技術を提案している（特許文献１参照）。 Conventionally, a method of artificially synthesizing a human voice has been used in various fields. The applicant has proposed a technique for synthesizing speech by digitizing a human voice with PCM and then using a Fourier transform to obtain a code code corresponding to a frequency having a large effective intensity (see Patent Document 1). ).

また、出願人は、玩具などに搭載されている性能の低いＭＩＤＩ音源でも再生可能とし、既存の楽譜編集ツールに読み込ませて五線譜に変換すると、判読性のある譜面が得られるようにするために、各音素ごとの符号コード群を簡素な符号コード群に変換する技術を提案している（特許文献２参照）。 In addition, the applicant can play back even a low-performance MIDI sound source installed in toys, etc., and read it with an existing score editing tool and convert it to a staff score so that a legible score can be obtained. Have proposed a technique of converting a code code group for each phoneme into a simple code code group (see Patent Document 2).

特開平１１−９５７９８号公報JP-A-11-95798 特願２００９−４１１６５号Japanese Patent Application No. 2009-41165

上記特許文献１に記載の技術では、人間の音声を構成する各音素ごとにＭＩＤＩデータ形式で符号化された符号コード群は、電子楽器による自動演奏で音声を再生することを前提としているため、発音タイミングやベロシティ制御が煩雑で、玩具などに搭載されている性能の低いＭＩＤＩ音源では再生できなかった。また、前記符号コード群を既存の楽譜編集ツールに読み込ませて五線譜に変換すると、判読不能な譜面になってしまい、人間による楽器演奏で前記符号コード群を再生することは困難であった。 In the technique described in Patent Document 1, since the code code group encoded in the MIDI data format for each phoneme constituting the human voice is based on the premise that the voice is reproduced by an automatic performance by an electronic musical instrument, The sound generation timing and velocity control are complicated, and it was not possible to play with the MIDI sound source with low performance installed in toys. Further, when the code code group is read by an existing score editing tool and converted into a staff score, it becomes an unreadable music score, and it is difficult to reproduce the code code group by playing a musical instrument by a human.

上記特許文献２に記載の技術では、母音以外の全ての音素を２連和音で簡略化したＭＩＤＩデータで統一的に表現しているが、ＭＩＤＩ音源で種々の楽器音色を設定して再生しても音声の明瞭性に欠けるという問題がある。音声が不明瞭である原因としては、以下の３つがある。第１に、母音以外の全ての音素を２連の４和音等で簡略化している点に問題があり、実際の音素では、時間軸方向のフォルマントゆらぎが大きく、子音の雑音成分と母音成分をつなぐ経過音を人間の聴覚上の補間現象（音脈分凝）に委ねて省略することは難しく、２連和音では変化が少なすぎる。第２に、母音以外の全ての音素を２連の４和音等に統一化する方法に問題があり、実際の日本語の音素では音素ごとに経過音の構成や発音時間が異なり、構成される和音数を統一化することに無理がある。第３に、データベースに登録されている音素のピッチ（音高）にバラツキがある点に問題があり、単純に時間軸方向に配置すると、合成音声に不自然な抑揚が付加され、音声の聴取が行いにくくなっている。 In the technique described in Patent Document 2, all phonemes other than vowels are uniformly expressed by MIDI data simplified by double chords, but various instrument sounds are set and reproduced by a MIDI sound source. However, there is a problem of lack of clarity of speech. There are the following three causes of unclear audio. First, there is a problem in that all phonemes other than vowels are simplified with two quadratures, etc. In the actual phonemes, formant fluctuations in the time axis direction are large, and the noise and vowel components of consonants are reduced. It is difficult to omit the elapsed sound to be connected to the human auditory interpolation phenomenon (sound pulse concentration), and there is too little change in double chords. Secondly, there is a problem with the method of unifying all phonemes other than vowels into two quadruples, etc. In the actual Japanese phoneme, the structure of the elapsed sound and the pronunciation time are different for each phoneme. It is impossible to unify the number of chords. Third, there is a problem in that there is a variation in the pitch (pitch) of phonemes registered in the database. If it is simply placed in the time axis direction, an unnatural inflection is added to the synthesized speech, and the speech is heard. Is difficult to do.

また、ＭＩＤＩ形式の音素データベースを構築する際、日本語音声五十音を録音した波形音声データ一式を準備し、特許第４１３２３６２号公報等に開示の技術を用いて高精細なＭＩＤＩデータに変換し、変換されたＭＩＤＩデータ対して、２箇所の音素区間に対応する位置を画面上で指示する方法をとっており、作業者の主観が入るため整形データが不揃いになる問題と、作業者の負荷が大きいという問題があった。 Also, when building a phoneme database in MIDI format, a set of waveform voice data that records Japanese voiced Japanese syllabary is prepared and converted to high-definition MIDI data using the technology disclosed in Japanese Patent No. 4132362. The method of instructing the position corresponding to the two phoneme sections on the converted MIDI data on the screen, and the subjectivity of the worker enters, the problem that the shaping data becomes uneven and the load of the worker There was a problem that was large.

そこで、本発明は、人間が楽器で演奏可能な程度に、五線譜に自動変換できる簡略化した符号データ（ＭＩＤＩデータ等）を基本として音声合成機能を実現する場合において、作業負荷を軽減するとともに、音声再生品質の明瞭性を改善することが可能な音素符号補正装置、音素符号データベース、および音声合成装置を提供することを課題とする。 Therefore, the present invention reduces the work load in the case of realizing a speech synthesis function based on simplified code data (such as MIDI data) that can be automatically converted into a musical score to the extent that a human can perform with a musical instrument, It is an object of the present invention to provide a phoneme code correction device, a phoneme code database, and a speech synthesizer that can improve the clarity of speech reproduction quality.

上記課題を解決するため、本発明では、１つの音素を複数の符号コード群で表現した音素符号を読み込む音素符号読込手段と、前記読み込まれた音素符号を構成する各符号コードについて、発音開始時刻と発音終了時刻との時間差と符号コードの強さとの積で与えられるエネルギー値が高い上位のものを、時間的に重複する符号コードが所定の種類以下の範囲で抽出し、抽出された符号コード群で構成される音素符号に変換する音素符号変換手段と、当該音素符号を構成する符号コードの発音開始時刻および発音終了時刻の各々に対して、所定の時間単位の整数倍になるように補正する符号時刻補正手段を有する音素符号補正装置による補正後の音素符号であって、日本語カナ文字の各音素に対応して、所定の種類以下の音高を同時にもち、音の長さが所定の時間単位の整数倍で規定される和音複数個（母音は２個以上、子音は３個以上）で構成された補正後の音素符号を、音素符号を識別する音素符号識別情報と対応付けて記録した音素符号データベースと、与えられた合成指示データに記載されている音素符号識別情報に対応する前記補正後の音素符号を前記音素符号データベースから抽出し、当該音素の種別に従って、発音時間および無音時間を設定し、発音の開始および終了を特定する時刻を設定することにより合成音声データを生成する音素編集処理手段と、を有することを特徴とする音声合成装置を提供する。 In order to solve the above-described problem, in the present invention, a phoneme code reading unit that reads a phoneme code in which one phoneme is expressed by a plurality of code code groups, and a pronunciation start time for each code code constituting the read phoneme code The code with the highest energy value given by the product of the time difference between the sound and the pronunciation end time and the strength of the code code is extracted in the range where the code code that overlaps in time falls within a predetermined type , and the extracted code code a phonemic code converting means for converting the constructed phoneme codes in de group, for each of the reproduction starting time and the sound end time of sign-code that make up the phonemic code, an integral multiple of a predetermined time unit a phonemic code corrected by the phonemic code correction device having a code time correction means for correcting, as, in correspondence with each phoneme in Japanese kana characters, a predetermined or fewer pitch simultaneously have, Phoneme code identification that identifies a phoneme code as a corrected phoneme code composed of a plurality of chords (two or more vowels and three or more consonants) whose length is defined by an integral multiple of a predetermined time unit The phoneme code database recorded in association with the information and the corrected phoneme code corresponding to the phoneme code identification information described in the given synthesis instruction data are extracted from the phoneme code database, and according to the type of the phoneme There is provided a speech synthesizer comprising: phoneme editing processing means for generating synthesized speech data by setting a pronunciation time and a silence time, and setting a time for specifying the start and end of pronunciation .

本発明によれば、符号コード群として構成される音素符号を読み込み、エネルギー値の高い符号コードを所定数抽出した符号コード群で構成される音素符号に変換し、音素符号を構成する符号コードの発音開始時刻および発音終了時刻の各々に対して、所定の時間単位の整数倍になるように補正するようにしたので、人間が楽器で演奏可能な程度に、五線譜に自動変換できる簡略化した符号データを基本として音声合成機能を実現する場合において、作業負荷を軽減するとともに、音声再生品質の明瞭性を改善することが可能となる。 According to the onset bright, it reads the phonemic code configured as a code code group, the high code code energy value into a composed phonemic code of a predetermined number extracted code code group, that make up the phonemic code marks Since the sound generation start time and the sound end time of the chord code are corrected to be an integral multiple of a predetermined time unit, it can be automatically converted into a staff notation that can be played by a musical instrument. In the case of realizing the speech synthesis function based on the converted code data, it is possible to reduce the work load and improve the clarity of the speech reproduction quality.

本発明によれば、日本語カナ文字の各音素を表現した音素符号を、所定の種類以下の音高を同時にもち、音の長さが所定の時間単位の整数倍で規定される和音複数個により構成するようにしたので、音声合成を行うことにより作成される符号コード群は玩具などに搭載されている性能の低いＭＩＤＩ音源で再生可能であるとともに、既存の楽譜編集ツールにより演奏者が楽器演奏時に判読容易な五線譜に変換することが可能となる。 According to the onset bright, chords plurality of phonemic code representing the respective phoneme in Japanese kana characters, which have predetermined or fewer pitch simultaneously, the length of the sound is defined by an integral multiple of a predetermined time unit Since the code code group created by performing the speech synthesis can be reproduced with a low-performance MIDI sound source mounted on a toy or the like, the player can use the existing score editing tool to It becomes possible to convert to a staff score that is easy to read when playing a musical instrument.

本発明によれば、所定の種類以下の音高を同時にもち、音の長さが所定の時間単位の整数倍で規定される和音複数個で構成された音素符号を記録したデータベースを有し、入力された音素符号識別情報に対応する音素符号を抽出し、音素の種別に従って、発音時間および無音時間を設定し、発音の開始および終了を特定する時刻を設定することにより合成音声データを生成するようにしたので、音声合成を行うことにより作成される符号コード群は玩具などに搭載されている性能の低いＭＩＤＩ音源で再生可能であるとともに、既存の楽譜編集ツールにより演奏者が楽器演奏で再生可能な判読性のある五線譜に変換することが可能となる。
According to the onset bright, has a database that records the phonemic code composed of a chord plurality of glutinous predetermined or fewer pitch simultaneously, the length of the sound is defined by an integral multiple of a predetermined time unit Extract phoneme code corresponding to input phoneme code identification information, set pronunciation time and silence time according to phoneme type, and generate synthesized speech data by setting the time to specify the start and end of pronunciation As a result, the code code group created by performing speech synthesis can be played back with a low-performance MIDI sound source mounted on a toy or the like. It can be converted into a reproducible and readable staff score.

本発明によれば、人間が楽器で演奏可能な程度に、五線譜に自動変換できる簡略化した符号データを基本として音声合成機能を実現する場合において、作業負荷を軽減するとともに、音声再生品質の明瞭性を改善することが可能となるという効果を奏する。 According to the present invention, in the case of realizing a speech synthesis function based on simplified code data that can be automatically converted into a musical score to the extent that a human can perform with a musical instrument, the workload is reduced and the sound reproduction quality is clearly defined. It is possible to improve the sex.

本発明における音声合成の基本概念を示す図である。It is a figure which shows the basic concept of the speech synthesis in this invention. 本発明における音声合成の基本概念を示す図である。It is a figure which shows the basic concept of the speech synthesis in this invention. 本発明に係る音素符号補正装置の一実施形態を示す構成図である。It is a block diagram which shows one Embodiment of the phoneme code correction apparatus which concerns on this invention. 音素符号表示手段３０に表示された補正前の音素符号の様子を示す図である。It is a figure which shows the mode of the phoneme code | symbol before correction | amendment displayed on the phoneme code | symbol display means. 音素符号表示手段３０に表示された補正後の音素符号の様子を示す図である。It is a figure which shows the mode of the phoneme code after correction | amendment displayed on the phoneme code | symbol display means. 音素符号補正手段２１による処理前と処理後の音素符号の変化の様子を示す図である。It is a figure which shows the mode of the change of the phoneme code before and after the process by the phoneme code correction means. 補正音素符号記憶部１２に格納された男声の音素符号の例を示す図である。It is a figure which shows the example of the phoneme code | symbol of the male voice stored in the correction | amendment phoneme code | symbol storage part. 補正音素符号記憶部１２に格納された男声の音素符号の例を示す図である。It is a figure which shows the example of the phoneme code | symbol of the male voice stored in the correction | amendment phoneme code | symbol storage part. 補正音素符号記憶部１２に格納された女声の音素符号の例を示す図である。It is a figure which shows the example of the phoneme code | symbol of the female voice stored in the correction | amendment phoneme code | symbol storage part. 補正音素符号記憶部１２に格納された女声の音素符号の例を示す図である。It is a figure which shows the example of the phoneme code | symbol of the female voice stored in the correction | amendment phoneme code | symbol storage part. 図７、図８に示した男声の音素符号を五線譜化した例を示す図である。It is a figure which shows the example which made the phonetic code | symbol of the male voice shown in FIG. 7, FIG. 図９、図１０に示した女声の音素符号を五線譜化した例を示す図である。It is a figure which shows the example which made the phoneme code | symbol of the female voice shown in FIG. 9, FIG. 本発明に係る音声合成装置の一実施形態を示す構成図である。It is a block diagram which shows one Embodiment of the speech synthesizer which concerns on this invention. 音素編集処理手段５０による音高の補正の様子を示す図である。It is a figure which shows the mode of the correction | amendment of the pitch by the phoneme edit process means. 本発明に係る電子透かし埋め込み装置の一実施形態を示す構成図である。It is a block diagram which shows one Embodiment of the digital watermark embedding apparatus based on this invention. ３２個の符号コードに対して音素符号補正手段２１による処理を行った例を示す図である。It is a figure which shows the example which performed the process by the phoneme code correction means 21 with respect to 32 code codes.

（１．本発明の基本概念）
以下、本発明の好適な実施形態について図面を参照して詳細に説明する。最初に、本発明の基本概念について説明する。日本語の母音は、２つの特徴的な音声フォルマント成分を含む４つ以上の重音による和音で近似できることが知られている。子音は母音成分に加えて、摩擦音など雑音を表現する和音と母音への経過音を表現する和音の３種の連結された和音が理論上必要であるが、経過音は人間の聴覚上の補間現象（音脈分凝）に委ねれば、最初の雑音と母音の２つの和音に近似できる。そこで、出願人は、日本語の母音は単一の４和音、子音は２連の４和音を基本にして全音素を表現し、これらを時間軸上につなぎ合わせれば音声合成を実現できると考え、特許文献２において、これを提案した。 (1. Basic concept of the present invention)
DESCRIPTION OF EXEMPLARY EMBODIMENTS Hereinafter, preferred embodiments of the invention will be described in detail with reference to the drawings. First, the basic concept of the present invention will be described. It is known that Japanese vowels can be approximated by chords of four or more overtones including two characteristic speech formant components. In addition to vowel components, consonants need three types of connected chords: chords that express noise such as friction sounds and chords that express vowels to vowels. If it is left to the phenomenon (sound pulse concentration), it can be approximated to two chords of the first noise and vowel. Therefore, the applicant thinks that speech synthesis can be realized by expressing all phonemes based on a single quadruple for Japanese vowels and two quadruples for consonants, and connecting them on the time axis. This is proposed in Patent Document 2.

しかしながら、上述のように、様々な要因により、ＭＩＤＩ音源で種々の楽器音色を設定して再生しても音声の明瞭性に欠けるという問題があった。本発明では、母音、子音等の音素の構成を上記特許文献１とは根本的に異なるものとした。そして、各音素をそれぞれ固有の態様で記録する。さらに、これらの音素を用いて、音素の特徴に応じてさらに多種の態様で合成することを特徴とする。 However, as described above, due to various factors, there is a problem in that the sound is not clear even when various instrument sounds are set and reproduced by the MIDI sound source. In the present invention, the configuration of phonemes such as vowels and consonants is fundamentally different from that of Patent Document 1. Each phoneme is recorded in a unique manner. Furthermore, these phonemes are used to synthesize in various modes according to the characteristics of phonemes.

本実施形態では、母音・子音を問わず、各音素ごとに２連から６連の範囲で可変の和音構成とし、各和音は６重音を標準とし、長さは１／１６秒の整数倍で与える。トータル１２重音の範囲で、長さの異なる和音を重ねる構成も可能とする。図１は、本実施形態における音素の基本構成を４連の場合を例にとって示す図である。図１（ａ）の例では、４つの和音からなる４連構成であり、第１和音、第２和音が最小単位区間の１／１６秒、第３和音が最小単位区間の２倍の２／１６秒、第４和音が最小単位区間の３倍の３／１６秒となっている。図１（ｂ）の例では、図１（ａ）の例と同じ和音を用いているが、第２和音および第３和音に、第４和音を重ねた構成となっている。 In this embodiment, regardless of vowels or consonants, each phoneme has a variable chord configuration in a range of 2 to 6 chords, each chord is a standard 6-tone, and the length is an integral multiple of 1/16 second. give. A configuration is also possible in which chords having different lengths are stacked in a total twelve-tone range. FIG. 1 is a diagram illustrating an example of a basic phoneme configuration in the present embodiment in the case of four stations. In the example of FIG. 1 (a), it has a quadruple configuration consisting of four chords, the first chord and the second chord are 1/16 seconds of the minimum unit interval, and the third chord is 2/2 of the minimum unit interval. 16 seconds, the fourth chord is 3/16 seconds, which is three times the minimum unit interval. In the example of FIG. 1B, the same chord as in the example of FIG. 1A is used, but the fourth chord is superimposed on the second chord and the third chord.

本実施形態では、音声合成を行なう際、上記のような構成の音素を時系列に配置する。この際、本実施形態では、音素の長さである音素区間を可変としている。ここで、音素区間を固定で配置する固定音素配置、音素区間を可変で配置する可変音素配置をそれぞれ図２（ａ）（ｂ）に示す。図２の例では、いずれも音素は３連の場合を示している。固定音素配置の場合、図２（ａ）に示すように、左から２番目の音素が合計３／１６秒であっても、音素区間は４／１６秒と固定である。これに対して、可変音素配置の場合、図２（ｂ）に示すように、左から２番目の音素が合計３／１６秒の場合、音素区間も３／１６秒に変化する。 In the present embodiment, when speech synthesis is performed, phonemes having the above-described configuration are arranged in time series. At this time, in the present embodiment, the phoneme section which is the length of the phoneme is variable. Here, FIG. 2A and FIG. 2B show a fixed phoneme arrangement in which phoneme sections are arranged in a fixed manner and a variable phoneme arrangement in which phoneme sections are arranged in a variable manner. In the example of FIG. 2, the case where there are three phonemes is shown. In the case of the fixed phoneme arrangement, as shown in FIG. 2A, even if the second phoneme from the left is 3/16 seconds in total, the phoneme section is fixed at 4/16 seconds. On the other hand, in the case of variable phoneme arrangement, as shown in FIG. 2B, when the second phoneme from the left is 3/16 seconds in total, the phoneme section is also changed to 3/16 seconds.

図３（ａ）は、通常の音素の合成パターンを示す図である。通常の音素の場合、音素区間は音素の長さそのまま、音素間隔は標準で設定される。図３（ａ）に示すように、音素“カ”の長さが４／１６秒であった場合、音素区間は音素の長さそのままの４／１６秒、音素間隔は標準の４／１６秒となり、その後、後続音素が続く。図３（ｂ）は、長音の音素の合成パターンを示す例である。長音の音素の場合、音素区間は音素の長さの２倍、音素間隔は標準の２倍で設定される。図３（ｂ）に示すように、音素“カ”の長さが４／１６秒であった場合、音素区間は音素の長さの２倍の８／１６秒、音素間隔は標準の８／１６秒となる。 FIG. 3A shows a normal phoneme synthesis pattern. In the case of a normal phoneme, the phoneme interval is set as a standard with the phoneme interval as it is. As shown in FIG. 3A, when the length of the phoneme “K” is 4/16 seconds, the phoneme interval is 4/16 seconds with the phoneme length unchanged, and the phoneme interval is the standard 4/16 seconds. Followed by a subsequent phoneme. FIG. 3B is an example showing a synthesized pattern of long phonemes. In the case of a long phoneme, the phoneme interval is set to twice the phoneme length, and the phoneme interval is set to twice the standard. As shown in FIG. 3B, when the length of the phoneme “K” is 4/16 seconds, the phoneme section is 8/16 seconds that is twice the length of the phoneme, and the phoneme interval is the standard 8 / 16 seconds.

図４は、通常の音素との比較により、拗音・促音の合成パターンを説明する図である。“アツタ”のように間の“ツ”が通常の音素の場合は、図４（ａ）に示すように、その前後の音素間隔は標準の４／１６秒で設定される。音素区間については、特に示していないが、上述のように音素の長さそのままとなっている。“アッタ”のように間の“ッ”が促音の場合は、図４（ｂ）に示すように、その前後の音素間隔が標準の１／２の２／１６秒で設定される。音素区間については、通常の音素の場合と全く同じである。このように、本実施形態では、音素自体には加工を行なわず、音素間隔を短くすることにより、拗音・促音を作り出している。図２〜図４においては、一部先行音素、後続音素を省略したが、基本的に全ての音素の間には無音の音素間隔が設定される。また、図１から図４に示したものは、基本様式であるため、各音素における各和音の発音時間の比率、各音素の発音時間（音素区間）や無音時間（音素間隔）の長さは適宜変更することが可能である。例えば、隣接音素との間隔（音素間隔）を変動させれば話速を変更できる。 FIG. 4 is a diagram for explaining a combined pattern of stuttering and prompting sounds by comparison with normal phonemes. When the “tsu” in between is a normal phoneme, such as “Atta”, as shown in FIG. 4A, the phoneme interval before and after that is set to a standard 4/16 second. Although the phoneme section is not particularly shown, the length of the phoneme remains as described above. In the case of “h” in the middle, such as “Atta”, as shown in FIG. 4B, the phoneme interval before and after that is set to 2/16 seconds which is 1/2 of the standard. The phoneme section is exactly the same as in the case of normal phonemes. Thus, in this embodiment, the phoneme itself is not processed, and the stuttering / promoting sound is created by shortening the phoneme interval. In FIG. 2 to FIG. 4, some preceding phonemes and subsequent phonemes are omitted, but a silent phoneme interval is basically set between all phonemes. 1 to 4 are basic styles, the ratio of the pronunciation time of each chord in each phoneme, the length of the pronunciation time (phoneme interval) of each phoneme, and the length of silent time (phoneme interval) are as follows. It can be changed as appropriate. For example, the speech rate can be changed by changing the interval between adjacent phonemes (phoneme interval).

（２．前準備）
次に、従来技術を利用した前処理について説明する。前準備として、人間の声を各音素ごとに符号化する処理を行う。これは、特許文献１に示されている音素の符号化処理に相当する。ただし、特許文献１に示されている符号化処理では、１つの単位区間において、４、８、１６個の符号コードを定義しているが、本実施形態では、より多くの符号コードを定義する点で異なっている。具体的には、本実施形態では、ノートナンバーに対応した３２個の符号コードを定義し、符号化を行う。 (2. Preparation)
Next, pre-processing using the prior art will be described. As a preparation, a process of encoding a human voice for each phoneme is performed. This corresponds to the phoneme encoding process disclosed in Patent Document 1. However, in the encoding process shown in Patent Document 1, 4, 8, and 16 code codes are defined in one unit section, but in this embodiment, more code codes are defined. It is different in point. Specifically, in this embodiment, 32 code codes corresponding to the note number are defined and encoded.

符号化処理としては、まず最初に、人間の声を音素単位でデジタル化する。これは、特許文献１に示したのと同様に、各音素を人間が実際に発声し、これを録音したものをＰＣＭ等の手法により行う。このとき、話者はネイティブの日本人男性または女性とし、同一人が７１の音節に対してできる限り、ピッチ（音高）と発声区間を揃えて、発声することが望まれる。一般人の話者ではこのように発声を揃えるのは困難であるため、専門のボイストレーニングを受けたアナウンサー・声楽家などに発声してもらうことが望ましい。更に、男性または女性を含む複数の話者により複数のセットの７１音節の録音信号を収集することが望ましい。続いて、各音素のデジタルデータを１２８個の符号コード群に変換する。この変換は、概略的には特許文献１に示したものと同様である。したがって、各符号コードは、音の高さ、音の強さ、発音開始時刻、発音終了時刻（本実施形態では、符号コードとしてＭＩＤＩを採用するので、ノートナンバー、ベロシティ、ノートオン時刻、ノートオフ時刻）で構成される。しかし、上述のように、本実施形態では、特許文献１のように４〜１６個程度ではなく、全てのノートナンバーに対応した１２８個の符号コード群に変換し、更にその中から３２個の符号コード群を選別する。ＰＣＭデータから、３２個の符号コード群への変換の具体的手法は、特許第４０３７５４２号や、特許第４１３２３６２号等に開示されている公知の技術を利用する。このようにして、各音素ごとの符号コード群を音素符号として音素データベースに記録する。本実施形態では、音素データベースに記録される音素は、いわゆる五十音に加え、撥音、濁音、半濁音を含む計７１音である。 As an encoding process, first, a human voice is digitized in units of phonemes. In the same way as shown in Patent Document 1, each phoneme is actually uttered by a person, and the recorded sound is recorded by a technique such as PCM. At this time, the speaker is a native Japanese man or woman, and it is desirable that the same person utter as much as possible with respect to 71 syllables with the same pitch (pitch) and utterance interval. Since it is difficult for ordinary speakers to arrange the utterances in this way, it is desirable to have an announcer or vocalist who has received specialized voice training speak. Furthermore, it is desirable to collect multiple sets of 71 syllable recordings by multiple speakers, including men or women. Subsequently, the digital data of each phoneme is converted into 128 code code groups. This conversion is generally the same as that shown in Patent Document 1. Therefore, each code code has a pitch, a sound intensity, a sound generation start time, a sound generation end time (in this embodiment, MIDI is used as the code code, so the note number, velocity, note-on time, note-off time) Time). However, as described above, in the present embodiment, the code code group is converted to 128 code codes corresponding to all the note numbers, instead of about 4 to 16 as in Patent Document 1, and 32 of them are further converted. A code code group is selected. As a specific method of converting PCM data into 32 code code groups, a known technique disclosed in Japanese Patent No. 4037542, Japanese Patent No. 4132362, or the like is used. In this way, the code code group for each phoneme is recorded in the phoneme database as a phoneme code. In the present embodiment, the phonemes recorded in the phoneme database are a total of 71 sounds including so-called fifty sounds, repelling sounds, muddy sounds, and semi-voiced sounds.

（３．音素符号の補正）
次に、音素符号の補正について説明する。図５は、本発明に係る音素符号補正装置の一実施形態を示す構成図である。記憶手段１０は、音素符号記憶部１１、補正音素符号記憶部１２を有しており、コンピュータに接続されたハードディスク等の外部記憶装置により実現される。処理制御手段２０は、音素符号補正装置全体の処理を統括するものであり、音素符号変換手段２１、符号時刻補正手段２２を有している。処理制御手段２０は、ＣＰＵ、メモリを含むコンピュータ本体であり、音素符号変換手段２１、符号時刻補正手段２２は、専用のプログラムをＣＰＵが実行することにより実現される。音素符号表示手段３０は、処理制御手段２０に読み込まれた音素符号を表示するものであり、液晶ディスプレイ等の表示装置により実現される。 (3. Correction of phoneme code)
Next, phoneme code correction will be described. FIG. 5 is a block diagram showing an embodiment of a phoneme code correction apparatus according to the present invention. The storage means 10 includes a phoneme code storage unit 11 and a corrected phoneme code storage unit 12, and is realized by an external storage device such as a hard disk connected to a computer. The process control unit 20 controls the entire process of the phoneme code correction apparatus, and includes a phoneme code conversion unit 21 and a code time correction unit 22. The processing control means 20 is a computer main body including a CPU and a memory, and the phoneme code conversion means 21 and the code time correction means 22 are realized by the CPU executing a dedicated program. The phoneme code display means 30 displays the phoneme code read by the process control means 20, and is realized by a display device such as a liquid crystal display.

次に、図５に示した音素符号補正装置の処理動作について説明する。まず、最初に処理制御手段２０が、音素符号記憶部１１から音素ごとに音素符号を読み込む。 Next, the processing operation of the phoneme code correction apparatus shown in FIG. 5 will be described. First, the process control means 20 reads a phoneme code for each phoneme from the phoneme code storage unit 11.

処理制御手段２０が、音素符号を読み込んだら、音素符号変換手段２１は、読み込んだ音素符号を構成する全符号コードを対象にして、音高（ＭＩＤＩの場合、ノートナンバー）別に、エネルギー総和値を算出する。エネルギー総和値は、各音高における音の強度（ＭＩＤＩの場合、ベロシティ）×発音時間（ＭＩＤＩの場合、デュレーション：ノートオフ時刻−ノートオン時刻）により算出する。全音高についてエネルギー総和値が算出されたら、音素符号補正手段２１は、そのうち、強度値が所定値（例：０〜１２７の値をとるベロシティの場合、６４）以上で、エネルギー総和値が上位の音高を各発音時間の区間において指定和音数（例：６個）を超えないように選出する。 When the processing control unit 20 reads the phoneme code, the phoneme code conversion unit 21 sets the energy sum value for each pitch (note number in the case of MIDI) for all code codes constituting the read phoneme code. calculate. The total energy value is calculated by the sound intensity at each pitch (velocity in the case of MIDI) × sounding time (in the case of MIDI, duration: note-off time−note-on time). When the energy sum value is calculated for all pitches, the phoneme code correction means 21 has an intensity value equal to or higher than a predetermined value (for example, 64 for a velocity having a value of 0 to 127), and the energy sum value is higher. The pitch is selected so that it does not exceed the specified number of chords (eg, 6) in each sounding time interval.

指定和音数は、事前に設定されるものであり、自由に設定することができるが、本実施形態では、上述のように“６”としている。したがって、本実施形態では、３２個の符号コードから６個の符号コードが、音素符号変換手段２１により抽出されることになる。次に、符号時刻補正手段２２が、抽出された符号コードのノートオン時刻、ノートオフ時刻を所定の単位時刻の整数倍になるように補正して開始時刻、終了時刻として設定するとともに、ベロシティを規定値に設定する。 The specified number of chords is set in advance and can be set freely. In the present embodiment, it is “6” as described above. Therefore, in the present embodiment, six code codes are extracted from the 32 code codes by the phoneme code conversion means 21. Next, the code time correction means 22 corrects the note-on time and note-off time of the extracted code code so as to be an integral multiple of a predetermined unit time, sets the start time and end time, and sets the velocity. Set to the default value.

開始時刻、終了時刻は、抽出された符号コードのノートオン時刻をＴｏｎ、ノートオフ時刻をＴｏｆｆ、最小単位時刻をＴｑとし、以下の〔数式１〕に従った処理を実行し、得られたＴｏｎ´、Ｔｏｆｆ´で設定される。 As the start time and end time, Ton is the note-on time of the extracted code code, Toff is the note-off time, and Tq is the minimum unit time. 'And Toff' are set.

〔数式１〕
Ｔｄ´＝Ｆｌｏｏｒ[（Ｔｄ＋Ｔｑ・２／３）／Ｔｑ] ・Ｔｑ
Ｔｏｎ´＝Ｆｌｏｏｒ[｛（Ｔｏｆｆ＋Ｔｏｎ）／２−Ｔｄ´／２｝／Ｔｑ]・Ｔｑ
Ｔｏｆｆ´＝Ｔｏｎ´＋Ｔｄ´ [Formula 1]
Td ′ = Floor [(Td + Tq · 2/3) / Tq] · Tq
Ton ′ = Floor [{(Toff + Ton) / 2−Td ′ / 2} / Tq] · Tq
Toff ′ = Ton ′ + Td ′

上記〔数式１〕において、Ｔｄはデュレーションであり、Ｔｏｆｆ−Ｔｏｎである。また、Ｆｌｏｏｒ[]は、実数値の小数点以下を切り捨てて整数化する関数である。 In the above [Equation 1], Td is a duration, which is Toff-Ton. Floor [] is a function that rounds down the decimal point of a real value to an integer.

ベロシティの規定値については、ベロシティが“０”〜“１２７”の値を取り得るため、本実施形態では、その最大の“１２７”としている。 Since the velocity can take a value from “0” to “127”, the maximum value is set to “127” in the present embodiment.

音素符号変換手段２１、符号時刻補正手段２２による処理前と処理後の音素符号の変化の様子を図６に示す。図６において、横軸は時間、縦軸は周波数（ノートナンバー）に対応している。グラフ内に配置された矩形は符号コードを示しており、横方向の長さは横軸に従って時間的長さを示しているが、縦方向の長さは縦軸とは異なり、周波数ではなく強度（ベロシティ）を示している。 FIG. 6 shows how phoneme codes change before and after processing by the phoneme code conversion means 21 and code time correction means 22. In FIG. 6, the horizontal axis corresponds to time, and the vertical axis corresponds to frequency (note number). The rectangle arranged in the graph shows the code code, and the horizontal length shows the temporal length according to the horizontal axis, but the vertical length is different from the vertical axis, not the frequency but the intensity. (Velocity).

図６（ａ）は、音素符号変換手段２１、符号時刻補正手段２２による処理前の音素符号を示したものである。上述のように、本実施形態では、同一時刻において３２個の符号コードで音素符号を構成し、指定和音数は６に設定するのが一般的であるが、図６（ａ）では、説明の都合上、同一時刻において符号コードは最大５個となっており、指定和音数は２に設定している場合を示している。また、各符号コードを示す矩形の横方向および縦方向の長さからわかるように、各符号コードの再生時間（終了時刻−開始時刻）および強度も異なっている。音素符号補正手段２１による処理後は、図６（ｂ）に示すように、開始時刻および終了時刻が所定の時間間隔の整数倍の位置に補正され、各符号コードの強度は、規定値に統一されるため、図６（ｂ）においては、各符号コードを示す矩形の横方向および縦方向の長さが全て同一または既定長の整数倍となる。実際に、３２個の符号コードに対して処理を行った例を図１６に示す。 FIG. 6A shows a phoneme code before processing by the phoneme code conversion means 21 and the code time correction means 22. As described above, in this embodiment, a phoneme code is generally composed of 32 code codes at the same time, and the designated number of chords is set to 6, but in FIG. For convenience, the number of code codes is 5 at the same time, and the designated number of chords is set to 2. Further, as can be seen from the horizontal and vertical lengths of the rectangles indicating the respective code codes, the reproduction time (end time-start time) and intensity of each code code are also different. After the processing by the phoneme code correcting means 21, as shown in FIG. 6B, the start time and the end time are corrected to a position that is an integral multiple of a predetermined time interval, and the strength of each code code is unified to a specified value. Therefore, in FIG. 6B, the horizontal and vertical lengths of the rectangles indicating the respective code codes are all the same or an integral multiple of the predetermined length. An example in which processing is actually performed on 32 code codes is shown in FIG.

音素符号変換手段２１、符号時刻補正手段２２は、音素符号記憶部１１に記憶されている各音素符号について処理を行い、補正後の各音素符号を補正音素符号記憶部１２に格納する。補正音素符号記憶部１２に格納された音素符号の例を図７〜図１０に示す。このうち、図７、図８は男声を符号化したものであり、図９、図１０は女声を符号化したものである。図７〜図１０中、“Ｃ，Ｃ＃，Ｄ，Ｄ＃、Ｅ、Ｆ、Ｆ＃、Ｇ、Ｇ＃、Ａ、Ａ＃、Ｂ”は、“ド、ド＃、レ、レ＃、ミ、ファ、ファ＃、ソ、ソ＃、ラ、ラ＃、シ”の音名の英語表記で、列記されている数字はオクターブ番号を示し、音名とオクターブ番号の対記号でＭＩＤＩ規格のノートナンバーを特定でき、本願ではＭＩＤＩ規格ノートナンバーの６９をＡ３と表記する（国際的にはＡ４をＭＩＤＩ規格ノートナンバーの６９を示す表記も多数存在する）。図７〜図１０において、“−−−”は、左記の音を延長させることを意味している。したがって、例えば、図７（１）音素Ａの場合、最小単位区間を１／１６秒とすると、“Ｆ５”“Ｄ＃５”“Ａ４”“Ｆ＃４”は音素Ａの開始から２／１６秒間発音され、“Ｇ４”“Ｇ＃２”“Ｇ＃１”は音素Ａの開始から１／１６秒間発音され、“Ｂ２”“Ｂ１”は音素Ａの開始の１／１６後から１／１６秒間発音されることを意味している。また、図７の（２）音素Ｉ、（３）音素Ｕ、（４）音素Ｅ、（５）音素Ｏは、最小単位区間を１／１６秒とすると、それぞれ４／１６秒、３／１６秒、３／１６秒、３／１６秒発音されることを示す。音素符号を構成する符号コードが、ＭＩＤＩ規格で定義されている場合、市販の楽譜編集ツールにより五線譜に変換することができる。図７、図８の男声の音素符号を五線譜化した例を図１１に、図９、図１０の女声の音素符号を五線譜化した例を図１２にそれぞれ示す。 The phoneme code conversion means 21 and the code time correction means 22 perform processing on each phoneme code stored in the phoneme code storage unit 11 and store each corrected phoneme code in the corrected phoneme code storage unit 12. Examples of phoneme codes stored in the corrected phoneme code storage unit 12 are shown in FIGS. Of these, FIGS. 7 and 8 are encoded male voices, and FIGS. 9 and 10 are encoded female voices. 7 to 10, “C, C #, D, D #, E, F, F #, G, G #, A, A #, B” are “do, de #, re, re #, "Mi, Fah, Fah #, Seo, Seo #, La, La #, Shi" in English notation, the numbers listed indicate octave numbers, and the MIDI standard is a pair symbol of pitch names and octave numbers. The note number can be specified, and in the present application, the MIDI standard note number 69 is expressed as A3 (internationally, there are many notations indicating A4 as MIDI standard note number 69). In FIGS. 7 to 10, “---” means that the left sound is extended. Therefore, for example, in the case of phoneme A in FIG. 7A, if the minimum unit interval is 1/16 second, “F5”, “D # 5”, “A4”, and “F # 4” are 2/16 from the start of phoneme A. "G4", "G # 2", and "G # 1" are sounded for 1/16 second from the start of phoneme A, and "B2" and "B1" are 1/16 from 1/16 after the start of phoneme A. It means that it is pronounced for a second. Further, (2) phoneme I, (3) phoneme U, (4) phoneme E, and (5) phoneme O in FIG. 7 are 4/16 seconds and 3/16, respectively, assuming that the minimum unit interval is 1/16 second. Seconds, 3/16 seconds, 3/16 seconds. When the code code constituting the phoneme code is defined by the MIDI standard, it can be converted into a staff score by a commercially available score editing tool. FIG. 11 shows an example in which the phoneme codes of male voices in FIGS. 7 and 8 are converted into a staff, and FIG. 12 shows an example in which the phoneme codes of female voices in FIGS.

（４．音声の合成）
次に、補正した音素符号を利用した音声の合成について説明する。図１３は、本発明に係る音声合成装置の一実施形態を示す構成図である。図１３において、音素符号データベース１２ａは、補正された音素符号を、合成指示データで示される音素符号識別情報と対応付けて記録したものである。音素符号データベース１２ａに格納されている音素符号は、上述の音素符号補正装置により補正され、補正音素符号記憶部１２に格納されたものと同じである。したがって、上述の音素符号補正装置は、この音素符号データベース１２ａを作成するためのものであるとも言える。合成音声データ記憶手段１３は、音素編集処理手段５０により合成された合成音声データを記憶するものであり、ハードディスク等の記憶装置により実現される。 (4. Speech synthesis)
Next, speech synthesis using the corrected phoneme code will be described. FIG. 13 is a block diagram showing an embodiment of a speech synthesizer according to the present invention. In FIG. 13, the phoneme code database 12a records the corrected phoneme code in association with the phoneme code identification information indicated by the synthesis instruction data. The phoneme codes stored in the phoneme code database 12a are the same as those stored in the corrected phoneme code storage unit 12 after being corrected by the above-described phoneme code correction device. Therefore, it can be said that the phoneme code correction apparatus described above is for creating the phoneme code database 12a. The synthesized speech data storage means 13 stores the synthesized speech data synthesized by the phoneme editing processing means 50, and is realized by a storage device such as a hard disk.

音素編集処理手段５０は、合成指示データの内容に従って、音素符号データベース１２ａから対応する音素符号を抽出し、所定の加工を施して合成音声データを生成し、所定の出力先に出力する処理を行う。生成された合成音声データは、設定に従って合成音声データ記憶手段１３、音声出力手段６０、印刷手段７０のうち、１つ以上に出力される。音声出力手段６０は、音素編集処理手段５０から受け取った合成音声データを実際の音声として発音するものであり、ＭＩＤＩ音源を備えたＭＩＤＩ再生装置により実現される。印刷手段７０は、音素編集処理手段５０から受け取った合成音声データを五線譜に変換し、印刷するものであり、五線譜への変換は、公知の変換ソフトウェアを実行することにより実現され、印刷機能は、公知のプリンタ等により実現される。図１３に示した音声合成装置は、現実には、入力機器、外部記憶装置を備え、ＭＩＤＩ再生装置を接続したコンピュータに専用のプログラムを組み込むことにより実現される。 The phoneme editing processing means 50 performs a process of extracting a corresponding phoneme code from the phoneme code database 12a according to the content of the synthesis instruction data, generating a synthesized speech data by performing a predetermined process, and outputting it to a predetermined output destination. . The generated synthesized voice data is output to one or more of the synthesized voice data storage unit 13, the voice output unit 60, and the printing unit 70 according to the setting. The voice output means 60 is for generating the synthesized voice data received from the phoneme editing processing means 50 as an actual voice, and is realized by a MIDI playback device having a MIDI sound source. The printing means 70 converts the synthesized voice data received from the phoneme editing processing means 50 into a staff score and prints it. The conversion to the staff score is realized by executing known conversion software. This is realized by a known printer or the like. The speech synthesizer shown in FIG. 13 is actually realized by including a dedicated program in a computer that includes an input device and an external storage device and is connected to a MIDI playback device.

音声合成装置に入力される合成指示データは、音素識別情報を所定の順序で配置したものであり、この音素識別情報は、音素符号を識別することができるものであれば、どのような形式であっても良い。本実施形態では、音素識別情報として、音素に対応する文字コードを記録したテキストデータを用いている。この場合、音素データベース内の音素符号は、音素に対応する文字コードと対応付けて記録されている必要がある。 The synthesis instruction data input to the speech synthesizer is obtained by arranging phoneme identification information in a predetermined order. This phoneme identification information can be in any format as long as it can identify a phoneme code. There may be. In this embodiment, text data in which a character code corresponding to a phoneme is recorded is used as phoneme identification information. In this case, the phoneme code in the phoneme database needs to be recorded in association with the character code corresponding to the phoneme.

続いて、図１３に示した音声合成装置の処理動作について説明する。まず、合成指示データを音声合成装置に入力する。音声合成装置は、合成指示データを読み込むと、音素編集処理手段５０が合成指示データ内を先頭の音素識別情報から順に合成処理していく。具体的には、音素編集処理手段５０は、合成指示データ内の音素識別情報で音素符号データベース１２ａから対応する音素符号を抽出する。そして、抽出した音素符号が通常音素である場合は、先行音素のノートオフ時刻に音素間隔を加算した時刻をノートオン時刻として設定し、その音素の長さを音素区間として加算した時刻をノートオフ時刻として設定し、ノートナンバー、ベロシティは音素符号データベース１２ａに記録されていた値そのものとするＭＩＤＩイベントを作成する。 Next, the processing operation of the speech synthesizer shown in FIG. 13 will be described. First, synthesis instruction data is input to the speech synthesizer. When the speech synthesizer reads the synthesis instruction data, the phoneme editing processing means 50 synthesizes the synthesis instruction data in order from the first phoneme identification information. Specifically, the phoneme editing processing unit 50 extracts a corresponding phoneme code from the phoneme code database 12a with the phoneme identification information in the synthesis instruction data. If the extracted phoneme code is a normal phoneme, set the note-on time as the note-on time of the preceding phoneme and set the note-on time as the phoneme interval. A MIDI event is created with the note number and velocity set as the time and the values recorded in the phoneme code database 12a as they are.

抽出した音素符号が長音である場合は、図３（ｂ）に示したように、先行音素のノートオフ時刻に音素間隔を加算した時刻をノートオン時刻として設定し、その音素の長さの２倍を音素区間として加算した時刻をノートオフ時刻として設定し、ノートナンバー、ベロシティは音素データベースに記録されていた値そのものとするＭＩＤＩイベントを作成する。長音の音素の後続音素は、その種別に係らず、ノートオン時刻が、長音の音素のノートオフ時刻の０．５秒後に設定される。ただし、この０．５秒後という数値はあくまで標準値であり、各音素における各和音の発音時間の比率、各音素の発音時間（音素区間）や無音時間（音素間隔）の長さは適宜変更することが可能である。 When the extracted phoneme code is a long sound, as shown in FIG. 3B, the time obtained by adding the phoneme interval to the note-off time of the preceding phoneme is set as the note-on time, and the length of the phoneme is 2 A MIDI event is created in which the time obtained by adding doubles as phoneme intervals is set as the note-off time, and the note number and velocity are the values recorded in the phoneme database. Regardless of the type of the phoneme subsequent to the long phoneme, the note-on time is set 0.5 seconds after the note-off time of the long phoneme. However, the numerical value after 0.5 seconds is a standard value, and the ratio of the pronunciation time of each chord in each phoneme, the duration of each phoneme (phoneme interval), and the length of silent time (phoneme interval) are changed as appropriate. Is possible.

抽出した音素符号が拗音・促音である場合は、図４に示したように、音素間隔が通常音素の１／２になるため、前の音素のノートオフ時刻の０．１２５秒後をノートオン時刻として設定し、音素の長さに従ってノートオフ時刻を設定する。そして、その音素のノートオフ時刻から０．１２５秒後に後続音素のノートオン時刻を設定する。ただし、この０．１２５秒後という数値はあくまで標準値であり、各音素における各和音の発音時間の比率、各音素の発音時間（音素区間）や無音時間（音素間隔）の長さは適宜変更することが可能である。 When the extracted phoneme code is a stuttering / promoting sound, as shown in FIG. 4, the phoneme interval is ½ of the normal phoneme, so the note-on time is 0.125 seconds after the note-off time of the previous phoneme. Set as time and set note-off time according to phoneme length. Then, the note-on time of the subsequent phoneme is set 0.125 seconds after the note-off time of the phoneme. However, the numerical value after 0.125 seconds is a standard value to the last, and the ratio of the pronunciation time of each chord in each phoneme, and the length of the pronunciation time (phoneme interval) and silence period (phoneme interval) of each phoneme are changed as appropriate. Is possible.

通常音素、長音、拗音・促音いずれの場合であっても、音高（ＭＩＤＩの場合ノートナンバー）については、音素符号データベースに記録されていた値そのものとしても良いが、その場合、音素符号データベース１２ａに記録されている各音素の音高が不揃いであると、合成した際、不自然な抑揚が付いてしまう。そこで、本実施形態では、先頭音素以外の各音素について、各音素の最下音と先頭音素の最下音との差分を求め、先頭音素以外の各音素の最下音が先頭音素の最下音と同一となるように、求めた差分だけ各音素の音高を全体に補正する。 Regardless of the normal phoneme, long sound, stuttering / promotion sound, the pitch (note number in the case of MIDI) may be the value recorded in the phoneme code database, but in that case, the phoneme code database 12a. If the pitches of the phonemes recorded in are inconsistent, an unnatural inflection will occur when synthesized. Therefore, in this embodiment, for each phoneme other than the first phoneme, the difference between the lowest tone of each phoneme and the lowest tone of the first phoneme is obtained, and the lowest tone of each phoneme other than the first phoneme is the lowest of the first phoneme. The pitch of each phoneme is corrected as a whole so as to be the same as the sound.

ここで、音素編集処理手段５０による音高の補正の様子を図１４に示す。図１４では、説明の便宜上、各音素が４つの符号コードで構成された例について示している。また、図１４（ａ）は、音高を補正しない場合（データベースに記録された音高そのままの場合）、図１４（ｂ）は、音高を補正した場合を示している。先頭音素、後続音素１、後続音素２について、音高を補正しない場合、先頭音素の最下音（先頭音素を構成する符号コードの音高）と、後続音素１の最下音、後続音素２の最下音が図１４（ａ）に示すような状態であったとする。図１４（ａ）のように合成した音声を、再生すると、不自然な抑揚が付いてしまう。そこで、先頭音素の最下音と後続音素１の最下音との差分１、先頭音素の最下音と後続音素２の最下音との差分２をそれぞれ求め、後続音素１については、構成する４つの符号コード全てについて、その音高を差分１だけ補正し、後続音素２については、構成する４つの符号コード全てについて、その音高を差分２だけ補正する。音高補正の結果、図１４（ｂ）に示すように、先頭音素、後続音素１、後続音素２の最下音の音高が同一となり、再生時に不自然な抑揚がなくなる。 Here, FIG. 14 shows a state of pitch correction by the phoneme editing processing means 50. FIG. 14 shows an example in which each phoneme is composed of four code codes for convenience of explanation. FIG. 14A shows a case where the pitch is not corrected (when the pitch recorded in the database is used as it is), and FIG. 14B shows a case where the pitch is corrected. When the pitch is not corrected for the first phoneme, the subsequent phoneme 1, and the subsequent phoneme 2, the lowest tone of the first phoneme (the pitch of the code code constituting the first phoneme) and the lowest tone and the subsequent phoneme 2 of the subsequent phoneme 1 Is the state as shown in FIG. When the synthesized speech as shown in FIG. 14A is reproduced, an unnatural inflection is added. Therefore, the difference 1 between the lowest sound of the first phoneme and the lowermost sound of the subsequent phoneme 1 and the difference 2 between the lowest sound of the first phoneme and the lowermost sound of the subsequent phoneme 2 are obtained, respectively. For all the four code codes to be performed, the pitch is corrected by the difference 1, and for the subsequent phoneme 2, the pitch is corrected by the difference 2 for all the four code codes to be configured. As a result of the pitch correction, as shown in FIG. 14B, the pitches of the lowest sounds of the first phoneme, the subsequent phoneme 1, and the subsequent phoneme 2 are the same, and unnatural inflection is eliminated during reproduction.

ノートナンバーについてはオプション的に別途ユーザにより指示される音高オフセットパラメータを加算することにより適宜上下され、ピッチ変換を実現することができる。 The note number is optionally raised or lowered by adding a pitch offset parameter separately designated by the user, and pitch conversion can be realized.

音素編集処理手段５０は、読み込んだ合成指示データ内の音素識別情報単位で音素の合成処理を行っていき、処理が終わった音素単位で順に、合成音声データ（ＭＩＤＩデータ）を、音声出力手段６０に渡していく。音声出力手段６０は、音素編集処理手段５０から受け取ったＭＩＤＩデータを順に再生していく。以上のようにして、音声合成装置は、読み込んだ合成指示データに従って音声の再生が可能となる。 The phoneme editing processing unit 50 performs phoneme synthesis processing in units of phoneme identification information in the read synthesis instruction data, and sequentially outputs synthesized voice data (MIDI data) in units of phonemes that have been processed. I will pass it on. The audio output means 60 reproduces the MIDI data received from the phoneme editing processing means 50 in order. As described above, the speech synthesizer can reproduce speech according to the read synthesis instruction data.

五線譜として出力する場合は、合成音声データを印刷手段７０により五線譜データに変換した後、印刷出力する。また、上記の例のように、合成指示データに従って音声合成をリアルタイムで行い、音声再生したり、五線譜出力することも可能であるが、この音声合成装置では、音素編集処理手段５０による処理結果であるＭＩＤＩデータを合成音声データ記憶手段１３に蓄積し、別途このＭＩＤＩデータをＭＩＤＩ再生装置により音声再生するようにしても良い。ＭＩＤＩデータを記憶装置に蓄積する方法としては、ＳＭＦ(Standard MIDI File)形式ファイルを用いると、市販の種々の音楽関係ソフトウェアに渡すことができ、作成されたＭＩＤＩデータからは、市販の楽譜作成ツールを用いて、楽譜を作成することができる。この場合、楽譜は、ＳＭＦ形式に記録されていた音素符号を基にして作成される。そして、作成された楽譜を印刷装置から出力すれば、読みやすい楽譜として、楽器演奏の際に利用することができる。 When outputting as a musical score, the synthesized voice data is converted into the musical score data by the printing means 70 and then printed out. Further, as in the above example, voice synthesis can be performed in real time according to the synthesis instruction data, and voice reproduction or stave output can be performed. In this voice synthesis apparatus, the result of processing by the phoneme editing processing unit 50 is used. Some MIDI data may be stored in the synthesized voice data storage means 13, and the MIDI data may be separately played back by a MIDI playback device. As a method for accumulating MIDI data in a storage device, an SMF (Standard MIDI File) format file can be used to pass it to various commercially available music-related software. Can be used to create a score. In this case, the score is created based on the phoneme code recorded in the SMF format. Then, if the generated score is output from the printing device, it can be used as an easy-to-read score when playing a musical instrument.

上述の通り、音素編集処理手段５０は、合成指示データ内の音素識別情報で音素データベース１２ａから対応する音素符号を抽出し、ＭＩＤＩイベントを作成する際、そのノートナンバーについては音素符号データベース１２ａに収録されている当該音素符号に対応する和音を構成する各音符のノートナンバーに対して、オプション的に別途ユーザにより指示される音高オフセットパラメータを加算し適宜上下させ、ピッチ変換を行えるようにしてある。この場合は、合成音声データ全体のピッチを上下させるものであるが、合成指示データ内の音素識別情報とともに音高オフセットパラメータを音素ごとに定義すれば、各音素ごとにピッチを上下させることもできる。すなわち、あらかじめ作成した旋律の隣接音符間での音高変化（音程情報）を、合成指示データ内の音素識別情報とともに定義される音高オフセットパラメータとして与えれば、歌声合成を実現することができる。 As described above, the phoneme editing processing unit 50 extracts the corresponding phoneme code from the phoneme database 12a using the phoneme identification information in the synthesis instruction data, and when creating a MIDI event, the note number is recorded in the phoneme code database 12a. A pitch offset parameter is optionally added to the note number of each note constituting the chord corresponding to the phoneme code being added, and the pitch conversion can be performed up and down as appropriate. . In this case, the pitch of the entire synthesized speech data is raised or lowered, but if the pitch offset parameter is defined for each phoneme together with the phoneme identification information in the synthesis instruction data, the pitch can be raised or lowered for each phoneme. . That is, singing voice synthesis can be realized if a pitch change (pitch information) between adjacent notes of a melody created in advance is given as a pitch offset parameter defined together with phoneme identification information in the synthesis instruction data.

（５．電子透かしへの応用）
本発明に係る音声合成装置は、音楽データに、音声メッセージの形態で著作権者情報など特定の情報を埋め込む技術、可聴な“電子透かし”に応用することが可能である。図１５は、本発明に係る音声合成装置の基本構成を利用した電子透かし埋め込み装置を示す図である。図１５において、音素符号データベース１２ａは、図１３に示した音素符号データベース１２ａと同じものであり、補正された音素符号を、合成指示データで示される音素符号識別情報と対応付けて記録したものである。埋め込み処理手段５１は、ＳＭＦ形式等により記述されたデジタルデータである音楽コンテンツに、メッセージテキスト（合成指示データ）で特定されるメッセージを埋め込む。具体的には、埋め込み処理手段５１は、図１３に示した音素編集処理手段５０の機能を備え、メッセージテキスト（合成指示データ）の内容に従って、音素符号データベース１２ａから対応する音素符号を抽出し、所定の加工を施して合成音声を生成する。そして、出力する音楽コンテンツが複数トラックであり、メッセージ用の専用トラックが存在する場合は、その専用トラックに合成音声を埋め込んで単一のＭＩＤＩ形式の音楽データとして音響出力手段６１に出力する。音楽コンテンツに専用トラックが存在しない場合には、音楽コンテンツの無音部分に、合成音声を格納して音響出力手段６１に出力する。 (5. Application to digital watermarking)
The voice synthesizer according to the present invention can be applied to a technique for embedding specific information such as copyright holder information in the form of a voice message in music data, and an audible “digital watermark”. FIG. 15 is a diagram showing a digital watermark embedding device using the basic configuration of the speech synthesizer according to the present invention. In FIG. 15, the phoneme code database 12a is the same as the phoneme code database 12a shown in FIG. 13, and the corrected phoneme code is recorded in association with the phoneme code identification information indicated by the synthesis instruction data. is there. The embedding processing means 51 embeds a message specified by a message text (synthesis instruction data) in music content that is digital data described in the SMF format or the like. Specifically, the embedding processing unit 51 has the function of the phoneme editing processing unit 50 shown in FIG. 13, and extracts a corresponding phoneme code from the phoneme code database 12a according to the content of the message text (synthesis instruction data). A predetermined process is performed to generate synthesized speech. If the music content to be output is a plurality of tracks and there is a dedicated track for messages, the synthesized speech is embedded in the dedicated track and output to the sound output means 61 as a single MIDI format music data. When there is no dedicated track in the music content, the synthesized speech is stored in the silent portion of the music content and output to the sound output means 61.

音響出力手段６１は、図１３に示した音声出力手段６０と実質的には同じものであり、埋め込み処理手段５１から受け取った音響データを実際の音として発音するものである。図１５に示した電子透かし埋め込み装置では、メッセージテキストの埋め込みをリアルタイムで行い、音響出力するようにしたが、埋め込み処理手段５１による処理結果であるＳＭＦ形式等でＭＩＤＩデータを記憶装置に蓄積し、ネットワーク等で別途このＭＩＤＩデータを配信し、受信者側のＭＩＤＩ再生装置により音響出力する形態をとることもできる。図１５に示した電子透かし埋め込み装置は、現実には、外部記憶装置を備え、ＭＩＤＩ再生装置を接続したコンピュータに専用のプログラムを組み込むことにより実現される。 The sound output means 61 is substantially the same as the sound output means 60 shown in FIG. 13, and generates sound data received from the embedding processing means 51 as an actual sound. In the digital watermark embedding apparatus shown in FIG. 15, the message text is embedded in real time and output as sound, but the MIDI data is stored in the storage device in the SMF format as the processing result by the embedding processing means 51, The MIDI data can be separately distributed over a network or the like, and the sound can be output by the MIDI playback device on the receiver side. The digital watermark embedding device shown in FIG. 15 is actually realized by incorporating a dedicated program into a computer that includes an external storage device and is connected to a MIDI playback device.

音響出力の際、埋め込まれたメッセージテキストが音楽コンテンツと合成されて音声メッセージとして出力される可聴な電子透かしとして運用する方法と、埋め込まれたメッセージテキストに対応するＭＩＤＩデータのチャンネルボリュームを最小に設定するか、１２７の固定値に設定されている全てのＭＩＤＩイベントのベロシティ値を０に変更する方法により、音楽コンテンツ以外の音声メッセージは再生されない不可聴な電子透かしとして運用する方法もとれる。例えば、一般ユーザにサンプルとして試聴版配布する場合は、可聴な電子透かしを埋め込んで配布し、正規購入された製品版配布する場合は、不可聴な電子透かしを埋め込んで配布する。不可聴な電子透かしが埋め込まれた音楽コンテンツが正規購入品か否かを音楽コンテンツ事業者側で判断する場合、上記の逆の操作、即ち、ＭＩＤＩデータのチャンネルボリュームを最大値に変更するか、０に設定されている全てのＭＩＤＩイベントのベロシティ値を１２７に変更するような前処理を行うことにより可聴な形態で埋め込まれた状態に変更して、以下電子透かし抽出装置を適用すればよい。 A method of operating as an audible digital watermark in which the embedded message text is synthesized with the music content and output as a voice message at the time of sound output, and the channel volume of the MIDI data corresponding to the embedded message text is set to the minimum Or, by changing the velocity value of all MIDI events set to a fixed value of 127 to 0, a method of operating as an inaudible digital watermark in which voice messages other than music contents are not reproduced can be used. For example, when a trial version is distributed to a general user as a sample, an audible digital watermark is embedded and distributed, and when a genuinely purchased product version is distributed, an inaudible digital watermark is embedded and distributed. When the music content provider determines whether or not the music content with the inaudible digital watermark embedded is a genuine purchase product, the above operation is reversed, that is, the channel volume of the MIDI data is changed to the maximum value, By performing preprocessing such as changing the velocity values of all MIDI events set to 0 to 127, the state is embedded in an audible form, and the digital watermark extracting apparatus may be applied hereinafter.

続いて、ＭＩＤＩデータに前述の可聴な形態で埋め込まれた電子透かしを抽出する電子透かし抽出装置について述べる。電子透かし抽出装置は、マイクロフォン等の音響信号取得機器、上記音素符号データベース１２ａを備えるとともに、電子透かし抽出のための専用のプログラムを組み込んだコンピュータにより実現される。電子透かし抽出装置に組み込まれた専用プログラムは、コンピュータを、周波数解析手段、音素符号識別情報復号化手段として機能させる。周波数解析手段、音素符号識別情報復号化手段の具体的内容は、特許第４０３７５４２号や、特許第４１３２３６２号等に開示されている手順により実現可能である。電子透かし抽出装置は、音響出力手段６１より空間に送出された音響信号に対して電子透かし抽出装置に接続されたマイクロフォンなどを通じて部分的に録音を行い、録音されたＰＣＭデータに対して周波数解析手段が周波数解析を行い、和音データを抽出する。具体的には、特許第４０３７５４２号や、特許第４１３２３６２号等に開示されている公知の技術を用いて時系列の３２個の符号コード群へ変換する。続いて、音素符号識別情報復号化手段が、抽出された和音データを音素符号データベース１２ａと照合し、類似した和音データをもつ音素符号を抽出し、音素符号識別情報を復号化する。具体的には、音素符号データベース１２ａに収録されている符号コード群と順次照合し、適合する音素符号を順次抽出することにより、図１５のメッセージテキスト（合成指示データ）を復元する。 Next, a digital watermark extraction apparatus that extracts a digital watermark embedded in the above-mentioned audible form in MIDI data will be described. The digital watermark extraction apparatus is realized by a computer including an acoustic signal acquisition device such as a microphone and the phoneme code database 12a and a dedicated program for digital watermark extraction. The dedicated program incorporated in the digital watermark extraction apparatus causes the computer to function as frequency analysis means and phoneme code identification information decoding means. Specific contents of the frequency analysis means and the phoneme code identification information decoding means can be realized by the procedures disclosed in Japanese Patent No. 4037542 and Japanese Patent No. 4132362. The digital watermark extracting apparatus records partly the sound signal sent to the space from the sound output means 61 through a microphone or the like connected to the digital watermark extracting apparatus, and frequency analysis means for the recorded PCM data. Performs frequency analysis and extracts chord data. Specifically, it is converted into 32 time-series code code groups using a known technique disclosed in Japanese Patent No. 4037542 and Japanese Patent No. 4132362. Subsequently, the phoneme code identification information decoding means collates the extracted chord data with the phoneme code database 12a, extracts phoneme codes having similar chord data, and decodes the phoneme code identification information. Specifically, the message text (synthesis instruction data) in FIG. 15 is restored by sequentially collating with the code code group recorded in the phoneme code database 12a and sequentially extracting the matching phoneme codes.

本発明は、イベントや余興目的に行われる人間の音声再生を模倣した音楽作品制作・作曲の支援産業に利用することができる。また、エンターテインメント分野において、電子楽器を主体とした玩具（ロボット、ぬいぐるみを含む）、玩具型のアコースティック楽器（室内装飾用のミニチュアピアノ）、オルゴール、携帯電話の着信メロディ等の音階再生媒体に対して音声合成機能を付加する産業に利用することができる。また、ＳＭＦ（ＳｔａｎｄａｒｄＭＩＤＩＦｉｌｅ）等によるＭＩＤＩ音楽コンテンツ配布時における著作権保護等の産業に利用することができる。 INDUSTRIAL APPLICABILITY The present invention can be used in a music production / composition support industry that imitates human voice reproduction performed for events and entertainment purposes. In the entertainment field, for musical scale reproduction media such as toys (including robots and stuffed animals) mainly made of electronic musical instruments, toy-type acoustic instruments (miniature pianos for interior decoration), music boxes, and ringtones for mobile phones. It can be used in industries that add speech synthesis functions. Further, it can be used in industries such as copyright protection when distributing MIDI music content by SMF (Standard MIDI File) or the like.

１０・・・記憶手段
１１・・・音素符号記憶部
１２・・・補正音素符号記憶部
１２ａ・・・音素符号データベース
１３・・・合成音声データ記憶手段
２０・・・処理制御手段
２１・・・音素符号変換手段
２２・・・符号時刻補正手段
３０・・・音素符号表示手段
４０・・・開始終了時刻指示手段
５０・・・音素編集処理手段
５１・・・埋め込み処理手段
６０・・・音声出力手段
６１・・・音響出力手段
７０・・・印刷手段 DESCRIPTION OF SYMBOLS 10 ... Memory | storage means 11 ... Phoneme code memory | storage part 12 ... Correction | amendment phoneme code memory | storage part 12a ... Phoneme code database 13 ... Synthetic speech data memory | storage means 20 ... Process control means 21 ... Phoneme code conversion means 22 ... Code time correction means 30 ... Phoneme code display means 40 ... Start / end time instruction means 50 ... Phoneme editing processing means 51 ... Embedding processing means 60 ... Speech output Means 61 ... Sound output means 70 ... Printing means

Claims

１つの音素を複数の符号コード群で表現した音素符号を読み込む音素符号読込手段と、前記読み込まれた音素符号を構成する各符号コードについて、発音開始時刻と発音終了時刻との時間差と符号コードの強さとの積で与えられるエネルギー値が高い上位のものを、時間的に重複する符号コードが所定の種類以下の範囲で抽出し、抽出された符号コード群で構成される音素符号に変換する音素符号変換手段と、当該音素符号を構成する符号コードの発音開始時刻および発音終了時刻の各々に対して、所定の時間単位の整数倍になるように補正する符号時刻補正手段を有する音素符号補正装置による補正後の音素符号であって、日本語カナ文字の各音素に対応して、所定の種類以下の音高を同時にもち、音の長さが所定の時間単位の整数倍で規定される和音複数個（母音は２個以上、子音は３個以上）で構成された補正後の音素符号を、音素符号を識別する音素符号識別情報と対応付けて記録した音素符号データベースと、
与えられた合成指示データに記載されている音素符号識別情報に対応する前記補正後の音素符号を前記音素符号データベースから抽出し、当該音素の種別に従って、発音時間および無音時間を設定し、発音の開始および終了を特定する時刻を設定することにより合成音声データを生成する音素編集処理手段と、を有することを特徴とする音声合成装置。 A phoneme code reading means for reading a phoneme code expressing one phoneme by a plurality of code code groups, and for each code code constituting the read phoneme code, a time difference between a pronunciation start time and a pronunciation end time and a code code A phoneme that has a high energy value given as a product of strength and is extracted in a range where code codes that overlap in time are less than or equal to a predetermined type and is converted to a phoneme code that is composed of the extracted code code group A phoneme code correction apparatus having code conversion means and code time correction means for correcting each of the sounding start time and sounding end time of the code code constituting the phoneme code so as to be an integral multiple of a predetermined time unit a phonemic code corrected by, in correspondence with each phoneme in Japanese kana characters, have a predetermined or fewer pitch simultaneously, the length of the sound regulations an integral multiple of a predetermined time unit Is the chord plurality (vowels two or more consonants are three or more) the phonemic code after correction made up of, and sound Motofugo database recording in association with the phonemic code identification information for identifying the phonemic code ,
The corrected phoneme code corresponding to the phoneme code identification information described in the given synthesis instruction data is extracted from the phoneme code database, and the pronunciation time and the silence time are set according to the type of the phoneme. A speech synthesizer comprising: phoneme editing processing means for generating synthesized speech data by setting a time for specifying a start and an end.

請求項１において、
前記音素編集処理手段が、前記合成指示データの先頭の音素に対応する音素符号を構成する符号コードの中で最低の音高と、先頭以外の各音素に対応する音素符号を構成する符号コードの中で最低の音高との差分を、前記先頭以外の各音素についてそれぞれ求め、前記先頭以外の各音素に対応する音素符号を構成する全ての符号コードについて当該差分だけ音高を変更することを特徴とする音声合成装置。 In claim 1 ,
The phoneme editing processing means includes the lowest pitch among the code codes constituting the phoneme code corresponding to the head phoneme of the synthesis instruction data and the code code constituting the phoneme code corresponding to each phoneme other than the head phoneme. The difference from the lowest pitch among the phonemes other than the head is obtained, and the pitch is changed by the difference for all code codes constituting the phoneme codes corresponding to the phonemes other than the head. A featured voice synthesizer.

請求項１または請求項２において、
前記音素編集処理手段により生成された合成音声データを音声として出力する音声出力手段をさらに有することを特徴とする音声合成装置。 In claim 1 or claim 2 ,
A speech synthesizer further comprising speech output means for outputting the synthesized speech data generated by the phoneme editing processing means as speech.

請求項１から請求項３のいずれか一項において、
前記音素編集処理手段により生成された合成音声データを五線譜に変換し、印刷する印刷手段をさらに有することを特徴とする音声合成装置。 In any one of Claims 1-3 ,
A speech synthesizer, further comprising: a printing unit that converts the synthesized speech data generated by the phoneme editing processing unit into a musical score and prints it.

請求項１から請求項４のいずれか一項において、
前記音素編集処理手段は、前記音素の種別が、日本語カナ文字の長音であるとき、前記発音時間を所定の値だけ増加させることを特徴とする音声合成装置。 In any one of claims 1 to 4,
The speech synthesis apparatus, wherein the phoneme editing processing means increases the pronunciation time by a predetermined value when the phoneme type is a long sound of Japanese Kana characters.

請求項１から請求項５のいずれか一項において、
前記音素編集処理手段は、前記音素の種別が、日本語カナ文字の「ツ」の促音「ッ」または「ヤ」「ユ」「ヨ」の拗音であるとき、当該音素と当該音素の直前の音素との無音時間、当該音素と当該音素の直後の音素との無音時間、および当該音素の発音時間を、所定の値だけ減少させることを特徴とする音声合成装置。 In any one of claims 1 to 5,
The phoneme editing processing means, when the type of the phoneme is a roaring sound of “tsu” or “ya”, “yu”, “yo” of the Japanese kana character “tsu”, the phoneme and the phoneme immediately before the phoneme A speech synthesizer characterized by reducing a silent time with a phoneme, a silent time between the phoneme and a phoneme immediately after the phoneme, and a pronunciation time of the phoneme by a predetermined value.

請求項１から請求項６のいずれか一項において、
前記音素編集処理手段が、与えられた音素符号識別情報に対応する音素符号を前記音素符号データベースから抽出し、当該音素の種別に従って、発音の開始および終了を特定する時刻を設定する際、前記無音時間に対して、設定された時間伸縮率を乗算し、前記発音の開始および終了を特定する時刻に対して所定の改変を施すようにしていることを特徴とする音声合成装置。 In any one of claims 1 to 6,
When the phoneme editing processing means extracts the phoneme code corresponding to the given phoneme code identification information from the phoneme code database and sets the time to specify the start and end of pronunciation according to the type of the phoneme, the silence A speech synthesizer characterized in that time is multiplied by a set time expansion / contraction rate, and a predetermined modification is applied to the time for specifying the start and end of the pronunciation.

請求項１から請求項７のいずれか一項において、
前記音素編集処理手段が、与えられた音素符号識別情報に対応する音素符号を前記音素符号データベースから抽出し、当該音素の種別に従って、発音の開始および終了を特定する時刻を設定する際、設定された音高オフセットパラメータに基づいて、前記音素符号データベースに記録されている前記音素符号を構成する各符号コードの音高に対して、前記音高オフセットパラメータを加算し、前記合成音声データを構成する全ての符号コードの音高に対して所定の改変を施すようにしていることを特徴とする音声合成装置。 In any one of claims 1 to 7,
Set when the phoneme editing processing means extracts the phoneme code corresponding to the given phoneme code identification information from the phoneme code database and sets the time to specify the start and end of pronunciation according to the type of the phoneme. Based on the pitch offset parameter, the pitch offset parameter is added to the pitch of each code code constituting the phoneme code recorded in the phoneme code database to constitute the synthesized speech data A speech synthesizer characterized in that a predetermined modification is applied to the pitches of all code codes.

請求項８において、
前記合成指示データには各音素ごとに音素符号識別情報とともに前記音高オフセットパラメータが定義されており、前記音素編集処理手段が、
与えられた音素符号識別情報に対応する音素符号を前記音素符号データベースから抽出し、当該音素の種別に従って、発音の開始および終了を特定する時刻を設定する際、前記各音素ごとに定義された音高オフセットパラメータに基づいて、前記音素符号データベースに記録されている前記音素符号を構成する各符号コードの音高に対して、前記音高オフセットパラメータを加算し、前記合成音声データを構成する全ての符号コードの音高に対して、改変を施すようにしていることを特徴とする音声合成装置。 In claim 8 ,
In the synthesis instruction data, the pitch offset parameter is defined together with phoneme code identification information for each phoneme, and the phoneme editing processing means includes:
When the phoneme code corresponding to the given phoneme code identification information is extracted from the phoneme code database and the time for specifying the start and end of pronunciation is set according to the type of the phoneme, the phoneme defined for each phoneme is defined. Based on the high offset parameter, the pitch offset parameter is added to the pitch of each code code constituting the phoneme code recorded in the phoneme code database, and all of the synthesized speech data are configured. A speech synthesizer characterized by modifying a pitch of a code code.

請求項１から請求項９のいずれか一項に記載の音声合成装置としてコンピュータを機能させるためのプログラム。 Program for causing a computer to function as a speech synthesis apparatus according to any one of claims 1 to 9.