JPS5950079B2

JPS5950079B2 - Speech synthesis method

Info

Publication number: JPS5950079B2
Application number: JP8290080A
Authority: JP
Inventors: 春司岩崎; 昭広浅田; 義注太田; 規斉藤; 重光樋口; 徹三瓶
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 1980-06-20
Filing date: 1980-06-20
Publication date: 1984-12-06
Also published as: JPS5710194A

Description

【発明の詳細な説明】本発明はＰＡＲＣＯＲ型音声合成の音声の変化点におい
て聞きやすい合成音を得る方法に関するものである。DETAILED DESCRIPTION OF THE INVENTION The present invention relates to a method of obtaining a synthesized sound that is easy to hear at a change point of speech in PARCOR type speech synthesis.

音声合成法として、線形予側係数の一種である偏自己相
関係数（以後ＰＡＲＣＯＲ係数と呼ぶ）を用いるＰＡＲ
ＣＯＲ合成法がある。PAR uses a partial autocorrelation coefficient (hereinafter referred to as PARCOR coefficient), which is a type of linear predictive coefficient, as a speech synthesis method.
There is a COR synthesis method.

この方法は既に音声研究分野では広く知られた方法であ
り詳しい説明は省略するが、音声スペクトラムとＰＡＲ
ＣＯＲ音声合成について述べる。This method is already widely known in the speech research field, so detailed explanation will be omitted, but it is important to note that the speech spectrum and PAR
Let us describe COR speech synthesis.

第１図ばア″という音を周波数分析したものである。ａ
がスペクトラム全体を表す。これは周波数とともに緩や
かに変化するスペクトラム包絡ｂと、激しく変化するス
ペクトラム微細構造ｃの積に分解して考えることは出来
る。スペクトラム包絡は主として声道の共鳴特性を反映
し音韻性すなわぢア″なのがイ″なのかの情報を含む。
一方、スペクトラム微細構造は音声の周期性す・なわち
有声音か無声音（一般的に無声音には周期性がない）の
情報と、有声音の場合周期（ピッチ情報）すなわち音の
高さと、各音声について声帯の振動の強さの情報（音量
情報）を含んでいる。ＰＡＲＣＯＲ係数は物理的には声
道伝送特性を表わす特徴パラメータであり、声道をいく
つかの音響管の組み合せと考えた時、各音響管の接続点
における反射係数である。そこでＰＡＲＣＯＲ係数によ
って音韻を表わすフィルタ特性を再現すれば音が合成で
きる。フ第２図に従来のＰＡＲＣＯＲ音声合成装置の
構成を示す。Figure 1 is a frequency analysis of the sound Ba''.a
represents the entire spectrum. This can be broken down into the product of the spectral envelope b, which changes slowly with frequency, and the spectral fine structure c, which changes sharply. The spectral envelope mainly reflects the resonance characteristics of the vocal tract and includes information on phonological characteristics, such as whether ``a'' or ``i''.
On the other hand, the spectral fine structure contains information about the periodicity of speech, that is, whether it is a voiced or unvoiced sound (generally speaking, unvoiced sounds have no periodicity), and in the case of voiced sounds, the period (pitch information), that is, the pitch of the sound, and each Contains information on the strength of vibration of the vocal cords (volume information) regarding the voice. Physically, the PARCOR coefficient is a characteristic parameter representing vocal tract transmission characteristics, and when the vocal tract is considered as a combination of several acoustic tubes, it is a reflection coefficient at the connection point of each acoustic tube. Therefore, sounds can be synthesized by reproducing filter characteristics representing phonemes using PARCOR coefficients. Figure 2 shows the configuration of a conventional PARCOR speech synthesizer.

図において、１は雑音発生器、２はパルス発生器、３は
有声・無声切換えスイッチ、４は掛け算器、５はディジ
タルフィルタ、６はＤ／Ａ変換器、７はスピーカ、８は
ピッチ情報、９は有５声・無声音切換え信号、１０は音
量情報、１１はＰＡＲＣＯＲ係数、１２は記憶部である
。記憶部１２にはあらかじめ自然音声を分析して得られ
た有声無声の判別情報、ピツチ情報、音量情報、ＰＡＲ
ＣＯＲ係数が時系列的に記憶されている。In the figure, 1 is a noise generator, 2 is a pulse generator, 3 is a voiced/unvoiced switch, 4 is a multiplier, 5 is a digital filter, 6 is a D/A converter, 7 is a speaker, 8 is pitch information, 9 is a five-voice/unvoiced sound switching signal, 10 is volume information, 11 is a PARCOR coefficient, and 12 is a storage section. The storage unit 12 stores voiced/unvoiced discrimination information, pitch information, volume information, and PAR obtained by analyzing natural speech in advance.
COR coefficients are stored in chronological order.

そこで音声の合成に際し、デイジタルフイルタ５は記憶
部１２からのＰＡＲＣＯＲ係数１１により音韻を表わす
フイルタ特性を再現し、有声音の場合であれば、ピツチ
情報８で与えられる周期にてパルス発生器２よりパルス
をフイルタに加え、無声音ならば雑音発生器１により雑
音がフイルタに加えられる。Therefore, when synthesizing speech, the digital filter 5 reproduces the filter characteristics representing the phoneme using the PARCOR coefficient 11 from the storage section 12, and in the case of voiced speech, the digital filter 5 uses the PARCOR coefficient 11 from the storage section 12 to reproduce the filter characteristics representing the phoneme. A pulse is applied to the filter, and if it is an unvoiced sound, noise is added to the filter by the noise generator 1.

そして次に音量情報１０が掛け算器４にて前述のパルス
又は雑音と掛け合わされ、合成音の大きさが決定され、
デイジタルフイルタ５は記憶部１２からのＰＡＲＣＯＲ
係数１１により音韻を表わすフイルタ特性を再現し、音
声が合成され、Ｄ／Ａ変換器６、スピーカ７を通して音
声を出力する。第３図に「こちらは」の「こ」の部分の
自然音声波形を示す。Then, the volume information 10 is multiplied by the above-mentioned pulse or noise in a multiplier 4, and the loudness of the synthesized sound is determined.
The digital filter 5 receives PARCOR from the storage unit 12.
The filter characteristics representing the phoneme are reproduced by the coefficient 11, the voice is synthesized, and the voice is output through the D/A converter 6 and the speaker 7. Figure 3 shows the natural speech waveform of the ``ko'' part of ``kochiwa''.

同じく第４図に［ぎんこう」の「ぎん」の部分の自然音
声波形を示す。「こ」の音声の場合図からも分るように
約２０ｍｓｅｃの間ほぼ同じような波形であることが知
れる。一方「ぎん」の場合約６０〜８０ｍｓｅｃの間ほ
ぼ同じ波形であることが分る。これは普通の速さで人が
会話をするとき口の形、のどの形、声帯の振動等は急激
に変化することはなく、速くても約２０ｍｓｅｃ位の間
５同じ状態を保つていると考えられる。このことは第２
図に示した合成器にて合成音を得る場合音韻性を表わす
ＰＡＲＣＯＲ係数１１等は２０ｍｓｅｃの間同じで良い
ことが分る。Similarly, FIG. 4 shows the natural speech waveform of the "gin" part of "Ginkou". In the case of the voice "ko", as can be seen from the figure, the waveform is almost the same for about 20 msec. On the other hand, in the case of "Gin", it can be seen that the waveform is almost the same for about 60 to 80 msec. This means that when a person speaks at a normal speed, the shape of the mouth, throat, vibration of the vocal cords, etc. do not change suddenly, and remain the same for about 20 msec at the most. Conceivable. This is the second
It can be seen that when a synthesized sound is obtained by the synthesizer shown in the figure, the PARCOR coefficient 11 etc. representing phonological properties may be kept the same for 20 msec.

よつて記憶部１２には自然音声を２０ｍｓｅｃごとに分
析し、ＰＡＲＣＯＲ係数等の情報を時系列的に収納すれ
ば良い。実際に第３図、第４図の音声を計算機にて分析
し第２図に示した音声合成装置によつて合成し得られた
合成音の波形を第５図、第６図に示す。Therefore, natural speech may be analyzed every 20 msec and information such as PARCOR coefficients may be stored in the storage unit 12 in chronological order. FIGS. 5 and 6 show the waveforms of synthesized sounds obtained by actually analyzing the voices shown in FIGS. 3 and 4 using a computer and synthesizing them using the speech synthesizer shown in FIG. 2.

第．５図の場合などは特に２０ｍｓｅｃごとに各データ
が変わつているのが分る。実際に合成音を聞いた感じと
しては、音の滑らかさに欠け、自然性を損なつたものと
なつている。この原因としては、各種データがステツプ
状に変化するため音声のスペクィトラムの時間的不連続
が大きくなるからである。また、第６図の「ぎん」の場
合には、立上り部分が０からでなく、急にある程度の振
幅をもつた正弦波状の波形から始まつているため「プチ
ツ」というパルシブなノイズが入り「プチツ＋銀行」と
なつて聞こえる。合成音の滑らかさを増す方法として補
間の手段があり、これを利用したＰＡＲＣＯＲ型音声合
成装置を第７図に示す。No. In the case of Figure 5, it can be seen that each piece of data changes every 20 msec. When I actually heard the synthesized sound, it lacked smoothness and lost its naturalness. The reason for this is that various data change in a stepwise manner, which increases temporal discontinuity in the audio spectrum. In addition, in the case of "Gin" in Figure 6, the rising part does not start from 0, but suddenly starts from a sinusoidal waveform with a certain amplitude, so there is a pulsative noise called "Puchitsu". It sounds like "Puchits + Bank". Interpolation is a method of increasing the smoothness of synthesized speech, and FIG. 7 shows a PARCOR-type speech synthesis device that utilizes interpolation.

図において１３は補間回路である。補間回路１３は適当
な補間周期にて各種パラメータを直線補間し、デイジタ
ルフイルタ５等にデータを送り出し合成音に滑らかさを
与え、自然な音声を得ようとするものである。第７図の
ＰＡＲＣＯＲ型音声合成装置にて得られた音声波形を第
８図、第９図に示す。「こ」に関しては第５に比較して
波形そのものから判断しても滑らかさが増しているが、
「こ」の頭の部分にある無声音が０から漸近的に増大し
てくるため無声音がはつきりしない、立上りが弱くなつ
てしまう等の問題がある。一方「ぎん」の方については
「プチツ」等のノイズもなく滑らかな音声が得られる。
本発明は上記した従来技術の欠点をなくし音声の不連続
点においても聞きやすい音声を合成する音声合成方法を
提供するにある。In the figure, 13 is an interpolation circuit. The interpolation circuit 13 performs linear interpolation on various parameters at an appropriate interpolation cycle, and sends the data to the digital filter 5 etc. to give smoothness to the synthesized sound and obtain natural speech. FIGS. 8 and 9 show speech waveforms obtained by the PARCOR type speech synthesizer shown in FIG. 7. Regarding "Ko", compared to the fifth one, judging from the waveform itself, the smoothness has increased,
Since the unvoiced sound at the beginning of "ko" increases asymptotically from 0, there are problems such as the unvoiced sound not coming out or the rising edge becoming weak. On the other hand, for "Gin", a smooth voice is obtained without any noise such as "Puchitsu".
SUMMARY OF THE INVENTION An object of the present invention is to provide a speech synthesis method that eliminates the drawbacks of the prior art described above and synthesizes speech that is easy to hear even at discontinuous points of speech.

本発明は音声合成装置において、１単位時間前の音声の
有・無（無音・有声音または無声音すなわち有音）を記
憶し、次の音声の種類が有声音、無声音かによつて音量
情報の補間の方法を切り替えることによつて聞きやすい
合成音を得るものである。The present invention is a speech synthesis device that stores the presence/absence of speech (silent/voiced or unvoiced, i.e., voiced) from one unit time ago, and determines the volume information depending on whether the type of the next speech is voiced or unvoiced. By switching the interpolation method, a synthesized sound that is easy to hear can be obtained.

以下本発明になる音声合成装置を図に示す実施例により
説明する。DESCRIPTION OF THE PREFERRED EMBODIMENTS The speech synthesis device according to the present invention will be explained below with reference to embodiments shown in the drawings.

第１０図に本発明によるＰＡＲＣＯＲ型音声合成装置の
一実施例を示す。FIG. 10 shows an embodiment of a PARCOR type speech synthesizer according to the present invention.

第７図と同一符号の物は同一物を示し同一の動作を行な
う。第１０図において、１４は遅延回路、１５は判定回
路、１６は切り換えスイツチ、１７はゼロ検出回路であ
る。ゼロ検出回路１７は記憶部１２からの音量情報が０
ならば０を、それ以外のときは１を出力する。遅延回路
１４はゼロ検出回路１７の１、Ｏ信号を遅延させ、１単
位時間前の値を保持する。切り換えスイツチ１６は音量
情報１０としてスイツチ１６が口側に接続されたとき補
間回路１３を通つて補間された値を用い、イ側に接続さ
れたときは記憶部１２からの値をそのまま用いる。切り
換えスイツチ１６の制御は判定回路１５によつて行なわ
れる。判定回路１５は無音状態から有音状態になつたと
きに、有声音ならば０を無声音ならば１を出力し１、０
に対応し切換えスイツチ１６をイ，口に接続する。判定
回路１５の具体回路を第１１図に示す。回路の説明の前
に記憶部１２からのデータで、有声・無声音切換え信号
９は、有声音の時１、無声音の時０とする。図において
第１０図と同一符号の物は同一物を示す。（１）有音状
態が続いている場合を考える。Components with the same reference numerals as in FIG. 7 indicate the same components and perform the same operations. In FIG. 10, 14 is a delay circuit, 15 is a determination circuit, 16 is a changeover switch, and 17 is a zero detection circuit. The zero detection circuit 17 detects that the volume information from the storage unit 12 is 0.
If so, output 0, otherwise output 1. The delay circuit 14 delays the 1 and O signals of the zero detection circuit 17 and holds the value one unit time ago. The changeover switch 16 uses the value interpolated through the interpolation circuit 13 as the volume information 10 when the switch 16 is connected to the mouth side, and uses the value from the storage section 12 as it is when it is connected to the mouth side. The changeover switch 16 is controlled by the determination circuit 15. When the state changes from a silent state to a sound state, the determination circuit 15 outputs 0 if it is a voiced sound and 1 if it is an unvoiced sound.
Corresponding to this, selector switch 16 is connected to A and A. A specific circuit of the determination circuit 15 is shown in FIG. Before explaining the circuit, it is assumed that the voiced/unvoiced sound switching signal 9 is 1 for a voiced sound and 0 for an unvoiced sound using data from the storage unit 12. In the figure, the same reference numerals as in FIG. 10 indicate the same parts. (1) Consider a case where a voiced state continues.

ゼロ検出回路１７は常に１を出力し、遅延回路１４の出
力も１となりインバータ１８の出力は０となる。よつて
、ＡＮＤゲート１９の入力４は（１、０）となり出力は
０となる。そしてＡＮＤゲートの入力は（０）となり、
（１または０）インバータ２１の出力には関係なく出
力は常にｏとなる。そして切り換えスイツチ１６は口側
に接続され、音量情報１０は補間された値を用いること
になる。（２）無音状態から有音状態（有声音）に変化
した場合を考える。The zero detection circuit 17 always outputs 1, the output of the delay circuit 14 also becomes 1, and the output of the inverter 18 becomes 0. Therefore, the input 4 of the AND gate 19 becomes (1, 0), and the output becomes 0. And the input of the AND gate becomes (0),
(1 or 0) Regardless of the output of the inverter 21, the output is always o. The changeover switch 16 is connected to the mouth side, and the volume information 10 uses an interpolated value. (2) Consider a case where a silent state changes to a sound state (voiced sound).

ゼロ検出回路１７はｏから１に変化し、遅延回路１４の
出力はｏとなる。The zero detection circuit 17 changes from o to 1, and the output of the delay circuit 14 becomes o.

そしてインバーター１８の入力はｏ、出力は１となり、
ＡＮＤゲート１９の入力は（１、１）となり出力は１と
なる。次にインバータ２１の入力は１、出力はｏとなり
ＡＮＤゲート２０の入力は（１、０）となり出力はｏと
なる。そして切り換えスイツチ１６は口側に接続され、
音量情報１０は補間された値を用いることになり（１）
の場合と同様になる。（３）無音状態から有音状態（無
声音）に変化した場合。Then, the input of the inverter 18 is o, the output is 1,
The input of the AND gate 19 becomes (1, 1), and the output becomes 1. Next, the input of the inverter 21 becomes 1 and the output becomes o, and the input of the AND gate 20 becomes (1, 0) and the output becomes o. And the changeover switch 16 is connected to the mouth side,
Volume information 10 will use interpolated values (1)
The result is the same as in the case of . (3) When the state changes from a silent state to a sound state (voiceless sound).

ＡＮＤゲート１９の出力が１となるところまでは（２）
の場合と同様でありインバータ２１の入力は０、出力は
１、そしてＡＮＤゲート２０の入力は（１、１）となり
出力は１となる。Until the output of AND gate 19 becomes 1 (2)
The input of the inverter 21 is 0, the output is 1, and the input of the AND gate 20 is (1, 1), so the output is 1.

これによつて切換えスイツチ１６はイ側に接続され、音
量情報１０として、記憶部１２からの値がそのまま使用
される。第１０図のＰＡＲＣＯＲ型音声合成装置にて得
られた合成音声波形を第１２図、第１３図に示す。As a result, the changeover switch 16 is connected to the A side, and the value from the storage section 12 is used as is as the volume information 10. FIGS. 12 and 13 show synthesized speech waveforms obtained by the PARCOR type speech synthesizer shown in FIG. 10.

第１２図の「こ」の頭の無声音の部分の立上りはするど
く、はつきりした無声音を聞きとることが出来る。一方
第１３図の「ぎん」の部分においても「プチツ」等のノ
イズは無く、全体として明確さと滑らかさをもつた合成
音を得ることが出来る。前述した如く、従来無声音がは
つきりしない有声音の立上り時に「プチツ」等のパルシ
グなノイズが発生する等の問題があつたが本発明により
それらの問題は解決され、明確でかつ滑らかな合成音を
得ることが出来る。The unvoiced part at the beginning of ``ko'' in Figure 12 has a sharp rise, and you can hear the sharp unvoiced sound. On the other hand, there is no noise such as "Puchitsu" in the "Gin" part of FIG. 13, and it is possible to obtain a synthesized sound with clarity and smoothness as a whole. As mentioned above, conventionally there were problems such as pulsing noises such as "puchits" occurring at the rise of voiced sounds where unvoiced sounds do not stand out, but the present invention solves these problems and allows for clear and smooth synthesis. You can get sound.

【図面の簡単な説明】[Brief explanation of the drawing]

第１図は自然音声をスペクトラム分解した波形図、第２
図は従来のＰＡＲＣＯＲ型音声合成装置のプロツク図、
第３図、第４図は自然音声の波形図、第５図、第６図は
第２図の合成器にて得られた合成音声波形図、第７図は
従来のＰＡＲＣＯＲ型音声合成装置のプロツク図、第８
図、第９図は第７図の合成器にて得られた合成音声波形
図、第１０図は本発明による音声合成装置の一実施例を
示すプロツク図、第１１図は判定回路の一具体例を示す
プロツク図、第１２図、第１３図を本発明による合成装
置にて得られた合成音声波形図である。１：雑音発生器、２：パルス発生器、４：掛算器、５：
デイジタルフイルタ、６：Ｄ／Ａ変換器、１２：記憶部
、１４：遅延回路、１５：判定回路、１７：ゼロ検出回
路。Figure 1 is a waveform diagram of spectrum decomposition of natural speech, Figure 2
The figure shows a block diagram of a conventional PARCOR type speech synthesizer.
Figures 3 and 4 are waveform diagrams of natural speech, Figures 5 and 6 are waveform diagrams of synthesized speech obtained by the synthesizer shown in Figure 2, and Figure 7 is a diagram of the synthesized speech obtained by the synthesizer of Figure 2. Plot diagram, No. 8
9 is a synthesized speech waveform diagram obtained by the synthesizer of FIG. 7, FIG. 10 is a block diagram showing an embodiment of the speech synthesizer according to the present invention, and FIG. 11 is a specific example of the determination circuit. 12 and 13 are synthesized speech waveform diagrams obtained by the synthesizer according to the present invention; FIG. 1: Noise generator, 2: Pulse generator, 4: Multiplier, 5:
Digital filter, 6: D/A converter, 12: Storage section, 14: Delay circuit, 15: Judgment circuit, 17: Zero detection circuit.

Claims

【特許請求の範囲】[Claims]

１自然音声より切り出された波形から抽出されたスペ
クトル情報とピッチ情報及び音量情報をもとにディジタ
ルフィルタで音声を合成する音声合成方法において、無
音状態から有声音源へ変化する場合に音量情報を連続的
に補間し、無音状態から無声状態へ変化する場合に音量
情報を補間することなしに使用し合成音を得ることを特
徴とする音声合成方法。1 In a speech synthesis method that synthesizes speech using a digital filter based on spectrum information, pitch information, and volume information extracted from a waveform extracted from natural speech, the volume information is continuously synthesized when changing from a silent state to a voiced sound source. 1. A speech synthesis method characterized in that when changing from a silent state to a silent state, volume information is used without interpolation to obtain a synthesized sound.