JP2004061753A

JP2004061753A - Method and device for synthesizing singing voice

Info

Publication number: JP2004061753A
Application number: JP2002218583A
Authority: JP
Inventors: Shigeki Fujii; 藤井　茂樹
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2002-07-26
Filing date: 2002-07-26
Publication date: 2004-02-26
Anticipated expiration: 2022-07-26
Also published as: JP4300764B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a method and a device for singing voice synthesis capable of a more natural singing voice. <P>SOLUTION: A singing voice synthesis part 20 synthesizes a singing voice corresponding to syllables constituting lyrics. A breath sound synthesis part 40 synthesizes a breath sound corresponding to a breath mark included in lyrics information. At this time, the breath sound synthesis part 40 selects a phoneme waveform of the breath sound from a breath sound database in a breath phoneme storage part 50 according to a combination of phonemes of the singing voice before and after the breath sound and synthesizes the breath sound by using the selected phoneme waveform. <P>COPYRIGHT: (C)2004,JPO

Description

【０００１】
【発明の属する技術分野】
この発明は、歌唱音および吸気音の混ざった歌唱音声を合成する方法および装置に関する。
【０００２】
【従来の技術】
人工的に音声を作り出す音声合成技術が種々提案されている。このような音声合成技術を利用するものとして、歌詞を複数の音節に分解し、各音節に対応した音素を順次合成する歌唱音声合成装置がある。
【０００３】
【発明が解決しようとする課題】
しかしながら、従来の歌唱音声合成装置においては、歌唱音声の合成に際して歌詞を分解した各音節に対応する音素のみが合成されており、歌詞と歌詞との間で息継ぎをする際に発せられる吸気音（ブレス音）については考慮されていない。このため、発声区間と非発声区間との差が顕著になりすぎてしまい、これらの区間の切り替わりが唐突な印象を与えてしまうことから、非人間的、かつ不自然であるという評価にもつながっていた。
【０００４】
この発明は、上述した事情に鑑みてなされたものであり、より自然な歌唱音声を合成できる歌唱音声合成方法および歌唱音声合成装置を提供することを目的とする。
【０００５】
【課題を解決するための手段】
この発明は、時系列的な歌唱音の合成指示に従い、歌唱音を順次合成する歌唱音合成過程と、時間的に前後した２つの歌唱音の合成指示の間に吸気音の合成指示が与えられた場合に、少なくとも当該吸気音の直後の歌唱音の音素が関与した選択方法に従って、吸気音を決定付けるパラメータを選択し、該パラメータを用いて吸気音を合成する吸気音合成過程とを具備することを特徴とする歌唱音声合成方法を提供する。
【０００６】
好ましい態様において、前記吸気音合成過程では、当該吸気音の直後の歌唱音の音素と当該吸気音の直前の歌唱音の音素の両方が関与した選択方法に従って、前記吸気音を決定付けるパラメータを選択する。
【０００７】
また、好ましい態様において、前記吸気音合成過程では、当該吸気音の直後の歌唱音の音素と当該吸気音の直前の歌唱音の音素の組み合わせに基づいて、記憶手段に予め記憶された複数種類の吸気音の波形データの中から１種類の吸気音の波形データを選択し、前記吸気音を決定付けるパラメータとして用いる。
また、好ましい態様において、前記吸気音合成過程では、当該吸気音の直後の歌唱音の音素に応じて、当該吸気音の振幅を制御する。
【０００８】
この発明は、以上掲げたような歌唱音声合成方法として実施される他、これらの方法に従って、歌唱音と吸気音を含んだ歌唱音声を合成する歌唱音声合成装置を生産しあるいは譲渡するといった態様でも実施される。
【０００９】
【発明の実施の形態】
以下、図面を参照して本発明の実施形態について説明する。
図１はこの発明の一実施形態に係る歌唱音声合成装置の構成を示すブロック図である。図１に示すように、この歌唱音声合成装置は、歌唱情報解析部１０と、歌唱音合成部２０と、歌唱音素片記憶部３０と、ブレス音合成部４０と、ブレス音素片記憶部５０と、加算器６０とを有する。
【００１０】
歌唱情報解析部１０は、時系列の歌唱情報を解析する装置である。好ましい態様において、この歌唱情報は通信手段を介して歌唱情報解析部１０に入力される。また、別の好ましい態様においては、ハードディスクなどの記憶手段から歌唱情報が読み出され、歌唱情報解析部１０に入力される。
【００１１】
図２には、ある曲の歌唱情報が例示されている。図２に示すように、歌唱情報は、曲を表す一連の音符＃１、＃２、…の各々に対応した情報セグメントにより構成されている。そして、１つの音符に対応した情報セグメントは、その音符の音高を示す音高情報、その音符の符長を示す符長情報およびその音符に合わせて発音すべき１または複数の音節を表す歌詞情報を含み、さらに、その音符に適用されるべきテンポ情報、ダイナミックス情報およびビブラート情報を含んでいる。図２に示す例において、音符＃４は、符長１／４、つまり、４分音符の符長を有する休符である。この休符のタイミングにおいて、発音すべき音節はなく、歌唱者は息継ぎを行う。このため、音符＃４に対応した歌詞情報として、ブレス音の合成を指示するブレスマーク＄が用いられている。これらの歌唱情報は、合成する歌唱音声の歌唱者の情報である歌唱者情報に対応していてもよい。
【００１２】
図３に示すように、歌唱情報解析部１０は、歌詞フィルタ１１と、言語処理部１２と、シーケンサ１３とを有している。
【００１３】
歌詞フィルタ１１は、歌詞情報中にブレスマーク＄がある場合に、そのブレスマーク＄に応じて発音すべきブレス音を特定するブレス制御情報を生成し、歌詞情報を言語処理部１２へ、ブレス制御情報をシーケンサ１３へ出力する。ここで、歌詞が日本語である場合を例にブレス制御情報の生成方法を説明すると、次の通りである。例えば図２に示す歌詞情報は、音符＃４に対応したブレスマーク＄の前後に、音節情報「た」と音節情報「さ」を有している。ここで、ブレスマーク＄の直前の音節“た”は２個の音素／ｔ／および／ａ／に分解することができ、ブレスマーク＄の直後の音節“さ”は２個の音素／ｓ／および／ａ／に分解することができる。そこで、歌詞フィルタ１１は、ブレスマーク＄の前の音節“た”の最後の音素／ａ／を表す先行音素記号と、ブレスマーク＄の後の音節“さ”の最初の音素／ｓ／を表す後続音素記号を生成し、それらの組をブレス制御情報として出力するのである。
【００１４】
言語処理部１２は、歌詞フィルタ１１から与えられる歌詞情報中の音節情報を音素記号に分解して出力する。
【００１５】
シーケンサ１３は、曲の進行に合わせて、歌唱情報中の各情報、歌詞フィルタ１１によって生成されるブレス制御情報および言語処理部１２によって生成される音素記号を歌唱音合成部２０またはブレス音合成部４０に供給するためのタイミング制御を行う装置である。
【００１６】
図１において、歌唱音素片記憶部３０は、歌唱音素片データベースを記憶している。この歌唱音素片データベースは、人によって発声される各種の音声波形を収集し、これらの音声波形を音素の波形に分割し、各音素波形を符号化することにより得られたデータの集合体である。各音素の波形データは、その音素の音素記号をキーとして歌唱音素片データベースから読み出すことができる。音声波形の波形データは、例えばＬＰＣ（Ｌｉｎｅａｒ　Ｐｒｅｄｉｃｔｉｖｅ　Ｃｏｄｉｎｇ：線形予測分析）合成技術、波形重畳合成技術、フォルマント合成技術等が利用して得られたものでもよい。
【００１７】
歌唱音合成部２０は、上述した歌詞情報中の音節情報に対応した歌唱音を合成する装置である。ある音符に対応した音節の歌唱音を合成すべきとき、シーケンサ１３は、この歌唱音合成部２０に対し、その音符に対応した音高情報と、符長情報と、テンポ情報と、ダイナミックス情報と、ビブラート情報を供給する。また、シーケンサ１３は、その音符に対応した音節情報から得られた音素記号を言語処理部１２から受け取り、歌唱音合成部２０に供給する。さらに、シーケンサ１３は、その音符に対応した符長情報を符長情報として歌唱音合成部２０に供給する。これに応じて、歌唱音合成部２０は、ピッチエンベロープと振幅エンベロープをを生成する。そして、歌唱音合成部２０は、シーケンサ１３から受け取った音素記号により指示された音素の波形データを、時々刻々と変化するピッチエンベロープの瞬時値に応じた読み出し速度で、歌唱音素片記憶部３０から読み出し、振幅エンベロープにより振幅変調し、歌唱音波形として出力する。ピッチエンベロープおよび振幅エンベロープは、シーケンサ１３から与えられたテンポ情報と符長情報によって定まる時間だけ持続する。また、ピッチエンベロープの波形は、音高情報およびビブラート情報により決定され、振幅エンベロープの波形は、ダイナミックス情報により決定される。
【００１８】
ブレス音素片記憶部５０には、予め人が発したブレス音の波形を表すデータの集合体であるブレス音データベースが記憶されている。ブレス音の波形データは、音素波形の波形データと同様に種々の合成技術等を利用して得られたものであってもよい。ブレス音合成部４０には、歌詞情報中のブレスマーク＄のタイミングにおいて、ブレス音データベースを参照してブレス音を合成する装置である。
【００１９】
図４はブレス音素片記憶部５０およびブレス音合成部４０の構成を示すブロック図である。本実施形態では、複数の歌唱者の各々についてブレス音データベースがブレス音素片記憶部５０に記憶されている。各ブレス音データベースは、複数種類のブレス音の波形データの集合体である。人から発声されるブレス音の波形は、そのブレス音の直後の音素の影響を強く受ける。また、ブレス音の波形には、その直前に発声された音素の影響も現れる。そこで、本実施形態では、あるブレスマーク＄の発生に応じてブレス音を合成する場合に、そのブレスマーク＄の直前の先行音素記号と直後の後続音素記号の組み合わせに応じてブレス音の音素波形を決定し、その音素波形を用いてブレス音を合成する。このようなブレス音の合成を可能にするため、本実施形態におけるブレス音データベースは、先行音素記号と後続音素記号の可能な組み合わせのすべてについて、ブレス音の音素波形の波形データを含んでいる。
【００２０】
ブレス音素片選択部４１には、歌唱者情報が与えられる。好ましい態様において、この歌唱者情報は、図示しない操作部から入力される。ブレス音素片選択部４１は、ブレス音素片記憶部５０に記憶された複数の歌唱者のブレス音データベースの中から歌唱者情報によって指定されたものを選択する。また、ブレスマーク＄に対応したタイミングにおいて、シーケンサ１３は、ブレス制御情報を出力する。ブレス音素片選択部４１は、このブレス制御情報中の先行音素記号および後続音素記号の組み合わせに対応したブレス音の音素の波形データを、選択したブレス音データベースの中から読み出し、ブレス音振幅制御部４２に出力する。
【００２１】
ブレス音振幅制御部４２は、ブレス制御情報中の後続音素記号に基づいて、ブレス音素片選択部４１から出力されたブレス音の波形データの振幅を制御する。さらに詳述すると、ブレス音振幅制御部４２は、後続音素記号が特定の音素、具体的には母音を表している場合に、ブレス音波形がその終期付近において急激に立ち上がり、その後に急激に減衰するように、波形データに振幅変調処理を施す。
【００２２】
ブレス区間長計算部４３には、シーケンサ１３から符長情報とテンポ情報が与えられる。ブレス区間長計算部４３は、符長情報とテンポ情報に基づいて休符の実時間長ｔを求め、これを所定の内分比によりブレス音長ｔ’と無音区間長ｔｓとに分ける。ここで、ｔ、ｔ’、ｔｓの間には、
ｔ＝ｔ’＋ｔｓ
の関係がある。
【００２３】
ブレス音音長制御部４４には、ブレス音長ｔ’がブレス区間長計算部４３から通知される。ブレス音音長制御部４４は、ブレス音振幅制御部４２によって振幅が制御されたブレス音の波形データを受け取ると、ブレス音波形の持続時間がこのブレス音長ｔ’に相当する期間となるように、波形データの調整を行う。好ましい態様において、この調整は、ブレス音波形の前縁部分と後縁部分（すなわち、上記振幅変調処理の対象となる部分）との間の中間部分の波形データを一旦出力した後、再度、この中間部分を１または複数回出力してブレス音波形の持続時間を長くしたり、あるいはその中間部分を間引くことにより持続時間を短くするという方法により行われる。
【００２４】
無音区間付加部４５は、ブレス音音長制御部４４から出力されたブレス音長ｔ’のブレス音の波形データをそのまま出力するとともに、これに続けて、無音区間ｔｓに相当する期間、無音状態を表す波形データを出力する。
【００２５】
図１における加算器６０は、このようにして無音区間付加部４５から出力されるブレス音の波形データと、歌唱音合成部２０から出力される歌唱音の波形データとを加算し、歌唱合成音の波形データとして出力する。この波形データは、図示しないＤ／Ａ変換器、アンプおよびスピーカを介することにより歌唱音声として出力される。
【００２６】
以下、図５に示すタイムチャートを参照し、本実施形態の動作を説明する。図示のような歌詞情報、音高情報、符長情報およびその他の情報が与えられた場合、歌詞フィルタ１１は、歌詞情報を先頭から順に読み、ブレスマーク＄を発見した場合、ブレスマーク＄の直前直後の各音素を表す先行音素記号と後続音素記号とを求め、これらの情報によりブレス制御情報を構成する。また、言語処理部１２は、歌詞情報中の音節情報を音素記号に分解する。なお、歌詞フィルタ１１および言語処理部１２は、１曲分の歌詞情報を取得したときに、それらの全てを対象として以上の処理を一括して行い、音素記号列とブレス制御情報を生成してもよい。あるいは歌詞フィルタ１１および言語処理部１２は、シーケンサ１３によって行われる歌唱音またはブレス音の合成のためのタイミング制御に対し、例えば音符１個分だけ進んだ位相で以上の処理を逐次実行してもよい。要するに、音素記号およびブレス制御情報の生成は、シーケンサ１３がそれらの情報を必要とするときまでに行われればよい。
【００２７】
歌唱音声の合成を開始するとき、シーケンサ１３は、最初の音符に対応した音高情報、符長情報、テンポ情報、ダイナミックス情報、ビブラート情報を歌唱データから取り込むとともに、最初の音符に対応した音節の音素記号を言語処理部１２から取り込む。
【００２８】
図５に示す例では、音高が“ド”であり、符長が４分の１拍である最初の音符の音高情報および符長情報とこれに適用されるテンポ、ダイナミックス、ビブラートの各情報がシーケンサ１３に取り込まれる。また、最初の音符に合わせて発声する音節“さ”を分解した音素の音素記号／ｓ／および／ａ／が言語処理部１２から出力され、シーケンサ１３に取り込まれる。なお、この最初の音符に対応した情報の送信時、シーケンサ１３に送るべきブレス制御情報はない。
【００２９】
このようにして最初の音符に対応した各情報を取得すると、シーケンサ１３は、音素記号／ｓ／および／ａ／を歌唱音合成部２０に送る。同時にシーケンサ１３は、その音符の音高情報“ド”、符長情報「１／４」、テンポ情報、ダイナミックス情報、ビブラート情報を歌唱音合成部２０に送る。
【００３０】
この結果、音素記号／ｓ／および／ａ／に対応した音素の波形データが歌唱音素片記憶部３０から読み出され、音高が“ド”である音節“さ”の歌唱音の波形データが歌唱音合成部２０から出力され、加算器６０を介することにより歌唱音として出力される。
【００３１】
以上の動作が行われている間、シーケンサ１３は、最初の音符に対応したテンポ情報と符長情報「１／４」により決定される時間の計時を行う。そして、計時が終了したときに、後続の音符に対応した各情報を取り込むのである。
【００３２】
図５に示す例では、２番目の音符と３番目の音符についても以上の同様な動作が行われる。そして、３番目の音符の符長に対応した計時が終了すると、シーケンサ１３は、３番目の音符の次の休符に対応した符長情報と、テンポ情報と、ダイナミックス情報とを歌唱データから取り込むとともに、ブレスマーク＄に応じて生成したブレス制御情報を歌詞フィルタ１１から取り込む。そして、シーケンサ１３は、取り込んだ各情報をブレス音合成部４０に送り、４分の１拍相当の時間の計時を開始する。
【００３３】
ブレス音合成部４０は、ブレス制御情報により特定されるブレス音を合成する。この例の場合、ブレス制御情報は、先行音素記号／ａ／および後続音素記号／ｓ／を含んでいる。これらのうち先行音素記号／ａ／は、図５において休符の直前に発声する音節“た”の最後の音素を表しており、後続音素記号／ｓ／は休符の直後に発声する音節“さ”の最初の音素を表している。ブレス音合成部４０のブレス音素片選択部４１は、これらの先行音素記号／ａ／および後続音素記号／ｓ／の組み合わせに対応したブレス音の音素の波形データを、歌唱者情報により選択されたブレス音データベースの中から読み出し、ブレス音振幅制御部４２に出力する。
【００３４】
ブレス音振幅制御部４２は、ブレス制御情報中の後続音素記号に基づいて、ブレス音素片選択部４１から出力されたブレス音の波形データの振幅変調を行う。そして、ブレス区間長計算部４３には、シーケンサ１３からの符長情報とテンポ情報に基づいて休符の実時間長ｔを求め、これからブレス音長ｔ’と無音区間長ｔｓとを求める。例えば、テンポ情報が１分間に４分音符１１０個分の歌唱が行われるような速度を示しており、ブレス符長情報が１／４拍、つまり４分休符である場合には、休符の実時間長ｔは、“６０／１１０秒＝５４５ｍｓ”となる。ブレス区間長計算部４３は、この実時間長ｔを所定の比で内分し、ブレス音長ｔ’と無音区間長ｔｓを求める。好ましい態様において、この比は例えば９：１である。この場合、ブレス音長ｔ’は４９０ｍｓ、無音区間長ｔｓは５５ｍｓとなる。
【００３５】
ブレス音音長制御部４４および無音区間付加部４５は、ブレス区間長計算部４３の計算結果に従い、ブレス音振幅制御部４２から受け取ったブレス音の波形データを用いて、ブレス音長ｔ’相当の時間継続し、後は無音状態となるブレス音の波形データを生成する。
【００３６】
このようにして得られたブレス音の波形データがブレス音合成部４０から出力され、加算器６０を介することによりブレス音として出力される。
【００３７】
このブレス音の合成の後は、図５において休符の後の音節“さ”“い”等の歌唱音の合成が行われるが、それらの動作は既に説明したものと同様なので説明を省略する。
【００３８】
図６は本実施形態の効果を説明するものである。図６（ａ）に示されるブレス音波形Ｋ１は、ブレス音の直前および直後に発音される音素の音素記号がともに／ａ／である。一方、図６（ｂ）に示されるブレス音波形Ｋ２は、ブレス音の直前に発音される音素の音素記号が／ｅ／であり、ブレス音の直後に発音される音素の音素記号が／ｔ／である。ブレス音波形Ｋ１の始期部分Ｆ１およびブレス音波形Ｋ２の始期部分Ｆ２は、先行音素の影響を受け、特に先行音素の音色による影響が反映される。ブレス音波形Ｋ１の終期部分Ｂ１およびブレス音波形Ｋ２の終期部分Ｂ２は、後続音素の影響を受け、後続音素が有声音であるか無声子音であるかにより受ける影響が異なる。例えば、後続音素が有声音、特に母音・鼻音である場合には、図６（ａ）に示されるように、ブレス音波形Ｋ１の終期部分Ｂ１の振幅が急激に増大して減衰するという特徴が見受けられる。また、例えば、後続音素が無声子音である場合には、図６（ｂ）に示されるように、ブレス音波形Ｋ２の終期部分Ｂ２の振幅がゆるやかに減衰する。
【００３９】
本実施形態によれば、ブレス音の直前の先行音素と直後の後続音素との組み合わせによりブレス音の波形を選択するので、以上のような現象を再現し、自然なブレス音を合成することができる。
【００４０】
本実施形態には、次のような変形例が考えられる。
＜変形例１＞
上記実施形態では、ブレス音の直後の音節に関しては、最初の音素のみを考慮してブレス音の波形の制御を行った。これに対し、本変形例では、ブレス音の直後の連続した２個の音素の組み合わせが特定の組み合わせである場合、ブレス音波形の後縁の部分を急激に立ち上げ急激に減衰させる振幅変調を行う。特定の組み合わせとは、例えば図６（ａ）に示されるような、母音／ａ／の後に鼻音／ｎ／が続くような組み合わせである。本変形例によれば、より自然なブレス音を合成することができる。
【００４１】
＜変形例２＞
本変形例では、ブレス音の直後に発音される音素の音素記号のみに基づいて、ブレス音波形を選択する。本変形例によれば、ブレス音データベースのデータ量を削減することが可能になる。
【００４２】
＜変形例３＞
本変形例における歌唱音声合成装置は、ブレス音採否制御部を有している。これは、歌唱音声を合成するに当たって、歌唱音声にブレス音を含めるか否かの切り換え制御を行う装置である。好ましい態様においては、歌唱情報にこの切り換え制御のための制御情報が含まれている。この態様において、ブレス音採否制御部は、この歌唱情報に含まれる制御情報に基づいて歌唱音声にブレス音を含めるか否かの切り換えを行う。また、別の好ましい態様において、ブレス音採否制御部は、図示しない操作部から与えられる指令に従い、歌唱音声にブレス音を含めるか否かの切り換えを行う。
【００４３】
【発明の効果】
以上説明したように本発明によれば、歌唱音に続けてブレス音を合成する場合に、少なくともそのブレス音の直後の歌唱音の音素に基づいてブレス音波形を決定付けるパラメータを選択するので、より自然な歌唱音声を合成することができる。
【図面の簡単な説明】
【図１】この発明の一実施形態に係る歌唱音声合成装置の構成を示すブロック図である。
【図２】同実施形態において取り扱う歌唱情報を示す図である。
【図３】同実施形態における歌唱情報解析部の構成を示すブロック図である。
【図４】同実施形態におけるブレス音記憶部およびブレス音合成部の構成を示すブロック図である。
【図５】同実施形態の動作を示すフローチャートである。
【図６】同実施形態の効果を説明する図である。
【符号の説明】
１０……歌唱情報解析部、２０……歌唱音合成部、３０……歌唱音素片記憶部、４０……ブレス音合成部、５０……ブレス音素片記憶部、６０……加算器。[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to a method and an apparatus for synthesizing a singing voice in which a singing sound and an intake sound are mixed.
[0002]
[Prior art]
Various voice synthesis techniques for artificially generating voice have been proposed. There is a singing voice synthesizer that utilizes such a voice synthesis technology to decompose lyrics into a plurality of syllables and sequentially synthesize phonemes corresponding to each syllable.
[0003]
[Problems to be solved by the invention]
However, in the conventional singing voice synthesizer, only the phonemes corresponding to the respective syllables obtained by decomposing the lyrics are synthesized at the time of synthesizing the singing voice, and the intake sound ( Breath sound) is not considered. For this reason, the difference between the vocal section and the non-vocal section becomes too noticeable, and the switching of these sections gives an abrupt impression, which leads to the evaluation of being unhuman and unnatural. I was
[0004]
The present invention has been made in view of the above-described circumstances, and has as its object to provide a singing voice synthesizing method and a singing voice synthesizing apparatus capable of synthesizing a more natural singing voice.
[0005]
[Means for Solving the Problems]
According to the present invention, a singing sound synthesizing process for sequentially synthesizing singing sounds in accordance with a chronological singing sound synthesizing instruction and an intake sound synthesizing instruction are provided between two singing sounds synthesizing instructions that are temporally delayed. In this case, at least according to a selection method involving a phoneme of a singing sound immediately after the intake sound, a parameter for determining the intake sound is selected, and the intake sound is synthesized using the parameter. A singing voice synthesis method is provided.
[0006]
In a preferred aspect, in the intake sound synthesizing step, a parameter for determining the intake sound is selected according to a selection method involving both a phoneme of a singing sound immediately after the intake sound and a phoneme of a singing sound immediately before the intake sound. I do.
[0007]
In a preferred aspect, in the intake sound synthesizing process, based on a combination of a phoneme of a singing sound immediately after the intake sound and a phoneme of a singing sound immediately before the intake sound, a plurality of types of pieces are stored in the storage unit in advance. One type of waveform data of the intake sound is selected from the waveform data of the intake sound, and is used as a parameter for determining the intake sound.
In a preferred aspect, in the intake sound synthesis process, the amplitude of the intake sound is controlled according to a phoneme of a singing sound immediately after the intake sound.
[0008]
The present invention can be implemented as a singing voice synthesizing method as described above, or in a mode of producing or transferring a singing voice synthesizing device that synthesizes a singing voice including a singing sound and an intake sound in accordance with these methods. Will be implemented.
[0009]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
FIG. 1 is a block diagram showing a configuration of a singing voice synthesizer according to one embodiment of the present invention. As shown in FIG. 1, the singing voice synthesizer includes a singing information analyzing unit 10, a singing sound synthesizing unit 20, a singing voice segment storing unit 30, a breath sound synthesizing unit 40, and a breath phoneme storing unit 50. , An adder 60.
[0010]
The singing information analyzing unit 10 is a device that analyzes time-series singing information. In a preferred embodiment, the singing information is input to the singing information analyzing unit 10 via communication means. In another preferred embodiment, the singing information is read from a storage unit such as a hard disk and is input to the singing information analyzing unit 10.
[0011]
FIG. 2 illustrates singing information of a certain song. As shown in FIG. 2, the singing information is composed of information segments corresponding to each of a series of notes # 1, # 2,... The information segment corresponding to one note includes pitch information indicating the pitch of the note, note length information indicating the note length of the note, and lyrics indicating one or more syllables to be pronounced in accordance with the note. Information, as well as tempo information, dynamics information and vibrato information to be applied to the note. In the example shown in FIG. 2, note # 4 is a rest having a note length of 1/4, that is, a note length of a quarter note. At this rest timing, there is no syllable to be pronounced, and the singer breathes. For this reason, a breath mark 指示 instructing the synthesis of a breath sound is used as the lyrics information corresponding to the note # 4. These pieces of singing information may correspond to singer information that is information of the singer of the singing voice to be synthesized.
[0012]
As shown in FIG. 3, the singing information analyzing unit 10 includes a lyrics filter 11, a language processing unit 12, and a sequencer 13.
[0013]
When there is a breath mark 歌詞 in the lyrics information, the lyrics filter 11 generates breath control information for specifying a breath sound to be generated according to the breath mark 、, and sends the lyrics information to the language processing unit 12. The information is output to the sequencer 13. Here, a method of generating breath control information will be described with reference to a case where the lyrics are in Japanese as an example. For example, the lyric information shown in FIG. 2 has syllable information “ta” and syllable information “sa” before and after a breath mark 対応 corresponding to note # 4. Here, the syllable “ta” immediately before the breath mark でき can be decomposed into two phonemes / t / and / a /, and the syllable “sa” immediately after the breath mark は becomes two phonemes / s / And / a /. Thus, the lyrics filter 11 represents the preceding phoneme symbol representing the last phoneme / a / of the syllable "ta" before the breath mark と and the first phoneme / s / of the syllable "sa" after the breath mark ＄. Subsequent phoneme symbols are generated, and those sets are output as breath control information.
[0014]
The language processing unit 12 decomposes syllable information in the lyrics information provided from the lyrics filter 11 into phoneme symbols and outputs the phoneme symbols.
[0015]
The sequencer 13 converts the information in the singing information, the breath control information generated by the lyric filter 11 and the phoneme symbols generated by the language processing unit 12 into the singing sound synthesizing unit 20 or the breath sound synthesizing unit in accordance with the progress of the music. This is a device that performs timing control for supplying the power to the power supply 40.
[0016]
In FIG. 1, the singing voice segment storage unit 30 stores a singing voice segment database. This singing phoneme segment database is a collection of data obtained by collecting various voice waveforms uttered by a human, dividing these voice waveforms into phoneme waveforms, and encoding each phoneme waveform. . The waveform data of each phoneme can be read from the singing phoneme database using the phoneme symbol of the phoneme as a key. The waveform data of the audio waveform may be obtained by using, for example, an LPC (Linear Predictive Coding: linear prediction analysis) synthesis technique, a waveform superposition synthesis technique, a formant synthesis technique, or the like.
[0017]
The singing sound synthesizer 20 is a device that synthesizes a singing sound corresponding to the syllable information in the above-mentioned lyric information. When the singing sound of a syllable corresponding to a certain note is to be synthesized, the sequencer 13 instructs the singing sound synthesizing unit 20 to generate pitch information corresponding to the note, note length information, tempo information, and dynamics information. Supplies the vibrato information. Further, the sequencer 13 receives phoneme symbols obtained from syllable information corresponding to the notes from the language processing unit 12 and supplies them to the singing sound synthesis unit 20. Further, the sequencer 13 supplies note length information corresponding to the note to the singing sound synthesizer 20 as note length information. In response, the singing sound synthesizer 20 generates a pitch envelope and an amplitude envelope. Then, the singing sound synthesizer 20 reads the waveform data of the phoneme indicated by the phoneme symbol received from the sequencer 13 from the singing sound segment storage unit 30 at a reading speed corresponding to the instantaneous value of the pitch envelope that changes every moment. The readout, amplitude modulation by the amplitude envelope, and output as a singing sound waveform. The pitch envelope and the amplitude envelope last for a time determined by the tempo information and the note length information provided from the sequencer 13. The waveform of the pitch envelope is determined by the pitch information and the vibrato information, and the waveform of the amplitude envelope is determined by the dynamics information.
[0018]
The breath phoneme storage unit 50 stores in advance a breath sound database which is an aggregate of data representing the waveform of a breath sound emitted by a person. The waveform data of the breath sound may be data obtained by using various synthesis techniques or the like, similarly to the waveform data of the phoneme waveform. The breath sound synthesizer 40 is a device that synthesizes a breath sound with reference to the breath sound database at the timing of the breath mark # in the lyrics information.
[0019]
FIG. 4 is a block diagram showing the configurations of the breath phoneme storage unit 50 and the breath sound synthesis unit 40. In the present embodiment, a breath sound database is stored in the breath phoneme storage unit 50 for each of a plurality of singers. Each breath sound database is an aggregate of waveform data of a plurality of types of breath sounds. The waveform of the breath sound uttered by a person is strongly affected by the phoneme immediately after the breath sound. In addition, the waveform of the breath sound also includes the influence of the phoneme uttered immediately before. Therefore, in the present embodiment, when synthesizing a breath sound in response to the occurrence of a certain breath mark 、, the phoneme waveform of the breath sound depends on the combination of the preceding phoneme symbol immediately before the breath mark と and the succeeding phoneme symbol immediately after the breath mark ＄. Is determined, and a breath sound is synthesized using the phoneme waveform. In order to enable such synthesis of the breath sound, the breath sound database in the present embodiment includes the waveform data of the phoneme waveform of the breath sound for all possible combinations of the preceding phoneme symbol and the subsequent phoneme symbol.
[0020]
The singer information is given to the breath phoneme selection unit 41. In a preferred embodiment, the singer information is input from an operation unit (not shown). The breath phoneme selection unit 41 selects one specified by the singer information from the breath sound database of a plurality of singers stored in the breath phoneme storage unit 50. At the timing corresponding to breath mark #, sequencer 13 outputs breath control information. The breath phoneme selecting unit 41 reads out the waveform data of the phoneme of the breath sound corresponding to the combination of the preceding phoneme symbol and the subsequent phoneme symbol in the breath control information from the selected breath sound database, and outputs the breath sound amplitude control unit. 42.
[0021]
The breath sound amplitude control unit 42 controls the amplitude of the breath sound waveform data output from the breath sound element selection unit 41 based on the subsequent phoneme symbols in the breath control information. More specifically, when the succeeding phoneme symbol represents a specific phoneme, specifically, a vowel, the breath sound amplitude control unit 42 causes the breath sound waveform to rise sharply near its end, and then to rapidly attenuate. To perform amplitude modulation on the waveform data.
[0022]
The breath section length calculation unit 43 is provided with note length information and tempo information from the sequencer 13. The breath section length calculation unit 43 calculates the actual time length t of the rest based on the note length information and the tempo information, and divides this into a breath sound length t 'and a silent section length ts according to a predetermined internal division ratio. Here, between t, t ', and ts,
t = t '+ ts
There is a relationship.
[0023]
The breath sound length control unit 44 is notified of the breath sound length t ′ from the breath section length calculation unit 43. Upon receiving the breath sound waveform data whose amplitude is controlled by the breath sound amplitude control unit 42, the breath sound length control unit 44 sets the duration of the breath sound waveform to a period corresponding to the breath sound length t '. Next, the waveform data is adjusted. In a preferred embodiment, this adjustment is performed after once outputting the waveform data of the intermediate portion between the leading edge portion and the trailing edge portion of the breath sound waveform (that is, the portion to be subjected to the amplitude modulation process). This is performed by outputting the middle part one or more times to increase the duration of the breath sound waveform, or by thinning out the middle part to shorten the duration.
[0024]
The silence section addition unit 45 outputs the breath sound waveform data of the breath sound length t ′ output from the breath sound duration control unit 44 as it is, and subsequently, continues the silence state for a period corresponding to the silence section ts. Is output.
[0025]
The adder 60 in FIG. 1 adds the waveform data of the breath sound output from the silent section adding unit 45 and the waveform data of the singing sound output from the singing sound synthesizing unit 20 in this manner, and Is output as the waveform data. This waveform data is output as singing voice via a D / A converter, an amplifier, and a speaker (not shown).
[0026]
Hereinafter, the operation of the present embodiment will be described with reference to the time chart shown in FIG. When lyrics information, pitch information, note length information, and other information as shown in the drawing are given, the lyrics filter 11 reads the lyrics information in order from the beginning, and when a breath mark 発見 is found, A preceding phoneme symbol and a succeeding phoneme symbol representing each immediately succeeding phoneme are obtained, and these information constitute breath control information. Further, the language processing unit 12 decomposes syllable information in the lyrics information into phoneme symbols. When the lyric filter 11 and the linguistic processing unit 12 acquire the lyric information for one song, the lyric filter 11 and the language processing unit 12 collectively perform the above processing for all of them, and generate a phoneme symbol string and breath control information. Is also good. Alternatively, the lyric filter 11 and the language processing unit 12 may sequentially execute the above processing at a phase advanced by, for example, one note, with respect to the timing control for synthesizing the singing sound or the breath sound performed by the sequencer 13. Good. In short, the generation of the phoneme symbol and the breath control information may be performed by the time when the sequencer 13 needs the information.
[0027]
When the synthesis of the singing voice is started, the sequencer 13 takes in pitch information, note length information, tempo information, dynamics information, vibrato information corresponding to the first note from the singing data, and sets a syllable corresponding to the first note. From the language processing unit 12.
[0028]
In the example shown in FIG. 5, pitch information and note length information of the first note whose pitch is “do” and the note length is a quarter beat, and the tempo, dynamics and vibrato applied thereto Each information is taken into the sequencer 13. Further, phoneme symbols / s / and / a / of phonemes obtained by decomposing the syllable “sa” uttered in accordance with the first note are output from the language processing unit 12 and taken into the sequencer 13. When transmitting the information corresponding to the first note, there is no breath control information to be sent to the sequencer 13.
[0029]
Upon acquiring each piece of information corresponding to the first note in this way, the sequencer 13 sends the phoneme symbols / s / and / a / to the singing sound synthesizer 20. At the same time, the sequencer 13 sends the pitch information “do”, note length information “４”, tempo information, dynamics information, and vibrato information of the note to the singing sound synthesizer 20.
[0030]
As a result, the waveform data of the phoneme corresponding to the phoneme symbols / s / and / a / is read from the singing phoneme segment storage unit 30, and the waveform data of the singing sound of the syllable “sa” having the pitch “do” is obtained. The singing sound is output from the singing sound synthesizing unit 20 and output as a singing sound via the adder 60.
[0031]
While the above operation is being performed, the sequencer 13 measures the time determined by the tempo information corresponding to the first note and the note length information “１／”. Then, when the timing is completed, each piece of information corresponding to the subsequent note is fetched.
[0032]
In the example shown in FIG. 5, the same operation as described above is performed for the second note and the third note. When the timing corresponding to the note length of the third note ends, the sequencer 13 converts the note length information corresponding to the rest following the third note, tempo information, and dynamics information from the singing data. At the same time, the breath control information generated according to the breath mark ＄ is fetched from the lyrics filter 11. Then, the sequencer 13 sends the acquired information to the breath sound synthesizing section 40, and starts counting time equivalent to a quarter beat.
[0033]
The breath sound synthesizer 40 synthesizes a breath sound specified by breath control information. In the case of this example, the breath control information includes the preceding phoneme symbol / a / and the succeeding phoneme symbol / s /. Of these, the preceding phoneme symbol / a / represents the last phoneme of the syllable "ta" uttered immediately before the rest in FIG. 5, and the subsequent phoneme symbol / s / represents the syllable uttered immediately after the rest. "" Represents the first phoneme. The breath phoneme selecting unit 41 of the breath sound synthesizing unit 40 selects the waveform data of the phoneme of the breath sound corresponding to the combination of the preceding phoneme symbol / a / and the subsequent phoneme symbol / s / based on the singer information. The breath sound is read from the breath sound database and output to the breath sound amplitude control unit 42.
[0034]
The breath sound amplitude controller 42 modulates the amplitude of the breath sound waveform data output from the breath phoneme segment selector 41 based on the succeeding phoneme symbols in the breath control information. Then, the breath section length calculation section 43 obtains the actual time length t of the rest based on the note length information and the tempo information from the sequencer 13, and obtains the breath sound length t 'and the silent section length ts therefrom. For example, if the tempo information indicates a speed at which singing of 110 quarter notes per minute is performed, and if the breath note length information is 1/4 beat, that is, a quarter rest, a rest is set. Is 60/110 seconds = 545 ms. The breath section length calculation unit 43 internally divides the actual time length t at a predetermined ratio to obtain a breath sound length t ′ and a silent section length ts. In a preferred embodiment, this ratio is, for example, 9: 1. In this case, the breath sound length t 'is 490 ms, and the silent section length ts is 55 ms.
[0035]
The breath sound length control section 44 and the silent section addition section 45 use the breath sound waveform data received from the breath sound amplitude control section 42 according to the calculation result of the breath section length calculation section 43, and correspond to the breath sound length t '. The waveform data of the breath sound which continues for the period of time and then becomes a silent state is generated.
[0036]
The breath sound waveform data obtained in this way is output from the breath sound synthesizing section 40, and is output as a breath sound through the adder 60.
[0037]
After the synthesis of the breath sound, the synthesis of the singing sounds such as the syllables “sa” and “i” after the rest in FIG. 5 is performed, but the operation is the same as that already described, and the description is omitted. .
[0038]
FIG. 6 illustrates the effect of the present embodiment. In the breath sound waveform K1 shown in FIG. 6A, the phoneme symbols of the phonemes generated immediately before and after the breath sound are both / a /. On the other hand, in the breath sound waveform K2 shown in FIG. 6B, the phoneme symbol of the phoneme pronounced immediately before the breath sound is / e /, and the phoneme symbol of the phoneme pronounced immediately after the breath sound is / t. /. The beginning portion F1 of the breath sound waveform K1 and the beginning portion F2 of the breath sound waveform K2 are affected by the preceding phoneme, and particularly reflect the effect of the timbre of the preceding phoneme. The ending part B1 of the breath sound waveform K1 and the ending part B2 of the breath sound waveform K2 are affected by the succeeding phoneme, and have different effects depending on whether the succeeding phoneme is a voiced sound or an unvoiced consonant. For example, when the succeeding phoneme is a voiced sound, in particular, a vowel or a nasal sound, as shown in FIG. 6A, the characteristic is that the amplitude of the final part B1 of the breath sound waveform K1 rapidly increases and attenuates. Can be seen. For example, when the succeeding phoneme is an unvoiced consonant, as shown in FIG. 6B, the amplitude of the end portion B2 of the breath sound waveform K2 gradually decreases.
[0039]
According to the present embodiment, since the waveform of the breath sound is selected by a combination of the preceding phoneme immediately before the breath sound and the succeeding phoneme immediately after the breath sound, it is possible to reproduce the above phenomenon and synthesize a natural breath sound. it can.
[0040]
The following modifications are conceivable in the present embodiment.
<Modification 1>
In the above embodiment, for the syllable immediately after the breath sound, the waveform of the breath sound was controlled in consideration of only the first phoneme. On the other hand, in this modification, when the combination of two consecutive phonemes immediately after the breath sound is a specific combination, the amplitude modulation for rapidly rising the trailing edge portion of the breath sound waveform and rapidly attenuating the amplitude is performed. Do. The specific combination is, for example, a combination in which a vowel / a / is followed by a nasal / n / as shown in FIG. According to this modification, a more natural breath sound can be synthesized.
[0041]
<Modification 2>
In this modification, a breath sound waveform is selected based only on the phoneme symbols of the phonemes pronounced immediately after the breath sound. According to this modification, the data amount of the breath sound database can be reduced.
[0042]
<Modification 3>
The singing voice synthesizing device according to the present modification has a breath sound adoption / non-admission control unit. This is a device that performs switching control of whether or not to include a breath sound in the singing voice when synthesizing the singing voice. In a preferred embodiment, the singing information includes control information for this switching control. In this aspect, the breath sound adoption / non-execution control unit switches whether to include the breath sound in the singing voice based on the control information included in the singing information. In another preferred aspect, the breath sound adoption / non-execution control unit switches whether or not to include the breath sound in the singing voice according to a command given from an operation unit (not shown).
[0043]
【The invention's effect】
As described above, according to the present invention, when synthesizing a breath sound following a singing sound, a parameter that determines a breath sound waveform is selected based on at least the phoneme of the singing sound immediately after the breath sound. A more natural singing voice can be synthesized.
[Brief description of the drawings]
FIG. 1 is a block diagram illustrating a configuration of a singing voice synthesis device according to an embodiment of the present invention.
FIG. 2 is a diagram showing singing information handled in the embodiment.
FIG. 3 is a block diagram showing a configuration of a singing information analyzing unit in the embodiment.
FIG. 4 is a block diagram illustrating a configuration of a breath sound storage unit and a breath sound synthesis unit according to the first embodiment.
FIG. 5 is a flowchart showing the operation of the embodiment.
FIG. 6 is a diagram illustrating the effect of the embodiment.
[Explanation of symbols]
10 singing information analyzing unit, 20 singing sound synthesizing unit, 30 singing voice unit storing unit, 40 breath sound synthesizing unit, 50 breath unit storing unit, 60 adder.

Claims

時系列的な歌唱音の合成指示に従い、歌唱音を順次合成する歌唱音合成過程と、
時間的に前後した２つの歌唱音の合成指示の間に吸気音の合成指示が与えられた場合に、少なくとも当該吸気音の直後の歌唱音の音素が関与した選択方法に従って、吸気音を決定付けるパラメータを選択し、該パラメータを用いて吸気音を合成する吸気音合成過程と
を具備することを特徴とする歌唱音声合成方法。A singing sound synthesis process for sequentially synthesizing singing sounds in accordance with a chronological singing sound synthesis instruction;
When an instruction for synthesizing an intake sound is given between instructions for synthesizing two singing sounds that are temporally different from each other, an intake sound is determined according to a selection method involving at least a phoneme of a singing sound immediately after the intake sound. Selecting a parameter and synthesizing an intake sound using the parameter.

前記吸気音合成過程では、当該吸気音の直後の歌唱音の音素と当該吸気音の直前の歌唱音の音素の両方が関与した選択方法に従って、前記吸気音を決定付けるパラメータを選択することを特徴とする請求項１に記載の歌唱音声合成方法。In the intake sound synthesis process, according to a selection method involving both the phoneme of the singing sound immediately after the intake sound and the phoneme of the singing sound immediately before the intake sound, a parameter that determines the intake sound is selected. The singing voice synthesis method according to claim 1, wherein

前記吸気音合成過程では、当該吸気音の直後の歌唱音の音素と当該吸気音の直前の歌唱音の音素の組み合わせに基づいて、記憶手段に予め記憶された複数種類の吸気音の波形データの中から１種類の吸気音の波形データを選択し、前記吸気音を決定付けるパラメータとして用いることを特徴とする請求項２に記載の歌唱音声合成方法。In the intake sound synthesis process, based on a combination of a phoneme of a singing sound immediately after the intake sound and a phoneme of a singing sound immediately before the intake sound, waveform data of a plurality of types of intake sounds stored in the storage unit in advance is stored. 3. The singing voice synthesizing method according to claim 2, wherein waveform data of one kind of intake sound is selected from among the data and used as a parameter for determining the intake sound.

前記吸気音合成過程では、当該吸気音の直後の歌唱音の音素に応じて、当該吸気音の振幅を制御することを特徴とする請求項１〜３のいずれか１の請求項に記載の歌唱音声合成方法。The singing according to any one of claims 1 to 3, wherein, in the intake sound synthesizing process, an amplitude of the intake sound is controlled according to a phoneme of a singing sound immediately after the intake sound. Speech synthesis method.

時系列的な歌唱音の合成指示に従い、歌唱音を順次合成する歌唱音合成部と、
時間的に前後した２つの歌唱音の合成指示の間に吸気音の合成指示が与えられた場合に、少なくとも当該吸気音の直後の歌唱音の音素が関与した選択方法に従って、吸気音を決定付けるパラメータを選択し、該パラメータを用いて吸気音を合成する吸気音合成部と
を具備することを特徴とする歌唱音声合成装置。A singing sound synthesizer for sequentially synthesizing singing sounds in accordance with a chronological singing sound synthesis instruction;
When an instruction for synthesizing an intake sound is given between instructions for synthesizing two singing sounds that are temporally different from each other, an intake sound is determined according to a selection method involving at least a phoneme of a singing sound immediately after the intake sound. A singing voice synthesizing device, comprising: an intake sound synthesizing unit that selects a parameter and synthesizes an intake sound using the parameter.