JP2017107228A

JP2017107228A - Singing voice synthesis device and singing voice synthesis method

Info

Publication number: JP2017107228A
Application number: JP2017028630A
Authority: JP
Inventors: 恵一徳田; Keiichi Tokuda; 圭一郎大浦; Keiichiro Oura; 和寛中村; Kazuhiro Nakamura
Original assignee: Techno Speech Inc
Current assignee: Techno Speech Inc
Priority date: 2017-02-20
Filing date: 2017-02-20
Publication date: 2017-06-15

Abstract

PROBLEM TO BE SOLVED: To synthesize a singing voice by combining plural singing styles without restriction.SOLUTION: An acoustic model acquired by learning an acoustic parameter containing a parameter in which at least a singing expression of a singing voice is reflected, about at least one of singing voices of plural singing styles, as a base model of the singing style. Then a degree of combination of singing expressions included in at least two singing styles selected out of the plural singing styles, is adjusted. Then, at least two collections of acoustic parameters are selected so that at least one collection of the acoustic parameters obtained using the base model, is included, from collections of the acoustic parameters which can reproduce the singing expression included in each singing style, then a singing voice is synthesized using a synthesis acoustic parameter determined by interpolating the acoustic parameter which affects the singing expression, with the adjusted combination degree.SELECTED DRAWING: Figure 1

Description

本発明は、歌声合成の技術に関する。 The present invention relates to a singing voice synthesis technique.

従来から、自然な音声をコンピュータによって合成しようとする装置が種々提案されている。こうした音声合成は、当初、ルールベースで音声を合成するフォルマント音声合成から始まり、やがて特定話者の音声を波形ベースで収集したデータベースを構築し、この中から必要な音声素片を取りだして合成する波形接続型音声合成へと発展した。後者を、コーパスベースの音声合成とも呼ぶ。 Conventionally, various devices for synthesizing natural speech by a computer have been proposed. Such speech synthesis starts with formant speech synthesis, which synthesizes speech based on rules, and eventually builds a database that collects the speech of a specific speaker on a waveform basis, then extracts and synthesizes the necessary speech segments. It evolved into waveform-connected speech synthesis. The latter is also called corpus-based speech synthesis.

こうした技術により、ある程度滑らかに繋がった音声の合成が可能になったものの、自然な発話とまでは言えず、また喜怒哀楽の表現や歌声として自然な音声合成を十分に行なうことは、こうした手法では達成できなかった。そこで、近年になって、音素の素片を接続するといった発想から離れ、音声の生成過程を模擬することにより、より自然な音声合成を実現しようとする試みが提案され、一部では実用化が始まっている。 Although this technology has made it possible to synthesize speech that is connected to a certain degree of smoothness, it cannot be said that it is a natural utterance. It was not possible to achieve. Therefore, in recent years, it has been proposed to try to realize more natural speech synthesis by simulating the speech generation process, away from the idea of connecting phoneme segments. It has begun.

この手法は、以下のようにして音声を合成する。まず、音声データベースから、基本周波数とスペクトルパラメータを抽出すると共に、音声に対応するテキストを解析して、音声の音響的特徴とテキストとの対応関係を学習した統計的なモデル（音響モデル）を構築する。その上で、合成するテキストが与えられると、音響モデルから音響パラメータ系列を生成し、音声の生成過程を模擬することで、音声を合成する。統計的な音響モデルとしては隠れマルコフモデルを用いることができ、こうした隠れマルコフモデルを用いた音声合成技術は、下記特許文献１などに詳しい。統計的な音響モデルとしては、隠れマルコフモデルの他に、ＤＮＮ（Deep Neural Network）なども知られている。 This technique synthesizes speech as follows. First, the fundamental frequency and spectral parameters are extracted from the speech database, and the text corresponding to the speech is analyzed to construct a statistical model (acoustic model) that learns the correspondence between the acoustic features of the speech and the text. To do. Then, given the text to be synthesized, an acoustic parameter series is generated from the acoustic model, and the speech is synthesized by simulating the speech generation process. A hidden Markov model can be used as a statistical acoustic model, and a speech synthesis technique using such a hidden Markov model is detailed in Patent Document 1 below. As a statistical acoustic model, DNN (Deep Neural Network) is known in addition to the hidden Markov model.

また、合成される音声に様々な変化を与えることも種々試みられている。例えば下記特許文献２では、各種パラメータを指定することにより、異なる種類の音声を生成したり、異なる複数の音声を重複させて合成することなどが、提案されている。また、特許文献３では、複数の音色の混合比率をテキストの途中で変更する技術が提案されており、時間軸上で、１つの音色から他の音色に次第に遷移しながら音声合成する手法が示されている。 Various attempts have been made to give various changes to the synthesized speech. For example, Patent Document 2 below proposes to generate different types of voices by specifying various parameters, or to synthesize different voices in a duplicate manner. Patent Document 3 proposes a technique for changing the mixing ratio of a plurality of timbres in the middle of a text, and shows a method of synthesizing a voice while gradually transitioning from one timbre to another on the time axis. Has been.

特開２０１３−１９０７９２号公報JP 2013-190792 A 特開２００６−３３７４６８号公報JP 2006-337468 A 特開２０１５−０４９２５３号公報Japanese Patent Laying-Open No. 2015-049253

しかしながら、これらの従来技術では、いずれも、音声の合成は、パラメータを指定したり、複数の音色を混合する比率を変更すると言った手法に終始しており、合成する音声の自然な雰囲気を損なうことなく、種々の音声を合成することは困難であった。特に、音声合成を用いて歌声を作り出そうとした場合、歌い手が持っている様々な歌唱スタイルに見られる独自の表現を、自由に調整することは困難であった。歌唱スタイルとは、一つの歌唱に含まれるビブラートやこぶしなどの歌唱表現のある組合せを指し、歌い手毎に特徴的な歌唱スタイルを持っているものとして認識されているが、同じ歌い手でも、例えば、洋楽と民謡など歌の種類により、異なる歌唱スタイルで歌う場合もあり得る。もとより、同じ歌を同じ歌い手が異なる歌唱スタイルで歌うということもあり得る。 However, in all of these conventional technologies, the synthesis of speech is always performed by a method of specifying parameters or changing the ratio of mixing a plurality of timbres, which impairs the natural atmosphere of the synthesized speech. Therefore, it was difficult to synthesize various voices. In particular, when trying to create a singing voice using speech synthesis, it was difficult to freely adjust the unique expressions found in various singing styles possessed by the singer. Singing style refers to a certain combination of singing expressions such as vibrato and fist included in one singing, and it is recognized that each singer has a characteristic singing style. Depending on the type of song, such as Western music and folk songs, it may be sung in different singing styles. Of course, the same song can be sung by different singing styles.

こうした異なる歌唱スタイルＡ、Ｂ、Ｃ等の歌声のデータを基にして音声合成を行なう場合、歌唱スタイルＡの特徴と歌唱スタイルＢの特徴とを自然に融合したり、その程度を変更したりすることは困難であった。更には、こうした調整を、音声合成の利用者が自然に行なうインタフェースについても知られていなかった。 When speech synthesis is performed based on the data of singing voices such as these different singing styles A, B, and C, the characteristics of the singing style A and the characteristics of the singing style B are naturally fused or the degree thereof is changed. It was difficult. Furthermore, it has not been known about an interface in which such adjustment is naturally performed by a user of speech synthesis.

本発明は、上述の課題の少なくとも一部を解決するためになされたものであり、以下の形態又は適用例として実現することが可能である。 SUMMARY An advantage of some aspects of the invention is to solve at least a part of the problems described above, and the invention can be implemented as the following forms or application examples.

（１）本発明の第１の実施形態として、歌声合成装置が提供される。この歌声合成装置は、複数の歌唱スタイルの歌声の少なくとも１つについて、前記歌声の少なくとも歌唱表現が反映されるパラメータを含む音響パラメータを学習して得られた音響モデルを当該歌唱スタイルについてのベースモデルとして記憶する記憶部と；前記複数の歌唱スタイルの中から選択された少なくとも２つの歌唱スタイルに含まれる前記歌唱表現の組合せの程度を調整するインタフェース部と；歌唱スタイルに含まれる歌唱表現を再現可能な音響パラメータの集まりの中から、前記記憶されたベースモデルを用いて得られた音響パラメータの集まりを少なくとも一つを含む、少なくとも２つの音響パラメータの集まりを選択し、前記選択された少なくとも２つの前記歌唱表現に影響する音響パラメータを、前記インタフェース部により調整された組合せの程度で補間して、合成用音響パラメータを決定するパラメータ決定部と；前記合成用音響パラメータを用いて歌声を合成する合成部とを備えて良い。 (1) A singing voice synthesis device is provided as a first embodiment of the present invention. The singing voice synthesizing apparatus is configured to obtain an acoustic model obtained by learning an acoustic parameter including a parameter in which at least a singing expression of the singing voice is reflected for at least one of a plurality of singing styles. A storage unit for storing as; an interface unit for adjusting the degree of combination of the singing expressions included in at least two singing styles selected from the plurality of singing styles; and reproducing the singing expressions included in the singing styles A set of at least two acoustic parameters including at least one set of acoustic parameters obtained using the stored base model from among the set of acoustic parameters, and the at least two selected The acoustic parameters that affect the singing expression, the interface unit By interpolating the extent of more coordinated combination, a parameter determining unit for determining a synthesized acoustic parameter; may a synthesizing unit for synthesizing the singing voice by using the combined acoustic parameters.

この歌声合成装置は、統計的な手法を用いて音響モデルを学習することにより、歌唱表現に影響を与える音響パラメータを歌唱スタイル毎のベースモデルとして記憶でき、これに基づいて、その歌唱スタイルの特徴を持つ歌声を生成できるばかりでなく、歌唱スタイルに含まれる歌唱表現を再現可能な音響パラメータの集まりの中から、前記記憶されたベースモデルを用いて得られた音響パラメータの集まりを少なくとも一つを含む、少なくとも２つの音響パラメータの集まりを選択し、選択された少なくとも２つの歌唱表現に影響する音響パラメータを、任意の組合せの程度で補間した合成用音響パラメータを用いて、歌声を合成することができる。しかも、組み合わせの程度を、インタフェース部を用いて容易に設定することができる。 This singing voice synthesizer learns the acoustic model using a statistical method, and can store the acoustic parameters that affect the singing expression as a base model for each singing style. A set of acoustic parameters obtained using the stored base model from among a set of acoustic parameters that can reproduce a singing expression included in the singing style. Selecting a set of at least two acoustic parameters, and synthesizing a singing voice by using acoustic parameters for synthesis obtained by interpolating acoustic parameters affecting the selected at least two singing expressions in an arbitrary combination degree. it can. In addition, the degree of combination can be easily set using the interface unit.

（２）こうした歌声合成装置において、前記音響パラメータの集まりには、少なくとも基本周波数、音量、歌唱表現に対応したパラメータのうちの少なくとも一つを含むものとして良い。これらのパラメータを補間することにより、歌唱表現の組合せの程度を容易に調整することができる。 (2) In such a singing voice synthesizer, the collection of acoustic parameters may include at least one of parameters corresponding to fundamental frequency, volume, and singing expression. By interpolating these parameters, the degree of combination of singing expressions can be easily adjusted.

（３）こうした歌声合成装置において、前記音響パラメータの集まりには、更にスペクトルパラメータを含むものとしてよい。スペクトルパラメータを用いることにより、多彩な歌唱表現の組合せの程度を調整することができる。 (3) In such a singing voice synthesizer, the collection of acoustic parameters may further include spectral parameters. By using spectral parameters, the degree of combination of various singing expressions can be adjusted.

（４）こうした歌声合成装置において、前記選択される少なくとも２つの音響パラメータの集まりは、いずれも前記記憶されたベースモデルを用いて得られた音響パラメータの集まりとしても良い。ベースモデルを用いた自然な歌声同士の歌唱表現を補間して、歌声を合成することができる。 (4) In such a singing voice synthesizer, the selected collection of at least two acoustic parameters may be a collection of acoustic parameters obtained using the stored base model. Singing voices can be synthesized by interpolating natural singing voice expressions using a base model.

（５）こうした歌声合成装置において、前記選択される少なくとも２つの音響パラメータの集まりのうちの一つは、ルールベースの手法で生成された音響パラメータの集まりとしても良い。こうすれば、ルールベースで生成された音響パラメータを用いても、歌唱表現の組合せの程度を調整して、歌声合成を行なうことができる。 (5) In such a singing voice synthesizing apparatus, one of the selected groups of at least two acoustic parameters may be a group of acoustic parameters generated by a rule-based method. By doing this, it is possible to synthesize a singing voice by adjusting the degree of the combination of singing expressions even when using acoustic parameters generated on a rule basis.

（６）本発明の第２の実施形態として、もうひとつの歌声を合成する歌声合成装置が提供される。第２の実施形態にかかる歌声合成装置は、複数の歌唱スタイルの歌声のそれぞれに含まれる少なくとも歌唱表現が反映されるパラメータを含む音響パラメータを統計的な手法を用いて学習して得られた音響モデルを、前記歌唱スタイル毎のベースモデルとして記憶した記憶部と；前記複数の歌唱スタイルの中から選択された少なくとも２つの歌唱スタイルに含まれる前記歌唱表現の組合せの程度を調整するインタフェース部と；前記記憶部に記憶された前記複数のベースモデルから、前記複数の歌唱スタイルのうちから選択された少なくとも２つの歌唱スタイルに対応したベースモデルに基づき、前記インタフェース部により調整された組合せの程度で前記歌唱表現を補間した合成用音響パラメータを抽出する補間抽出部と；前記合成用音響パラメータを用いて歌声を合成する合成部とを備えてよい。 (6) A singing voice synthesizing device for synthesizing another singing voice is provided as a second embodiment of the present invention. The singing voice synthesizing device according to the second embodiment is obtained by learning, using a statistical method, acoustic parameters including parameters that reflect at least the singing expression included in each of a plurality of singing style singing voices. A storage unit that stores a model as a base model for each singing style; an interface unit that adjusts the degree of combination of the singing expressions included in at least two singing styles selected from the plurality of singing styles; Based on a base model corresponding to at least two singing styles selected from among the plurality of singing styles from the plurality of base models stored in the storage unit, the degree of combination adjusted by the interface unit An interpolation extraction unit for extracting a synthesis acoustic parameter obtained by interpolating a singing expression; It may comprise a combining unit for combining the voice using a parameter.

この歌声合成装置によれば、ベースモデルの状態で、歌唱スタイルに含まれる歌唱表現の組合せの程度を調整するので、歌唱表現の組合せの程度を調整したうえで、よりスムースな歌声の合成を行なうことができる。 According to this singing voice synthesizing apparatus, since the degree of combination of singing expressions included in the singing style is adjusted in the state of the base model, the singing voice is synthesized more smoothly after adjusting the degree of singing expression combination. be able to.

（７）上記の歌声合成装置において、前記歌唱表現が反映されるパラメータには、少なくともビブラート、しゃくり、アタック・リリース、こぶしのうちの１つに対応したパラメータを含むものとして良い。こうした歌声合成装置によれば、しゃくり（しゃくり上げ、しゃくり下げを含む）、アタック・リリース、こぶしのうちの少なくとも１つについて、複数の歌唱スタイルの特徴を組み合わせて音声合成を行なうことができる。 (7) In the singing voice synthesizer, the parameter reflecting the singing expression may include a parameter corresponding to at least one of vibrato, sneezing, attack / release, and fist. According to such a singing voice synthesizing device, it is possible to perform voice synthesis by combining features of a plurality of singing styles for at least one of squealing (including squeaking up and squeezing down), attack release, and fist.

（８）上記の歌声合成装置において、前記歌唱表現が反映されるパラメータには、少なくとも発声開始タイミング、発声終了タイミングのいずれか１つに対応したパラメータが含まれるものとして良い。この歌声合成装置によれば、発声開始タイミング、発声終了タイミングうちの少なくとも１つについて、複数の歌唱スタイルの特徴を組み合わせて音声合成を行なうことができる。 (8) In the singing voice synthesizer, the parameter reflecting the singing expression may include a parameter corresponding to at least one of the utterance start timing and the utterance end timing. According to this singing voice synthesizing apparatus, voice synthesis can be performed by combining features of a plurality of singing styles for at least one of the utterance start timing and the utterance end timing.

（９）上記の第１の形態の歌声合成装置においては、前記歌唱表現の組合せの程度の調整は、前記音響パラメータの値を補間することにより決定されるものとしても良い。こうすれば、少なくとも２つの歌唱表現の組合せの程度を容易に決定することができる。 (9) In the singing voice synthesizer of the first embodiment, the adjustment of the degree of combination of the singing expressions may be determined by interpolating the value of the acoustic parameter. In this way, it is possible to easily determine the degree of combination of at least two singing expressions.

（１０）この歌声合成装置において、前記補間は、前記音響パラメータを線形結合または非線形結合することにより行なわれるものとして良い。線形結合を採用すれば、補間演算を簡略化でき、非線形結合を採用すれば、補間の程度を柔軟に設定することができる。 (10) In this singing voice synthesizer, the interpolation may be performed by linearly coupling or nonlinearly coupling the acoustic parameters. If linear combination is adopted, the interpolation calculation can be simplified, and if nonlinear combination is adopted, the degree of interpolation can be set flexibly.

（１１）上記の歌声合成装置において、前記補間は、音符単位、音素単位、音節単位、フレーズ単位、曲単位、所定の時間単位のいずれか１つにより行なうものとしても良い。この場合、補間の対象を細かく設定でき、組合せの際の利便性が向上する。 (11) In the above singing voice synthesizer, the interpolation may be performed by any one of a note unit, a phoneme unit, a syllable unit, a phrase unit, a song unit, and a predetermined time unit. In this case, the interpolation target can be set finely, and convenience in combination is improved.

（１２）上記の第２の形態の歌声合成装置においては、前記歌唱表現の組合せの程度の調整は、前記ベースモデルの内部パラメータを補間することにより行なうものとしてもよい。こうすれば、少なくとも２つの歌唱表現の組合せの程度を、ベースモデルの内部パラメータを用いて容易に決定することができ、歌唱表現の組合せを調整した歌声合成を、一層滑らかに行なうことができる。 (12) In the singing voice synthesizing apparatus according to the second embodiment, the adjustment of the degree of combination of the singing expressions may be performed by interpolating internal parameters of the base model. In this way, the degree of the combination of at least two singing expressions can be easily determined using the internal parameters of the base model, and the singing voice synthesis in which the combination of the singing expressions is adjusted can be performed more smoothly.

（１３）この歌声合成装置において、補間は、前記ベースモデルの内部パラメータを線形結合または非線形結合することにより行なうものとしても良い。線形結合を採用すれば、補間演算を簡略化でき、非線形結合を採用すれば、補間の程度を柔軟に設定することができる。 (13) In this singing voice synthesizing apparatus, the interpolation may be performed by linearly coupling or nonlinearly coupling the internal parameters of the base model. If linear combination is adopted, the interpolation calculation can be simplified, and if nonlinear combination is adopted, the degree of interpolation can be set flexibly.

（１４）上記の歌声合成装置において、前記補間は、ベースモデルの状態単位で行なうものとしても良い。この場合、ベースモデルのモデルの状態単位で補間を行なうので、演算を容易なものとすることができる。なお、補間自体は、ベースモデルの状態単位で行なうが、インタフェース部において、歌唱表現の組合せの程度を調整する際のユーザインタフェースとしては、音符単位、音素単位、音節単位、フレーズ単位、曲単位、所定の時間単位などを単位として組合せの程度の指定を行なうものとしても差し支えない。 (14) In the above singing voice synthesizer, the interpolation may be performed in units of states of the base model. In this case, since the interpolation is performed in units of the model state of the base model, the calculation can be facilitated. The interpolation itself is performed in units of the state of the base model, but in the interface unit, as the user interface when adjusting the degree of combination of singing expressions, note units, phonemes units, syllable units, phrase units, song units, The degree of combination may be specified using a predetermined time unit or the like as a unit.

（１５）上記の各歌声合成装置において、前記補間は、内挿補間または外挿補間としても良い。この場合、内挿補間により複数の歌唱スタイルの特徴の中間的な特徴を実現することができ、また外挿補間により１つの歌唱スタイルの特徴から遠ざかるような組合せを実現することができる。 (15) In each singing voice synthesizer described above, the interpolation may be interpolation or extrapolation. In this case, it is possible to realize an intermediate feature among the features of a plurality of singing styles by interpolation, and it is possible to realize a combination that moves away from the feature of one singing style by extrapolation interpolation.

（１６）上記の歌声合成装置において、前記記憶されたベースモデルの１つは、予め用意された標準的な音響パラメータからなるベースモデルとしても良い。こうすれば、標準的な音響パラメータとの組合せを容易に実現することができる。 (16) In the above singing voice synthesis device, one of the stored base models may be a base model made up of standard acoustic parameters prepared in advance. In this way, a combination with standard acoustic parameters can be easily realized.

（１７）こうした歌声合成装置において、更に、画像表示装置とポインティングデバイスとを備えるものとして良く、前記インタフェース部は、前記画像表示装置上に描画されるグラフィカルユーザインタフェースであり、前記グラフィカルユーザインタフェースとして前記画像表示装置上に描画された画面を前記ポインティングデバイスにより操作することにより、前記組合せの程度が変更されるものとしても良い。こうすれば、組合せの程度の変更を直感的に行なうことができる。 (17) Such a singing voice synthesizing apparatus may further include an image display device and a pointing device, and the interface unit is a graphical user interface drawn on the image display device, and the graphical user interface includes the above-described graphical user interface. The degree of the combination may be changed by operating a screen drawn on the image display device with the pointing device. In this way, it is possible to intuitively change the degree of combination.

（１８）本発明の第３の実施形態として、歌声合成方法が提供される。この歌声合成方法は、複数の歌唱スタイルの歌声の少なくとも１つについて、複数の歌唱スタイルの歌声の少なくとも１つについて、前記歌声の少なくとも歌唱表現が反映されるパラメータを含む音響パラメータを学習して得られた音響モデルを当該歌唱スタイルについてのベースモデルとして記憶し；前記複数の歌唱スタイルの中から選択された少なくとも２つの歌唱スタイルに含まれる前記歌唱表現の組合せの程度を調整し；歌唱スタイルに含まれる歌唱表現を再現可能な音響パラメータの集まりの中から、前記記憶されたベースモデルを用いて得られた音響パラメータの集まりを少なくとも一つを含む、少なくとも２つの音響パラメータの集まりを選択し、前記選択された少なくとも２つの前記歌唱表現に影響する音響パラメータを、前記調整された組合せの程度で補間して、合成用音響パラメータを決定し；前記合成用音響パラメータを用いて歌声を合成するものとして良い。 (18) A singing voice synthesis method is provided as a third embodiment of the present invention. This singing voice synthesizing method is obtained by learning acoustic parameters including parameters that reflect at least the singing expression of the singing voice for at least one of the singing voices of the plurality of singing styles for at least one of the singing voices of the plurality of singing styles. The selected acoustic model as a base model for the singing style; adjusting the degree of combination of the singing expressions included in at least two singing styles selected from the plurality of singing styles; included in the singing style Selecting a set of at least two acoustic parameters including at least one set of acoustic parameters obtained using the stored base model from a set of acoustic parameters capable of reproducing a singing expression to be reproduced, Acoustic parameters affecting at least two selected singing expressions, And interpolating the degree of serial tuned combination, determine the synthesized acoustic parameters; good as for synthesizing singing voice by using the combined acoustic parameters.

かかる歌声合成方法によれば、統計的な手法を用いて音響モデルを学習することにより、歌唱表現に影響を与える音響パラメータを歌唱スタイル毎のベースモデルとして記憶でき、これに基づいて、その歌唱スタイルの特徴を持つ歌声を生成できるばかりでなく、歌唱スタイルに含まれる歌唱表現を再現可能な音響パラメータの集まりの中から、前記記憶されたベースモデルを用いて得られた音響パラメータの集まりを少なくとも一つを含む、少なくとも２つの音響パラメータの集まりを選択し、選択された少なくとも２つの歌唱表現に影響する音響パラメータを、任意の組合せの程度で補間した合成用音響パラメータを用いて、歌声を合成することができる。しかも、組み合わせの程度を、容易に設定することができる。 According to this singing voice synthesis method, by learning an acoustic model using a statistical method, acoustic parameters that affect the singing expression can be stored as a base model for each singing style, and based on this, the singing style In addition to being able to generate a singing voice having the following characteristics, at least one acoustic parameter set obtained using the stored base model is selected from among acoustic parameter collections that can reproduce the singing expression included in the singing style. A group of at least two acoustic parameters including two, and synthesize a singing voice using acoustic parameters for synthesis obtained by interpolating acoustic parameters affecting the selected at least two singing expressions in any combination degree. be able to. In addition, the degree of combination can be easily set.

（１９）本発明の第４の実施形態として、歌声を合成するもうひとつの歌声合成方法が提供される。この歌声合成方法は、複数の歌唱スタイルの歌声のそれぞれに含まれる少なくとも歌唱表現が反映されるパラメータを含む音響パラメータを統計的な手法を用いて学習して得られた音響モデルを、前記歌唱スタイル毎のベースモデルとして記憶部に記憶し；前記複数の歌唱スタイルの中から選択された少なくとも２つの歌唱スタイルに含まれる前記歌唱表現の組合せの程度を調整し；前記記憶部に記憶された前記複数のベースモデルから、前記複数の歌唱スタイルのうちから選択された少なくとも２つの歌唱スタイルに対応したベースモデルに基づき、前記調整された組合せの程度で前記歌唱表現を補間した合成用音響パラメータを抽出し；前記合成用音響パラメータを用いて歌声を合成して良い。 (19) As a fourth embodiment of the present invention, another singing voice synthesis method for synthesizing a singing voice is provided. In this singing voice synthesis method, an acoustic model obtained by learning an acoustic parameter including at least a parameter that reflects a singing expression included in each of the singing voices of a plurality of singing styles using a statistical method, the singing style Storing in the storage unit as a base model for each; adjusting the degree of combination of the singing expressions included in at least two singing styles selected from the plurality of singing styles; the plurality stored in the storing unit A synthesis acoustic parameter obtained by interpolating the singing expression with the degree of the adjusted combination based on a base model corresponding to at least two singing styles selected from the plurality of singing styles. A singing voice may be synthesized using the synthesis acoustic parameter.

この歌声合成方法によれば、記憶部に記憶されたベースモデルに基づき、歌唱表現を補間した合成用音響パラメータを抽出するので、歌唱表現の組合せの程度を調整した歌声を、より滑らかに合成することができる。 According to this singing voice synthesis method, the synthesis acoustic parameters obtained by interpolating the singing expression are extracted based on the base model stored in the storage unit, so that the singing voice in which the degree of the combination of the singing expressions is adjusted is synthesized more smoothly. be able to.

歌声合成装置の実施形態の概略構成図。The schematic block diagram of embodiment of a singing voice synthesizer. ＨＭＭを用いた音響モデルとその学習の原理を示す説明図。Explanatory drawing which shows the acoustic model using HMM, and the principle of the learning. 歌声合成準備ルーチンを示すフローチャート。The flowchart which shows a singing voice synthetic | combination preparation routine. 歌声のデータから抽出される代表的なパラメータを示す説明図。Explanatory drawing which shows the representative parameter extracted from the data of singing voice. 音響モデルを学習する際の基本単位であるコンテキスト依存の音素を示す説明図。Explanatory drawing which shows the context-dependent phoneme which is a basic unit at the time of learning an acoustic model. ＨＭＭの状態の集合をクラスタリングする様子を示す説明図。Explanatory drawing which shows a mode that the set of the state of HMM is clustered. 状態継続長モデルと各パラメータの決定木の様子を示す説明図。Explanatory drawing which shows the mode continuation length model and the mode of decision tree of each parameter. 歌声合成処理ルーチンを示すフローチャート。The flowchart which shows a singing voice synthetic | combination processing routine. 歌詞と音程の時間との関係を示すユーザインタフェースの一例を示す説明図。Explanatory drawing which shows an example of the user interface which shows the relationship between a lyric and the time of a pitch. 複数の歌い手の歌唱スタイルの補間割合の編集画面の一例を示す説明図。Explanatory drawing which shows an example of the edit screen of the interpolation ratio of the singing style of a some singer. 補間割合の変更の具体的な手法の一例を示す説明図。Explanatory drawing which shows an example of the specific method of the change of an interpolation ratio. 歌い手毎の歌唱スタイルに対応して用意されたベースモデルの編集画面の一例を示す説明図。Explanatory drawing which shows an example of the edit screen of a base model prepared corresponding to the singing style for every singer. 合成された歌声のピッチを表示する画面の一例を示す説明図。Explanatory drawing which shows an example of the screen which displays the pitch of the synthesized singing voice. 第２実施形態における補間割合の設定方法を示す説明図。Explanatory drawing which shows the setting method of the interpolation ratio in 2nd Embodiment.

本発明のいくつかの実施形態について、図面を参照しながら説明する。図１は、本発明の歌声処理装置の第１実施形態を示す概略構成図である。図１に示した歌声処理装置１００は、予め音響パラメータを処理して音響モデルを学習するための構成と、実際に歌声を合成する構成、即ち歌声合成装置としての構成との両方を含んでいる。単に歌声の合成のみを行なうのであれば、前者の構成は必要ない。ここでは、両方を併せて説明するが、歌声合成のみを行なうのであれば、学習済みの音響モデルをハードディスクなどの記憶部に記憶しておき、この音響モデルを用いて音声合成を行なえばよい。 Several embodiments of the present invention will be described with reference to the drawings. FIG. 1 is a schematic configuration diagram showing a first embodiment of the singing voice processing apparatus of the present invention. The singing voice processing device 100 shown in FIG. 1 includes both a configuration for processing an acoustic parameter in advance to learn an acoustic model and a configuration for actually synthesizing a singing voice, that is, a configuration as a singing voice synthesizing device. . If only singing voice synthesis is performed, the former configuration is not necessary. Here, both will be described together. However, if only singing voice synthesis is performed, a learned acoustic model may be stored in a storage unit such as a hard disk, and speech synthesis may be performed using this acoustic model.

図１に示した歌声処理装置１００は、ネットワークＮＷを介して接続されたコンピュータＰＣ１と、サーバ３０およびサーバ３０に接続されたもう一台のコンピュータＰＣ２とから構成されている。もとより、歌声処理装置１００は、一台のコンピュータによって構成することもできるし、ネットワーク上に分散して置かれた複数のコンピュータから構成することも可能である。 The singing voice processing apparatus 100 shown in FIG. 1 includes a computer PC1 connected via a network NW, and a server 30 and another computer PC2 connected to the server 30. Of course, the singing voice processing apparatus 100 can be configured by a single computer, or can be configured by a plurality of computers distributed on a network.

コンピュータＰＣ１は、歌声を入力するために設けられており、楽譜入力部１０および歌唱（歌声）を入力するための音声入力部２０を備える。楽譜入力部１０は一般にはキーボードが用いられ、音声入力部２０としてはマイクが用いられる。歌い手が音声入力部２０としてマイクに向かって歌い、その歌詞を含む楽譜を楽譜入力部１０であるキーボードから入力すると、楽譜と歌声とが対応付けられ、歌唱スタイル毎のインデックスを付けて、例えばデータＡ、データＢ・・・データＮとして、コンピュータＰＣ１内に保存される。楽譜を入力する場合、五線紙に記載された楽譜の形で入力しても良いが、鍵盤タイプのキーボードを用いて入力しても良い。楽譜の入力は歌声を入力する前であっても差し支えない。なお、歌い手が歌った歌の音程や音の長さは、必ずしも予め用意した楽譜と一致するとは限らないので、楽譜を予めデータとして入力しておいた場合、歌声の入力後で、実際に歌われた歌に合せて、楽譜を修正すれば良い。 The computer PC1 is provided for inputting a singing voice and includes a score input unit 10 and a voice input unit 20 for inputting a singing (singing voice). The score input unit 10 generally uses a keyboard, and the voice input unit 20 uses a microphone. When a singer sings into the microphone as the voice input unit 20 and inputs a score including the lyrics from the keyboard which is the score input unit 10, the score and the singing voice are associated with each other, and an index for each singing style is attached. A, data B... Data N are stored in the computer PC1. When inputting a score, it may be input in the form of a score written on a staff, or may be input using a keyboard type keyboard. The score may be input even before the singing voice is input. Note that the pitch and length of the song sung by the singer does not necessarily match the score prepared in advance. Therefore, if the score is input as data in advance, the actual singing is performed after the singing voice is input. You can revise the score to match your song.

楽譜を楽器などを用いて入力する場合は、楽譜入力部１０として、テキスト入力用のキーボードの他に、ピアノ鍵盤タイプのキーボードを併用するものとし、鍵盤タイプのキーボードによる音程および音の長さの入力と、テキスト入力用のキーボードによる歌詞（日本語の場合は、各音に対応したかな文字列）の入力とを、対応付けながら行えば良い。各歌唱スタイル毎の歌唱の楽譜と歌声のデータは、一つの歌唱スタイル当たり少なくとも数分分蓄積される。後述するように、この楽譜と歌声のデータは、サーバ３０内の音響パラメータ学習部により解析される。解析するためには、楽譜と歌声のデータに、全ての音素やその組み合わせ、全ての歌唱表現が含まれている必要はないが、統計的な学習が可能な程度の種類の音素とその組み合わせ、および種々の歌唱表現が含まれていることが望ましい。従って、一般に、おおよそ数分から数十分程度の歌声が必要とされる。 When inputting a score using a musical instrument or the like, the score input unit 10 uses a keyboard of a piano keyboard type in addition to a keyboard for text input. Input and input of lyrics using a keyboard for text input (in the case of Japanese, a kana character string corresponding to each sound) may be performed in association with each other. The singing score and singing voice data for each singing style is accumulated for at least several minutes per singing style. As will be described later, the musical score and singing voice data are analyzed by an acoustic parameter learning unit in the server 30. In order to analyze, not all phonemes and their combinations and all singing expressions need to be included in the score and singing voice data, but the types of phonemes and combinations that allow statistical learning, It is desirable that various singing expressions are included. Therefore, generally, a singing voice of about several minutes to several tens of minutes is required.

歌声入力用のコンピュータＰＣ１をサーバ３０と分離したのは、複数の歌唱スタイルのデータ入力を容易にするためである。コンピュータＰＣ１は、マイクを備えたノートパソコンなどにより実現でき、簡易に持ち運んで、楽譜と歌声を採取・保存することができる。なお、この例では、歌声は、直接マイクなどの音声入力部２０から入力するものとしたが、音源は生歌である必要はなく、ＣＤやＤＶＤなどに記録された歌声から採取するものとしても差し支えない。あるいはネットワークＮＷを介して入力するものとしても良い。また、後述するように、歌唱スタイル毎に、音響モデルを構築することから、歌唱表現を含めて、同じ歌唱スタイルの歌声としては、同一または似通った歌い方の音源（通常は同じ歌い手の音源）を用いることが望ましい。また、その歌唱スタイルの特徴がもっと良く表現された音源を用いることが、より望ましい。 The reason why the computer PC1 for singing voice input is separated from the server 30 is to facilitate data input of a plurality of singing styles. The computer PC1 can be realized by a notebook personal computer equipped with a microphone, and can be easily carried to collect and store musical scores and singing voices. In this example, the singing voice is directly input from the voice input unit 20 such as a microphone. However, the sound source does not have to be a live song and may be collected from a singing voice recorded on a CD or DVD. There is no problem. Or it is good also as what inputs via network NW. Also, as will be described later, since an acoustic model is constructed for each singing style, the singing voice including the singing expression includes the same or similar sound source (usually the same singer's sound source). It is desirable to use It is more desirable to use a sound source that better expresses the characteristics of the singing style.

こうして採取・記録された歌声のデータは、ネットワークＮＷを介してサーバ３０に送られ、サーバ３０内のハードディスク３１に保存される。サーバ３０内には、楽譜解析部３３と音響パラメータの学習部４０と音響モデル記憶部５０とが備えられている。音響モデル記憶部５０が、各歌唱スタイル毎に音響モデルを記憶する記憶部の下位概念に相当している。またサーバ３０には、この他、パラメータ調整部５５，楽譜解析部５７，音声合成部６０が設けられており、音響モデル記憶部５０と共に、歌声合成装置を構成している。パラメータ調整部５５がパラメータ決定部の下位概念に、音声合成部６０が合成部の下位概念に、それぞれ相当する。 The singing voice data collected and recorded in this way is sent to the server 30 via the network NW and stored in the hard disk 31 in the server 30. The server 30 includes a score analysis unit 33, an acoustic parameter learning unit 40, and an acoustic model storage unit 50. The acoustic model storage unit 50 corresponds to a subordinate concept of a storage unit that stores an acoustic model for each singing style. In addition, the server 30 is further provided with a parameter adjustment unit 55, a score analysis unit 57, and a speech synthesis unit 60, and constitutes a singing voice synthesis device together with the acoustic model storage unit 50. The parameter adjustment unit 55 corresponds to a subordinate concept of the parameter determination unit, and the speech synthesis unit 60 corresponds to a subordinate concept of the synthesis unit.

パラメータ調整部５５と楽譜解析部５７は、コンピュータＰＣ２との間でデータのやり取りを行なう。コンピュータＰＣ２には、キーボード５１、マウスなどのポインティングデバイス５２、および表示部５３が設けられている。キーボード５１からは、主に合成しようとする歌声の楽譜のデータが入力される。また表示部５３には、後述する音響パラメータの組合せとその程度を示すグラフィカルインタフェースが表示される。コンピュータＰＣ２は、グラフィカルインタフェースを用い、ポインティングデバイス５２により、この音響パラメータの組合せやその程度（組合せの割合）などを指定あるいは修正することができる。コンピュータＰＣ２が、インタフェース部の下位概念に相当する。音響パラメータの組合せやその程度を調整する手法については、後で詳しく説明する。 The parameter adjustment unit 55 and the score analysis unit 57 exchange data with the computer PC2. The computer PC2 is provided with a keyboard 51, a pointing device 52 such as a mouse, and a display unit 53. From the keyboard 51, mainly the musical score data of the singing voice to be synthesized is input. The display unit 53 displays a graphical interface indicating a combination of acoustic parameters described later and the degree thereof. The computer PC2 can designate or modify the combination of the acoustic parameters and the degree thereof (ratio of combination) by using the graphical interface and the pointing device 52. The computer PC2 corresponds to a subordinate concept of the interface unit. A method of adjusting the combination of acoustic parameters and the degree thereof will be described in detail later.

サーバ３０内の学習部４０について説明する。学習部４０は、ハードディスク３１内に記憶された各歌唱スタイル毎の楽譜と音声データとから、音響モデルを構築するための学習を行なう。この学習は、最終的に歌声の合成を行なうためになされるので、本実施形態で用いる歌声合成の手法について先にその概略を説明する。本実施形態では、人が歌声を生成するのに用いている声帯や口蓋などの諸器官を、音源（励振源）と所定の伝達特性を持ったフィルタであるとして捉え、これをデジタルフィルタにより模擬する。このとき、音声波形から抽出されたスペクトルパラメータ、基本周波数、有声／無声などの情報からなる音響パラメータの時間軸に沿った列を用いる。これらの音響パラメータの列を、楽譜から推定することができれば、楽譜からそれに対応した音声を合成することができる。そこで、実際の歌声のデータと楽譜とから、音響パラメータの列とそれに対応する楽譜との関係を学習して、音響モデルを学習するのである。こうした音響モデルとしては、隠れマルコフモデル（ＨＭＭ）が採用可能である。 The learning unit 40 in the server 30 will be described. The learning unit 40 performs learning for constructing an acoustic model from the score and voice data for each singing style stored in the hard disk 31. Since this learning is performed in order to finally synthesize a singing voice, the outline of the technique of singing voice synthesis used in the present embodiment will be described first. In the present embodiment, various organs such as vocal cords and palate used by a person to generate a singing voice are regarded as a filter having a sound source (excitation source) and a predetermined transfer characteristic, and this is simulated by a digital filter. To do. At this time, a sequence along the time axis of acoustic parameters made up of information such as spectral parameters, fundamental frequencies, voiced / unvoiced, etc. extracted from the speech waveform is used. If these acoustic parameter strings can be estimated from the score, the corresponding speech can be synthesized from the score. Therefore, the acoustic model is learned by learning the relationship between the acoustic parameter string and the corresponding score from the actual singing voice data and the score. As such an acoustic model, a hidden Markov model (HMM) can be adopted.

図２は、ＨＭＭを用いた音響モデルとその学習の原理を示す説明図である。１つの音素の発声はその前後の音素（先行音素と後続音素）の影響を受ける。前後の音素が異なれば、発声される音素の音響パラメータは異なるものとなるのである。図２において、「１」はある音素の始まりの辺りを、「２」は真ん中辺りを、「３」は最期の方を、それぞれ表している。発生される一つの音素を、このモデルでは、こうした三つの状態として表現している。このとき、「１」の部分は、当該音素に先行する先行音素の影響を受けやすく、「３」の部分は、当該音素に後続する後続音素の影響を受けやすい。先行音素や後続音素は、発声される音素にとっての最も基本的なコンテキストになる。 FIG. 2 is an explanatory diagram showing an acoustic model using the HMM and the principle of learning thereof. The utterance of one phoneme is affected by the preceding and following phonemes (preceding phoneme and succeeding phoneme). If the phonemes before and after are different, the acoustic parameters of the phonemes uttered will be different. In FIG. 2, “1” represents the beginning of a phoneme, “2” represents the middle, and “3” represents the last. In this model, one generated phoneme is expressed as these three states. At this time, the portion “1” is easily affected by the preceding phoneme preceding the phoneme, and the portion “3” is easily influenced by the subsequent phoneme following the phoneme. The preceding phoneme and the subsequent phoneme are the most basic context for the uttered phoneme.

図２において、ａijは、遷移確率を示している。ｉ＝ｊの場合には、音素の同じ部分に留まる確率を示し、ｊ＝ｉ＋１の場合には、次の部分に遷移する確率を示す。このとき、観測系列ｏは、出力確率密度関数ｂq（ｏｔ）により得られる値となる。ｑは、このときの状態系列である。説明を簡略にするために、図２では、一つの音素を３つの部分からなるものとして例示し、先行音素と後続音素の影響を受けるものとしたが、実際の音声合成では、更に他のコンテキストも参照する。つまり、コンテキスト依存モデルを用いるのであり、この点は図５を用いて後で詳しく説明する。本実施形態では、楽譜と歌声データとから、ＨＭＭを学習するのであり、一旦、歌唱スタイル毎に学習されたＨＭＭが出来上がれば、このＨＭＭを用いて、楽譜から歌声を合成するのである。こうしたＨＭＭで学習される状態遷移確率ａijと出力確率密度関数ｂq（ｏt）は、最尤推定法の一つである期待値最大化（ＥＭ）アルゴリズムを用いて推定することができる。 In FIG. 2, aij indicates a transition probability. When i = j, it indicates the probability of staying in the same part of the phoneme, and when j = i + 1, it indicates the probability of transition to the next part. At this time, the observation sequence o is a value obtained by the output probability density function bq (ot). q is a state series at this time. In order to simplify the explanation, in FIG. 2, one phoneme is illustrated as having three parts and is influenced by the preceding phoneme and the succeeding phoneme. See also That is, a context-dependent model is used, and this point will be described in detail later with reference to FIG. In this embodiment, the HMM is learned from the score and singing voice data. Once the HMM learned for each singing style is completed, the singing voice is synthesized from the score using this HMM. The state transition probability aij and the output probability density function bq (ot) learned by such an HMM can be estimated using an expected value maximization (EM) algorithm which is one of the maximum likelihood estimation methods.

次に、ＨＭＭの学習で用いられる音響パラメータについて説明する。基本的に音響パラメータとして、音声波形から抽出されたスペクトルパラメータ、基本周波数、有声／無声情報が想定されることは既に説明した。ここで、スペクトルパラメータとしてはメルケプストラムや線スペクトル対（ＬＳＰ）などを用いる。本実施形態では、メルケプストラムを用いた。メルケプストラムとは、音声信号のフーリエ変換の対数を更に逆フーリエ変換したものに、人の聴覚特性を考慮して低周波数領域の情報を多く持つようにしたものである。基本周波数は、一般に対数領域の値を用いる。有声／無声情報とは、母音のように周期的な音声か子音のように周期性のない音声かの区別である。基本周波数は、有声区間で連続値を取り、無声区間で値を持たない。この他、動的特徴と呼ばれるパラメータも用いられる。動的特徴とは、基本周波数やメルケプストラムなどのパラメータの時間方向の１次微分（デルタ）や二次微分（デルタデルタ）に対応するパラメータである。これらのパラメータは、ＨＭＭが時系列データの時間軸方向の相関関係をモデル化しにくいという点を補うために用いられる。動的特徴を扱うことで、音素列を合成する際のつなぎ目が滑らかになる。 Next, acoustic parameters used in HMM learning will be described. It has already been described that spectral parameters, fundamental frequencies, and voiced / unvoiced information extracted from speech waveforms are basically assumed as acoustic parameters. Here, a mel cepstrum, a line spectrum pair (LSP), or the like is used as the spectrum parameter. In this embodiment, a mel cepstrum was used. The mel cepstrum is obtained by further inversely transforming the logarithm of the Fourier transform of an audio signal and having much information in the low frequency region in consideration of human auditory characteristics. In general, a logarithmic value is used as the fundamental frequency. Voiced / unvoiced information is a distinction between periodic voices such as vowels and non-periodic voices such as consonants. The fundamental frequency takes a continuous value in the voiced section and has no value in the unvoiced section. In addition, parameters called dynamic features are also used. The dynamic feature is a parameter corresponding to a first-order derivative (delta) or a second-order derivative (delta delta) in the time direction of parameters such as a fundamental frequency and a mel cepstrum. These parameters are used to compensate for the fact that the HMM is difficult to model the correlation in the time axis direction of time series data. By handling dynamic features, the joints when synthesizing phoneme sequences are smoothed.

ここまで、音響モデルは、スペクトルパラメータや基本周波数を用いた隠れマルコフモデルであるとして説明したが、実際に用いられるモデルはもっと複雑である。以下、実際に導入されているモデルについて簡略に説明する。
（Ａ）状態継続長モデル：歌声の中に含まれる各音素の長さは、歌唱スタイルなどによって変動することから、音声の時間的な構造（どれだけその音素が継続するか）をより精度良くモデル化するために、明示的な状態継続長分布を用いている。こうした状態継続長モデルを付加したＨＭＭを、隠れセミマルコフモデルと呼ぶ。
（Ｂ）コンテキスト依存モデル：基本周波数や継続長は、歌詞に含まれる言語的な情報の影響を受けやすい。このため、歌詞の言語情報と、更に楽譜から得られる音高、テンポ、調性、拍子などのコンテキストを考慮してモデル化している。
（Ｃ）多空間確率分布ＨＭＭ：歌声を含む音声には無声区間があって、そこでは基本周波数の時系列データそのものが存在しない。本実施形態では、こうした特殊な時系列を扱うために、多空間確率分布ＨＭＭ（ＭＳＤ−ＨＭＭ）を用いる。 So far, the acoustic model has been described as a hidden Markov model using spectral parameters and fundamental frequencies, but the model actually used is more complicated. Hereinafter, the model actually introduced will be briefly described.
(A) State duration model: Since the length of each phoneme included in the singing voice varies depending on the singing style, etc., the temporal structure of the speech (how long the phoneme lasts) is more accurate. An explicit state duration distribution is used to model. An HMM to which such a state duration model is added is called a hidden semi-Markov model.
(B) Context-dependent model: The fundamental frequency and duration are easily affected by linguistic information included in the lyrics. For this reason, modeling is performed in consideration of linguistic language information and contexts such as pitch, tempo, tonality, and time signature obtained from the score.
(C) Multi-space probability distribution HMM: A voice including a singing voice has an unvoiced section, and there is no time-series data of the fundamental frequency itself. In the present embodiment, a multi-space probability distribution HMM (MSD-HMM) is used to handle such special time series.

（Ｄ）歌唱表現モデル：歌声には、楽譜から見た場合、様々なズレが存在する。これを広義の歌唱表現と呼ぶ。歌唱表現は、一般に歌唱スタイル毎に異なり、特定の歌唱スタイルによる歌唱を特徴付けるものとなっている。これらも歌唱スタイルの特徴を学習するのに用いられる。以下、広義の歌唱表現に含まれるものを挙げる。歌唱表現には、これら全てが含まれる必要はないが、これらの少なくとも１つは、歌唱スタイルを特徴付けるものとして用いられ、歌唱スタイル毎のベースモデルとして扱われる。
（１）タイミング：実際の歌声は、楽譜から計算される音符の時間軸上の位置から意図せずもしくは意図的にずれることがある。例えば、子音はその音符の開始タイミングより少し前で発声されることが多い。また「前ノリ」「後ノリ」「タメ」など、発声のタイミングを意図的にずらす歌唱表現が存在する。このため、楽譜から計算される絶対的な時間を基準とした実際の発声との時間的なズレを、音素単位でモデル化している。
（２）ビブラート：ビブラートは、音高および音量の少なくとも一方を周期的に揺らす歌唱表現である。歌声においてビブラートがかかるタイミングやその周期、振幅の変化は、歌唱スタイル毎に異なるため、歌唱スタイル毎の音響モデルの学習に用いられる。ビブラートは、更にその周期と振幅の２つのパラメータとして扱われ、音響モデルに組み込まれる。
（３）その他の歌唱表現：上記のビブラート以外にも様々な歌唱表現が存在する。例えば、「こぶし」「しゃくり上げ」「しゃくり下げ」「アタック・リリース」などがある。こうした歌唱表現は、周期と振幅のパラメータや、基本周波数の音素途中で変動量として扱うことができ、音響モデルに組み込まれる。
本明細書では、上述した各モデルを含めて、ＨＭＭ（隠れマルコフモデル）と称する。 (D) Singing expression model: The singing voice has various deviations when viewed from the score. This is called broad singing expression. Singing expressions generally differ for each singing style and characterize singing with a specific singing style. These are also used to learn the characteristics of singing style. Hereafter, what is included in a broad sense of singing expression is listed. The singing expression need not include all of these, but at least one of these is used to characterize the singing style and is treated as a base model for each singing style.
(1) Timing: The actual singing voice may be unintentionally or intentionally deviated from the position on the time axis of the note calculated from the score. For example, a consonant is often uttered slightly before the start timing of the note. In addition, there are singing expressions that intentionally shift the timing of utterances, such as “before”, “after”, and “tame”. For this reason, the temporal deviation from the actual utterance based on the absolute time calculated from the score is modeled in units of phonemes.
(2) Vibrato: Vibrato is a singing expression that periodically rocks at least one of pitch and volume. Since the timing at which vibrato is applied in the singing voice, the period, and the change in amplitude differ for each singing style, they are used for learning an acoustic model for each singing style. Vibrato is further treated as two parameters, its period and amplitude, and incorporated into the acoustic model.
(3) Other singing expressions: There are various singing expressions other than the above vibrato. For example, “fist”, “squeak up”, “squeeze down”, “attack release”, and the like. Such a singing expression can be treated as a fluctuation amount in the phoneme of the fundamental and frequency parameters and the fundamental frequency, and is incorporated into the acoustic model.
In the present specification, the HMM (Hidden Markov Model) including the above-described models is referred to.

図１に戻って、サーバ３０内の構成について、更に説明する。サーバ３０には、ハードディスク３１に記憶された歌声のデータから、歌声の基本周波数やその微分（デルタパラメータ）などを抽出するＦ０抽出部４１、歌声に含まれるスペクトルパラメータやその微分（デルタパラメータ）を抽出するＳＰ抽出部４３、上述した広義の歌唱表現パラメータを抽出する歌唱Ｐ抽出部４４、これらの抽出したパラメータを用い隠れマルコフモデル（ＨＭＭ）を用いて、音響パラメータを学習するＨＭＭ学習部４５が含まれる。各抽出部４１，４３，４４は、音響パラメータの特徴を、音符単位やフレーム単位など、特徴に合せた単位で抽出する。Ｆ０抽出部４１は、パラメータとして、対数基本周波数の他、それらの微分値を抽出する。ＳＰ抽出部４３が抽出するスペクトルパラメータには、メルケプストラムやその微分値が含まれる。歌唱Ｐ抽出部４４は、これら以外の歌唱表現パラメータとして、ビブラートに関するものやその微分値の他、しゃくり、こぶし、アタック・リリースに対応したパラメータを抽出する。 Returning to FIG. 1, the configuration in the server 30 will be further described. The server 30 includes the F0 extraction unit 41 for extracting the fundamental frequency of the singing voice and its derivative (delta parameter) from the singing voice data stored in the hard disk 31, and the spectral parameter and the derivative (delta parameter) included in the singing voice. An SP extracting unit 43 for extracting, a singing P extracting unit 44 for extracting the above-described singing expression parameters in a broad sense, and an HMM learning unit 45 for learning acoustic parameters using these extracted parameters using a hidden Markov model (HMM). included. Each extraction part 41,43,44 extracts the characteristic of an acoustic parameter in the unit according to the characteristic, such as a note unit and a frame unit. The F0 extraction unit 41 extracts the differential value in addition to the logarithmic fundamental frequency as a parameter. The spectral parameters extracted by the SP extraction unit 43 include a mel cepstrum and its differential value. The singing P extraction unit 44 extracts parameters corresponding to squeak, fist, attack and release, in addition to those related to vibrato and their differential values, as singing expression parameters other than these.

図３を用いて、歌声合成の準備のために実行される処理について説明する。図３に示した歌声合成準備ルーチンの前半（ステップＳ１００〜Ｓ１３０）は、コンピュータＰＣ１により実行される。後半（ステップＳ１４０〜Ｓ１７０）は、サーバにより実行される。 With reference to FIG. 3, processing executed for preparation for singing voice synthesis will be described. The first half (steps S100 to S130) of the singing voice synthesis preparation routine shown in FIG. 3 is executed by the computer PC1. The second half (steps S140 to S170) is executed by the server.

この歌声合成準備ルーチンが開始されると、まず歌唱スタイルを指定する処理が行なわれる（ステップＳ１００）。歌唱スタイルＡなり歌唱スタイルＢなりが指定されると、次に歌声のデータの入力が行なわれる（ステップＳ１１０）。歌声のデータは、少なくとも数分分の歌唱を、マイクなどの音声入力部２０を介して入力し、デジタルデータとして記憶することにより入力される。続いて楽譜入力部１０により楽譜の入力が行なわれる（ステップＳ１２０）。歌声データに対して、入力された楽譜から抽出された音程と歌詞（発音）とが対応付けられる。 When this singing voice synthesis preparation routine is started, processing for designating a singing style is first performed (step S100). When singing style A or singing style B is designated, singing voice data is input (step S110). The singing voice data is input by inputting at least a few minutes of singing through the voice input unit 20 such as a microphone and storing it as digital data. Subsequently, the score input unit 10 inputs a score (step S120). The pitch extracted from the inputted score and the lyrics (pronunciation) are associated with the singing voice data.

次に、音響モデルを作成しようとしている全ての歌唱スタイルについて処理が完了したかを判断し（ステップＳ１３０）、全ての歌唱スタイルについての処理が完了するまで、上述したステップＳ１００ないしＳ１２０の処理を繰り返す。 Next, it is determined whether the processing has been completed for all the singing styles for which an acoustic model is to be created (step S130), and the processing of steps S100 to S120 described above is repeated until the processing for all the singing styles is completed. .

こうして全ての歌唱スタイルについての歌声データの入力とその歌声に対応した楽譜の入力とが完了すると（ステップＳ１３０）、次にこの歌声データと楽譜とを受け取ったサーバ３０において、データの解析が行なわれる（ステップＳ１４０）。データの解析は、ハードディスク３１に記憶された各歌唱スタイルの歌声を順次取り出して行なわれる。データ解析には、楽譜解析部３３を用いた楽譜の解析と、学習部４０のＦ０抽出部４１による基本周波数およびその関連パラメータの解析、ＳＰ抽出部４３によるスペクトルパラメータ（ＳＰ）およびその関連パラメータの解析、更には、歌唱表現関連のパラメータの解析が含まれる。図４に、こうした解析により抽出される各種パラメータを例示した。 When the input of the singing voice data for all the singing styles and the input of the score corresponding to the singing voice are completed in this way (step S130), the server 30 that receives the singing voice data and the score then analyzes the data. (Step S140). Data analysis is performed by sequentially extracting the singing voices of each singing style stored in the hard disk 31. For data analysis, the score analysis using the score analysis unit 33, the analysis of the fundamental frequency and its related parameters by the F0 extraction unit 41 of the learning unit 40, the spectral parameter (SP) and its related parameters by the SP extraction unit 43, Analysis and further analysis of parameters related to singing expression are included. FIG. 4 illustrates various parameters extracted by such analysis.

基本周波数は、いわゆる対数基本周波数ｐｔとして扱われており、その関連パラメータとしては、有声／無声の区別、対数基本周波数の一次微分（Δｐｔ）や二次微分（Δ^２ｐｔ）が考えられる。これらは音源情報と呼ばれることがある。また、スペクトルパラメータとしては、メルケプストラムｃｔやその一次微分（Δｃｔ）、二次微分（Δ^２ｃｔ）などがある。これらは、スペクトル情報と呼ばれることがある。更に、こうした音源情報、スペクトル情報の他に、本実施形態では、歌唱表現情報を扱う。 The fundamental frequency is handled as a so-called logarithmic fundamental frequency pt, and as related parameters, distinction between voiced / unvoiced, primary differentiation (Δpt) and secondary differentiation (Δ ² pt) of the logarithmic fundamental frequency can be considered. These are sometimes called sound source information. The spectral parameters include mel cepstrum ct, its first derivative (Δct), and second derivative (Δ ² ct). These are sometimes referred to as spectral information. Furthermore, in addition to such sound source information and spectrum information, in this embodiment, singing expression information is handled.

歌唱表現情報には、ビブラートの周期Ｖｆとその振幅Ｖａおよびそれらの一次微分（ΔＶｆ，ΔＶａ）と二次微分（Δ^２Ｖｆ，Δ^２Ｖａ）、しゃくりに関するパラメータＳ１〜Ｓ６、アタック・リリースに関するパラメータＡＲ１〜ＡＲ６などが含まれる。もとより、こぶしなどの情報を解析しても良い。本実施形態では、しゃくりとアタック・リリースに関しては、「長さ」「高さ」「急峻度」の３つのパラメーターを音符の先頭と末尾に、それぞれ持っている。従って、それぞれ６つのパラメータからなる。しゃくりのパラメータ等の学習の手法は、後で説明する。上記パラメータのうち、メルケプストラムｃｔを初めとする各パラメータの一次微分や二次微分は、時間変動を考慮するために用いられる。動的特徴を考慮することにより、歌声の合成時における音と音のつながりが滑らかなものとなる。動的特徴を用いた音声合成の手法については、説明を省略する。 The singing expression information is periodically Vf and its amplitude Va and first derivative thereof vibrato (Delta] Vf, .DELTA.Va) and second derivative ^{^{(Δ 2 Vf, Δ 2 Va}} ), parameters relating jerking S1 to S6, parameters related to attack Release AR1 to AR6 are included. Of course, information such as fist may be analyzed. In the present embodiment, the three parameters of “length”, “height”, and “steepness” are provided at the beginning and the end of the note, respectively, with respect to scooping and attack / release. Therefore, each consists of six parameters. A method for learning parameters such as the parameters of scouring will be described later. Of the above parameters, the first and second derivatives of each parameter including the mel cepstrum ct are used to take into account time variations. By considering the dynamic features, the connection between the sounds during the synthesis of the singing voice becomes smooth. Description of the speech synthesis method using dynamic features is omitted.

続いて、解析したデータを用いて、ＨＭＭ学習を行なう（ステップＳ１５０）。ＨＭＭ学習は、抽出した各パラメータを学習して、隠れセミマルコフモデルを得るために行なうものであり、概略以下の処理を行なう。上述したように、ＨＭＭ学習は、基の歌声データに含まれる音素毎に行なうが、音素を単独で扱うのではなく、音声合成において、音声変動を引き起こす多数の要因と共にＨＭＭ学習を行なう。合成しようとする音声に音声変動を引き起こす要因としては、例えばその音素の前後の音素の組み合わせ（前後の音素と当該音素の組合せである音素列トライフォンや、更にその前後の音素を考慮したクインフォンなど）や、楽譜情報や言語情報などがある。楽譜情報としては、前後の音素の音程や、休止符の長さなどがある。また言語情報としては、当該音素が属する語の品詞、活用形あるいはアクセントの位置、アクセント型など種々の情報がある。これらの要因をまとめて、コンテキストと呼ぶ。 Subsequently, HMM learning is performed using the analyzed data (step S150). The HMM learning is performed to learn each extracted parameter to obtain a hidden semi-Markov model, and performs the following processing. As described above, the HMM learning is performed for each phoneme included in the basic singing voice data. However, the HMM learning is performed together with a number of factors that cause the voice fluctuation in the speech synthesis, instead of handling the phoneme alone. Factors that cause voice fluctuations in the speech to be synthesized include, for example, a combination of phonemes before and after the phoneme (a phoneme triphone that is a combination of the phonemes before and after the phoneme and the phoneme before and after that phoneme) Etc.), music score information and language information. The musical score information includes the pitches of the front and back phonemes, the length of rests, and the like. The language information includes various information such as the part of speech of the word to which the phoneme belongs, the utilization form or accent position, and the accent type. These factors are collectively referred to as context.

滑らかな音声合成を行なう場合、考慮すべき要因は多数に上るが、学習の手法を概説するために、図５では、コンテキスト付きの音素の一例として上記のトライフォンを示した。図５は、「さっぱりわからない」という歌声を例に、トライフォンを取り出す場合を示す説明図である。「さっぱりわからない」という音声データの中には、音素ａは複数回出現するが、同じ音素であっても前後の音素等のコンテキストが異なると音声の音響的特徴は異なる。そのため当該音素が同じａであっても、前後の音素を考慮したトライフォンとして別々にモデル化する。ハードディスク３１に記憶された歌声から、音素を、コンテキストを考慮した状態で順次取り出す。コンテキストが考慮された音素を、以下、コンテキスト依存音素と呼ぶ。数分から数十分の歌声から取り出されるコンテキスト依存音素の数は数百から数万に上る。ハードディスク３１から取りだしたコンテキスト依存音素の全てに対して、図２に示した状態遷移確率ａijと出力確率密度関数ｂq（ｏt）を学習する。つまり、コンテキスト依存音素が属するフレーム毎に図４に示したパラメータを抽出し、各コンテキスト依存音素のＨＭＭを学習するのである。 When performing smooth speech synthesis, there are many factors to be considered, but in order to outline the learning method, FIG. 5 shows the above triphone as an example of a phoneme with context. FIG. 5 is an explanatory diagram showing a case where a triphone is taken out, taking as an example a singing voice “I don't know at all”. In the voice data “I don't know at all”, the phoneme a appears multiple times. However, even if the phoneme is the same, the acoustic characteristics of the voice will be different if the context of the phonemes before and after is different. Therefore, even if the phoneme is the same a, it is separately modeled as a triphone that considers the phonemes before and after. Phonemes are sequentially extracted from the singing voice stored in the hard disk 31 in consideration of the context. A phoneme in which context is taken into account is hereinafter referred to as a context-dependent phoneme. The number of context-dependent phonemes taken from minutes to tens of minutes of voices ranges from hundreds to tens of thousands. The state transition probability aij and the output probability density function bq (ot) shown in FIG. 2 are learned for all the context-dependent phonemes extracted from the hard disk 31. That is, the parameters shown in FIG. 4 are extracted for each frame to which the context-dependent phoneme belongs, and the HMM of each context-dependent phoneme is learned.

続いて、ＨＭＭ学習がなされたコンテキスト依存音素に対して、コンテキストクラスタリングを行なう（ステップＳ１６０）。これは、ＨＭＭ学習されたコンテキスト依存音素をクラスタリングして、各クラスタ毎に代表的なガウス分布を求める処理に相当する。各クラスタ毎に代表的なガウス分布が求められれば、クラスタリングされたコンテキスト毎に音声合成が可能なモデルが得られたことになる。このモデルを、状態共有のあるコンテキスト依存モデルと呼ぶ。換言すれば、クラスタリングすることにより、状態共有のないコンテキスト依存モデルから、音声合成に用いる状態共有のあるコンテキスト依存モデルが得られ、結果的に、状態共有のあるコンテキスト依存モデルを選択するための決定木が構築されるのである。以下の説明では、クラスタリング後の、つまり状態共有のあるコンテキスト依存モデルについて扱うので、これを単に「コンテキスト依存モデル」と呼ぶ。音声合成の際には、可能であれば、コンテキストが同一のコンテキスト依存モデルを用いることが望ましい。しかし、限られた音声データからは、上述したように、全てのコンテキストの組み合わせに対応する音素が得られる訳ではなく、全てのコンテキストの組み合わせに対応したコンテキスト依存モデルが得られる訳ではない。そこで、クラスタリングにより、決定木を作り、音声合成時に、最も適したコンテキスト依存モデルを選択できるように準備するのである。クラスタリングがなされた決定木の一例を、図６に示した。図６において、太い矢印は、各二分木の分岐条件に対する判断が「ＹＥＳ」である場合を、細い矢印は各二分木における判断が「ＮＯ」であることを示している。決定木において、どのような質問（分岐条件）をどのように配置するかという点について以下説明する。 Subsequently, context clustering is performed on the context-dependent phonemes subjected to HMM learning (step S160). This corresponds to a process of clustering context-dependent phonemes that have been subjected to HMM learning to obtain a representative Gaussian distribution for each cluster. If a representative Gaussian distribution is obtained for each cluster, a model capable of speech synthesis for each clustered context is obtained. This model is called a context-dependent model with state sharing. In other words, by clustering, a context-dependent model with state sharing used for speech synthesis is obtained from a context-dependent model without state sharing, and as a result, a decision for selecting a context-dependent model with state sharing is selected. A tree is built. In the following description, a context-dependent model after clustering, that is, a state-dependent model with state sharing is treated, and this is simply referred to as a “context-dependent model”. In speech synthesis, it is desirable to use a context-dependent model with the same context if possible. However, as described above, phonemes corresponding to all combinations of contexts are not obtained from limited audio data, and context-dependent models corresponding to all combinations of contexts are not obtained. Therefore, a decision tree is created by clustering and prepared so that the most suitable context-dependent model can be selected during speech synthesis. An example of a decision tree subjected to clustering is shown in FIG. In FIG. 6, a thick arrow indicates that the determination for the branch condition of each binary tree is “YES”, and a thin arrow indicates that the determination in each binary tree is “NO”. The following is a description of what questions (branch conditions) are arranged in the decision tree.

ハードディスク３１に保存した音声データから得られるコンテキスト依存音素の集合がどのようなものになるかは音声データに依存するから、決定木を構成する質問（分岐条件）と各リーフノードの中身とは、事前には分からない。従って、コンテキスト依存音素の集合を分割するための適切な質問や分割後の適切な決定木の形などを予め決めることはできない。このため、コンテキスト依存音素の集合を分割するための適切な質問を予め用意し、同じ状態位置の全てのコンテキスト依存音素を取り出し、用意した全ての質問を適用し、分割後の各ノードの代表的なコンテキスト依存モデルが最も良いものになるような質問を選択して、コンテキスト依存音素の集合を分割する。具体的には、まず一つのリーフノードに全てのコンテキスト依存音素が入っているものとし、全ての質問を適用して、最も適切な質問を探す。最も適切な質問が見つかったら、その質問でそのリーフノードを分割することで、コンテキスト依存音素の集合を二つに分け、新たにできたリーフノードに対して、同様に、残りの全ての質問を適用し、そのノードに対する最も適切な質問を探す。分割するノードとそのノードに対する最も適切な質問の組合せがみつかれば、そのノードをその質問で分割し、得られたリーフノードに対して、同じ処理を繰り返す。こうして決定木が適切な大きさになるまで、同様の処理を繰り返す。適切な大きさとは、学習データの多様性を表現しつつも過度に依存しないバランスの取れた大きさであり、空のリーフノードがなく、各リーフノードに数個から数百個程度のコンテキスト依存音素が割り当てられている状態を言う。 Since the context-dependent phoneme set obtained from the voice data stored in the hard disk 31 depends on the voice data, the questions (branch conditions) constituting the decision tree and the contents of each leaf node are: I don't know in advance. Therefore, it is not possible to predetermine an appropriate question for dividing a context-dependent phoneme set, an appropriate decision tree after division, and the like. For this reason, appropriate questions for dividing a set of context-dependent phonemes are prepared in advance, all context-dependent phonemes in the same state position are extracted, all prepared questions are applied, and representative nodes of each node after division are represented. Select the question that gives the best context-dependent model and divide the set of context-dependent phonemes. Specifically, it is assumed that all context-dependent phonemes are contained in one leaf node, and all questions are applied to find the most appropriate question. Once the most appropriate question is found, split the leaf node by that question to divide the context-dependent phoneme set in two and, for the newly created leaf node, repeat all the remaining questions as well. Apply and find the most appropriate question for that node. If the most suitable combination of a node to be divided and a question for the node is found, the node is divided by the question, and the same processing is repeated for the obtained leaf node. The same process is repeated until the decision tree becomes an appropriate size. Appropriate size is a balanced size that expresses the diversity of learning data but does not depend excessively. There are no empty leaf nodes, and each leaf node has several to several hundreds of contexts. A state in which phonemes are assigned.

このような手法でクラスタリングを行なうと、ハードディスク３１に保存された音声データから、特定の音素に関して十分なコンテキスト依存音素が得られない場合でも、音声合成を行なうための決定木が得られる。決定木を作る際に、空のリーフノードができないように質問を選択するので、元の音声データに存在しないコンテキスト依存音素を合成しようとすると、類似のコンテキスト依存音素が集められたリーフノードに辿り着き、そのリーフノードのコンテキスト依存モデルを用いて音声合成を行なうことができる。 When clustering is performed by such a method, a decision tree for performing speech synthesis can be obtained from speech data stored in the hard disk 31 even when sufficient context-dependent phonemes cannot be obtained for a specific phoneme. When creating a decision tree, questions are selected so that there are no empty leaf nodes, so if you try to synthesize context-dependent phonemes that do not exist in the original speech data, you will go to the leaf nodes where similar context-dependent phonemes are collected. The speech synthesis can be performed using the context-dependent model of the leaf node.

クラスタリングが終わり、決定木が得られたら、得られた全てのリーフノードについて、代表的なガウス分布を求める処理を行なう（ステップＳ１７０）。コンテキスト依存音素のＨＭＭ学習がなされ、更にそのコンテキスト依存音素のクラスタリングが行なわれて得られたリーフノード毎に代表的なガウス分布が求められたということは、その歌唱スタイルでの音声合成を行なうためのコンテキスト依存モデルが得られたということである。こうした決定木は、状態の数だけ作られる。この様子を図７に模式的に示した。１つのコンテキスト依存音素に着目すると、一つのコンテキスト依存音素に属する各状態は、状態継続長モデルにより、各状態が継続する長さが与えられる。多数のコンテキスト依存音素からこの状態継続長を決定する二分木が学習されている。これを状態継続長の決定木と呼ぶ。同様に、解析されたパラメータ毎に、メルケプストラムの決定木や基本周波数の決定木、音高のビブラートの決定木、などが構成される。他の歌唱表現の特徴のうち、タイミングモデルからは、タイミングの決定木が、しゃくり上げやしゃくり下げを含むしゃくりモデルからは、しゃくりの決定木が、アタック・リリースモデルからは、アタック・リリースの決定木が、それぞれ作られる。 When clustering is completed and a decision tree is obtained, a process for obtaining a representative Gaussian distribution is performed for all obtained leaf nodes (step S170). HMM learning of context-dependent phonemes and further clustering of the context-dependent phonemes has resulted in a representative Gaussian distribution being obtained for each leaf node, in order to perform speech synthesis in that singing style. This means that a context-dependent model is obtained. There are as many such decision trees as there are states. This is schematically shown in FIG. Focusing on one context-dependent phoneme, each state belonging to one context-dependent phoneme is given a length in which each state continues by the state duration model. A binary tree that determines this state duration is learned from a number of context-dependent phonemes. This is called a state continuation length decision tree. Similarly, a mel cepstrum decision tree, a fundamental frequency decision tree, a pitch vibrato decision tree, and the like are formed for each analyzed parameter. Among other singing expression features, the timing model determines the timing decision tree, the screaming model that includes screaming up and down squeezing, the screaming decision tree, and the attack release model from the attack release decision tree. Are made respectively.

歌唱表現に関するモデルとその決定木の作り方について、しゃくりを例にして以下簡略に説明する。しゃくりやアタック・リリースなどのパラメータは、まずしゃくりのパラメータを無視してＨＭＭ学習とクラスタリングを行なって音響モデルを作り、これを用いて、歌声合成を行なう。合成されたものは、しゃくりを含まない歌声である。そして、しゃくりを含むものとして予め記憶した音声データの基本周波数の系列を、しゃくりを含まない音響モデルから生成した歌声の基本周波数の系列と比較する。両者の差分は、しゃくりの有無であることから、これからコンテキストを考慮したしゃくりモデルを構築し、コンテキストクラスタリングによりしゃくりの決定木を作ることができる。アタック・リリースについては、ボリュームに関して差分をとる同様の処理を行ない、これからアタック・リリースモデルを構築し、アタック・リリースの決定木をつくる。このようにして、特定の歌唱スタイルの歌声データを基にして、これらの決定木の集合が求められるということが、結局その歌唱スタイルの音響モデルが学習されたことに他ならない。 The model for singing expression and how to make the decision tree will be briefly explained below using sukuri as an example. For parameters such as screaming and attack / release, first, an acoustic model is created by performing HMM learning and clustering while ignoring the parameters of screaming, and singing voice synthesis is performed using this. What is synthesized is a singing voice that does not include sneezing. Then, the basic frequency sequence of the voice data stored in advance as including the screaming is compared with the basic frequency sequence of the singing voice generated from the acoustic model not including the screaming. Since the difference between the two is the presence or absence of sneezing, a sneezing model considering the context can be constructed from now on, and a sneezing decision tree can be created by context clustering. For attack / release, the same processing is performed to obtain a difference with respect to the volume, and an attack / release model is constructed from this, and an attack / release decision tree is created. Thus, the fact that a set of these decision trees is obtained based on the singing voice data of a specific singing style is nothing other than the learning of the acoustic model of that singing style.

サーバ３０の学習部４０による学習は、上記のように行なわれる。そこでサーバ３０では、全ての歌唱スタイルについて上記のデータ解析（ステップＳ１４０）、ＨＭＭ学習（ステップＳ１５０）、クラスタリングによる決定木の構築（ステップＳ１６０）、各リーフノードの代表的なガウス分布の決定（ステップＳ１７０）が完了したかを判断し（ステップＳ１８０）、完了していなければ、次の歌唱スタイルについての上記処理を繰り返す。全ての歌唱スタイルについて、上記の処理が完了したと判断すれば、図３に示した歌声合成準備ルーチンを終了する。学習された各歌唱スタイルの音響モデルは、音響モデル記憶部５０に歌唱スタイルＡのモデル（Ａモデルと記載。以下同じ）、Ｂモデル、Ｃモデル・・・Ｎモデルとして記憶される。 Learning by the learning unit 40 of the server 30 is performed as described above. Therefore, the server 30 analyzes the above data analysis for all singing styles (step S140), HMM learning (step S150), construction of a decision tree by clustering (step S160), and determination of representative Gaussian distribution of each leaf node (step S160). It is determined whether or not (S170) has been completed (step S180), and if not completed, the above-described processing for the next singing style is repeated. If it is determined that the above processing has been completed for all singing styles, the singing voice synthesis preparation routine shown in FIG. 3 is terminated. The learned acoustic models of each singing style are stored in the acoustic model storage unit 50 as a singing style A model (described as A model; the same applies hereinafter), B model, C model... N model.

以上で、歌声合成のための準備が完了する。本実施形態では、こうした準備から説明したが、各歌唱スタイル毎の音響モデルの学習を別に行ない、その結果だけを利用することも差し支えない。音響モデルの学習は、上記の手法に限らず、他の手法によっても良い。歌唱スタイルを利用した、以下に説明する歌声合成を行なうだけであれば、図１におけるモデル記憶部５０、パラメータ調整部５５、楽譜解析部５７、音声合成部６０などがあれば足りる。 This completes the preparation for singing voice synthesis. In this embodiment, although it demonstrated from such preparation, you may learn separately the acoustic model for every singing style, and may use only the result. The learning of the acoustic model is not limited to the above method, and other methods may be used. If only the singing voice synthesis described below using the singing style is performed, the model storage unit 50, the parameter adjustment unit 55, the score analysis unit 57, the voice synthesis unit 60, etc. in FIG. 1 are sufficient.

次に歌声合成処理について説明する。上述した歌声合成準備ルーチン（図３）が実行されると、サーバ３０の音響モデル記憶部５０には、各歌唱スタイル毎の音響モデルが記憶された状態となる。音声合成は、この歌唱スタイル毎の音響モデルを用い、サーバ３０とコンピュータＰＣ２とを用いて行なう。サーバ３０には、パラメータ調整部５５、楽譜解析部５７、音声合成部６０が設けられている。パラメータ調整部５５は、後述するように、複数の歌唱スタイル毎の歌唱表現のベースモデルの組み合わせに従って、音響パラメータを調整するために設けられている。ベースモデルを用いたパラメータの調整については、後で詳しく説明する。または楽譜解析部５７は、合成しようする歌声を表した楽譜を解析して、合成すべき音素列（音程と音素の組み合わせ）を初めとする歌声合成に必要な諸情報を出力する。歌声合成に必要な諸情報とは、歌声の合成の際に、上述した決定木を辿るための情報、例えばシラブル内の何番目の音素か、と言った言語情報や、スタッカートが付与されているか、といった楽譜情報などである。これらの情報を用いて、決定木を辿り、歌声合成に必要なパラメータを取り出す。パラメータ調整部５５及び楽譜解析部５７の出力を受けて、音声合成部６０は音声の合成を行なう。 Next, the singing voice synthesis process will be described. When the above-described singing voice synthesis preparation routine (FIG. 3) is executed, the acoustic model storage unit 50 of the server 30 is in a state where the acoustic model for each singing style is stored. Voice synthesis is performed using the acoustic model for each singing style and using the server 30 and the computer PC2. The server 30 includes a parameter adjustment unit 55, a score analysis unit 57, and a speech synthesis unit 60. As will be described later, the parameter adjustment unit 55 is provided to adjust the acoustic parameters according to a combination of singing expression base models for each of a plurality of singing styles. The parameter adjustment using the base model will be described in detail later. Alternatively, the score analysis unit 57 analyzes a score representing a singing voice to be synthesized and outputs various information necessary for singing voice synthesis including a phoneme string (combination of pitches and phonemes) to be synthesized. Information required for singing voice synthesis is information for tracing the decision tree described above when synthesizing the singing voice, for example, linguistic information such as the number of phonemes in the syllable, or whether staccato is given , And so on. Using these pieces of information, the decision tree is traced to extract parameters necessary for singing voice synthesis. In response to the outputs of the parameter adjustment unit 55 and the score analysis unit 57, the speech synthesizer 60 synthesizes speech.

音声合成部６０は、音声パラメータ生成部６１、音源生成部６３、合成フィルタ６５等を備える。音声パラメータ生成部６１は、楽譜解析部５７の出力を受け取り、歌声を合成しようとする歌唱スタイルの学習済みの音響モデルから、基本周波数やメルケプストラムパラメータ、歌唱表現パラメータなどの各種パラメータを生成する。音源生成部６３は、基本周波数やビブラート、しゃくり上げ、しゃくり下げなどの音の高低に関与するパラメータを受け取って、励振源を時間軸に沿って生成する。また、合成フィルタ６５は、主にメルケプストラムにより音声を合成するフィルタである。こうしたフィルタとしては、例えばＭＬＳＡフィルタなどが知られている。歌唱表現パラメータのうち、音源生成部６３での音源生成に関与しないパラメータは、メルケプストラムの一部として、合成フィルタ６５に入力される。 The voice synthesis unit 60 includes a voice parameter generation unit 61, a sound source generation unit 63, a synthesis filter 65, and the like. The voice parameter generation unit 61 receives the output of the score analysis unit 57 and generates various parameters such as a fundamental frequency, a mel cepstrum parameter, and a singing expression parameter from a learned singing style acoustic model to be synthesized. The sound source generator 63 receives parameters related to the pitch of the sound, such as the fundamental frequency, vibrato, squealing up, and squeaking down, and generates an excitation source along the time axis. The synthesizing filter 65 is a filter that synthesizes speech mainly by a mel cepstrum. As such a filter, for example, an MLSA filter is known. Of the singing expression parameters, parameters not involved in sound source generation by the sound source generation unit 63 are input to the synthesis filter 65 as part of the mel cepstrum.

サーバ３０が、コンピュータＰＣ２と共に実行する歌声合成処理については、図８を参照しつつ説明する。サーバ３０は、まずコンピュータＰＣ２から歌唱スタイルの指定を受け付ける（ステップＳ２００）。これは、音響モデルが音響モデル記憶部５０に記憶された複数の歌唱スタイルをコンピュータＰＣ２の表示部５３に表示し、使用者に、ポインティングデバイス５２を用いて選択させることにより行なわれる。歌唱スタイルの指定は、一以上であれば良いが、本実施形態では、複数の歌唱スタイルの歌唱を合成することから、２以上の歌唱スタイルを選択するものとする。ここでは、歌唱スタイルＡ，Ｂ，Ｃの３つが選択されて、指定されたものとする。 The singing voice synthesis process executed by the server 30 together with the computer PC2 will be described with reference to FIG. The server 30 first receives a singing style designation from the computer PC2 (step S200). This is performed by displaying a plurality of singing styles in which the acoustic model is stored in the acoustic model storage unit 50 on the display unit 53 of the computer PC2 and allowing the user to select using the pointing device 52. The designation of the singing style may be one or more, but in this embodiment, two or more singing styles are selected because the singing of a plurality of singing styles is synthesized. Here, it is assumed that three singing styles A, B, and C are selected and designated.

次に、入力した音譜を解析する処理を行なう（ステップＳ２１０）。使用者が歌声を合成しようとする歌の楽譜をキーボード５１等を用いてコンピュータＰＣ２から入力すると、サーバ３０は、この楽譜を楽譜解析部５７により解析するのである。楽譜の入力は、例えば図９に示したピアノロール画面により入力することができる。このとき、キーボード５１としては、テキスト入力用のキーボードの他に、ピアノ鍵盤タイプのキーボードを併用するものとし、鍵盤タイプのキーボードによる音程および音の長さの入力と、テキスト入力用のキーボードによる歌詞（日本語の場合は、各音に対応したかな文字列）の入力とを、対応付けながら行えば良い。また、強弱記号や、スタッカートなどの他の楽譜情報を入力するために、画面上に専用のボタンを設け、画面のボタンをクリックすることで、楽譜上にこれらの楽譜情報を簡単に書き込めるようにしても良い。専用のキーボードは、ＵＳＢやＭＩＤＩなどのインタフェースを用いて、コンピュータＰＣ２に接続することができる。 Next, a process for analyzing the input musical score is performed (step S210). When the user inputs a musical score of a song to be synthesized from the computer PC2 using the keyboard 51 or the like, the server 30 analyzes the musical score by the musical score analysis unit 57. The musical score can be input, for example, on the piano roll screen shown in FIG. At this time, as the keyboard 51, in addition to the text input keyboard, a piano keyboard type keyboard is used in combination, and the pitch and sound length input by the keyboard type keyboard and the lyrics by the text input keyboard are used. (In the case of Japanese, a kana character string corresponding to each sound) may be input while being associated. In addition, in order to input other musical score information such as dynamic symbols and staccato, a dedicated button is provided on the screen, and by clicking the button on the screen, the musical score information can be easily written on the musical score. May be. The dedicated keyboard can be connected to the computer PC2 using an interface such as USB or MIDI.

図９の例では、表示部５３にピアノロールが表示される。ピアノロールの画面では、縦軸が音の高さに、横軸が音の長さと時間軸上の前後方向に、それぞれ相当し、平均律による一音の高さと四分音符を単位とする音の長さにより、画面が枡目状に分割される。鍵盤タイプのキーボードを操作すると、鍵盤位置に応じて音の高さが特定される。またキーを押している時間によって、四分音符を基準長さとして、横方向の長さが特定される。更に、テキスト入力用のキーボードから各音に対応する歌詞のかな文字列が入力される。図９の例では、「さ」「い」がそれぞれ四分音符１つ分の長さを持ち、「た」が四分音符２個分の長さを持つことが指定されている。四分音符より短い八分音符や十六分音符などを音の長さの入力単位としてもよい。入力単位の音符長さより短い長さの音を入力する場合や三連符などを入力する場合には、その音に対応する枡目をポインティングデバイス５２により指定して、右クリックでメニューを表示させ、メニューの中から「分割」や「三連符」などのコマンドを選択して指定すれば良い。 In the example of FIG. 9, the piano roll is displayed on the display unit 53. On the piano roll screen, the vertical axis corresponds to the pitch of the sound, the horizontal axis corresponds to the length of the sound and the longitudinal direction on the time axis, respectively. Depending on the length of the screen, the screen is divided into grids. When a keyboard-type keyboard is operated, the pitch is specified according to the keyboard position. Further, the length in the horizontal direction is specified by using the quarter note as a reference length according to the time during which the key is pressed. Furthermore, a kana character string of lyrics corresponding to each sound is input from a text input keyboard. In the example of FIG. 9, it is specified that “sa” and “i” each have a length of one quarter note, and “ta” has a length of two quarter notes. An octave note or a sixteenth note shorter than a quarter note may be used as an input unit for the sound length. When inputting a note with a length shorter than the note length of the input unit or inputting a triplet, etc., a cell corresponding to the note is designated by the pointing device 52, and a menu is displayed by right-clicking. You can select and specify commands such as “divide” and “triplet” from the menu.

こうして入力された楽譜をステップＳ２１０では楽譜解析部５７を用いて解析し、後述する音声合成において、利用できるよう、音の高さ等のコンテキストを持った音素列のデータとして、図示しない記憶部（ＲＡＭなど）に記憶する。次に、音声パラメータの生成を行なう（ステップＳ２２０）。この処理は、音声合成部６０の音声パラメータ生成部６１の処理に相当する。具体的には、ステップ２１０においてコンピュータＰＣ２から入力され楽譜解析部５７を用いて解析された音素や音程、およびこれに付随するコンテキストを用いて、歌声合成に必要な音響パラメータを生成する処理である。各歌唱スタイルの音響モデルは、音響モデル記憶部５０に記憶されているから、この中から、指定された歌唱スタイルの音響モデルを参照するのである。各歌唱スタイルの音響モデルは、ＨＭＭ学習により各種決定木によりクラスタリングされた統計的モデルとして記憶されているから、指定された歌唱スタイルの歌声を合成するのに必要な各種パラメータを、この音響モデルを基に生成する。 In step S210, the score inputted in this way is analyzed by using the score analysis unit 57, and as a phoneme string data having a context such as a pitch, which is used in speech synthesis described later, a storage unit (not shown) RAM). Next, a voice parameter is generated (step S220). This process corresponds to the process of the speech parameter generation unit 61 of the speech synthesis unit 60. Specifically, it is a process of generating acoustic parameters necessary for singing voice synthesis using the phonemes and pitches input from the computer PC2 in step 210 and analyzed using the score analysis unit 57, and the context associated therewith. . Since the acoustic model of each singing style is stored in the acoustic model storage unit 50, the acoustic model of the designated singing style is referred to from this. Since the acoustic model of each singing style is stored as a statistical model clustered by various decision trees by HMM learning, various parameters necessary for synthesizing the singing voice of the specified singing style are stored in this acoustic model. Generate based on.

次に、補間割合設定処理を行なう（ステップＳ２３０）。この処理は、コンピュータＰＣ２の表示部５３に、ステップＳ２００で指定した複数の歌唱スタイルを表示し、各歌唱スタイルの歌唱表現の割合を設定する処理である。こうした歌唱表現の時間軸上の強弱をプリセットと呼ぶ。「プリセット」と呼ぶのは、複数の歌唱スタイルの歌唱表現の特徴をどのように反映して歌声を合成するかを、実際の音声合成に先立って設定するからである。プリセットは、各歌唱スタイルの歌唱表現の特徴を、時間軸上でどの程度強く、あるいは弱く反映したいかという相対的な指定である。本実施形態では、複数の歌唱スタイルの歌唱表現を重畳して音声合成を行なうため、複数の歌唱スタイルに対応した各ベースモデル間の補間を行なう必要がある。この補間割合の編集画面の一例を図１０に示した。この例では、補間割合は、横軸を時間として、各歌唱スタイルの歌唱表現の補間の割合を縦方向に積み上げた積み上げグラフの形態で示されている。デフォルトでは、各歌唱スタイルＡ，Ｂ，Ｃの割合は、選択した歌唱スタイルの数（この例では３）に応じた割合（１／３）ずつとなっている。画面において、歌唱スタイルＡとＢとの間には、境界線ＬＡＢが示され、歌唱スタイルＢとＣとの間には、境界線ＬＢＣが示されている。使用者は、ポインティングデバイス５２を用いて、この境界線ＬＡＢ，ＬＢＣを自由に移動することができる。ここでは、補間割合を編集しているので、各歌唱スタイルの合計は１００％であり、一定である。従って、境界線を移動しても、画面上の全体の高さは変わらない。つまり、例えば境界線ＬＢＣ上の点ＰＢＣをポインティングデバイス５２でつまんで上下に移動すると、移動によって、境界線両側の歌唱スタイルの割合は相補的に増減するが、移動されている境界線を共有していない歌唱スタイルの割合は変化しない。 Next, an interpolation ratio setting process is performed (step S230). This process is a process of displaying the plurality of singing styles specified in step S200 on the display unit 53 of the computer PC2 and setting the ratio of the singing expression of each singing style. The strength of the singing expression on the time axis is called a preset. The reason why it is called “preset” is to set how to synthesize a singing voice reflecting the characteristics of singing expressions of a plurality of singing styles prior to actual speech synthesis. The preset is a relative designation indicating how strong or weakly the characteristic of the singing expression of each singing style is reflected on the time axis. In the present embodiment, since voice synthesis is performed by superimposing singing expressions of a plurality of singing styles, it is necessary to perform interpolation between the base models corresponding to the plurality of singing styles. An example of the interpolation ratio editing screen is shown in FIG. In this example, the interpolation ratio is shown in the form of a stacked graph in which the horizontal axis is time, and the interpolation ratios of singing expressions of each singing style are stacked vertically. By default, the ratio of each singing style A, B, C is a ratio (1/3) according to the number of selected singing styles (3 in this example). On the screen, a boundary line LAB is shown between the singing styles A and B, and a boundary line LBC is shown between the singing styles B and C. The user can freely move the boundary lines LAB and LBC by using the pointing device 52. Here, since the interpolation ratio is edited, the total of each singing style is 100%, which is constant. Therefore, even if the boundary line is moved, the overall height on the screen does not change. That is, for example, if the point PBC on the boundary line LBC is picked up and moved up and down by the pointing device 52, the ratio of the singing styles on both sides of the boundary line increases and decreases by the movement, but the moving boundary line is shared. The percentage of singing styles that are not changed does not change.

こうした境界線の移動は、簡易には、ポインティングデバイス５２によって指定した点の上下移動により、所定の曲線で、つまり特定した点の周辺も変化するように行なわれる。もう少し、細かく調整したい場合には、例えば図１１に示すように、境界線を自由曲線（ベジェ曲線など）とみなし、ポインティングデバイス５２による境界線上の指定によって、アンカーポイントとハンドルを表示させて行なえば良い。図１１の例では、境界線ＬＡＢ上のある点ＰＡＢにカーソルＫＳを表示させた状態で（図１１（Ａ））、ポインティングデバイス５２に供えられたボタンをクリックすると（図１１（Ｂ））、点ＰＡＢにアンカーポイントが、また点ＬＡＢからその接線方向にハンドルＨｕ，Ｈｄが表示される（図１１（Ｃ））。 Such a movement of the boundary line is simply performed so that the point designated by the pointing device 52 is moved up and down in a predetermined curve, that is, around the specified point. If it is desired to make finer adjustments, for example, as shown in FIG. 11, the boundary line is regarded as a free curve (a Bezier curve or the like), and anchor points and handles are displayed by designation on the boundary line by the pointing device 52. good. In the example of FIG. 11, when the cursor KS is displayed at a certain point PAB on the boundary line LAB (FIG. 11A), when a button provided to the pointing device 52 is clicked (FIG. 11B), An anchor point is displayed at the point PAB, and handles Hu and Hd are displayed in the tangential direction from the point LAB (FIG. 11C).

この状態で、ポインティングデバイス５２により、カーソルＫＳを使って、ハンドルＨｕまたはＨｄを掴んで、これを移動すると、ハンドルの移動量に応じて、境界線ＬＡＢを変形することができる（図１１（Ｄ））。このとき、ハンドル操作により境界線が変形する範囲は、ハンドルの長さによる。従って、カーソルＫＳを用いて、ハンドルＨｕ，Ｈｄをそれぞれその線分方向に移動することにより、ハンドルの長さ（アンカーポイントからの隔たり）を変更することができる。ハンドルを長くすれば、ハンドル操作により影響を受ける範囲は広くなる。従って、境界線を緩やかに変更することができる。またハンドルを短くすれば、境界線を急峻に変更することができる。更に、カーソルＫＳによりアンカーポイントを掴んで移動すれば、ハンドルにより影響が及ぶ範囲の曲線全体が、カーソルの移動方向に、境界線の滑らかさを保持したまま移動する。 In this state, when the pointing device 52 uses the cursor KS to grasp the handle Hu or Hd and moves it, the boundary line LAB can be deformed according to the amount of movement of the handle (FIG. 11D). )). At this time, the range in which the boundary line is deformed by the handle operation depends on the length of the handle. Therefore, the length of the handle (distance from the anchor point) can be changed by moving the handles Hu and Hd in the direction of the line segment using the cursor KS. If the handle is lengthened, the range affected by the handle operation becomes wider. Therefore, the boundary line can be changed gently. If the handle is shortened, the boundary line can be changed abruptly. Further, if the cursor KS is grabbed and moved by the cursor KS, the entire curve in the range affected by the handle moves in the cursor moving direction while maintaining the smoothness of the boundary line.

こうした操作によって、表示部５３に表示されたグラフィカルインタフェースを介して、複数の歌唱スタイルの歌唱表現を重畳する割合を設定することができる。なお、上記の手法では、複数の歌唱スタイルの歌唱表現の割合の合計は一定としているので、各歌唱スタイルの歌唱表現の特徴は、内挿法により補完されることになる。これに対して、例えば図１０において、点ＰＡＢを、歌唱スタイルＡの上限ラインを越えて移動可能とし、歌唱スタイルＡと歌唱スタイルＢとの特徴の割合を、外挿法によって補間するものとしても良い。外挿法による補間を行なえば、この例では、歌唱スタイルＡの特徴からは遠ざかる特徴が設定されたことになる。 By such an operation, the ratio at which the singing expressions of a plurality of singing styles are superimposed can be set via the graphical interface displayed on the display unit 53. In the above method, since the sum of the ratios of the singing expressions of the plurality of singing styles is constant, the characteristics of the singing expressions of each singing style are complemented by the interpolation method. On the other hand, for example, in FIG. 10, the point PAB can be moved beyond the upper limit line of the singing style A, and the ratio of the characteristics of the singing style A and the singing style B can be interpolated by extrapolation. good. If interpolation by extrapolation is performed, in this example, a feature that is far from the feature of singing style A is set.

図１０では、三つの歌唱スタイルの歌唱表現の特徴をどの程度の割合で反映するかをまとめて示したが、各人別に表示するものとしても良い。図１２は、歌唱スタイルとして、歌い手別の歌唱表現のベースモデルを示す。同じ歌い手でも異なる歌唱スタイルを取ることも当然あるが、ここでは歌声合成の様子をイメージしやすいように、歌い手により歌唱スタイルが異なるものとして示した。コンピュータＰＣ２の表示部５３には、図１０に示したグラフィカルインタフェースを表示しても良いし、図１２に示したグラフィカルインタフェースを表示してもよい。あるいは両者をまとめて表示しても良い。図１２に示した例では、各歌い手の毎の歌唱表現の時間軸に沿った強弱を理解しやすい。図１２に示したベースモデルも、図１０の例と同様に、ポインティングデバイス５２を用いて容易に編集することができる。 In FIG. 10, the ratio of the characteristics of the singing expressions of the three singing styles is collectively shown, but may be displayed for each person. FIG. 12 shows a base model of singing expression for each singer as a singing style. Of course, the same singer may take different singing styles, but here, the singing styles are shown to be different depending on the singer so that it is easier to imagine the composition of the singing voice. The display unit 53 of the computer PC2 may display the graphical interface shown in FIG. 10 or the graphical interface shown in FIG. Alternatively, both may be displayed together. In the example shown in FIG. 12, it is easy to understand the strength along the time axis of the singing expression for each singer. The base model shown in FIG. 12 can also be easily edited using the pointing device 52 as in the example of FIG.

図１２に示した各歌い手毎の歌唱スタイルのベースモデルは、個々に設定可能としても良いし、互いに連動するものとしても良い。前者の場合は、各歌唱スタイルのベースモデルは、独立に編集できるが、補間は、割合によって行なわれるので、各歌唱スタイルのベースモデルを合算すると、図１０に示したように、各歌唱スタイルのベースモデルは全体に対する割合に変換される。前者の場合、図１２に示した各人別の補間割合は、合成を行なう演算における重み付け係数だと考えることもできる。重み付け係数を全時間範囲において零とすれば、その歌唱スタイルを選択しなかったのと同じである。後者の場合は、一つの歌唱スタイル（例えば歌唱スタイルＡ）のベースモデルの任意の場所を、ポインティングデバイス５２よりつまんで上下に移動すると、移動した分の１／２ずつが、残りの歌唱スタイル（歌唱スタイルＢ，Ｃ）のベースモデルの減増として反映され、合計は変わらないように表示される。いずれのインタフェースを採用するかは、使用者が決めれば良い。 The singing style base model for each singer shown in FIG. 12 may be set individually or may be linked to each other. In the former case, the base model of each singing style can be edited independently. However, since interpolation is performed by the ratio, when the base models of each singing style are added up, as shown in FIG. The base model is converted to a percentage of the whole. In the former case, the interpolation ratio for each person shown in FIG. 12 can also be considered as a weighting coefficient in the calculation for combining. If the weighting coefficient is zero in the entire time range, it is the same as not having selected the singing style. In the latter case, when an arbitrary place of the base model of one singing style (for example, singing style A) is picked up and moved up and down by the pointing device 52, ½ of the moved singing style is transferred to the remaining singing style ( This is reflected as a decrease in the base model of the singing style B, C), and the total is displayed so as not to change. The user can decide which interface to use.

図１０に示した例では、歌唱スタイルを指定した直後では、複数の歌唱スタイル毎の歌唱表現の割合は、１／３ずつであり、時間軸方向にフラットになっている。これに対して、指定直後の各歌唱スタイルの歌唱表現の割合を、時間軸方向に予めセットしておいても良い。つまり、各歌唱スタイル毎の歌唱表現の時間軸方向の割合の初期値を予め設定しておくのである。例えば、通常、歌唱スタイルＡの歌唱表現については歌の後半では弱める使い方をするとしている場合には、図１２に例示したように、後半に向けて漸減するようなプリセットを予め用意しておけばよい。図１２の右端に示したチェックボックスはこうしたプリセットを使用するか否かを指定する際に用いられる。チェックボックスのチェックのオン・オフは、ポインティングデバイス５２を用いて容易に変更可能である。チェックボックスがチェックされていなければ、その歌唱スタイルについては、予め用意したプリセットを利用せず、デフォルトのフラットな割合からベースモデルの使用の割合の設定がなされる。 In the example shown in FIG. 10, immediately after designating the singing style, the ratio of the singing expression for each of the plurality of singing styles is 1/3 each, which is flat in the time axis direction. On the other hand, the ratio of the singing expression of each singing style immediately after the designation may be set in the time axis direction in advance. That is, the initial value of the ratio in the time axis direction of the singing expression for each singing style is set in advance. For example, when the singing expression of the singing style A is normally used to weaken in the second half of the song, as shown in FIG. 12, a preset that gradually decreases toward the second half can be prepared in advance. Good. The check box shown at the right end of FIG. 12 is used when designating whether or not to use such a preset. The check box can be easily turned on / off using the pointing device 52. If the check box is not checked, the use rate of the base model is set from the default flat rate without using a preset prepared for the singing style.

上記の説明では、ベースモデルは、歌唱スタイルの歌唱表現として一括りで扱った。これは、歌唱表現は基本周波数と共々、その歌唱スタイルの特徴がよく現れているため、一括で扱うことにより、その歌唱スタイルに似た歌声の合成が可能になるからである。もとより歌唱表現には、ビブラートやこぶし、しゃくり上げ、しゃくり下げ、タイミングなど、複数の要素が含まれる。図１０や図１２で示したベースモデルは、これらを一括で扱うものとしたが、個々の歌唱表現毎に設定できるようにしても良い。この場合、ビブラートのプリセットでは歌い手Ａの歌唱スタイルの割合を高くし、しゃくり上げのプリセットでは歌い手Ｂの歌唱スタイルの割合を高くするといった設定を行なえば、ビブラートは歌い手Ａの歌唱スタイルに似ており、しゃくり上げは歌い手Ｂの歌唱スタイルに似ている、といった歌唱を合成できる。 In the above description, the base model is collectively treated as a singing style singing expression. This is because the singing expression, along with the fundamental frequency, often shows the characteristics of the singing style, so that it is possible to synthesize a singing voice similar to the singing style by handling them together. Naturally, the singing expression includes a plurality of elements such as vibrato, fist, squeak up, squeeze down, and timing. The base model shown in FIG. 10 and FIG. 12 handles them all together, but may be set for each individual singing expression. In this case, vibrato resembles singer A's singing style if the setting of singing A's singing style is increased in the vibrato preset and the singing B's singing style is increased in the preset of scribbling. Singing up can synthesize a song that resembles the singing style of singer B.

図１０に示したグラフィカルインタフェースを用いた補間割合の設定の後（ステップＳ２３０）、パラメータの調整が行なわれる（ステップＳ２５０）。パラメータ調整では、ステップＳ２２０で生成した歌唱スタイル毎のパラメータを、ステップＳ２３０で設定された補間割合を用いて線形結合する処理を行なう。プリセットが歌唱表現毎に設定されていれば、結合も個々の歌唱表現のパラメータ毎に行なう。なお、結合は線形結合に限らず、非線形な結合であっても良い。非線形結合の場合には、結合の非線形モデルを予め定めておけば良い。 After setting the interpolation ratio using the graphical interface shown in FIG. 10 (step S230), parameters are adjusted (step S250). In the parameter adjustment, a process for linearly combining the parameters for each singing style generated in step S220 using the interpolation ratio set in step S230 is performed. If a preset is set for each song expression, combination is also performed for each parameter of the song expression. The coupling is not limited to linear coupling, and may be nonlinear coupling. In the case of nonlinear coupling, a nonlinear model of coupling may be determined in advance.

パラメータを調整すると、続いてこのパラメータを用いて、音源生成部６３や合成フィルタ６５の設定を行ない、その設定を図示しないＲＡＭに記憶する処理を行なう（ステップＳ２６０）。 When the parameter is adjusted, the sound source generation unit 63 and the synthesis filter 65 are subsequently set using this parameter, and the setting is stored in a RAM (not shown) (step S260).

以上の処理により、音源生成部６３や合成フィルタ６５の設定がなされたことになる。そこで、次に調整が完了したか否かの判断を行なう（ステップＳ２７０）。使用者は、ステップＳ２２０において設定した補間割合で音声を合成した結果が自らの望んだものでなければ、補間割合の設定から処理をやり直すのである。ステップＳ２３０で設定した補間割合で良いか否かは、１つには、実際に歌声合成を行なって、合成された歌声が満足できるものか否かにより判断することができる。また、合成される歌声のピッチを計算し、これをコンピュータＰＣ２に送信して、その表示部５３に表示させ、これを見て判断することもできる。図１３は、こうしたピッチの表示画面の一例を示している。 With the above processing, the sound source generation unit 63 and the synthesis filter 65 are set. Therefore, it is next determined whether or not the adjustment is completed (step S270). If the result of synthesizing the voice with the interpolation ratio set in step S220 is not what the user desires, the user starts the process again from the setting of the interpolation ratio. Whether or not the interpolation ratio set in step S230 is acceptable can be determined based on whether or not the synthesized singing voice is satisfactory by actually synthesizing the singing voice. It is also possible to calculate the pitch of the synthesized singing voice, transmit it to the computer PC2, display it on the display unit 53, and make a judgment by looking at it. FIG. 13 shows an example of such a pitch display screen.

上記実施形態では、一旦パラメータを生成すると（ステップＳ２２０）、補間割合の設定（ステップＳ２３０）を変更しても、各パラメータの割合を調整するだけで済み、新たにパラメータを生成する処理は行なっていない。但し、コンテキスト依存音素の発生タイミングや終了タイミングなどの補間が必要な場合には、ＨＭＭの状態の継続長が変わるため、再度パラメータを生成する必要が生じる。こうした場合には、ステップＳ２７０での判断が「ＮＯ」の場合、ステップＳ２２０に一旦戻って、上記処理を行なう繰り返すものとすれば良い。 In the above embodiment, once the parameters are generated (step S220), even if the setting of the interpolation ratio (step S230) is changed, it is only necessary to adjust the ratio of each parameter, and a process for generating a new parameter is performed. Absent. However, when the context-dependent phoneme generation timing or end timing needs to be interpolated, the continuation length of the HMM state changes, so it is necessary to generate parameters again. In such a case, if the determination in step S270 is “NO”, it is sufficient to return to step S220 and repeat the above processing.

本実施形態の歌声処理装置１００は、補間割合の設定（ステップＳ２３０）、パラメータの調整（ステップＳ２４０）、音源やフィルタの設定等（ステップＳ２６０）を行なうと、生成されるピッチを表示して、調整完了として良いか否かを使用者に問い合わせる。使用者が、コンピュータＰＣ２の表示部５３に表示されたピッチやボリュームの表示を見て、補間割合の設定が、自らが望んだものであると判断して、調整完了の指示をすると、歌声処理装置１００は、音源生成部６３，合成フィルタ６５に設定された各種パラメータを用いて歌声を合成する（ステップＳ２８０）。合成される歌声は、７０から再生される。この合成された歌声は、複数の歌唱スタイルの歌唱表現を、指定された割合で補間したものとなる。歌声を合成した後、歌声処理装置１００は処理を終了する。 When the singing voice processing apparatus 100 of the present embodiment performs setting of an interpolation ratio (step S230), adjustment of parameters (step S240), setting of a sound source and a filter (step S260), the generated pitch is displayed, Ask the user if the adjustment is complete. When the user looks at the display of the pitch and volume displayed on the display unit 53 of the computer PC2 and determines that the setting of the interpolation ratio is what he desires, and instructs the completion of adjustment, the singing process The apparatus 100 synthesizes a singing voice using various parameters set in the sound source generation unit 63 and the synthesis filter 65 (step S280). The synthesized singing voice is reproduced from 70. The synthesized singing voice is obtained by interpolating singing expressions of a plurality of singing styles at a specified ratio. After synthesizing the singing voice, the singing voice processing apparatus 100 ends the process.

以上説明した歌声処理装置１００によれば、ＨＭＭを用いた学習により、少ない量の音源から、歌い手の声質や歌唱スタイルを反映した音響モデルを生成でき、これに基づいてその歌唱スタイルの特徴を持つ歌声を生成できるばかりでなく、複数の歌唱スタイルの歌唱表現を任意の割合で補間して、歌声を合成することができる。しかも、補間しようとする歌唱表現の割合を、グラフィカルインタフェースを用いて容易に設定することができる。また、補間割合を変更した結果を簡易に確認できるので、補間割合の調整も容易である。 According to the singing voice processing apparatus 100 described above, an acoustic model reflecting the voice quality and singing style of a singer can be generated from a small amount of sound source by learning using the HMM, and based on this, the singing style has characteristics. Not only can the singing voice be generated, but also the singing voice can be synthesized by interpolating singing expressions of a plurality of singing styles at an arbitrary ratio. In addition, the ratio of the singing expression to be interpolated can be easily set using a graphical interface. In addition, since the result of changing the interpolation ratio can be easily confirmed, the adjustment of the interpolation ratio is easy.

次に本発明の第２の実施形態について説明する。第２実施形態の歌声処理装置１００は、第１実施形態の歌声処理装置１００と同一のハードウェア構成を備える。また基本的に歌声合成準備処理や歌声合成処理も同一である。第２実施形態では、補間割合の設定方法が異なる。図１４は、第２実施形態における補間割合の設定方法を示す説明図である。 Next, a second embodiment of the present invention will be described. The singing voice processing device 100 of the second embodiment has the same hardware configuration as the singing voice processing device 100 of the first embodiment. The singing voice synthesis preparation process and the singing voice synthesis process are basically the same. In the second embodiment, the interpolation ratio setting method is different. FIG. 14 is an explanatory diagram showing a method for setting the interpolation ratio in the second embodiment.

図示するように、この例では、楽譜の入力は、第１実施形態と同様、図９で示したピアノロールを用いて行なわれる。その上で第２実施形態では、歌唱表現の補間割合の設定を、楽曲全体ではなく、音符単位で行なえるようにしている。具体的には、ピアノロールを用いて入力した楽譜のうちの１つの音符ＴＴＧを、ポインティングデバイス５２を用いて指定する。その上で、この音符ＴＴＧに対する歌唱表現の補間割合を設定する。図１４では、歌唱表現として、ビブラートを取り上げ、その補間割合を設定するものとして図示している。 As shown in the figure, in this example, the musical score is input using the piano roll shown in FIG. 9 as in the first embodiment. In addition, in the second embodiment, the setting of the interpolation ratio of the singing expression can be performed in units of notes, not in the whole music. Specifically, one musical note TTG in the score inputted using the piano roll is designated using the pointing device 52. After that, the interpolation ratio of the singing expression with respect to the note TTG is set. In FIG. 14, the vibrato is taken up as a singing expression, and the interpolation ratio is set as illustrated.

図１４に例示したものでも、３人の歌い手Ａ、Ｂ、Ｃのそれぞれ歌唱スタイルＡ，Ｂ，Ｃの歌唱表現（ビブラート）を重畳するものとし、その割合を、各歌い手のＡ、Ｂ、Ｃの歌唱スタイルの境界に標示されたバーＢＡＢ，ＢＢＣをスライドすることにより行なう。境界のバーＢＡＢ，ＢＢＣをスライドすると、その下に表示された歌唱スタイル毎のビブラートのプリセットを示すスライダーＢＡ，ＢＢ，ＢＣも移動する。また、各歌唱スタイルのビブラートのプリセットを示すスライダーＢＡ，ＢＢ，ＢＣを個別に移動すると、補間割合を示すバーＢＡＢ，ＢＢＣも移動する。 In the example illustrated in FIG. 14, the singing styles (vibrato) of the singing styles A, B, and C of the three singers A, B, and C are superimposed, and the ratios are expressed as A, B, and C for each singer. This is done by sliding the bars BAB and BBC marked at the boundary of the singing style. When the boundary bars BAB and BBC are slid, the sliders BA, BB and BC indicating the vibrato preset for each singing style displayed below are also moved. Further, when the sliders BA, BB, and BC indicating the vibrato preset of each singing style are individually moved, the bars BAB and BBC indicating the interpolation ratio are also moved.

図示は省略したが、この場合も、ビブラートの補間割合を変更する度に、ピッチを表示し、補間割合が所望のものであるか否かを判断するものとしても良い。もとより合成した歌声を７０から再生して判断しても良い。 Although illustration is omitted, in this case as well, every time the interpolation rate of vibrato is changed, the pitch may be displayed to determine whether or not the interpolation rate is desired. Alternatively, the synthesized singing voice may be reproduced from 70 and judged.

以上説明した第２実施形態の歌声処理装置１００では、第１実施例の効果に加えて、音符毎に歌唱表現の補間割合を設定できるという効果を奏する。このため、歌唱表現をきめ細かに設定することができる。なお、上記の説明では、歌唱表現の補間割合の設定は音符単位で行なったが、音素単位、フレーズ単位などに行なうものとしても良い。 In the singing voice processing apparatus 100 of the second embodiment described above, in addition to the effects of the first embodiment, there is an effect that the interpolation ratio of the singing expression can be set for each note. For this reason, song expression can be set finely. In the above description, the interpolation ratio of the singing expression is set in units of notes, but may be set in units of phonemes or phrases.

次に、実施形態の変形例について、いくつか説明する。上記実施形態では、歌唱表現に影響を与えるパラメータとして、図４に示したように、ビブラートやしゃくりなどに関するパラメータを用意したが、これらの歌唱表現は、音源情報やスペクトル情報にも含まれ得るものである。従って、歌唱表現の組合せの程度を調整する際の歌唱表現に影響を与えるパラメータとしては、ビブラートやしゃくりなどに直接的に対応するパラメータに限っても良いし、音源情報やスペクトル情報などの一部または全部を含めても良い。また、こうした歌唱表現に直接対応したパラメータ（図４におけるビブラート周波数や振幅、あるいはしゃくりのパラメータなど）を設けず、音源情報やスペクトル情報などを用いて歌唱表現を再現することも可能である。この場合には、歌唱表現の組合せの程度を調整する際に、歌唱表現に影響を与えるパラメータとして、基本周波数とボリューム、あるいはこれに加えてスペクトル情報などの一部または全部を用いれば良い。 Next, some modifications of the embodiment will be described. In the above embodiment, parameters relating to vibrato, shackle, etc. are prepared as parameters affecting the singing expression, as shown in FIG. 4, but these singing expressions can also be included in sound source information and spectrum information. It is. Therefore, the parameters that affect the singing expression when adjusting the combination of the singing expressions may be limited to parameters that directly correspond to vibrato, sneezing, etc., and some parameters such as sound source information and spectrum information. Or all may be included. It is also possible to reproduce the singing expression using the sound source information, the spectrum information, etc. without providing the parameters (vibrato frequency, amplitude, or scribbling parameters in FIG. 4) that directly correspond to such singing expression. In this case, when adjusting the degree of combination of singing expressions, the fundamental frequency and volume, or a part or all of spectrum information, etc. may be used as parameters affecting the singing expression.

上記の実施形態では、歌唱スタイルに基づいて音響モデルを学習するので、いずれの音響モデルにも、何らかの歌唱表現が含まれていることになる。これに対して、全ての歌唱表現を除いて音響モデルの学習を行ない、これを特定の歌唱スタイルに対応しないもの（ノーマル）として扱っても良い。複数の歌唱スタイルの１つとしてこの「ノーマル」な歌唱スタイルを指定し、これと他の歌唱スタイルＡ，Ｂなどを組み合わせて補間割合をしているものとしても良い。こうすれば、例えば「ノーマル」な歌唱スタイルと特定の歌唱スタイルＡとを指定して、補間割合を設定すれば、歌唱スタイルＡの歌唱表現の強弱を自由に設定できることになる。 In the above embodiment, since the acoustic model is learned based on the singing style, any singing expression is included in any acoustic model. On the other hand, the acoustic model may be learned except for all the singing expressions, and this may be treated as not corresponding to a specific singing style (normal). This “normal” singing style may be designated as one of a plurality of singing styles, and may be interpolated by combining this with other singing styles A and B. In this way, for example, if the “normal” singing style and the specific singing style A are designated and the interpolation ratio is set, the strength of the singing style A singing expression can be freely set.

また、上記実施形態では、ビブラートやしゃくり上げなどの歌唱表現は、それぞれ個別にＨＭＭ学習等を行なったが、上記のノーマルな歌唱スタイルの歌唱表現との差分を統計モデルで学習したものとしても良い。この手法では、ノーマルな歌唱スタイルを複数定めた場合には、それぞれに対して同様な歌唱表現を付与することができる。 In the above embodiment, the singing expressions such as vibrato and screaming are individually subjected to HMM learning or the like. However, the difference from the normal singing style singing expression may be learned by a statistical model. In this method, when a plurality of normal singing styles are determined, the same singing expression can be given to each.

補間されるパラメータとして、上記の例示以外には、例えば発声開始タイミングや発声終了タイミングなどを扱い、複数の歌唱スタイルにおけるこれらのタイミングを補間するものとしても良い。発声開始タイミング、発声終了タイミングうちの少なくとも１つについても、他のパラメータと同様、複数の歌唱スタイルの特徴を組み合わせて音声合成を行なうことができる。また、上述した種々の歌唱表現に関して、補間を行なう単位としては、上記のような音符単位の補間に代えて、音素単位、音節単位、時間単位のいずれか１つにより行なうものとしても良い。こうすれば、対象に合せて、補間を行なう単位を適切に設定できる。また、合成する１つの歌の中で、補間の単位を切り替えるものとしてもよい。 As parameters to be interpolated, in addition to the above example, for example, utterance start timing, utterance end timing, etc. may be handled, and these timings in a plurality of singing styles may be interpolated. At least one of the utterance start timing and the utterance end timing can be synthesized by combining features of a plurality of singing styles, as with other parameters. In addition, regarding the various singing expressions described above, the interpolation unit may be one of a phoneme unit, a syllable unit, and a time unit, instead of the above-described note unit interpolation. In this way, the unit for interpolation can be set appropriately according to the object. Moreover, it is good also as what switches the unit of interpolation in one song to synthesize | combine.

上記実施形態では、複数の歌唱スタイルの歌唱表現の割合は、グラフィカルインタフェースを用いて調整するものとした。直感的な変更とはならない場合もあるが、グラフィカルインタフェースによらず、数値によって指定するものとしても良い。この場合は、時間軸をいくつかの部分（例えば、導入部、前半１、前半２、後半１、後半２、エンディング）に分け、その範囲での割合を数値で指定するようにしても良い。範囲同士の境界は、漸増、漸減させて、各歌唱スタイルの歌唱表現の割合が滑らかに変化するようにすれば良い。 In the above embodiment, the ratio of singing expressions of a plurality of singing styles is adjusted using the graphical interface. Although it may not be an intuitive change, it may be specified by a numerical value without using a graphical interface. In this case, the time axis may be divided into several parts (for example, the introduction part, the first half 1, the first half 2, the second half 1, the second half 2, and the ending), and the ratio in the range may be designated by a numerical value. The boundary between the ranges may be gradually increased or decreased so that the ratio of the singing expression of each singing style changes smoothly.

上記実施形態では、複数の歌唱スタイルについて、同じＨＭＭを学習し、ここから音響パラメータを生成し、補間を行なっているが、複数の歌唱スタイル毎のパラメータを標準化しておけば、異なる手法で生成された音響パラメータ同士を、指定された補間割合で補間しても差し支えない。即ち、歌唱表現に対応するパラメータを歌唱スタイル毎に予め用意しておけば、そのうちの少なくとも一つのパラメータがＨＭＭとして学習された音響モデルから導かれたものであれば、他の音響パラメータは、例えばルールベースの手法で導かれたものであっても差し支えない。 In the above embodiment, the same HMM is learned for a plurality of singing styles, and acoustic parameters are generated therefrom and interpolation is performed. However, if parameters for each singing style are standardized, they are generated by different methods. Interpolated acoustic parameters may be interpolated at a designated interpolation ratio. That is, if parameters corresponding to the singing expression are prepared for each singing style in advance, if at least one of the parameters is derived from an acoustic model learned as an HMM, other acoustic parameters are, for example, It can be derived from a rule-based approach.

上記の実施形態では、歌唱表現の組合せの程度は、音響パラメータを補間することにより行なったが、組み合わされる音響パラメータがいずれも統計的な手法により学習された音響モデル（ベースモデル）から導かれるものである場合には、ベースモデルの段階で補間するものとしても良い。具体的には、ある歌唱表現に関して複数の歌唱スタイルの組合せの程度を、インタフェース部により指定すると、この歌唱表現に影響を与えるベースモデルの内部パラメータを補間し、その上で、補間済みの内部パラメータを用いて、ベースモデルから歌声の合成に必要な音響パラメータを抽出する。抽出した音響パラメータを音源生成部６３や合成フィルタ６５に与えることにより、音声合成を行なう。こうすれば、抽出される音響パラメータは、組合せの程度が調整された後の歌唱表現に対応したものとなり、音声合成をより一層滑らかに行なうことができる。 In the above embodiment, the combination of singing expressions is performed by interpolating the acoustic parameters, but all the acoustic parameters to be combined are derived from an acoustic model (base model) learned by a statistical method. In this case, the interpolation may be performed at the base model stage. Specifically, when the degree of combination of a plurality of singing styles for a certain singing expression is specified by the interface unit, the internal parameters of the base model that affect this singing expression are interpolated, and then the interpolated internal parameters are interpolated. Is used to extract acoustic parameters necessary for the synthesis of the singing voice from the base model. By providing the extracted acoustic parameters to the sound source generation unit 63 and the synthesis filter 65, voice synthesis is performed. In this way, the extracted acoustic parameters correspond to the singing expression after the degree of combination is adjusted, and speech synthesis can be performed more smoothly.

こうした手法により音声合成を行なう場合には、補間の単位を、ベースモデルのモデルの状態毎とすることができる。ベースモデルがＨＭＭによるものである場合には、ＨＭＭの状態を単位として補間するのである。もとより、補間自体は、ＨＭＭの状態単位で行なうが、図１０ないし図１２に例示した歌唱表現の組合せの程度を調整する際のユーザインタフェースとしては、音符単位、音素単位、音節単位、フレーズ単位、曲単位、所定の時間単位などを単位として組合せの程度の指定を行なうわせるものとしても差し支えない。この場合には、その単位を用いてインタフェースを介して指定された時間的な範囲からＨＭＭの状態の範囲を求めてから補間演算を行なえば良い。 When speech synthesis is performed by such a method, the unit of interpolation can be set for each state of the model of the base model. When the base model is based on the HMM, interpolation is performed with the HMM state as a unit. Of course, the interpolation itself is performed in HMM state units, but the user interface for adjusting the combination of the singing expressions exemplified in FIGS. 10 to 12 includes note units, phonemes units, syllable units, phrase units, It may be possible to specify the degree of combination in units of music or a predetermined time unit. In this case, the interpolation calculation may be performed after obtaining the HMM state range from the temporal range designated via the interface using the unit.

以上本発明のいくつかの実施形態・変形例について説明したが、本発明はこうした実施形態に限定されるものではなく、本発明の要旨を変更しない範囲内において、種々なる態様で実施できることはもちろんである。例えば、歌声学習の機能がない歌声合成装置のみとして実施しても良い。また、歌声としては平均律に基づくものに限らず、民族音楽のように固有の音律に従うものを用いても良い。例えば日本における雅楽、謡曲、声明、お経や、ヨーロッパにおけるグレゴリオ聖歌などの平均律以前の音律に従う歌声などの合成に適用しても良い。また、歌い手の歌声としては、実在の歌い手によるものに限らず、機械的な合成音声などを対象としても良い。 Although several embodiments and modifications of the present invention have been described above, the present invention is not limited to these embodiments, and can of course be implemented in various modes within the scope not changing the gist of the present invention. It is. For example, it may be implemented only as a singing voice synthesizing device that does not have a singing voice learning function. Also, the singing voice is not limited to that based on the equal temperament, but may be one that follows a specific temperament such as folk music. For example, the present invention may be applied to synthesis of Japanese music, kyoku, statement, sutra, singing voice that follows the pre-equilibrium temperament such as Gregorian chant in Europe. Further, the singing voice of a singer is not limited to that of a real singer, but may be a mechanical synthetic voice.

１０…楽譜入力部
２０…音声入力部
３０…サーバ
３１…ハードディスク
３３…楽譜解析部
４０…学習部
４１…Ｆ０抽出部
４３…ＳＰ抽出部
４４…歌唱Ｐ抽出部
４５…ＨＭＭ学習部
５０…音響モデル記憶部
５１…キーボード
５２…ポインティングデバイス
５３…表示部
５５…パラメータ調整部
５７…楽譜解析部
６０…音声合成部
６１…音声パラメータ生成部
６３…音源生成部
６５…合成フィルタ
１００…歌声処理装置 DESCRIPTION OF SYMBOLS 10 ... Musical score input part 20 ... Voice input part 30 ... Server 31 ... Hard disk 33 ... Musical score analysis part 40 ... Learning part 41 ... F0 extraction part 43 ... SP extraction part 44 ... Singing P extraction part 45 ... HMM learning part 50 ... Acoustic model Storage unit 51 ... Keyboard 52 ... Pointing device 53 ... Display unit 55 ... Parameter adjustment unit 57 ... Score analysis unit 60 ... Speech synthesis unit 61 ... Speech parameter generation unit 63 ... Sound source generation unit 65 ... Synthesis filter 100 ... Singing voice processing device

Claims

歌声を合成する歌声合成装置であって、
複数の歌唱スタイルの歌声の少なくとも１つについて、複数の歌唱スタイルの歌声の少なくとも１つについて、前記歌声の少なくとも歌唱表現が反映されるパラメータを含む音響パラメータを学習して得られた音響モデルを当該歌唱スタイルについてのベースモデルとして記憶する記憶部と、
前記複数の歌唱スタイルの中から選択された少なくとも２つの歌唱スタイルに含まれる前記歌唱表現の組合せの程度を調整するインタフェース部と、
歌唱スタイルに含まれる歌唱表現を再現可能な音響パラメータの集まりの中から、前記記憶されたベースモデルを用いて得られた音響パラメータの集まりを少なくとも一つを含む、少なくとも２つの音響パラメータの集まりを選択し、前記選択された少なくとも２つの前記歌唱表現に影響する音響パラメータを、前記インタフェース部により調整された組合せの程度で補間して、合成用音響パラメータを決定するパラメータ決定部と、
前記合成用音響パラメータを用いて歌声を合成する合成部と
を備えた歌声合成装置。 A singing voice synthesizer that synthesizes a singing voice,
An acoustic model obtained by learning an acoustic parameter including a parameter reflecting at least a singing expression of the singing voice for at least one of the singing voices of a plurality of singing styles, A storage unit for storing a singing style as a base model;
An interface unit that adjusts the degree of combination of the singing expressions included in at least two singing styles selected from the plurality of singing styles;
A group of at least two acoustic parameters including at least one group of acoustic parameters obtained using the stored base model from among a group of acoustic parameters capable of reproducing the singing expression included in the singing style. A parameter determination unit that selects and interpolates the acoustic parameters that affect the selected at least two of the singing expressions with a degree of combination adjusted by the interface unit, and determines a synthesis acoustic parameter;
A singing voice synthesizing apparatus comprising: a synthesizing unit that synthesizes a singing voice using the acoustic parameter for synthesis.

前記音響パラメータの集まりには、少なくとも基本周波数、音量、歌唱表現に対応したパラメータのうちの少なくとも一つを含む請求項１記載の歌声合成装置。 The singing voice synthesizing apparatus according to claim 1, wherein the collection of acoustic parameters includes at least one of parameters corresponding to fundamental frequency, volume, and singing expression.

前記音響パラメータの集まりには、更にスペクトルパラメータを含む請求項２記載の歌声合成装置。 The singing voice synthesizing apparatus according to claim 2, wherein the collection of acoustic parameters further includes a spectral parameter.

前記選択される少なくとも２つの音響パラメータの集まりは、いずれも前記記憶されたベースモデルを用いて得られた音響パラメータの集まりである請求項１から請求項３のいずれか一項に記載の歌声合成装置。 The singing voice synthesis according to any one of claims 1 to 3, wherein the collection of at least two acoustic parameters selected is a collection of acoustic parameters obtained using the stored base model. apparatus.

前記選択される少なくとも２つの音響パラメータの集まりのうちの一つは、ルールベースの手法で生成された音響パラメータの集まりである請求項１から請求項３のいずれか一項に記載の歌声合成装置。 4. The singing voice synthesizing apparatus according to claim 1, wherein one of the selected groups of at least two acoustic parameters is a group of acoustic parameters generated by a rule-based method. 5. .

歌声を合成する歌声合成装置であって、
複数の歌唱スタイルの歌声のそれぞれに含まれる少なくとも歌唱表現が反映されるパラメータを含む音響パラメータを統計的な手法を用いて学習して得られた音響モデルを、前記歌唱スタイル毎のベースモデルとして記憶した記憶部と、
前記複数の歌唱スタイルの中から選択された少なくとも２つの歌唱スタイルに含まれる前記歌唱表現の組合せの程度を調整するインタフェース部と、
前記記憶部に記憶された前記複数のベースモデルから、前記複数の歌唱スタイルのうちから選択された少なくとも２つの歌唱スタイルに対応したベースモデルに基づき、前記インタフェース部により調整された組合せの程度で前記歌唱表現を補間した合成用音響パラメータを抽出する補間抽出部と
前記合成用音響パラメータを用いて歌声を合成する合成部と
を備えた歌声合成装置。 A singing voice synthesizer that synthesizes a singing voice,
An acoustic model obtained by learning an acoustic parameter including at least a parameter that reflects the singing expression included in each of the singing voices of a plurality of singing styles using a statistical method is stored as a base model for each singing style Storage unit
An interface unit that adjusts the degree of combination of the singing expressions included in at least two singing styles selected from the plurality of singing styles;
Based on a base model corresponding to at least two singing styles selected from among the plurality of singing styles from the plurality of base models stored in the storage unit, the degree of combination adjusted by the interface unit A singing voice synthesizing apparatus comprising: an interpolation extracting unit that extracts a synthesis acoustic parameter obtained by interpolating a singing expression; and a synthesis unit that synthesizes a singing voice using the synthesis acoustic parameter.

前記歌唱表現が反映されるパラメータには、少なくともビブラート、しゃくり、アタック・リリース、こぶしのうちの１つに対応したパラメータが含まれる請求項１から請求項６のいずれか一項に記載の歌声合成装置。 The singing voice synthesis according to any one of claims 1 to 6, wherein the parameter reflecting the singing expression includes a parameter corresponding to at least one of vibrato, sneezing, attack release, and fist. apparatus.

前記歌唱表現が反映されるパラメータには、少なくとも発声開始タイミング、発声終了タイミングのいずれか１つに対応したパラメータが含まれる請求項１から請求項６のいずれか一項に記載の歌声合成装置。 The singing voice synthesizing apparatus according to any one of claims 1 to 6, wherein the parameter in which the singing expression is reflected includes a parameter corresponding to at least one of an utterance start timing and an utterance end timing.

請求項１から請求項５のいずれか一項に記載の歌声合成装置であって、
前記歌唱表現の組合せの程度の調整は、前記各音響パラメータの値を補間することにより行なわれる歌声合成装置。 A singing voice synthesizing device according to any one of claims 1 to 5,
The singing voice synthesizing device, wherein the adjustment of the combination of the singing expressions is performed by interpolating the values of the respective acoustic parameters.

前記補間は、前記音響パラメータを線形結合または非線形結合することにより行なわれる請求項９記載の歌声合成装置。 The singing voice synthesis apparatus according to claim 9, wherein the interpolation is performed by linearly coupling or nonlinearly coupling the acoustic parameters.

前記補間は、音符単位、音素単位、音節単位、フレーズ単位、曲単位、所定の時間単位のいずれか１つにより行なう請求項９または請求項１０に記載の歌声合成装置。 The singing voice synthesizing apparatus according to claim 9 or 10, wherein the interpolation is performed by any one of a note unit, a phoneme unit, a syllable unit, a phrase unit, a song unit, and a predetermined time unit.

請求項６に記載の歌声合成装置であって、
前記歌唱表現の組合せの程度の調整は、前記ベースモデルの内部パラメータを補間することにより行なわれる歌声合成装置。 The singing voice synthesizing device according to claim 6,
The singing voice synthesizing apparatus, wherein the adjustment of the combination of the singing expressions is performed by interpolating the internal parameters of the base model.

前記補間は、前記前記ベースモデルの内部パラメータを線形結合または非線形結合することにより行なわれる請求項１２記載の歌声合成装置。 The singing voice synthesizing apparatus according to claim 12, wherein the interpolation is performed by linearly coupling or nonlinearly coupling internal parameters of the base model.

前記補間は、ベースモデルの状態単位で行なう請求項１２また請求項１３に記載の歌声合成装置。 The singing voice synthesizing apparatus according to claim 12 or 13, wherein the interpolation is performed in units of states of a base model.

前記補間は、内挿補間または外挿補間である請求項９から請求項１４のいずれか一項に記載の歌声合成装置。 The singing voice synthesizing apparatus according to any one of claims 9 to 14, wherein the interpolation is interpolation or extrapolation.

前記記憶されたベースモデルの１つは、予め用意された標準的な音響パラメータからなるベースモデルである請求項１から請求項１５のいずれか一項に記載の歌声合成装置。 The singing voice synthesizing apparatus according to any one of claims 1 to 15, wherein one of the stored base models is a base model including standard acoustic parameters prepared in advance.

請求項１から請求項１６のいずれか一項に記載の歌声合成装置であって、
更に、画像表示装置とポインティングデバイスとを備え、
前記インタフェース部は、
前記画像表示装置上に描画されるグラフィカルユーザインタフェースであり、
前記グラフィカルユーザインタフェースとして前記画像表示装置上に描画された画面を前記ポインティングデバイスにより操作することにより、前記組合せの程度が変更される
歌声合成装置。 The singing voice synthesizing device according to any one of claims 1 to 16,
Furthermore, an image display device and a pointing device are provided,
The interface unit is
A graphical user interface drawn on the image display device;
The singing voice synthesizing apparatus in which the degree of the combination is changed by operating the screen drawn on the image display device as the graphical user interface with the pointing device.

歌声を合成する歌声合成方法であって、
複数の歌唱スタイルの歌声の少なくとも１つについて、複数の歌唱スタイルの歌声の少なくとも１つについて、前記歌声の少なくとも歌唱表現が反映されるパラメータを含む音響パラメータを学習して得られた音響モデルを当該歌唱スタイルについてのベースモデルとして記憶し、
前記複数の歌唱スタイルの中から選択された少なくとも２つの歌唱スタイルに含まれる前記歌唱表現の組合せの程度を調整し、
歌唱スタイルに含まれる歌唱表現を再現可能な音響パラメータの集まりの中から、前記記憶されたベースモデルを用いて得られた音響パラメータの集まりを少なくとも一つを含む、少なくとも２つの音響パラメータの集まりを選択し、前記選択された少なくとも２つの前記歌唱表現に影響する音響パラメータを、前記調整された組合せの程度で補間して、合成用音響パラメータを決定し、
前記合成用音響パラメータを用いて歌声を合成する
歌声合成方法。 A singing voice synthesis method for synthesizing a singing voice,
An acoustic model obtained by learning an acoustic parameter including a parameter reflecting at least a singing expression of the singing voice for at least one of the singing voices of a plurality of singing styles, Remember as a base model about the singing style,
Adjusting the degree of combination of the singing expressions included in at least two singing styles selected from the plurality of singing styles;
A group of at least two acoustic parameters including at least one group of acoustic parameters obtained using the stored base model from among a group of acoustic parameters capable of reproducing the singing expression included in the singing style. Selecting and interpolating acoustic parameters affecting the selected at least two of the singing expressions to the extent of the adjusted combination to determine a synthesis acoustic parameter;
A singing voice synthesis method for synthesizing a singing voice using the acoustic parameters for synthesis.

歌声を合成する歌声合成方法であって、
複数の歌唱スタイルの歌声のそれぞれに含まれる少なくとも歌唱表現が反映されるパラメータを含む音響パラメータを統計的な手法を用いて学習して得られた音響モデルを、前記歌唱スタイル毎のベースモデルとして記憶部に記憶し、
前記複数の歌唱スタイルの中から選択された少なくとも２つの歌唱スタイルに含まれる前記歌唱表現の組合せの程度を調整し、
前記記憶部に記憶された前記複数のベースモデルから、前記複数の歌唱スタイルのうちから選択された少なくとも２つの歌唱スタイルに対応したベースモデルに基づき、前記調整された組合せの程度で前記歌唱表現を補間した合成用音響パラメータを抽出し、
前記合成用音響パラメータを用いて歌声を合成する
歌声合成方法。 A singing voice synthesis method for synthesizing a singing voice,
An acoustic model obtained by learning an acoustic parameter including at least a parameter that reflects the singing expression included in each of the singing voices of a plurality of singing styles using a statistical method is stored as a base model for each singing style Remember in the department,
Adjusting the degree of combination of the singing expressions included in at least two singing styles selected from the plurality of singing styles;
Based on the base model corresponding to at least two singing styles selected from the plurality of singing styles from the plurality of base models stored in the storage unit, the singing expression is expressed in the adjusted combination degree. Extract the interpolated synthesis acoustic parameters,
A singing voice synthesis method for synthesizing a singing voice using the acoustic parameters for synthesis.