JP6470586B2

JP6470586B2 - Audio processing apparatus and program

Info

Publication number: JP6470586B2
Application number: JP2015029995A
Authority: JP
Inventors: 礼子齋藤; 信正清山; 今井　篤; 篤今井; 都木　徹; 徹都木
Original assignee: Japan Broadcasting Corp; NHK Engineering System Inc
Current assignee: Japan Broadcasting Corp; NHK Engineering System Inc
Priority date: 2015-02-18
Filing date: 2015-02-18
Publication date: 2019-02-13
Anticipated expiration: 2035-02-18
Also published as: JP2016151715A

Description

本発明は、音声加工装置、及びプログラムに関する。 The present invention relates to a voice processing device and a program.

音声に多様な発話スタイルを付与する音声加工方法は、コンテンツ制作や音声によるインターフェースなどで必要とされる音声表現のバリエーションを拡大できる。多様な発話スタイルとしては、特に感情表現への音声加工方法が多く検討されている。
平静音声を感情表現に変換する方法として、確率モデルによる声質変換を応用する方法が考案されている（例えば、非特許文献１参照）。しかし、確率モデルの構築には、変換したい変換対象話者の平静音声と感情音声のパラレルデータが必要となる。よって、事前に変換対象話者の感情音声が存在しない場合は利用できない。
また、テキストから任意話者の感情音声を合成する方法として、ＨＭＭ音声合成を用いて、学習話者の平静音声モデルと感情音声とから学習した感情付与モデルを、任意話者の平静音声モデルに付与する方法が考案されている（例えば、非特許文献２参照）。しかし、この技術は、任意話者についても事前に平静音声のデータベースを用意する必要がある。このデータベースは、音響特徴量だけでなく、言語情報なども含まれるため、新規作成にはコストがかかる。 The voice processing method that gives various utterance styles to the voice can expand the variation of the voice expression required for the content production and the voice interface. As a variety of utterance styles, many voice processing methods for emotional expression have been studied.
A method of applying voice quality conversion based on a probability model has been devised as a method of converting calm speech into emotional expression (see Non-Patent Document 1, for example). However, the construction of a probability model requires parallel data of calm speech and emotional speech of the conversion target speaker to be converted. Therefore, it cannot be used when there is no emotional voice of the conversion target speaker in advance.
Also, as a method of synthesizing emotional speech of an arbitrary speaker from text, an emotion-giving model learned from a quiet speech model of a learning speaker and emotional speech using HMM speech synthesis is converted into a quiet speech model of the arbitrary speaker. A method of giving has been devised (see, for example, Non-Patent Document 2). However, with this technology, it is necessary to prepare a database of quiet speech in advance for any speaker. Since this database includes not only the acoustic feature quantity but also language information and the like, it is costly to create a new database.

岩見洋平、外４名、「ＧＭＭに基づく声質変換を用いた感情音声合成」、一般社団法人電子情報通信学会、電子情報通信学会技術研究報告.ＳＰ、音声１０２（６１９）、２００３年、ｐ．１１−１６Yohei Iwami, 4 others, "Emotional speech synthesis using voice quality conversion based on GMM", The Institute of Electronics, Information and Communication Engineers, IEICE Technical Report. SP, Speech 102 (619), 2003, p. 11-16 大谷大和、外５名、「ＨＭＭ音声合成における加算モデルに基づく任意話者への感情付与法の検討」、日本音響学会講演論文集２０１４春季、２−７−２、２０１４年、ｐ.２３３−２３６Yamato Otani, 5 others, “Examination of how to give emotions to an arbitrary speaker based on addition model in HMM speech synthesis”, Acoustical Society of Japan 2014 Spring, 2-7-2, 2014, p.233 236

様々なシチュエーションの音声に対し、多様な発話スタイルを付与できる音声加工方法を実現するためには、任意の話者の任意の発話内容の音声に対して、事前に感情音声のデータがなく、平静音声のデータも小規模しか与えられない場合でも、感情表現を付与できることが必要である。 In order to realize a voice processing method that can give various utterance styles to the voice of various situations, there is no emotional voice data in advance for the voice of any utterance content of any speaker, It is necessary to be able to add emotional expressions even when only small amounts of audio data are given.

本発明は、このような事情を考慮してなされたもので、コストを低減しながら、事前に感情音声を用意していない任意話者の任意発話の平静音声を感情音声に加工することができる音声加工装置、及びプログラムを提供する。 The present invention has been made in view of such circumstances, and can reduce the quiet speech of an arbitrary utterance of an arbitrary speaker who does not prepare emotional speech beforehand into emotional speech while reducing costs. An audio processing apparatus and a program are provided.

本発明の一態様は、変換対象話者の平静音声の音声データを音響分析してフレーム単位の音響特徴量を取得する音声分析部と、変換対象話者の平静音声の音響特徴量を参照話者の平静音声の音響特徴量に変換するための第一変換規則を用いて、前記音声分析部が取得した各フレームの前記音響特徴量を変換する第一変換部と、変換対象話者の平静音声の音響特徴量を参照話者の感情音声の音響特徴量に変換するための第二変換規則を用いて、前記音声分析部が取得した各フレームの前記音響特徴量を変換する第二変換部と、フレーム単位で、前記第二変換部が変換により得た前記音響特徴量について、前記第一変換部が変換により得た前記音響特徴量に対する差分を算出する差分取得部と、フレーム単位で、前記音声分析部が取得した前記音響特徴量に、前記差分取得部が算出した差分を加算する加工部と、を備えることを特徴とする音声加工装置である。
この発明によれば、音声加工装置は、変換対象話者の平静音声の音響特徴量を参照話者の平静音声の音響特徴量に変換するための第一変換規則を用いて、変換対象話者の平静音声の音声データの音響特徴量を変換して参照話者の平静音声の音響特徴量を得る。さらに、音声加工装置は、変換対象話者の平静音声の音響特徴量を参照話者の感情音声の音響特徴量に変換するための第二変換規則を用いて、変換対象話者の平静音声の音声データの音響特徴量を変換して参照話者の感情音声の音響特徴量を得る。音声加工装置は、参照話者の感情音声の音声データの音響特徴量について、参照話者の平静音声の音声データの音響特徴量に対する差分を得ると、得られた差分を変換対象話者の平静音声の音響特徴量に加算して、変換対象話者の感情音声の音響特徴量を得る。
これにより、音声加工装置は、変換対象話者の感情音声を事前に用意することなく、簡易な処理によって、任意の話者の任意の発話の平静音声を感情音声に加工することができる。 One aspect of the present invention includes a speech analysis unit that acoustically analyzes speech data of a quiet speech of a conversion target speaker to acquire an acoustic feature amount in units of frames, and a reference speech for the acoustic feature amount of the quiet speech of the conversion target speaker A first conversion unit that converts the acoustic feature amount of each frame acquired by the speech analysis unit using a first conversion rule for converting into an acoustic feature amount of a person's calm speech; A second conversion unit that converts the acoustic feature amount of each frame acquired by the speech analysis unit using a second conversion rule for converting the acoustic feature amount of the speech into the acoustic feature amount of the emotional speech of the reference speaker A difference acquisition unit that calculates a difference with respect to the acoustic feature amount obtained by the conversion by the first conversion unit for the acoustic feature amount obtained by the second conversion unit by the frame unit, and a frame unit, The sound acquired by the voice analysis unit The feature quantity, a speech processing apparatus characterized by comprising a processing unit for adding the difference to the difference obtaining unit is calculated.
According to this invention, the speech processing apparatus uses the first conversion rule for converting the acoustic feature amount of the quiet speech of the conversion target speaker into the acoustic feature amount of the quiet speech of the reference speaker, and uses the first conversion rule. The acoustic feature amount of the speech data of the quiet speech is converted to obtain the acoustic feature amount of the quiet speech of the reference speaker. Furthermore, the speech processing apparatus uses the second conversion rule for converting the acoustic feature amount of the quiet speech of the conversion target speaker into the acoustic feature amount of the emotional speech of the reference speaker, and uses the second conversion rule. The acoustic feature quantity of the voice data is converted to obtain the acoustic feature quantity of the emotional voice of the reference speaker. When the speech processing apparatus obtains a difference with respect to the acoustic feature amount of the speech data of the reference speaker's calm speech with respect to the acoustic feature amount of the speech data of the emotional speech of the reference speaker, the speech processing apparatus converts the obtained difference to the quietness of the conversion target speaker. By adding to the acoustic feature quantity of speech, the acoustic feature quantity of emotional speech of the conversion target speaker is obtained.
Thereby, the speech processing apparatus can process the quiet speech of an arbitrary utterance of an arbitrary speaker into an emotional speech by a simple process without preparing the emotional speech of the conversion target speaker in advance.

本発明の一態様は、上述する音声加工装置であって、前記差分取得部は、前記第一変換部が変換により得た前記音響特徴量と前記第二変換部が変換により得た前記音響特徴量とを正規化した後、フレーム単位で差分を算出する、ことを特徴とする。
この発明によれば、音声加工装置は、第一変換規則を用いて変換対象話者の平静音声の音声データを変換して得た参照話者の平静音声の音響特徴量と、第二変換規則を用いて変換対象話者の平静音声の音声データの音響特徴量を変換して得た参照話者の感情音声の音響特徴量とに正規化を行ってからそれらの差分を算出し、変換対象話者の平静音声の音響特徴量に加算する。
これにより、音声加工装置は、変換対象話者の任意発話を、変換対象話者の感情音声に精度よく変換することができる。 One aspect of the present invention is the speech processing device described above, wherein the difference acquisition unit includes the acoustic feature obtained by the first converter and the acoustic feature obtained by the second converter. After normalizing the quantity, the difference is calculated in units of frames.
According to the present invention, the speech processing apparatus uses the first conversion rule to convert the speech feature of the quiet speech of the conversion target speaker, and the second conversion rule. Is used to normalize the acoustic features of the speech data of the target speaker's quiet speech and to the acoustic features of the emotional speech of the reference speaker. Add to the acoustic features of the speaker's quiet speech.
Thereby, the speech processing apparatus can convert the arbitrary utterance of the conversion target speaker into the emotional speech of the conversion target speaker with high accuracy.

本発明の一態様は、上述する音声加工装置であって、前記変換対象話者の学習用の平静音声のデータである第一音声データと、前記第一音声データと同じ発話内容の参照話者の学習用の平静音声のデータである第二音声データとに基づいて前記第一変換規則を取得する処理と、前記第一音声データと、前記第一音声データと同じ発話内容の前記参照話者の学習用の感情音声のデータである第三音声データとに基づいて前記第二変換規則を取得する処理とを行う変換規則学習部をさらに備える、ことを特徴とする。
この発明によれば、音声加工装置は、同じ発話内容の変換対象話者の平静音声の音声データと、参照話者の平静音声及び感情音声の音声データとを用いて、変換対象話者の平静音声の音響特徴量を参照話者の平静音声の音響特徴量に変換するための第一変換規則及び変換対象話者の平静音声の音響特徴量を参照話者の感情音声の音響特徴量に変換するための第二変換規則を学習する。
これにより、音声加工装置は、変換対象話者の感情音声がなくとも、変換対象話者の平静音声と、参照話者の平静音声及び感情音声との少量の学習データを用いて、コストを低減しながら、変換規則を得ることができる。 One aspect of the present invention is the speech processing apparatus described above, wherein the reference speech has the same speech content as the first speech data and the first speech data that is the quiet speech data for learning of the conversion target speaker. Processing for obtaining the first conversion rule based on the second voice data which is the quiet voice data for learning, the first voice data, and the reference speaker having the same utterance content as the first voice data The method further comprises a conversion rule learning unit that performs processing for obtaining the second conversion rule based on third voice data that is emotional voice data for learning.
According to the present invention, the speech processing device uses the speech data of the speech of the conversion target speaker having the same utterance content and the speech data of the speech of the reference speaker and the speech of the emotional speech. The first conversion rule for converting the acoustic feature of speech into the acoustic feature of the quiet speech of the reference speaker and the acoustic feature of the quiet speech of the conversion target speaker into the acoustic feature of the emotional speech of the reference speaker To learn the second conversion rule.
Thereby, even if there is no emotional voice of the conversion target speaker, the voice processing device reduces the cost by using a small amount of learning data of the conversion target speaker's calm voice and the reference speaker's calm voice and emotional voice. The conversion rule can be obtained.

本発明の一態様は、上述する音声加工装置であって、前記音響特徴量は、周波数スペクトルに関する特徴量である、ことを特徴とする。
この発明によれば、音声加工装置は、音響特徴量として、音声波形から得られる周波数スペクトルに関する特徴量を用いる。
これにより、音声加工装置は、変換対象話者の任意発話の声質を変換して、変換対象話者の感情音声に変換することができる。 One aspect of the present invention is the speech processing device described above, wherein the acoustic feature amount is a feature amount related to a frequency spectrum.
According to this invention, the speech processing apparatus uses a feature quantity related to a frequency spectrum obtained from a speech waveform as an acoustic feature quantity.
Thereby, the speech processing apparatus can convert the voice quality of the arbitrary utterance of the conversion target speaker and convert it into emotional speech of the conversion target speaker.

本発明の一態様は、コンピュータを、変換対象話者の音声データを音響分析してフレーム単位の音響特徴量を取得する音声分析手段と、変換対象話者の平静音声の音響特徴量を参照話者の平静音声の音響特徴量に変換するための第一変換規則を用いて、前記音声分析手段が取得した各フレームの前記音響特徴量を変換する第一変換手段と、変換対象話者の平静音声の音響特徴量を参照話者の感情音声の音響特徴量に変換するための第二変換規則を用いて、前記音声分析手段が取得した各フレームの前記音響特徴量を変換する第二変換手段と、フレーム単位で、前記第二変換手段が変換により得た前記音響特徴量について、前記第一変換手段が変換により得た前記音響特徴量に対する差分を算出する差分取得手段と、フレーム単位で、前記音声分析手段が取得した前記音響特徴量に、前記差分取得手段が算出した差分を加算する加工手段と、を具備する音声加工装置として機能させるためのプログラムである。 According to one aspect of the present invention, the computer analyzes the speech data of the conversion target speaker to acquire the acoustic feature amount of each frame, and the acoustic feature amount of the quiet speech of the conversion target speaker is referred to. A first conversion means for converting the acoustic feature value of each frame acquired by the speech analysis means using a first conversion rule for conversion into an acoustic feature value of a person's calm voice; Second conversion means for converting the acoustic feature quantity of each frame acquired by the voice analysis means using a second conversion rule for converting the acoustic feature quantity of the speech into the acoustic feature quantity of the emotional speech of the reference speaker And, for each acoustic feature obtained by the conversion by the second conversion means in frame units, difference obtaining means for calculating a difference with respect to the acoustic feature quantity obtained by the conversion by the first conversion means, and in frame units, The voice Said acoustic features analysis means obtains a program for functioning as a voice processing apparatus comprising, a processing means for adding the difference to the difference obtaining means has calculated.

本発明によれば、コストを低減しながら、事前に感情音声を用意していない任意話者の任意発話の平静音声を感情音声に加工することができる。 ADVANTAGE OF THE INVENTION According to this invention, the quiet voice of the arbitrary utterances of the arbitrary speaker who has not prepared emotion voice beforehand can be processed into emotion voice, reducing cost.

本発明の一実施形態による音声加工装置の機能ブロック図である。It is a functional block diagram of the audio processing apparatus by one Embodiment of this invention. 同実施形態による音声加工装置の変換規則学習処理を示す処理フローである。It is a processing flow which shows the conversion rule learning process of the audio processing apparatus by the embodiment. 同実施形態による音声加工装置の変換規則学習処理を説明するための図である。It is a figure for demonstrating the conversion rule learning process of the audio processing apparatus by the embodiment. 同実施形態による音声加工装置の音声加工処理を示す処理フローである。It is a processing flow which shows the audio processing of the audio processing apparatus by the same embodiment. 同実施形態による音声加工装置の音声加工処理における参照話者の音響特徴量への変換を説明するための図である。It is a figure for demonstrating conversion to the acoustic feature-value of a reference speaker in the audio processing of the audio processing apparatus by the embodiment. 同実施形態による音声加工装置の音声加工処理における音響特徴量の差分の取得を説明するための図である。It is a figure for demonstrating acquisition of the difference of the acoustic feature-value in the audio processing of the audio processing apparatus by the embodiment. 同実施形態による音声加工装置の音声加工処理における変換対象話者の変換対象音声の音響特徴量の加工を説明するための図である。It is a figure for demonstrating the process of the acoustic feature-value of the conversion object audio | voice of the conversion object speaker in the audio processing of the audio processing apparatus by the embodiment.

以下、図面を参照しながら本発明の実施形態を詳細に説明する。
図１は、本発明の一実施形態による音声加工装置１の構成を示す機能ブロック図であり、本実施形態と関係する機能ブロックのみを抽出して示してある。音声加工装置１は、１台または複数台のコンピュータ装置により実現される。複数台のコンピュータ装置により音声加工装置１を実現する場合、いずれの機能部をいずれのコンピュータ装置により実現するかは任意とすることができる。また、１つの機能部を、複数台のコンピュータ装置により実現してもよい。同図に示すように、音声加工装置１は、学習用音声分析部１１と、変換規則学習部１２と、変換規則記憶部１３と、音声分析部１４と、スペクトル変換部１５と、音声合成部１６とを備えて構成される。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
FIG. 1 is a functional block diagram showing a configuration of a sound processing apparatus 1 according to an embodiment of the present invention, and only functional blocks related to the present embodiment are extracted and shown. The voice processing device 1 is realized by one or a plurality of computer devices. When the sound processing device 1 is realized by a plurality of computer devices, which functional unit is realized by which computer device can be arbitrarily determined. One functional unit may be realized by a plurality of computer devices. As shown in the figure, the speech processing apparatus 1 includes a learning speech analysis unit 11, a conversion rule learning unit 12, a conversion rule storage unit 13, a speech analysis unit 14, a spectrum conversion unit 15, and a speech synthesis unit. 16.

学習用音声分析部１１は、学習用音声データが示す音声波形を分析し、所定のフレームシフト及びフレーム長により、フレーム単位の音響特徴量を取得する。音響特徴量には、周波数スペクトルに関する特徴量を用いることができる。本実施形態では、音響特徴量として、音声波形の周波数スペクトルから得られる５０次元のスペクトルパラメータ（例えば、メルケプストラム）を用いる。音声波形から周波数スペクトルを取得する方法や、周波数スペクトルに関する特徴量を取得する方法には、任意の従来技術を用いることができる。学習用音声データは、同じ発話内容の変換対象話者の学習用平静音声データ（第一音声データ）と、参照話者の学習用平静音声データ（第二音声データ）及び学習用感情音声データ（第三音声データ）である。変換対象話者は、任意発話の平静音声の音声データを感情音声の音声データに変換する対象の話者であり、参照話者は、学習用の音声データを提供する、変換対象話者とは異なる話者である。学習用平静音声データは、学習用の平静音声の音声データであり、学習用感情音声データは、学習用の感情音声の音声データである。また、平静音声は、感情が込められていない音声であり、感情音声は、感情が込められた音声である。参照話者の学習用感情音声データは、変換対象話者の任意発話の平静音声に対して付加したい感情が込められた音声である。 The learning speech analysis unit 11 analyzes a speech waveform indicated by the learning speech data, and acquires acoustic features in units of frames based on a predetermined frame shift and frame length. As the acoustic feature amount, a feature amount related to the frequency spectrum can be used. In this embodiment, a 50-dimensional spectrum parameter (for example, mel cepstrum) obtained from the frequency spectrum of the speech waveform is used as the acoustic feature quantity. Any conventional technique can be used as a method for acquiring a frequency spectrum from a speech waveform and a method for acquiring a feature amount related to a frequency spectrum. The learning voice data includes the quiet voice data for learning (first voice data) of the conversion target speaker having the same utterance content, the quiet voice data for learning of the reference speaker (second voice data), and the emotion voice data for learning ( Third audio data). The conversion target speaker is a target speaker that converts speech data of calm speech of arbitrary utterance into speech data of emotional speech, and the reference speaker provides the speech data for learning. Be a different speaker. The learning calm voice data is voice data of learning calm voice, and the learning emotion voice data is voice data of learning emotion voice. Moreover, the calm voice is a voice without feelings, and the emotion voice is a voice with feelings. The emotional voice data for learning of the reference speaker is a voice in which an emotion desired to be added is added to the calm voice of an arbitrary utterance of the conversion target speaker.

変換規則学習部１２は、学習用音声分析部１１が取得した変換対象話者の学習用平静音声データの音響特徴量と、参照話者の学習用平静音声データの音響特徴量とに基づいて第一変換規則を取得する。第一変換規則は、変換対象話者の平静音声の音響特徴量を参照話者の平静音声の音響特徴量に変換するための規則である。また、変換規則学習部１２は、学習用音声分析部１１が取得した変換対象話者の学習用平静音声データの音響特徴量と、参照話者の学習用感情音声データの音響特徴量とに基づいて第二変換規則を取得する。第二変換規則は、変換対象話者の平静音声の音響特徴量を参照話者の感情音声の音響特徴量に変換するための規則である。
変換規則記憶部１３は、変換規則学習部１２が取得した第一変換規則及び第二変換規則を記憶する。 The conversion rule learning unit 12 is based on the acoustic feature amount of the quiet speech data for learning of the conversion target speaker acquired by the speech analysis unit for learning 11 and the acoustic feature amount of the quiet speech data for learning of the reference speaker. Get a conversion rule. The first conversion rule is a rule for converting the acoustic feature amount of the quiet speech of the conversion target speaker into the acoustic feature amount of the quiet speech of the reference speaker. Also, the conversion rule learning unit 12 is based on the acoustic feature amount of the learning target speech sound data of the conversion target speaker and the acoustic feature amount of the reference speaker's emotional speech data for learning acquired by the learning speech analysis unit 11. To obtain the second conversion rule. The second conversion rule is a rule for converting the acoustic feature amount of the quiet speech of the conversion target speaker into the acoustic feature amount of the emotional speech of the reference speaker.
The conversion rule storage unit 13 stores the first conversion rule and the second conversion rule acquired by the conversion rule learning unit 12.

音声分析部１４は、変換対象話者の変換対象音声データが示す音声波形を音響分析してフレーム単位の音響特徴量を取得する。変換対象音声データは、変換対象話者の任意発話の平静音声の音声データである。
スペクトル変換部１５は、変換対象話者の任意発話の平静音声のスペクトルを、感情音声のスペクトルに変換する。スペクトル変換部１５は、第一変換部１５１、第二変換部１５２、差分取得部１５３、及び加工部１５４を備えて構成される。
第一変換部１５１は、音声分析部１４が変換対象音声データから得た各フレームの音響特徴量を、変換規則記憶部１３に記憶されている第一変換規則を用いて変換する。
第二変換部１５２は、音声分析部１４が変換対象音声データから得た各フレームの音響特徴量を、変換規則記憶部１３に記憶されている第二変換規則を用いて変換する。
差分取得部１５３は、フレーム単位で、第二変換部１５２が変換により得た音響特徴量について、第一変換部１５１が変換により得た音響特徴量に対する差分を算出する。
加工部１５４は、フレーム単位で、音声分析部１４が取得した音響特徴量に、差分取得部１５３が算出した差分を加算する加工をする。
音声合成部１６は、加工部１５４が加工して得たフレーム単位の音響特徴量に基づいて音声データを合成し、出力する。 The voice analysis unit 14 acoustically analyzes the voice waveform indicated by the conversion target voice data of the conversion target speaker, and acquires an acoustic feature amount in units of frames. The conversion target voice data is voice data of calm voice of an arbitrary utterance of the conversion target speaker.
The spectrum conversion unit 15 converts the spectrum of calm speech of an arbitrary utterance of the conversion target speaker into a spectrum of emotional speech. The spectrum conversion unit 15 includes a first conversion unit 151, a second conversion unit 152, a difference acquisition unit 153, and a processing unit 154.
The first conversion unit 151 converts the acoustic feature amount of each frame obtained from the conversion target audio data by the audio analysis unit 14 using the first conversion rule stored in the conversion rule storage unit 13.
The second conversion unit 152 converts the acoustic feature amount of each frame obtained from the conversion target speech data by the speech analysis unit 14 using the second conversion rule stored in the conversion rule storage unit 13.
The difference acquisition unit 153 calculates a difference with respect to the acoustic feature amount obtained by the conversion by the first conversion unit 151 for the acoustic feature amount obtained by the conversion by the second conversion unit 152 for each frame.
The processing unit 154 performs processing for adding the difference calculated by the difference acquisition unit 153 to the acoustic feature amount acquired by the voice analysis unit 14 in units of frames.
The voice synthesizer 16 synthesizes voice data based on the acoustic features in units of frames obtained by processing by the processing unit 154 and outputs the synthesized voice data.

なお、学習用音声分析部１１、変換規則学習部１２、及び、変換規則記憶部１３を外部の装置が備え、事前に第一変換規則及び第二変換規則を学習して変換規則記憶部１３に記憶しておき、必要な都度、第一変換部１５１、第二変換部１５２がそれぞれ、変換規則記憶部１３から第一変換規則、第二変換規則を取り込むようにしてもよい。 Note that an external device includes the learning speech analysis unit 11, the conversion rule learning unit 12, and the conversion rule storage unit 13, and learns the first conversion rule and the second conversion rule in advance and stores them in the conversion rule storage unit 13. The first conversion unit 151 and the second conversion unit 152 may store the first conversion rule and the second conversion rule from the conversion rule storage unit 13 as necessary.

図２は、音声加工装置１の変換規則学習処理を示す処理フローである。
まず、音声加工装置１に、同じ文章を読み上げた変換対象話者の学習用平静音声データと、参照話者の学習用平静音声データ及び学習用感情音声データとが入力される。この文章には、様々な音素と、その音素の様々な並びがバランスよく含まれる音素バランス文を用いることが望ましい。 FIG. 2 is a process flow showing the conversion rule learning process of the speech processing apparatus 1.
First, quiet speech data for learning of a conversion target speaker who has read out the same sentence, quiet speech data for learning and emotional speech data for learning of a reference speaker are input to the speech processing device 1. For this sentence, it is desirable to use a phoneme balance sentence including various phonemes and various arrangements of the phonemes in a balanced manner.

学習用音声分析部１１は、変換対象話者の学習用平静音声データが示す音声波形を音響分析し、フレーム単位の音響特徴量を取得する（ステップＳ１１０）。同様に、学習用音声分析部１１は、参照話者の学習用平静音声データが示す音声波形を音響分析してフレーム単位の音響特徴量を取得し（ステップＳ１２０）、参照話者の学習用感情音声データが示す音声波形を音響分析してフレーム単位の音響特徴量を取得する（ステップＳ１３０）。 The learning voice analysis unit 11 acoustically analyzes the voice waveform indicated by the quiet voice data for learning of the conversion target speaker, and acquires the acoustic feature amount in units of frames (step S110). Similarly, the learning speech analysis unit 11 acoustically analyzes the speech waveform indicated by the reference speaker's learning calm speech data to acquire an acoustic feature amount in units of frames (step S120), and the reference speaker's learning emotion is obtained. A sound waveform indicated by the sound data is acoustically analyzed to acquire a sound feature amount in units of frames (step S130).

変換規則学習部１２は、変換対象話者の学習用平静音声データから得られた音響特徴量と、参照話者の学習用平静音声データから得られた音響特徴量とを、それらの値の類似性に基づいてフレーム単位で対応付ける（ステップＳ１４０）。変換規則学習部１２は、対応付けられたフレームにおける変換対象話者の学習用平静音声データの音響特徴量及び参照話者の学習用平静音声データの音響特徴量に基づいて第一変換規則を算出する（ステップＳ１５０）。第一変換規則は、変換対象話者の平静音声の音響特徴量を参照話者の平静音声の音響特徴量に変換するための関数である。この第一変換規則として得られた関数を、「第一変換関数」と記載する。変換規則学習部１２は、算出した第一変換関数を変換規則記憶部１３に書き込む。 The conversion rule learning unit 12 compares the acoustic feature amount obtained from the quiet speech data for learning of the conversion target speaker with the acoustic feature amount obtained from the quiet speech data for learning of the reference speaker, and the similarity between the values. Corresponding in units of frames based on the sex (step S140). The conversion rule learning unit 12 calculates a first conversion rule on the basis of the acoustic feature amount of the calm speech data for learning of the conversion target speaker and the quiet feature speech data of the reference speaker for learning in the associated frame. (Step S150). The first conversion rule is a function for converting the acoustic feature amount of the quiet speech of the conversion target speaker into the acoustic feature amount of the quiet speech of the reference speaker. The function obtained as the first conversion rule is referred to as “first conversion function”. The conversion rule learning unit 12 writes the calculated first conversion function in the conversion rule storage unit 13.

さらに、変換規則学習部１２は、変換対象話者の学習用平静音声データから得られた音響特徴量と、参照話者の学習用感情音声データから得られた音響特徴量とを、それらの値の類似性に基づいてフレーム単位で対応付ける（ステップＳ１６０）。変換規則学習部１２は、対応付けられたフレームにおける変換対象話者の学習用平静音声データの音響特徴量及び参照話者の学習用感情音声データの音響特徴量に基づいて第二変換規則を算出する（ステップＳ１７０）。第二変換規則は、変換対象話者の平静音声の音響特徴量を参照話者の感情音声の音響特徴量に変換するための関数である。この第二変換規則として得られた関数を、「第二変換関数」と記載する。この第二変換関数には、第一変換関数と同様に算出される。変換規則学習部１２は、算出した第二変換関数を変換規則記憶部１３に書き込む。 Further, the conversion rule learning unit 12 uses the acoustic feature amount obtained from the quiet speech data for learning of the conversion target speaker and the acoustic feature amount obtained from the emotional speech data for learning of the reference speaker as their values. Are associated in units of frames on the basis of the similarity (step S160). The conversion rule learning unit 12 calculates a second conversion rule based on the acoustic feature amount of the calm voice data for learning of the conversion target speaker and the acoustic feature amount of the emotion voice data for learning of the reference speaker in the associated frame. (Step S170). The second conversion rule is a function for converting the acoustic feature amount of the quiet speech of the conversion target speaker into the acoustic feature amount of the emotional speech of the reference speaker. The function obtained as the second conversion rule is referred to as “second conversion function”. The second conversion function is calculated in the same manner as the first conversion function. The conversion rule learning unit 12 writes the calculated second conversion function in the conversion rule storage unit 13.

なお、音声加工装置１は、ステップＳ１１０〜ステップＳ１３０の各処理を、並行して実行してもよく、任意の順番で実行してもよい。また、音声加工装置１は、ステップＳ１４０〜ステップＳ１５０の処理とステップＳ１６０〜ステップＳ１７０の処理とを、並行して実行してもよく、任意の順番で実行してもよい。 Note that the audio processing device 1 may execute the processes of steps S110 to S130 in parallel or in any order. In addition, the voice processing device 1 may execute the processes in steps S140 to S150 and the processes in steps S160 to S170 in parallel or in any order.

図３は、音声加工装置１の図２に示す変換規則学習処理を説明するための図である。
図２のステップＳ１１０において、学習用音声分析部１１は、変換対象話者の学習用平静音声データからフレーム単位の音響特徴量Ａ１、Ａ２、Ａ３、…を取得する。Ａｉ（ｉは１以上の整数）は、変換対象話者の学習用平静音声データが示す音声波形から得られたｉ番目のフレームの音響特徴量である。
また、図２のステップＳ１２０において、学習用音声分析部１１は、参照話者の学習用平静音声データからフレーム単位の音響特徴量Ｂ１、Ｂ２、Ｂ３、…を取得する。Ｂｊ（ｊは１以上の整数）は、参照話者の学習用平静音声データが示す音声波形から得られたｊ番目のフレームの音響特徴量である。
また、図２のステップＳ１３０において、学習用音声分析部１１は、参照話者の学習用感情音声データからフレーム単位の音響特徴量Ｃ１、Ｃ２、Ｃ３、…を取得する。Ｃｋ（ｋは１以上の整数）は、参照話者の学習用感情音声データが示す音声波形から得られたｋ番目のフレームの音響特徴量である。 FIG. 3 is a diagram for explaining the conversion rule learning process shown in FIG.
In step S110 of FIG. 2, the learning speech analysis unit 11 acquires acoustic feature amounts A1, A2, A3,... In units of frames from the learning calm speech data of the conversion target speaker. Ai (i is an integer of 1 or more) is an acoustic feature amount of the i-th frame obtained from the speech waveform indicated by the calm speech data for learning of the conversion target speaker.
2, the learning speech analysis unit 11 acquires acoustic feature quantities B1, B2, B3,... In units of frames from the reference quiet speech data of the reference speaker. Bj (j is an integer of 1 or more) is an acoustic feature amount of the jth frame obtained from the speech waveform indicated by the reference speaker's quiet speech data for learning.
2, the learning speech analysis unit 11 acquires acoustic feature amounts C1, C2, C3,... For each frame from the learning emotion speech data of the reference speaker. Ck (k is an integer of 1 or more) is an acoustic feature quantity of the kth frame obtained from the speech waveform indicated by the reference speaker's learning emotion speech data.

図２のステップＳ１４０において、変換規則学習部１２は、音響特徴量Ａ１、Ａ２、Ａ３、…と、音響特徴量Ｂ１、Ｂ２、Ｂ３、…とを、５０次元のスペクトルパラメータによる距離尺度を用いて、動的計画法（ＤＴＷ）などにより対応付ける。
図２のステップＳ１５０において、変換規則学習部１２は、対応付けられた音響特徴量Ａｉと音響特徴量Ｂｊの組から第一変換関数を算出する。この第一変換関数には、例えば、以下の参考文献１に記載の技術により算出される変換関数を用いることができる。この技術によれば、ある話者の音響特徴量と、他の話者の音響特徴量との結合確率密度をＧＭＭ（Gaussian Mixture Model、ガウス混合分布）で表現した確率モデルが変換関数として得られる。 In step S140 of FIG. 2, the conversion rule learning unit 12 converts the acoustic feature amounts A1, A2, A3,... And the acoustic feature amounts B1, B2, B3,. Corresponding by dynamic programming (DTW) or the like.
In step S150 of FIG. 2, the conversion rule learning unit 12 calculates a first conversion function from the set of the associated acoustic feature quantity Ai and acoustic feature quantity Bj. As the first conversion function, for example, a conversion function calculated by the technique described in Reference Document 1 below can be used. According to this technique, a probability model in which a joint probability density between an acoustic feature quantity of a speaker and an acoustic feature quantity of another speaker is expressed by a GMM (Gaussian Mixture Model) is obtained as a conversion function. .

（参考文献１）Tomoki Toda、外２名、"Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory"、IEEE Trans. ASLP、Vol.15、No.8、p.2222-2235、2007年 (Reference 1) Tomoki Toda, 2 others, "Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory", IEEE Trans. ASLP, Vol.15, No.8, p.2222-2235, 2007

図２のステップＳ１６０において、変換規則学習部１２は、音響特徴量Ａ１、Ａ２、Ａ３、…と、音響特徴量Ｃ１、Ｃ２、Ｃ３、…とを、ステップＳ１４０の処理と同様に、動的計画法（ＤＴＷ）などにより対応付ける。図２のステップＳ１７０において、変換規則学習部１２は、ステップＳ１５０の処理と同様に、対応付けられた音響特徴量Ａｉと音響特徴量Ｃｋの組から第二変換規則を算出する。 In step S160 of FIG. 2, the conversion rule learning unit 12 performs dynamic planning on the acoustic feature amounts A1, A2, A3,... And the acoustic feature amounts C1, C2, C3,. Corresponding by the law (DTW). In step S170 of FIG. 2, the conversion rule learning unit 12 calculates a second conversion rule from the set of the associated acoustic feature quantity Ai and acoustic feature quantity Ck, similarly to the process of step S150.

図４は、音声加工装置１の音声加工処理を示す処理フローである。
音声加工装置１に、変換対象話者の任意発話の平静音声のデータである変換対象音声データが入力される。音声分析部１４は、変換対象音声データが示す音声波形を音響分析し、フレーム単位の音響特徴量を取得する（ステップＳ２１０）。 FIG. 4 is a processing flow showing the voice processing of the voice processing device 1.
Conversion target speech data, which is quiet speech data of an arbitrary utterance of the conversion target speaker, is input to the speech processing apparatus 1. The voice analysis unit 14 acoustically analyzes the voice waveform indicated by the conversion target voice data, and acquires the acoustic feature amount in units of frames (step S210).

スペクトル変換部１５の第一変換部１５１は、変換対象音声データから得られた各フレームの音響特徴量を、変換規則記憶部１３に記憶されている第一変換関数により変換する（ステップＳ２２０）。これにより、既存技術の確率モデルによる声質変換を利用して、変換対象音声データから得られた各フレームの音響特徴量を表すスペクトルパラメータが、参照話者の平静音声の音響特徴量を表すスペクトルパラメータに変換される。 The first conversion unit 151 of the spectrum conversion unit 15 converts the acoustic feature amount of each frame obtained from the conversion target audio data using the first conversion function stored in the conversion rule storage unit 13 (step S220). Thus, using the voice quality conversion based on the probability model of the existing technology, the spectral parameter representing the acoustic feature amount of each frame obtained from the speech data to be converted is the spectral parameter representing the acoustic feature amount of the quiet speech of the reference speaker. Is converted to

第二変換部１５２は、変換対象音声データから得られた各フレームの音響特徴量を、変換規則記憶部１３に記憶されている第二変換関数により変換する（ステップＳ２３０）。これにより、既存技術の確率モデルによる声質変換を利用して、変換対象音声データから得られた各フレームの音響特徴量を表すスペクトルパラメータが、参照話者の感情音声の音響特徴量を表すスペクトルパラメータに変換される。 The 2nd conversion part 152 converts the acoustic feature-value of each flame | frame obtained from conversion object audio | speech data with the 2nd conversion function memorize | stored in the conversion rule memory | storage part 13 (step S230). Thus, using the voice quality conversion based on the probability model of the existing technology, the spectral parameter representing the acoustic feature amount of each frame obtained from the conversion target speech data becomes the spectral parameter representing the acoustic feature amount of the emotional speech of the reference speaker. Is converted to

差分取得部１５３は、第一変換部１５１が変換により得た参照話者の平静音声の音響特徴量と、第二変換部１５２が変換により得た参照話者の感情音声の音響特徴量とを、例えば、Cepstrum Mean Normalization（ケプストラム平均正規化）により正規化する（ステップＳ２４０）。差分取得部１５３は、フレーム単位で、正規化された参照話者の感情音声の音響特徴量について、正規化された参照話者の平静音声の音響特徴量に対する差分を算出する（ステップＳ２５０）。加工部１５４は、フレーム単位で、音声分析部１４により取得した音響特徴量が示すスペクトルパラメータに、差分取得部１５３が算出した差分を加算することにより加工する（ステップＳ２６０）。音声合成部１６は、加工部１５４が加工して得たフレーム単位の音響特徴量に基づいて音声データを合成し、出力する（ステップＳ２７０）。 The difference acquisition unit 153 obtains the acoustic feature amount of the quiet speech of the reference speaker obtained by the conversion by the first conversion unit 151 and the acoustic feature amount of the emotional speech of the reference speaker obtained by the conversion by the second conversion unit 152. For example, normalization is performed by Cepstrum Mean Normalization (step S240). The difference acquisition unit 153 calculates a difference with respect to the acoustic feature amount of the normalized speech of the reference speaker with respect to the normalized acoustic feature amount of the reference speaker's emotional speech in units of frames (step S250). The processing unit 154 performs processing by adding the difference calculated by the difference acquisition unit 153 to the spectrum parameter indicated by the acoustic feature acquired by the speech analysis unit 14 in units of frames (step S260). The voice synthesizer 16 synthesizes voice data based on the acoustic features in units of frames obtained by processing by the processing unit 154 and outputs the voice data (step S270).

なお、音声加工装置１は、ステップＳ２２０及びステップＳ２３０の処理を、並行して実行してもよく、任意の順番で実行してもよい。 Note that the voice processing device 1 may execute the processes of step S220 and step S230 in parallel or in any order.

図５は、音声加工装置１の音声加工処理における参照話者の音響特徴量への変換を説明するための図である。同図は、図４のステップＳ２１０〜ステップＳ２３０の処理を示す。
図４のステップＳ２１０において、音声分析部１４は、変換対象音声データからフレーム単位の音響特徴量Ｄ１、Ｄ２、Ｄ３、…を取得する。Ｄｉ（ｉは１以上の整数）は、変換対象音声データが示す音声波形から得られたｉ番目のフレームの音響特徴量である。
図４のステップＳ２２０において、第一変換部１５１は、音響特徴量Ｄ１、Ｄ２、Ｄ３、…のそれぞれを第一変換関数により変換し、参照話者の平静音声の音響特徴量Ｅ１、Ｅ２、Ｅ３、…を得る。音響特徴量Ｄｉから変換により、音響特徴量Ｅｉが得られる。
図４のステップＳ２３０において、第二変換部１５２は、音響特徴量Ｄ１、Ｄ２、Ｄ３、…のそれぞれを第二変換関数により変換し、参照話者の感情音声の音響特徴量Ｆ１、Ｆ２、Ｆ３、…を得る。音響特徴量Ｄｉから変換により、音響特徴量Ｆｉが得られる。 FIG. 5 is a diagram for explaining the conversion into the acoustic feature amount of the reference speaker in the voice processing process of the voice processing apparatus 1. This figure shows the processing of step S210 to step S230 of FIG.
In step S210 of FIG. 4, the voice analysis unit 14 acquires acoustic feature quantities D1, D2, D3,... In units of frames from the conversion target voice data. Di (i is an integer of 1 or more) is an acoustic feature quantity of the i-th frame obtained from the speech waveform indicated by the conversion target speech data.
In step S220 of FIG. 4, the first conversion unit 151 converts each of the acoustic feature amounts D1, D2, D3,... Using the first conversion function, and the acoustic feature amounts E1, E2, E3 of the quiet speech of the reference speaker. Get ... An acoustic feature quantity Ei is obtained by conversion from the acoustic feature quantity Di.
In step S230 of FIG. 4, the second conversion unit 152 converts each of the acoustic feature amounts D1, D2, D3,... Using the second conversion function, and the acoustic feature amounts F1, F2, F3 of the emotional speech of the reference speaker. Get ... An acoustic feature quantity Fi is obtained by conversion from the acoustic feature quantity Di.

図６は、音声加工装置１の音声加工処理における音響特徴量の差分の取得を説明するための図である。同図は、図４のステップＳ２４０〜Ｓ２５０の処理を示す。
図４のステップＳ２４０において、差分取得部１５３は、変換対象音声データの音響特徴量を第一変換関数により変換して得た参照話者の平静音声の音響特徴量Ｅ１、Ｅ２、Ｅ３、…を正規化し、音響特徴量Ｅ１’、Ｅ２’、Ｅ３’、…を得る。さらに、差分取得部１５３は、変換対象音声データの音響特徴量を第二変換関数により変換して得た参照話者の感情音声の音響特徴量Ｆ１、Ｆ２、Ｆ３、…を正規化し、音響特徴量Ｆ１’、Ｆ２’、Ｆ３’、…を得る。図４のステップＳ２５０において、差分取得部１５３は、参照話者の感情音声の音響特徴量Ｆｉ’について、参照話者の平静音声の音響特徴量Ｅｉ’に対する差分Ｇｉを算出する。つまり、差分取得部１５３は、差分Ｇｉ＝音響特徴量Ｆｉ’−音響特徴量Ｅｉ’を算出する。 FIG. 6 is a diagram for explaining the acquisition of the difference between the acoustic feature amounts in the voice processing process of the voice processing apparatus 1. This figure shows the processing of steps S240 to S250 of FIG.
In step S240 of FIG. 4, the difference acquisition unit 153 obtains the acoustic feature amounts E1, E2, E3,... Of the quiet speech of the reference speaker obtained by converting the acoustic feature amount of the conversion target speech data by the first conversion function. Normalization is performed to obtain acoustic feature amounts E1 ′, E2 ′, E3 ′,. Further, the difference acquisition unit 153 normalizes the acoustic feature amounts F1, F2, F3,... Of the reference speaker's emotional speech obtained by converting the acoustic feature amount of the conversion target speech data using the second conversion function, and the acoustic feature. The quantities F1 ′, F2 ′, F3 ′,. In step S250 of FIG. 4, the difference acquisition unit 153 calculates a difference Gi with respect to the acoustic feature quantity Ei ′ of the reference speaker's calm voice for the acoustic feature quantity Fi ′ of the reference speaker's emotional voice. That is, the difference acquisition unit 153 calculates the difference Gi = acoustic feature amount Fi′−acoustic feature amount Ei ′.

図７は、音声加工装置１の音声加工処理における変換対象話者の変換対象音声の音響特徴量の加工を説明するための図である。同図は、図４のステップＳ２６０の処理を示す。
図４のステップＳ２６０において、加工部１５４は、変換対象音声データの音響特徴量Ｄｉに、差分取得部１５３が算出した差分Ｇｉを加算し、変換対象話者の感情音声の音響特徴量Ｈｉに加工する。つまり、加工部１５４は、音響特徴量Ｈｉ＝音響特徴量Ｄｉ＋差分Ｇｉを算出する。図４のステップＳ２７０において、音声合成部１６は、音響特徴量Ｈ１、Ｈ２、Ｈ３、…に基づいて音声データを合成し、出力する。 FIG. 7 is a diagram for explaining the processing of the acoustic feature amount of the conversion target speech of the conversion target speaker in the speech processing processing of the speech processing apparatus 1. This figure shows the process of step S260 of FIG.
In step S260 of FIG. 4, the processing unit 154 adds the difference Gi calculated by the difference acquisition unit 153 to the acoustic feature amount Di of the conversion target speech data, and processes the acoustic feature amount Hi of the emotion speech of the conversion target speaker. To do. That is, the processing unit 154 calculates acoustic feature amount Hi = acoustic feature amount Di + difference Gi. In step S270 of FIG. 4, the speech synthesizer 16 synthesizes speech data based on the acoustic feature amounts H1, H2, H3,.

上述した実施形態によれば、音声加工装置１は、事前学習に、数十文の変換対象話者の平静音声と、参照話者の平静音声及び感情音声のパラレルデータを利用できればよい。従って、ＨＭＭ音声合成のような音声データベースを利用する従来技術と比較して、事前学習のために必要なデータ数も少なく、学習のためのコストも低減することができる。また、複数の変換対象話者がいる場合でも、それぞれの変換対象話者について数十文の平静音声のデータのみがあればよく、事前の準備が容易である。また、学習に変換対象話者の感情音声が不要であるため、テキストデータから音声合成された音声データを、変換対象話者の音声データとして用いることができる。 According to the above-described embodiment, the speech processing device 1 only needs to be able to use the parallel data of the quiet speech of the conversion target speaker of several tens of sentences and the calm speech and emotional speech of the reference speaker for the prior learning. Therefore, compared with the prior art using a speech database such as HMM speech synthesis, the number of data required for prior learning is less and the cost for learning can be reduced. Further, even when there are a plurality of conversion target speakers, it is only necessary to have several tens of sentences of calm voice data for each conversion target speaker, and advance preparation is easy. Further, since the emotional voice of the conversion target speaker is not necessary for learning, the voice data synthesized from the text data can be used as the voice data of the conversion target speaker.

また、上述した実施形態によれば、音声加工装置１は、事前学習に得られた変換規則を利用して、変換対象話者の任意発話の平静音声のスペクトルを参照話者の平静音声のスペクトル及び感情音声のスペクトルに加工し、それらの差分をフレーム毎に算出する。音声加工装置１は、フレーム毎に、変換対象話者の任意発話の平静音声のスペクトルに、算出した差分を加算して、変換対象話者の感情音声のスペクトルを得る。変換対象話者の任意発話の平静音声から得られたスペクトルのフレームと、加算すべき差分のフレームとは、時刻順に１対１で対応しているため、フレーム間の対応付けなどの処理を行う必要なく、簡易な処理により加工を行うことができる。このように、音声加工装置１は、変換対象話者の任意発話の平静音声のスペクトルに、参照話者の感情音声のスペクトルの特徴を付与し、変換対象話者の平静音声の声質を感情表現の声質に変換することができる。 Further, according to the above-described embodiment, the speech processing apparatus 1 uses the conversion rule obtained in the prior learning to convert the spectrum of the quiet speech of the utterance of the conversion target speaker into the spectrum of the quiet speech of the reference speaker. And the emotional speech spectrum, and the difference between them is calculated for each frame. The speech processing apparatus 1 adds the calculated difference to the spectrum of the quiet speech of the arbitrary speech of the conversion target speaker for each frame to obtain the spectrum of the emotional speech of the conversion target speaker. The spectrum frame obtained from the quiet speech of the arbitrary utterance of the conversion target speaker and the difference frame to be added correspond one-to-one in order of time, and therefore processing such as association between frames is performed. It is not necessary and can be processed by simple processing. As described above, the speech processing apparatus 1 adds the characteristics of the spectrum of the emotional speech of the reference speaker to the spectrum of the quiet speech of the arbitrary speech of the conversion target speaker, and expresses the voice quality of the quiet speech of the conversion target speaker as an emotional expression. Can be converted to voice quality.

なお、上述の音声加工装置１は、内部にコンピュータシステムを有している。そして、音声加工装置１の動作の過程は、プログラムの形式でコンピュータ読み取り可能な記録媒体に記憶されており、このプログラムをコンピュータシステムが読み出して実行することによって、上記処理が行われる。ここでいうコンピュータシステムとは、ＣＰＵ及び各種メモリやＯＳ、周辺機器等のハードウェアを含むものである。 Note that the above-described speech processing apparatus 1 has a computer system therein. The operation process of the sound processing apparatus 1 is stored in a computer-readable recording medium in the form of a program, and the above processing is performed by the computer system reading and executing this program. The computer system here includes a CPU, various memories, an OS, and hardware such as peripheral devices.

また、「コンピュータシステム」は、ＷＷＷシステムを利用している場合であれば、ホームページ提供環境（あるいは表示環境）も含むものとする。
また、「コンピュータ読み取り可能な記録媒体」とは、フレキシブルディスク、光磁気ディスク、ＲＯＭ、ＣＤ−ＲＯＭ等の可搬媒体、コンピュータシステムに内蔵されるハードディスク等の記憶装置のことをいう。さらに「コンピュータ読み取り可能な記録媒体」とは、インターネット等のネットワークや電話回線等の通信回線を介してプログラムを送信する場合の通信線のように、短時間の間、動的にプログラムを保持するもの、その場合のサーバやクライアントとなるコンピュータシステム内部の揮発性メモリのように、一定時間プログラムを保持しているものも含むものとする。また上記プログラムは、前述した機能の一部を実現するためのものであってもよく、さらに前述した機能をコンピュータシステムにすでに記録されているプログラムとの組み合わせで実現できるものであってもよい。 Further, the “computer system” includes a homepage providing environment (or display environment) if a WWW system is used.
The “computer-readable recording medium” refers to a storage device such as a flexible medium, a magneto-optical disk, a portable medium such as a ROM and a CD-ROM, and a hard disk incorporated in a computer system. Furthermore, the “computer-readable recording medium” dynamically holds a program for a short time like a communication line when transmitting a program via a network such as the Internet or a communication line such as a telephone line. In this case, a volatile memory in a computer system serving as a server or a client in that case, and a program that holds a program for a certain period of time are also included. The program may be a program for realizing a part of the functions described above, and may be a program capable of realizing the functions described above in combination with a program already recorded in a computer system.

１音声加工装置
１１学習用音声分析部
１２変換規則学習部
１３変換規則記憶部
１４音声分析部
１５スペクトル変換部
１５１第一変換部
１５２第二変換部
１５３差分取得部
１５４加工部
１６音声合成部 DESCRIPTION OF SYMBOLS 1 Speech processing apparatus 11 Speech analysis part 12 for learning Conversion rule learning part 13 Conversion rule memory | storage part 14 Speech analysis part 15 Spectrum conversion part 151 First conversion part 152 Second conversion part 153 Difference acquisition part 154 Processing part 16 Speech synthesis part

Claims

変換対象話者の平静音声の音声データを音響分析してフレーム単位の音響特徴量を取得する音声分析部と、
変換対象話者の平静音声の音響特徴量を参照話者の平静音声の音響特徴量に変換するための第一変換規則を用いて、前記音声分析部が取得した各フレームの前記音響特徴量を変換する第一変換部と、
変換対象話者の平静音声の音響特徴量を参照話者の感情音声の音響特徴量に変換するための第二変換規則を用いて、前記音声分析部が取得した各フレームの前記音響特徴量を変換する第二変換部と、
フレーム単位で、前記第二変換部が変換により得た前記音響特徴量について、前記第一変換部が変換により得た前記音響特徴量に対する差分を算出する差分取得部と、
フレーム単位で、前記音声分析部が取得した前記音響特徴量に、前記差分取得部が算出した差分を加算する加工部と、
を備えることを特徴とする音声加工装置。 A voice analysis unit for acoustically analyzing the voice data of the quiet voice of the conversion target speaker and obtaining an acoustic feature amount in units of frames;
Using the first conversion rule for converting the acoustic feature amount of the quiet speech of the conversion target speaker into the acoustic feature amount of the quiet speech of the reference speaker, the acoustic feature amount of each frame acquired by the speech analysis unit is used. A first conversion unit for conversion;
Using the second conversion rule for converting the acoustic feature amount of the quiet speech of the conversion target speaker into the acoustic feature amount of the emotional speech of the reference speaker, the acoustic feature amount of each frame acquired by the speech analysis unit A second conversion unit for conversion;
A difference acquisition unit that calculates a difference with respect to the acoustic feature value obtained by the conversion by the first conversion unit for the acoustic feature value obtained by the conversion by the second conversion unit in units of frames;
A processing unit that adds the difference calculated by the difference acquisition unit to the acoustic feature amount acquired by the voice analysis unit in units of frames;
An audio processing apparatus comprising:

前記差分取得部は、前記第一変換部が変換により得た前記音響特徴量と前記第二変換部が変換により得た前記音響特徴量とを正規化した後、フレーム単位で差分を算出する、
ことを特徴とする請求項１に記載の音声加工装置。 The difference acquisition unit calculates the difference in units of frames after normalizing the acoustic feature amount obtained by the conversion by the first conversion unit and the acoustic feature amount obtained by the conversion by the second conversion unit.
The speech processing apparatus according to claim 1.

前記変換対象話者の学習用の平静音声のデータである第一音声データと、前記第一音声データと同じ発話内容の参照話者の学習用の平静音声のデータである第二音声データとに基づいて前記第一変換規則を取得する処理と、前記第一音声データと、前記第一音声データと同じ発話内容の前記参照話者の学習用の感情音声のデータである第三音声データとに基づいて前記第二変換規則を取得する処理とを行う変換規則学習部をさらに備える、
ことを特徴とする請求項１または請求項２に記載の音声加工装置。 The first voice data which is the quiet voice data for learning of the conversion target speaker and the second voice data which is the quiet voice data for learning of the reference speaker having the same utterance content as the first voice data. Based on the first conversion rule, the first voice data, and the third voice data that is emotion voice data for learning the reference speaker having the same utterance content as the first voice data. A conversion rule learning unit that performs processing for obtaining the second conversion rule based on
The speech processing apparatus according to claim 1 or 2, wherein

前記音響特徴量は、周波数スペクトルに関する特徴量である、
ことを特徴とする請求項１から請求項３のいずれか1項に記載の音声加工装置。 The acoustic feature amount is a feature amount related to a frequency spectrum.
The sound processing device according to claim 1, wherein the sound processing device is a sound processing device.

コンピュータを、
変換対象話者の音声データを音響分析してフレーム単位の音響特徴量を取得する音声分析手段と、
変換対象話者の平静音声の音響特徴量を参照話者の平静音声の音響特徴量に変換するための第一変換規則を用いて、前記音声分析手段が取得した各フレームの前記音響特徴量を変換する第一変換手段と、
変換対象話者の平静音声の音響特徴量を参照話者の感情音声の音響特徴量に変換するための第二変換規則を用いて、前記音声分析手段が取得した各フレームの前記音響特徴量を変換する第二変換手段と、
フレーム単位で、前記第二変換手段が変換により得た前記音響特徴量について、前記第一変換手段が変換により得た前記音響特徴量に対する差分を算出する差分取得手段と、
フレーム単位で、前記音声分析手段が取得した前記音響特徴量に、前記差分取得手段が算出した差分を加算する加工手段と、
を具備する音声加工装置として機能させるためのプログラム。 Computer
A voice analysis means for acoustically analyzing voice data of the speaker to be converted to obtain acoustic features in units of frames;
Using the first conversion rule for converting the acoustic feature amount of the quiet speech of the conversion target speaker into the acoustic feature amount of the quiet speech of the reference speaker, the acoustic feature amount of each frame acquired by the speech analysis means First converting means for converting;
Using the second conversion rule for converting the acoustic feature amount of the quiet speech of the conversion target speaker into the acoustic feature amount of the emotional speech of the reference speaker, the acoustic feature amount of each frame acquired by the speech analysis means A second converting means for converting;
Difference acquisition means for calculating a difference with respect to the acoustic feature value obtained by the conversion by the first conversion means for the acoustic feature value obtained by the conversion by the second conversion means in frame units;
Processing means for adding the difference calculated by the difference acquisition means to the acoustic feature quantity acquired by the voice analysis means in frame units;
A program for causing a voice processing apparatus to function.