JP2001184099A

JP2001184099A - Device and method for voice conversion

Info

Publication number: JP2001184099A
Application number: JP36527199A
Authority: JP
Inventors: Takahiro Kawashima; 隆宏川嶋; Shiimentsu Marc; シーメンツマーク; Bonada Jordi; ボナダジョルディ
Original assignee: Universitat Pompeu Fabra UPF; Yamaha Corp
Current assignee: Universitat Pompeu Fabra UPF; Yamaha Corp
Priority date: 1999-12-22
Filing date: 1999-12-22
Publication date: 2001-07-06
Anticipated expiration: 2019-12-22
Also published as: JP4509273B2

Abstract

PROBLEM TO BE SOLVED: To make natural the quality of voice after pitch shifting. SOLUTION: This device has an analysis part 13 which analyzes the pitch of an input voice according to the input signal representing the input voice, a pitch-after-conversion calculation part 15 which calculates a pitch after conversion according to the pitch and a given pitch shift quantity, a mean gain calculation part 14 which calculates the mean gain of the input signal, a feature information database 12 wherein feature information for generating a spectrum shape is stored corresponding to a phoneme, a phoneme recognition part 11 which recognizes a phoneme from the input signal, a spectrum shape generation part 16 which obtains feature information corresponding to the phoneme recognized by the phoneme recognition part 11 from the feature information database 12 and generates a spectrum shape according to the obtained feature information and the pitch after conversion, and a composition part 20 which outputs a signal based upon the spectrum shape and the mentioned mean gain.

Description

【発明の詳細な説明】DETAILED DESCRIPTION OF THE INVENTION

【０００１】[0001]

【発明の属する技術分野】本発明は、入力音声のピッチ
をシフトした出力音声を得る音声変換装置及び音声変換
方法に係り、特にカラオケ装置に用いて好適な音声変換
装置及び音声変換方法に関する。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a voice conversion apparatus and a voice conversion method for obtaining an output voice in which the pitch of an input voice is shifted, and more particularly to a voice conversion apparatus and a voice conversion method suitable for use in a karaoke apparatus.

【０００２】[0002]

【従来の技術】従来より、入力された音声のピッチをシ
フトして出力する音声変換装置は種々開発されており、
例えば、カラオケ装置の中には、歌い手の歌った歌声の
ピッチを変換して、男性の声（男声）を女性の声（女
声）に、あるいはその逆に変換して出力するものもある
（例えば、特表平８−５０８５８１号）。2. Description of the Related Art Conventionally, various voice converters for shifting the pitch of an input voice and outputting the voice have been developed.
For example, some karaoke apparatuses convert the pitch of a singing voice of a singer and convert a male voice (male voice) into a female voice (female voice) or vice versa and output the converted voice (for example, And JP-T 8-508581.

【０００３】この種の音声変換装置が採用しているピッ
チシフトの方法としては、時間領域での方法と周波数領
域での方法が挙げられる。前者は歌い手の歌った歌声を
表す入力信号のサンプリング結果からサンプルを間引い
たり所定の補間を行ったりすることでピッチをシフトす
る方法であり、後者は入力信号から得られた正弦波成分
（倍音列）を周波数領域でシフトすることでピッチをシ
フトする方法である。[0003] As a pitch shift method employed by this type of voice conversion apparatus, there are a time domain method and a frequency domain method. The former is a method of shifting the pitch by thinning out samples or performing a predetermined interpolation from the sampling result of the input signal representing the singing voice of the singer, and the latter is a method of shifting the sine wave component (overtone train) obtained from the input signal. ) In the frequency domain to shift the pitch.

【０００４】[0004]

【発明が解決しようとする課題】しかしながら、従来の
いずれの方法を採用してもピッチシフト後の音声の声質
が不自然な声質となるのを避けることはできなかった。
本発明は上述した事情に鑑みて為されたものであり、ピ
ッチシフト後の音声の声質を自然な声質とすることがで
きる音声変換装置及び音声変換方法を提供することを目
的としている。However, any of the conventional methods cannot avoid the unnatural voice quality of the voice after pitch shift.
The present invention has been made in view of the above-described circumstances, and has as its object to provide a voice conversion device and a voice conversion method that can make the voice quality of voice after pitch shift natural.

【０００５】[0005]

【課題を解決するための手段】上述した課題を解決する
ために、請求項１に係る音声変換装置は、入力音声を表
す入力信号から入力音声と異なるピッチの音声を表す出
力信号を得る音声変換装置において、前記入力信号に基
づいて前記入力音声のピッチを分析するピッチ分析手段
と、前記入力信号の平均ゲインを分析する平均ゲイン分
析手段と、前記ピッチ分析手段により分析されたピッチ
と与えられたピッチシフト量とに基づいて変換後ピッチ
を算出する変換後ピッチ算出手段と、スペクトルシェイ
プを生成するための特徴情報を音素に対応付けて格納し
た特徴情報データベースと、前記入力信号から音素を認
識する音素認識手段と、前記音素認識手段により認識さ
れた音素に対応した特徴情報を前記特徴情報データベー
スから取得し、該特徴情報に基づいて第１のスペクトル
シェイプを生成し、前記平均ゲイン分析手段により分析
された平均ゲインと前記第１のスペクトルシェイプとに
応じた信号を前記出力信号として出力する出力手段とを
具備することを特徴としている。According to a first aspect of the present invention, there is provided a voice conversion apparatus for obtaining an output signal representing a voice having a different pitch from an input voice from an input signal representing the input voice. In the apparatus, pitch analysis means for analyzing a pitch of the input voice based on the input signal, average gain analysis means for analyzing an average gain of the input signal, and a pitch analyzed by the pitch analysis means are provided. A converted pitch calculating means for calculating a converted pitch based on the pitch shift amount, a feature information database storing feature information for generating a spectrum shape in association with a phoneme, and recognizing phonemes from the input signal Phoneme recognition means, and acquires feature information corresponding to the phoneme recognized by the phoneme recognition means from the feature information database; Output means for generating a first spectrum shape based on the signature information and outputting a signal corresponding to the average gain analyzed by the average gain analysis means and the first spectrum shape as the output signal; It is characterized by:

【０００６】また、上述した課題を解決するために、請
求項２に係る音声変換装置は、入力音声を表す入力信号
から入力音声と異なるピッチの音声を表す出力信号を得
る音声変換装置において、前記入力信号に基づいて前記
入力音声のピッチを分析するピッチ分析手段と、前記入
力信号の平均ゲインを分析する平均ゲイン分析手段と、
前記ピッチ分析手段により分析されたピッチと与えられ
た変換後ピッチとに基づいてピッチシフト量を算出する
ピッチシフト量算出手段と、スペクトルシェイプを生成
するための特徴情報を音素に対応付けて格納した特徴情
報データベースと、前記入力信号から音素を認識する音
素認識手段と、前記音素認識手段により認識された音素
に対応した特徴情報を前記特徴情報データベースから取
得し、該特徴情報と前記変換後ピッチとに基づいて第１
のスペクトルシェイプを生成し、前記平均ゲインと前記
第１のスペクトルシェイプとに応じた信号を前記出力信
号として出力する出力手段とを具備することを特徴とし
ている。上記各構成によれば、入力音声の音素に応じた
スペクトルシェイプと入力信号の平均ゲインとに基づい
た出力信号が出力される。According to another aspect of the present invention, there is provided a voice conversion apparatus for obtaining an output signal representing a voice having a different pitch from the input voice from an input signal representing the input voice. Pitch analysis means for analyzing the pitch of the input voice based on the input signal, and average gain analysis means for analyzing the average gain of the input signal,
A pitch shift amount calculating unit that calculates a pitch shift amount based on the pitch analyzed by the pitch analyzing unit and a given converted pitch, and feature information for generating a spectrum shape are stored in association with phonemes. A feature information database, a phoneme recognition means for recognizing a phoneme from the input signal, and acquiring feature information corresponding to the phoneme recognized by the phoneme recognition means from the feature information database; the feature information and the converted pitch; Based on the first
And an output unit for generating a signal corresponding to the average gain and the first spectrum shape as the output signal. According to each of the above configurations, an output signal is output based on the spectrum shape corresponding to the phoneme of the input voice and the average gain of the input signal.

【０００７】請求項１または２に記載の音声変換装置に
おいて、前記入力信号をフレーム単位で周波数分析する
周波数分析手段を具備し、前記出力手段は、前記第１の
スペクトルシェイプと前記周波数分析手段による周波数
分析結果と前記ピッチシフト量とに基づいて第２のスペ
クトルシェイプを生成し、該第２のスペクトルシェイプ
に応じた信号を前記出力信号として出力するようにして
もよい（請求項３）。3. The audio converter according to claim 1, further comprising frequency analysis means for frequency-analyzing said input signal on a frame basis, wherein said output means comprises said first spectrum shape and said frequency analysis means. A second spectrum shape may be generated based on a frequency analysis result and the pitch shift amount, and a signal corresponding to the second spectrum shape may be output as the output signal (claim 3).

【０００８】請求項３に記載の音声変換装置において、
前記周波数分析手段は前記入力信号をフレーム単位で周
波数分析して正弦波成分と残差成分に分離し、前記出力
手段は、前記第１のスペクトルシェイプと前記正弦波成
分と前記ピッチシフト量とに基づいて第３のスペクトル
シェイプを生成し、該第３のスペクトルシェイプと前記
残差成分とに応じた信号を前記出力信号として出力する
ようにしてもよい（請求項４）。この構成によれば、音
素に応じたスペクトルシェイプ及び入力信号の正弦波成
分に基づいた第３のスペクトルシェイプ（ピッチシフト
後）と入力信号の残差成分（ピッチシフト前）とに応じ
た出力信号が出力される。[0008] In the voice conversion device according to claim 3,
The frequency analysis unit frequency-analyzes the input signal on a frame basis to separate a sine wave component and a residual component, and the output unit calculates the first spectrum shape, the sine wave component, and the pitch shift amount. A third spectrum shape may be generated based on the third spectrum shape, and a signal corresponding to the third spectrum shape and the residual component may be output as the output signal. According to this configuration, the output signal according to the spectrum shape corresponding to the phoneme and the third spectrum shape based on the sine wave component of the input signal (after the pitch shift) and the residual component of the input signal (before the pitch shift) Is output.

【０００９】請求項３に記載の音声変換装置において、
前記出力手段は、前記周波数分析結果と前記ピッチシフ
ト量とに基づいて得られるスペクトルシェイプと前記第
１のスペクトルシェイプとを与えられたパラメータに従
って補間することで前記第２のスペクトルシェイプを生
成するようにしてもよい（請求項５）。請求項３または
５に記載の構成によれば、入力音声の音素に応じたスペ
クトルシェイプ及び入力信号の周波数分析結果に基づい
た出力信号が出力される。[0009] In the voice conversion device according to claim 3,
The output means generates the second spectrum shape by interpolating a spectrum shape obtained based on the frequency analysis result and the pitch shift amount and the first spectrum shape according to given parameters. (Claim 5). According to the configuration of the third or fifth aspect, an output signal is output based on a spectrum shape corresponding to a phoneme of an input voice and a frequency analysis result of the input signal.

【００１０】請求項１または２に記載の音声変換装置に
おいて、前記特徴情報データベースは複数のパラメータ
セットの各々について、スペクトルシェイプを生成する
ための特徴情報を音素に対応付けて格納し、前記出力手
段は、指定されたパラメータセットと前記音素認識手段
により認識された音素とに対応した特徴情報を前記特徴
情報データベースから取得するようにしてもよい（請求
項６）。3. The voice conversion apparatus according to claim 1, wherein said feature information database stores feature information for generating a spectrum shape in association with a phoneme for each of a plurality of parameter sets, and said output means. May acquire feature information corresponding to a designated parameter set and a phoneme recognized by the phoneme recognition means from the feature information database (claim 6).

【００１１】請求項１または２に記載の音声変換装置に
おいて、前記平均ゲインに応じて前記第１のスペクトル
シェイプの傾きを補正して第４のスペクトルシェイプを
生成するスペクトル傾き補正手段を具備し、前記出力手
段は、前記第４のスペクトルシェイプに応じた信号を前
記出力信号として出力するようにしてもよい（請求項
７）。この構成によれば、入力音声の音素に応じたスペ
クトルシェイプの傾きを入力信号の平均ゲインに応じて
補正して得られたスペクトルシェイプに応じた出力信号
が出力される。3. The voice conversion apparatus according to claim 1, further comprising: a spectrum tilt correcting unit that corrects a tilt of the first spectrum shape according to the average gain to generate a fourth spectrum shape. The output means may output a signal corresponding to the fourth spectrum shape as the output signal (claim 7). According to this configuration, an output signal corresponding to the spectrum shape obtained by correcting the slope of the spectrum shape corresponding to the phoneme of the input voice according to the average gain of the input signal is output.

【００１２】請求項１または２に記載の音声変換装置に
おいて、直前のスペクトルシェイプを記憶する前フレー
ム情報記憶手段を具備し、前記出力手段は、前記第１の
スペクトルシェイプと前記平均ゲインと前記前フレーム
情報記憶手段に記憶された前記直前のスペクトルシェイ
プとに基づいて第５のスペクトルシェイプを生成し、該
第５のスペクトルシェイプに応じた信号を前記出力信号
として出力するとともに、該第５のスペクトルシェイプ
を前記直前のスペクトルシェイプとして前記前フレーム
情報記憶手段に記憶させるようにしてもよい（請求項
８）。この構成によれば、入力音声の音素に応じたスペ
クトルシェイプとピッチシフト量と直前のスペクトルシ
ェイプとに基づいた出力信号が出力される。3. The voice conversion apparatus according to claim 1, further comprising a previous frame information storage unit for storing a previous spectrum shape, wherein said output unit outputs said first spectrum shape, said average gain, and said previous spectrum shape. A fifth spectrum shape is generated based on the immediately preceding spectrum shape stored in the frame information storage means, a signal corresponding to the fifth spectrum shape is output as the output signal, and the fifth spectrum shape is output. A shape may be stored in the preceding frame information storage means as the immediately preceding spectrum shape (claim 8). According to this configuration, an output signal is output based on the spectrum shape corresponding to the phoneme of the input voice, the pitch shift amount, and the immediately preceding spectrum shape.

【００１３】また、上述の課題を解決するために、請求
項９に記載の音声変換方法は、入力音声を表す入力信号
から入力音声と異なる変換後ピッチの音声を表す出力信
号を得る音声変換方法において、前記入力信号の平均ゲ
インを分析するとともに該入力信号から音素を認識する
ステップと、スペクトルシェイプを生成するための特徴
情報であって、前記音素認識手段により認識された音素
に対応した特徴情報を取得するステップと、前記特徴情
報と前記変換後ピッチとに基づいてスペクトルシェイプ
を生成するステップと、前記平均ゲインと前記スペクト
ルシェイプとに基づいた信号を前記出力信号として出力
するステップとを有することを特徴としている。この方
法によれば、入力音声の音素に応じたスペクトルシェイ
プと入力信号の平均ゲインに基づいた出力信号が出力さ
れる。According to another aspect of the present invention, there is provided a voice conversion method for obtaining an output signal representing a voice having a converted pitch different from the input voice from an input signal representing the input voice. Analyzing the average gain of the input signal and recognizing a phoneme from the input signal; and characteristic information for generating a spectrum shape, the characteristic information corresponding to the phoneme recognized by the phoneme recognition means. Obtaining a spectrum shape based on the characteristic information and the converted pitch, and outputting a signal based on the average gain and the spectrum shape as the output signal. It is characterized by. According to this method, an output signal is output based on the spectrum shape corresponding to the phoneme of the input voice and the average gain of the input signal.

【００１４】[0014]

【発明の実施の形態】以下、図面を参照して本発明の実
施形態に係る音声変換装置について説明する。なお、こ
こでは、入力信号を変換して得られる出力信号に含まれ
る正弦波成分のピッチを相対的に指定する第１実施形態
と、絶対的に指定する第２実施形態について順に説明す
る。BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a block diagram showing a speech converter according to an embodiment of the present invention. Here, the first embodiment in which the pitch of the sine wave component included in the output signal obtained by converting the input signal is relatively specified, and the second embodiment in which the pitch is absolutely specified will be described in order.

【００１５】［Ａ−１．第１実施形態の構成］図１は本
発明の第１実施形態に係る音声変換装置の全体構成を示
すブロック図であり、この図に示すように、本音声変換
装置はピッチシフト前の音声を表す入力信号からピッチ
シフト後の音声を表す出力信号を得るものである。[A-1. Configuration of First Embodiment] FIG. 1 is a block diagram showing the overall configuration of a voice converter according to a first embodiment of the present invention. As shown in FIG. The output signal representing the voice after the pitch shift is obtained from the input signal.

【００１６】図１において、１１は入力信号に対して音
素認識処理を行う音素認識部である。音素認識処理は入
力信号に対応する音素を特定する処理であり、音素認識
部１１は特定した音素を表す情報を出力する。また、音
素認識部１１は、特定した音素が複数である場合には、
音素を表す情報と当該音素との相関を表す相関値（例え
ば、０より大で１未満の実数値）とを複数の音素の各々
について出力する。この音素認識部１１が採用する音素
認識方法は音素を認識できる方法であればよく、例え
ば、帯域フィルタ群による方法であってもいし、ＦＦＴ
（高速フーリエ変換）による方法、相関関数による方
法、ＬＰＣ（Linear Predictive Coding）分析による方
法、あるいはΔケプストラム法であってもよい。In FIG. 1, reference numeral 11 denotes a phoneme recognition unit for performing a phoneme recognition process on an input signal. The phoneme recognition process is a process of specifying a phoneme corresponding to an input signal, and the phoneme recognition unit 11 outputs information representing the specified phoneme. In addition, when the identified phonemes are plural, the phoneme recognition unit 11
Information representing a phoneme and a correlation value (for example, a real value greater than 0 and less than 1) representing a correlation between the phoneme are output for each of the plurality of phonemes. The phoneme recognition method adopted by the phoneme recognition unit 11 may be any method capable of recognizing phonemes. For example, the phoneme recognition method may be a method using a group of band filters,
(Fast Fourier Transform), a method using a correlation function, a method based on LPC (Linear Predictive Coding) analysis, or a ΔCepstrum method.

【００１７】１２は音素毎の特徴情報を格納した特徴情
報データベースである。この特徴情報データベース１２
は図２に示すような特徴情報テーブルＴＢＬを有し、音
素認識部１１から出力された情報を入力し、当該情報で
表された音素に対応した特徴情報を特徴情報テーブルＴ
ＢＬから抽出し、当該特徴情報を当該音素に対応した相
関値とともに出力する。Reference numeral 12 denotes a feature information database storing feature information for each phoneme. This feature information database 12
Has a feature information table TBL as shown in FIG. 2, inputs information output from the phoneme recognition unit 11, and stores feature information corresponding to the phoneme represented by the information in the feature information table TBL.
BL, and outputs the feature information together with the correlation value corresponding to the phoneme.

【００１８】図２に示すように、特徴情報テーブルＴＢ
Ｌにおける各音素“a”，“i”，…にはデフォルトセッ
トの特徴情報がピッチバンク毎に対応付けられており、
特徴情報データベース１２は音素認識部１１からの情報
で表された音素とピッチシフト後のピッチである変換後
ピッチを包含したピッチバンクとに対応した特徴情報を
デフォルトセットから抽出することができる。上記ピッ
チバンクは特定の音域幅を示しており、各パラメータセ
ットに対して全音域をカバーするように複数のピッチバ
ンクが設けられている。なお、デフォルトセットの特徴
情報は変更不能に予め設定されたものであるが、特徴情
報テーブルＴＢＬは、特徴情報を変更可能なパラメータ
セットとして、男声セット、女声セット、及び個人別セ
ットを有する。As shown in FIG. 2, the characteristic information table TB
The default set of feature information is associated with each phoneme “a”, “i”,.
The feature information database 12 can extract, from a default set, feature information corresponding to a phoneme represented by information from the phoneme recognition unit 11 and a pitch bank including a converted pitch which is a pitch after pitch shift. The pitch bank indicates a specific pitch range, and a plurality of pitch banks are provided for each parameter set so as to cover the entire pitch range. Although the feature information of the default set is set in advance so as not to be changed, the feature information table TBL has a male voice set, a female voice set, and an individual set as parameter sets whose feature information can be changed.

【００１９】男声セット／女声セットは男声／女声の別
を示す指定情報が外部から入力された場合に使用される
べき特徴情報のセットであり、特徴情報データベース１
２は外部から男声／女声の別を示す指定情報が入力され
ると、音素及びピッチバンクに対応した特徴情報を男声
セット／女声セットから抽出する。なお、男声セット／
女声セットの特徴情報は予め設定されているが、使用者
により任意に変更可能である。The male voice set / female voice set is a set of characteristic information to be used when designation information indicating the distinction between male voice / female voice is input from outside.
2. When designation information indicating the distinction of male voice / female voice is input from outside, feature information corresponding to phonemes and pitch banks is extracted from the male voice set / female voice set. In addition, male voice set /
The feature information of the female voice set is set in advance, but can be arbitrarily changed by the user.

【００２０】図２における鈴木太郎セットは個人別セッ
トの一例であり、個人名（例えば、鈴木太郎）を示す指
定情報が入力された場合に使用されるべき特徴情報のセ
ットである。例えば、特徴情報データベース１２は外部
から鈴木太郎を示す指定情報が入力されると、音素及び
ピッチバンクに対応した特徴情報を鈴木太郎セットから
抽出する。なお、個人別セットは予め設定されておら
ず、使用者により任意に追加可能である。また、個人別
セットの特徴情報は使用者により任意に変更可能であ
る。The Taro Suzuki set in FIG. 2 is an example of an individual set, and is a set of feature information to be used when designation information indicating a personal name (for example, Taro Suzuki) is input. For example, when designating information indicating Taro Suzuki is input from outside, the feature information database 12 extracts feature information corresponding to phonemes and pitch banks from the Taro Suzuki set. The individual set is not set in advance and can be arbitrarily added by the user. Further, the feature information of the individual set can be arbitrarily changed by the user.

【００２１】特徴情報データベース１２における特徴情
報の抽出の優先順位は個人別セット、男声セット／女声
セット、デフォルトセットとなっており、個人名を示す
指定情報が入力された場合には当該指定情報に対応した
個人別セットが、個人名を示す指定情報が入力されずに
男声／女声の別を示す指定情報が入力された場合には当
該指定情報に対応した男声セット／女声セットが、個人
名を示す指定情報及び男声／女声の別を示す指定情報の
いずれも入力されなかった場合にはデフォルトセット
が、特徴情報の抽出対象のパラメータセットとなる。The priority order for extracting feature information in the feature information database 12 is an individual set, a male / female voice set, and a default set. When specified information indicating a personal name is input, the specified information is added to the specified information. If the corresponding individual set does not input the designation information indicating the personal name but inputs the designation information indicating male / female voice, the male / female set corresponding to the designation information changes the personal name. If neither the designated information nor the designated information indicating male / female voice is input, the default set is the parameter set from which feature information is to be extracted.

【００２２】なお、特徴情報テーブルＴＢＬが音素及び
パラメータセット毎に有する特徴情報はピッチシフト後
の音声が不自然とならない程度のスペクトルシェイプ
（第１のスペクトルシェイプ）を生成可能なパラメータ
を含んでいる。「スペクトルシェイプ」は波形の特徴を
示しており、本実施形態では、以下のパラメータを含む
特徴情報によって定義付けられている。・フォルマント周波数・フォルマントバンド幅・スペクトル傾き・スペクトルエンベロープThe feature information included in the feature information table TBL for each phoneme and each parameter set includes parameters capable of generating a spectrum shape (first spectrum shape) to such an extent that the sound after pitch shift does not become unnatural. . The “spectral shape” indicates a characteristic of the waveform, and in the present embodiment, is defined by characteristic information including the following parameters.・ Formant frequency ・ Formant bandwidth ・ Spectral slope ・ Spectral envelope

【００２３】１３は入力信号に対してフレーム単位で周
波数分析を行う分析部であり、フレーム単位で入力信号
を正弦波成分と残差成分とに分離し、両成分を出力す
る。また、分析部１３は分析対象フレームの基本周波数
（ピッチ）を分析し、これを入力信号のフレームピッチ
として出力する。なお、本実施形態において「フレー
ム」は所定の時間単位で区切られた波形信号を意味して
おり、本実施形態では入力信号のフレームに対応した出
力信号のフレームを生成することで出力信号を生成して
いる。Reference numeral 13 denotes an analysis unit for performing a frequency analysis on the input signal on a frame basis, and separates the input signal into a sine wave component and a residual component on a frame basis and outputs both components. In addition, the analysis unit 13 analyzes the fundamental frequency (pitch) of the analysis target frame and outputs this as the frame pitch of the input signal. In the present embodiment, “frame” means a waveform signal separated by a predetermined time unit. In the present embodiment, an output signal is generated by generating an output signal frame corresponding to an input signal frame. are doing.

【００２４】分析部１３における周波数分析はＳＭＳ
（Spectral Modeling Synthesis）分析である。ここ
で、本実施形態におけるＳＭＳ分析の処理内容について
図３を参照して説明する。この図に示すように、分析部
１３は、まず、標本化された入力信号に窓関数を乗じて
フレームを切り出し、このフレームに高速フーリエ変換
（ＦＦＴ）を施して得られる周波数スペクトルから正弦
波成分と残差成分を抽出する。正弦波成分とは、基本周
波数及び基本周波数の倍数にあたる周波数（倍音）の成
分をいう。The frequency analysis in the analysis unit 13 is performed by SMS.
(Spectral Modeling Synthesis) analysis. Here, the processing content of the SMS analysis in the present embodiment will be described with reference to FIG. As shown in this figure, the analyzing unit 13 first multiplies a sampled input signal by a window function to cut out a frame, and performs a fast Fourier transform (FFT) on the frame to obtain a sine wave component from a frequency spectrum obtained. And the residual component is extracted. The sine wave component refers to a component of a fundamental frequency and a frequency (harmonic) that is a multiple of the fundamental frequency.

【００２５】１４は分析部１３から出力された正弦波成
分の平均ゲインを算出する平均ゲイン算出部であり、算
出した平均ゲインを出力する。１５は分析部１３から出
力されたフレームピッチを外部から与えられたピッチシ
フト量だけシフトさせて得られるピッチを求める変換後
ピッチ算出部であり、このピッチを変換後ピッチとして
出力する。An average gain calculator 14 calculates an average gain of the sine wave component output from the analyzer 13, and outputs the calculated average gain. Reference numeral 15 denotes a converted pitch calculating unit that obtains a pitch obtained by shifting the frame pitch output from the analyzing unit 13 by an externally provided pitch shift amount, and outputs this pitch as a converted pitch.

【００２６】１６はスペクトルシェイプ生成部、１７は
スペクトル傾き補正部、１８は前フレーム情報記憶部で
あり、これらの各部は連携して作動し、特徴情報データ
ベース１２から出力された情報に基づいたスペクトルシ
ェイプを、新ピッチと入力信号の正弦波成分の平均ゲイ
ンと直前のフレームに対する音素スペクトルシェイプ
（直前のスペクトルシェイプ）とに基づいて変形し、現
在のフレームに対する音素スペクトルシェイプを生成す
る。なお、本実施形態における「音素スペクトルシェイ
プ」は、音素に応じて生成され、スペクトル傾き補正部
１７から出力されるスペクトルシェイプを意味する。Reference numeral 16 denotes a spectrum shape generation unit, 17 denotes a spectrum tilt correction unit, and 18 denotes a previous frame information storage unit. These units operate in cooperation with each other and operate based on information output from the feature information database 12. The shape is modified based on the new pitch, the average gain of the sine wave component of the input signal, and the phoneme spectrum shape for the immediately preceding frame (the immediately preceding spectrum shape) to generate a phoneme spectrum shape for the current frame. The “phoneme spectrum shape” in the present embodiment means a spectrum shape generated according to the phoneme and output from the spectrum tilt correction unit 17.

【００２７】ここで、図４を参照して上記各部１６，１
７及び１８が行う音素スペクトルシェイプ生成処理につ
いてより具体的に説明する。スペクトルシェイプ生成部
１６は、まず、特徴情報データベース１２から出力され
た特徴情報に従ってスペクトルシェイプを生成する（ス
テップＳ１）。ここで生成されるスペクトルシェイプが
１つの場合（音素が１つの場合）には当該スペクトルシ
ェイプを処理対象のスペクトルシェイプとし、複数の場
合（音素が複数の場合）には複数のスペクトルシェイプ
に対してスペクトル補間（音素間補間）を行うことで１
つのスペクトルシェイプを生成し、このスペクトルシェ
イプを処理対象のスペクトルシェイプとする（ステップ
Ｓ２，Ｓ３）。Here, with reference to FIG.
The phoneme spectrum shape generation processing performed by 7 and 18 will be described more specifically. First, the spectrum shape generation unit 16 generates a spectrum shape according to the feature information output from the feature information database 12 (step S1). If one spectrum shape is generated here (one phoneme), the spectrum shape is set as a processing target spectrum shape, and if plural spectrum shapes are generated (multiple phonemes), a plurality of spectrum shapes are generated. By performing spectral interpolation (inter-phoneme interpolation), 1
One spectrum shape is generated, and this spectrum shape is set as a processing target spectrum shape (steps S2 and S3).

【００２８】次に、スペクトルシェイプ生成部１６は処
理対象のスペクトルシェイプの正弦波成分を変換後ピッ
チ算出部１５により算出された変換後ピッチに一致する
ようにシフトしたスペクトルシェイプを生成し（ステッ
プＳ４）、前フレーム情報記憶部１８に記憶された直前
の音素スペクトルシェイプから当該シフト後のスペクト
ルシェイプへ滑らかに変化させるために必要となる中間
的な補間スペクトルシェイプをスペクトル補間（フレー
ム間補間）により生成する（ステップＳ５）。Next, the spectrum shape generator 16 generates a spectrum shape in which the sine wave component of the spectrum shape to be processed is shifted so as to match the converted pitch calculated by the converted pitch calculator 15 (step S4). ), An intermediate interpolation spectrum shape necessary for smoothly changing the immediately preceding phoneme spectrum shape stored in the previous frame information storage unit 18 to the shifted spectrum shape is generated by spectrum interpolation (inter-frame interpolation). (Step S5).

【００２９】以下、図５（ａ）及び（ｂ）を参照してス
ペクトル補間について説明する。スペクトル補間では、
まず、図５（ａ）に示すように、補間の元となる２つの
スペクトルシェイプ（以後、第１スペクトルシェイプＳ
Ｓ１１及び第２スペクトルシェイプＳＳ１２とする）を
それぞれ、複数の周波数領域Ｚ１、Ｚ２、…に分割す
る。ここで、第１スペクトルシェイプＳＳ１１における
各領域の境界の周波数（以後、アンカーポイント）をＲ
Ｂ1,1、ＲＢ2,1、…、ＲＢN,1とし、第２スペクトルシ
ェイプＳＳ１２におけるアンカーポイントをＲＢ1,2、
ＲＢ2,2、…、ＲＢM,2とする。Hereinafter, the spectrum interpolation will be described with reference to FIGS. 5 (a) and 5 (b). In spectral interpolation,
First, as shown in FIG. 5 (a), two spectral shapes (hereinafter referred to as first spectral shapes S
S11 and the second spectrum shape SS12) are respectively divided into a plurality of frequency domains Z1, Z2,. Here, the frequency (hereinafter referred to as an anchor point) at the boundary of each region in the first spectral shape SS11 is R
.., RBN, 1 and anchor points in the second spectral shape SS12 are RB1,2, RB1,2,.
RB2,2, ..., RBM, 2.

【００３０】次に、図５（ｂ）に示す処理が行われる。
図５（ｂ）において、補間位置ｘは第１スペクトルシェ
イプＳＳ１１及び第２スペクトルシェイプＳＳ１２と補
間により生成されるスペクトルシェイプとの位置関係を
示すパラメータであり、０より大で１未満の実数値をと
る。補間により生成されるスペクトルシェイプは、ｘ＝
０の場合には第１スペクトルシェイプＳＳ１そのもの、
ｘ＝１の場合には第２スペクトルシェイプそのものに一
致する。この図においては、ｘ＝０．３５の例が示され
ている。また、図中の縦軸上の白丸（○）はスペクトル
シェイプを構成する周波数及びマグニチュードの組の各
々を示している。Next, the processing shown in FIG. 5B is performed.
In FIG. 5B, an interpolation position x is a parameter indicating a positional relationship between the first spectrum shape SS11 and the second spectrum shape SS12 and the spectrum shape generated by the interpolation. Take. The spectral shape generated by interpolation is x =
In the case of 0, the first spectral shape SS1 itself,
When x = 1, it corresponds to the second spectral shape itself. In this figure, an example where x = 0.35 is shown. In addition, white circles (）) on the vertical axis in the drawing indicate each set of frequency and magnitude constituting the spectrum shape.

【００３１】図５（ｂ）において、縦軸は周波数を表し
ており、マグニチュードの軸は紙面垂直方向に立ち上が
っているものとする。また、第１スペクトルシェイプＳ
Ｓ１１（ｘ＝０）の対象領域Ｚiに対応するアンカーポ
イントがＲＢ_i,1及びＲＢ_i+1, ₁であり、当該領域Ｚiに
属する具体的な周波数及びマグニチュードの組のうちの
何れかの組の周波数がｆ_i1であり、そのマグニチュード
がＳ₁（ｆ_i1）であるものとする。In FIG. 5B, the vertical axis represents frequency.
And the magnitude axis rises in the direction perpendicular to the page.
It is assumed that Also, the first spectral shape S
An anchor point corresponding to the target area Zi of S11 (x = 0)
Int is RB_{i, 1}And RB_{i + 1,} ₁And in the area Zi
Of the specific frequency and magnitude pairs to which
Any pair of frequencies is f_i1And its magnitude
Is S₁(F_i1).

【００３２】さらに、第２スペクトルシェイプＳＳ１２
（ｘ＝１）の対象領域Ｚiに対応するアンカーポイント
がＲＢ_i,2及びＲＢ_i+1,2であり、当該領域Ｚiに属する
具体的な周波数及びマグニチュードの組のうちの何れか
の組の周波数がｆi2であり、そのマグニチュードがＳ₂
（ｆ_i2）であるものとする。Further, the second spectrum shape SS12
The anchor points corresponding to the target region Zi of (x = 1) are RB _{i, 2} and RB _{i + 1,2} , and any one of the specific frequency and magnitude pairs belonging to the region Zi The frequency is fi2 and its magnitude is S ₂
(F _i2 ).

【００３３】本実施形態では、第１スペクトルシェイプ
ＳＳ１１上の実在の組に対応したスペクトル遷移関数ｆ
_trans1（ｘ）と、第２スペクトルシェイプＳＳ１２上の
実在の組に対応したスペクトル遷移関数ｆ_trans2（ｘ）
を最も簡単な線形関数としており、これらの遷移関数は
次式（１），（２）で表される。ｆ_trans1(ｘ)=ｍ₁・ｘ+ｂ₁ …（１）ｆ_trans2(ｘ)=ｍ₂・ｘ+ｂ₂ …（２）ただし、ｍ₁=ＲＢ_i,2-ＲＢ_i,1 ｂ₁=ＲＢ_i,1 ｍ₂=ＲＢ_i+1,2-ＲＢ_i+1,1 ｂ₂=ＲＢ_i+1,2 である。In the present embodiment, the spectrum transition function f corresponding to the real set on the first spectrum shape SS11
_trans1 (x) and a spectrum transition function f _trans2 (x) corresponding to a real set on the second spectrum shape SS12
Are the simplest linear functions, and these transition functions are represented by the following equations (1) and (2). f _trans1 (x) = m ₁ × x + b ₁ (1) f _trans2 (x) = m ₂ × x + b ₂ (2) where m ₁ = RB _{i, 2} -RB _{i, 1} b ₁ = RB _{i, 1} m ₂ = RB _{i + 1,2} -RB _{i + 1,1} b ₂ = RB _{i + 1,2}

【００３４】上記前提に基づいて、まず、第１スペクト
ルシェイプＳＳ１１上の組に基づいて補間スペクトルシ
ェイプ上の組を求める処理について説明する。第１スペ
クトルシェイプＳＳ１１上に実在する周波数ｆ_i1及びマ
グニチュードＳ₁（ｆ_i1）の組に対応した第２スペクト
ルシェイプＳＳ１２上の周波数ｆ_i1,2は下式（３）で表
される。Based on the above assumption, first, a process of obtaining a set on the interpolation spectrum shape based on the set on the first spectrum shape SS11 will be described. The frequency f _i1,2 on the second spectrum shape SS12 corresponding to the combination of the frequency f _i1 and the magnitude S ₁ (f _i1 ) existing on the first spectrum shape SS11 is expressed by the following equation (3).

【数１】ただし、Ｗ₁=ＲＢ_i+1,1-ＲＢ_i,1 Ｗ₂=ＲＢ_i+1,2-ＲＢ_i,2 である。(Equation 1) Here, W ₁ = RB _{i + 1,1} -RB _i _{, 1} W ₂ = RB _{i + 1,2} -RB _{i, 2}

【００３５】この周波数ｆ_i1,2を挟むように最も近接し
て実在する第２スペクトルシェイプＳＳ１２上の２つの
組を用いると、Ｓ₂(ｆ_i1,2)は下式（４）で表される。
ただし、下式では、上記２つの組の低い方の周波数に
“−”、高い方の周波数に“＋”のサフィックスを付し
ている。If two sets on the second spectrum shape SS12 which are closest and exist so as to sandwich the frequency f _i1,2 are used, S ₂ (f _i1,2 ) is expressed by the following equation (4). You.
However, in the following equation, a suffix of “−” is given to a lower frequency of the above two sets and a “+” is given to a higher frequency.

【数２】 (Equation 2)

【００３６】式（３），（４）から容易に推測されるよ
うに、第１スペクトルシェイプＳＳ１１上に実在する周
波数及びマグニチュードの組に対応する補間スペクトル
シェイプ上の周波数ｆ_i1,x及びマグニチュードＳ_x（ｆ
_i1,x）は下式（５），（６）で表される。As can be easily inferred from the equations (3) and (4), the frequency f _{i1, x} and the magnitude S on the interpolated spectrum shape corresponding to the combination of the frequency and the magnitude actually existing on the first spectrum shape SS11. _x (f
_{i1, x} ) is represented by the following equations (5) and (6).

【数３】Ｓ_x(ｆ_i1,x)＝Ｓ₁(ｆ_i1)+{Ｓ₂(ｆ_i1,2)-Ｓ₁(ｆ_i1)｝・ｘ …（６）上式（５），（６）を用いることにより、第１スペクト
ルシェイプＳＳ１１上の実在する全ての組に対応した補
間スペクトルシェイプ上の組を求めることができる。(Equation 3) S _x (f _{i1, x} ) = S ₁ (f _i1 ) + {S ₂ (f _i1,2 ) −S ₁ (f _i1 )｝ · x (6) Using the above equations (5) and (6) As a result, a set on the interpolated spectrum shape corresponding to all existing sets on the first spectrum shape SS11 can be obtained.

【００３７】次に、第２スペクトルシェイプＳＳ１２上
の組に基づいて補間スペクトルシェイプ上の組を求める
処理について説明する。第２スペクトルシェイプＳＳ１
２上に実在する周波数ｆi2及びマグニチュードＳ₂（ｆ
_i2）の組に対応した第１スペクトルシェイプＳＳ１１上
の周波数ｆ_i2,1は下式（７）で表される。Next, a process for obtaining a set on the interpolated spectrum shape based on the set on the second spectrum shape SS12 will be described. Second spectrum shape SS1
2 and the actual frequency fi2 and magnitude S ₂ (f
The frequency f _i2,1 on the first spectrum shape SS11 corresponding to the set of _i2 ) is expressed by the following equation (7).

【数４】ただし、Ｗ₁=ＲＢ_i+1,1-ＲＢ_i,1 Ｗ₂=ＲＢ_i+1,2-ＲＢ_i,2 である。(Equation 4) Here, W ₁ = RB _{i + 1,1} -RB _i _{, 1} W ₂ = RB _{i + 1,2} -RB _{i, 2}

【００３８】この周波数ｆ_i2,1を挟むように最も近接し
て実在する第１スペクトルシェイプＳＳ１１上の２つの
組を用いると、Ｓ1(ｆ_i2,1)は下式（８）で表される。
ただし、下式では、上記２つの組の低い方の周波数に
“−”、高い方の周波数に“＋”のサフィックスを付し
ている。When two pairs on the first spectrum shape SS11 which are present closest to each other so as to sandwich this frequency f _i2,1 are used, S1 ( _fi2,1 ) is expressed by the following equation (8). .
However, in the following equation, a suffix of “−” is given to a lower frequency of the above two sets and a “+” is given to a higher frequency.

【数５】 (Equation 5)

【００３９】式（７），（８）から容易に推測されるよ
うに、第２スペクトルシェイプＳＳ１２上に実在する周
波数及びマグニチュードの組に対応する補間スペクトル
シェイプ上の周波数ｆ_i2,x及びマグニチュードＳ_x（ｆ
_i2,x）は下式（９），（１０）で表される。As can be easily inferred from the equations (7) and (8), the frequency f _{i2, x} and the magnitude S on the interpolation spectrum shape corresponding to the combination of the frequency and the magnitude actually existing on the second spectrum shape SS12 _x (f
_{i2, x} ) is represented by the following equations (9) and (10).

【数６】Ｓ_x(ｆ_i2,x)=Ｓ₂(ｆ_i2)+{Ｓ₁(ｆ_i2,1)-Ｓ₂(ｆ_i2)｝・(ｘ-１) …（１０）上式（９），（１０）を用いることにより、第２スペク
トルシェイプＳＳ１２上の実在する全ての組に対応した
補間スペクトルシェイプ上の組を求めることができる。(Equation 6) S _x (f _{i2, x} ) = S ₂ (f _i2 ) + {S ₁ (f _i2,1 ) −S ₂ (f _i2 )｝ · (x−1) (10) Equations (9), ( By using (10), it is possible to obtain sets on the interpolated spectrum shape corresponding to all existing sets on the second spectrum shape SS12.

【００４０】こうして得られた補間スペクトルシェイプ
上の全ての組を周波数順に並べることにより領域Ｚiに
対する補間スペクトルシェイプが得られる。本実施形態
では、全ての領域Ｚ1、Ｚ2、…について上述の処理を行
うことで、複数のスペクトルシェイプに対する補間スペ
クトルシェイプを得ている。なお、本実施形態における
補間位置ｘは、音素間補間においては特徴情報データベ
ース１２からの相関値となり、フレーム間補正において
は予め設定された値となる。By arranging all the sets on the interpolated spectrum shape thus obtained in order of frequency, an interpolated spectrum shape for the region Zi can be obtained. In the present embodiment, the above-described processing is performed on all the regions Z1, Z2,... To obtain an interpolated spectrum shape for a plurality of spectrum shapes. Note that the interpolation position x in the present embodiment is a correlation value from the feature information database 12 in inter-phoneme interpolation, and is a preset value in inter-frame correction.

【００４１】再び図４において、スペクトル傾き補正部
１７は生成された補間スペクトルシェイプに対して傾き
補正等を行う（ステップＳ６）。一般に、出力音量が大
の場合にはスペクトルシェイプの高域が豊か（リッチ）
となり、小の場合にはスペクトルシェイプの高域が乏し
くなる。この現象を再現するために、スペクトル傾き補
正部１７は平均ゲイン算出部１４から出力された平均ゲ
インに応じて、補正スペクトルシェイプの平均ゲインや
高域の形状（ここでは「傾き」）を補正し、補正後のス
ペクトルシェイプを音素スペクトルシェイプとして出力
する。Referring again to FIG. 4, the spectrum tilt correction unit 17 corrects the tilt of the generated interpolation spectrum shape (step S6). Generally, when the output volume is high, the high range of the spectrum shape is rich (rich)
When the frequency is small, the high range of the spectrum shape becomes poor. In order to reproduce this phenomenon, the spectrum tilt correction unit 17 corrects the average gain of the corrected spectrum shape and the shape of the high frequency band (here, “tilt”) according to the average gain output from the average gain calculation unit 14. , And outputs the corrected spectrum shape as a phoneme spectrum shape.

【００４２】そして、前フレーム情報記憶部１８はスペ
クトル傾き補正部１７から出力された音素スペクトルシ
ェイプを記憶し、スペクトルシェイプ生成部１６の使用
に供する（ステップＳ７）。以後、処理はステップＳ１
に戻る。Then, the previous frame information storage unit 18 stores the phoneme spectrum shape output from the spectrum tilt correction unit 17 and uses it for use by the spectrum shape generation unit 16 (step S7). Thereafter, the processing is performed in step S1.
Return to

【００４３】再び図１において、１９はピッチシフトシ
ェイプ生成部であり、分析部１３からフレーム単位で出
力された入力信号の正弦波成分を外部から与えられたピ
ッチシフト量だけピッチシフトし、ピッチシフト後の正
弦波成分を表すスペクトルシェイプとスペクトル傾き補
正部１７から出力された音素スペクトルシェイプとの間
でスペクトル補間を行うことで新たなスペクトルシェイ
プを生成し、出力する。ここで行われるスペクトル補間
の詳細は前述の通りである。Referring again to FIG. 1, reference numeral 19 denotes a pitch shift shape generation unit which pitch-shifts the sine wave component of the input signal output from the analysis unit 13 in frame units by an externally applied pitch shift amount. A new spectrum shape is generated and output by performing spectrum interpolation between the spectrum shape representing the subsequent sine wave component and the phoneme spectrum shape output from the spectrum tilt correction unit 17. The details of the spectrum interpolation performed here are as described above.

【００４４】２０は合成部であり、分析部１３から出力
された入力信号の残差成分とピッチシフトシェイプ生成
部１９から出力されたスペクトルシェイプとに応じた出
力信号を生成し、これを出力する。Reference numeral 20 denotes a synthesizing unit, which generates an output signal corresponding to the residual component of the input signal output from the analyzing unit 13 and the spectrum shape output from the pitch shift shape generating unit 19, and outputs this. .

【００４５】［Ａ−２．第１実施形態の動作］次に、上
記構成の音声変換装置の動作について図１を参照して説
明する。入力信号が音声変換装置に入力されると、分析
部１３において当該入力信号に対してフレーム単位の周
波数分析が行われる。これにより、入力信号は正弦波成
分と残差成分とに分けられ、正弦波成分が平均ゲイン算
出部１４及びピッチシフトシェイプ生成部１９へ、残差
成分が合成部２０へ供給される。また、分析部１３によ
り、入力信号のフレームピッチが求められ、このフレー
ムピッチが変換後ピッチ算出部１５へ供給される。[A-2. Operation of First Embodiment] Next, the operation of the voice conversion device having the above configuration will be described with reference to FIG. When the input signal is input to the voice conversion device, the analysis unit 13 performs a frequency analysis on a frame basis for the input signal. As a result, the input signal is divided into a sine wave component and a residual component, the sine wave component is supplied to the average gain calculator 14 and the pitch shift shape generator 19, and the residual component is supplied to the synthesizer 20. Further, the analysis unit 13 determines the frame pitch of the input signal, and supplies the frame pitch to the converted pitch calculation unit 15.

【００４６】正弦波成分が供給された平均ゲイン算出部
１４では平均ゲインが求められる。この平均ゲインはス
ペクトル傾き補正部１７へ供給される。また、フレーム
ピッチが供給された変換後ピッチ算出部１５には外部か
らピッチシフト量が与えられており、変換後ピッチ算出
部１５では、これらのフレームピッチ及びピッチシフト
量とに基づいて変換後ピッチが算出される。この変換後
ピッチは特徴情報データベース１２、スペクトルシェイ
プ生成部１６及びピッチシフトシェイプ生成部１９へ供
給される。変換後ピッチが供給されたピッチシフトシェ
イプ生成部１９では、入力信号の正弦波成分に対し、外
部から与えられたピッチシフト量に従ったピッチシフト
が行われる。The average gain calculator 14 to which the sine wave component is supplied calculates the average gain. This average gain is supplied to the spectrum tilt correction unit 17. The converted pitch calculator 15 to which the frame pitch has been supplied is provided with a pitch shift amount from the outside. The converted pitch calculator 15 converts the converted pitch based on the frame pitch and the pitch shift amount. Is calculated. The converted pitch is supplied to the feature information database 12, the spectrum shape generator 16, and the pitch shift shape generator 19. In the pitch shift shape generator 19 to which the converted pitch is supplied, the sine wave component of the input signal is pitch-shifted according to a pitch shift amount given from the outside.

【００４７】一方、音素認識部１１においては入力信号
から音素が認識され、当該音素を表す情報が特徴情報デ
ータベース１２へ供給される。なお、認識された音素が
複数である場合には、各音素に対する相関値も特徴情報
データベース１２へ供給される。特徴情報データベース
１２においては、当該音素に対応した特徴情報が抽出さ
れ、当該音素を表す情報（及び当該音素との相関値）と
ともにスペクトルシェイプ生成部１６へ供給される。前
述のように、特徴情報データベース１２がパラメータセ
ットの抽出における優先順位は個人別セット、男声／女
声セット、デフォルトセットとなっているため、特徴情
報データベース１２に個人名を示す情報が入力されてい
る場合には個人別セットが、個人名ではなく男声／女声
の別を示す情報が入力されている場合には男声／女声セ
ットが、いずれも入力されていない場合にはデフォルト
セットが、抽出対象のパラメータセットとなる。特徴情
報データベース１２では、このパラメータセットから、
音素認識部１１により認識された音素と変換後ピッチ算
出部１５により算出された変換後ピッチを包含したピッ
チバンクとに対応した特徴情報が抽出される。On the other hand, the phoneme recognition unit 11 recognizes a phoneme from the input signal, and supplies information representing the phoneme to the feature information database 12. If there are a plurality of recognized phonemes, the correlation value for each phoneme is also supplied to the feature information database 12. In the feature information database 12, feature information corresponding to the phoneme is extracted and supplied to the spectrum shape generator 16 together with information representing the phoneme (and a correlation value with the phoneme). As described above, since the feature information database 12 has an individual set, a male / female voice set, and a default set in the parameter set extraction priority, information indicating a personal name is input to the feature information database 12. In this case, the personalized set is used, the male / female voice set is input when information indicating the distinction between male / female voices is input instead of the individual name, and the default set is used if none of the information is input. It becomes a parameter set. In the feature information database 12, from this parameter set,
Feature information corresponding to the phoneme recognized by the phoneme recognition unit 11 and the pitch bank including the converted pitch calculated by the converted pitch calculation unit 15 is extracted.

【００４８】スペクトルシェイプ生成部１６では、特徴
情報データベース１２から供給された特徴情報に従って
スペクトルシェイプが生成される。ここで生成されたス
ペクトルシェイプが複数の場合には、スペクトルシェイ
プ生成部１６において前述の音素間補間が行われて１つ
のスペクトルシェイプが生成される。１つだけ生成され
たスペクトルシェイプは、その正弦波成分のピッチが変
換後ピッチ算出部１５から供給された変換後ピッチに一
致するようにシフトされた後に、前フレーム情報記憶部
１８に記憶された直前のスペクトルシェイプから滑らか
につながるように前述のフレーム間補間で利用される。
この結果として得られたスペクトルシェイプはスペクト
ル傾き補正部１７へ供給される。The spectrum shape generator 16 generates a spectrum shape according to the feature information supplied from the feature information database 12. When a plurality of spectrum shapes are generated here, the above-described inter-phoneme interpolation is performed in the spectrum shape generation unit 16 to generate one spectrum shape. Only one of the generated spectral shapes is stored in the previous frame information storage unit 18 after being shifted so that the pitch of the sine wave component matches the converted pitch supplied from the converted pitch calculation unit 15. It is used in the above-mentioned inter-frame interpolation so as to be smoothly connected to the immediately preceding spectrum shape.
The resulting spectrum shape is supplied to the spectrum tilt correction unit 17.

【００４９】スペクトル傾き補正部１７ではスペクトル
シェイプに対して、入力信号の正弦波成分の平均ゲイン
に応じた傾き補正等が施され、この結果として得られた
スペクトルシェイプが音素スペクトルシェイプとしてピ
ッチシフトシェイプ生成部１９へ供給される。この際、
当該音素スペクトルシェイプはスペクトル傾き補正部１
７に記憶される。音素スペクトルシェイプが供給された
ピッチシフトシェイプ生成部１９では、このスペクトル
シェイプと入力信号の正弦波成分に基づいたスペクトル
シェイプとの間でスペクトル補間が行われ、新しいスペ
クトルシェイプが生成される。The spectrum tilt correcting section 17 performs a tilt correction or the like on the spectrum shape in accordance with the average gain of the sine wave component of the input signal, and the resulting spectrum shape is used as a pitch shift shape as a phoneme spectrum shape. It is supplied to the generation unit 19. On this occasion,
The phoneme spectrum shape is stored in the spectrum tilt correction unit 1
7 is stored. In the pitch shift shape generator 19 to which the phoneme spectrum shape is supplied, spectrum interpolation is performed between the spectrum shape and the spectrum shape based on the sine wave component of the input signal, and a new spectrum shape is generated.

【００５０】ピッチシフトシェイプ生成部１９により生
成されたスペクトルシェイプは合成部２０へ供給され
る。合成部２０ではこのスペクトルシェイプと入力信号
の残差成分とに応じた信号が生成され、本音声変換装置
の出力信号として出力される。なお、この出力信号のピ
ッチは変換後ピッチとなっている。The spectrum shape generated by the pitch shift shape generator 19 is supplied to the synthesizer 20. The synthesizing unit 20 generates a signal corresponding to the spectrum shape and the residual component of the input signal, and outputs the signal as an output signal of the present voice conversion device. The pitch of this output signal is the converted pitch.

【００５１】上述した第１実施形態によれば、入力信号
から、自然な声質のピッチシフト後の出力信号を得るこ
とができる。特に、ピッチシフト後の変換後ピッチを相
対的に指定することができるため、例えば、カラオケ装
置において、ユーザによるキーの制御の際や、ボーカル
に対してハーモニーを生成する際、男声と女声との変換
の際に用いて好適である。According to the first embodiment described above, it is possible to obtain a natural voice quality output signal after pitch shift from an input signal. In particular, since the converted pitch after the pitch shift can be relatively specified, for example, in a karaoke apparatus, when a user controls a key or generates harmony for a vocal, a male voice and a female voice are generated. It is suitable for use in conversion.

【００５２】［Ｂ−１．第２実施形態の構成］図６は本
発明の第２実施形態に係る音声変換装置の全体構成を示
すブロック図であり、この図において、図１と共通する
部分には同一の符号を付し、その説明を省略する。[B-1. Configuration of Second Embodiment] FIG. 6 is a block diagram showing the overall configuration of a speech conversion apparatus according to a second embodiment of the present invention. In this figure, the same parts as those in FIG. , The description of which is omitted.

【００５３】第２実施形態に係る音声変換装置が第１実
施形態に係る音声変換装置と大きく相違する点は、外部
から与えられるパラメータがピッチシフト量ではなく、
変換後ピッチである点である。このため、第２実施形態
に係る音声変換装置においては、外部から与えられた変
換後ピッチがそのまま特徴情報データベース１２及びス
ペクトルシェイプ生成部１６へ供給されている。The voice conversion device according to the second embodiment is largely different from the voice conversion device according to the first embodiment in that a parameter given from the outside is not a pitch shift amount but a pitch shift amount.
This is the pitch after conversion. For this reason, in the voice conversion device according to the second embodiment, the converted pitch provided from the outside is directly supplied to the feature information database 12 and the spectrum shape generator 16.

【００５４】また、上記相違点に起因して、図６では、
変換後ピッチ算出部１５（図１参照）に代えてピッチシ
フト量算出部２１が設けられている。このピッチシフト
量算出部２１は、分析部１３から出力されたフレームピ
ッチと外部から与えられた変換後ピッチとの差を求め、
この差をシフトすべき量（以後、ピッチシフト量）とし
て出力する。このピッチシフト量がピッチシフトシェイ
プ生成部１９へ供給されるように第２実施形態に係る本
音声変換装置は構成されている。Also, due to the above difference, FIG.
A pitch shift amount calculator 21 is provided instead of the converted pitch calculator 15 (see FIG. 1). The pitch shift amount calculation unit 21 calculates a difference between the frame pitch output from the analysis unit 13 and the converted pitch provided from the outside,
This difference is output as an amount to be shifted (hereinafter, a pitch shift amount). The voice conversion device according to the second embodiment is configured so that the pitch shift amount is supplied to the pitch shift shape generation unit 19.

【００５５】［Ｂ−２．第２実施形態の動作］次に、上
記構成の音声変換装置の動作について説明する。ただ
し、第１実施形態に係る音声変換装置の動作と同様の動
作については、その説明を省略する。入力信号が音声変
換装置に入力されると、分析部１３において入力信号の
フレームピッチが求められ、このフレームピッチがピッ
チシフト量算出部２１へ供給される。ピッチシフト量算
出部２１では、外部から与えられた変換後ピッチに対す
るフレームピッチの差であるピッチシフト量が算出さ
れ、ピッチシフトシェイプ生成部１９へ供給される。シ
フト量が供給されたピッチシフトシェイプ生成部１９で
は、入力信号の正弦波成分に対し、シフト量に従ったピ
ッチシフトが行われる。[B-2. Operation of Second Embodiment] Next, the operation of the voice conversion device having the above configuration will be described. However, the description of the same operation as the operation of the voice conversion device according to the first embodiment will be omitted. When the input signal is input to the speech converter, the frame pitch of the input signal is obtained in the analysis unit 13, and this frame pitch is supplied to the pitch shift amount calculation unit 21. The pitch shift amount calculation unit 21 calculates a pitch shift amount, which is a difference between a frame pitch and an externally applied converted pitch, and supplies the difference to the pitch shift shape generation unit 19. The pitch shift shape generation unit 19 to which the shift amount is supplied performs a pitch shift on the sine wave component of the input signal according to the shift amount.

【００５６】また、特徴情報データベース１２及びスペ
クトルシェイプ生成部１６には、外部から与えられた変
換後ピッチが供給され、最終的に、スペクトル傾き補正
部１７から音素スペクトルシェイプが出力される。以降
の動作は第１実施形態における動作と同一であることか
ら、その説明を省略する。The characteristic information database 12 and the spectrum shape generator 16 are supplied with an externally applied converted pitch, and the spectrum inclination corrector 17 finally outputs a phoneme spectrum shape. Subsequent operations are the same as the operations in the first embodiment, and a description thereof will be omitted.

【００５７】上述した第２実施形態によれば、入力信号
から、自然な声質のピッチシフト後の出力信号を得るこ
とができる。特に、変換後ピッチを絶対的に指定するこ
とから、例えば、カラオケ装置において、強制的に正し
いピッチの出力音声を得る際や、曲の進行に応じてボー
カルとの度数差が変わる変則的なハーモニーを付加する
際に用いて好適である。According to the above-described second embodiment, a pitch-shifted output signal having natural voice quality can be obtained from an input signal. In particular, since the pitch after conversion is absolutely specified, for example, when a karaoke device is used to forcibly obtain an output voice with a correct pitch, or an irregular harmony in which the frequency difference from the vocal changes as the music progresses. Is suitable for use when adding.

【００５８】［Ｃ．補足］上述した各実施形態は例に過
ぎず、本発明は上記構成に限定されるものではなく、以
下に例示するような様々な態様を包含する。[C. Supplement] Each of the above-described embodiments is merely an example, and the present invention is not limited to the above configuration, but includes various aspects as exemplified below.

【００５９】本実施形態ではパラメータセット毎に複数
のピッチバンクを設けるようにしたが、図７に示すよう
に、ピッチバンクと他のパラメータセットとを独立して
設けてもよい。この場合には、特徴情報データベース１
２における特徴情報の抽出の優先順位は、例えば、個人
別セット、男声セット／女声セット、ピッチバンク、デ
フォルトセットとなり、個人名を示す指定情報が入力さ
れた場合には当該指定情報に対応した個人別セットが、
個人名を示す指定情報が入力されずに男声／女声の別を
示す指定情報が入力された場合には当該指定情報に対応
した男声セット／女声セットが、個人名を示す指定情報
及び男声／女声の別を示す指定情報のいずれも入力され
ずにピッチを示す指定情報が入力された場合には当該ピ
ッチを包含するピッチバンクが、上記のいずれでもない
場合にはデフォルトセットが、特徴情報の抽出対象のパ
ラメータセットとなる。In this embodiment, a plurality of pitch banks are provided for each parameter set. However, as shown in FIG. 7, pitch banks and other parameter sets may be provided independently. In this case, the feature information database 1
The priority order of the extraction of the characteristic information in 2, for example, is an individual set, a male voice set / female voice set, a pitch bank, and a default set. When specified information indicating a personal name is input, an individual corresponding to the specified information is input. Another set,
If the designation information indicating the male / female voice is input without the designation information indicating the personal name being input, the male / female voice set corresponding to the designation information becomes the designation information indicating the personal name and the male / female voice. If the designation information indicating the pitch is input without any of the designation information indicating the distinction of the pitch, the pitch bank including the pitch is not included, and if not, the default set includes the extraction of the feature information. The target parameter set.

【００６０】また、特徴情報テーブルＴＢＬに複数種の
パラメータセットを格納した例を示したが、これらのパ
ラメータセットとは異なる種類のパラメータセットを追
加してもよいし、逆にパラメータセットの種類を削減し
てもよい。Although an example in which a plurality of types of parameter sets are stored in the feature information table TBL has been described, parameter sets different from these parameter sets may be added. May be reduced.

【００６１】また、本実施形態では、分析部１３が入力
信号のフレームピッチを特徴情報データベース１２へ供
給するようにしたが、特徴情報データベース１２がピッ
チバンク毎の特徴情報を持たない場合には、この供給を
省略してもよい。さらに、本実施形態では各種のスペク
トル補間を行うようにしたが、スペクトル補間を行わな
い態様も実現可能である。例えば、認識される音素を必
ず１つとして音素間補間を省略し、さらにフレーム間補
間を省略し、加えて入力信号の正弦波成分のスペクトル
シェイプを考慮せずに出力信号を生成・出力するように
してもよい。In the present embodiment, the analysis unit 13 supplies the frame pitch of the input signal to the feature information database 12, but if the feature information database 12 does not have the feature information for each pitch bank, This supply may be omitted. Furthermore, in the present embodiment, various kinds of spectrum interpolation are performed, but a mode in which spectrum interpolation is not performed can also be realized. For example, it is possible to omit the inter-phoneme interpolation, always omit the inter-phoneme interpolation, and generate and output the output signal without considering the spectral shape of the sine wave component of the input signal. It may be.

【００６２】また、本実施形態では入力信号の残差成分
をも考慮に入れて出力信号を生成するようにしたが、残
差成分を考慮しない態様も実現可能である。この場合、
分析部１３が残差成分を破棄し、合成部２０がピッチシ
フトシェイプ生成部１９からのスペクトルシェイプのみ
を用いて出力信号を得ることになる。In the present embodiment, the output signal is generated in consideration of the residual component of the input signal. However, a mode in which the residual component is not considered can be realized. in this case,
The analyzer 13 discards the residual component, and the synthesizer 20 obtains an output signal using only the spectrum shape from the pitch shift shape generator 19.

【００６３】さらに、スペクトル補間において用いられ
るスペクトル遷移関数は線形関数に限定されない。２次
関数、指数関数などの非線形関数であってもよいし、離
散的な関数であってもよい。さらに言えば、変数に対応
した変化をテーブルとして用意し、このテーブルを関数
のように用いてもよい。また、スペクトル補間におい
て、アンカーポイント毎に遷移関数を変更するようにし
てもよい。さらに、遷移元と遷移先の音素を比較し、音
素の遷移状態に応じて遷移関数及び補間位置の少なくと
も一方を変更するようにしてもよい。Further, the spectrum transition function used in the spectrum interpolation is not limited to a linear function. It may be a non-linear function such as a quadratic function or an exponential function, or may be a discrete function. Furthermore, a change corresponding to a variable may be prepared as a table, and this table may be used like a function. In the spectrum interpolation, the transition function may be changed for each anchor point. Further, the transition source and the transition destination phonemes may be compared, and at least one of the transition function and the interpolation position may be changed according to the transition state of the phoneme.

【００６４】[0064]

【発明の効果】以上説明したように、本発明によれば、
音声のピッチシフトにおいて、入力音声の音素毎の特徴
情報と変換後ピッチと入力音声を表す入力信号の平均ゲ
インとに基づいてピッチシフト後の音声を表す出力信号
が出力されるため、ピッチシフト後の音声を、各音素に
適合した、より自然な声質の音声とすることができる。As described above, according to the present invention,
In the pitch shift of the voice, an output signal representing the voice after the pitch shift is output based on the feature information for each phoneme of the input voice, the converted pitch, and the average gain of the input signal representing the input voice. Can be converted into a more natural voice quality voice adapted to each phoneme.

【００６５】さらに、入力信号の周波数分析結果を考慮
して出力信号を生成するようにすれば、入力音声に近い
声質の音声を得ることができる。また、入力信号の正弦
波成分及び残差成分を分離し、両者の取り扱いを分けれ
ば、より入力音声に近い声質の音声を得ることができ
る。さらに、入力信号の平均ゲインを考慮してスペクト
ルシェイプの傾きを補正したり、直前のスペクトルシェ
イプとの滑らかなつながりを実現する処理を行ったりす
ることにより、ピッチシフト後の音声をより自然な声質
の音声とすることもできる。Further, if the output signal is generated in consideration of the result of the frequency analysis of the input signal, it is possible to obtain a voice having a voice quality close to the input voice. Also, if the sine wave component and the residual component of the input signal are separated and the handling of both is separated, a voice having a voice quality closer to the input voice can be obtained. Furthermore, by correcting the slope of the spectrum shape in consideration of the average gain of the input signal and performing processing to achieve a smooth connection with the immediately preceding spectrum shape, the voice after the pitch shift can have a more natural voice quality. It can also be a voice.

【図面の簡単な説明】[Brief description of the drawings]

【図１】本発明の第１実施形態に係る音声変換装置の
全体構成を示すブロック図である。FIG. 1 is a block diagram illustrating an overall configuration of a voice conversion device according to a first embodiment of the present invention.

【図２】同音声変換装置の特徴情報データベース１２
が有する特徴情報テーブルＴＢＬの構成を示す概念図で
ある。FIG. 2 is a feature information database 12 of the voice conversion device.
FIG. 3 is a conceptual diagram showing a configuration of a feature information table TBL included in FIG.

【図３】同音声変換装置において行われるＳＭＳ分析
を説明するための図である。FIG. 3 is a diagram for explaining SMS analysis performed in the voice conversion device.

【図４】同音声変換装置において行われる音素スペク
トルシェイプ生成処理の流れを示すフローチャートであ
る。FIG. 4 is a flowchart showing a flow of a phoneme spectrum shape generation process performed in the voice conversion device.

【図５】（ａ）及び（ｂ）はそれぞれ同音声変換装置
において行われるスペクトル補間について説明するため
の図である。FIGS. 5A and 5B are diagrams for explaining spectrum interpolation performed in the voice conversion device.

【図６】本発明の第２実施形態に係る音声変換装置の
全体構成を示すブロック図である。FIG. 6 is a block diagram illustrating an overall configuration of a voice conversion device according to a second embodiment of the present invention.

【図７】各音声変換装置の特徴情報データベース１２
が有する特徴情報テーブルＴＢＬの他の構成例を示す概
念図である。FIG. 7 is a feature information database 12 of each voice converter.
It is a key map showing other examples of composition of characteristic information table TBL which has.

【符号の説明】[Explanation of symbols]

１１……音素認識部、１２……特徴情報データベース、１３……分析部、１４……平均ゲイン算出部、１５……変換後ピッチ算出部、１６……スペクトルシェイプ生成部（出力手段）、１７……スペクトル傾き補正部（出力手段）、１８……前フレーム情報記憶部、１９……ピッチシフトシェイプ生成部（出力手段）、２０……合成部（出力手段）、ＴＢＬ……特徴情報テーブル。 11 phoneme recognition unit 12 feature information database 13 analysis unit 14 average gain calculation unit 15 converted pitch calculation unit 16 spectrum shape generation unit (output means) 17 ... Spectrum tilt correction unit (output unit), 18 previous frame information storage unit, 19 pitch shift shape generation unit (output unit), 20 synthesis unit (output unit), TBL.

───────────────────────────────────────────────────── フロントページの続き (72)発明者マークシーメンツスペインバルセロナ 08002 メルセ 12 (72)発明者ジョルディボナダスペインバルセロナ 08002 メルセ 12 Ｆターム(参考） 5D015 BB02 CC13 5D045 BA01 9A001 EZ02 KK45 KK62 ──────────────────────────────────────────────────続き Continuing on the front page (72) Mark Marks Spain Barcelona 08002 Merce 12 (72) Inventor Jordi Bonada Spain Barcelona 08002 Merse 12 F-term (reference) 5D015 BB02 CC13 5D045 BA01 9A001 EZ02 KK45 KK62

Claims

【特許請求の範囲】[Claims]

【請求項１】入力音声を表す入力信号から入力音声と
異なるピッチの音声を表す出力信号を得る音声変換装置
において、前記入力信号に基づいて前記入力音声のピッチを分析す
るピッチ分析手段と、前記入力信号の平均ゲインを分析する平均ゲイン分析手
段と、前記ピッチ分析手段により分析されたピッチと与えられ
たピッチシフト量とに基づいて変換後ピッチを算出する
変換後ピッチ算出手段と、スペクトルシェイプを生成するための特徴情報を音素に
対応付けて格納した特徴情報データベースと、前記入力信号から音素を認識する音素認識手段と、前記音素認識手段により認識された音素に対応した特徴
情報を前記特徴情報データベースから取得し、該特徴情
報と前記変換後ピッチとに基づいて第１のスペクトルシ
ェイプを生成し、前記平均ゲイン分析手段により分析さ
れた平均ゲインと前記第１のスペクトルシェイプとに応
じた信号を前記出力信号として出力する出力手段とを具
備することを特徴とする音声変換装置。1. A voice conversion device for obtaining an output signal representing a voice having a different pitch from an input voice from an input signal representing an input voice, comprising: a pitch analysis unit configured to analyze a pitch of the input voice based on the input signal; Average gain analysis means for analyzing the average gain of the input signal; converted pitch calculation means for calculating a converted pitch based on the pitch analyzed by the pitch analysis means and a given pitch shift amount; and a spectrum shape. A feature information database storing feature information to be generated in association with phonemes; a phoneme recognition unit that recognizes phonemes from the input signal; and feature information corresponding to phonemes recognized by the phoneme recognition unit. Acquiring from a database, generating a first spectral shape based on the feature information and the converted pitch, An audio converter, comprising: output means for outputting, as the output signal, a signal corresponding to the average gain analyzed by the average gain analysis means and the first spectrum shape.

【請求項２】入力音声を表す入力信号から入力音声と
異なるピッチの音声を表す出力信号を得る音声変換装置
において、前記入力信号に基づいて前記入力音声のピッチを分析す
るピッチ分析手段と、前記入力信号の平均ゲインを分析する平均ゲイン分析手
段と、前記ピッチ分析手段により分析されたピッチと与えられ
た変換後ピッチとに基づいてピッチシフト量を算出する
ピッチシフト量算出手段と、スペクトルシェイプを生成するための特徴情報を音素に
対応付けて格納した特徴情報データベースと、前記入力信号から音素を認識する音素認識手段と、前記音素認識手段により認識された音素に対応した特徴
情報を前記特徴情報データベースから取得し、該特徴情
報と前記変換後ピッチとに基づいて第１のスペクトルシ
ェイプを生成し、前記平均ゲインと前記第１のスペクト
ルシェイプとに応じた信号を前記出力信号として出力す
る出力手段とを具備することを特徴とする音声変換装
置。2. A voice conversion device for obtaining an output signal representing a voice having a different pitch from the input voice from an input signal representing an input voice, wherein: a pitch analysis means for analyzing a pitch of the input voice based on the input signal; Average gain analysis means for analyzing the average gain of the input signal; pitch shift amount calculation means for calculating a pitch shift amount based on the pitch analyzed by the pitch analysis means and the given converted pitch; and a spectrum shape. A feature information database storing feature information to be generated in association with phonemes; a phoneme recognition unit that recognizes phonemes from the input signal; and feature information corresponding to phonemes recognized by the phoneme recognition unit. A first spectral shape obtained from the database and based on the characteristic information and the converted pitch. Output means for outputting a signal corresponding to the average gain and the first spectrum shape as the output signal.

【請求項３】前記入力信号をフレーム単位で周波数分
析する周波数分析手段を具備し、前記出力手段は、前記第１のスペクトルシェイプと前記
周波数分析手段による周波数分析結果と前記ピッチシフ
ト量とに基づいて第２のスペクトルシェイプを生成し、
該第２のスペクトルシェイプに応じた信号を前記出力信
号として出力することを特徴とする請求項１または２に
記載の音声変換装置。3. A frequency analysis means for frequency-analyzing the input signal in frame units, wherein the output means is based on the first spectrum shape, a frequency analysis result by the frequency analysis means, and the pitch shift amount. To generate a second spectral shape,
The audio converter according to claim 1, wherein a signal corresponding to the second spectrum shape is output as the output signal.

【請求項４】前記周波数分析手段は前記入力信号をフ
レーム単位で周波数分析して正弦波成分と残差成分に分
離し、前記出力手段は、前記第１のスペクトルシェイプと前記
正弦波成分と前記ピッチシフト量とに基づいて第３のス
ペクトルシェイプを生成し、該第３のスペクトルシェイ
プと前記残差成分とに応じた信号を前記出力信号として
出力することを特徴とする請求項３に記載の音声変換装
置。4. The frequency analysis unit frequency-analyzes the input signal on a frame-by-frame basis to separate a sine wave component and a residual component, and the output unit outputs the first spectrum shape, the sine wave component, The method according to claim 3, wherein a third spectrum shape is generated based on the pitch shift amount, and a signal corresponding to the third spectrum shape and the residual component is output as the output signal. Voice converter.

【請求項５】前記出力手段は、前記周波数分析結果と
前記ピッチシフト量とに基づいて得られるスペクトルシ
ェイプと前記第１のスペクトルシェイプとを与えられた
パラメータに従って補間することで前記第２のスペクト
ルシェイプを生成することを特徴とする請求項３に記載
の音声変換装置。5. The second spectrum by interpolating a spectrum shape obtained based on the frequency analysis result and the pitch shift amount and the first spectrum shape in accordance with a given parameter. The voice conversion device according to claim 3, wherein the voice conversion device generates a shape.

【請求項６】前記特徴情報データベースは複数のパラ
メータセットの各々について、スペクトルシェイプを生
成するための特徴情報を音素に対応付けて格納し、前記出力手段は、指定されたパラメータセットと前記音
素認識手段により認識された音素とに対応した特徴情報
を前記特徴情報データベースから取得することを特徴と
する請求項１または２に記載の音声変換装置。6. The feature information database stores, for each of a plurality of parameter sets, feature information for generating a spectrum shape in association with a phoneme, and the output unit includes a designated parameter set and the phoneme recognition. 3. The speech converter according to claim 1, wherein feature information corresponding to the phoneme recognized by the means is obtained from the feature information database.

【請求項７】前記平均ゲインに応じて前記第１のスペ
クトルシェイプの傾きを補正して第４のスペクトルシェ
イプを生成するスペクトル傾き補正手段を具備し、前記出力手段は、前記第４のスペクトルシェイプに応じ
た信号を前記出力信号として出力することを特徴とする
請求項１または２に記載の音声変換装置。7. A spectrum inclination correcting means for correcting a slope of the first spectrum shape in accordance with the average gain to generate a fourth spectrum shape, wherein the output means comprises a fourth spectrum shape. 3. The audio converter according to claim 1, wherein a signal corresponding to the output is output as the output signal.

【請求項８】直前のスペクトルシェイプを記憶する前
フレーム情報記憶手段を具備し、前記出力手段は、前記第１のスペクトルシェイプと前記
平均ゲインと前記前フレーム情報記憶手段に記憶された
前記直前のスペクトルシェイプとに基づいて第５のスペ
クトルシェイプを生成し、該第５のスペクトルシェイプ
に応じた信号を前記出力信号として出力するとともに、
該第５のスペクトルシェイプを前記直前のスペクトルシ
ェイプとして前記前フレーム情報記憶手段に記憶させる
ことを特徴とする請求項１または２に記載の音声変換装
置。8. An apparatus according to claim 1, further comprising: a previous frame information storage means for storing a previous spectrum shape; and said output means, wherein said first spectrum shape, said average gain, and said last frame information stored in said previous frame information storage means are stored. Generating a fifth spectrum shape based on the spectrum shape and outputting a signal corresponding to the fifth spectrum shape as the output signal;
The speech converter according to claim 1, wherein the fifth spectrum shape is stored in the preceding frame information storage unit as the immediately preceding spectrum shape.

【請求項９】入力音声を表す入力信号から入力音声と
異なる変換後ピッチの音声を表す出力信号を得る音声変
換方法において、前記入力信号の平均ゲインを分析するとともに該入力信
号から音素を認識するステップと、スペクトルシェイプを生成するための特徴情報であっ
て、前記音素認識手段により認識された音素に対応した
特徴情報を取得するステップと、前記特徴情報と前記変換後ピッチとに基づいてスペクト
ルシェイプを生成するステップと、前記平均ゲインと前記スペクトルシェイプとに基づいた
信号を前記出力信号として出力するステップとを有する
ことを特徴とする音声変換方法。9. A speech conversion method for obtaining, from an input signal representing an input speech, an output signal representing a speech having a converted pitch different from that of the input speech, analyzing an average gain of the input signal and recognizing phonemes from the input signal. Obtaining feature information for generating a spectrum shape, the feature information corresponding to the phoneme recognized by the phoneme recognition means; and a spectrum shape based on the feature information and the converted pitch. And outputting a signal based on the average gain and the spectrum shape as the output signal.