JP6167063B2

JP6167063B2 - Utterance rhythm transformation matrix generation device, utterance rhythm transformation device, utterance rhythm transformation matrix generation method, and program thereof

Info

Publication number: JP6167063B2
Application number: JP2014082920A
Authority: JP
Inventors: 定男廣谷
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2014-04-14
Filing date: 2014-04-14
Publication date: 2017-07-19
Anticipated expiration: 2034-04-14
Also published as: JP2015203766A

Description

本発明は、ある音声データの発話リズムを他の音声データの発話リズムに変換する技術に関する。 The present invention relates to a technique for converting an utterance rhythm of certain audio data into an utterance rhythm of other audio data.

声道スペクトル、例えばLSPパラメータの時系列信号Yから時間関数Φを求める非負値時空間分解法という方法が提案されている（非特許文献１参照）。時系列信号Yの時間長をT、時刻を表すインデックスをt、Y={Y(1),…,Y(t),…,Y(T)}である。また、LSPパラメータの分析次数の総数をp、分析次数を表すインデックスをiとすると、Y(t)={y₁(t),…,y_i(t),…,y_p(t)}である。よって、y_i(t)は、時刻tにおけるLSPパラメータの分析次数iの値を表し、単にLSPパラメータy_i(t)とも記載する。また、Φ={Φ(1),…,Φ(t),…,Φ(T)}であり、時系列信号Yに対応する音声データに含まれる音素の総数を(m-2)とし、kを音素を表すインデックス、k=1,…,mとすると(ただし、k=1、k=mをそれぞれ始まりと終わりを表すインデックスとする)、Φ(t)={φ₁(t),…,φ_k(t),…，φ_m(t)}である。よって、φ_k(t)は時刻tにおける音素kの時間関数の値を表し、単に時間関数φ_k(t)とも記載する。時空間分解法では、以下の分解が行われる。

ここで、t_kは、音素kの中心時刻を表す。このモデルでは、発話リズムを音素単位で表現し、また、隣り合う音素のみが影響を及ぼすことが考慮されている。 A method called a non-negative space-time decomposition method for obtaining a time function Φ from a time series signal Y of a vocal tract spectrum, for example, an LSP parameter has been proposed (see Non-Patent Document 1). The time length of the time series signal Y is T, the index representing the time is t, and Y = {Y (1),..., Y (t),. Further, if the total number of analysis orders of LSP parameters is p and the index representing the analysis order is i, Y (t) = {y ₁ (t), ..., y _i (t), ..., y _p (t)} It is. Therefore, y _i (t) represents the value of the analysis order i of the LSP parameter at time t, and is also simply referred to as LSP parameter y _i (t). Also, Φ = {Φ (1), ..., Φ (t), ..., Φ (T)}, and the total number of phonemes included in the speech data corresponding to the time series signal Y is (m-2), If k is an index representing a phoneme and k = 1,..., m (where k = 1 and k = m are indices representing the beginning and end, respectively), Φ (t) = {φ ₁ (t), ..., φ _k (t), ..., φ _m (t)}. Therefore, φ _k (t) represents the value of the time function of phoneme k at time t, and is also simply referred to as time function φ _k (t). In the spatiotemporal decomposition method, the following decomposition is performed.

Here, t _k represents the central time of phoneme k. In this model, the utterance rhythm is expressed in units of phonemes, and it is considered that only adjacent phonemes have an effect.

さて、異なる（あるいは同一）話者の同一文章発声のLSPパラメータの時系列信号Z={Z(1),…,Z(s),…,Z(S)}から非負値時空間分解法により求めた時間関数Ωに従ってLSPパラメータの時系列信号Yの発話リズムを、LSPパラメータの時系列信号Zの発話リズムに変換することを考える。時間関数Ωの時間長Sは、Tと異なってもよい。時間関数Φと時間関数Ωとに存在するそれぞれの音素が対応していれば、時間関数Φと時間関数Ωとを入れ替える、つまり

により、LSPパラメータの時系列信号Yの発話リズムを、LSPパラメータの時系列信号Zの発話リズムに変換することができる（非特許文献２参照）。 Now, from the time series signal Z = {Z (1), ..., Z (s), ..., Z (S)} of LSP parameters of the same sentence utterance of different (or the same) speakers by non-negative space-time decomposition method Consider that the utterance rhythm of the time-series signal Y of the LSP parameter is converted into the utterance rhythm of the time-series signal Z of the LSP parameter according to the obtained time function Ω. The time length S of the time function Ω may be different from T. If the phonemes existing in the time function Φ and the time function Ω correspond to each other, the time function Φ and the time function Ω are interchanged, that is,

Thus, the utterance rhythm of the time series signal Y of the LSP parameter can be converted into the utterance rhythm of the time series signal Z of the LSP parameter (see Non-Patent Document 2).

S. Hiroya, “Non-negative temporal decomposition of speech parameters by multiplicative update rules”, IEEE Trans. Audio, Speech, Lang. Process., 2013, pp.2108-2117.S. Hiroya, “Non-negative temporal decomposition of speech parameters by multiplicative update rules”, IEEE Trans. Audio, Speech, Lang. Process., 2013, pp.2108-2117. 廣谷定男, “発話リズムを抽出・制御する音声信号処理”, NTT技術ジャーナル, 2013, pp26-29.Sadao Shibuya, “Voice signal processing to extract and control speech rhythm”, NTT Technical Journal, 2013, pp26-29.

しかしながら、非特許文献２の方法により発話リズムを変換した声道スペクトルから音声を合成した場合、時刻t_k-1≦t≦t_kにおいて、時刻t=t_k-1とt=t_kとでは、実際に音声データから得られるLSPパラメータy_i(t_k-1)とy_i(t_k)の値が用いられるが、その間の時刻t_k-1＜t＜t_kのLSPパラメータはy_i(t_k-1)とy_i(t_k)の値の補間によって求められる。そのため、変換音声の自然性が不十分となる問題がある。 However, when speech is synthesized from the vocal tract spectrum with the utterance rhythm converted by the method of Non-Patent Document 2, at time t _k−1 ≦ t ≦ t _k , at time t = t _k−1 and t = t _k The values of the LSP parameters y _i (t _k-1 ) and y _i (t _k ) actually obtained from the speech data are used, and the LSP parameters at the time t _k-1 <t <t _k are y _i It is obtained by interpolation of the values of (t _k-1 ) and y _i (t _k ). Therefore, there is a problem that the naturalness of the converted speech becomes insufficient.

本発明は、変換音声の自然性を従来よりも向上させるために、ある声道スペクトルの時系列に対応する発話リズムを他の音声スペクトルの時系列に対応する発話リズムに変換するための行列である変換行列を生成する発話リズム変換行列生成装置、発話リズム変換装置、発話リズム変換行列生成方法、及びそのプログラムを提供することを目的とする。 The present invention is a matrix for converting an utterance rhythm corresponding to a time series of a certain vocal tract spectrum into an utterance rhythm corresponding to a time series of another voice spectrum, in order to improve the naturalness of the converted speech. An object of the present invention is to provide an utterance rhythm conversion matrix generation device, an utterance rhythm conversion device, an utterance rhythm conversion matrix generation method, and a program thereof that generate a certain conversion matrix.

上記の課題を解決するために、本発明の一態様によれば、発話リズム変換行列生成装置は、音素に対する重みと時刻との関係を示す関数を時間関数とし、非負値時空間分解法により、第一音声データの第一声道スペクトルに対する時間関数である第一時間関数を求める非負値時空間分解部と、第一時間関数と対応する音素に対する重みと時刻との関係を示す時間関数を第二時間関数とし、第二時間関数に対応する声道スペクトルを第二声道スペクトルとし、第一時間関数と第二時間関数とを用いて、第一声道スペクトルの発話リズムを第二声道スペクトルの発話リズムに変換するための行列である変換行列を生成する変換行列生成部と、を含む。 In order to solve the above-described problem, according to one aspect of the present invention, an utterance rhythm transformation matrix generation device uses a function indicating a relationship between a weight for a phoneme and time as a time function, and uses a non-negative space-time decomposition method, A non-negative space-time decomposition unit that obtains a first time function that is a time function for the first vocal tract spectrum of the first speech data, and a time function that indicates the relationship between the weight and time for the phoneme corresponding to the first time function. The vocal tract spectrum corresponding to the second time function is defined as the second vocal tract spectrum, and the utterance rhythm of the first vocal tract spectrum is determined as the second vocal tract using the first time function and the second time function. A conversion matrix generation unit that generates a conversion matrix that is a matrix for converting into an utterance rhythm of a spectrum.

上記の課題を解決するために、本発明の他の態様によれば、発話リズム変換行列生成方法は、音素に対する重みと時刻との関係を示す関数を時間関数とし、非負値時空間分解法により、第一音声データの第一声道スペクトルに対する時間関数である第一時間関数を求める非負値時空間分解ステップと、第一時間関数と対応する音素に対する重みと時刻との関係を示す時間関数を第二時間関数とし、第二時間関数に対応する声道スペクトルを第二声道スペクトルとし、第一時間関数と第二時間関数とを用いて、第一声道スペクトルの発話リズムを第二声道スペクトルの発話リズムに変換するための行列である変換行列を生成する変換行列生成ステップと、を含む。 In order to solve the above-described problem, according to another aspect of the present invention, an utterance rhythm transformation matrix generation method uses a function indicating a relationship between a weight for a phoneme and time as a time function, and uses a non-negative space-time decomposition method. A non-negative spatio-temporal decomposition step for obtaining a first time function that is a time function for the first vocal tract spectrum of the first speech data, and a time function indicating a relationship between a weight and a time for a phoneme corresponding to the first time function. The vocal tract spectrum corresponding to the second time function is set as the second vocal tract spectrum, and the utterance rhythm of the first vocal tract spectrum is set to the second voice by using the first time function and the second time function. A transformation matrix generation step of generating a transformation matrix that is a matrix for transformation into the utterance rhythm of the road spectrum.

本発明により生成した変換行列を用いて、発話リズムを変換することで、変換音声の自然性を従来よりも向上させることができるという効果を奏する。 By converting the utterance rhythm using the conversion matrix generated by the present invention, there is an effect that the naturalness of the converted speech can be improved as compared with the conventional case.

第一実施形態に係る発話リズム変換行列生成装置の機能ブロック図。The functional block diagram of the utterance rhythm conversion matrix production | generation apparatus which concerns on 1st embodiment. 第一実施形態に係る発話リズム変換行列生成装置の処理フローの例を示す図。The figure which shows the example of the processing flow of the speech rhythm conversion matrix production | generation apparatus which concerns on 1st embodiment. 時間関数を説明するための図。The figure for demonstrating a time function. 第一実施形態に係る発話リズム変換装置の機能ブロック図。The functional block diagram of the utterance rhythm conversion apparatus which concerns on 1st embodiment. 第一実施形態に係る発話リズム変換装置の処理フローの例を示す図。The figure which shows the example of the processing flow of the speech rhythm conversion apparatus which concerns on 1st embodiment. 第一実施形態の変形例に係る発話リズム変換行列生成装置の機能ブロック図。図。The functional block diagram of the speech rhythm conversion matrix production | generation apparatus which concerns on the modification of 1st embodiment. Figure. 第一実施形態の変形例に係る発話リズム変換行列生成装置の処理フローの例を示す図。The figure which shows the example of the processing flow of the speech rhythm conversion matrix production | generation apparatus which concerns on the modification of 1st embodiment.

以下、本発明の実施形態について説明する。なお、以下の説明に用いる図面では、同じ機能を持つ構成部や同じ処理を行うステップには同一の符号を記し、重複説明を省略する。また、ベクトルや行列の各要素単位で行われる処理は、特に断りが無い限り、そのベクトルやその行列の全ての要素に対して適用されるものとする。 Hereinafter, embodiments of the present invention will be described. In the drawings used for the following description, constituent parts having the same function and steps for performing the same process are denoted by the same reference numerals, and redundant description is omitted. Further, the processing performed for each element of a vector or matrix is applied to all elements of the vector or matrix unless otherwise specified.

＜第一実施形態＞
変換対象となる声道スペクトルの時系列をY、その時間関数をΦとした場合、与えられた時間関数Ωを用いて変換行列G、HまたはKを決定し、この変換行列を用いて声道スペクトルの時系列Yの発話リズムを時間関数Ωに対応する発話リズムに変換する。なお、時間関数とは、音素に対する重みと時刻との関係を示す関数であり、発話リズムを表す。 <First embodiment>
If the time series of the vocal tract spectrum to be converted is Y and its time function is Φ, the transformation matrix G, H, or K is determined using the given time function Ω, and the vocal tract is used using this transformation matrix. The utterance rhythm of the spectrum time series Y is converted to the utterance rhythm corresponding to the time function Ω. The time function is a function indicating the relationship between the weight for the phoneme and the time, and represents the speech rhythm.

＜第一実施形態に係る発話リズム変換行列生成装置１０＞
図１は本実施形態に係る発話リズム変換行列生成装置１０の機能ブロック図を、図２はその処理フローの例を示す。 <Speech Rhythm Conversion Matrix Generation Device 10 according to First Embodiment>
FIG. 1 is a functional block diagram of an utterance rhythm conversion matrix generation apparatus 10 according to the present embodiment, and FIG. 2 shows an example of a processing flow thereof.

発話リズム変換行列生成装置１０は、声道スペクトル抽出部１１と、非負値時空間分解部１２と、変換行列生成部１３とを含む。 The utterance rhythm conversion matrix generation device 10 includes a vocal tract spectrum extraction unit 11, a non-negative value space-time decomposition unit 12, and a conversion matrix generation unit 13.

発話リズム変換行列生成装置１０は、話者Aの音声データX_Φを受け取り、変換行列G,HまたはKを出力する。本実施形態では、話者Bの音声データX_Ωも受け取るものとする。なお、音声データX_Ωには、音声データX_Φに含まれる音素の時系列と同じ時系列が含まれるものとする。例えば、話者A及び話者Bをそれぞれ日本語母語話者及び英語母語話者とし、話者A及び話者Bがそれぞれ同一文章を発声したものをマイクロホン等を用いてそれぞれ収音し、収音した音声データX_Φ及びX_Ωを用いる。 The utterance rhythm conversion matrix generation device 10 receives the voice data _XΦ of the speaker A and outputs a conversion matrix G, H, or K. In this embodiment, it is assumed that the voice data _XΩ of the speaker B is also received. Note that the audio data X _Omega, is intended to include the same chronological time series of phonemes included in the voice data X _[Phi. For example, speaker A and speaker B are Japanese native speakers and English native speakers, respectively, and speakers A and B speak the same sentence, respectively, using a microphone or the like. The sound data X _Φ and X _Ω that are sounded are used.

＜声道スペクトル抽出部１１＞
声道スペクトル抽出部１１は、音声データX_Φ及びX_Ωを受け取り、既知の技術を用いて、それぞれ声道スペクトル分析を行い、声道スペクトル（例えばLSPパラメータ）の時系列Y_Φ及びY_Ωを抽出し（ｓ１１）、非負値時空間分解部１２に出力する。例えば、非特許文献１や参考文献１の方法を用いて、声道スペクトルを抽出することができる。
（参考文献１）廣谷定男、持田岳美、「位相等化処理に基づく線形予測法を用いた頑健な声道スペクトルの推定」、信学技法、2010年、pp.41-46 <Vocal tract spectrum extraction unit 11>
Vocal tract spectrum extracting unit 11 receives the audio data X _[Phi and X _Omega, using known techniques, each performs vocal tract spectral analysis, time-series Y _[Phi and Y _Omega vocal tract spectrum (e.g. LSP parameter) Extract (s11) and output to the non-negative value space-time decomposition unit 12. For example, the vocal tract spectrum can be extracted using the methods of Non-Patent Document 1 and Reference Document 1.
(Reference 1) Sadao Shibuya, Takemi Mochida, “Estimating robust vocal tract spectrum using linear prediction based on phase equalization”, IEICE Tech., 2010, pp.41-46

＜非負値時空間分解部１２＞
非負値時空間分解部１２は、声道スペクトルの時系列Y_Φ及びY_Ωを受け取り、それぞれ既知の技術である非負値時空間分解法を用いて、

となるように、時間関数Φ及びΩを決定し（ｓ１２）、変換行列生成部１３に出力する。例えば、非特許文献１の方法を用いて、時間関数を決定することができる。ただし、声道スペクトルの時系列Y_Φ及びY_Ωの時間長をそれぞれT及びS、t=1,…,T、s=1,…,Sとすると、Y_Φ={Y_Φ(1),…,Y_Φ(t),…,Y_Φ(T)}、Y_Ω={Y_Ω(1),…,Y_Ω(s),…,Y_Ω(S)}である。また、声道スペクトルの分析次数の総数をp（なお、本実施形態では声道スペクトルの時系列Y_Φ及びY_Ωの分析次数の総数は何れもpとするが、声道スペクトルの時系列Y_Φの分析次数の総数と声道スペクトルの時系列Y_Ωの分析次数の総数とは異なってもよい）、分析次数を表すインデックスをiとすると、Y_Φ(t)={y_Φ,1(t),…,y_Φ,i(t),…,y_Φ,p(t)}、Y_Ω(s)={y_Ω,1(s),…,y_Ω,i(s),…,y_Ω,p(s)}である。よって、y_Φ,i(t)及びy_Ω,i(s)は、それぞれ時刻t及びsにおける声道スペクトルの分析次数iの値を表し、それぞれ単に声道スペクトルy_Φ,i(t)及びy_Ω,i(s)とも記載する。また、Φ={Φ(1),…,Φ(t),…,Φ(T)}、Ω={Ω(1),…,Ω(s),…,Ω(S)}であり、時系列Y_Φ及びY_Ωに対応する音声データに含まれる音素の総数を(m-2)とし、kを音素を表すインデックス、k=1,…,mとすると(ただし、k=1、k=mをそれぞれ音素の時系列の始まりと終わりを表すインデックスとする)、Φ(t)={φ₁(t),…,φ_k(t),…，φ_m(t)}及びΩ(s)={ω₁(s),…,ω_k(s),…，ω_m(s)}である。よって、φ_k(t)及びω_k(s)はそれぞれ時刻t及びsにおける音素kの時間関数の値を表し、それぞれ単に時間関数φ_k(t)及びω_k(s)とも記載する。ここで、t_k及びs_kは、それぞれ音声データX_Φ及びX_Ωにおける音素kの中心時刻を表す。なお、音素kと中心時刻t_k及びs_kとは、それぞれ予め音声データX_Φ及びX_Ωに付加されているものとする。例えば、音声データに人手により付与してもよいし、既存の技術（例えば既知の音声認識技術）を用いて自動的に付与してもよい。式（１１）、（１２）のモデルでは、発話リズムを音素単位で表現し、また、隣り合う音素のみが影響を及ぼすことが考慮されている。 <Non-negative space-time decomposition unit 12>
Nonnegative value spatiotemporal decomposition unit 12 receives the time series Y _[Phi and Y _Omega vocal tract spectrum, respectively, using a non-negative value at the spatial decomposition method is a known technology,

The time functions Φ and Ω are determined so as to satisfy (s12) and output to the transformation matrix generation unit 13. For example, the time function can be determined using the method of Non-Patent Document 1. However, if the time lengths of the time series Y _Φ and Y _Ω of the vocal tract spectrum are T and S, t = 1, ..., T, s = 1, ..., S, respectively, Y _Φ = {Y _Φ (1), …, Y _Φ (t),…, Y _Φ (T)}, Y _Ω = {Y _Ω (1),…, Y _Ω (s),…, Y _Ω (S)}. In addition, the total number of analysis orders of the vocal tract spectrum is p (in this embodiment, the total number of analysis orders of the time series Y _Φ and Y _Ω of the vocal tract spectrum is p, but the time series Y of the vocal tract spectrum is may be different from the total number of analysis order of the time series Y _Omega total number and the vocal tract spectral analysis order of _[Phi), when the index representing the analysis order and _{i, Y Φ (t) =} {y Φ, 1 ( t),…, y _{Φ, i} (t),…, y _{Φ, p} (t)}, Y _Ω (s) = {y _{Ω, 1} (s),…, y _{Ω, i} (s),… , y _{Ω, p} (s)}. Thus, y _{Φ, i} (t) and y _{Ω, i} (s) represent the values of the analysis order i of the vocal tract spectrum at times t and s, respectively, and are simply the vocal tract spectrum y _{Φ, i} (t) and Also described as y _{Ω, i} (s). Φ = {Φ (1), ..., Φ (t), ..., Φ (T)}, Ω = {Ω (1), ..., Ω (s), ..., Ω (S)}, the total number of phonemes included in the voice data corresponding to the time sequence Y _[Phi and Y _Omega and (m-2), an index representing the phonemes k, k = 1, ..., when m (although, k = 1, k = m is an index representing the start and end of the phoneme time series, respectively), Φ (t) = {φ ₁ (t),…, φ _k (t),…, φ _m (t)} and Ω ( s) = {ω ₁ (s), ..., ω _k (s), ..., ω _m (s)}. Therefore, φ _k (t) and ω _k (s) represent time function values of phoneme k at times t and s, respectively, and are also simply referred to as time functions φ _k (t) and ω _k (s), respectively. Here, t _k and s _k each represent center time of a phoneme k in the speech data X _[Phi and X _Omega. Note that the phoneme k and center time t _k and s _k, assumed to be added in advance to the audio data X _[Phi and X _Omega respectively. For example, the voice data may be given manually, or may be automatically given using an existing technique (for example, a known voice recognition technique). In the models of Expressions (11) and (12), the utterance rhythm is expressed in units of phonemes, and it is considered that only adjacent phonemes have an effect.

例えば、

により、時間関数φ_k-1(t)、φ_k(t)を求める。全ての時刻tについて時間関数φ_k-1(t)、φ_k(t)を求め、音素kをt_k-1≦t≦t_kを満たすように、時刻tに合わせて変更すればよい。ただし、本実施形態ではa_i,k=y_Φ,i(t_k)とする。なお、βは重みである。例えば、β=100とする。 For example,

Thus, time functions φ _k−1 (t) and φ _k (t) are obtained. The time functions φ _k−1 (t) and φ _k (t) are obtained for all times t, and the phoneme k may be changed according to the time t so as to satisfy t _k−1 ≦ t ≦ t _k . However, in this embodiment, a _{i, k} = y _{Φ, i} (t _k ). Β is a weight. For example, β = 100.

同様の方法により、時間関数ω_k-1(s)、ω_k(s)を求めることができる。式中、a_i,k=y_Ω,i(s_k)とし、y_Φ,i(t_k)をy_Ω,i(s_k)に、tをsに、t_kをs_k等のように置き換えることで、時間差分δ_k(s)、時間差分δ_k-1(s)を求めることができ、そこから時間関数ω_k(s)、時間関数ω_k-1(s)を求めることができる。 The time functions ω _k-1 (s) and ω _k (s) can be obtained by the same method. Where a _{i, k} = y _{Ω, i} (s _k ), y _{Φ, i} (t _k ) to y _{Ω, i} (s _k ), t to s, t _k to s _k, etc. To obtain the time difference δ _k (s) and the time difference δ _k-1 (s), from which the time function ω _k (s) and the time function ω _k-1 (s) are obtained. Can do.

＜変換行列生成部１３＞
変換行列生成部１３は、時間関数Φ及びΩを受け取り、これらの値を用いて変換行列G、HまたはKを生成し（ｓ１３）、出力する。変換行列は、声道スペクトルY_Φの発話リズムを声道スペクトルY_Ωの発話リズムに変換するための行列である。以下に変換行列G,H,Kの生成方法を説明する。 <Conversion matrix generator 13>
The transformation matrix generation unit 13 receives the time functions Φ and Ω, generates a transformation matrix G, H, or K using these values (s13) and outputs it. Transformation matrix is a matrix for converting the speech rhythm of the vocal tract spectrum _Y Φ to the speech rhythm of the vocal tract spectrum Y _Ω. A method for generating the transformation matrices G, H, and K will be described below.

(生成方法１)
非負値時空間分解法で得られる時間関数は[0,1]の範囲に制限されるため、時刻ｓの時間関数ω_k(s)の値に最も近い、時刻ｔの時間関数φ_k(t)の値を探索により求め、そのときの時刻ｔの声道スペクトルY_φ(t)を時刻ｓにおける声道スペクトルY'_φ(s)(=Y_φ(t))として出力する。これを変換行列Gにより実現するためには、時間関数Ωの時刻ｓと、時間関数Φの時刻ｔとを関係付ければ良い。例えば、s_k-1≦s≦s_k、t_k-1≦t≦t_kにおいて、時刻sに対して、時間関数ω_k(s)と時間関数φ_k(t)との差分が最も小さくなる時刻tを関係付ける。変換行列GはS行T列の行列であり、そのs行t列の要素をG(s,t)とすると、

として決定できる。全てのkについて、計算することで、S行T列の変換行列Gを求めることができる。ここで、時刻ｓに対して関係付られた時刻ｔをt^*(s)とする。変換行列Gの数値例は、

であり、S行T列の行列の対角近傍の成分が1となり、それ以外の成分が0となる。 (Generation method 1)
Since the time function obtained by the non-negative space-time decomposition method is limited to the range [0, 1], the time function φ _k (t at time t closest to the value of the time function ω _k (s) at time s. ) Is obtained by searching, and the vocal tract spectrum Y _φ (t) at time t is output as the vocal tract spectrum Y ′ _φ (s) (= Y _φ (t)) at time s. In order to realize this by the transformation matrix G, the time s of the time function Ω may be related to the time t of the time function Φ. For example, when s _k-1 ≦ s ≦ s _k and t _k-1 ≦ t ≦ t _k , the difference between the time function ω _k (s) and the time function φ _k (t) is the smallest for the time s. Is related to time t. The transformation matrix G is a matrix of S rows and T columns, and the element of the s rows and t columns is G (s, t).

Can be determined as By calculating for all k, a transformation matrix G of S rows and T columns can be obtained. Here, the time t associated with the time s is t ^* (s). A numerical example of the transformation matrix G is

The component near the diagonal of the matrix of S rows and T columns is 1, and the other components are 0.

(生成方法２)
生成方法１により求まる変換行列Gを用いて声道スペクトルYの変換を行った場合、Gの値が離散的に変化するため、声道スペクトルが不連続に変化する可能性がある。そこで、時刻t-1の声道スペクトルY_φ(t-1)と時刻tの声道スペクトルY_φ(t)との重み付け和を考え、声道スペクトルが時間的に連続して変化するように、変換行列Gを生成する。例えば式（１８）〜式（２２）を用いて変換行列Gを生成する。生成方法１で求まったt^*(s)（時刻ｓに対して関係付られた時刻ｔ）を用いて、

として、

とし、それ以外の場合、
G(s,t^*(s))=0 (22)
とする。式（１８）、（１９）において、要素G(s,t^*(s)-1)と要素G(s,t^*(s))とは、時間関数φ_k(t^*(s)-1)と時間関数φ_k(t^*(s))とを時間関数ω_k(s)で内分したものである（図３参照）。 (Generation method 2)
When the vocal tract spectrum Y is transformed using the transformation matrix G obtained by the generation method 1, since the value of G changes discretely, the vocal tract spectrum may change discontinuously. Therefore, considering the weighted sum of the time t-1 of the vocal tract spectrum Y _phi (t-1) and time t of the vocal tract spectrum Y _phi (t), as the vocal tract spectrum varies temporally continuously , Generate a transformation matrix G. For example, the transformation matrix G is generated using the equations (18) to (22). Using t ^* (s) obtained by the generation method 1 (time t related to time s),

As

Otherwise,
G (s, t ^* (s)) = 0 (22)
And In the equations (18) and (19), the element G (s, t ^* (s) -1) and the element G (s, t ^* (s)) are represented by the time function φ _k (t ^* (s) −1. ) And the time function φ _k (t ^* (s)) are internally divided by the time function ω _k (s) (see FIG. 3).

式（２０）、（２１）において、要素G(s,t^*(s))と要素G(s,t^*(s)+1)とは、時間関数φ_k(t^*(s))と時間関数φ_k(t^*(s)+1)とを時間関数ω_k(s)で内分したものである。Gの数値例は、

で、S行T列の行列の対角近傍の成分が非零となり、それ以外の成分が0となる。 In the equations (20) and (21), the element G (s, t ^* (s)) and the element G (s, t ^* (s) +1) are expressed as a time function φ _k (t ^* (s)). The time function φ _k (t ^* (s) +1) is internally divided by the time function ω _k (s). A numerical example of G is

Thus, the components near the diagonal of the matrix of S rows and T columns are non-zero, and the other components are zero.

(生成方法３)
生成方法２では、2点間の重み付け和となるように変換行列の要素G(s,t)を決定したが、3点以上の重み付け和となるように要素G(s,t)を決定してもよい。その場合、要素G(s,t)の値を解析的に求めることが困難であるため、非特許文献１で用いられている乗算型更新を用いて、以下の評価関数Qを最小にするような要素G(s,t)の値を反復的に求める。

(Generation method 3)
In generation method 2, element G (s, t) of the transformation matrix is determined so as to be a weighted sum between two points, but element G (s, t) is determined so as to be a weighted sum of three or more points. May be. In that case, since it is difficult to analytically determine the value of the element G (s, t), the following evaluation function Q is minimized by using the multiplication type update used in Non-Patent Document 1. The value of a simple element G (s, t) is obtained iteratively.

ここで、第１項は時間関数φ_k(s)と時間関数ω_k(s)との二乗誤差で、第２項は重みの和が1となるための制約、αはその制約のための重み係数である。例えば、α=1とする。非特許文献１で用いられている乗算型更新を用いることで、変換行列Gの初期値が非負値であれば非負値に収束するため、要素G(s,t)が非負値となるような制約を評価関数Qに加える必要はない。乗算型更新式は、評価関数Qを微分した結果、正の符号をもつ項を分母、負の符号をもつ項を分子とすることで次のように求まる。

Here, the first term is the square error between the time function φ _k (s) and the time function ω _k (s), the second term is a constraint for the sum of weights to be 1, and α is the constraint It is a weighting factor. For example, α = 1. By using the multiplication type update used in Non-Patent Document 1, if the initial value of the transformation matrix G is a non-negative value, it converges to a non-negative value, so that the element G (s, t) becomes a non-negative value. There is no need to add constraints to the evaluation function Q. As a result of differentiating the evaluation function Q, the multiplication type update formula is obtained as follows by using a term having a positive sign as a denominator and a term having a negative sign as a numerator.

例えば5点の重み付け和としたい場合、生成方法１で求まったt^*(s)を用いて、τの範囲をt^*(s)-2〜t^*(s)+2のようにすればよい。例えば、変換行列Gの要素G(s,t)の初期値を[0,1]の範囲の乱数に設定し、更新による評価関数Qの値の変化が十分小さくなったとき、あるいは予め指定した更新回数に達したとき（所定の条件を満したとき）の変換行列Gの値を出力する。二乗誤差ではなく、カルバック・ライブラー距離や板倉斉藤歪みなどの距離尺度を用いることも可能である。 For example, when it is desired to use a weighted sum of 5 points, the range of τ may be set to t ^* (s) -2 to t ^* (s) +2 using t ^* (s) obtained by the generation method 1. . For example, the initial value of the element G (s, t) of the transformation matrix G is set to a random number in the range [0,1], and when the change in the value of the evaluation function Q due to the update becomes sufficiently small or specified in advance The value of the transformation matrix G when the number of updates is reached (when a predetermined condition is satisfied) is output. Instead of the square error, it is also possible to use a distance measure such as the Cullbach-Riverer distance or the Itakura Saito distortion.

例えば、カルバック・ライブラー距離を用いたとき、評価関数Qは次式により表される。

For example, the evaluation function Q is expressed by the following equation when using the Cullback-Liber distance.

さらに、この評価関数Qを最小にする乗算型更新式は、次式により表される。

Furthermore, the multiplication type update formula that minimizes the evaluation function Q is expressed by the following formula.

また、板倉斉藤歪みを用いたとき、評価関数Qは次式により表される。

When the Itakura Saito distortion is used, the evaluation function Q is expressed by the following equation.

(生成方法４)
2つの駆動時点t_k-1とt_kの声道スペクトルの補間となるような変換行列Hを求める。行列の要素は、

で求められる。Hの数値例は、

で、t_k-1番目の列とt_k番目の列に値を持つ行列となる。この変換行列Hを用いた変換により得られる声道スペクトルは、非特許文献２において、時間関数Φと時間関数Ωを入れ替えた場合に得られるものと一致する。 (Generation method 4)
A transformation matrix H that is an interpolation of the vocal tract spectrum at two driving time points t _k-1 and t _k is obtained. The elements of the matrix are

Is required. A numerical example of H is

Thus, a matrix having values in the t _k-1 th column and the t _k th column is obtained. The vocal tract spectrum obtained by the transformation using this transformation matrix H matches that obtained in Non-Patent Document 2 when the time function Φ and the time function Ω are interchanged.

例えば、日本語母語話者と英語母語話者とが同じ文や単語を発声した場合、それぞれの音声データに含まれる音素が異なる場合がある。例えば、英文を読んだときに、英語母語話者は、日本語母語話者に比べ母音を省略する傾向があり、そのような場合、一方の音声データに含まれる音素の時系列が、他方の音声データに含まれる音素の時系列とは異なる。その場合、生成方法１〜３で求めた変換行列Gでは対応できないが、本生成方法で求めた変換行列Hであれば、適切に音素を省略することができる。 For example, when a Japanese native speaker and an English native speaker utter the same sentence or word, the phonemes included in the respective speech data may be different. For example, when reading English, native English speakers tend to omit vowels compared to native Japanese speakers. In such a case, the time series of phonemes contained in one speech data This is different from the time series of phonemes included in the speech data. In that case, the conversion matrix G obtained by the generation methods 1 to 3 cannot cope, but the phoneme can be appropriately omitted if the conversion matrix H is obtained by the generation method.

例えば、”roundup”という単語の”du”に着目すると、日本語母語話者は/doa/と/d/の後に母音/o/を挿入して発声し、英語母語話者は/da/と発声する傾向がある。日本語母語話者の発話リズムを英語母語話者のそれに変換する場合、日本語母語話者の/o/は取り除くことが適切である。日本語母語話者の/d/と/a/の声道スペクトルと変換行列Hを用いることで、母音/o/を取り除き、かつ/d/と/a/の間の声道スペクトルを滑らかに変化させることができる。 For example, focusing on the word “du” in the word “roundup”, a Japanese native speaker speaks by inserting the vowel / o / after / doa / and / d /, and an English native speaker is / da /. There is a tendency to speak. When converting the utterance rhythm of the native Japanese speaker to that of the native English speaker, it is appropriate to remove / o / from the native Japanese speaker. By using the vocal tract spectrum of Japanese native speakers / d / and / a / and the transformation matrix H, the vowel / o / is removed and the vocal tract spectrum between / d / and / a / is smoothed. Can be changed.

この生成方法における時間関数φ_k(t)と時間関数ω_k(s)との関係は、時刻t_kと時刻s_kにおいてφ_k(t_k)=ω_k(s_k)=1が満たされるのみであり、変換行列Hはほとんど時間関数φ_k(t)によって決まることが特徴である。 The relationship between the time function φ _k (t) and the time function ω _k (s) in this generation method is such that φ _k (t _k ) = ω _k (s _k ) = 1 is satisfied at time t _k and time s _k . The transformation matrix H is almost determined by the time function φ _k (t).

(生成方法５)
変換行列として、生成方法１〜４で求められた変換行列Gあるいは変換行列Hをそのまま用いることも可能であるが、変換行列を用いた変換による声道スペクトルの急激な変化を避けるため、変換行列G及びHを重みwによって補間することもできる。例えば、生成方法１〜３の何れかで生成される変換行列Gと、生成方法４で生成される変換行列Hとを用いて、次式により変換行列Kを生成する（図１参照）。

(Generation method 5)
As the transformation matrix, the transformation matrix G or transformation matrix H obtained by the generation methods 1 to 4 can be used as it is. However, in order to avoid a sudden change in the vocal tract spectrum due to the transformation using the transformation matrix, the transformation matrix is used. G and H can also be interpolated by weight w. For example, the transformation matrix K is generated by the following equation using the transformation matrix G generated by any one of the generation methods 1 to 3 and the transformation matrix H generated by the generation method 4 (see FIG. 1).

ただし、0≦w≦1であり、例えばw=0.5に設定する。また、生成方法１〜３で生成される変換行列をそれぞれG₁,G₂及びG₃とし、w₁≧0、w₂≧0、w₃≧0、0≦w₁+w₂+w₃≦1とし、

とすることも可能である（図１参照）。例えばw₁+w₂+w₃=0.5に設定する。式（３１）の三つの重みw₁,w₂,w₃のうち、二つが０のとき式（３０）となる。なお、少なくともG₁,G₂及びG₃の何れかの影響を変換行列Kに与えたい場合には、式（３０）及び（３１）において、それぞれ0＜w≦1、0＜w₁+w₂+w₃≦1とすればよい。 However, 0 ≦ w ≦ 1, and for example, w = 0.5 is set. Also, the transformation matrices generated by the generation methods 1 to 3 are G ₁ , G ₂ and G ₃ , respectively, and w ₁ ≧ 0, w ₂ ≧ 0, w ₃ ≧ 0, 0 ≦ w ₁ + w ₂ + w ₃ ≤ 1

(See FIG. 1). For example, w ₁ + w ₂ + w ₃ = 0.5 is set. When _two of the three weights w ₁ , w ₂ , and w ₃ in Expression (31) are 0, Expression (30) is obtained. In addition, when it is desired to give at least one of the influences of G ₁ , G _2, and G ₃ to the transformation matrix K, 0 <w ≦ 1, 0 <w ₁ + w in Equations (30) and (31), respectively. ₂ + w ₃ ≦ 1

＜第一実施形態に係る発話リズム変換装置２０＞
図４は本実施形態に係る発話リズム変換装置２０の機能ブロック図を、図５はその処理フローの例を示す。
発話リズム変換装置２０は、声道スペクトル抽出部２１、発話リズム変換部２２及び音声合成部２３を含む。
発話リズム変換装置２０は、話者Aの音声データX_Φを受け取り、音声データX'_Φを出力する。なお、音声データX'_Φは、話者Aの音声データX_Φの発話リズムを、話者Bの音声データX_Ωの発話リズムに変換したものである。 <Speech Rhythm Conversion Device 20 according to First Embodiment>
FIG. 4 is a functional block diagram of the utterance rhythm conversion apparatus 20 according to the present embodiment, and FIG. 5 shows an example of the processing flow thereof.
The utterance rhythm conversion device 20 includes a vocal tract spectrum extraction unit 21, an utterance rhythm conversion unit 22, and a speech synthesis unit 23.
The utterance rhythm converter 20 receives the voice data _XΦ of the speaker A and outputs the voice data _X′Φ . Incidentally, the audio data X _'[Phi, the speech rhythm of the audio data X _[Phi speaker A, is obtained by converting the utterance rhythm of the audio data X _Omega speaker B.

＜声道スペクトル抽出部２１＞
声道スペクトル抽出部２１は、音声データX_Φを受け取り、既知の技術を用いて、声道スペクトル分析を行い、声道スペクトルの時系列Y_Φ及び音源信号の時系列Z_Φを抽出し（ｓ２１）、発話リズム変換部２２に出力する。例えば、非特許文献１や参考文献１の方法を用いて、声道スペクトル及び音源信号を抽出することができる。 <Vocal tract spectrum extraction unit 21>
Vocal tract spectrum extracting unit 21 receives the audio data X _[Phi, using known techniques, carried out the vocal tract spectral analysis, extracts the time series Z _[Phi time series Y _[Phi and sound signals of the vocal tract spectrum (s21 ) And output to the utterance rhythm conversion unit 22. For example, the vocal tract spectrum and the sound source signal can be extracted using the methods of Non-Patent Document 1 and Reference Document 1.

例えば、音源信号は、音声区間以外あるいは無声の音声区間では白色雑音に白色雑音ゲインを乗じたものを用いる。有声の音声区間では、基本周波数、パルスゲインおよびマルチパルス音源モデルから計算されるマルチパルス、あるいは基本周波数とパルスゲインから計算される単一パルス列を用いる。なお、基本周波数やゲイン（白色雑音ゲインやパルスゲイン）等は既知の技術を用いて抽出すればよい。例えば、参考文献２の方法により抽出することができる。 For example, as a sound source signal, a signal obtained by multiplying white noise by a white noise gain is used in a voice section other than a voice section or a voiceless voice section. In the voiced speech section, a multipulse calculated from the fundamental frequency, pulse gain, and multipulse sound source model, or a single pulse train calculated from the fundamental frequency and pulse gain is used. The fundamental frequency, gain (white noise gain, pulse gain), etc. may be extracted using a known technique. For example, it can be extracted by the method of Reference 2.

＜発話リズム変換部２２＞
発話リズム変換部２２は、声道スペクトルの時系列Y_Φ及び音源信号の時系列Z_Φと変換行列G,HまたはKを受け取り、次式のように声道スペクトルの時系列Y_Φ及び音源信号の時系列Z_Φに変換行列G,HまたはK(例えばG)を乗じて、声道スペクトルの時系列Y'_Φ及び音源信号の時系列Z'_Φを求め（ｓ２２）、音声合成部２３に出力する。
Y'_Φ=G×Y_Φ (32)
Z'_Φ=G×Z_Φ (33)
なお、声道スペクトルの時系列Y'_Φの発話リズムは、声道スペクトルの時系列Y_Ωの発話リズムと同一である。 <Speech rhythm conversion unit 22>
Speech rhythm conversion unit 22 receives the time series Z _[Phi the transformation matrix G, H or K time series Y _[Phi and sound signals of the vocal tract spectrum, time series Y _[Phi and sound signals vocal tract spectrum as follows time series Z _[Phi in the transformation matrix G of, by multiplying the H or K (for example G), 'time series Z of _[Phi and source signals' seek _[Phi chronological Y vocal tract spectrum (s22), the speech synthesis unit 23 Output.
Y ' _Φ = G × Y _Φ (32)
Z ' _Φ = G × Z _Φ (33)
Note that the speech rhythm of the time series _Y′Φ of the vocal tract spectrum is the same as the speech rhythm of the time series Y _Ω of the vocal tract spectrum.

例えば、音源信号の時系列Z_Φを求めるために、発話リズム変換部２２では、基本周波数やゲイン（白色雑音ゲインやパルスゲイン）等に変換行列を乗じ、声道スペクトルの時系列Y'_Φに時間長が対応する基本周波数やゲインを求めればよい。 For example, in order to obtain the time series Z _Φ of the sound source signal, the speech rhythm conversion unit 22 multiplies the fundamental frequency, gain (white noise gain, pulse gain), and the like by the conversion matrix to obtain the time series Y ′ _Φ of the vocal tract spectrum. What is necessary is just to obtain | require the fundamental frequency and gain which time length respond | corresponds.

＜音声合成部２３＞
音声合成部２３は、音源信号の時系列Z'_Φと声道スペクトルの時系列Y'_Φとを受け取り、これらの値を用いて音声合成を行う。例えば、参考文献２の方法により音声合成を行い（ｓ２３）、音声データX'_Φを、発話リズム変換装置２０の出力値として出力する。
（参考文献２）特開２０１１−１５０２３２号公報
例えば、声道スペクトルの時系列Y'_Φと音源信号の時系列Z'_Φを畳み込み演算することにより音声合成を行う。 <Speech synthesizer 23>
Speech synthesis unit 23 receives a 'time series Y of _[Phi and vocal tract spectrum' _[Phi chronological Z of the sound source signals, performs speech synthesis by using these values. For example, voice synthesis was carried out (s23) by the method of Reference 2, and outputs the audio data X _'[Phi, as the output value of the utterance rhythm converter 20.
(Reference 2) JP 2011-150232 Unexamined performs speech synthesis by convolution of the 'time-series Z of _[Phi and the sound source signal' _[Phi chronological Y vocal tract spectrum.

なお、隣接するフレームで有声と無声（通常、基本周波数を0とする）が切り替わる場合、生成方法２〜５の重み付け和を用いると適切な変換とならないことがある。このようなフレームには、生成方法１を用いるとよい。 When voiced and unvoiced (usually with a fundamental frequency of 0) are switched between adjacent frames, the conversion may not be appropriate if the weighted sums of the generation methods 2 to 5 are used. Generation method 1 may be used for such a frame.

＜効果＞
このような構成により、生成した変換行列を用いて、発話リズムを変換することで、変換音声の自然性を従来よりも向上させることができる。 <Effect>
With such a configuration, by converting the utterance rhythm using the generated conversion matrix, the naturalness of the converted speech can be improved as compared with the conventional case.

＜変形例＞
発話リズム変換行列生成装置１０のポイントは、時間関数ΩとΦを用いて、変換行列を生成する点にある。よって、他の装置で求められた時間関数ΩとΦを受け取り、変換行列生成処理（ｓ１３）を行い、変換行列を出力する構成としてもよい。その場合、声道スペクトル抽出部１１及び非負値時空間分解部１２を含まなくともよい。また、声道スペクトルを入力とし、声道スペクトル抽出部１１を含まない構成としてもよい。 <Modification>
The point of the utterance rhythm conversion matrix generation device 10 is that a conversion matrix is generated using the time functions Ω and Φ. Therefore, it may be configured to receive the time functions Ω and Φ obtained by other devices, perform the transformation matrix generation process (s13), and output the transformation matrix. In that case, the vocal tract spectrum extraction unit 11 and the non-negative space-time decomposition unit 12 may not be included. Alternatively, the vocal tract spectrum may be input and the vocal tract spectrum extraction unit 11 may not be included.

本実施形態では、説明を簡潔にするため、音声データX_Ωに含まれる音素の時系列と音声データX_Φに含まれる音素の時系列とが同一であるものとしているが、必ずしも同一でなくともよい。例えば、日本語母語話者と英語母語話者とが同じ文や単語を発声した場合、それぞれの音声データに含まれる音素が異なる場合がある。例えば、英文を読んだときに、英語母語話者は、日本語母語話者に比べ母音を省略する傾向があり、そのような場合、一方の音声データに含まれる音素の時系列が、他方の音素の時系列とは異なる。その場合、本実施形態の処理を行う前に、音素の多い音声データから音素を省略したり、音素の少ない音声データに音素を追加し、音素の数を同一とすればよい。よって、音声データX_Ωに含まれる音素の時系列と音声データX_Φに含まれる音素の時系列とは必ずしも同一である必要はなく、対応するものであればよい。 In the present embodiment, for simplicity of explanation, the time series of phonemes contained in time-series and the audio data X _[Phi phoneme included in the voice data X _Omega are assumed to be identical, not necessarily identical Good. For example, when a Japanese native speaker and an English native speaker utter the same sentence or word, the phonemes included in the respective speech data may be different. For example, when reading English, native English speakers tend to omit vowels compared to native Japanese speakers. In such a case, the time series of phonemes contained in one speech data It is different from the phoneme time series. In that case, before performing the processing of the present embodiment, the phonemes may be omitted from the speech data with a large number of phonemes, or the phonemes may be added to the speech data with a small number of phonemes to make the number of phonemes the same. Thus, not necessarily the same as the time series of phonemes contained in time-series and the audio data X _[Phi phoneme included in the voice data X _Omega, as long as the corresponding.

変換対象となる声道スペクトルはLSPパラメータに限らない。他の声道スペクトルであってもよい。 The vocal tract spectrum to be converted is not limited to LSP parameters. Other vocal tract spectra may be used.

発話リズム変換行列生成装置１０を発話リズム変換装置２０の内部に備えてもよい（図６参照）。非負値時空間分解処理（ｓ１２）及び変換行列生成処理（ｓ１３）を発話リズム変換処理（ｓ２２）に先立ち行っておく必要がある（図７参照）。 The utterance rhythm conversion matrix generation device 10 may be provided inside the utterance rhythm conversion device 20 (see FIG. 6). The non-negative space-time decomposition process (s12) and the transformation matrix generation process (s13) must be performed prior to the utterance rhythm conversion process (s22) (see FIG. 7).

本実施形態では、式（１５）、式（１６）において、a_i,k=y_i(t_k)としたが、以下の式（３３）、（３４）によりa_i,kを求めてもよい（非特許文献１参照）。

In the present embodiment, a _{i, k} = y _i (t _k ) in the equations (15) and (16), but a _{i, k} can be obtained by the following equations (33) and (34). Good (see Non-Patent Document 1).

本実施形態では、話者Bの音声データX_Ωに基づき声道スペクトル抽出処理（ｓ１２）、非負値時空間分解処理（ｓ１３）により時間関数Ωを求めているが、他の方法により、時間関数Ωを求めてもよい。例えば、時間関数Φからの変換により変換関数Ωを求めてもよい。例えば、ある母語話者に特徴的な時間関数Φを他の母語話者に特徴的な時間関数Ωに変換するための変換規則等が存在する場合には、その変換規則に従って、時間関数Φからの変換により変換関数Ωを求めてもよい。また、時間関数Φの時間軸を伸縮させることにより求めてもよいし、時間関数Ωの制約(s_k-1≦s≦s_k、ω_k-1(s)+ω_k(s)=1、ω_k-1(s)>0、ω_k(s)>0)を満たすものを任意に作成することで求めてもよい。また、通常、発話リズムを変えるために、話者A以外の話者が発声した音声データを用いるが、必ずしもA以外の話者である必要はなく、話者Aが発声した音声データを用いて時間関数Ωを求めてもよい。 In the present embodiment, the time function Ω is obtained by the vocal tract spectrum extraction process (s12) and the non-negative space-time decomposition process (s13) based on the voice data _XΩ of the speaker B. Ω may be obtained. For example, the conversion function Ω may be obtained by conversion from the time function Φ. For example, if there is a conversion rule for converting a time function Φ characteristic of a native speaker into a time function Ω characteristic of another native speaker, the time function Φ The conversion function Ω may be obtained by conversion of Further, it may be obtained by expanding or contracting the time axis of the time function Φ, or the constraints of the time function Ω (s _k-1 ≦ s ≦ s _k , ω _k-1 (s) + ω _k (s) = 1 , Ω _k-1 (s)> 0, ω _k (s)> 0) may be arbitrarily created. Also, in order to change the utterance rhythm, voice data uttered by a speaker other than the speaker A is usually used, but it is not always necessary to be a speaker other than the A, and the voice data uttered by the speaker A is used. The time function Ω may be obtained.

＜その他の変形例＞
本発明は上記の実施形態及び変形例に限定されるものではない。例えば、上述の各種の処理は、記載に従って時系列に実行されるのみならず、処理を実行する装置の処理能力あるいは必要に応じて並列的にあるいは個別に実行されてもよい。その他、本発明の趣旨を逸脱しない範囲で適宜変更が可能である。 <Other variations>
The present invention is not limited to the above-described embodiments and modifications. For example, the various processes described above are not only executed in time series according to the description, but may also be executed in parallel or individually as required by the processing capability of the apparatus that executes the processes. In addition, it can change suitably in the range which does not deviate from the meaning of this invention.

＜プログラム及び記録媒体＞
また、上記の実施形態及び変形例で説明した各装置における各種の処理機能をコンピュータによって実現してもよい。その場合、各装置が有すべき機能の処理内容はプログラムによって記述される。そして、このプログラムをコンピュータで実行することにより、上記各装置における各種の処理機能がコンピュータ上で実現される。 <Program and recording medium>
In addition, various processing functions in each device described in the above embodiments and modifications may be realized by a computer. In that case, the processing contents of the functions that each device should have are described by a program. Then, by executing this program on a computer, various processing functions in each of the above devices are realized on the computer.

この処理内容を記述したプログラムは、コンピュータで読み取り可能な記録媒体に記録しておくことができる。コンピュータで読み取り可能な記録媒体としては、例えば、磁気記録装置、光ディスク、光磁気記録媒体、半導体メモリ等どのようなものでもよい。 The program describing the processing contents can be recorded on a computer-readable recording medium. As the computer-readable recording medium, for example, any recording medium such as a magnetic recording device, an optical disk, a magneto-optical recording medium, and a semiconductor memory may be used.

また、このプログラムの流通は、例えば、そのプログラムを記録したＤＶＤ、ＣＤ−ＲＯＭ等の可搬型記録媒体を販売、譲渡、貸与等することによって行う。さらに、このプログラムをサーバコンピュータの記憶装置に格納しておき、ネットワークを介して、サーバコンピュータから他のコンピュータにそのプログラムを転送することにより、このプログラムを流通させてもよい。 The program is distributed by selling, transferring, or lending a portable recording medium such as a DVD or CD-ROM in which the program is recorded. Further, the program may be distributed by storing the program in a storage device of the server computer and transferring the program from the server computer to another computer via a network.

このようなプログラムを実行するコンピュータは、例えば、まず、可搬型記録媒体に記録されたプログラムもしくはサーバコンピュータから転送されたプログラムを、一旦、自己の記憶部に格納する。そして、処理の実行時、このコンピュータは、自己の記憶部に格納されたプログラムを読み取り、読み取ったプログラムに従った処理を実行する。また、このプログラムの別の実施形態として、コンピュータが可搬型記録媒体から直接プログラムを読み取り、そのプログラムに従った処理を実行することとしてもよい。さらに、このコンピュータにサーバコンピュータからプログラムが転送されるたびに、逐次、受け取ったプログラムに従った処理を実行することとしてもよい。また、サーバコンピュータから、このコンピュータへのプログラムの転送は行わず、その実行指示と結果取得のみによって処理機能を実現する、いわゆるＡＳＰ（Application Service Provider）型のサービスによって、上述の処理を実行する構成としてもよい。なお、プログラムには、電子計算機による処理の用に供する情報であってプログラムに準ずるもの（コンピュータに対する直接の指令ではないがコンピュータの処理を規定する性質を有するデータ等）を含むものとする。 A computer that executes such a program first stores, for example, a program recorded on a portable recording medium or a program transferred from a server computer in its storage unit. When executing the process, this computer reads the program stored in its own storage unit and executes the process according to the read program. As another embodiment of this program, a computer may read a program directly from a portable recording medium and execute processing according to the program. Further, each time a program is transferred from the server computer to the computer, processing according to the received program may be executed sequentially. Also, the program is not transferred from the server computer to the computer, and the above-described processing is executed by a so-called ASP (Application Service Provider) type service that realizes the processing function only by the execution instruction and result acquisition. It is good. Note that the program includes information provided for processing by the electronic computer and equivalent to the program (data that is not a direct command to the computer but has a property that defines the processing of the computer).

また、コンピュータ上で所定のプログラムを実行させることにより、各装置を構成することとしたが、これらの処理内容の少なくとも一部をハードウェア的に実現することとしてもよい。 In addition, although each device is configured by executing a predetermined program on a computer, at least a part of these processing contents may be realized by hardware.

Claims

音素に対する重みと時刻との関係を示す関数を時間関数とし、非負値時空間分解法により、第一音声データの第一声道スペクトルに対する時間関数である第一時間関数を求める非負値時空間分解部と、
第一時間関数と対応する音素に対する重みと時刻との関係を示す時間関数を第二時間関数とし、第二時間関数に対応する声道スペクトルを第二声道スペクトルとし、前記第一時間関数と第二時間関数とを用いて、前記第一声道スペクトルの発話リズムを第二声道スペクトルの発話リズムに変換するための行列である変換行列を生成する変換行列生成部と、を含む、
発話リズム変換行列生成装置。 Non-negative spatio-temporal decomposition to obtain the first time function, which is the time function for the first vocal tract spectrum of the first speech data, using the non-negative spatio-temporal decomposition method as a function that represents the relationship between the weight for the phoneme and the time. And
The time function indicating the relationship between the time and the weight for the phoneme corresponding to the first time function is the second time function, the vocal tract spectrum corresponding to the second time function is the second vocal tract spectrum, and the first time function A conversion matrix generation unit that generates a conversion matrix that is a matrix for converting the utterance rhythm of the first vocal tract spectrum into the utterance rhythm of the second vocal tract spectrum using a second time function,
Utterance rhythm conversion matrix generator.

請求項１の発話リズム変換行列生成装置であって、
前記第一声道スペクトルの時間長をT、前記第二声道スペクトルの時間長をSとし、前記変換行列をS行T列の行列とし、s=1,…,S、t=1,…,T、前記変換行列のs行t列の要素をG(s,t)とし、前記第一声道スペクトルに含まれる音素の総数をKとし、前記第一声道スペクトルの時系列の音素のそれぞれに付与される音素番号をkとし、k=1,…,K、時刻tにおけるk番目の音素に対応する第一時間関数をφ_k(t)とし、時刻sにおけるk番目の音素に対応する第二時間関数をω_k(s)とし、前記第一声道スペクトルにおけるk番目の音素の中心の時刻をt_kとし、前記第二声道スペクトルにおけるk番目の音素の中心の時刻をs_kとし、前記変換行列生成部は、

で定義される前記変換行列を生成する、
発話リズム変換行列生成装置。 The utterance rhythm conversion matrix generation device according to claim 1,
The time length of the first vocal tract spectrum is T, the time length of the second vocal tract spectrum is S, the transformation matrix is a matrix of S rows and T columns, and s = 1,..., S, t = 1,. , T, the element of s rows and t columns of the transformation matrix is G (s, t), the total number of phonemes included in the first vocal tract spectrum is K, and the time-series phonemes of the first vocal tract spectrum are The phoneme number assigned to each is k, k = 1, ..., K, the first time function corresponding to the kth phoneme at time t is φ _k (t), and corresponds to the kth phoneme at time s Ω _k (s), the time of the center of the kth phoneme in the first vocal tract spectrum is t _k, and the time of the center of the kth phoneme in the second vocal tract spectrum is s _k , and the transformation matrix generation unit

Generate the transformation matrix defined by
Utterance rhythm conversion matrix generator.

請求項１の発話リズム変換行列生成装置であって、
前記第一声道スペクトルの時間長をT、前記第二声道スペクトルの時間長をSとし、前記変換行列をS行T列の行列とし、s=1,…,S、t=1,…,T、前記変換行列のs行t列の要素をG(s,t)とし、前記第一声道スペクトルに含まれる音素の総数をKとし、前記第一声道スペクトルの時系列の音素のそれぞれに付与される音素番号をkとし、k=1,…,K、時刻tにおけるk番目の音素に対応する第一時間関数をφ_k(t)とし、時刻sにおけるk番目の音素に対応する第二時間関数をω_k(s)とし、前記第一声道スペクトルにおけるk番目の音素の中心の時刻をt_kとし、前記第二声道スペクトルにおけるk番目の音素の中心の時刻をs_kとし、s_k-1≦s≦s_kにおいて時刻sに対して|φ_k(t)-ω_k(s)|を最小とするt(ただし及びt_k-1≦t≦t_k)をt^*(s)とし、前記変換行列生成部は、

として、

とし、それ以外の場合、
G(s,t^*(s))=0
として定義される前記変換行列を生成する、
発話リズム変換行列生成装置。 The utterance rhythm conversion matrix generation device according to claim 1,
The time length of the first vocal tract spectrum is T, the time length of the second vocal tract spectrum is S, the transformation matrix is a matrix of S rows and T columns, and s = 1,..., S, t = 1,. , T, the element of s rows and t columns of the transformation matrix is G (s, t), the total number of phonemes included in the first vocal tract spectrum is K, and the time-series phonemes of the first vocal tract spectrum are The phoneme number assigned to each is k, k = 1, ..., K, the first time function corresponding to the kth phoneme at time t is φ _k (t), and corresponds to the kth phoneme at time s Ω _k (s), the time of the center of the kth phoneme in the first vocal tract spectrum is t _k, and the time of the center of the kth phoneme in the second vocal tract spectrum is s _k, and t (and t _k-1 ≦ t ≦ t _k ) that minimizes | φ _k (t) -ω _k (s) | with respect to time s at s _k-1 ≦ s ≦ s _k t ^* (s), and the transformation matrix generation unit

As

Otherwise,
G (s, t ^* (s)) = 0
Generate the transformation matrix defined as:
Utterance rhythm conversion matrix generator.

請求項１の発話リズム変換行列生成装置であって、
前記第一声道スペクトルの時間長をT、前記第二声道スペクトルの時間長をSとし、前記変換行列をS行T列の行列とし、s=1,…,S、t=1,…,T、前記変換行列のs行t列の要素をG(s,t)とし、前記第一声道スペクトルに含まれる音素の総数をKとし、前記第一声道スペクトルの時系列の音素のそれぞれに付与される音素番号をkとし、k=1,…,K、時刻tにおけるk番目の音素に対応する第一時間関数をφ_k(t)とし、時刻sにおけるk番目の音素に対応する第二時間関数をω_k(s)とし、前記第一声道スペクトルにおけるk番目の音素の中心の時刻をt_kとし、前記第二声道スペクトルにおけるk番目の音素の中心の時刻をs_kとし、s_k-1≦s≦s_kにおいて時刻sに対して|φ_k(t)-ω_k(s)|を最小とするt(ただし及びt_k-1≦t≦t_k)をt^*(s)とし、aを1以上の整数の何れか、t^*(s)-a≦τ≦t^*(s)+aとし、重みの和が１となるための制約のための重み係数をαとし、前記変換行列生成部は、所定の条件を満たすまで、

または、

または、

により、要素G(s,t)を更新し、前記変換行列を生成する、
発話リズム変換行列生成装置。 The utterance rhythm conversion matrix generation device according to claim 1,
The time length of the first vocal tract spectrum is T, the time length of the second vocal tract spectrum is S, the transformation matrix is a matrix of S rows and T columns, and s = 1,..., S, t = 1,. , T, the element of s rows and t columns of the transformation matrix is G (s, t), the total number of phonemes included in the first vocal tract spectrum is K, and the time-series phonemes of the first vocal tract spectrum are The phoneme number assigned to each is k, k = 1, ..., K, the first time function corresponding to the kth phoneme at time t is φ _k (t), and corresponds to the kth phoneme at time s Ω _k (s), the time of the center of the kth phoneme in the first vocal tract spectrum is t _k, and the time of the center of the kth phoneme in the second vocal tract spectrum is s _k, and t (and t _k-1 ≦ t ≦ t _k ) that minimizes | φ _k (t) -ω _k (s) | with respect to time s at s _k-1 ≦ s ≦ s _k t ^* (s), where a is an integer greater than or equal to 1, t ^* (s)-a ≤ τ ≤ t ^* (s) + a And the weighting factor for the constraint for the sum of weights to be 1 is α, and the transformation matrix generation unit until a predetermined condition is satisfied,

Or

Or

To update the element G (s, t) and generate the transformation matrix,
Utterance rhythm conversion matrix generator.

請求項１の発話リズム変換行列生成装置であって、
前記第一声道スペクトルの時間長をT、前記第二声道スペクトルの時間長をSとし、前記変換行列をS行T列の行列とし、s=1,…,S、t=1,…,T、前記変換行列のs行t列の要素をK(s,t)とし、前記第一声道スペクトルに含まれる音素の総数をKとし、前記第一声道スペクトルの時系列の音素のそれぞれに付与される音素番号をkとし、k=1,…,K、時刻tにおけるk番目の音素に対応する第一時間関数をφ_k(t)とし、時刻sにおけるk番目の音素に対応する第二時間関数をω_k(s)とし、前記第一声道スペクトルにおけるk番目の音素の中心の時刻をt_kとし、前記第二声道スペクトルにおけるk番目の音素の中心の時刻をs_kとし、

とし、s_k-1≦s≦s_kにおいて時刻sに対して|φ_k(t)-ω_k(s)|を最小とするt(ただし及びt_k-1≦t≦t_k)をt^*(s)とし、

とし、

とし、それ以外の場合、
G₂(s,t^*(s))=0
とし、aを1以上の整数の何れか、t^*(s)-a≦τ≦t^*(s)+aとし、重みの和が１となるための制約のための重み係数をαとし、所定の条件を満たすまで、

または、

または、

により、G₃(s,t)を更新するものとし、

とし、w₁≧0、w₂≧0、w₃≧0、0≦w₁+w₂+w₃≦1とし、前記変換行列生成部は、

で定義される前記変換行列を生成する、
発話リズム変換行列生成装置。 The utterance rhythm conversion matrix generation device according to claim 1,
The time length of the first vocal tract spectrum is T, the time length of the second vocal tract spectrum is S, the transformation matrix is a matrix of S rows and T columns, and s = 1,..., S, t = 1,. , T, the element of s rows and t columns of the transformation matrix is K (s, t), the total number of phonemes included in the first vocal tract spectrum is K, and the time-series phonemes of the first vocal tract spectrum are The phoneme number assigned to each is k, k = 1, ..., K, the first time function corresponding to the kth phoneme at time t is φ _k (t), and corresponds to the kth phoneme at time s Ω _k (s), the time of the center of the kth phoneme in the first vocal tract spectrum is t _k, and the time of the center of the kth phoneme in the second vocal tract spectrum is s _k and

And t (and t _k-1 ≦ t ≦ t _k ) that minimizes | φ _k (t) −ω _k (s) | with respect to time s at s _k−1 ≦ s ≦ s _k is t ^* (s)

age,

Otherwise,
G ₂ (s, t ^* (s)) = 0
And a is any integer greater than or equal to 1, t ^* (s) -a ≦ τ ≦ t ^* (s) + a, and the weighting factor for the constraint that the sum of weights is 1 is α, Until the predetermined condition is met

Or

Or

To update G ₃ (s, t)

And w ₁ ≧ 0, w ₂ ≧ 0, w ₃ ≧ 0, 0 ≦ w ₁ + w ₂ + w ₃ ≦ 1, and the transformation matrix generation unit is

請求項１から請求項５の何れかの発話リズム変換行列生成装置で生成した変換行列を用いる発話リズム変換装置であって、
前記第一声道スペクトルに前記変換行列を乗じて、変換後第一声道スペクトルを求め、前記第一音声データの音源信号に前記変換行列を乗じて、変換後音源信号を求める発話リズム変換部と、
前記変換後音源信号と前記変換後第一声道スペクトルとを用いて音声合成を行う音声合成部とを含む、
発話リズム変換装置。 An utterance rhythm conversion device using a conversion matrix generated by the utterance rhythm conversion matrix generation device according to claim 1,
An utterance rhythm conversion unit that multiplies the first vocal tract spectrum by the conversion matrix to obtain a converted first vocal tract spectrum, multiplies the sound source signal of the first speech data by the conversion matrix to obtain a converted sound source signal. When,
A speech synthesizer that performs speech synthesis using the converted sound source signal and the converted first vocal tract spectrum,
Utterance rhythm converter.

音素に対する重みと時刻との関係を示す関数を時間関数とし、非負値時空間分解法により、第一音声データの第一声道スペクトルに対する時間関数である第一時間関数を求める非負値時空間分解ステップと、
第一時間関数と対応する音素に対する重みと時刻との関係を示す時間関数を第二時間関数とし、第二時間関数に対応する声道スペクトルを第二声道スペクトルとし、前記第一時間関数と第二時間関数とを用いて、前記第一声道スペクトルの発話リズムを第二声道スペクトルの発話リズムに変換するための行列である変換行列を生成する変換行列生成ステップと、を含む、
発話リズム変換行列生成方法。 Non-negative spatio-temporal decomposition to obtain the first time function, which is the time function for the first vocal tract spectrum of the first speech data, using the non-negative spatio-temporal decomposition method as a function that represents the relationship between the weight for the phoneme and the time. Steps,
The time function indicating the relationship between the time and the weight for the phoneme corresponding to the first time function is the second time function, the vocal tract spectrum corresponding to the second time function is the second vocal tract spectrum, and the first time function A conversion matrix generating step for generating a conversion matrix that is a matrix for converting the utterance rhythm of the first vocal tract spectrum into the utterance rhythm of the second vocal tract spectrum using a second time function,
Utterance rhythm conversion matrix generation method.

請求項１から請求項５の何れかの発話リズム変換行列生成装置、または、請求項６の発話リズム変換装置として、コンピュータを機能させるためのプログラム。 A program for causing a computer to function as the utterance rhythm conversion matrix generation device according to any one of claims 1 to 5 or the utterance rhythm conversion device according to claim 6.