JP7274184B2

JP7274184B2 - A neural vocoder that implements a speaker-adaptive model to generate a synthesized speech signal and a training method for the neural vocoder

Info

Publication number: JP7274184B2
Application number: JP2021540067A
Authority: JP
Inventors: ソン，ウンウー; キム，ジンソプ; ビョン，キョングン; カン，ホング
Original assignee: Naver Corp
Current assignee: Naver Corp
Priority date: 2019-01-11
Filing date: 2019-08-16
Publication date: 2023-05-16
Anticipated expiration: 2039-08-16
Also published as: WO2020145472A1; JP2023089256A; JP2022516784A

Description

特許法第３０条第２項適用２０１８年８月１６日に大韓民国ソウル市の延世大学校で開催された２０１８年度韓国音響学会第３５回音声通信及び信号処理学術大会にて「ＤＥＥＰＬＥＡＲＮＩＮＧ－ＢＡＳＥＤＳＰＥＥＣＨＳＹＮＴＨＥＳＩＳＳＹＳＴＥＭ」として公開。Application of Article 30, Paragraph 2 of the Patent Act At the 2018 Korean Acoustical Society 35th Speech Communication and Signal Processing Academic Conference held at Yonsei University in Seoul, Republic of Korea on August 16, 2018, "DEEP LEARNING-BASED SPEECH SYNTHESIS SYSTEM”.

特許法第３０条第２項適用２０１８年１１月８日にウェブサイト（ｈｔｔｐｓ：／／ａｒｘｉｖ．ｏｒｇ／ａｂｓ／１８１１．０３３１１）上にて「ＳＰＥＡＫＥＲ－ＡＤＡＰＴＩＶＥＮＥＵＲＡＬＶＯＤＯＤＥＲＳＦＯＲＳＴＡＴＩＳＴＩＣＡＬＰＡＲＡＭＥＴＲＩＣＳＰＥＥＣＨＳＹＮＴＨＥＳＩＳＳＹＳＴＥＭＳ」として公開。Application of Article 30, Paragraph 2 of the Patent Law November 8, 2018 on the website (https://arxiv.org/abs/1811.03311) as "SPEAKER-ADAPTIVE NEURAL VODODERS FOR STATISTICAL PARAMETRIC SPEECH SYNTHESIS SYSTEMS" Release.

以下の説明は、ニューラルボコーダを使用する合成音声信号生成方法とニューラルボコーダ、およびニューラルボコーダの訓練方法に関する。 The following description relates to a method of generating a synthesized speech signal using a neural vocoder, a neural vocoder, and a method of training the neural vocoder.

また、以下の説明は、話者適応型モデルを利用してターゲット話者の合成音声信号を生成するニューラルボコーダ、および話者適応型モデルを実現するためのニューラルボコーダの訓練方法に関する。 The following description also relates to a neural vocoder that utilizes a speaker-adaptive model to generate a synthesized speech signal for a target speaker, and a method of training the neural vocoder to implement the speaker-adaptive model.

音声合成技術とは、入力されたデータに基づいて人間の音声と類似の合成音を作り出す技術である。一例として、ＴＴＳ（Ｔｅｘｔｔｏｓｐｅｅｃｈ）は、入力されたテキストを人間の音声に変換して提供する。 Speech synthesis technology is a technology for creating synthesized sounds similar to human speech based on input data. As an example, TTS (Text to speech) converts input text into human speech and provides it.

このような合成音声は、入力された音響パラメータに基づいて音声信号を生成するボコーダによって生成される。近年は、人工知能およびディープラーニング技術の発展に伴い、合成音声の生成にニューラルネットワークを活用するニューラルボコーダが提案されるようになった。ニューラルボコーダは、話者からの音声データによって話者独立的または話者従属的に訓練され、訓練の結果を使用することにより、入力された音響パラメータに対する合成音声信号を生成する。 Such synthesized speech is produced by a vocoder that produces speech signals based on input acoustic parameters. In recent years, with the development of artificial intelligence and deep learning technology, neural vocoders that utilize neural networks to generate synthesized speech have been proposed. Neural vocoders are trained speaker-independently or speaker-dependently with speech data from speakers and use the training results to generate synthesized speech signals for input acoustic parameters.

ニューラルボコーダが特定のターゲット話者に対応する合成音声信号を生成するためには、該当のターゲット話者の音声データを利用してニューラルボコーダを訓練しなければならない。一定以上の品質をもつ合成音声信号を生成するためには、一般的には、ターゲット話者の音声録音を含んだ、数時間以上の音声データが必要となる。音声データが足りない場合には、生成される合成音声信号の品質が低下したり歪曲が発生したりするようになる。ターゲット話者が一般人ではなく、芸能人や有名人などのようなセレブリティ（ｃｅｌｅｂｒｉｔｙ）の場合には、数時間以上の音声録音を訓練データとして確保するのが困難なことが多い。したがって、訓練に使用するためのターゲット話者の音声データの量を最小に抑えながらも合成音声信号の品質を高めることができるニューラルボコーダシステムが求められている。 In order for a neural vocoder to generate a synthesized speech signal corresponding to a particular target speaker, the neural vocoder must be trained using the speech data of that target speaker. In order to generate a synthesized speech signal with a certain quality or higher, generally several hours or more of speech data including the speech recording of the target speaker are required. If the audio data is insufficient, the quality of the generated synthetic audio signal is degraded or distorted. When the target speaker is not an ordinary person but a celebrity such as an entertainer or a famous person, it is often difficult to secure several hours or more of voice recording as training data. Therefore, there is a need for a neural vocoder system that can improve the quality of the synthesized speech signal while minimizing the amount of target speaker speech data to use for training.

一方、音声信号に基づく合成音声信号の生成において、音声信号はダイナミックな特性を有するため、ニューラルネットワーク（例えば、ＣＮＮ）がこのような特性を完全に捕捉するには困難がある。特に、音声信号の高周波数領域ではスペクトル歪曲が発生しやすく、これは合成音声信号の品質の低下にも繋がりかねない。したがって、高周波数領域のスペクトル歪曲を低めて合成音声信号の品質を高めることができ、さらに音声データを訓練する過程を簡略化することのできるニューラルボコーダシステムも求められている。 On the other hand, in generating a synthesized speech signal based on a speech signal, since the speech signal has dynamic characteristics, it is difficult for the neural network (eg, CNN) to fully capture such characteristics. In particular, spectral distortion is likely to occur in the high-frequency region of the speech signal, which may lead to deterioration of the quality of the synthesized speech signal. Therefore, there is also a need for a neural vocoder system that can improve the quality of synthesized speech signals by reducing spectral distortion in the high frequency region and simplify the process of training speech data.

特許文献１（韓国特許出願公開第１０－２０１８－０１１３３２５号公報（公開日２０１８年１０月１６日））は、音声合成装置が音声波形を合成するにあたり、開発者や利用者の意図したとおりに合成音の音声が変調されるように音声合成器の音声モデルを符号化し、音声モデルコードを変換し、音声モデルを復号化することにより、変調された音声波形を合成する機能を提供する音声合成装置および方法について説明している。 Patent Document 1 (Korea Patent Application Publication No. 10-2018-0113325 (publication date: October 16, 2018)) discloses that when a speech synthesizer synthesizes a speech waveform, as intended by a developer or a user, Speech synthesis that provides the ability to synthesize a modulated speech waveform by encoding a speech model in a speech synthesizer such that the synthesized speech is modulated, converting the speech model code, and decoding the speech model An apparatus and method are described.

上述した情報は、本発明の理解を助けるためのものに過ぎず、従来技術の一部を形成しない内容を含むこともあるし、従来技術が通常の技術者に提示することのできる内容を含まないこともある。 The above information is merely to assist in understanding the present invention and may include subject matter which does not form part of the prior art and includes subject matter which the prior art may present to one of ordinary skill in the art. sometimes not.

韓国特許出願公開第１０－２０１８－０１１３３２５号Korean Patent Application Publication No. 10-2018-0113325

スペクトル関連パラメータおよび励起関連パラメータを含む複数の音響パラメータを取得し、複数の音響パラメータに基づいて励起信号を推定し、推定された励起信号に対してスペクトル関連パラメータのうちの少なくとも１つに基づく線形合成フィルタを適用することによってターゲット音声信号を生成する、ニューラルボコーダによる音声信号生成方法を提供することを目的とする。 obtaining a plurality of acoustic parameters including a spectrally-related parameter and an excitation-related parameter; estimating an excitation signal based on the plurality of acoustical parameters; linearizing the estimated excitation signal based on at least one of the spectral-related parameters It is an object of the present invention to provide a speech signal generation method by a neural vocoder that generates a target speech signal by applying a synthesis filter.

複数の話者からの音声データセットに対して話者独立的に訓練されたソースモデルからの加重値を初期値として設定し、該当の初期値に対してターゲット話者からの音声データセットを訓練することによってアップデートされた加重値を生成する、ニューラルボコーダの訓練方法を提供することを他の目的とする。 Set the weights from the speaker-independently trained source model for speech datasets from multiple speakers as initial values, and train the speech dataset from the target speaker against the corresponding initial values. It is another object to provide a neural vocoder training method that generates updated weights by doing.

一側面において、コンピュータによって実現されるニューラルボコーダ（ｎｅｕｒａｌｖｏｃｏｄｅｒ）が実行する音声信号生成方法であって、スペクトル関連パラメータ（ｓｐｅｃｔｒａｌｐａｒａｍｅｔｅｒ）および励起（ｅｘｃｉｔａｔｉｏｎ）の周期性によって区分される励起関連パラメータを含む複数の音響パラメータを取得する段階、前記複数の音響パラメータに基づいて励起信号（ｅｘｃｉｔａｔｉｏｎｓｉｇｎａｌ）を推定する段階、および前記推定された励起信号に対して前記スペクトル関連パラメータのうちの少なくとも１つに基づく線形合成フィルタを適用することによってターゲット音声信号を生成する段階を含む、音声信号生成方法を提供する。 In one aspect, a method of generating an audio signal performed by a computer-implemented neural vocoder comprising a spectral-related parameter and an excitation-related parameter partitioned by a periodicity of excitation. obtaining a plurality of acoustic parameters; estimating an excitation signal based on the plurality of acoustic parameters; and basing at least one of the spectrally-related parameters on the estimated excitation signal. An audio signal generation method is provided that includes generating a target audio signal by applying a linear synthesis filter.

前記励起関連パラメータは、所定のカットオフ周波数以下の励起を示す第１励起パラメータ、および前記カットオフ周波数を超過する励起を示す第２励起パラメータを含んでよい。 The excitation-related parameters may include a first excitation parameter indicative of excitation below a predetermined cutoff frequency and a second excitation parameter indicative of excitation above said cutoff frequency.

前記第１励起パラメータは、前記励起の高調波スペクトル（ｈａｒｍｏｎｉｃｓｐｅｃｔｒｕｍ）を示し、前記第２励起パラメータは、前記励起のその他の部分を示してよい。 The first excitation parameter may indicate a harmonic spectrum of the excitation and the second excitation parameter may indicate another portion of the excitation.

前記スペクトル関連パラメータは、音声信号のピッチを示す周波数パラメータ、音声信号のエネルギーを示すエネルギーパラメータ、音声信号が有声音（ｖｏｉｃｅ）であるか無声音（ｕｎｖｏｉｃｅ）であるかを示すパラメータ、および音声信号の線スペクトル周波数（ＬｉｎｅＳｐｅｃｔｒａｌＦｒｅｑｕｅｎｃｙ：ＬＳＦ）を示すパラメータを含んでよい。 The spectrum-related parameters include a frequency parameter indicating the pitch of the speech signal, an energy parameter indicating the energy of the speech signal, a parameter indicating whether the speech signal is voiced or unvoiced, and It may include a parameter indicating the Line Spectral Frequency (LSF).

前記ターゲット音声信号を生成する段階は、前記ＬＳＦを示すパラメータを線形予測符号（ＬｉｎｅａｒＰｒｅｄｉｃｔｉｖｅＣｏｄｉｎｇ：ＬＰＣ）に変換する段階、および前記推定された励起信号に対し、前記変換されたＬＰＣに基づく前記線形合成フィルタを適用する段階を含んでよい。 Generating the target speech signal includes transforming parameters indicative of the LSF into Linear Predictive Coding (LPC); Applying a synthesis filter may be included.

前記複数の音響パラメータは、入力されたテキストまたは入力された音声信号に基づいて音響モデル（ａｃｏｕｓｔｉｃｍｏｄｅｌ）によって生成されたものであってよい。 The plurality of acoustic parameters may be generated by an acoustic model based on input text or input speech signals.

前記ニューラルボコーダは、訓練のために入力された音声信号に基づいて訓練されたものであり、前記訓練は、前記入力された音声信号に対して線形予測分析フィルタ（Ｌｉｎｅａｒｐｒｅｄｉｃｔｉｏｎａｎａｌｙｓｉｓｆｉｌｔｅｒ）を適用することによって前記入力された音声信号から励起信号を分離する段階、および前記分離した励起信号の確率分布をモデリングする段階を含み、前記励起信号を推定する段階は、前記モデリングされた励起信号の確率分布を使用して前記複数の音響パラメータに対する励起信号を推定してよい。 The neural vocoder is trained based on an input speech signal for training, and the training includes applying a linear prediction analysis filter to the input speech signal. and modeling a probability distribution of the separated excitation signal, wherein estimating the excitation signal is a probability distribution of the modeled excitation signal. may be used to estimate excitation signals for the plurality of acoustic parameters.

前記励起信号を分離する段階は、前記入力された音声信号のＬＳＦを示すパラメータを線形予測符号（ＬｉｎｅａｒＰｒｅｄｉｃｔｉｖｅＣｏｄｉｎｇ：ＬＰＣ）に変換する段階、および前記入力された音声信号に対し、前記入力された音声信号の変換されたＬＰＣに基づく前記線形予測分析フィルタを適用する段階を含んでよい。 The step of separating the excitation signal includes converting a parameter indicating the LSF of the input speech signal into linear predictive coding (LPC), and for the input speech signal, the input applying the linear predictive analysis filter based on the transformed LPC of the speech signal.

前記分離された励起信号は、前記入力された音声信号の残渣成分（ｒｅｓｉｄｕａｌｃｏｍｐｏｎｅｎｔ）であってよい。 The separated excitation signal may be a residual component of the input audio signal.

他の側面において、コンピュータによって実現されるニューラルボコーダの訓練方法であって、音声信号の入力を受ける段階、前記入力された音声信号から、スペクトル関連パラメータおよび励起の周期性によって区分される励起関連パラメータを含む複数の音響パラメータを抽出する段階、前記入力された音声信号に対して前記スペクトル関連パラメータのうちの少なくとも１つに基づく線形予測分析フィルタを適用することによって前記入力された音声信号から励起信号を分離する段階、および前記分離した励起信号の確率分布をモデリングする段階を含む、ニューラルボコーダの訓練方法を提供する。 In another aspect, a computer-implemented method of training a neural vocoder, comprising: receiving an input of a speech signal; from said input speech signal, spectrally-related parameters and excitation-related parameters partitioned by excitation periodicity; an excitation signal from the input audio signal by applying a linear predictive analysis filter based on at least one of the spectrally-related parameters to the input audio signal; and modeling the probability distribution of said isolated excitation signal.

前記励起信号を分離する段階は、前記スペクトル関連パラメータのうちで前記入力された音声信号のＬＳＦを示すパラメータをＬＰＣに変換する段階、および前記入力された音声信号に対し、前記入力された音声信号の変換されたＬＰＣに基づく前記線形予測分析フィルタを適用する段階を含んでよい。 Separating the excitation signal includes converting a parameter indicating LSF of the input speech signal among the spectrum-related parameters into LPC; applying the linear predictive analysis filter based on the transformed LPC of .

また他の側面において、ニューラルボコーダであって、スペクトル関連パラメータ（ｓｐｅｃｔｒａｌｐａｒａｍｅｔｅｒ）、および励起（ｅｘｃｉｔａｔｉｏｎ）の周期性によって区分される励起関連パラメータを含む複数の音響パラメータを取得するパラメータ取得部、前記複数の音響パラメータに基づいて励起信号（ｅｘｃｉｔａｔｉｏｎｓｉｇｎａｌ）を推定する励起信号推定部、および前記推定された励起信号に対して前記スペクトル関連パラメータのうちの少なくとも１つに基づく線形合成フィルタを適用することによってターゲット音声信号を生成する音声信号生成部を含む、ニューラルボコーダを提供する。 In yet another aspect, a neural vocoder, a parameter acquisition unit configured to acquire a plurality of acoustic parameters including a spectral parameter and an excitation-related parameter partitioned by a periodicity of excitation; and by applying to said estimated excitation signal a linear synthesis filter based on at least one of said spectrally related parameters A neural vocoder is provided that includes an audio signal generator that generates a target audio signal.

前記音声信号生成部は、前記スペクトル関連パラメータのうちで音声信号のＬＳＦを示すパラメータを線形予測符号（ＬｉｎｅａｒＰｒｅｄｉｃｔｉｖｅＣｏｄｉｎｇ：ＬＰＣ）に変換する変換部を含み、前記推定された励起信号に対し、前記変換されたＬＰＣに基づく前記線形合成フィルタを適用してよい。 The speech signal generation unit includes a conversion unit that converts a parameter indicating the LSF of the speech signal among the spectrum-related parameters into linear predictive coding (LPC), and for the estimated excitation signal, the Said linear synthesis filter based on the transformed LPC may be applied.

前記ニューラルボコーダは、訓練のために入力された音声信号に基づいて訓練されたものであり、前記ニューラルボコーダは、前記入力された音声信号に対して線形予測分析フィルタ（ｌｉｎｅａｒｐｒｅｄｉｃｔｉｏｎａｎａｌｙｓｉｓｆｉｌｔｅｒ）を適用することによって前記入力された音声信号から励起信号を分離する励起信号分離部、および前記分離した励起信号の確率分布をモデリングするモデリング部をさらに含んでよく、前記励起信号推定部は、前記モデリングされた励起信号の確率分布を使用して前記複数の音響パラメータに対する励起信号を推定してよい。 The neural vocoder is trained based on an input speech signal for training, and the neural vocoder applies a linear prediction analysis filter to the input speech signal. and a modeling unit for modeling a probability distribution of the separated excitation signal, wherein the excitation signal estimator performs the modeled A probability distribution of the excitation signal may be used to estimate the excitation signal for the plurality of acoustic parameters.

前記励起信号分離部は、前記入力された音声信号のＬＳＦを示すパラメータを線形予測符号（ＬｉｎｅａｒＰｒｅｄｉｃｔｉｖｅＣｏｄｉｎｇ：ＬＰＣ）に変換する変換部を含み、前記入力された音声信号に対し、前記入力された音声信号の変換されたＬＰＣに基づく前記線形予測分析フィルタを適用してよい。 The excitation signal separation unit includes a conversion unit that converts a parameter indicating the LSF of the input speech signal into a linear predictive coding (LPC), and for the input speech signal, the input Said linear predictive analysis filter based on the transformed LPC of the speech signal may be applied.

また他の側面において、コンピュータによって実現されるニューラルボコーダの訓練方法であって、複数の話者からの音声データセットに対して訓練されたソースモデルからの加重値（ｗｅｉｇｈｔ）を初期値として設定する段階、および前記初期値に対し、ターゲット話者からの音声データセットを訓練することによってアップデートされた加重値を生成する段階を含み、前記アップデートされた加重値は、前記ターゲット話者に対応する合成音声信号を生成するために使用される、ニューラルボコーダの訓練方法が提供される。 In yet another aspect, a computer-implemented method of training a neural vocoder comprising initializing weights from a source model trained on speech data sets from multiple speakers. and generating updated weights by training a speech dataset from a target speaker against the initial values, the updated weights corresponding to the target speaker. A method for training a neural vocoder used to generate speech signals is provided.

前記ソースモデルからの加重値（ｗｅｉｇｈｔ）は、前記音声データセットに含まれた話者ごとに区分されないグローバル特性を示す値であり、前記アップデートされた加重値を生成する段階は、前記ソースモデルからの加重値を、前記ターゲット話者からの音声データセットが含む前記ターゲット話者の固有の特性が反映されるように調整することによって前記アップデートされた加重値を生成してよい。 The weights from the source model are values representing global characteristics not classified by speakers included in the speech data set, and generating the updated weights includes: The updated weights may be generated by adjusting the weights of to reflect the specific characteristics of the target speaker contained in the speech data set from the target speaker.

前記複数の話者からの音声データセットのそれぞれの大きさは、前記ターゲット話者からの音声データセットよりも大きくてよい。 The size of each of the speech datasets from the plurality of speakers may be larger than the speech dataset from the target speaker.

前記ニューラルボコーダの訓練方法は、前記複数の話者からの音声データセットを話者独立的に訓練するソースモデルを構築する段階、および前記ソースモデルから前記加重値を取得する段階をさらに含み、前記ソースモデルは、前記ターゲット話者からの音声データセットを訓練するためのモデルの初期化子（ｉｎｉｔｉａｌｉｚｅｒ）として使用されてよい。 The neural vocoder training method further comprises building a source model for speaker-independent training of speech data sets from the plurality of speakers, and obtaining the weights from the source model, wherein A source model may be used as a model initializer for training a speech dataset from the target speaker.

前記訓練方法によって訓練されたニューラルボコーダが実行する音声信号生成方法であって、入力されたテキストまたは入力された音声信号に基づき、音響モデル（ａｃｏｕｓｔｉｃｍｏｄｅｌ）によって生成されたスペクトル関連パラメータ（ｓｐｅｃｔｒａｌｐａｒａｍｅｔｅｒ）および励起（ｅｘｃｉｔａｔｉｏｎ）の周期性によって区分される励起関連パラメータを含む複数の音響パラメータを取得する段階、前記複数の音響パラメータに基づいて励起信号（ｅｘｃｉｔａｔｉｏｎｓｉｇｎａｌ）を推定する段階、および前記推定された励起信号に対して前記スペクトル関連パラメータのうちの少なくとも１つに基づく線形合成フィルタを適用することによってターゲット音声信号を生成する段階を含み、前記ターゲット音声信号は、前記ターゲット話者に対応する合成音声である、音声信号生成方法が提供される。 A speech signal generation method performed by a neural vocoder trained by the training method, wherein spectral parameters generated by an acoustic model based on input text or input speech signals. obtaining a plurality of acoustic parameters including excitation-related parameters partitioned by periodicity of excitation and excitation; estimating an excitation signal based on the plurality of acoustic parameters; and generating a target speech signal by applying a linear synthesis filter based on at least one of said spectrally related parameters to an excitation signal, said target speech signal being synthesized speech corresponding to said target speaker. A method for generating an audio signal is provided.

前記ターゲット音声信号を生成する段階は、前記スペクトル関連パラメータのうちで音声信号のＬＳＦを示すパラメータを線形予測符号（ＬｉｎｅａｒＰｒｅｄｉｃｔｉｖｅＣｏｄｉｎｇ：ＬＰＣ）に変換する段階、および前記推定された励起信号に対し、前記変換されたＬＰＣに基づく前記線形合成フィルタを適用する段階を含んでよい。 The step of generating the target speech signal includes converting a parameter indicating LSF of the speech signal among the spectral-related parameters into linear predictive coding (LPC), and for the estimated excitation signal, applying the linear synthesis filter based on the transformed LPC.

前記励起信号を推定する段階は、モデリングされた励起信号の確率分布を使用して前記複数の音響パラメータに対する励起信号を推定し、前記励起信号の確率分布のモデリングは、訓練のために入力された音声信号に対して線形予測分析フィルタ（Ｌｉｎｅａｒｐｒｅｄｉｃｔｉｏｎａｎａｌｙｓｉｓｆｉｌｔｅｒ）を適用することによって前記入力された音声信号から励起信号を分離する段階、および前記分離した励起信号の確率分布をモデリングする段階を含む方法によって実行されてよい。 The step of estimating the excitation signal estimates an excitation signal for the plurality of acoustic parameters using a modeled excitation signal probability distribution, the modeling of the excitation signal probability distribution being input for training. A method comprising separating an excitation signal from the input speech signal by applying a linear prediction analysis filter to the speech signal, and modeling a probability distribution of the separated excitation signal. may be performed by

また他の側面において、ニューラルボコーダであって、複数の話者からの音声データセットに対して話者独立的に訓練されたソースモデルからの加重値（ｗｅｉｇｈｔ）を初期値として設定し、前記初期値に対し、ターゲット話者からの音声データセットを訓練することによってアップデートされた加重値を生成する話者適応型モデルを構築する話者適応型モデル構築部を含み、前記話者適応型モデルによって生成された、前記アップデートされた加重値は、前記ターゲット話者に対応する合成音声を生成するために使用される、ニューラルボコーダを提供する。 In yet another aspect, a neural vocoder, wherein weights from a speaker-independently trained source model are set as initial values for speech data sets from a plurality of speakers; a speaker-adaptive model builder that builds a speaker-adaptive model that produces updated weighted values by training a speech dataset from a target speaker for the values, wherein the speaker-adaptive model constructs The updated weights generated provide a neural vocoder that is used to generate synthesized speech corresponding to the target speaker.

前記ニューラルボコーダは、前記複数の話者からの音声データセットを話者独立的に訓練するソースモデルを構築するソースモデル構築部をさらに含み、前記ソースモデルは、前記ターゲット話者からの音声データセットを訓練するためのモデルの初期化子（ｉｎｉｔｉａｌｉｚｅｒ）として動作してよい。 The neural vocoder further includes a source model builder that builds a source model that is speaker-independently trained on speech datasets from the plurality of speakers, wherein the source model is a speech dataset from the target speaker. may act as a model initializer for training the .

前記ニューラルボコーダは、入力されたテキストまたは入力された音声信号に基づき、音響モデル（ａｃｏｕｓｔｉｃｍｏｄｅｌ）によって生成されたスペクトル関連パラメータ（ｓｐｅｃｔｒａｌｐａｒａｍｅｔｅｒ）および励起（ｅｘｃｉｔａｔｉｏｎ）の周期性によって区分される励起関連パラメータを含む複数の音響パラメータを取得するパラメータ取得部、前記複数の音響パラメータに基づいて励起信号（ｅｘｃｉｔａｔｉｏｎｓｉｇｎａｌ）を推定する励起信号推定部、および前記推定された励起信号に対して前記スペクトル関連パラメータのうちの少なくとも１つに基づく線形合成フィルタを適用することによってターゲット音声信号を生成する音声信号生成部をさらに含み、前記ターゲット音声信号は、前記ターゲット話者に対応する合成音声であってよい。 The neural vocoder is based on an input text or an input speech signal, a spectral parameter generated by an acoustic model and an excitation related parameter partitioned by the periodicity of excitation an excitation signal estimator for estimating an excitation signal based on the plurality of acoustic parameters; and for the estimated excitation signal, the spectrum-related parameters of further comprising a speech signal generator for generating a target speech signal by applying a linear synthesis filter based on at least one of said target speech signal may be a synthesized speech corresponding to said target speaker;

前記ニューラルボコーダは、訓練のために入力された音声信号に対して線形予測分析フィルタ（ｌｉｎｅａｒｐｒｅｄｉｃｔｉｏｎａｎａｌｙｓｉｓｆｉｌｔｅｒ）を適用することによって前記入力された音声信号から励起信号を分離する励起信号分離部、および前記分離された励起信号の確率分布をモデリングするモデリング部をさらに含み、前記励起信号推定部は、前記モデリングされた励起信号の確率分布を使用して前記複数の音響パラメータに対する励起信号を推定してよい。 an excitation signal separator for separating an excitation signal from the input speech signal by applying a linear prediction analysis filter to the input speech signal for training of the neural vocoder; and A modeling unit that models the probability distribution of the separated excitation signal, wherein the excitation signal estimator estimates the excitation signal for the plurality of acoustic parameters using the modeled probability distribution of the excitation signal. good.

ニューラルボコーダが励起信号をターゲットにして推定を実行し、推定された励起信号に対して線形予測フィルタを適用することによってターゲット音声信号が生成されることにより、生成されたターゲット音声信号の品質を高めることができ、特に、音声信号の高周波数領域のスペクトル歪曲を減らすことができる。 A neural vocoder targets an excitation signal to perform estimation, and a target speech signal is generated by applying a linear prediction filter to the estimated excitation signal to enhance the quality of the generated target speech signal. In particular, spectral distortion in the high frequency range of the audio signal can be reduced.

ランダム値ではない話者独立的に訓練されたソースモデルからの加重値を初期値として使用してターゲット話者からの音声データセットを訓練することにより、相対的に小さい（すなわち、短時間の）音声データセットを訓練するだけでも高品質のターゲット話者の合成音声（合成音声信号）を生成することができる。 A relatively small (i.e., short-term) High-quality synthetic speech (synthetic speech signal) of the target speaker can be generated simply by training the speech dataset.

一実施形態における、入力されたテキストまたは音声信号に基づいて合成音声信号を生成する方法を示した図である。FIG. 4 illustrates a method for generating a synthesized speech signal based on an input text or speech signal, according to one embodiment. 一実施形態における、ニューラルボコーダシステムの構造を示したブロック図である。1 is a block diagram illustrating the structure of a neural vocoder system in one embodiment; FIG. 一実施形態における、ニューラルボコーダシステムのプロセッサの構造を示したブロック図である。1 is a block diagram illustrating the structure of a processor of a neural vocoder system, in one embodiment; FIG. 一実施形態における、音声信号生成方法を示したフローチャートである。1 is a flow chart illustrating a method for generating an audio signal, according to one embodiment. 一実施形態における、ニューラルボコーダを訓練させる方法を示したフローチャートである。Figure 2 is a flow chart illustrating a method for training a neural vocoder, according to one embodiment. 一実施形態における、話者適応型モデルを構築してターゲット話者の合成音声を生成する方法を示した図である。FIG. 4 illustrates a method for building a speaker-adaptive model to generate synthetic speech for a target speaker, in one embodiment. 一実施形態における、ニューラルボコーダのプロセッサの構造を示したブロック図である。1 is a block diagram illustrating the structure of a neural vocoder processor in one embodiment; FIG. 一実施形態における、話者適応型モデルを構築するためのニューラルボコーダの訓練方法を示したフローチャートである。1 is a flowchart illustrating a method for training a neural vocoder to build a speaker-adaptive model, in one embodiment. 一例における、音声信号および励起信号とその関係を示した図である。FIG. 4 is a diagram showing an audio signal and an excitation signal and their relationship in one example; それぞれ異なる種類のボコーダを使用した、合成音声信号生成のための統計的パラメトリック音声合成（ＳｔａｔｉｓｔｉｃａｌＰａｒａｍｅｔｒｉｃＳｐｅｅｃｈＳｙｎｔｈｅｓｉｓ：ＳＰＳＳ）システムを示した図である。1 illustrates a Statistical Parametric Speech Synthesis (SPSS) system for synthetic speech signal generation using different types of vocoders; FIG. それぞれ異なる種類のボコーダを使用した、合成音声信号生成のための統計的パラメトリック音声合成（ＳｔａｔｉｓｔｉｃａｌＰａｒａｍｅｔｒｉｃＳｐｅｅｃｈＳｙｎｔｈｅｓｉｓ：ＳＰＳＳ）システムを示した図である。1 illustrates a Statistical Parametric Speech Synthesis (SPSS) system for synthetic speech signal generation using different types of vocoders; FIG. それぞれ異なる種類のボコーダを使用した、合成音声信号生成のための統計的パラメトリック音声合成（ＳｔａｔｉｓｔｉｃａｌＰａｒａｍｅｔｒｉｃＳｐｅｅｃｈＳｙｎｔｈｅｓｉｓ：ＳＰＳＳ）システムを示した図である。1 illustrates a Statistical Parametric Speech Synthesis (SPSS) system for synthetic speech signal generation using different types of vocoders; FIG. 一実施形態における、訓練のために入力された音声信号から励起信号を分離することによってニューラルボコーダを訓練させる方法を示した図である。FIG. 4 illustrates a method of training a neural vocoder by separating the excitation signal from the audio signal input for training, in one embodiment. 一実施形態における、入力テキストに基づいて音響モデルによって生成された音響パラメータから励起信号を推定して合成音声信号を生成する方法を示した図である。FIG. 4 illustrates a method for estimating an excitation signal from acoustic parameters generated by an acoustic model based on an input text to generate a synthesized speech signal, in one embodiment. 一実施形態における、訓練のために入力された音声信号から励起信号を分離することによってニューラルボコーダを訓練させる方法を示した図である。FIG. 4 illustrates a method of training a neural vocoder by separating the excitation signal from the audio signal input for training, in one embodiment. 一実施形態における、入力テキストに基づいて音響モデルによって生成された音響パラメータから励起信号を推定して合成音声信号を生成する方法を示した図である。FIG. 4 illustrates a method for estimating an excitation signal from acoustic parameters generated by an acoustic model based on an input text to generate a synthesized speech signal, in one embodiment. 一例における、訓練過程／合成音声信号の生成過程で取得した負の対数尤度（ＮｅｇａｔｉｖｅＬｏｇ－Ｌｉｋｅｌｉｈｏｏｄ：ＮＬＬ）の音響パラメータとして励起の周期性によって区分されるパラメータの使用の可否による差を示したグラフである。In one example, the negative log-likelihood (NLL) acquired in the training process / the process of generating a synthesized speech signal is shown as an acoustic parameter depending on whether or not a parameter classified by the periodicity of excitation is used. graph. 一例における、複数の話者からの音声信号に対し、音声信号の話者従属的な特徴と話者独立的な特徴を示した図式である。1 is a diagram showing speaker-dependent and speaker-independent features of an audio signal for audio signals from multiple speakers in one example; 一例における、複数の話者からの音声データセットを訓練させることによって構築されたソースモデルと、ターゲット話者からの音声データセットを訓練させることによって構築された話者適応型モデルを使用してターゲット話者の合成音声を生成する方法を示した図である。In one example, target using a source model constructed by training speech datasets from multiple speakers and a speaker-adaptive model constructed by training speech datasets from target speakers. FIG. 2 illustrates a method of generating synthetic speech for a speaker; 一例における、話者適応（ｓｐｅａｋｅｒａｄａｐｔａｔｉｏｎ）アルゴリズムの適用の可否によって生成された合成音声信号の品質を比較評価した結果を示した図である。FIG. 6 is a diagram showing results of comparative evaluation of the quality of synthesized speech signals generated depending on whether or not a speaker adaptation algorithm is applied in one example; 一例における、話者適応（ｓｐｅａｋｅｒａｄａｐｔａｔｉｏｎ）アルゴリズムの適用の可否によって生成された合成音声信号の品質を比較評価した結果を示した図である。FIG. 6 is a diagram showing results of comparative evaluation of the quality of synthesized speech signals generated depending on whether or not a speaker adaptation algorithm is applied in one example; 一例における、ＥｘｃｉｔＮｅｔボコーダと他のボコーダとのＭＯＳ（ＭｅａｎＯｐｉｎｉｏｎＳｃｏｒｅ）評価の結果を示した図である。FIG. 10 is a diagram showing the result of MOS (Mean Opinion Score) evaluation between an ExcitNet vocoder and another vocoder in one example; 一例における、Ｆ０スケーリングファクタ（ｓｃａｌｉｎｇｆａｃｔｏｒ）を相違させる場合において、話者適応型モデルを構築するニューラルボコーダの性能の変化を示した図である。FIG. 10 is a diagram showing the change in performance of a neural vocoder for constructing a speaker-adaptive model in the case of different F0 scaling factors in one example;

以下、本発明の実施形態について、添付の図面を参照しながら詳細に説明する。 BEST MODE FOR CARRYING OUT THE INVENTION Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

図１は、一実施形態における、入力されたテキストまたは音声信号に基づいて合成音声信号を生成する方法を示した図である。 FIG. 1 illustrates a method for generating a synthesized speech signal based on an input text or speech signal, in one embodiment.

音声信号とは音声を示すものであるが、以下の詳細な説明では、説明の便宜上、「音声信号」と「音声」が混用されることもある。 An audio signal indicates sound, but in the following detailed description, for convenience of explanation, the terms "audio signal" and "audio" may be used together.

音響モデル（ａｃｏｕｓｔｉｃｍｏｄｅｌ）１１０は、合成音声信号の生成のために入力されたテキストまたは音声信号から音響パラメータ（複数可）を生成してよい。音響モデル１１０は、ディープラーニングに基づく統計的パラメトリック音声合成（ＳｔａｔｉｓｔｉｃａｌＰａｒａｍｅｔｒｉｃＳｐｅｅｃｈＳｙｎｔｈｅｓｉｓ：ＳＰＳＳ）システムで設計されたものであってよい。音響モデル１１０は、言語入力と音響出力パラメータとの間の非線形マッピング関数を示すために訓練された、マルチフィードフォワードおよび長短期記憶層で構成されてよい。音響モデル１１０は、例えば、ＤＮＮＴＴＳモジュールであってよい。音響パラメータは、合成音声信号を生成するために使用されるフィーチャーであるか、フィーチャーを構成するために使用されるパラメータであってよい。 Acoustic model 110 may generate acoustic parameter(s) from an input text or speech signal for generation of a synthesized speech signal. The acoustic model 110 may have been designed with a deep learning-based Statistical Parametric Speech Synthesis (SPSS) system. Acoustic model 110 may consist of multiple feedforward and short-term memory layers trained to exhibit non-linear mapping functions between verbal input and acoustic output parameters. Acoustic model 110 may be, for example, a DNN TTS module. An acoustic parameter may be a feature used to generate a synthesized speech signal or a parameter used to configure a feature.

ボコーダ１２０は、音響モデル１１０で生成された音響パラメータを音声信号に変換することによって合成音声信号を生成してよい。ボコーダ１２０は、ニューラルボコーダであってよい。ニューラルボコーダは、ディープラーニングモデルによって訓練されたものであってよい。ニューラルボコーダは、例えば、ＷａｖｅＮｅｔ、ＳａｍｐｌｅＲＮＮ、またはＷａｖｅＲＮＮであってよい。また、ニューラルボコーダは、これらに制限されない、一般的な生成モデル（ｇｅｎｅｒａｔｉｖｅｍｏｄｅｌ）であってもよい。 Vocoder 120 may generate a synthesized speech signal by converting acoustic parameters generated by acoustic model 110 into a speech signal. Vocoder 120 may be a neural vocoder. A neural vocoder may have been trained by a deep learning model. A neural vocoder may be, for example, WaveNet, SampleRNN, or WaveRNN. Also, the neural vocoder may be a general generative model that is not limited to these.

「ニューラルボコーダ」は、（合成）音声信号の生成のために訓練されたモデル（例えば、ＷａｖｅＮｅｔ、ＳａｍｐｌｅＲＮＮ、ＷａｖｅＲＮＮ、または一般的なモデル）、および各種フィルタを含む装置を示すために使用されてよい。 "Neural vocoder" may be used to denote a device that includes a trained model (e.g. WaveNet, SampleRNN, WaveRNN, or general model) and various filters for the generation of (synthetic) speech signals. .

ボコーダ１２０は、音響モデル１１０から取得した音響パラメータに基づいて音声信号の励起（ｅｘｃｉｔａｔｉｏｎ）信号を推定してよい。すなわち、音声信号の励起信号がボコーダ１２０のターゲットとなってよい。 Vocoder 120 may estimate an excitation signal of the speech signal based on acoustic parameters obtained from acoustic model 110 . That is, the excitation signal of the speech signal may be the target of the vocoder 120 .

励起信号は、音声信号のうちで音声の震えを示す成分であって、発話者の口の形状によって変化する音声信号の変化を示す成分（スペクトル成分（ｓｐｅｃｔｒａｌｃｏｍｐｏｎｅｎｔ））とは区分されてよい。励起信号の変化は、発話者の声帯の動き（ｖｏｃａｌｃｏｒｄｍｏｖｅｍｅｎｔ）によってのみ制限されてよい。励起信号は、音声信号の残渣信号（ｒｅｓｉｄｕａｌｓｉｇｎａｌ）であってよい。 The excitation signal may be a component of the speech signal that indicates speech tremors and may be distinguished from a component that indicates changes in the speech signal that vary with the shape of the speaker's mouth (spectral component). Changes in the excitation signal may be limited only by the vocal cord movement of the speaker. The excitation signal may be a residual signal of the audio signal.

ボコーダ１２０によって推定された励起信号に対し、音声信号のスペクトル成分を示す音響パラメータに基づいて生成された線形予測（ＬｉｎｅａｒＰｒｅｄｉｃｔｉｏｎ）フィルタが適用されることにより、ターゲット音声信号（すなわち、合成音声信号）が生成されてよい。 A target speech signal (i.e., a synthesized speech signal) is obtained by applying a linear prediction filter generated based on acoustic parameters indicating spectral components of the speech signal to the excitation signal estimated by the vocoder 120. may be generated.

ボコーダ１２０が音声信号ではない励起信号をターゲットとし、推定された励起信号に対して線形予測フィルタを適用することによってターゲット音声信号が生成されることにより、生成されたターゲット音声信号の品質を高めることができ、特に、音声信号の高周波数領域のスペクトル歪曲を減らすことができる。 The vocoder 120 targets an excitation signal that is not a speech signal, and the target speech signal is generated by applying a linear prediction filter to the estimated excitation signal, thereby enhancing the quality of the generated target speech signal. In particular, spectral distortion in the high frequency region of the speech signal can be reduced.

励起信号を推定することによってターゲット音声信号を生成するより具体的な方法と、励起信号を推定するためにニューラルボコーダを訓練させるより具体的な方法については、図２～５を参照しながらさらに詳しく説明する。 A more specific method of generating the target speech signal by estimating the excitation signal and of training a neural vocoder to estimate the excitation signal will be described in more detail with reference to FIGS. explain.

図２は、一実施形態における、ニューラルボコーダシステムの構造を示したブロック図である。 FIG. 2 is a block diagram illustrating the structure of a neural vocoder system, in one embodiment.

図２を参照しながら、ニューラルボコーダシステム２００のより詳細な構成について説明する。図に示したニューラルボコーダシステム２００は、ニューラルボコーダを含んで構成されるコンピュータ（コンピュータシステム）を示してよい。 A more detailed configuration of the neural vocoder system 200 will be described with reference to FIG. The illustrated neural vocoder system 200 may represent a computer (computer system) that includes a neural vocoder.

ニューラルボコーダシステム２００は、コンピュータシステムによって実現される固定端末や移動端末であってよい。例えば、ニューラルボコーダシステム２００は、ＡＩスピーカ、スマートフォン、携帯電話、ナビゲーション、ＰＣ（ｐｅｒｓｏｎａｌｃｏｍｐｕｔｅｒ）、ノート型ＰＣ、デジタル放送用端末、ＰＤＡ（ＰｅｒｓｏｎａｌＤｉｇｉｔａｌＡｓｓｉｓｔａｎｔ）、ＰＭＰ（ＰｏｒｔａｂｌｅＭｕｌｔｉｍｅｄｉａＰｌａｙｅｒ）、タブレット、ゲームコンソール、ウェアラブルデバイス、ＩｏＴ（ＩｎｔｅｒｎｅｔｏｆＴｈｉｎｇｓ）デバイス、ＶＲ（ＶｉｒｔｕａｌＲｅａｌｉｔｙ）デバイス、ＡＲ（ＡｕｇｍｅｎｔｅｄＲｅａｌｉｔｙ）デバイスなどによって実現されてよい。また、ニューラルボコーダシステム２００は、上述したような端末とネットワークを介して通信するサーバまたはその他のコンピューティング装置によって実現されてよい。 Neural vocoder system 200 may be a fixed or mobile terminal implemented by a computer system. For example, the neural vocoder system 200 can be used in AI speakers, smart phones, mobile phones, navigation systems, PCs (personal computers), notebook PCs, digital broadcasting terminals, PDAs (Personal Digital Assistants), PMPs (Portable Multimedia Players), tablets, and games. It may be realized by consoles, wearable devices, IoT (Internet of Things) devices, VR (Virtual Reality) devices, AR (Augmented Reality) devices, and the like. Additionally, neural vocoder system 200 may be implemented by a server or other computing device in communication with terminals such as those described above over a network.

ニューラルボコーダシステム２００は、メモリ２１０、プロセッサ２２０、通信モジュール２３０、および入力／出力インタフェースを含んでよい。メモリ２１０は、非一時的なコンピュータ読み取り可能な記録媒体であって、ＲＡＭ（ｒａｎｄｏｍａｃｃｅｓｓｍｅｍｏｒｙ）、ＲＯＭ（ｒｅａｄｏｎｌｙｍｅｍｏｒｙ）、ディスクドライブ、ＳＳＤ（ｓｏｌｉｄｓｔａｔｅｄｒｉｖｅ）、フラッシュメモリ（ｆｌａｓｈｍｅｍｏｒｙ）などのような永続的大容量記録装置を含んでよい。ここで、ＲＯＭ、ＳＳＤ、フラッシュメモリ、ディスクドライブのような永続的大容量記録装置は、メモリ２１０とは区分される別の永続的記録装置としてニューラルボコーダシステム２００に含まれてもよい。また、メモリ２１０には、オペレーティングシステムと、少なくとも１つのプログラムコード（一例として、ニューラルボコーダシステム２００においてインストールされて実行されるブラウザや、特定のサービスの提供のためにニューラルボコーダシステム２００にインストールされたアプリケーションなどのためのコード）が記録されてよい。このようなソフトウェア構成要素は、メモリ２１０とは別のコンピュータ読み取り可能な記録媒体からロードされてよい。このような別のコンピュータ読み取り可能な記録媒体は、フロッピー（登録商標）ドライブ、ディスク、テープ、ＤＶＤ／ＣＤ－ＲＯＭドライブ、メモリカードなどのコンピュータ読み取り可能な記録媒体を含んでよい。他の実施形態において、ソフトウェア構成要素は、コンピュータ読み取り可能な記録媒体ではない通信モジュール２３０を通じてメモリ２１０にロードされてもよい。例えば、少なくとも１つのプログラムは、開発者またはアプリケーションのインストールファイルを配布するファイル配布システム（一例として、外部サーバ）を経て提供するファイルによってインストールされるコンピュータプログラムに基づいてメモリ２１０にロードされてよい。 Neural vocoder system 200 may include memory 210, processor 220, communication module 230, and input/output interfaces. The memory 210 is a non-temporary computer-readable recording medium, such as a RAM (random access memory), a ROM (read only memory), a disk drive, a solid state drive (SSD), and a flash memory. may include a permanent mass storage device such as Here, a permanent mass storage device such as ROM, SSD, flash memory, disk drive may be included in neural vocoder system 200 as a separate permanent storage device separate from memory 210 . The memory 210 also contains an operating system and at least one program code (for example, a browser installed and executed in the neural vocoder system 200 and a browser installed in the neural vocoder system 200 to provide a particular service). code for applications, etc.) may be recorded. Such software components may be loaded from a computer-readable medium separate from memory 210 . Such other computer-readable recording media may include computer-readable recording media such as floppy drives, disks, tapes, DVD/CD-ROM drives, memory cards, and the like. In other embodiments, software components may be loaded into memory 210 through communications module 230 that is not a computer-readable medium. For example, at least one program may be loaded into memory 210 based on a computer program installed by files provided via a developer or a file distribution system (eg, an external server) that distributes application installation files.

プロセッサ２２０は、基本的な算術、ロジック、および入出力演算を実行することにより、コンピュータプログラムの命令を処理するように構成されてよい。命令は、メモリ２１０または通信モジュール２３０によって、プロセッサ２２０に提供されてよい。例えば、プロセッサ２２０は、メモリ２１０のような記録装置に記録されたプログラムコードにしたがって受信される命令を実行するように構成されてよい。 Processor 220 may be configured to process computer program instructions by performing basic arithmetic, logic, and input/output operations. Instructions may be provided to processor 220 by memory 210 or communication module 230 . For example, processor 220 may be configured to execute received instructions according to program code stored in a storage device, such as memory 210 .

通信モジュール２３０は、ネットワークを介してニューラルボコーダシステム２００が他の電子機器または他のサーバと互いに通信するための機能を提供してよい。通信モジュール２３０は、ニューラルボコーダシステム２００のネットワークインタフェースカード、ネットワークインタフェースチップ、およびネットワーキングインタフェースポートなどのようなハードウェアモジュール、またはネットワークデバイスドライバまたはネットワーキングプログラムのようなソフトウェアモジュールであってよい。 Communication module 230 may provide functionality for neural vocoder system 200 to communicate with other electronic devices or other servers over a network. Communication module 230 may be a hardware module such as a network interface card, network interface chip, and networking interface port of neural vocoder system 200, or a software module such as a network device driver or networking program.

入力／出力インタフェース２４０は、入力／出力装置（図示せず）とのインタフェースのための手段であってよい。例えば、入力装置は、キーボード、マウス、マイクロフォン、カメラなどの装置を、出力装置は、ディスプレイ、話者、触覚フィードバックデバイスなどのような装置を含んでよい。他の例として、入力／出力インタフェース２４０は、タッチスクリーンのように入力と出力のための機能が１つに統合された装置とのインタフェースのための手段であってもよい。入力／出力装置２１５は、ニューラルボコーダシステム２００の構成であってよい。ニューラルボコーダシステム２００がサーバとして実現される場合、ニューラルボコーダシステム２００は、入力／出力装置および入力／出力インタフェースを含まなくてもよい。 Input/output interface 240 may be a means for interfacing with an input/output device (not shown). For example, input devices may include devices such as keyboards, mice, microphones, cameras, and output devices may include devices such as displays, speakers, tactile feedback devices, and the like. As another example, input/output interface 240 may be a means for interfacing with a device that integrates functionality for input and output, such as a touch screen. Input/output device 215 may be a component of neural vocoder system 200 . When neural vocoder system 200 is implemented as a server, neural vocoder system 200 may not include input/output devices and input/output interfaces.

また、他の実施形態において、ニューラルボコーダシステム２００は、図に示した構成要素よりも多くの構成要素を含んでもよい。しかし、大部分の従来技術の構成要素を明確に図に示す必要はないため、これについては省略する。 Also, in other embodiments, neural vocoder system 200 may include more components than those shown. However, since it is not necessary to clearly show most of the prior art components in the figure, they are omitted.

図３を参照しながら、プロセッサ２２０のより詳細な構成を中心に、励起信号を推定することによってターゲット音声信号を生成する方法と、励起信号を推定するためにニューラルボコーダを訓練させる方法について説明する。 Referring to FIG. 3, the method of generating the target speech signal by estimating the excitation signal and the method of training the neural vocoder to estimate the excitation signal will be described, focusing on a more detailed configuration of the processor 220. .

以上、図１を参照しながら説明した技術的特徴についての説明は、図２に対してもそのまま適用可能であるため、重複する説明は省略する。 The description of the technical features described above with reference to FIG. 1 can also be applied to FIG.

図３は、一実施形態における、ニューラルボコーダシステムのプロセッサの構造を示したブロック図である。 FIG. 3 is a block diagram that illustrates the structure of the processor of the neural vocoder system, in one embodiment.

以下で説明するプロセッサ２２０の構成３１０～３４０のそれぞれは、１つ以上のソフトウェアモジュールおよび／またはハードウェアモジュールによって実現されてよい。実施形態によって、プロセッサ２２０の構成要素は、選択的にプロセッサ２２０に含まれても除外されてもよい。また、実施形態によって、プロセッサ２２０の構成要素は、プロセッサ２２０の機能の表現のために分離されても併合されてもよい。 Each of the configurations 310-340 of processor 220 described below may be implemented by one or more software and/or hardware modules. Depending on the embodiment, components of processor 220 may be selectively included or excluded from processor 220 . Also, depending on the embodiment, the components of processor 220 may be separated or merged to represent the functionality of processor 220 .

プロセッサ２２０の構成要素は、ニューラルボコーダシステム２００に記録されたプログラムコードが提供する命令にしたがってプロセッサ２２０によって実行される、プロセッサ２２０の互いに異なる機能（ｄｉｆｆｅｒｅｎｔｆｕｎｃｔｉｏｎｓ）の表現であってよい。 The components of processor 220 may be representations of different functions of processor 220 executed by processor 220 according to instructions provided by program code recorded in neural vocoder system 200 .

プロセッサ２２０のパラメータ取得部３１０は、スペクトル関連パラメータ（ｓｐｅｃｔｒａｌｐａｒａｍｅｔｅｒ）、および励起（ｅｘｃｉｔａｔｉｏｎ）の周期性によって区分される励起関連パラメータを含む複数の音響パラメータを取得してよい。パラメータ取得部３１０が取得する複数の音響パラメータは、利用者から入力されたテキスト、または話者から入力された音声信号に基づいて音響モデル（ａｃｏｕｓｔｉｃｍｏｄｅｌ）によって生成されたものであってよい。 The parameter acquisition unit 310 of the processor 220 may acquire a plurality of acoustic parameters, including spectral parameters and excitation-related parameters partitioned by the periodicity of the excitation. The plurality of acoustic parameters acquired by the parameter acquisition unit 310 may be generated by an acoustic model based on the text input by the user or the voice signal input by the speaker.

プロセッサ２２０の励起信号推定部３２０は、複数の音響パラメータに基づいて励起信号（ｅｘｃｉｔａｔｉｏｎｓｉｇｎａｌ）を推定してよい。励起信号推定部３２０（ニューラルボコーダ）は、訓練のために入力された音声信号に基づいて訓練されたものであってよい。励起信号推定部３２０は、訓練によってモデリングされた励起信号の確率分布を使用して複数の音響パラメータに対する励起信号を推定してよい。 An excitation signal estimator 320 of processor 220 may estimate an excitation signal based on a plurality of acoustic parameters. The excitation signal estimator 320 (neural vocoder) may have been trained based on the input speech signal for training. The excitation signal estimator 320 may estimate the excitation signal for multiple acoustic parameters using the probability distribution of the excitation signal modeled by training.

プロセッサ２２０は、ニューラルボコーダの訓練を実行するための構成３４０を含んでよい。プロセッサ２２０の励起信号分離部３４２は、訓練のために入力された音声信号に対して線形予測分析フィルタ（ｌｉｎｅａｒｐｒｅｄｉｃｔｉｏｎａｎａｌｙｓｉｓｆｉｌｔｅｒ）を適用することにより、訓練のために入力された音声信号から励起信号を分離してよい。励起信号分離部３４２は、訓練のために入力された音声信号の線スペクトル周波数（ＬｉｎｅＳｐｅｃｔｒａｌＦｒｅｑｕｅｎｃｙ：ＬＳＦ）を示すパラメータを線形予測符号（ＬｉｎｅａｒＰｒｅｄｉｃｔｉｖｅＣｏｄｉｎｇ：ＬＰＣ）に変換する変換部３４３を含んでよい。前記線形予測分析フィルタは、ＬＳＦを示すパラメータに基づくものであり、前記変換されたＬＰＣに基づいて生成されるものであってよい。プロセッサ２２０のモデリング部３４４は、分離した励起信号の確率分布をモデリングしてよい。 Processor 220 may include configuration 340 for performing neural vocoder training. The excitation signal separator 342 of the processor 220 separates the excitation signal from the speech signal input for training by applying a linear prediction analysis filter to the speech signal input for training. can be separated. The excitation signal separation unit 342 includes a conversion unit 343 that converts a parameter indicating the line spectral frequency (LSF) of the speech signal input for training into linear predictive coding (LPC). good. The linear predictive analysis filter may be based on parameters indicative of the LSF and generated based on the transformed LPC. A modeling portion 344 of the processor 220 may model the probability distribution of the separated excitation signals.

プロセッサ２２０の音声信号生成部３３０は、励起信号推定部３２０によって推定された励起信号に対してスペクトル関連パラメータのうちの少なくとも１つに基づく線形（予測）合成フィルタを適用することによってターゲット音声信号を生成してよい。ターゲット音声信号は、合成された音声信号であってよい。 The audio signal generator 330 of the processor 220 generates a target audio signal by applying a linear (predictive) synthesis filter based on at least one of the spectrally-related parameters to the excitation signal estimated by the excitation signal estimator 320. may be generated. The target audio signal may be a synthesized audio signal.

音声信号生成部３３０は、取得したスペクトル関連パラメータのうちで音声信号のＬＳＦを示すパラメータを線形予測符号（ＬｉｎｅａｒＰｒｅｄｉｃｔｉｖｅＣｏｄｉｎｇ：ＬＰＣ）に変換する変換部３３２を含んでよい。前記線形予測合成フィルタは、取得したスペクトル関連パラメータのうちで音声信号のＬＳＦを示すパラメータに基づくものであり、前記変換されたＬＰＣに基づいて生成されるものであってよい。言い換えれば、音声信号生成部３３０は、推定された励起信号に対して変換されたＬＰＣに基づく線形予測合成フィルタを適用することによってターゲット音声信号を生成してよい。 The speech signal generation unit 330 may include a conversion unit 332 that converts a parameter indicating the LSF of the speech signal among the obtained spectrum-related parameters into linear predictive coding (LPC). The linear prediction synthesis filter may be based on a parameter indicative of the LSF of the speech signal among the acquired spectrum-related parameters, and may be generated based on the transformed LPC. In other words, the speech signal generator 330 may generate the target speech signal by applying a transformed LPC-based linear prediction synthesis filter to the estimated excitation signal.

励起信号を推定することによってターゲット音声信号を生成するより具体的な方法については、図４を参照しながらさらに詳しく説明するし、励起信号を推定するためにニューラルボコーダを訓練させるより具体的な方法については、図５を参照しながらさらに詳しく説明する。 A more specific method of generating the target speech signal by estimating the excitation signal will be described in more detail with reference to FIG. 4, and a more specific method of training a neural vocoder to estimate the excitation signal. will be described in more detail with reference to FIG.

以上、図１および図２を参照しながら説明した技術的特徴ついての説明は、図３に対してもそのまま適用可能であるため、重複する説明は省略する。 The description of the technical features described above with reference to FIGS. 1 and 2 can also be applied to FIG. 3 as it is, so redundant description will be omitted.

図４は、一実施形態における、音声信号生成方法を示したフローチャートである。 FIG. 4 is a flow chart illustrating a method for generating an audio signal, according to one embodiment.

段階４１０で、パラメータ取得部３１０は、スペクトル関連パラメータおよび励起の周期性によって区分される励起関連パラメータを含む複数の音響パラメータを取得してよい。パラメータ取得部３１０が取得する複数の音響パラメータは、利用者が入力したテキストまたは話者が入力した音声信号に基づいて音響モデルによって生成されたものであってよい。すなわち、パラメータ取得部３１０は、音響モデルから前記複数の音響パラメータを受信してよい。 At step 410, the parameter acquisition unit 310 may acquire a plurality of acoustic parameters, including spectral-related parameters and excitation-related parameters partitioned by excitation periodicity. The plurality of acoustic parameters acquired by the parameter acquisition unit 310 may be generated by an acoustic model based on the text input by the user or the speech signal input by the speaker. That is, the parameter acquisition section 310 may receive the plurality of acoustic parameters from the acoustic model.

スペクトル関連パラメータは、音声信号を構成するスペクトル成分（ｓｐｅｃｔｒａｌｃｏｍｐｏｎｅｎｔ）を示すパラメータであってよい。励起関連パラメータは、音声信号からスペクトル成分を除いた残渣信号（励起信号）に該当する成分を示すパラメータであってよい。スペクトル成分の信号は、発話者の口の形状に応じて変化する音声信号の部分を示してよい。励起信号は、音声信号のうちで音声の震えを示す音声信号の部分を示してよい。励起信号の変化は、発話者の声帯の動きによってのみ制限されてよい。 A spectrum-related parameter may be a parameter indicating a spectral component that constitutes an audio signal. The excitation-related parameter may be a parameter indicating a component corresponding to a residual signal (excitation signal) obtained by removing spectral components from an audio signal. The spectral component signal may indicate the portion of the speech signal that varies according to the shape of the speaker's mouth. The excitation signal may indicate a portion of the audio signal that exhibits audio tremors. Changes in the excitation signal may be limited only by the movement of the speaker's vocal cords.

スペクトル関連パラメータは、例えば、音声信号のピッチを示す周波数パラメータ（Ｆ０）、音声信号のエネルギーを示すエネルギーパラメータ（一例として、利得（ｇａｉｎ）を示すパラメータ）、音声信号が有声音（ｖｏｉｃｅ）であるか無声音（ｕｎｖｏｉｃｅ）であるかを示すパラメータ（ｖ／ｕｖ）、および音声信号の線スペクトル周波数（ＬｉｎｅＳｐｅｃｔｒａｌＦｒｅｑｕｅｎｃｙ：ＬＳＦ）を示すパラメータを含んでよい。 The spectrum-related parameters are, for example, a frequency parameter (F0) indicating the pitch of the speech signal, an energy parameter indicating the energy of the speech signal (for example, a parameter indicating gain), and the speech signal being voiced. a parameter (v/uv) indicating whether it is unvoiced or unvoiced, and a parameter indicating the Line Spectral Frequency (LSF) of the speech signal.

励起関連パラメータは、励起の周期性によって区分されるパラメータを含んでよい。励起関連パラメータは、例えば、ＴＦＴＥ（Ｔｉｍｅ－ＦｒｅｑｕｅｎｃｙＴｒａｊｅｃｔｏｒｙＥｘｃｉｔａｔｉｏｎ）パラメータであってよい。ＴＦＴＥは、周波数軸に沿った励起のスペクトル形状と時間軸に沿ったこのような形状の展開（ｅｖｏｌｕｔｉｏｎ）を示してよい。励起関連パラメータは、励起信号のうちで時間－周波数軸でよりゆっくり変化する成分を示す第１励起パラメータ（ＳＥＷ（ＳｌｏｗｌｙＥｖｏｌｖｉｎｇＷａｖｅｆｏｒｍ）パラメータ）、および励起信号のうちで時間－周波数軸でより迅速に変化する成分を示す第２励起パラメータ（ＲＥＷ（ＲａｐｉｄｌｙＥｖｏｌｖｉｎｇＷａｖｅｆｏｒｍ）パラメータ）を含んでよい。 Excitation-related parameters may include parameters that are differentiated by the periodicity of the excitation. The excitation-related parameters may be TFTE (Time-Frequency Trajectory Excitation) parameters, for example. TFTE may describe the spectral shape of the excitation along the frequency axis and the evolution of such shape along the time axis. The excitation-related parameters include a first excitation parameter (SEW (Slowly Evolving Waveform) parameter) indicating a component of the excitation signal that changes more slowly on the time-frequency axis, and a first excitation parameter (SEW (Slowly Evolving Waveform) parameter) that indicates a component of the excitation signal that changes more rapidly on the time-frequency axis A second excitation parameter (REW (Rapidly Evolving Waveform) parameter) indicating a changing component may be included.

第１励起パラメータは、所定のカットオフ周波数以下の励起を示してよく、第２励起パラメータは、カットオフ周波数を超過する励起を示してよい。第１励起パラメータは、励起の高調波スペクトル（ｈａｒｍｏｎｉｃｓｐｅｃｔｒｕｍ）を示してよく、第２励起パラメータは、励起のその他の部分を示してよい。例えば、高調波励起スペクトル（ｈａｒｍｏｎｉｃｅｘｃｉｔａｔｉｏｎｓｐｅｃｔｒｕｍ）に該当する第１励起パラメータ（ＳＥＷパラメータ）は、ＴＦＴＥの各周波数成分を時間領域軸に沿って（所定のカットオフ周波数で）ローパスフィルタリングすることによって取得されてよい。所定のカットオフ周波数を超過する残留雑音スペクトルは、第２励起パラメータ（ＲＥＷパラメータ）として、ＴＦＴＥからＳＥＷを減算することによって取得されてよい。第１励起パラメータ（ＳＥＷパラメータ）および第２励起パラメータが使用されることにより、励起の周期性がより効果的に表現されるようになる。第１励起パラメータおよび第２励起パラメータは、ＩＴＦＴＥ（ＩｍｐｒｏｖｅｄＴｉｍｅ－ＦｒｅｑｕｅｎｃｙＴｒａｊｅｃｔｏｒｙＥｘｃｉｔａｔｉｏｎ）パラメータに該当してよい。 A first excitation parameter may indicate excitation below a predetermined cutoff frequency, and a second excitation parameter may indicate excitation above the cutoff frequency. A first excitation parameter may indicate a harmonic spectrum of the excitation, and a second excitation parameter may indicate another portion of the excitation. For example, the first excitation parameter (SEW parameter), which corresponds to the harmonic excitation spectrum, is obtained by low-pass filtering each frequency component of the TFTE along the time domain axis (with a predetermined cutoff frequency). may be A residual noise spectrum above a predetermined cut-off frequency may be obtained by subtracting SEW from TFTE as a second excitation parameter (REW parameter). By using the first excitation parameter (SEW parameter) and the second excitation parameter, the periodicity of the excitation can be expressed more effectively. The first excitation parameter and the second excitation parameter may correspond to ITTE (Improved Time-Frequency Trajectory Excitation) parameters.

段階４２０で、励起信号推定部３２０は、複数の音響パラメータに基づいて励起信号（ｅｘｃｉｔａｔｉｏｎｓｉｇｎａｌ）を推定してよい。すなわち、励起信号推定部３２０は、スペクトル関連パラメータおよび励起関連パラメータを入力として励起信号を推定してよい。推定される励起信号は、励起信号の時間シーケンス（ｔｉｍｅｓｅｑｕｅｎｃｅ）であってよい。 At step 420, the excitation signal estimator 320 may estimate an excitation signal based on a plurality of acoustic parameters. That is, the excitation signal estimating section 320 may estimate the excitation signal by inputting the spectrum-related parameters and the excitation-related parameters. The estimated excitation signal may be the time sequence of the excitation signal.

励起信号推定部３２０は、訓練のために入力された音声信号に基づいて訓練されたものであって、励起信号推定部３２０は、訓練によってモデリングされた励起信号の確率分布を使用することで、取得した複数の音響パラメータに対する励起信号を推定してよい。励起信号推定部３２０を含むニューラルボコーダの訓練方法については、図５を参照しながらより詳しく説明する。 The excitation signal estimator 320 is trained based on the input speech signal for training, and the excitation signal estimator 320 uses the probability distribution of the excitation signal modeled by the training to: An excitation signal may be estimated for the acquired plurality of acoustic parameters. A method of training a neural vocoder including excitation signal estimator 320 is described in more detail with reference to FIG.

励起信号推定部３２０は、例えば、ＷａｖｅＮｅｔ、ＳａｍｐｌｅＲＮＮ、またはＷａｖｅＲＮＮによって実現されてよい。また、励起信号推定部３２０は、これらに制限されない、一般的な生成モデル（ｇｅｎｅｒａｔｉｖｅｍｏｄｅｌ）によって実現されてもよい。 The excitation signal estimator 320 may be implemented by WaveNet, SampleRNN, or WaveRNN, for example. Also, the excitation signal estimator 320 may be realized by a general generative model, which is not limited to these.

段階４３０で、音声信号生成部３３０は、励起信号推定部３２０によって推定された励起信号に対してスペクトル関連パラメータのうちの少なくとも１つに基づく線形（予測）合成フィルタを適用することによってターゲット音声信号を生成してよい。ターゲット音声信号は、合成された音声信号であってよい。段階４３２および４３４を参照しながら、段階４３０についてより詳しく説明する。 At step 430, the speech signal generator 330 generates a target speech signal by applying a linear (predictive) synthesis filter based on at least one of spectral-related parameters to the excitation signal estimated by the excitation signal estimator 320. can be generated. The target audio signal may be a synthesized audio signal. Step 430 will be described in more detail with reference to steps 432 and 434 .

段階４３２で、変換部３３２は、取得したスペクトル関連パラメータのうちで音声信号のＬＳＦを示すパラメータを線形予測符号（ＬｉｎｅａｒＰｒｅｄｉｃｔｉｖｅＣｏｄｉｎｇ：ＬＰＣ）に変換してよい。線形予測合成フィルタは、取得したスペクトル関連パラメータのうちで音声信号のＬＳＦを示すパラメータに基づくものであり、変換されたＬＰＣに基づいて生成されてよい。 In operation 432, the transform unit 332 may transform a parameter indicating the LSF of the speech signal among the obtained spectrum-related parameters into linear predictive coding (LPC). A linear prediction synthesis filter, which is based on a parameter indicative of the LSF of the speech signal among the obtained spectrally-related parameters, may be generated based on the transformed LPC.

段階４３４で、音声信号生成部３３０は、推定された励起信号に対して段階４３２で変換されたＬＰＣに基づく線形予測合成フィルタを適用することによってターゲット音声信号を生成してよい。 At step 434, the speech signal generator 330 may generate the target speech signal by applying the LPC-based linear prediction synthesis filter transformed at step 432 to the estimated excitation signal.

段階４１０～４３０によって生成されたターゲット音声信号は、励起信号をターゲットとして推定せず、音声信号を直ぐに推定して生成された音声信号に比べて品質が優れ、特に、音声信号の高周波数領域のスペクトル歪曲を減らすことができる。 The target audio signal generated by steps 410 to 430 is superior in quality to the audio signal generated by directly estimating the audio signal without estimating the excitation signal as the target, especially in the high frequency region of the audio signal. Spectral distortion can be reduced.

以上、図１～３を参照しながら説明した技術的特徴についての説明は、図４に対してもそのまま適用可能であるため、重複する説明は省略する。 The description of the technical features described above with reference to FIGS. 1 to 3 can also be applied to FIG. 4 as it is, so redundant description will be omitted.

図５は、一実施形態における、ニューラルボコーダを訓練させる方法を示したフローチャートである。 FIG. 5 is a flow chart that illustrates a method for training a neural vocoder, in one embodiment.

図５を参照しながら、取得した音響パラメータに基づいて励起信号を推定することができる励起信号の確率分布をモデリングする方法について詳しく説明する。 With reference to FIG. 5, a method for modeling the probability distribution of the excitation signal that allows the excitation signal to be estimated based on the acquired acoustic parameters will be described in detail.

段階５１０で、ニューラルボコーダシステム２００は、訓練のための音声信号を受信してよい。訓練のための音声信号は、話者からニューラルボコーダシステム２００に直接に入力されるか、音声信号を含むデータが音声信号を受信した電子機器から送信されることによってニューラルボコーダシステム２００に入力されてよい。 At step 510, neural vocoder system 200 may receive speech signals for training. A speech signal for training is input to neural vocoder system 200 directly from a speaker, or data containing the speech signal is sent to neural vocoder system 200 from the electronic device that received the speech signal. good.

段階５２０で、ニューラルボコーダシステム２００は、入力された音声信号から、スペクトル関連パラメータおよび励起の周期性によって区分される励起関連パラメータを含む複数の音響パラメータを抽出してよい。ニューラルボコーダシステム２００は、音声分析（ｓｐｅｅｃｈａｎａｌｙｓｉｓ）によって音声信号から複数の音響パラメータを抽出してよい。例えば、ニューラルボコーダシステム２００は、その内部または外部に存在するパラメトリックボコーダを使用して音声信号から複数の音響パラメータを抽出してよい。 At step 520, neural vocoder system 200 may extract a plurality of acoustic parameters from the input speech signal, including spectrally-related parameters and excitation-related parameters partitioned by the periodicity of the excitation. Neural vocoder system 200 may extract a plurality of acoustic parameters from the speech signal by speech analysis. For example, neural vocoder system 200 may use an internal or external parametric vocoder to extract acoustic parameters from the speech signal.

スペクトル関連パラメータは、例えば、音声信号のピッチを示す周波数パラメータ（Ｆ０）、音声信号のエネルギーを示すエネルギーパラメータ（一例として、利得（ｇａｉｎ）を示すパラメータ）、音声信号が有声音（ｖｏｉｃｅ）であるか無声音（ｕｎｖｏｉｃｅ）であるかを示すパラメータ（ｖ／ｕｖ）、および音声信号の線スペクトル周波数（ＬｉｎｅＳｐｅｃｔｒａｌＦｒｅｑｕｅｎｃｙ：ＬＳＦ）を示すパラメータを含んでよい。励起関連パラメータは、励起の周期性によって区分されるパラメータを含んでよい。励起関連パラメータは、例えば、ＴＦＴＥ（Ｔｉｍｅ－ＦｒｅｑｕｅｎｃｙＴｒａｊｅｃｔｏｒｙＥｘｃｉｔａｔｉｏｎ）パラメータであってよい。ＴＦＴＥは、周波数軸に沿った励起のスペクトル形状と時間軸に沿ったこのような形状の展開（ｅｖｏｌｕｔｉｏｎ）を示してよい。励起関連パラメータは、励起信号のうちで時間－周波数軸でよりゆっくり変化する成分を示すＳＥＷパラメータ、および励起信号のうちで時間－周波数軸でより迅速に変化する成分を示すＲＥＷパラメータを含んでよい。ＳＥＷパラメータは、所定のカットオフ周波数以下の励起を示してよく、ＲＥＷパラメータは、カットオフ周波数を超過する励起を示してよい。ＳＥＷパラメータは、励起の高調波スペクトル（ｈａｒｍｏｎｉｃｓｐｅｃｔｒｕｍ）を示してよく、ＲＥＷパラメータは、励起のその他の部分を示してよい。例えば、高調波励起スペクトル（ｈａｒｍｏｎｉｃｅｘｃｉｔａｔｉｏｎｓｐｅｃｔｒｕｍ）に該当するＳＥＷパラメータは、ＴＦＴＥの各周波数成分を時間領域軸に沿って（所定のカットオフ周波数で）ローパスフィルタリングすることによって取得されてよい。所定のカットオフ周波数を超過する残留雑音スペクトルは、ＲＥＷパラメータとして、ＴＦＴＥからＳＥＷを減算することによって取得されてよい。 The spectrum-related parameters are, for example, a frequency parameter (F0) indicating the pitch of the speech signal, an energy parameter indicating the energy of the speech signal (for example, a parameter indicating gain), and the speech signal being voiced. a parameter (v/uv) indicating whether it is unvoiced or unvoiced, and a parameter indicating the Line Spectral Frequency (LSF) of the speech signal. Excitation-related parameters may include parameters that are differentiated by the periodicity of the excitation. The excitation-related parameters may be TFTE (Time-Frequency Trajectory Excitation) parameters, for example. TFTE may describe the spectral shape of the excitation along the frequency axis and the evolution of such shape along the time axis. The excitation-related parameters may include a SEW parameter that indicates the component of the excitation signal that varies more slowly in the time-frequency axis, and a REW parameter that indicates the component of the excitation signal that varies more rapidly in the time-frequency axis. . The SEW parameter may indicate excitation below a predetermined cutoff frequency, and the REW parameter may indicate excitation above the cutoff frequency. The SEW parameters may describe the harmonic spectrum of the excitation, and the REW parameters may describe other parts of the excitation. For example, SEW parameters corresponding to the harmonic excitation spectrum may be obtained by low-pass filtering each frequency component of the TFTE along the time domain axis (with a predetermined cutoff frequency). A residual noise spectrum above a predetermined cutoff frequency may be obtained by subtracting SEW from TFTE as the REW parameter.

上述した段階５１０および５２０は、以下で説明する段階５３０および５４０と同じように、ニューラルボコーダシステム２００のプロセッサ２２０によって実行されてよい。 Steps 510 and 520 described above may be performed by processor 220 of neural vocoder system 200 in the same manner as steps 530 and 540 described below.

段階５３０で、励起信号分離部３４２は、入力された音声信号に対してスペクトル関連パラメータのうちの少なくとも１つに基づく線形予測分析フィルタ（ｌｉｎｅａｒｐｒｅｄｉｃｔｉｏｎａｎａｌｙｓｉｓｆｉｌｔｅｒ）を適用することにより、入力された音声信号から励起信号を分離してよい。線形予測分析フィルタは、音声信号からスペクトルフォルマント（ｓｐｅｃｔｒａｌｆｏｒｍａｎｔ）構造を分離するフィルタであってよい。分離した励起信号は、入力された音声信号の残渣成分（ｒｅｓｉｄｕａｌｃｏｍｐｏｎｅｎｔ）（すなわち、残渣信号）であってよい。励起信号は、情報量を減らすために残渣信号をパルスまたは雑音（ＰｏＮ）、帯域非周期性（ＢＡＰ）、声門励起（ｇｌｏｔｔａｌｅｘｃｉｔａｔｉｏｎ）、および時間－周波数軌跡励起（ＴＦＴＥ）モデルなどのような多様な類型の励起モデルのうちの少なくとも１つによって近似化したものであってよい。 At step 530, the excitation signal separator 342 applies a linear prediction analysis filter based on at least one of spectral-related parameters to the input speech signal to extract the input speech signal. An excitation signal may be separated from the signal. A linear prediction analysis filter may be a filter that separates spectral formant structures from an audio signal. The isolated excitation signal may be the residual component (ie, residual signal) of the input audio signal. The excitation signal can be varied from pulse or noise (PoN), band-aperiodic (BAP), glottal excitation, and time-frequency trajectory excitation (TTE) models to reduce the information content of the residual signal. may be approximated by at least one of various types of excitation models.

段階５３２および５３４を参照しながら、音声信号から励起信号を分離する方法についてより詳しく説明する。 The method of separating the excitation signal from the speech signal is described in more detail with reference to steps 532 and 534 .

段階５３２で、励起信号分離部３４２の変換部３４３は、スペクトル関連パラメータのうちで入力された音声信号のＬＳＦを示すパラメータをＬＰＣに変換してよい。線形予測分析フィルタは、取得したスペクトル関連パラメータのうちで音声信号のＬＳＦを示すパラメータに基づくものであり、変換されたＬＰＣに基づいて生成されてよい。 In step 532, the transforming unit 343 of the excitation signal separating unit 342 may transform the parameter indicating the LSF of the input speech signal among the spectrum-related parameters into LPC. A linear predictive analysis filter, which is based on a parameter indicative of the LSF of the speech signal among the obtained spectrally-related parameters, may be generated based on the transformed LPC.

段階５３４で、励起信号分離部３４２は、入力された音声信号に対して前記ＬＰＣに基づく線形予測分析フィルタを適用することにより、音声信号から励起信号を分離してよい。 At step 534, the excitation signal separator 342 may separate the excitation signal from the speech signal by applying the LPC-based linear prediction analysis filter to the input speech signal.

段階５４０で、モデリング部３４４は、分離した励起信号の確率分布をモデリングしてよい。モデリング部３４４は、例えば、ＷａｖｅＮｅｔ、ＳａｍｐｌｅＲＮＮ、またはＷａｖｅＲＮＮによって実現されてよい。また、モデリング部３４４は、これらに制限されない、一般的な生成モデル（ｇｅｎｅｒａｔｉｖｅｍｏｄｅｌ）によって実現されてもよい。 At step 540, the modeling unit 344 may model the probability distribution of the separated excitation signals. The modeling unit 344 may be realized by WaveNet, SampleRNN, or WaveRNN, for example. Also, the modeling unit 344 may be implemented by a general generative model that is not limited to these.

励起信号推定部３２０は、モデリング部３４４によってモデリングされた励起信号の確率分布を使用することで、上述した段階４２０の励起信号の推定を実行してよい。 The excitation signal estimator 320 may use the excitation signal probability distribution modeled by the modeler 344 to perform the excitation signal estimation of step 420 described above.

図１～４を参照しながら説明した実施形態のニューラルボコーダは、励起信号を訓練し、励起信号を推定して合成音声信号を生成するという点において、ＥｘｃｉｔＮｅｔボコーダと命名されてよい。 The neural vocoder of the embodiments described with reference to FIGS. 1-4 may be named an ExcitNet vocoder in that it trains an excitation signal and estimates the excitation signal to generate a synthesized speech signal.

励起信号の変化は、発話者の声帯の動きによってのみ制限されるようになるため、励起信号を訓練する過程は、（音声信号を訓練することに比べて）遥かに簡単に実行することができる。また、励起信号の周期性の程度を効果的に示す条件付き特徴としてＩＴＦＴＥパラメータが使用されることにより、励起信号の確率分布モデリングの正確度を大きく向上させることができる。 The process of training the excitation signal is much simpler to perform (compared to training the speech signal), since changes in the excitation signal become limited only by the vocal cord movements of the speaker. . Also, the accuracy of the probability distribution modeling of the excitation signal can be greatly improved by using the ITTE parameter as a conditional feature that effectively indicates the degree of periodicity of the excitation signal.

以上、図１～４を参照しながら説明した技術的特徴についての説明は、図５に対してもそのまま適用可能であるため、重複する説明は省略する。 The description of the technical features described above with reference to FIGS. 1 to 4 can also be applied to FIG. 5 as it is, so redundant description will be omitted.

以下では、図６～８を参照しながら、ターゲット話者からの少量の（すなわち、短時間の）音声データだけで高品質のターゲット話者の合成音声を生成する話者適応型モデルを構築してターゲット話者の合成音声を生成する方法について説明する。 In the following, referring to FIGS. 6-8, we build a speaker-adaptive model that generates high-quality synthesized speech of a target speaker with only a small amount (i.e., short-term) speech data from the target speaker. We describe how to generate synthesized speech for a target speaker using

図６は、一実施形態における、話者適応型モデルを構築してターゲット話者の合成音声を生成する方法を示した図である。 FIG. 6 illustrates a method for building a speaker-adaptive model to generate synthetic speech for a target speaker, in one embodiment.

以下の詳細な説明において、音声データセットは、音声信号または音声信号を含むデータを示してよい。例えば、音声データセットは、話者から一定の時間にわたって録音された音声信号を示してよい。 In the detailed description below, an audio data set may refer to an audio signal or data containing an audio signal. For example, an audio data set may represent audio signals recorded from a speaker over time.

ソースモデル６１０は、複数の話者からの音声データセットに対して訓練された音響モデルであってよい。ソースモデル６１０は、複数の話者に対して話者独立的に訓練された音響モデルであってよい。例えば、ソースモデル６１０は、１０人の話者それぞれからの１時間の音声データセットを使用して話者独立的に訓練された音響モデルであってよい。ソースモデル６１０は、ディープラーニングに基づく統計的パラメトリック音声合成（ＳｔａｔｉｓｔｉｃａｌＰａｒａｍｅｔｒｉｃＳｐｅｅｃｈＳｙｎｔｈｅｓｉｓ：ＳＰＳＳ）システムで設計されたものであってよい。音響モデル１１０は、例えば、ＤＮＮＴＴＳモジュールであってよい。 Source model 610 may be an acoustic model trained on speech datasets from multiple speakers. The source model 610 may be an acoustic model that has been speaker-independently trained for multiple speakers. For example, source model 610 may be an acoustic model that has been speaker-independently trained using a one-hour speech data set from each of ten speakers. The source model 610 may have been designed with a deep learning-based Statistical Parametric Speech Synthesis (SPSS) system. Acoustic model 110 may be, for example, a DNN TTS module.

複数の話者からの音声データセットによって話者独立的に訓練されたソースモデル６１０は、話者適応型モデル６２０の初期化子（ｉｎｉｔｉａｌｉｚｅｒ）として使用されてよい。言い換えれば、ソースモデル６１０からの加重値（ｗｅｉｇｈｔ）は、話者適応型モデル６２０のターゲット話者からの音声データセットに対する訓練において初期値として設定されてよい。ソースモデル６１０からの加重値は、例えば、上述した音響パラメータに対応してよい。 A source model 610 trained speaker-independently with speech datasets from multiple speakers may be used as an initializer for a speaker-adaptive model 620 . In other words, the weights from the source model 610 may be set as initial values in training the speaker-adaptive model 620 on the speech dataset from the target speaker. Weights from source model 610 may correspond to, for example, the acoustic parameters described above.

話者適応型モデル６２０は、ニューラルボコーダによって実現されてよい。ニューラルボコーダは、ディープラーニングモデルに基づいて訓練されたものであってよい。ニューラルボコーダは、例えば、ＷａｖｅＮｅｔ、ＳａｍｐｌｅＲＮＮ、ＥｘｃｉｔＮｅｔ、またはＷａｖｅＲＮＮであってよい。また、ニューラルボコーダは、これらに制限されない、一般的な生成モデル（ｇｅｎｅｒａｔｉｖｅｍｏｄｅｌ）であってもよい。 Speaker adaptive model 620 may be implemented by a neural vocoder. A neural vocoder may be trained based on a deep learning model. A neural vocoder may be, for example, WaveNet, SampleRNN, ExcitNet, or WaveRNN. Also, the neural vocoder may be a general generative model that is not limited to these.

話者適応型モデル６２０は、話者適応（ｓｐｅａｋｅｒａｄａｐｔａｔｉｏｎ）アルゴリズムを適用することにより、特定の話者に対して従属的に（ｓｐｅａｋｅｒ－ｄｅｐｅｎｄｅｎｔ）訓練されてよい。例えば、話者適応型モデル６２０は、特定のターゲット話者（例えば、芸能人や有名人などのようなセレブリティ）に対して話者従属的に訓練されてよい。話者適応型モデル６２０は、ターゲット話者からの音声データセットを訓練することによってアップデートされた加重値（複数可）を生成してよい。 Speaker-adaptive model 620 may be trained speaker-dependently on a particular speaker by applying a speaker adaptation algorithm. For example, speaker-adaptive model 620 may be trained speaker-dependently for a particular target speaker (eg, a celebrity such as an entertainer, celebrity, etc.). Speaker-adaptive model 620 may generate updated weight(s) by training on speech datasets from target speakers.

話者適応型モデル６２０は、ランダム値でない、話者独立的に訓練されたソースモデル６１０からの加重値を初期値として使用してターゲット話者からの音声データセットを訓練することにより、相対的に小さい（すなわち、短時間）音声データセットを訓練するだけでも高品質のターゲット話者の合成音声（合成音声信号）を生成することができる。例えば、話者適応型モデル６２０は、１０分前後のターゲット話者の音声データセットを訓練するだけでも高品質のターゲット話者の合成音声を生成することができる。 The speaker-adaptive model 620 uses non-random values, weights from the speaker-independently trained source model 610, as initial values to train the speech dataset from the target speaker to obtain relative It is possible to generate high-quality synthesized speech (synthetic speech signal) of the target speaker by simply training a small (ie, short-time) speech dataset. For example, the speaker-adaptive model 620 can generate high-quality synthesized speech of the target speaker by training only on around 10 minutes of the target speaker's speech data set.

実施形態によっては、数時間～数十時間以上の音声データセットの確保が困難なセレブリティに対して１０分前後の音声データセットを確保し、これを訓練データとして使用するだけでも、高品質のターゲット話者の合成音声を生成する話者適応型モデル６２０を構築することができる。 Depending on the embodiment, even if it is difficult to secure a voice data set of several hours to several tens of hours or more, a voice data set of about 10 minutes can be secured and a high-quality target can be obtained by simply using this as training data. A speaker-adaptive model 620 can be constructed that generates synthesized speech for a speaker.

以上、図１～５を参照しながら説明した技術的特徴の説明は、図６に対してもそのまま適用可能であるため、重複する説明は省略する。 The description of the technical features described above with reference to FIGS. 1 to 5 can be applied to FIG. 6 as it is, so redundant description will be omitted.

図７は、一実施形態における、ニューラルボコーダのプロセッサの構造を示したブロック図である。 FIG. 7 is a block diagram illustrating the structure of a neural vocoder processor, in one embodiment.

図７を参照しながら説明するプロセッサ２２０は、図３を参照しながら説明したプロセッサ２２０に対応してよい。以下で説明するプロセッサ２２０の構成７１０～７２０のそれぞれは、１つ以上のソフトウェアモジュールおよび／またはハードウェアモジュールによって実現されてよい。実施形態によって、プロセッサ２２０の構成要素は、選択的にプロセッサ２２０に含まれても除外されてもよい。また、実施形態によって、プロセッサ２２０の構成要素は、プロセッサ２２０の機能の表現のために分離されても併合されてもよい。構成７１０～７２０は、ニューラルボコーダシステム２００に記録されたプログラムコードが提供する命令にしたがってプロセッサ２２０によって実行される、プロセッサ２２０の互いに異なる機能（ｄｉｆｆｅｒｅｎｔｆｕｎｃｔｉｏｎｓ）の表現であってよい。 The processor 220 described with reference to FIG. 7 may correspond to the processor 220 described with reference to FIG. Each of the configurations 710-720 of processor 220 described below may be implemented by one or more software and/or hardware modules. Depending on the embodiment, components of processor 220 may be selectively included or excluded from processor 220 . Also, depending on the embodiment, the components of processor 220 may be separated or merged to represent the functionality of processor 220 . Configurations 710 - 720 may be representations of different functions of processor 220 performed by processor 220 according to instructions provided by program code recorded in neural vocoder system 200 .

プロセッサ２２０は、話者適応型モデル構築部７２０を含んでよい。話者適応型モデル構築部７２０は、複数の話者からの音声データセットに対して話者独立的に訓練されたソースモデル６１０からの加重値（ｗｅｉｇｈｔ）を初期値として設定してよく、設定された初期値に対し、ターゲット話者からの音声データセットを訓練することによってアップデートされた加重値を生成する話者適応型モデル６２０を構築してよい。話者適応型モデル６２０によって生成されたアップデートされた加重値は、ターゲット話者に対応する合成音声を生成するために使用されてよい。 Processor 220 may include a speaker adaptive model builder 720 . The speaker-adaptive model builder 720 may set, as initial values, the weights from the source models 610 that have been trained speaker-independently on speech data sets from multiple speakers. A speaker-adaptive model 620 may be constructed that generates updated weights by training the speech dataset from the target speaker against the initial values. The updated weights generated by speaker-adaptive model 620 may be used to generate synthesized speech corresponding to the target speaker.

プロセッサ２２０は、ソースモデル構築部７１０をさらに含んでよい。ソースモデル構築部７１０は、複数の話者からの音声データセットを話者独立的に訓練するソースモデル６１０を構築してよい。構築されたソースモデル６１０は、ターゲット話者からの音声データセットを訓練するためのモデルの初期化子（ｉｎｉｔｉａｌｉｚｅｒ）として動作してよい。 Processor 220 may further include a source model builder 710 . Source model builder 710 may build source models 610 that are speaker-independently trained on speech datasets from multiple speakers. The constructed source model 610 may act as a model initializer for training a speech dataset from a target speaker.

ソースモデル構築部７１０は、プロセッサ２２０に含まれず、ニューラルボコーダシステム２００とは個別の装置内に実現されてもよい。話者適応型モデル構築部７２０は、このような個別の装置内に実現されたソースモデル構築部７１０によって構築されたソースモデル６１０から加重値を取得し、話者適応型モデル６２０を構築するためのターゲット話者の音声データセットを訓練してよい。 The source model construction unit 710 may be implemented in a device separate from the neural vocoder system 200 instead of being included in the processor 220 . The speaker-adaptive model builder 720 obtains weights from the source model 610 constructed by the source model builder 710 implemented in such a separate device, and constructs the speaker-adaptive model 620. target speakers' speech datasets may be trained.

以上、図１～６を参照しながら説明した技術的特徴についての説明は、図７に対してもそのまま適用可能であるため、重複する説明は省略する。 The description of the technical features described above with reference to FIGS. 1 to 6 can also be applied to FIG. 7 as it is, so redundant description will be omitted.

図８は、一実施形態における、話者適応型モデルを構築するためのニューラルボコーダの訓練方法を示したフローチャートである。 FIG. 8 is a flowchart illustrating a method for training a neural vocoder to build a speaker-adaptive model in one embodiment.

段階８１０で、ソースモデル構築部７１０は、複数の話者からの音声データセットを話者独立的に訓練するソースモデル６１０を構築してよい。複数の話者は、ソースモデル６１０を訓練させるための音声データセットを提供する任意の利用者であってよい。 At step 810, the source model builder 710 may build a source model 610 that is speaker-independently trained on speech datasets from multiple speakers. The multiple speakers may be any user who provides speech datasets for training the source model 610 .

段階８２０で、話者適応型モデル構築部７２０は、ソースモデル６１０から加重値を取得してよい。ソースモデル６１０からの加重値は、複数の話者からの音声データセットに含まれた、話者ごとに区分されないグローバル特性を示す値を示してよい。グローバル特性とは、例えば、特定の発音（一例として、「あ（ａｈ）」または「い（ｅｅ）」など）に対するフォルマント（ｆｏｒｍａｎｔ）特性、または振幅－周波数特性（パターン）を示してよい。言い換えれば、ソースモデル６１０は、複数の話者からの音声データセットを使用してこのような音声の話者独立的なグローバル特性を訓練してよい。 At step 820 , speaker-adaptive model builder 720 may obtain weights from source model 610 . The weights from the source model 610 may represent values indicative of global characteristics not partitioned by speaker included in speech datasets from multiple speakers. Global characteristics may indicate, for example, formant characteristics or amplitude-frequency characteristics (patterns) for a particular pronunciation (eg, "ah" or "ee"). In other words, source model 610 may train speaker-independent global properties of such speech using speech datasets from multiple speakers.

段階８３０で、話者適応型モデル構築部７２０は、ソースモデル６１０から取得された加重値を初期値として設定してよい。言い換えれば、ソースモデル６１０は、話者適応型モデル構築部７２０によって構築される話者適応型モデル６２０の初期化子として使用されてよい。 In step 830, the speaker-adaptive model builder 720 may set weights obtained from the source model 610 as initial values. In other words, the source model 610 may be used as an initializer for the speaker-adaptive model 620 constructed by the speaker-adaptive model construction unit 720 .

段階８４０で、話者適応型モデル構築部７２０は、取得された初期値に対し、ターゲット話者からの音声データセットを訓練することによってアップデートされた加重値を生成してよい。言い換えれば、話者適応型モデル構築部７２０は、ソースモデル６１０からの初期値に対してターゲット話者からの音声データセットを訓練することにより、ターゲット話者に適応する（すなわち、ターゲット話者に従属的な）話者適応型モデル６２０を構築してよい。 At step 840, the speaker-adaptive model builder 720 may generate updated weights by training the speech data set from the target speaker against the obtained initial values. In other words, speaker-adaptive model builder 720 adapts to the target speaker by training the speech data set from the target speaker against initial values from source model 610 (i.e., dependent) speaker-adaptive model 620 may be constructed.

話者適応型モデル構築部７２０は、ソースモデル６１０からの加重値を、ターゲット話者からの音声データセットが含むターゲット話者の固有の特性が反映されるように調整することによってアップデートされた加重値を生成してよい。例えば、話者適応型モデル構築部７２０は、ターゲット話者からの音声データセットを訓練することにより、ソースモデル６１０からの話者ごとに区分されないグローバル特性を示す値をターゲット話者の固有の特性を含むように微調整することによってアップデートされた加重値を生成してよい。 Speaker-adaptive model builder 720 updates the weights by adjusting the weights from source model 610 to reflect the unique characteristics of the target speaker contained in the speech data set from the target speaker. values may be generated. For example, speaker-adaptive model builder 720 trains speech datasets from the target speaker to obtain values indicative of global characteristics that are not segmented for each speaker from source model 610, i.e., specific characteristics of the target speaker. An updated weight may be generated by tweaking to include .

生成された、アップデートされた加重値は、ターゲット話者に対応する合成音声信号を生成するために使用されてよい。ターゲット話者に対応する合成音声信号は、例えば、ターゲット話者に対応するセレブリティの合成音声であってよい。 The generated updated weights may be used to generate a synthesized speech signal corresponding to the target speaker. The synthesized speech signal corresponding to the target speaker may be, for example, the synthesized speech of a celebrity corresponding to the target speaker.

ソースモデル６１０を訓練させるための複数の話者からの音声データセットのそれぞれの大きさ（すなわち、録音された音声信号の長さ、例えば、１時間以上）は、ターゲット話者からの音声データセットの大きさ（すなわち、録音された音声信号の長さ、例えば、１０分）よりも大きくてよい。 The magnitude (i.e., the length of the recorded speech signal, e.g., one hour or longer) of each of the speech datasets from multiple speakers for training the source model 610 is obtained from the speech dataset from the target speaker. (ie, the length of the recorded audio signal, eg, 10 minutes).

段階８３０で説明したような適応プロセスの微調整（ｆｉｎｅ－ｔｕｎｉｎｇ）メカニズムによっては、ターゲット話者からの音声データセットからターゲット話者の固有の特性がキャプチャされてよい。したがって、説明した実施形態の方法によっては、ターゲット話者からの訓練のための音声データセットが不十分であっても、ボコーディング性能を向上させることができる。 Depending on the fine-tuning mechanism of the adaptation process, such as that described in stage 830, the target speaker's unique characteristics may be captured from the speech data set from the target speaker. Thus, the methods of the described embodiments can improve vocoding performance even if the speech data set for training from the target speaker is insufficient.

図６～８を参照しながら説明したニューラルボコーダの訓練方法は、図１～４を参照しながら説明した実施形態のニューラルボコーダの訓練方法と合成音声信号の生成方法と組み合わされてよい。例えば、上述したＥｘｃｉｔＮｅｔボコーダは、図６～８を参照しながら説明した実施形態と組み合わされてよい。 The neural vocoder training methods described with reference to FIGS. 6-8 may be combined with the neural vocoder training methods and synthetic speech signal generating methods of the embodiments described with reference to FIGS. 1-4. For example, the ExcitNet vocoder described above may be combined with the embodiments described with reference to FIGS.

一例として、段階８１０～８４０を実行することによって訓練されたニューラルボコーダは、図１～４を参照しながら説明したニューラルボコーダシステム２００に対応してよい。段階４３０で生成されたターゲット音声信号は、話者適応型モデル６２０が訓練したターゲット話者に対応する合成音声信号であってよい。 As an example, a neural vocoder trained by performing steps 810-840 may correspond to neural vocoder system 200 described with reference to FIGS. 1-4. The target speech signal generated in step 430 may be a synthesized speech signal corresponding to the target speaker that speaker-adaptive model 620 trained.

図６～８を参照しながら説明したニューラルボコーダの訓練方法と図１～４を参照しながら説明したＥｘｃｉｔＮｅｔモデルの技術的特徴とを組み合わせることにより、ターゲット話者に対応する合成音声の品質を高めることができる。 Combining the neural vocoder training method described with reference to FIGS. 6-8 with the technical features of the ExcitNet model described with reference to FIGS. 1-4 enhances the quality of the synthesized speech corresponding to the target speaker. be able to.

以上、図１～７を参照しながら説明した技術的特徴についての説明は、図８に対してもそのまま適用可能であるため、重複する説明は省略する。 The description of the technical features described above with reference to FIGS. 1 to 7 can also be applied to FIG. 8 as it is, so redundant description will be omitted.

図９は、一例における、音声信号および励起信号とその関係を示した図である。 FIG. 9 is a diagram showing an audio signal and an excitation signal and their relationship in one example.

図に示すように、音声信号をＳ（ｎ）と仮定し、Ｓ（ｎ）が含む励起信号をｅ（ｎ）と仮定するとき、Ｓ（ｎ）とｅ（ｎ）との関係は、以下の数式（１）のように表現されてよい。 As shown in the figure, assuming the speech signal to be S(n) and the excitation signal contained by S(n) to be e(n), the relationship between S(n) and e(n) is given by may be expressed as in Equation (1) below.

ｈ（ｎ）は、線形予測合成フィルタを示してよい。ｈ（ｎ）は、Ｓ（ｎ）のｅ（ｎ）成分を除いた残りの成分（すなわち、スペクトル成分）を示してよい。ｈ（ｎ）は、Ｓ（ｎ）のＬＳＦを示すパラメータに基づいて生成されてよい。

h(n) may denote a linear prediction synthesis filter. h(n) may denote the remaining components (ie, spectral components) of S(n) excluding the e(n) component. h(n) may be generated based on parameters indicative of the LSF of S(n).

数式（１）の関係により、図４の段階４２０によって推定された励起信号（すなわち、ｅ（ｎ））に対して線形予測合成フィルタ（すなわち、ｈ（ｎ））を適用することによってターゲット音声信号（Ｓ（ｎ））が生成されてよい。線形予測合成フィルタの具体的な例については、図１４を参照しながらさらに詳しく説明する。 According to the relationship of equation (1), the target speech signal (S(n)) may be generated. A specific example of the linear prediction synthesis filter will be described in more detail with reference to FIG.

数式（１）の関係は、図５の段階５３０の励起信号（すなわち、ｅ（ｎ））の分離に対しても類似に適用されてよい。言い換えれば、訓練のために入力された音声信号（Ｓ（ｎ））に対して線形予測分析フィルタが適用されることにより、音声信号（Ｓ（ｎ））から励起信号（ｅ（ｎ））が分離されてよい。線形予測分析フィルタの具体的な例については、図１３を参照しながらさらに詳しく説明する。 The relationship of equation (1) may be applied analogously to the separation of the excitation signals (ie, e(n)) in step 530 of FIG. In other words, a linear prediction analysis filter is applied to the input speech signal (S(n)) for training, so that the excitation signal (e(n)) is obtained from the speech signal (S(n)) may be separated. A specific example of a linear predictive analysis filter will be described in more detail with reference to FIG.

以上、図１～８を参照しながら説明した技術的特徴についての説明は、図９に対してもそのまま適用可能であるため、重複する説明は省略する。 The description of the technical features described above with reference to FIGS. 1 to 8 can also be applied to FIG. 9 as it is, so redundant description will be omitted.

図１０ａ～１０ｃは、それぞれ異なる種類のボコーダを使用する合成音声信号生成のための統計的パラメトリック音声合成（ＳｔａｔｉｓｔｉｃａｌＰａｒａｍｅｔｒｉｃＳｐｅｅｃｈＳｙｎｔｈｅｓｉｓ：ＳＰＳＳ）システムを示した図である。 10a-10c illustrate a Statistical Parametric Speech Synthesis (SPSS) system for synthetic speech signal generation using different types of vocoders.

図１０ａは、音響モデル１０１０と音響モデル１０１０からの音響フィーチャー（音響パラメータ）をＬＰＣ（ＬｉｎｅａｒＰｒｅｄｉｃｔｉｖｅＣｏｄｉｎｇ）合成することによって音声信号を生成する、ＬＰＣ合成モジュール１０２０を含む音声合成のためのフレームワークを示している。ＬＰＣ合成モジュール１０２０は、ＬＰＣボコーダであり、例えば、上述した線形予測合成フィルタに対応してよい。 FIG. 10a shows a framework for speech synthesis including an LPC synthesis module 1020 that generates a speech signal by linear predictive coding (LPC) synthesis of an acoustic model 1010 and acoustic features (acoustic parameters) from the acoustic model 1010. showing. LPC synthesis module 1020 is an LPC vocoder and may correspond, for example, to the linear prediction synthesis filter described above.

図１０ｂは、音響モデル１０１０と音響モデル１０１０からの音響フィーチャー（音響パラメータ）に基づいて音声信号を推定するニューラルボコーダであり、ＷａｖｅＮｅｔボコーダ１０２２を含む音声合成のためのフレームワークを示した図である。 FIG. 10b shows a framework for speech synthesis that includes an acoustic model 1010 and a WaveNet vocoder 1022, which is a neural vocoder that estimates a speech signal based on acoustic features (acoustic parameters) from the acoustic model 1010. .

図１０ｃは、図１～５で説明したような、ＥｘｃｉｔＮｅｔボコーダ１０２４を使用する音声合成のためのフレームワークを示している。図１０ｃに示した構造は、図１０ａのＬＰＣコーダ１０２０と図１０ｂのＷａｖｅＮｅｔボコーダ１０２２が組み合わされたものであってよい。 FIG. 10c shows a framework for speech synthesis using the ExcitNet vocoder 1024, as described in FIGS. 1-5. The structure shown in Figure 10c may be a combination of the LPC coder 1020 of Figure 10a and the WaveNet vocoder 1022 of Figure 10b.

図１０ｃの構造において、ＥｘｃｉｔＮｅｔボコーダ１０２４は、音響モデル１０１０からの音響フィーチャー（音響パラメータ）に基づいて励起信号を推定してよい。推定された励起信号は、線形予測合成フィルタ１０３０によるＬＰＣ（ＬｉｎｅａｒＰｒｅｄｉｃｔｉｖｅＣｏｄｉｎｇ）合成によってターゲット音声信号に変換されてよい。 In the structure of FIG. 10 c , the ExcitNet vocoder 1024 may estimate the excitation signal based on acoustic features (acoustic parameters) from the acoustic model 1010 . The estimated excitation signal may be transformed into a target speech signal by LPC (Linear Predictive Coding) synthesis by a linear predictive synthesis filter 1030 .

図１０ｃの構造のより詳細な例については、図１２および図１４を参照しながらさらに詳しく説明する。 A more detailed example of the structure of FIG. 10c is described in more detail with reference to FIGS.

以上、図１～９を参照しながら説明した技術的特徴についての説明は、図１０ａ～図１０ｃに対してもそのまま適用可能であるため、重複する説明は省略する。 The description of the technical features described above with reference to FIGS. 1 to 9 can also be applied to FIGS.

図１１および図１３は、一実施形態における、訓練のために入力された音声信号から励起信号を分離することによってニューラルボコーダを訓練させる方法を示した図である。 11 and 13 illustrate a method of training a neural vocoder by separating the excitation signal from the input audio signal for training, in one embodiment.

図１１に示すように、訓練のために入力された音声信号に対し、パラメトリックボコーダ１１１０は、音響パラメータを抽出してよい。入力された音声信号に対しては、抽出された音響パラメータのうちでスペクトル関連パラメータに基づいて生成された線形予測分析フィルタ１１４０が適用されることにより、入力された音声信号から励起信号が分離されてよい。 As shown in FIG. 11, parametric vocoder 1110 may extract acoustic parameters for an input speech signal for training. A linear predictive analysis filter 1140 generated based on spectral-related parameters among the extracted acoustic parameters is applied to the input speech signal to separate the excitation signal from the input speech signal. you can

ＷａｖｅＮｅｔボコーダ１１３０は、抽出された音響パラメータを補助
フィーチャー（ａｕｘｉｌｉａｒｙｆｅａｔｕｒｅ）として構成１１２０して受信してよい。補助フィーチャーは、上述したスペクトル関連パラメータおよび励起関連パラメータを含んでよい。ＷａｖｅＮｅｔボコーダ１１３０は、補助フィーチャーおよび分離した励起信号に基づいて励起信号の確率分布をモデリングしてよい。ＷａｖｅＮｅｔボコーダ１１３０は、ＥｘｃｉｔＮｅｔボコーダまたはその他の一般的な生成モデル（ｇｅｎｅｒａｔｉｖｅｍｏｄｅｌ）のニューラルボコーダによって実現されてよい。 A WaveNet vocoder 1130 may receive the extracted acoustic parameters, composing 1120 as auxiliary features. Auxiliary features may include spectral-related parameters and excitation-related parameters as described above. WaveNet vocoder 1130 may model the probability distribution of the excitation signal based on the auxiliary features and the isolated excitation signal. WaveNet vocoder 1130 may be implemented by an ExcitNet vocoder or other general generative model neural vocoder.

図１３を参照しながら、図１１の構造についてより詳しく説明する。訓練のために入力された音声信号は、音声分析１３１０によって音響フィーチャー（音響パラメータ）が抽出されてよい。音響パラメータのうちでＬＳＦを示すパラメータは、ＬＰＣに変換１３２０されてよい。変換されたＬＰＣに基づき、線形予測分析フィルタ１３４０が実現されてよい。入力された音声信号に対して線形予測分析フィルタ１３４０が適用されることにより、入力された音声信号から励起信号が分離されてよい。分離した励起信号は、ＥｘｃｉｔＮｅｔモデル（すなわち、ＥｘｃｉｔＮｅｔボコーダ）１３５０に入力されてよい。一方、音響パラメータは補助フィーチャー（ａｕｘｉｌｉａｒｙｆｅａｔｕｒｅ）として構成１３３０されてよく、補助フィーチャーはＥｘｃｉｔＮｅｔモデル１３５０に入力されてよい。ＥｘｃｉｔＮｅｔモデル１３５０は、入力された補助フィーチャー（すなわち、音響パラメータ）と分離した励起信号に基づいて励起信号の確率分布をモデリングしてよい。図に示した例において、ｅ_ｎは、分離した励起信号に対応してよい。 The structure of FIG. 11 will be described in more detail with reference to FIG. A speech signal input for training may have acoustic features (acoustic parameters) extracted by speech analysis 1310 . Those of the acoustic parameters that are indicative of LSF may be converted 1320 to LPC. Based on the transformed LPC, a linear predictive analysis filter 1340 may be implemented. A linear prediction analysis filter 1340 may be applied to the input audio signal to separate the excitation signal from the input audio signal. A separate excitation signal may be input to an ExcitNet model (ie, an ExcitNet vocoder) 1350 . Alternatively, the acoustic parameters may be configured 1330 as auxiliary features, which may be input to the ExcitNet model 1350 . The ExcitNet model 1350 may model the probability distribution of the excitation signal based on the input auxiliary features (ie, acoustic parameters) and the separate excitation signal. In the illustrated example, e _n may correspond to separate excitation signals.

図１２および図１４は、一実施形態における、入力テキストに基づいて音響モデルによって生成された音響パラメータから励起信号を推定して合成音声信号を生成する方法を示した図である。 12 and 14 illustrate a method for estimating an excitation signal from acoustic parameters generated by an acoustic model based on an input text to generate a synthesized speech signal in one embodiment.

図１２に示すように、音響モデル１１５０は、受信した言語パラメータに基づいて音響パラメータを生成してよい。ＷａｖｅＮｅｔボコーダ１１７０は、音響パラメータを補助フィーチャーとして構成１１６０して受信してよい。補助フィーチャーは、上述したスペクトル関連パラメータおよび励起関連パラメータを含んでよい。ＷａｖｅＮｅｔボコーダ１１７０は、音響パラメータに基づいて励起信号を推定してよい。ＷａｖｅＮｅｔボコーダ１１７０は、ＥｘｃｉｔＮｅｔボコーダまたはその他の一般的な生成モデル（ｇｅｎｅｒａｔｉｖｅｍｏｄｅｌ）のニューラルボコーダによって実現されてよい。推定された励起信号に対しては、抽出された音響パラメータのうちでスペクトル関連パラメータに基づいて生成された線形予測合成フィルタ１１８０が適用されることにより、ターゲット合成音声が生成されてよい。 As shown in FIG. 12, acoustic model 1150 may generate acoustic parameters based on the received linguistic parameters. WaveNet vocoder 1170 may configure 1160 and receive acoustic parameters as auxiliary features. Auxiliary features may include spectral-related parameters and excitation-related parameters as described above. WaveNet vocoder 1170 may estimate the excitation signal based on the acoustic parameters. WaveNet vocoder 1170 may be implemented by an ExcitNet vocoder or other general generative model neural vocoder. A linear prediction synthesis filter 1180 generated based on spectrally related parameters among the extracted acoustic parameters may be applied to the estimated excitation signal to generate the target synthesized speech.

図１４を参照しながら、図１２の構造についてより詳しく説明する。合成音声信号の生成のために入力されたテキストに対してテキスト分析１４１０を実行することにより、（上述した言語パラメータに対応する）言語フィーチャーが抽出されてよい。言語フィーチャーの抽出においては、図に示すように、音素デュレーション（ｐｈｏｎｅｍｅｄｕｒａｔｉｏｎ）を推定するデュレーションモデル１４２０がさらに使用されてよい。音響モデル１４３０は、抽出された言語フィーチャーから音響フィーチャー（音響パラメータ）を生成してよい。音響パラメータのうちでＬＳＦを示すパラメータは、ＬＰＣに変換１４４０されてよい。変換されたＬＰＣに基づいて線形予測合成フィルタ１４７０が実現されてよい。音響パラメータは補助フィーチャー（ａｕｘｉｌｉａｒｙｆｅａｔｕｒｅ）として構成１４５０されてよく、補助フィーチャーはＥｘｃｉｔＮｅｔモデル（すなわち、ＥｘｃｉｔＮｅｔボコーダ）１４６０に入力されてよい。ＥｘｃｉｔＮｅｔモデル１４６０は、入力された補助フィーチャー（すなわち、音響パラメータ）に基づいて励起信号を推定してよい。推定された励起信号に対して変換されたＬＰＣに基づく線形予測合成フィルタ１４７０が適用されることにより、ターゲット音声信号が生成されてよい。図に示した例において、 The structure of FIG. 12 will be described in more detail with reference to FIG. Linguistic features (corresponding to the linguistic parameters described above) may be extracted by performing text analysis 1410 on the input text for generation of the synthesized speech signal. In linguistic feature extraction, a duration model 1420 that estimates phoneme duration may also be used, as shown in the figure. Acoustic model 1430 may generate acoustic features (acoustic parameters) from the extracted language features. Those of the acoustic parameters that are indicative of LSF may be converted 1440 to LPC. A linear prediction synthesis filter 1470 may be implemented based on the transformed LPC. Acoustic parameters may be configured 1450 as auxiliary features, and the auxiliary features may be input to the ExcitNet model (ie, ExcitNet vocoder) 1460 . ExcitNet model 1460 may estimate the excitation signal based on input auxiliary features (ie, acoustic parameters). A transformed LPC-based linear prediction synthesis filter 1470 may be applied to the estimated excitation signal to generate the target speech signal. In the example shown in the figure,

は生成されたターゲット音声信号に対応してよく、

may correspond to the generated target audio signal, and

は推定された励起信号に対応してよい。

may correspond to the estimated excitation signal.

以上、図１～１０ｃを参照しながら説明した技術的特徴についての説明は、図１１～１４に対してもそのまま適用可能であるため、重複する説明は省略する。 The description of the technical features described above with reference to FIGS. 1 to 10c can also be applied to FIGS.

図１５は、一例における、訓練過程／合成音声信号の生成過程で取得した負の対数尤度（ＮｅｇａｔｉｖｅＬｏｇ－Ｌｉｋｅｌｉｈｏｏｄ：ＮＬＬ）の音響パラメータとして、励起の周期性によって区分されるパラメータの使用の可否による差を示したグラフである。 FIG. 15 shows, in one example, whether or not parameters classified by the periodicity of excitation can be used as acoustic parameters of the negative log-likelihood (NLL) acquired in the training process/generating process of the synthesized speech signal. It is a graph showing the difference due to.

訓練（ｔｒａｉｎｉｎｇ）過程において、ＮＬＬが低いほどモデリングの正確度が高いと見ることができる。図に示したグラフでは、上述したＳＥＷパラメータおよびＲＥＷパラメータのようなＩＴＦＴＥパラメータを使用した場合のＮＬＬは、そうでない場合よりも低くなることを確認することができる。 In the training process, it can be seen that the lower the NLL, the higher the modeling accuracy. In the graphs shown, it can be seen that the NLL is lower when ITFTE parameters, such as the SEW and REW parameters mentioned above, are used than otherwise.

また、合成音声信号の検証（ｖａｌｉｄａｔｉｏｎ）過程においても、ＮＬＬが低いほど生成される合成音声の品質が優れると見なすことができる。図に示したグラフでは、ＳＥＷパラメータおよびＲＥＷパラメータのようなＩＴＦＴＥパラメータを使用した場合のＮＬＬが、そうでない場合よりも低くなることを確認することができる。 Also, in the process of validating the synthesized speech signal, it can be considered that the lower the NLL, the better the quality of the synthesized speech generated. In the graphs shown, it can be seen that the NLL is lower when ITTE parameters such as SEW and REW parameters are used than when they are not.

言い換えれば、図に示したグラフから、ニューラルボコーダの訓練においてＩＴＦＴＥパラメータを使用することによって励起信号の確率分布のモデリングのエラーを大きく減らすことができ、合成音声の生成のための励起信号の推定でＩＴＦＴＥパラメータを使用することによって合成音声信号の生成におけるエラーを大きく減らすことができるという事実を確認することができる。 In other words, from the graphs shown in the figure, it can be seen that the use of the ITTE parameters in training the neural vocoder can greatly reduce the error in modeling the probability distribution of the excitation signal, and the estimation of the excitation signal for the generation of synthesized speech. The fact that the use of the ITTE parameter can greatly reduce the error in the generation of the synthesized speech signal can be confirmed.

以上、図１～１４を参照しながら説明した技術的特徴についての説明は、図１５に対してもそのまま適用可能であるため、重複する説明は省略する。 The description of the technical features described above with reference to FIGS. 1 to 14 can also be applied to FIG. 15 as it is, so redundant description will be omitted.

図１６は、一例における、複数の話者からの音声信号に対し、音声信号の話者従属的な特徴と話者独立的な特徴を示した図である。図１７は、一例における、複数の話者からの音声データセットを訓練させることによって構築されたソースモデルと、ターゲット話者からの音声データセットを訓練させることによって構築された話者適応型モデルを使用してターゲット話者の合成音声を生成する方法を示している。 FIG. 16 is a diagram illustrating speaker-dependent features and speaker-independent features of speech signals for speech signals from a plurality of speakers in one example. FIG. 17 shows, in one example, a source model constructed by training speech datasets from multiple speakers and a speaker-adaptive model constructed by training speech datasets from a target speaker. shows how to use it to generate synthesized speech for a target speaker.

図１６に示すように、話者独立的な特徴は、話者（話者１～３）の音声で共通する特徴であってよい。言い換えれば、話者独立的な特徴は、話者ごとに区分されない、グローバル特性を示してよい。話者従属的な特徴は、話者ごとの固有の特性を示してよい。 As shown in FIG. 16, speaker-independent features may be features that are common to the speech of speakers (speakers 1-3). In other words, speaker-independent features may indicate global characteristics that are not segmented by speaker. Speaker-dependent features may indicate characteristics specific to each speaker.

図１７に示すように、複数の話者からの音声データセットを話者独立的に訓練することによってソースモデル６１０が構築されてよく、このようなソースモデル６１０からの加重値に基づいてターゲット話者からの音声データセットを訓練することにより、ターゲット話者に従属的な話者適応型モデル６２０が構築されてよい。ソースモデル６１０からの加重値は、話者適応型モデル６２０でターゲット話者からの音声データセットが訓練されるにより、ターゲット話者の固有の特性を反映するように微調整されてよい。図に示すように、ソースモデル６１０および話者適応型モデル６２０は、ＥｘｃｉｔＮｅｔモデルを使用して実現されてよい。図に示すように、実施形態によっては、ニューラルボコーダに対して話者適応（ｓｐｅａｋｅｒａｄａｐｔａｔｉｏｎ）アルゴリズムを適用してよい。図には示してはいないが、ソースモデル６１０に対応する音響モデル（例えば、ＤＮＮＴＴＳ）に対しても同じように話者適応アルゴリズムが適用されてよい。 As shown in FIG. 17, a source model 610 may be constructed by training speaker-independently on speech datasets from multiple speakers, and based on the weights from such source model 610, the target speech. A speaker-adaptive model 620 dependent on the target speaker may be constructed by training speech data sets from the target speaker. The weights from the source model 610 may be fine-tuned to reflect the unique characteristics of the target speaker as the speaker-adaptive model 620 is trained on speech datasets from the target speaker. As shown, source model 610 and speaker adaptive model 620 may be implemented using the ExcitNet model. As shown, in some embodiments, a speaker adaptation algorithm may be applied to the neural vocoder. Although not shown in the figure, a speaker adaptation algorithm may also be applied to an acoustic model (eg, DNNTTS) corresponding to the source model 610 in a similar manner.

以上、図１～１５を参照しながら説明した技術的特徴についての説明は、図１６および図１７に対してもそのまま適用可能であるため、重複する説明は省略する。 The description of the technical features described above with reference to FIGS. 1 to 15 can also be applied to FIGS.

図１８および図１９は、一例における、話者適応（ｓｐｅａｋｅｒａｄａｐｔａｔｉｏｎ）アルゴリズムの適用の可否によって生成された合成音声信号の品質を比較評価した結果を示した図である。 18 and 19 are diagrams showing results of comparative evaluation of the quality of synthesized speech signals generated depending on whether or not the speaker adaptation algorithm is applied in one example.

図１８および図１９のＳｃｏｒｅは、評価者が音声信号を聞き取って評価したスコアの平均を示している。ここで、ＲＡＷは、原本音声信号に該当してよい。 Score in FIGS. 18 and 19 indicates the average of the scores evaluated by the evaluator by listening to the voice signal. Here, RAW may correspond to an original audio signal.

図１８を参照すると、ＷａｖｅＮｅｔモデルおよびＥｘｃｉｔＮｅｔモデルの両方で話者適応アルゴリズムを適用した場合の合成音声信号の品質が高く評価されたことを確認することができる。言い換えれば、図６～８を参照しながら説明したように、話者適応型モデル６２０を構築して合成音声信号を生成する場合（ｗ／ｓｐｅａｋｅｒａｄａｐｔａｔｉｏｎ）が、そうでない場合（ｗ／ｏｓｐｅａｋｅｒａｄａｐｔａｔｉｏｎ）に比べて優れた性能を示すことを確認することができる。 Referring to FIG. 18, it can be confirmed that the quality of the synthesized speech signal when applying the speaker adaptation algorithm was highly evaluated in both the WaveNet model and the ExcitNet model. In other words, as described with reference to FIGS. 6-8, if the speaker-adaptive model 620 is built to generate a synthesized speech signal (w/ speaker adaptation), otherwise (w/o speaker adaptation ) can be confirmed to exhibit superior performance.

図１９は、合成音声信号の品質を比較評価した、より詳細な結果を示した図である。図１９については、以下でさらに詳しく説明する。 FIG. 19 is a diagram showing more detailed results of comparative evaluation of the quality of synthesized speech signals. FIG. 19 is described in more detail below.

以上、図１～１７を参照しながら説明した技術的特徴についての説明は、図１８および図１９に対してもそのまま適用可能であるため、重複する説明は省略する。 The description of the technical features described above with reference to FIGS. 1 to 17 can also be applied to FIGS. 18 and 19 as they are, and redundant description will be omitted.

以下では、図１～５を参照しながら説明したＥｘｃｉｔＮｅｔモデルについてより詳しく説明し、他のモデルとの比較実験結果についてさらに説明する。 In the following, the ExcitNet model described with reference to FIGS. 1 to 5 will be described in more detail, and the results of comparison experiments with other models will be further described.

ＥｘｃｉｔＮｅｔモデル（ＥｘｃｉｔＮｅｔボコーダ）は、統計的パラメトリック音声合成（ＳＰＳＳ）システムのためのＷａｖｅＮｅｔに基づくニューラル励起モデルであってよい。ＷａｖｅＮｅｔに基づくニューラルボコーダシステムは、合成音声信号の認識品質を大きく向上させるが、音声信号の複雑な時変特性を捕捉できない場合があるためノイズを出力する場合がある。ＥｘｃｉｔＮｅｔに基づくニューラルボコーダシステムは、音声信号からスペクトル成分を分離する適応的エンボスフィルタを使用して（例えば、ＷａｖｅＮｅｔフレームワーク内で）残渣成分（すなわち、励起信号）を分離して訓練することができ、合成音声信号を生成するにあたり励起信号をターゲットとして推定することができる。このような方式により、ディープラーニングフレームワークによって音声信号のスペクトル成分がより適切に表現されるようになり、残渣成分はＷａｖｅＮｅｔフレームワークによって効率的に生成されるため、合成された音声信号の品質を向上することができる。 The ExcitNet model (ExcitNet vocoder) may be a WaveNet-based neural excitation model for the Statistical Parametric Speech Synthesis (SPSS) system. Although WaveNet-based neural vocoder systems greatly improve the recognition quality of synthesized speech signals, they may not capture the complex time-varying characteristics of speech signals and may output noise. A neural vocoder system based on ExcitNet can be trained (e.g., within the WaveNet framework) to isolate the residual component (i.e., the excitation signal) using an adaptive embossing filter that isolates the spectral component from the speech signal. , the excitation signal can be estimated as a target in generating the synthesized speech signal. Such schemes allow the spectral content of speech signals to be better represented by the deep learning framework, and the residual components are efficiently generated by the WaveNet framework, thus improving the quality of the synthesized speech signal. can be improved.

以下の実験でも、（話者従属的および話者独立的に訓練された）ＥｘｃｉｔＮｅｔに基づくニューラルボコーダシステムが、従来の線形予測ボコーダおよびＷａｖｅＮｅｔボコーダよりも優れた性能を発揮するという結果を示した。 The following experiments also show that the ExcitNet-based neural vocoder system (trained speaker-dependently and speaker-independently) outperforms conventional linear prediction vocoders and WaveNet vocoders.

試験のためには、音響モデルと話者従属的（ＳＤ）ＥｘｃｉｔＮｅｔボコーダを訓練させるために音声的に韻律的に豊かな３つのスピーチコーパスを利用した。各コーパスは、専門の韓国人女性（ＫＲＦ）と韓国人男性（ＫＲＭ）が録音したものである。音声信号は２４ｋＨｚでサンプリングされ、各サンプルは１６ビットで量子化された。以下の表１は、各集合の発話数を示したものである。話者独立的（ＳＩ）ＥｘｃｉｔＮｅｔボコーダを訓練させるために、韓国人女性５人と韓国人男性５人が録音した音声コーパスを使用した。合計６，４２２件（１０時間）および１，０８０件（１．７時間）の発話がそれぞれ訓練および検証（ｖａｌｉｄａｔｉｏｎ）に使用された。ＳＩデータセットに含まれない、同じＫＲＦおよびＫＲＭ話者によって録音された音声サンプルが試験のために使用された。 For testing, we utilized three phonetically and prosodicly rich speech corpuses to train an acoustic model and a speaker-dependent (SD) ExcitNet vocoder. Each corpus was recorded by a professional Korean female (KRF) and Korean male (KRM). The audio signal was sampled at 24 kHz and each sample was quantized with 16 bits. Table 1 below shows the number of utterances in each set. A speech corpus recorded by 5 Korean females and 5 Korean males was used to train the speaker-independent (SI) ExcitNet vocoder. A total of 6,422 (10 hours) and 1,080 (1.7 hours) utterances were used for training and validation, respectively. Speech samples recorded by the same KRF and KRM speakers not included in the SI dataset were used for testing.

以下の表２および表３は、客観的な試験の結果であって、原本音声と生成された音声との間の歪曲をＬＳＤ（Ｌｏｇ－ＳｐｅｃｔｒａｌＤｉｓｔａｎｃｅ）（ｄＢ）とＦ０ＲＭＳＥ（ＲｏｏｔＭｅａｎＳｑｕａｒｅＥｒｒｏｒ）（Ｈｚ）によってそれぞれ示したものである。ＷＮはＷａｖｅＮｅｔボコーダを示し、ＷＮ－ＮＳはＷａｖｅＮｅｔボコーダにノイズシェーピング方法を適用したものを示し、ＥｘｃｉｔＮｅｔはＥｘｃｉｔＮｅｔボコーダを示す。最も低いエラーが現れた部分は太字で表示した。表２および表３は、有声音に対して測定された結果であってよい。

Tables 2 and 3 below show the results of objective tests, showing the distortion between the original speech and the generated speech in LSD (Log-Spectral Distance) (dB) and F0 RMSE (Root Mean Square Error). ) (Hz). WN denotes a WaveNet vocoder, WN-NS denotes a WaveNet vocoder with the noise shaping method applied, and ExcitNet denotes an ExcitNet vocoder. The part where the lowest error appeared is shown in bold. Tables 2 and 3 may be results measured for voiced speech.

表２および表３に表示したように、ＳＤおよびＳＩの殆どの場合において、ＥｘｃｉｔＮｅｔボコーダの場合が、原本音声と生成された音声の間の歪曲が最も低く現われることを確認することができる。

As shown in Tables 2 and 3, in most cases of SD and SI, the ExcitNet vocoder has the lowest distortion between the original speech and the generated speech.

以下の表４は、無声音およびトランジション領域（ｔｒａｎｓｉｔｉｏｎｒｅｇｉｏｎｓ）に対して測定されたＬＳＤ（ｄＢ）を示している。 Table 4 below shows the measured LSD (dB) for unvoiced and transition regions.

表４に表示したように、ＳＤおよびＳＩのすべての場合において、ＥｘｃｉｔＮｅｔボコーダの場合が、原本音声と生成された音声の間の歪曲が最も低く現われることを確認することができる。

As shown in Table 4, in all SD and SI cases, the ExcitNet vocoder has the lowest distortion between the original speech and the generated speech.

以下の表５および表６は、主観的な試験の結果であって、選好度テストの結果（％）を示している。聞き取り者から高い選好度が示された部分は太字で表示した。残りのものに比べ、ＥｘｃｉｔＮｅｔボコーダの場合、合成音声の認識品質が著しく優れることを確認することができる。評価者は１２人の韓国語を母国語として使用する聞き取り者であり、２０件のランダムに選択された発話に対して試験が行われた。 Tables 5 and 6 below show the results of the subjective test and the results (%) of the preference test. The parts where high preference was indicated by the listeners are displayed in bold. It can be seen that the recognition quality of synthesized speech is significantly better for the ExcitNet vocoder than for the rest. The evaluators were 12 native Korean-speaking listeners who were tested on 20 randomly selected utterances.

図２０は、一例における、ＥｘｃｉｔＮｅｔボコーダと他のボコーダとの間のＭＯＳ（ＭｅａｎＯｐｉｎｉｏｎＳｃｏｒｅ）（ＭＯＳ）の評価結果を示している。

FIG. 20 shows the Mean Opinion Score (MOS) evaluation results between the ExcitNet vocoder and other vocoders in one example.

録音された音声から音響フィーチャーが抽出される場合である分析／合成（Ａ／Ｓ）の結果に対する評価、および音響モデルから音響フィーチャーが生成される場合であるＳＰＳＳにおける結果が評価された。 Results were evaluated for analysis/synthesis (A/S) results, where acoustic features are extracted from recorded speech, and SPSS, where acoustic features are generated from acoustic models.

Ｓ／Ａにおいて、ＳＩ－ＥｘｃｉｔＮｅｔボコーダは、ＩＴＦＴＥボコーダと類似の性能を示したが、ＷＯＲＬＤシステムよりも遥かに優れたものとして現われた。すべての場合において、ＳＤ－ＥｘｃｉｔＮｅｔボコーダは、最高の認識品質（ＫＲＦおよびＫＲＭ話者に対してそれぞれ４．３５および４．４７ＭＯＳ）を示した。高音の女性の音声を表現するのは難しいため、ＫＲＦ話者に対するＭＯＳ結果は、ＳＩボコーダ（ＷＯＲＬＤ、ＩＴＦＴＥ、およびＳＩ－ＥｘｃｉｔＮｅｔ）においてＫＲＭ話者の場合よりも良くない結果が出た。この反面、ＳＤ－ＥｘｃｉｔＮｅｔのＫＲＦ話者に対する結果は、ＫＲＭ話者に対する結果と類似するという点において、高音の声を効果的に表現するためには各話者の特性がモデリングされなければならないことを示す。ＳＰＳＳの側面では、ＳＤもＳＩ－ＥｘｃｉｔＮｅｔボコーダも、パラメトリックＩＴＦＴＥボコーダよりも遥かに優れた認識品質を示した。音響モデルが過度に平坦な音声媒介変数を生成したが、ＥｘｃｉｔＮｅｔボコーダは時間領域励起信号を直接に推定することによって平滑化効果を緩和することができた。結果的に、ＳＤ－ＥｘｃｉｔＮｅｔボコーダを使用するＳＰＳＳシステムは、それぞれＫＲＦおよびＫＲＭ話者に対して３．７８および３．８５ＭＯＳを達成した。ＳＩ－ＥｘｃｉｔＮｅｔボコーダは、ＫＲＦおよびＫＲＭ話者に対してそれぞれ２．９１および２．８９ＭＯＳを達成した。 At S/A, the SI-ExcitNet vocoder performed similarly to the ITFTE vocoder, but appeared to outperform the WORLD system. In all cases, the SD-ExcitNet vocoder showed the highest recognition quality (4.35 and 4.47 MOS for KRF and KRM speakers, respectively). MOS results for KRF speakers were worse than those for KRM speakers in the SI vocoders (WORLD, ITTE, and SI-ExcitNet) because of the difficulty in representing high-pitched female voices. On the other hand, SD-ExcitNet's results for KRF speakers are similar to those for KRM speakers, in that each speaker's characteristics must be modeled to effectively represent high-pitched voices. indicates In terms of SPSS, both SD and SI-ExcitNet vocoders showed much better recognition quality than the parametric ITFTE vocoder. Although the acoustic model produced overly flat speech parameters, the ExcitNet vocoder was able to mitigate the smoothing effect by directly estimating the time-domain excitation signal. As a result, the SPSS system using the SD-ExcitNet vocoder achieved 3.78 and 3.85 MOS for KRF and KRM speakers respectively. The SI-ExcitNet vocoder achieved 2.91 and 2.89 MOS for KRF and KRM speakers, respectively.

以下では、図６～８を参照しながら説明した話者適応型モデル６２０を構築するニューラルボコーダについてより詳細に説明し、他のモデルとの比較試験結果についてさらに説明する。実施形態のニューラルボコーダは、たった１０分の音声データセットのようにターゲット話者からの訓練データが不十分な場合であっても、高品質の音声合成システムを構築することができる。 In the following, the neural vocoder that builds the speaker-adaptive model 620 described with reference to FIGS. 6-8 is described in more detail, and comparative test results with other models are further described. The neural vocoder of embodiments can build a high-quality speech synthesis system even with insufficient training data from the target speaker, such as only a 10-minute speech dataset.

実施形態のニューラルボコーダは、ターゲット話者に対する制限された訓練データによって発生するターゲット話者関連情報の不足問題を解決するために、複数の話者に対して普遍的な特性を抽出する、話者独立的に訓練されたソースモデル６１０からの加重値を活用する。このようなソースモデル６１０からの加重値は、話者適応型モデル６２０の訓練を初期化するために使用され、ターゲット話者の固有の特性を示すために微調整されてよい。このような適応過程によってディープニューラルネットワークがターゲット話者の特性を捕捉することができるため、話者独立的なモデルで発生する不連続性の問題を減らすことができる。以下で説明する実験結果も、実施形態のニューラルボコーダが、従来の方法に比べて合成された音声の認識品質を著しく向上させることを示す。 The neural vocoder of the embodiment extracts features that are universal to multiple speakers to solve the problem of lack of target-speaker-related information caused by limited training data for the target speaker. Leverage weights from an independently trained source model 610 . Such weights from the source model 610 are used to initialize the training of the speaker-adaptive model 620 and may be fine-tuned to represent the unique characteristics of the target speaker. Such an adaptation process allows the deep neural network to capture the characteristics of the target speaker, thereby reducing the discontinuity problem that occurs in speaker-independent models. The experimental results described below also show that the neural vocoders of the embodiments significantly improve the recognition quality of synthesized speech compared to conventional methods.

ＳＤは（ソースモデル６１０からの加重値を初期値にせず）話者従属的に訓練されたモデルを示し、ＳＩは話者独立的に訓練されたモデルを示し、ＳＡは図６～８を参照しながら説明したような話者適応型に訓練されたモデル（すなわち、ソースモデル６１０からの加重値を初期値にして話者従属的に訓練されたモデル）を示す。 SD indicates a speaker-dependently trained model (without initializing the weights from the source model 610), SI indicates a speaker-independently trained model, and SA see FIGS. (ie, speaker-dependently trained model with initial weights from source model 610) as described above.

ＳＤおよびＳＡモデルにおいて、韓国人女性の話者が録音した音声コーパスが使用された。音声信号は２４ｋＨｚでサンプリングされ、各サンプルは１６ビットで量子化された。訓練、検証、および試験には合計９０件（１０分）、４０件（５分）、１３０件（１５分）の発話が使用された。ＳＩモデルを訓練させるために、ＳＤとＳＡモデル訓練には含まれない５人の韓国人男性の話者および５人の韓国人女性の話者が録音した音声データが使用された。このために、訓練および検証にそれぞれ６，４２２件（１０時間）および１，０８０件（１．７時間）の発話が使用された。ＳＤおよびＳＡモデルのテストセットは、ＳＩモデルを評価するためにも使用された。 In the SD and SA models, a speech corpus recorded by a Korean female speaker was used. The audio signal was sampled at 24 kHz and each sample was quantized with 16 bits. A total of 90 (10 min), 40 (5 min) and 130 (15 min) utterances were used for training, validation and testing. Speech data recorded by 5 Korean male speakers and 5 Korean female speakers not included in the SD and SA model training were used to train the SI model. For this, 6,422 (10 hours) and 1,080 (1.7 hours) utterances were used for training and validation, respectively. A test set of SD and SA models was also used to evaluate the SI model.

以下の表７および表８は、客観的な試験の結果であって、原本音声と生成された音声の間の歪曲をＬＳＤ（Ｌｏｇ－ＳｐｅｃｔｒａｌＤｉｓｔａｎｃｅ）（ｄＢ）とＦ０ＲＭＳＥ（ＲｏｏｔＭｅａｎＳｑｕａｒｅＥｒｒｏｒ）（Ｈｚ）によってそれぞれ示したものである。表７は、録音された音声から抽出された音響フィーチャーが補助フィーチャーを構成するために直接的に使用される場合の分析／合成の結果に対する評価（Ａ／Ｓ）を示している。表８は、ＳＰＳＳにおける結果の評価を示している。最も低いエラーが現れた部分は太字で表示した。 Tables 7 and 8 below show the results of objective tests, showing the distortion between the original speech and the generated speech in LSD (Log-Spectral Distance) (dB) and F0 RMSE (Root Mean Square Error). (Hz). Table 7 shows the ratings (A/S) for analysis/synthesis results when acoustic features extracted from recorded speech are used directly to construct auxiliary features. Table 8 shows the evaluation of results in SPSS. The part where the lowest error appeared is shown in bold.

表７および表８において、ＷａｖｅＮｅｔボコーダおよびＥｘｃｉｔＮｅｔボコーダの両方において、ＳＡの場合が、原本音声と生成された音声の間の歪曲が最も低く現われることを確認することができる。

In Tables 7 and 8, it can be seen that in both the WaveNet vocoder and the ExcitNet vocoder, the SA case appears to have the lowest distortion between the original speech and the generated speech.

図２１は、一例における、Ｆ０スケーリングファクタ（ｓｃａｌｉｎｇｆａｃｔｏｒ）を相違させる場合において、話者適応型モデルを構築するニューラルボコーダの性能変化を示した図である。 FIG. 21 is a diagram showing performance changes of a neural vocoder for constructing a speaker adaptive model when different F0 scaling factors are used in one example.

実施形態のＳＡを適用した訓練方法の有効性を検証するために、Ｆ０を手動で変更したときのニューラルボコーダの性能変化を調査した。ＳＩモデルは、ピッチを修正した合成音声の生成に効果的であることが明らかになっている。ＳＡモデルもＳＩモデルを活用するものであるため、ＳＤ接近法に比べて高い性能を示すことが期待される。 In order to verify the effectiveness of the training method applying the SA of the embodiment, we investigated the change in performance of the neural vocoder when F0 was manually changed. The SI model has been shown to be effective in generating pitch-corrected synthesized speech. Since the SA model also utilizes the SI model, it is expected to exhibit higher performance than the SD approach.

試験において、Ｆ０軌跡は、ＳＰＳＳフレームワークによって生成された後、補助フィーチャーベクトルを修正するためにスケーリングファクタ（０：６、０：８、１：０、および１：２）が乗算された。音声信号は、ニューラルボコーダシステムによって合成された。 In the tests, F0 trajectories were generated by the SPSS framework and then multiplied by scaling factors (0:6, 0:8, 1:0, and 1:2) to modify the auxiliary feature vectors. Speech signals were synthesized by a neural vocoder system.

図２１は、相違するＦ０スケーリングファクタに対するＦ０ＲＭＳＥ（Ｈｚ）試験結果を示している。図２１により、ＳＡモデルが、従来のＳＤモデルに比べて遥かに低い修正エラー（ｍｏｄｉｆｉｃａｔｉｏｎｅｒｒｏｒ）を含んでいることを確認することができる。ＳＩモデルに比べ、ＳＡ－ＥｘｃｉｔＮｅｔボコーダは、すべての加重値がターゲット話者の特性に合うように最適化されているにも関わらず、同等な品質が維持されていることを確認することができる。 FIG. 21 shows F0 RMSE (Hz) test results for different F0 scaling factors. It can be confirmed from FIG. 21 that the SA model contains much lower modification errors than the conventional SD model. Compared to the SI model, the SA-ExcitNet vocoder can confirm that comparable quality is maintained even though all weights are optimized to match the characteristics of the target speaker. .

また、ＥｘｃｉｔＮｅｔボコーダは、ＷａｖｅＮｅｔボコーダよりも遥かに優れた性能を発揮することを確認することができる。ＥｘｃｉｔＮｅｔボコーダは、声帯の動きの変化（励起信号の変化）を訓練するため、ＷａｖｅＮｅｔに基づく接近方式よりも柔軟にＦ０修正された音声セグメントを再構成できるものと見なされる。 It can also be seen that the ExcitNet vocoder performs much better than the WaveNet vocoder. The ExcitNet vocoder is assumed to be able to reconstruct F0-modified speech segments more flexibly than the WaveNet-based approach, since it trains for changes in vocal cord motion (changes in the excitation signal).

図１９は、主観的な試験結果であって、ＳＤ、ＳＩ、およびＳＡのボコーダ間のＭＯＳ評価結果を示した図である。録音された音声から音響フィーチャーが抽出される場合である分析／合成（Ａ／Ｓ）の結果に対する評価、および音響モデルから音響フィーチャーが生成される場合であるＳＰＳＳにおける結果の評価がなされた。 FIG. 19 is a subjective test result showing MOS evaluation results between SD, SI and SA vocoders. The results were evaluated for analysis/synthesis (A/S), where acoustic features are extracted from recorded speech, and for SPSS, where acoustic features are generated from acoustic models.

Ａ／Ｓの結果において、ＳＤ－ＷａｖｅＮｅｔボコーダは、制限的な訓練データではターゲット話者の特性を訓練することが不可能であるため、最も良くない結果が現れた。ＳＩ－ＷａｖｅＮｅｔボコーダは、ＩＴＦＴＥボコーダと類似の性能を示し、ＷＯＲＬＤシステムよりは優れた性能を示した。すべてのＷａｖｅＮｅｔボコーダにおけるＳＡの活用は、優れた性能を示すということが確認された。ＥｘｃｉｔＮｅｔボコーダに対する結果は、ＷａｖｅＮｅｔボコーダの場合と類似の傾向を示したが、ＥｘｃｉｔＮｅｔボコーダは、ＬＰインバスフィルタによって音声信号のフォルマント構成要素を分離することによって残りの信号のモデリング正確度を向上させるため、全体的に遥かに優れた性能を示した。結果的に、ＳＡ－ＥｘｃｉｔＮｅｔボコーダは、Ａ／Ｓ結果において４．４０ＭＯＳを達成した。 In the A/S results, the SD-WaveNet vocoder has the worst results because it is impossible to train the characteristics of the target speaker with the limited training data. The SI-WaveNet vocoder performed similarly to the ITTE vocoder and outperformed the WORLD system. It has been confirmed that the SA exploitation in all WaveNet vocoders shows excellent performance. Results for the ExcitNet vocoder showed similar trends as for the WaveNet vocoder, but the ExcitNet vocoder improves the modeling accuracy of the rest of the signal by isolating the formant components of the speech signal with an LP in-pass filter. It showed much better performance overall. As a result, the SA-ExcitNet vocoder achieved 4.40 MOS in A/S result.

ＳＰＳＳの結果において、ＳＩ－ＷａｖｅＮｅｔボコーダとＳＩ－ＥｘｃｉｔＮｅｔボコーダは、パラメトリックＩＴＦＴＥボコーダよりも優れた認識品質を提供した。結果的に、実施形態のＳＡ訓練モデルは、従来の話者依存的な方法と話者独立的な方法に比べ、合成音声信号の品質を大きく向上させることを確認することができた。Ａ／Ｓ結果と同じように、ＥｘｃｉｔＮｅｔボコーダは、ＳＰＳＳ結果においてＷａｖｅＮｅｔボコーダよりも優れた性能を示した。音響モデルが過度に平坦な音声媒介変数を生成したが、ＥｘｃｉｔＮｅｔボコーダは、時間領域励起信号を直接に推定することによって平滑化効果を緩和することができた。結果的に、ＳＡ－ＥｘｃｉｔＮｅｔボコーダがあるＳＰＳＳシステムは３．７７ＭＯＳを達成した。 In SPSS results, the SI-WaveNet and SI-ExcitNet vocoders provided better recognition quality than the parametric ITTE vocoder. As a result, it was confirmed that the SA training model of the embodiment significantly improves the quality of the synthesized speech signal compared to the conventional speaker-dependent method and speaker-independent method. Similar to the A/S results, the ExcitNet vocoder outperformed the WaveNet vocoder in the SPSS results. Although the acoustic model produced overly flat speech parameters, the ExcitNet vocoder was able to mitigate the smoothing effect by directly estimating the time-domain excitation signal. As a result, the SPSS system with SA-ExcitNet vocoder achieved 3.77 MOS.

上述した装置は、ハードウェア構成要素、ソフトウェア構成要素、および／またはハードウェア構成要素とソフトウェア構成要素との組み合わせによって実現されてよい。例えば、実施形態で説明された装置および構成要素は、プロセッサ、コントローラ、ＡＬＵ（ａｒｉｔｈｍｅｔｉｃｌｏｇｉｃｕｎｉｔ）、デジタル信号プロセッサ、マイクロコンピュータ、ＦＰＧＡ（ｆｉｅｌｄｐｒｏｇｒａｍｍａｂｌｅｇａｔｅａｒｒａｙ）、ＰＬＵ（ｐｒｏｇｒａｍｍａｂｌｅｌｏｇｉｃｕｎｉｔ）、マイクロプロセッサ、または命令を実行して応答することができる様々な装置のように、１つ以上の汎用コンピュータまたは特殊目的コンピュータを利用して実現されてよい。処理装置は、オペレーティングシステム（ＯＳ）およびＯＳ上で実行される１つ以上のソフトウェアアプリケーションを実行してよい。また、処理装置は、ソフトウェアの実行に応答し、データにアクセスし、データを記録、操作、処理、および生成してもよい。理解の便宜のために、１つの処理装置が使用されるとして説明される場合もあるが、当業者は、処理装置が複数個の処理要素および／または複数種類の処理要素を含んでもよいことが理解できるであろう。例えば、処理装置は、複数個のプロセッサまたは１つのプロセッサおよび１つのコントローラを含んでよい。また、並列プロセッサのような、他の処理構成も可能である。 The apparatus described above may be realized by hardware components, software components, and/or a combination of hardware and software components. For example, the devices and components described in the embodiments include processors, controllers, arithmetic logic units (ALUs), digital signal processors, microcomputers, field programmable gate arrays (FPGAs), programmable logic units (PLUs), microprocessors, Or may be implemented using one or more general purpose or special purpose computers, such as various devices capable of executing and responding to instructions. The processing unit may run an operating system (OS) and one or more software applications that run on the OS. The processor may also access, record, manipulate, process, and generate data in response to executing software. For convenience of understanding, one processing device may be described as being used, but those skilled in the art will appreciate that the processing device may include multiple processing elements and/or multiple types of processing elements. You can understand. For example, a processing unit may include multiple processors or a processor and a controller. Other processing configurations are also possible, such as parallel processors.

ソフトウェアは、コンピュータプログラム、コード、命令、またはこれらのうちの１つ以上の組み合わせを含んでもよく、思うままに動作するように処理装置を構成したり、独立的または集合的に処理装置に命令したりしてよい。ソフトウェアおよび／またはデータは、処理装置に基づいて解釈されたり、処理装置に命令またはデータを提供したりするために、いかなる種類の機械、コンポーネント、物理装置、コンピュータ記録媒体または装置に具現化されてよい。ソフトウェアは、ネットワークによって接続されたコンピュータシステム上に分散され、分散された状態で記録されても実行されてもよい。ソフトウェアおよびデータは、１つ以上のコンピュータ読み取り可能な記録媒体に記録されてよい。 Software may include computer programs, code, instructions, or a combination of one or more of these, to configure a processor to operate at its discretion or to independently or collectively instruct a processor. You can Software and/or data may be embodied in any kind of machine, component, physical device, computer storage medium, or device for interpretation by, or for providing instructions or data to, a processing device. good. The software may be stored and executed in a distributed fashion over computer systems linked by a network. Software and data may be recorded on one or more computer-readable recording media.

実施形態に係る方法は、多様なコンピュータ手段によって実行可能なプログラム命令の形態で実現されてコンピュータ読み取り可能な媒体に記録されてよい。ここで、媒体は、コンピュータ実行可能なプログラムを継続して記録するものであっても、実行またはダウンロードのために一時記録するものであってもよい。また、媒体は、単一または複数のハードウェアが結合した形態の多様な記録手段または格納手段であってよく、あるコンピュータシステムに直接接続する媒体に限定されることはなく、ネットワーク上に分散して存在するものであってもよい。媒体の例としては、ハードディスク、フロッピー（登録商標）ディスク、および磁気テープのような磁気媒体、ＣＤ－ＲＯＭおよびＤＶＤのような光媒体、フロプティカルディスク（ｆｌｏｐｔｉｃａｌｄｉｓｋ）のような光磁気媒体、およびＲＯＭ、ＲＡＭ、フラッシュメモリなどを含み、プログラム命令が記録されるように構成されたものであってよい。また、媒体の他の例として、アプリケーションを配布するアプリケーションストアやその他の多様なソフトウェアを供給または配布するサイト、サーバなどで管理する記録媒体または格納媒体が挙げられる。 The method according to the embodiments may be embodied in the form of program instructions executable by various computer means and recorded on a computer-readable medium. Here, the medium may record the computer-executable program continuously or temporarily record it for execution or download. In addition, the medium may be various recording means or storage means in the form of a combination of single or multiple hardware, and is not limited to a medium that is directly connected to a computer system, but is distributed over a network. It may exist in Examples of media include magnetic media such as hard disks, floppy disks, and magnetic tapes, optical media such as CD-ROMs and DVDs, magneto-optical media such as floptical disks, and ROM, RAM, flash memory, etc., and may be configured to store program instructions. Other examples of media include recording media or storage media managed by application stores that distribute applications, sites that supply or distribute various software, and servers.

以上のように、実施形態を、限定された実施形態および図面に基づいて説明したが、当業者であれば、上述した記載から多様な修正および変形が可能であろう。例えば、説明された技術が、説明された方法とは異なる順序で実行されたり、かつ／あるいは、説明されたシステム、構造、装置、回路などの構成要素が、説明された方法とは異なる形態で結合されたりまたは組み合わされたり、他の構成要素または均等物によって対置されたり置換されたとしても、適切な結果を達成することができる。 As described above, the embodiments have been described based on the limited embodiments and drawings, but those skilled in the art will be able to make various modifications and variations based on the above description. For example, the techniques described may be performed in a different order than in the manner described and/or components such as systems, structures, devices, circuits, etc. described may be performed in a manner different from the manner described. Appropriate results may be achieved when combined or combined, opposed or substituted by other elements or equivalents.

したがって、異なる実施形態であっても、特許請求の範囲と均等なものであれば、添付される特許請求の範囲に属する。 Therefore, different embodiments that are equivalent to the claims are covered by the appended claims.

Claims

コンピュータによって実現されるニューラルボコーダが実行する音声信号生成方法であって、
入力されたテキストまたは音声信号に基づいて、スペクトル関連パラメータ、および励起の周期性によって区分される励起関連パラメータを含む複数の音響パラメータを取得する段階、
前記複数の音響パラメータに基づいて励起信号を推定する段階、および
前記推定された励起信号に対して前記スペクトル関連パラメータのうちの少なくとも１つに基づく線形合成フィルタを適用することによってターゲット音声信号を生成する段階
を含み、
前記ニューラルボコーダは、前記ニューラルボコーダの訓練のための訓練用音声信号を利用して励起信号を推定するようにあらかじめ訓練されたものであり、
前記励起信号を推定する段階は、あらかじめ訓練された前記ニューラルボコーダを利用して前記複数の音響パラメータに基づいて励起信号を推定し、
前記ニューラルボコーダは、
前記訓練用音声信号に対して線形予測分析フィルタを適用することにより、前記訓練用音声信号から前記励起信号を分離する段階、および
前記分離された励起信号の確率分布をモデリングする段階
を含む段階によって訓練されたものであり、
前記励起信号を推定する段階は、
前記モデリングされた励起信号の確率分布を使用して前記複数の音響パラメータに対する励起信号を推定する、
音声信号生成方法。 An audio signal generation method performed by a computer-implemented neural vocoder, comprising:
obtaining a plurality of acoustic parameters , including spectrally-related parameters and excitation-related parameters partitioned by excitation periodicity, based on an input text or audio signal ;
estimating an excitation signal based on the plurality of acoustic parameters; and generating a target audio signal by applying a linear synthesis filter to the estimated excitation signal based on at least one of the spectrally-related parameters. and
said neural vocoder having been pre-trained to estimate an excitation signal using a training speech signal for training said neural vocoder;
estimating the excitation signal utilizes the pre-trained neural vocoder to estimate the excitation signal based on the plurality of acoustic parameters;
The neural vocoder is
separating the excitation signal from the training speech signal by applying a linear predictive analysis filter to the training speech signal; and modeling the probability distribution of the separated excitation signal. trained by
Estimating the excitation signal comprises:
estimating excitation signals for the plurality of acoustic parameters using the modeled excitation signal probability distribution ;
Audio signal generation method.

前記励起関連パラメータは、所定のカットオフ周波数以下の励起を示す第１励起パラメータ、および前記カットオフ周波数を超過する励起を示す第２励起パラメータを含む、
請求項１に記載の音声信号生成方法。 The excitation-related parameters include a first excitation parameter indicative of excitation below a predetermined cutoff frequency and a second excitation parameter indicative of excitation above the cutoff frequency.
2. The audio signal generation method according to claim 1.

前記第１励起パラメータは、前記励起の高調波スペクトルを示し、前記第２励起パラメータは、前記励起のその他の部分を示す、
請求項２に記載の音声信号生成方法。 wherein the first excitation parameter indicates a harmonic spectrum of the excitation and the second excitation parameter indicates another portion of the excitation;
3. The audio signal generation method according to claim 2.

前記スペクトル関連パラメータは、
音声信号のピッチを示す周波数パラメータ、音声信号のエネルギーを示すエネルギーパラメータ、音声信号が有声音であるか無声音であるかを示すパラメータ、および音声信号の線スペクトル周波数（ＬＳＦ）を示すパラメータを含む、
請求項１に記載の音声信号生成方法。 The spectrum-related parameters are
a frequency parameter indicating the pitch of the speech signal, an energy parameter indicating the energy of the speech signal, a parameter indicating whether the speech signal is voiced or unvoiced, and a parameter indicating the line spectral frequency (LSF) of the speech signal;
2. The audio signal generation method according to claim 1.

前記ターゲット音声信号を生成する段階は、
前記ＬＳＦを示すパラメータを線形予測符号（ＬＰＣ）に変換する段階、および
前記推定された励起信号に対し、前記変換されたＬＰＣに基づく前記線形合成フィルタを適用する段階
を含む、
請求項４に記載の音声信号生成方法。 The step of generating the target audio signal comprises:
transforming the LSF-indicative parameter into a linear predictive code (LPC); and applying to the estimated excitation signal the linear synthesis filter based on the transformed LPC.
5. The audio signal generation method according to claim 4.

前記複数の音響パラメータは、入力されたテキストまたは入力された音声信号に基づいて音響モデルによって生成されたものである、
請求項１に記載の音声信号生成方法。 wherein the plurality of acoustic parameters are generated by an acoustic model based on input text or input speech signals;
2. The audio signal generation method according to claim 1.

前記励起信号を分離する段階は、
前記入力された音声信号の線スペクトル周波数（ＬＳＦ）を示すパラメータを線形予測符号（ＬＰＣ）に変換する段階、および
前記入力された音声信号に対して前記入力された音声信号の変換されたＬＰＣに基づく前記線形予測分析フィルタを適用する段階
を含む、
請求項１に記載の音声信号生成方法。 Separating the excitation signal comprises:
converting a parameter indicative of the line spectral frequency (LSF) of the input speech signal into a linear predictive code (LPC); and for the input speech signal, to the converted LPC of the input speech signal applying the linear predictive analysis filter based on
2. The audio signal generation method according to claim 1.

前記分離された励起信号は、前記入力された音声信号の残渣成分である、
請求項１に記載の音声信号生成方法。 wherein the separated excitation signal is a residual component of the input audio signal;
2. The audio signal generation method according to claim 1.

コンピュータによって実現されるニューラルボコーダの訓練方法であって、
前記ニューラルボコーダの訓練のための訓練用音声信号の入力を受ける段階、
前記訓練用音声信号から、スペクトル関連パラメータ、および励起の周期性によって区分される励起関連パラメータを含む複数の音響パラメータを抽出する段階、
前記訓練用音声信号に対して前記スペクトル関連パラメータのうちの少なくとも１つに基づく線形予測分析フィルタを適用することにより、前記訓練用音声信号から励起信号を分離する段階、および
前記分離された励起信号の確率分布をモデリングする段階
を含み、
前記ニューラルボコーダは、前記段階によって訓練されることによって、入力されたテキストまたは音声信号に基づいて取得される、他のスペクトル関連パラメータ及び他の励起関連パラメータを含む他の音響パラメータに基づいて励起信号を推定し、
前記ニューラルボコーダは、前記モデリングされた励起信号の確率分布を利用して、前記他の音響パラメータに対する励起信号を推定する、
ニューラルボコーダの訓練方法。 A computer-implemented neural vocoder training method comprising:
receiving a training audio signal input for training the neural vocoder;
extracting from the training audio signal a plurality of acoustic parameters, including spectrally related parameters and excitation related parameters partitioned by excitation periodicity;
separating an excitation signal from the training audio signal by applying a linear prediction analysis filter based on at least one of the spectrally related parameters to the training audio signal; and the separated excitation signal. modeling the probability distribution of
The neural vocoder is trained by the steps to provide an excitation signal based on other acoustic parameters, including other spectral-related parameters and other excitation-related parameters, obtained based on an input text or speech signal. , and
the neural vocoder utilizes the probability distribution of the modeled excitation signal to estimate excitation signals for the other acoustic parameters ;
How to train a neural vocoder.

前記励起信号を分離する段階は、
前記スペクトル関連パラメータのうちで前記入力された音声信号の線スペクトル周波数（ＬＳＦ）を示すパラメータを線形予測符号（ＬＰＣ）に変換する段階、および
前記入力された音声信号に対し、前記入力された音声信号の変換されたＬＰＣに基づく前記線形予測分析フィルタを適用する段階
を含む、
請求項９に記載のニューラルボコーダの訓練方法。 Separating the excitation signal comprises:
converting a parameter indicating a line spectrum frequency (LSF) of the input speech signal among the spectrum-related parameters into a linear predictive code (LPC); and for the input speech signal, the input speech. applying the linear predictive analysis filter based on the transformed LPC of the signal;
A method for training a neural vocoder according to claim 9.

前記励起関連パラメータは、所定のカットオフ周波数以下の励起を示す第１励起パラメータ、および前記カットオフ周波数を超過する励起を示す第２励起パラメータを含む、
請求項９に記載のニューラルボコーダの訓練方法。 The excitation-related parameters include a first excitation parameter indicative of excitation below a predetermined cutoff frequency and a second excitation parameter indicative of excitation above the cutoff frequency.
A method for training a neural vocoder according to claim 9.

ニューラルボコーダであって、
入力されたテキストまたは音声信号に基づいて、スペクトル関連パラメータ、および励起の周期性によって区分される励起関連パラメータを含む複数の音響パラメータを取得するパラメータ取得部、
前記複数の音響パラメータに基づいて励起信号を推定する励起信号推定部、および
前記推定された励起信号に対して前記スペクトル関連パラメータのうちの少なくとも１つに基づく線形合成フィルタを適用することによってターゲット音声信号を生成する音声信号生成部
を含み、
前記ニューラルボコーダは、前記ニューラルボコーダの訓練のための訓練用音声信号を利用して励起信号を推定するようにあらかじめ訓練されたものであり、
あらかじめ訓練された前記ニューラルボコーダの励起信号分離部が、前記複数の音響パラメータに基づいて励起信号を推定し、
前記ニューラルボコーダは、
前記訓練用音声信号に対して線形予測分析フィルタを適用することにより、前記訓練用音声信号から励起信号を分離する前記励起信号分離部、および
前記分離された励起信号の確率分布をモデリングするモデリング部
をさらに含み、
前記励起信号推定部は、前記モデリングされた励起信号の確率分布を使用して前記複数の音響パラメータに対する励起信号を推定する、
ニューラルボコーダ。 a neural vocoder,
a parameter acquisition unit for acquiring a plurality of acoustic parameters , including spectral-related parameters and excitation-related parameters partitioned by excitation periodicity, based on an input text or audio signal ;
an excitation signal estimator for estimating an excitation signal based on the plurality of acoustic parameters; and a target speech by applying a linear synthesis filter based on at least one of the spectrally-related parameters to the estimated excitation signal. an audio signal generator for generating a signal;
said neural vocoder having been pre-trained to estimate an excitation signal using a training speech signal for training said neural vocoder;
an excitation signal separator of the pretrained neural vocoder estimating an excitation signal based on the plurality of acoustic parameters;
The neural vocoder is
The excitation signal separation unit for separating excitation signals from the training speech signal by applying a linear prediction analysis filter to the training speech signal; and a modeling unit for modeling the probability distribution of the separated excitation signals. further comprising
The excitation signal estimator estimates excitation signals for the plurality of acoustic parameters using the modeled probability distribution of the excitation signals.
neural vocoder.

前記音声信号生成部は、前記スペクトル関連パラメータのうちで音声信号の線スペクトル周波数（ＬＳＦ）を示すパラメータを線形予測符号（ＬＰＣ）に変換する変換部を含み、
前記推定された励起信号に対し、前記変換されたＬＰＣに基づく前記線形合成フィルタを適用する、
請求項１２に記載のニューラルボコーダ。 The speech signal generation unit includes a conversion unit that converts a parameter indicating a line spectrum frequency (LSF) of the speech signal among the spectrum-related parameters into a linear predictive code (LPC),
applying the linear synthesis filter based on the transformed LPC to the estimated excitation signal;
A neural vocoder according to claim 12.