KR20200088263A

KR20200088263A - Method and system of text to multiple speech

Info

Publication number: KR20200088263A
Application number: KR1020200087643A
Authority: KR
Inventors: 이수영; 이영근
Original assignee: 한국과학기술원
Priority date: 2018-05-29
Filing date: 2020-07-15
Publication date: 2020-07-22
Also published as: KR102272554B1

Abstract

The present invention relates to a method for text-to-multiple speech and a system thereof. More specifically, the method comprises the steps of: generating a character embedding vector with a sentence for generating speech as a character unit; generating a speech spectrum from the embedding vector; and outputting the speech spectrum as speech. The speech spectrum generation step receives multiple speech data, converts the multiple speech data into a speech embedding vector, and uses the speech embedding vector for speech spectrum merging. The method and the system may implement a novel voice with fewer data.

Description

텍스트- 다중 음성 변환 방법 및 시스템{Method and system of text to multiple speech}Method and system of text to multiple speech

본 발명은 텍스트- 다중 음성 변환 방법 및 시스템에 관한 것이다.The present invention relates to a method and system for text-to-speech conversion.

음성은 인간의 가장 자연스러운 의사 소통 수단이면서 정보 전달 수단이자, 언어를 구현하는 수단으로서 인간이 내는 의미 있는 소리이다.Voice is the most natural means of communication, a means of communication, and a meaningful sound made by humans as a means of realizing language.

인간과 기계 사이의 음성을 통한 통신 구현에 대한 시도는 과거부터 꾸준히 발전되어 왔는 바, 더욱이 최근 음성 정보를 효과적으로 처리하기 위한 음성 정보 처리 기술(speech information technology;SIT) 분야가 괄목할 만한 발전을 이룩함에 따라 실생활에도 속속 적용이 되고 있다.Attempts to implement communication through voice between humans and machines have been steadily developed from the past, and in recent years, the field of speech information technology (SIT) for effectively processing voice information has made remarkable progress. It is also being applied to real life one after another.

이러한 음성 정보 처리 기술을 크게 분류하면, 음성 인식(speech recognition), 음성 합성(speech synthesis), 화자 인증(speaker identification and verification), 음성 코딩(speech coding) 등의 카테고리로 분류될 수 있다.If such a voice information processing technology is largely classified, it may be classified into categories such as speech recognition, speech synthesis, speaker identification and verification, and speech coding.

음성 인식은 발화된 음성을 인식하여 문자열로 변환하는 기술이고, 음성 합성은 문자열을 음성 분석에서 얻어진 데이 터나 파라미터를 이용하여 원래의 음성으로 변환하는 기술이며, 화자 인증은 발화된 음성을 통하여 발화자를 추정하 거나 인증하는 기술이며 음성 코딩은 음성 신호를 효과적으로 압축하여 부호화하는 기술이다.Speech recognition is a technology that recognizes spoken speech and converts it into character strings. Speech synthesis is a technique that converts character strings into original speech using data or parameters obtained from speech analysis. It is a technique for estimating or authenticating, and voice coding is a technique for effectively compressing and encoding a voice signal.

이 중에서, 음성합성기술의 발전 과정을 살펴보면, 초기의 음성 합성은 대부분 기계 장치 또는 전자회로를 이용하여 인간의 발성기관을 흉내내는 구조를 채택하다. 예를 들어, 18세기 볼프강 폰 켐펠렌(Wolfgang von Kem pelen)은 고무로 만들어진 입과 콧구멍을 가지며 성도의 변화를 흉내낼 수 있도록 한, 풀무로 만든 음성 합성 기계를 고안한 바 있다. 이후, 전기적 분석 방법을 이용한 음성 합성 기술로 발전하여, 1930년대에는 더들리(Dudl ey)가 초기 형태의 보코더(vocoder)를 선보이기도 하다. Among them, when looking at the development process of speech synthesis technology, most early speech synthesis adopts a structure that mimics a human vocal organ using a mechanical device or an electronic circuit. For example, in the 18th century, Wolfgang von Kem pelen devised a speech synthesis machine made of bellows that had a mouth and nostrils made of rubber and made it possible to mimic changes in saints. Since then, it has developed into a speech synthesis technology using an electrical analysis method, and in the 1930s, Dudley ey also introduced an early form of vocoder.

오늘날에는 컴퓨터의 급속한 발달에 힘입어, 컴퓨터 기반 음성 합성 방식이 음성 합성 방식의 주류를 이루게 되었으 며, 시스템 모델 방식(조음 합성 (articulary synthesis) 등)이나 신호 모델 방식(규칙기반 포만트 합성 또는 단위음 결합 합성) 등의 다양한 방식이 개발되고 있다.Today, thanks to the rapid development of computers, computer-based speech synthesis methods have become the mainstream of speech synthesis methods, and system model methods (articulary synthesis, etc.) or signal model methods (rule-based formant synthesis or units) Various methods such as negative coupling synthesis) have been developed.

음성합성 기술은 실제 응용방식에 따라 크게 두 가지로 구분될 수 있다. 제한된 어휘 개수와 구문구조의 문장만을 합성하는 제한 어휘합성 또는 자동음성응답시스템(ARS; Automatic Response System)과 임의의 문장을 입력받아 음성 합성하는 무제한 어휘합성 또는 텍스트-음성 변환(TTS; Text-to-Speech) 시스템이 있다. Speech synthesis technology can be largely divided into two types according to the actual application method. A limited vocabulary synthesis or automatic response system (ARS) that synthesizes only sentences with a limited number of vocabulary and syntax structures, and unlimited vocabulary synthesis or text-to-speech conversion (TTS; text-to-speech) to synthesize speech by inputting arbitrary sentences -Speech) system.

그 중, 텍스트-음성 변환(TTS) 시스템은 작은 합성 단위음성과 언어 처리를 이용하여 임의의 문장에 대한 음성을 생성한다. 언어 처리를 이용하여 입력된 문장을 적당한 합성 단위의 조합으로 대응시키고, 문장으로부터 적당한 억양과 지속시간을 추출하여 합성음의 운율을 결정한다. 언어의 기본 단위인 음소, 음절 등의 조합에 의해 음성을 합성해 내므로 합성 대상어휘에 제한이 없으며 주로 TTS(Text-to-Speech) 장치 및 CTS(Context-to-Speech) 장치 등에 적용된다. Among them, a text-to-speech (TTS) system generates speech for an arbitrary sentence using small synthetic unit speech and language processing. Using the language processing, input sentences are matched with a combination of appropriate synthesis units, and appropriate intonation and duration are extracted from the sentences to determine the rhythm of the synthesized sound. Since speech is synthesized by a combination of phonemes and syllables, which are the basic units of language, there is no limitation on the target vocabulary, and it is mainly applied to text-to-speech (TTS) devices and context-to-speech (CTS) devices.

종래의 음성 합성 기술은 새로운 사람의 목소리를 생성하기 위해서 많은 양의 데이터를 수집해야하고, 또 그 데이터에 대해 모델을 학습하는 시간이 필요하기 때문에 효율성이 떨어지는 문제를 가지고 있다. The conventional speech synthesis technique has a problem in that efficiency is reduced because a large amount of data must be collected to generate a new human voice, and time is required to train a model on the data.

본 발명의 목적은The object of the present invention

기존의 텍스트-음성 변환 방법 및 시스템에서 복수의 음성을 구현하는데 있어서, 음성 구현에 소모되는 시간을 단축시키고 복수의 음성을 합성하기 위한 사전 데이터 수집 단계를 단축하는 텍스트-다중 음성 변환 방법 및 시스템을 제공하는 데 있다.In the existing text-to-speech method and system for implementing a plurality of voices, a text-to-multi-speech method and system that reduces a time spent in voice implementation and a prior data collection step for synthesizing multiple voices To provide.

상기 목적을 달성하기 위해, 본 발명의 일 실시 예에 따른 텍스트-다중음성 변환 방법은 음성을 생성할 문장을 문자 단위로 문자 임베딩 벡터를 생성 하는 단계, 상기 임베딩 벡터로부터 음성 스펙트럼을 생성하는 단계 및 상기 음성 스펙트럼을 음성으로 출력하는 단계를 포함하고, 상기 음성 스펙트럼 생성 단계는, 다중 음성 데이터를 입력받아 음성 임베딩 벡터로 변환해 음성 스펙트럼 병합에 이용하할 수 있다. In order to achieve the above object, a text-to-multi-voice conversion method according to an embodiment of the present invention includes generating a character embedding vector in units of characters of a sentence to generate speech, generating a speech spectrum from the embedding vector, and And outputting the speech spectrum as speech, wherein the speech spectrum generation step may receive multiple speech data and convert it into a speech embedding vector for use in speech spectrum merging.

본 발명의 일 실시 예에 따르면, 상기 다중 음성 데이터는 One-hot vector형태의 speaker-id일 수 있다.According to an embodiment of the present invention, the multi-voice data may be a speaker-id in the form of a one-hot vector.

본 발명의 일 실시 예에 따르면, 상기 다중 음성 데이터는 log-mel-spectrogram형태의 데이터일 수 있다.According to an embodiment of the present invention, the multiple voice data may be log-mel-spectrogram data.

본 발명의 일 실시 예에 따르면, 상기 log-mel-spectrogram은 임베더 네트워크를 거쳐 음성 임베딩 벡터로 변환될 수 있다.According to an embodiment of the present invention, the log-mel-spectrogram may be converted into a speech embedding vector via an embedder network.

본 발명의 일 실시 예에 따르면, 상기 음성 임베딩 벡터는 상기 스펙트럼 병합부의 디코더 RNN과 집중 RNN에 입력될 수 있다.According to an embodiment of the present invention, the speech embedding vector may be input to the decoder RNN and the centralized RNN of the spectrum merger.

상기 목적을 달성하기 위해, 본 발명의 일 실시 예에 따른 텍스트-다중음성 변환 시스템은 문자로부터 상기 문자의 음성을 출력하는 텍스트-다중음성 변환모듈을 포함하고, 상기 텍스트-음성 변환모듈은 음성을 생성할 문장을 문자 단위로 문자 임베딩 벡터를 생성하는 인코더, 상기 임베딩 벡터로부터 음성 스펙트럼을 생성하는 디코더 및 상기 음성 스펙트럼을 음성으로 출력하는 음성 출력부를 포함하고, 상기 디코더는 다중 음성 데이터를 입력받아 음성 임베딩 벡터로 변환해 음성 스펙트럼 병합에 이용할 수 있다.To achieve the above object, the text-to-multi-speech conversion system according to an embodiment of the present invention includes a text-to-speech conversion module that outputs the voice of the text from a text, and the text-to-speech conversion module performs speech An encoder for generating a character embedding vector for each sentence to be generated, a decoder for generating a speech spectrum from the embedding vector, and a voice output unit for outputting the speech spectrum as speech, wherein the decoder receives multiple speech data and speech It can be converted into an embedding vector and used for speech spectrum merging.

본 발명의 일 실시 예에 따르면, 상기 음성 임베딩 벡터는,According to an embodiment of the present invention, the speech embedding vector,

상기 스펙트럼 병합부의 디코더 RNN과 집중 RNN에 입력될 수 있다.It can be input to the decoder RNN and the centralized RNN of the spectrum merger.

본 발명의 텍스트-음성 변환 방법은, 적은 데이터만으로도 새로운 목소리 구현이 가능하고, 모델을 따로 학습하지 않아도 텍스트-음성 변환이 가능하고,In the text-to-speech conversion method of the present invention, a new voice can be implemented with only a small amount of data, and text-to-speech conversion is possible without learning a model separately.

복수의 음성에 해당하는 ID를 텍스트 정보와 함께 입력받아 다자간의 대화 당의 음성 변환에 걸리는 시간을 단축시킬 수 있다.IDs corresponding to a plurality of voices may be input together with text information, thereby reducing the time required for voice conversion during a multiparty conversation.

또한, 복수의 음성에 대한 데이터를 log-mel-spectogram을 이용해 적은 데이터 만으로 다수의 화자에 의한 음성을 합성하는 방법으로, 모델 학습시간을 줄일 수 있고, 사전 데이터 준비에 소모되는 시간 및 비용을 감소시킬 수 있는 효과가 있다.In addition, the method for synthesizing voices by multiple speakers with only a small amount of data using log-mel-spectogram for data for multiple voices can reduce model training time and reduce time and cost spent preparing data in advance. There is an effect that can be made.

도 1은 일반적인 텍스트-음성 변환 방법을 나타낸 모식도이다.
도 2는 본 발명의 실시예에 따른 GRU(Gated Recurrent Unit)을 나타낸 모식도이다.
도 3은 본 발명의 일 실시 예에 따른 텍스트- 다중 음성 변환 방법의 흐름도이다.
도 4는 본 발명의 일 실시 예에 따른 RNN 인코더-디코더 네트워크의 블록도이다.
도 5는 본 발명의 일 실시 예에 따른 콘텐츠 기반 집중 메커니즘 RNN 인코더-디코더 네트워크의 블록도이다.
도 6은 타코트론(Tacotron)의 모식도이다.
도 7은 집중 정렬의 이상적 그래프와 실제적인 집중 정렬의 문제를 도시하는 그래프이다.
도 8은 CBHG모듈의 모식도이다
도 9는 본 발명의 일 실시 예에 따른 타코트론의 파라미터를 도시한 그래프이다.
도 10은본 발명의 일 실시 예에 따른 텍스트-다중 음성 변환 타코트론의 모식도이다.
도 11은 본 발명의 일 실시 예에 따른 음성 임베딩 예측 네트워크의 모식도이다.
도 12는 본 발명의 일 실시 예에 따른 테스트 세트에서 음성 프로필을 나타낸 표이다.
도 13은 본 발명의 일 실시 예에 따른 다중 음성 타코트론의 파라미터 표이다.
도 14는 본 발명의 일 실시 예에 따른 생성된 음성 스펙토그램의 예시이다.
도 15는 본 발명의 일 실시 예에 따른 이론 요소에 따른 음성 임베딩 결과를 도시한 그래프이다.
도 16은 본 발명의 일 실시 예에 따른 이론 요소의 변화에 따른 스펙토그램을 도시한 그래프이다.
도 17은 본 발명의 일 실시 예에 따른 혼합된 음성에 의한 스펙토그램을 도시한 그래프이다.1 is a schematic diagram showing a general text-to-speech conversion method.
2 is a schematic diagram showing a GRU (Gated Recurrent Unit) according to an embodiment of the present invention.
3 is a flowchart of a text-to-multi-speech method according to an embodiment of the present invention.
4 is a block diagram of an RNN encoder-decoder network according to an embodiment of the present invention.
5 is a block diagram of a content-based centralized mechanism RNN encoder-decoder network according to an embodiment of the present invention.
6 is a schematic diagram of Tacotron.
7 is a graph showing an ideal graph of intensive alignment and a problem of practical intensive alignment.
8 is a schematic view of the CBHG module
9 is a graph showing the parameters of tacotron according to an embodiment of the present invention.
10 is a schematic diagram of a text-to-multi-speech tacolon according to an embodiment of the present invention.
11 is a schematic diagram of a voice embedding prediction network according to an embodiment of the present invention.
12 is a table showing a voice profile in a test set according to an embodiment of the present invention.
13 is a parameter table of a multi-negative tacotron according to an embodiment of the present invention.
14 is an example of a generated voice spectrogram according to an embodiment of the present invention.
15 is a graph showing a result of voice embedding according to a theoretical element according to an embodiment of the present invention.
16 is a graph showing a spectrogram according to a change in a theoretical element according to an embodiment of the present invention.
17 is a graph showing a spectrogram by mixed voice according to an embodiment of the present invention.

본 발명의 일 실시예에 따른 텍스트-음성 변환 방법은, 문자 입력을 받아 해당 내용의 발화 음성을 합성해서 출력하는 방법이다.A text-to-speech conversion method according to an embodiment of the present invention is a method of receiving text input and synthesizing and outputting speech of the corresponding content.

즉, 본 발명의 일 실시예에 따른 텍스트-음성 변환 방법은 발화할 내용의 문자를 입력할 경우, 그것을 읽은 음성 신호를 출력하는 방법이다.That is, the text-to-speech conversion method according to an embodiment of the present invention is a method of outputting a voice signal read when a character of content to be spoken is input.

이하, 본 발명의 실시예에 따른 텍스트-음성 변환 방법을 도면을 참조하여 각 단계별로 보다 상세히 설명한다.Hereinafter, a text-to-speech conversion method according to an embodiment of the present invention will be described in detail with reference to drawings.

본 발명의 실시예에 따른 텍스트-음성 변환 방법은 문자로부터 각각의 부분 대역에 해당하는 음성 스펙트럼을 생성하는 단계를 포함한다.A text-to-speech method according to an embodiment of the present invention includes generating a speech spectrum corresponding to each subband from a character.

본 발명의 실시예에 따른 텍스트-음성 변환 방법은 문자를 대역별로 나누고, 대역별로 나눠진 신호 각각에 대해, 각각의 음성 스펙트럼을 생성하기 때문에 입력받은 문자에 대해 대역별 구별없이 하나의 음성 스펙트럼을 생성하는 종래의 방법에 비해 계산량을 현저히 줄일 수 있으며, 어려운 문제를 분할 정복(divide and conquer)방식으로 해결할 수 있는 장점이 있다.The text-to-speech conversion method according to an embodiment of the present invention divides characters into bands and generates each speech spectrum for each signal divided by band, thereby generating one speech spectrum for each input character without distinction by band. Compared to the conventional method, the amount of calculation can be significantly reduced, and there is an advantage in that difficult problems can be solved by a divide and conquer method.

본 발명의 일 실시 예에 따르면, 문자로부터 각각의 부분 대역에 해당하는 음성 스펙트럼을 생성하는 단계는 각각의 부분 대역 별로 병렬적으로 동시에 음성 스펙트럼을 생성할 수 있다. According to an embodiment of the present invention, the step of generating a voice spectrum corresponding to each sub-band from a text may simultaneously generate a voice spectrum for each sub-band in parallel.

상기 음성 스펙트럼을 생성하는 단계는 각 대역별 신호를 각기 다른 방법 또는 시스템을 이용하여 연산할 수 있으며, 상기 각 대역별 신호를 연산하는 방법 또는 시스템은 서로 독립적일 수 있다. 이에, 상기 단계에서 각각의 대역별 신호에 해당하는 음성 스펙트럼을 생성하기 위해, 복수의 텍스트 음성 변환 시스템이 사용될 수 있으며, 상기 텍스트 음성 변환 시스템으로, 타코트론(Tacotron) 또는 웨이브넷(Wavenet) 알고리즘이 사용될 수 있다. In the generating of the voice spectrum, signals for each band may be calculated using different methods or systems, and methods or systems for calculating the signals for each band may be independent of each other. Accordingly, in the step, a plurality of text-to-speech systems may be used to generate a speech spectrum corresponding to each band-specific signal. As the text-to-speech system, a tacotron or wavenet algorithm Can be used.

여기서 상기 타코트론(Tacotron)은 하기와 같은 방법으로 문자를 음성 스펙트럼으로 생성한다.Here, the tacotron (Tacotron) generates a character in the speech spectrum in the following way.

타코트론(Tacotron)은 순환 신경망((Recurrent Neural Network, RNN) 인코더-디코더를 활용한 'sequence-to-sequence' 모델로서 텍스트에서 필요한 정보를 추출하는 인코더(encoder)부와 인코더된 텍스트로부터 음성을 합성하는 디코더(decoder)부로 나눌 수 있다.Tacotron is a'sequence-to-sequence' model that utilizes a Recurrent Neural Network (RNN) encoder-decoder. It extracts voice from the encoder and encoder that extracts the necessary information from the text. It can be divided into a decoder unit to synthesize.

인코더(encoder)부에서는, 인코더 네트워크의 입력값으로, 문장을 문자(character) 단위로 분해하여 벡터형태로 만든 문자 임베딩(character embedding)이 사용되며 신경망(neural network)을 거쳐 텍스트 임베딩 벡터(text encoding vector)를 출력값으로 내어 놓는다. In the encoder section, as the input value of the encoder network, character embedding, which is made in vector form by decomposing sentences into characters, is used, and text encoding is performed through a neural network. vector) as an output value.

상기 신경망(neural network)으로, CBHG 모듈 즉, 컨벌루션 신경망(convolutional neural network), 하이웨이 네트워크(highway network) 및 양방향성 순환 신경망(bi-directional recurrent neural network)을 순서대로 쌓은 신경망이 사용될 수 있다.As the neural network, a CBHG module, that is, a neural network in which convolutional neural networks, highway networks, and bi-directional recurrent neural networks are sequentially stacked may be used.

디코더(decoder)부에서는, t 시간 단계에서의 디코더 네트워크(Decoder network)의 입력값으로는 텍스트 임베딩 벡터(text encoding vector)들의 가중합과 이전 t-1 시간단계의 마지막 디코더 출력값이 합쳐져 사용된다. 디코더 출력값은 멜 스케일 스펙트로그램(mel-scale spectrogram)으로 매 시단 단계당 r개의 벡터를 내어 놓게 된다. r개의 벡터 중 마지막 벡터만 다음 시간 단계의 디코더 입력값으로 사용된다. 시간 단계마다 r개씩 생성된 멜 스케일 스펙트로그램 벡터들은 디코더 타임 스텝방향으로 합쳐져 합성된 음성 전체의 멜 스케일 스펙트로그램을 이루게 되며, 이 스펙트로그램은 추가적인 신경망(neural network)를 거쳐 선형-스케일 스펙트로그램으로 변환된다. 이후 상기 선형-스케일 스펙트로그램이 'Griffin-Lim reconstruction' 알고리즘을 통해 파형(wave form)으로 변환이 되며 이것을 '~.wav' 파일로 쓰게되면 음성 파일이 생성된다.In the decoder unit, a weighted sum of text encoding vectors and a final decoder output value of the previous t-1 time step are used as an input value of a decoder network in the t time step. The decoder output value is a mel-scale spectrogram, and r vectors are generated per each step. Of the r vectors, only the last vector is used as the decoder input value of the next time step. The mel-scale spectrogram vectors generated by r per time step are combined in the decoder time step direction to form a mel-scale spectrogram of the entire synthesized speech, which is converted into a linear-scale spectrogram through an additional neural network. Is converted. Then, the linear-scale spectrogram is converted into a wave form through the'Griffin-Lim reconstruction' algorithm, and when this is written as a'~.wav' file, a voice file is generated.

본 발명의 실시예에 따른 문자로부터 각각의 부분 대역에 해당하는 음성 스펙트럼을 생성하는 단계는 상기 타코트론(Tacotron)을 이용하여 구현될 수 있다. The step of generating a voice spectrum corresponding to each sub-band from a character according to an embodiment of the present invention may be implemented using the tacotron.

즉, 상기 타코트론(Tacotron)에서 디코더의 출력값은 r개의 스펙트로그램이다. 이것을 매트릭스 형태로 생각한다면 S x R 형태의 매트릭스로 볼 수 있다. 여기서 s는 스펙트로그램의 크기로, 멜-스캐일 스펙트로그램은 80차원의 벡터가 사용될 ㅅ수 있다. That is, the output value of the decoder in the Tacotron is r spectrograms. If you think of this as a matrix, you can think of it as a matrix of S x R. Here, s is the size of the spectrogram, and the Mel-scale spectrogram may be an 80-dimensional vector.

음성 스펙트럼을 선형 스펙트럼으로 변환 및 상기 선형 스펙트럼을 파형(waveform)으로 변환하여 최종적으로 음성으로 출력한다. The speech spectrum is converted into a linear spectrum, and the linear spectrum is converted into a waveform to be finally output as speech.

한편, 상기 음성 스펙트럼을 생성하는 단계는, 화자의 음색, 나이, 성별 및 감정 중 적어도 하나의 발화조건을 부여하여 음성 스펙트럼을 생성하는 단계를 포함할 수 있다.Meanwhile, the generating of the voice spectrum may include generating a voice spectrum by imparting at least one utterance condition among a speaker's tone, age, gender, and emotion.

상기 단계에서, 상기 발화조건이 추가될 경우, 상기 조건을 반영한 음성 신호가 생성될 수 있다.In the above step, when the utterance condition is added, a voice signal reflecting the condition may be generated.

상기 병합된 스펙트럼을 음성으로 출력하는 단계는, 상기 음성 스펙트럼을 선형 스펙트럼으로 변환하는 단계 및 상기 선형 스펙트럼을 파형(waveform)으로 변환하는 단계를 포함할 수 있다.The outputting of the merged spectrum as speech may include converting the speech spectrum into a linear spectrum and converting the linear spectrum into a waveform.

또한, 본 발명은In addition, the present invention

문자로부터 각각의 부분 대역별 신호를 생성하여 음성을 출력하는 텍스트-음성 변환모듈;을 포함하고, Includes a text-to-speech conversion module that generates a signal for each sub-band from a character and outputs a voice.

상기 텍스트-음성 변환모듈은,The text-to-speech conversion module,

문자로부터 각각의 부분 대역에 해당하는 음성 스펙트럼을 생성하는 텍스트-음성 변환부, Text-to-speech conversion unit that generates a speech spectrum corresponding to each sub-band from a character,

상기 텍스트-음성 변환부에서 생성된 각각의 대역에 해당하는 음성 스펙트럼을 병합하는 스펙트럼 병합부, 및A spectrum merging unit for merging speech spectrums corresponding to respective bands generated by the text-to-speech converter, and

상기 스펙트럼 병합부에서 병합된 스펙트럼을 음성으로 출력하는 음성 출력부,를 포함하는 텍스트-음성 변환 시스템을 제공한다.It provides a text-to-speech conversion system including a voice output unit for outputting the spectrum merged by the spectrum merger into speech.

본 발명의 일 실시예에 따른 텍스트-음성 변환 시스템은, 문자 입력을 받아 해당 내용의 발화 음성을 합성해서 출력하는 시스템이다.The text-to-speech conversion system according to an embodiment of the present invention is a system that receives a text input and synthesizes and outputs a spoken voice of the corresponding content.

즉, 본 발명의 일 실시예에 따른 텍스트-음성 변환 시스템은 발화할 내용의 문자가 입력되면, 그것을 읽은 음성 신호가 출력되는 시스템이다.That is, the text-to-speech conversion system according to an embodiment of the present invention is a system that outputs a voice signal that reads a character when content to be spoken is input.

본 발명의 일 실시예에 따른 텍스트-음성 변환 시스템은 인공신경망을 이용해 음성을 합성하며, 음성 합성에 있어서 합성할 음성신호에 대해 하나 이상의 대역으로 나누어져 생성 후 이를 합침으로써, 문자를 음성으로 변환하는 텍스트-음성 변환 시스템이다. The text-to-speech conversion system according to an embodiment of the present invention synthesizes speech using an artificial neural network, and in speech synthesis, converts text into speech by dividing it into one or more bands and combining them after generation. Is a text-to-speech system.

본 발명의 일 실시예에 따른 텍스트-음성 변환 시스템은 상기 대역분할모듈에서 생성된 대역별 신호를 연산하여 음성을 출력하는 텍스트-음성 변환모듈을 포함한다.The text-to-speech conversion system according to an embodiment of the present invention includes a text-to-speech conversion module that outputs voice by calculating a signal for each band generated by the band division module.

본 발명의 일 실시 예에 따르면, 상기 텍스트-음성 변환모듈에서 부분 대역은 전체 대역의 길이보다 짧은 길이를 가지는 부분 대역들이 포함될 수 있다.According to an embodiment of the present invention, the partial band in the text-to-speech conversion module may include partial bands having a length shorter than the length of the entire band.

본 발명의 일 실시 예에 따르면, 상기 부분 대역의 음성 파형 길이는 상기 전체 대역의 음성 파형 길이의 1/2 이하일 수 있다.According to an embodiment of the present invention, the voice waveform length of the partial band may be 1/2 or less of the voice waveform length of the entire band.

본 발명의 일 실시 예에 따르면, 원래의 음성 파형의 길이가 100일 때, 4개의 sub-band를 가지는 wavelet transform을 적용하면 {50, 25, 12.5, 12.5}의 길이를 가지는 4개의 부분 대역별 신호로 나뉠 수 있다.According to an embodiment of the present invention, when the original voice waveform has a length of 100, when a wavelet transform having four sub-bands is applied, each of the four subbands having a length of {50, 25, 12.5, 12.5} It can be divided into signals.

이때, 상기 텍스트-음성 변환모듈은, 문자로부터 각각의 부분 대역에 해당하는 음성 스펙트럼을 생성하는 텍스트-음성 변환부를 포함한다.In this case, the text-to-speech conversion module includes a text-to-speech conversion unit that generates a speech spectrum corresponding to each sub-band from a character.

본 발명의 일 실시 예에 따르면, 문자로부터 각각의 부분 대역에 해당하는 음성 스펙트럼을 생성하는 단계는 각각의 부분 대역 별로 병렬적으로 동시에 음성 스펙트럼을 생성할 수 있다.According to an embodiment of the present invention, the step of generating a voice spectrum corresponding to each sub-band from a text may simultaneously generate a voice spectrum for each sub-band in parallel.

상기 텍스트-음성 변환부는 각각의 대역마다 해당 음성 스펙트럼을 생성한다.The text-to-speech converter generates a corresponding speech spectrum for each band.

이때, 각각의 대역은 각기 다른 방법 또는 시스템을 이용하여 음성 스펙트럼을 생성할 있으며, 이에, 상기 방법 또는 시스템은 서로 독립적일 수 있다. 이에, 상기 텍스트-음성 변환부는 각각의 대역별 신호에 해당하는 음성 스펙트럼을 생성하기 위해, 복수의 텍스트 음성 변환 시스템이 사용될 수 있으며, 상기 텍스트 음성 변환 시스템으로, 타코트론(Tacotron) 또는 웨이브넷(Wavenet) 알고리즘이 사용될 수 있다. At this time, each band may generate a voice spectrum using different methods or systems, and thus, the methods or systems may be independent of each other. Accordingly, a plurality of text-to-speech systems may be used to generate a speech spectrum corresponding to each band-specific signal, and the text-to-speech unit may include a tacotron or wavenet ( Wavenet) algorithm can be used.

또한, 상기 텍스트-음성 변환모듈은, 상기 텍스트-음성 변환부에서 생성된 각각의 대역에 해당하는 음성 스펙트럼을 병합하는 스펙트럼 병합부를 포함할 수 있으며, 상기 스펙트럼 병합부에서 병합된 스펙트럼을 음성으로 출력하는 음성 출력부를 더 포함할 수 있다.In addition, the text-to-speech conversion module may include a spectrum merger that merges voice spectrums corresponding to each band generated by the text-to-speech converter, and outputs the spectrum merged by the spectrum merger as voice. It may further include a voice output unit.

이때, 음성 출력부는 병합된 음성 스펙트럼을 선형 스펙트럼으로 변환하고 상기 선형 스펙트럼을 파형(waveform)으로 변환하여 최종 음성을 출력할 수 있다. At this time, the voice output unit may convert the merged voice spectrum into a linear spectrum and convert the linear spectrum into a waveform to output the final voice.

도 1은음성 사용자 인터페이스를 나타낸 모식도이다.1 is a schematic diagram showing a voice user interface.

도 1을 참조하면, 기존의 음성 사용자 인터페이스의 구성을 확인할 수 있다. 음성은 인간의 가장 기초적이고 효율적인 의사 전달 도구 중 하나다. 사람들은 일상 생활에서 대화 나 발표와 같은 말을 사용하여 정보를 교환한다. 음성 기반 통신은 사용자에게 직관적이고 편리하기 때문에 일부 장치는 음성을 사용하여 음성을 상호 작용하는 음성 사용자 인터페이스를 사용한다. 음성 사용자 인터페이스에서 음성 응답을 구현하는 간단한 방법은 오디오 녹음이다. 그러나 녹음은 녹음 된 음성만을 말할 수있는 한계가 있다. 장치가 준비되지 않은 상황에 대응할 수 없으므로 장치 사용의 유연성이 떨어진다. 예를 들어 Apple Siri 및 Amazon Alexa와 같은 인공 지능 (AI) 에이전트는 사용자의 쿼리가 임의적 일 수 있으므로 다양한 문장을 사용해야한다. 이러한 응용 프로그램에서 가능한 모든 응답을 기록해야하는 경우 시간과 비용 측면에서 회사에 부담이 될 수 있다. 따라서 많은 연구자들이 자연스럽고 빠른 음성 합성 모델을 만들려고 노력했다. 이러한 의미에서, TTS (text-to-speech)라고도하는 음성 합성은 임의의 연설을 생성 할 수 있으므로 중요한 주제이다.Referring to FIG. 1, the configuration of an existing voice user interface can be confirmed. Voice is one of humans' most basic and efficient communication tools. People exchange information in everyday life using words such as conversations or presentations. Because voice-based communication is intuitive and convenient for users, some devices use a voice user interface to interact with the voice using voice. A simple way to implement a voice response in a voice user interface is audio recording. However, there is a limitation that the recording can only speak the recorded voice. Since the device cannot cope with unprepared situations, the flexibility of using the device is poor. For example, artificial intelligence (AI) agents such as Apple Siri and Amazon Alexa require users to use a variety of statements as queries can be arbitrary. If you need to record all possible responses in these applications, it can be a burden on the company in terms of time and money. Therefore, many researchers have tried to create a natural and fast speech synthesis model. In this sense, speech synthesis, also known as text-to-speech (TTS), is an important topic because it can generate random speech.

음성 합성 기술은 AI 에이전트에 더 많은 가치를 제공 할 수 있다. 요즈음 특히, 기계 학습 기술의 발전으로 대화형 에이전트 및 AI 비서에 대한 요구가 증가하고 있다. 도 1에 묘사 된 것처럼 일련의 모듈로서 회화 적 모델의 틀을 공식화했다. 이러한 인간과 같은 에이전트의 경우, 대화 시스템의 백엔드 알고리즘뿐만 아니라 프론트 엔드 시스템 응답 부분도 중요하다. 사람들은 외모와 목소리를 통해 에이전트의 첫인상을 얻는다. 에이전트는 음성 합성 모델의 자연 스러움을 개선하여 이익을 얻을 수 있다. 이 점에서, 각각의 다른 AI 에이전트가 자신의 목소리로 임의의 문장을 말할 수 있다면, 그들은 더 자연스럽고 사용자에게 친숙 해 보일 것이다.Speech synthesis technology can provide more value to AI agents. These days, especially with the development of machine learning technology, the demand for interactive agents and AI assistants is increasing. As depicted in Figure 1, the framework of the pictorial model was formulated as a series of modules. For these human-like agents, the back-end algorithm of the conversation system as well as the front-end system response part are important. People get the first impression of the agent through appearance and voice. Agents can benefit by improving the naturalness of the speech synthesis model. In this regard, if each other AI agent can speak random sentences in their own voice, they will look more natural and user-friendly.

도 2는 본 발명의 실시예에 따른 GRU(Gated Recurrent Unit)을 나타낸 모식도이다.2 is a schematic diagram showing a GRU (Gated Recurrent Unit) according to an embodiment of the present invention.

도 2를 참조하면, 시퀀스를 다른 시퀀스(반드시 동일한 유형의 시퀀스일 필요는 없다.)로 매핑하는 것은 널리 퍼진 문제이다. 예를 들어, 음성 인식 작업은 일련의 파형 샘플을 일련의 문자로 매핑해야하며, 음성 합성 작업은 음성 인식의 역매핑을 수행해야한다. 기계 번역은 텍스트에서 다른 언어로 된 다른 텍스트로의 매핑이기도하다. 시퀀스 - 시퀀스 문제를 풀기 위해서는 하나의 시퀀스를 다른 시퀀스에서 생성해야하며 두 시퀀스의 요소를 정렬하는 방법을 알아야한다. 시퀀스 간 매핑 중에 컨텍스트 정보를 포함하려면 장기 의존성을 고려해야한다. 우리가 이런 종류의 네트워크를 구현할 때 최선의 선택 중 하나는 순환 신경망 (RNN)이다. 순환 신경망은 순차적 입력을 받아 숨겨진 상태의 상황 정보를 축적 할 수 있다. 특히, LSTM과 GRU는 고유 한 게이팅 메커니즘 (GRU, GRUEMP, LSTM)을 사용하여 관련 컨텍스트 정보를 효율적으로 유지할 수 있음을 보여주었다. 구체적으로, 리셋 게이트 r 및 업데이트 게이트 z가있는 GRU는 다음 수학식 1로 설명 할 수 있다.Referring to Figure 2, mapping a sequence to another sequence (not necessarily the same type of sequence) is a widespread problem. For example, the speech recognition task needs to map a series of waveform samples to a series of characters, and the speech synthesis task needs to perform reverse mapping of speech recognition. Machine translation is also a mapping from text to other text in different languages. Sequences-To solve a sequence problem, one sequence must be created from another, and you need to know how to sort the elements of both sequences. To include contextual information during sequence-to-sequence mapping, long-term dependencies must be considered. One of the best choices when we implement this kind of network is circular neural network (RNN). Cyclic neural networks can receive sequential input and accumulate hidden context information. In particular, LSTM and GRU showed that they can efficiently maintain relevant context information using unique gating mechanisms (GRU, GRUEMP, LSTM). Specifically, a GRU having a reset gate r and an update gate z can be described by Equation 1 below.

[수학식 1][Equation 1]

여기서 x, h, t 및 j는 각각 입력, 숨겨진 상태, 시간 단계 색인 및 숨겨진 차원 색인이다. 게이팅 변수에 대해 비선형 성으로서 시그모이드 함수

를 적용함으로써 (0,1) 사이의 값을 갖도록 강제할 수 있다. 도 2는 GRU의 아키텍처를 보여준다. 본 발명의 일 실시 예에서는 GRU를 RNN의 기본 단위로 사용할 수 있다.Where x, h, t and j are input, hidden state, time step index and hidden dimensional index, respectively. Sigmoid function as nonlinearity for gating variables

By applying, it can be forced to have a value between (0,1). 2 shows the architecture of the GRU. In an embodiment of the present invention, GRU may be used as a basic unit of RNN.

도 3은 본 발명의 일 실시 예에 따른 텍스트- 다중 음성 변환 방법의 흐름도이다.3 is a flowchart of a text-to-multi-speech method according to an embodiment of the present invention.

도 3을 참조하면, 본 발명의 일 실시 예에 따른텍스트- 다중 음성 변환 방법은 음성을 생성할 전체 주파수 대역을 복수의 부분 대역으로 구별하는 단계(S110), 문자로부터 각각의 부분 대역에 해당하는 음성 스펙트럼을 생성하는 단계(S120), 상기 각각의 대역에 해당하는 음성 스펙트럼을 병합하는 단계(S130) 및 상기 병합된 스펙트럼을 음성으로 출력하는 단계(S140)를 포함할 수 있다.Referring to FIG. 3, in the text-to-multi-speech conversion method according to an embodiment of the present invention, the entire frequency band for generating speech is divided into a plurality of sub-bands (S110), corresponding to each sub-band from text. It may include generating a voice spectrum (S120), merging the voice spectrum corresponding to each band (S130), and outputting the merged spectrum as voice (S140).

상기 S130 단계는, 다중 음성 데이터를 입력받아 음성 임베딩 벡터로 변환해 음성 스펙트럼 병합에 이용할 수 있다.In step S130, multiple voice data may be input and converted into a voice embedding vector to be used for voice spectrum merging.

상기 다중 음성 데이터는 One-hot vector형태의 speaker-id일 수 있다.The multi-voice data may be a speaker-id in the form of a one-hot vector.

상기 다중 음성 데이터는 log-mel-spectrogram형태의 데이터일 수 있다.The multiple voice data may be log-mel-spectrogram data.

상기 log-mel-spectrogram은 임베더 네트워크를 거쳐 음성 임베딩 벡터로 변환되될 수 있다.The log-mel-spectrogram can be converted into a speech embedding vector via an embedder network.

상기 음성 임베딩 벡터는 상기 스펙트럼 병합부의 디코더 RNN과 집중 RNN에 입력될 수 있다.The speech embedding vector may be input to the decoder RNN and the centralized RNN of the spectrum merger.

도 4는 본 발명의 일 실시 예에 따른 RNN 인코더-디코더 네트워크의 블록도이다.4 is a block diagram of an RNN encoder-decoder network according to an embodiment of the present invention.

도 4를 참조하면, 장기 종속성을 모델링할 수 있는 RNN 기능을 사용하여 시퀀스간 네트워크를 형성할 수 있다. 도 2에서 인코더 RNN의 마지막 숨겨진 상태 h_T는 입력 시퀀스의 요약 정보를 포함할 수 있다. 이어서, 디코더 RNN은이 컨텍스트 벡터를 취하여 출력을 생성할 수 있다. 그런 다음 생성 된 출력은 다음 시간 단계에서 컨텍스트 벡터 h_T와 함께 입력으로 사용될 수 있다. 이러한 아키텍처의 단점은 컨텍스트 벡터h_T가 디코더 RNN의 모든 타임 스텝마다 고정된다는 것이다. 컨텍스트 벡터가 입력 시퀀스의 전체 정보를 손상시키지 않으면 문제가되지 않는다. 그러나, 많은 경우, 시퀀스가 그 용량보다 긴 경우, 컨텍스트 벡터의 고정 용량은 입력 시퀀스의 전체 정보를 포함 할 수 없다. 이 문제는 관심 메커니즘에 의해 해결될 수 있다.Referring to FIG. 4, an inter-sequence network may be formed using an RNN function capable of modeling long-term dependencies. In FIG. 2, the last hidden state h _T of the encoder RNN may include summary information of the input sequence. Subsequently, the decoder RNN can take this context vector and generate the output. The generated output can then be used as input with the context vector h _T in the next time step. The disadvantage of this architecture is that the context vector h _T is fixed at every time step of the decoder RNN. It doesn't matter if the context vector doesn't compromise the entire information of the input sequence. However, in many cases, if the sequence is longer than its capacity, the fixed capacity of the context vector cannot contain the entire information of the input sequence. This problem can be solved by the mechanism of interest.

기존의 소프트 윈도우라 불리는 집중 메커니즘은 인풋 시퀀스의 일부분에 집중 가중치를 곱한다. K가우시안 함수의 혼합을 사용하여 집중 가중치를 계산할 수 있다. 고정 된 수 K는 혼합의 수다. 집중 가중치를 계산하기 위해 가우스 함수를 가정했기 때문에 집중 가중치의 모양에 대해 임의의 유연성을 가질 수 없었다. The conventional concentration mechanism, called a soft window, multiplies a portion of the input sequence by the concentration weight. Concentration weights can be calculated using a mixture of K Gaussian functions. The fixed number K is the number of mixes. Since we assumed a Gaussian function to calculate the concentration weights, we could not have any flexibility in the shape of the concentration weights.

도 5는 본 발명의 일 실시 예에 따른 콘텐츠 기반 집중 메커니즘 RNN 인코더-디코더 네트워크의 블록도이다.5 is a block diagram of a content-based centralized mechanism RNN encoder-decoder network according to an embodiment of the present invention.

도 5를 참조하면, 내용 기반 집중 메커니즘을 확인할 수 있다. 상기 네트워크의 컨텍스트 벡터는 엔코더의 숨겨진 상태의 가중치 합계로 계산될 수 있다. 컨텍스트 벡터는 각 디코더의 타임 스텝에서 유연하게 변경 될 수 있기 때문에 인코더의 숨겨진 상태는 이전의 모든 정보를 기억할 필요가 없다. 집중 가중치 벡터

는 신경망에 의해 모델링 된 정렬 모델에 의해 계산될 수 있다. 길이가 T_enc인 입력 시퀀스를 취하는 콘텐츠 기반주의 메커니즘을 갖는 RNN 인코더 - 디코더 네트워크는 다음 수학식 2와 같이 정의 될 수있다 :Referring to FIG. 5, a content-based concentration mechanism can be confirmed. The context vector of the network can be calculated as the sum of the weights of the hidden state of the encoder. Since the context vector can be flexibly changed at the time step of each decoder, the hidden state of the encoder does not need to remember all the previous information. Concentration weight vector

Can be calculated by an alignment model modeled by a neural network. An RNN encoder-decoder network with a content-based attention mechanism taking an input sequence of length T _enc can be defined as in Equation 2:

[수학식 2][Equation 2]

여기서 c는 컨텍스트 벡터이고, d와 h는 각각 디코더와 인코더의 숨김 상태이다. 인코더에 대한 양방향 RNN이 있기 때문에 인코더 숨김 상태 h_i는 앞뒤 숨김 상태를 연결하여 정의된다. 행렬 W_h, W_d 및 바이어스 벡터 b는 학습 가능한 매개 변수다. softmax 함수는 입력 벡터를 정규화하므로 결과 벡터의 합은 1이다. 따라서 인코더의 숨겨진 상태와 동일한 크기의 컨텍스트 벡터를 가질 수 있다. 결과적으로, 컨텍스트 벡터 c와 이전 시간 스텝 d_t-1의 디코더 숨김 상태가 연결되어 디코더 RNN의 입력을 만든다.Here, c is a context vector, and d and h are hidden states of the decoder and the encoder, respectively. Since there is a bidirectional RNN for the encoder, the encoder hidden state h _i is defined by linking the front and back hidden states. The matrix W _h , W _d and bias vector b are learnable parameters. The softmax function normalizes the input vector, so the sum of the result vectors is 1. Therefore, it is possible to have a context vector of the same size as the hidden state of the encoder. As a result, the context vector c and the decoder hidden state of the previous time step d _t-1 are connected to make the input of the decoder RNN.

공통 TTS 시스템은 텍스트 인코딩 부분과 음성 생성 부분의 두 가지 주요 부분으로 구성될 수 있다. 대상 언어에 대한 사전 지식을 사용하여 본 발명의 시스템은 언어의 유용한 기능을 정의하고이를 입력 텍스트에서 추출할 수 있다. 이 과정을 텍스트 인코딩 부분이라고하며 상기 단계에서 많은 자연 언어 처리 기술이 사용될 수 있다. 예를 들어, 음소 시퀀스를 얻기 위해 텍스트를 입력하기 위해 음소 - 음소 모델을 적용하고, 품사 정보를 얻기 위해 품사 태그 지정 기가 적용될 수 있다. 이러한 방식으로, 텍스트 인코딩부는 텍스트 입력을 취하여 다양한 언어 특징을 출력할 수 있다. 그리고, 다음의 음성 생성부는 언어적인 특징을 취하여 음성의 파형을 생성할 수 있다. 음성 생성부에 대한 두 가지 일반적인 접근법은 연결 TTS 및 매개 TTS이다. 연결 TTS는 짧은 단위의 음성을 연결하여 음성을 생성할 수 있다. 짧은 단위는 음소 또는 하위 음소의 음계를 가지며, 짧은 단위의 순서는 상태 전이 네트워크에 의해 결정될 수 있다. 이 과정은 HMM 기반 음성 인식과 유사할 수 있다. 반면, 매개 TTS는 생성 모델 및 보코더를 사용할 수 있다. 생성 모델은 결합 확률 p (o, l,

)를 모델링하기위한 매개 변수를 학습할 수 있다. 여기서 o, l 및

는 각각 보코더 매개 변수, 언어 기능 및 모델 매개 변수다. 훈련 단계에서 매개 TTS는 다음 수학식 3에 설명 된대로 모델 매개 변수를 학습할 수 있다.The common TTS system may be composed of two main parts, a text encoding part and a speech generating part. Using prior knowledge of the target language, the system of the present invention can define useful functions of the language and extract it from the input text. This process is called the text encoding part and many natural language processing techniques can be used in the above step. For example, a phoneme-phoneme model may be applied to input text to obtain a phoneme sequence, and a part-of-speech tagger may be applied to obtain part-of-speech information. In this way, the text encoding unit may take text input and output various language features. In addition, the following voice generating unit may take a linguistic feature and generate a waveform of voice. The two common approaches to the speech generator are the connecting TTS and the intermediary TTS. The connection TTS can generate a voice by connecting a short unit of voice. The short unit has a phoneme or a sub-phoneme scale, and the order of the short units can be determined by a state transition network. This process can be similar to HMM based speech recognition. On the other hand, each TTS can use a generation model and a vocoder. The generation model is a combination probability p (o, l,

) To learn the parameters for modeling. Where o, l and

Are respectively vocoder parameters, language functions and model parameters. In the training phase, the intermediary TTS can learn the model parameters as described in Equation 3 below.

[수학식 3][Equation 3]

보코더 매개 변수는 멜 주파수 셉 스트럴 계수(MFCC), 기본 주파수 (F₀), 비 주기성 등과 같은 음성 관련 기능으로 구성될 수 있다. 보코더 매개 변수를 얻으려면 사전 훈련 된 피쳐 추출기를 이용할 수 있다. 언어 기능은 텍스트 인코딩 부분의 출력에서 사용할 수 있다. 생성 단계에서는 다음 수학식 4와 같이 언어 특징 및 모델 매개 변수에서 보코더 매개 변수를 얻을 수 있다.Vocoder parameters may consist of voice-related functions such as Mel Frequency Astral Coefficient (MFCC), fundamental frequency (F ₀ ), aperiodic, etc. To get vocoder parameters, you can use a pre-trained feature extractor. The language function is available in the output of the text encoding part. In the generating step, vocoder parameters may be obtained from language features and model parameters as in Equation 4 below.

[수학식 4][Equation 4]

이어서, 획득 된 보코더 파라미터는 Vocaine 또는 WORLD와 같은 보코더에 의해 처리된다. 결국 우리는 음성의 파형을 얻을 수 있다. (관련없는 내용)Subsequently, the obtained vocoder parameters are processed by a vocoder such as Vocaine or WORLD. Eventually we can get the waveform of the voice. (Not related)

Tacotron은 엔코더, 디코더 및 포스트 프로세서의 세 가지 모듈로 구성된 엔드 - 투 - 엔드 TTS 모델입니다. Tacotron은 바닐라 sequence-to-sequence 모델보다 복잡한 구조를 가지고 있지만 Tacotron은 기본적으로 문자 시퀀스를 해당 파형으로 변환하는주의 메커니즘을 사용하여 시퀀스 - 시퀀스 프레임 워크를 따릅니다. 보다 구체적으로, 인코더는 문자 시퀀스를 입력으로 취하여 문자 시퀀스와 길이가 동일한 텍스트 인코딩 시퀀스를 생성한다. 디코더는 자기 회귀 방식으로 멜 스케일 스펙트럼을 생성합니다. 이전 시간 - 단계로부터의 디코더 출력은 현재 시간 - 단계에서 디코더의 입력으로서 사용된다. 디코더 모듈에서주의 RNN은 보통 일반적인 시퀀스 대 시퀀스 모델처럼주의 정렬을 예측합니다. 주의 정렬을 텍스트 인코딩과 결합하면 컨텍스트 벡터가 제공되고 디코더 RNN은 컨텍스트 벡터와주의 RNN의 출력을 입력으로 사용합니다. 디코더 RNN은 멜 - 스케일 스펙트로 그램을 예측하며, 포스트 - 프로세서 모듈은 결과적으로 멜 - 스케일 스펙트로 그램으로부터 선형 스케일 스펙트로 그램을 생성합니다. 선형 스케일 스펙트로 그램 자체에는 위상 정보가 없으므로 파형으로 직접 되돌릴 수 없습니다. Griffin-Lim 재구성 알고리즘은 역 푸리에 변환을 적용하여 선형 스케일 스펙트로 그램 {GLRECON}의 위상을 추정합니다. Tacotron은 선형 스케일 스펙트로 그램에서 파형을 얻기 위해 포스트 프로세서 모듈의 마지막 단계에서 Griffin-Lim 재구성을 사용했습니다.Tacotron is an end-to-end TTS model consisting of three modules: encoder, decoder and post processor. Tacotron has a more complex structure than the vanilla sequence-to-sequence model, but Tacotron basically follows the sequence-sequence framework using a careful mechanism to convert character sequences into corresponding waveforms. More specifically, the encoder takes a character sequence as input and generates a text encoding sequence having the same length as the character sequence. The decoder generates a mel-scale spectrum in an autoregressive fashion. The decoder output from the previous time-step is used as the input of the decoder in the current time-step. In the decoder module, the attention RNN usually predicts the attention order as a normal sequence-to-sequence model. Combining attention alignment with text encoding provides a context vector, and decoder RNN takes the context vector and the output of the attention RNN as input. The decoder RNN predicts the mel-scale spectrogram, and the post-processor module results in a linear scale spectrogram from the mel-scale spectrogram. The linear scale spectrogram itself has no phase information, so it cannot be returned directly to the waveform. The Griffin-Lim reconstruction algorithm applies the inverse Fourier transform to estimate the phase of the linear scale spectrogram {GLRECON}. Tacotron used a Griffin-Lim reconstruction in the final stage of the post processor module to obtain a waveform from a linear scale spectrogram.

도 6은 타코트론(Tacotron)의 모식도이다.6 is a schematic diagram of Tacotron.

도 6을 참조하면, 타코트론은 엔코더, 디코더 및 포스트 프로세서의 세 가지 모듈로 구성된 엔드-투-엔드 TTS 모델다. 상기 타코트론은 바닐라 sequence-to-sequence 모델보다 복잡한 구조를 가지고 있지만, 상기 타코트론은 기본적으로 문자 시퀀스를 해당 파형으로 변환하는 집중 메커니즘을 사용하여 시퀀스-시퀀스 프레임 워크를 따를 수 있다. 보다 구체적으로, 인코더는 문자 시퀀스를 입력으로 취하여 문자 시퀀스와 길이가 동일한 텍스트 인코딩 시퀀스를 생성할 수 있다. 디코더는 자기 순환 방식으로 멜 스케일 스펙트럼을 생성할 수 있다. 이전 시간-단계로부터의 디코더 출력은 현재 시간-단계(time-step)에서 디코더의 입력으로서 사용될 수 있다. 디코더 모듈에서주의 RNN은 보통 일반적인 시퀀스 대 시퀀스 모델처럼주의 정렬을 예측할 수 있다. 주의 정렬을 텍스트 인코딩과 결합하면 컨텍스트 벡터가 제공될 수 있고 디코더 RNN은 컨텍스트 벡터와주의 RNN의 출력을 입력으로 사용할 수 있다. 디코더 RNN은 멜-스케일 스펙트로그램을 예측하며, 포스트 - 프로세서 모듈은 결과적으로 멜-스케일 스펙트로그램으로부터 선형 스케일 스펙트로그램을 생성할 수 있다. 선형 스케일 스펙트로 그램 자체에는 위상 정보가 없으므로 파형으로 직접 되돌릴 수 없다. Griffin-Lim 재구성 알고리즘은 역 푸리에 변환을 적용하여 선형 스케일 스펙트로그램의 위상을 추정할 수 있다. 상기 타코트론은 선형 스케일 스펙트로그램에서 파형을 얻기 위해 포스트 프로세서 모듈의 마지막 단계에서 Griffin-Lim 재구성을 사용할 수 있다.Referring to FIG. 6, Tacotron is an end-to-end TTS model composed of three modules: an encoder, a decoder, and a post processor. The tacotron has a more complex structure than the vanilla sequence-to-sequence model, but the tacotron can basically follow the sequence-sequence framework using a focused mechanism that converts character sequences into corresponding waveforms. More specifically, the encoder can take a character sequence as input and generate a text encoding sequence having the same length as the character sequence. The decoder can generate a mel-scale spectrum in a self-circulating manner. The decoder output from the previous time-step can be used as the input of the decoder in the current time-step. In the decoder module, the attention RNN can predict the alignment of attention, as in the usual sequence-to-sequence model. By combining attention alignment with text encoding, a context vector can be provided, and the decoder RNN can use the context vector and the output of the attention RNN as input. The decoder RNN predicts the mel-scale spectrogram, and the post-processor module can result in a linear scale spectrogram from the mel-scale spectrogram. Since the linear scale spectrogram itself has no phase information, it cannot be directly returned to the waveform. The Griffin-Lim reconstruction algorithm can estimate the phase of a linear scale spectrogram by applying an inverse Fourier transform. The tacotron can use the Griffin-Lim reconstruction in the final step of the post processor module to obtain a waveform in a linear scale spectrogram.

상기 타코트론에서 사용되는 중요한 네트워크 아키텍처는 CBHG 모듈일 수 있다. CBHG는 콘볼루션 은행(bank), 고속도로 네트워크 및 양방향 Gated Recurrent Unit (GRU)을 포함할 수 있다. GRU는 RNN의 일종일 수 있다. 개념적으로 CBHG의 컨볼루션레이어는 로컬 패턴을 캡처하는 반면 양방향 GRU는 주어진 시퀀스에서 장기 의존성을 캡처할 수 있다. 시퀀스의 특징을 인코딩하기 위해 CBHG는 상기 인코더와 상기 타코트론의 후처리기에서 사용됩니다.An important network architecture used in the tacolon may be a CBHG module. CBHG can include convolutional banks, highway networks and bidirectional Gated Recurrent Units (GRUs). GRU may be a type of RNN. Conceptually, CBHG's convolutional layer captures local patterns, while a bidirectional GRU can capture long-term dependencies in a given sequence. To encode the characteristics of the sequence, CBHG is used in the encoder and the post processor of the tacotron.

훈련을 위해, 모델은 목적 함수에 두 개의 목표를 가지고있다. (1) 멜 스케일 분광법 목표 Y_mel와 (2) 선형 스케일 spectrogram 목표 Y_linear이다. 각 멜-스케일 스펙트로그램의 L1 거리는

와Y_mel이고 선형 스케일 스펙트로 그램은

와Y_linear이다. 다음과 같이 목적 함수인 수학식 5를 계산하기 위해 추가될 수 있다.For training, the model has two objectives in the objective function. (1) Mel scale spectral target Y _mel and (2) linear scale spectrogram target Y _linear . The L1 distance of each mel-scale spectrogram is

And Y _mel and the linear scale spectrogram

And Y _linear . It can be added to calculate the objective function, Equation 5, as follows.

[수학식 5][Equation 5]

여기서

와

는 상기 타코트론의 출력이고, 그라운드 진실 스펙트로그램이다.here

Wow

Is the output of the tacolon and is the ground truth spectrogram.

도 7은 집중 정렬의 이상적 그래프와 실제적인 집중 정렬의 문제를 도시하는 그래프이다.7 is a graph showing an ideal graph of intensive alignment and a problem of practical intensive alignment.

도 7을 참조하면, 긴 음성을 생성 할 때, 집중 정렬은 연설의 중간 부분에 불규칙성을 보였다. 또한, 집중 정렬을 해당 생성된 연설과 비교할 때주의 집중 정렬의 선명도와 생성된 음성의 품질간의 상관 관계를 관찰한 결과, 직관적으로 상기 타코트론의 디코더는 집중된 컨텍스트 벡터를 기반으로 스펙트로그램을 생성하므로 흐려진 집중에 의해 불명료하게 예측된 스펙트로그램을 가져와야한다. 본 발명의 일 실시 예에서 상기 집중 정렬 예측을 향상시키기 위해 상기 타코트론에서 정보의 흐름을 검사하기로 결정했다. 도 6에서 상기 타코트론의 집중 모듈은 집중 RNN 의 현재 숨겨진 상태(

)와 인코더의 텍스트 인코딩이라는 두 가지 소스에서 입력을 받을 수 있다. 상기 두 가지 소스를 바탕으로 집중 정렬의 예측을 개선하기 위해 다음과 같이 개량할 수 있다.Referring to Fig. 7, when generating a long voice, intensive alignment showed irregularity in the middle of the speech. In addition, as a result of observing the correlation between the sharpness of attentional alignment and the quality of the generated speech when comparing the focused alignment with the corresponding generated speech, the decoder of the tacotron intuitively generates a spectrogram based on the focused context vector. It is necessary to bring the spectrogram predicted indistinctly by blurred focus. In one embodiment of the present invention, it was decided to examine the flow of information in the tacolon to improve the prediction of the focused alignment. In FIG. 6, the concentration module of the tacotron is the current hidden state of the concentration RNN (

) And the encoder's text encoding. Based on the above two sources, it can be improved as follows to improve the prediction of intensive alignment.

한 문자의 발음에는 일반적으로 하나 이상의 스펙트로그램 프레임이 필요하기 때문에 기존 타코트론 모델의 집중 파트는 여러 디코더 시간 단계에 대해 텍스트 입력의 비슷한 부분에 자주 집중(attend)한다. 모델이 집중 가중치를 변경해야하는 경우에도 집중할 다음 부분은 현재 집중한 텍스트의 인접 부분이될 수 있다. 그러므로 집중 가중치를 결정할 때 모델은 텍스트 인코딩의 가중치 합계인 이전에 집중한 텍스트 c_t-1의 정보를 가질 수 있다. 그러나 기존의 타코트론은 상기c_t-1의 정보를 활용하지 않는다. 집중 RNN은 c_t-1의 정보를 거의 포함하지 않는 이전 시간-단계의 스펙트로그램만을 사용할 수 있다. 상기한 내용을 바탕으로, 집중 RNN의 입력 x_t에c_t-1을 연결했다. 집중 RNN은 다음 수학식 6과 같이 하나 더 많은 입력c_t-1을 취할 수 있다.Because the pronunciation of a letter generally requires more than one spectrogram frame, the intensive part of the existing tacotron model often focuses on similar parts of text input for multiple decoder time steps. Even if the model needs to change the focus weight, the next part to focus on can be the adjacent part of the currently focused text. Therefore, when determining the concentration weight, the model can have the information of the previously concentrated text c _{t-1 which} is the sum of the weights of the text encoding. However, the existing tacotron does not utilize the information of c _t-1 above. Intensive RNN can use only the previous time-step spectrogram that contains little information of c _t-1 . Based on the above, c _t-1 was connected to the input x _t of the concentrated RNN. The concentrated RNN can take one more input c _t-1 as in Equation 6 below.

[수학식 6][Equation 6]

여기서 x_t와 h_t는 각각 시간-단계 t에서 집중 RNN의 입력과 숨겨진 상태이다.Here, x _t and h _t are input and hidden states of the concentrated RNN at time-step t, respectively.

도 8은 CBHG모듈의 모식도이다.8 is a schematic diagram of a CBHG module.

도 8을 참조하면, 집중 예측을 향상시키는 두 번째 아이디어는 텍스트 입력을 인코딩하는 방법을 변경할 수 있다. 변경된 텍스트 입력은 집중 정렬을 결정하기위한 정보 소스이기도하다. 인코딩은 도 8에서와 같이 CBHG 모듈에 의해 생성될 수 있다. CBHG는 컨벌루션 필터 뱅크를 포함할 수 있으며, 각 필터는 로컬 및 컨텍스트 정보를 명시적으로 추출할 수 있다. CBHG의 마지막 단계에는 텍스트 입력에 장기 종속성을 포착 할 수있는 양방향 RNN이 포함될 수 있다. RNN이 입력을 순차적으로 읽음에 따라 RNN의 숨겨진 상태에 장기 종속성이 누적될 수 있다. 여기서 문제는 숨겨진 상태의 크기가 고정되어 있다는 것이다. 시퀀스가 충분히 길다면 숨겨진 상태는 시퀀스의 전체 정보를 포함 할 수 없다. 또한 CBHG의 양방향 RNN에는 현재 시간 단계의 텍스트 입력 정보와 장기 의존성 정보가 포함되어야한다. 이것은 숨겨진 상태에 더 많은 부담을 안준다. 도 7(a)를 참조하면 입력 정렬이 원본 타코트론에서 길면 시퀀스의 중간에 집중 정렬이 틈이나 흐린 부분이 포함될 수 있다. 이러한 불규칙성이 CBHG에서 양방향 RNN의 숨겨진 상태의 용량 부족으로 인한 것이다. 따라서 양방향 RNN의 입력과 출력을 연결하는 잔여 연결을 추가한다. CBHG의 출력을 다음 수학식 7과 같이 x_t이라는 추가 용어를 갖도록 변경했다.Referring to FIG. 8, a second idea to improve concentration prediction can change the method of encoding text input. The changed text input is also an information source for determining intensive alignment. The encoding can be generated by the CBHG module as in FIG. 8. CBHG may include a convolutional filter bank, and each filter may explicitly extract local and context information. The final stage of CBHG may include bidirectional RNNs that can capture long-term dependencies in text input. As the RNN reads input sequentially, long-term dependencies can accumulate in the hidden state of the RNN. The problem here is that the size of the hidden state is fixed. If the sequence is long enough, the hidden state cannot contain the entire information of the sequence. In addition, CBHG's bidirectional RNN should include text input information and long-term dependency information of the current time step. This puts no more burden on the hidden state. Referring to FIG. 7(a), if the input alignment is long in the original tacotron, a concentrated alignment may be included in the middle or in the middle of the sequence. This irregularity is due to the lack of capacity in the hidden state of the bidirectional RNN in CBHG. Therefore, we add a residual connection that connects the input and output of the bidirectional RNN. The output of CBHG was changed to have the additional term x _t as shown in Equation 7 below.

[수학식 7][Equation 7]

여기서 x_t, h_t 및 y_t는 각각 양방향 RNN의 입력, 숨겨진 상태 및 출력입니다. 이제 나머지 연결은 현재 시간 단계의 정보를 전달하므로 양방향 RNN의 숨겨진 상태에는 현재 시간 단계의 정보가 포함될 필요가 없다. 상기 연결로 인해 숨겨진 상태가 덜 혼잡 해지고 텍스트 정보를 인코딩하는 데 도움이된다.Where x _t , h _t and y _t are the input, hidden state and output of the bidirectional RNN respectively. The rest of the connection now carries information from the current time step, so the hidden state of the bidirectional RNN need not include the current time step information. The connection makes the hidden state less congested and helps to encode text information.

도 9는 본 발명의 각 실시 예에 따른 집중 정렬의 결과 그래프이다.9 is a graph showing the results of intensive alignment according to each embodiment of the present invention.

도 9를 참조하면, 4 가지 모델을 비교하여 문제를 해결하는 데 효과적인 방법을 확인했다. 4 가지 모델에는 기준선, (기준선 + CI), (기준선 + RC) 및 (기준선 + CI + RC)이 포함될 수 있다. 도 9의 이미지는 4 가지 모델의 텍스트를 생성하는 동안 얻은주의 정렬입니다. 텍스트와 음성은 성격상 단조적 정렬을 가져야한다는 것이 직관적이다. 따라서, 이상적인주의 정렬은 연속적이고 단조롭게 감소해야한다 (또는 집중 정렬 플롯의 y 축에 따라 증가한다).Referring to FIG. 9, four models were compared to determine an effective method for solving the problem. The four models can include baseline, (baseline + CI), (baseline + RC) and (baseline + CI + RC). The image in Figure 9 is the alignment of the attention obtained while generating the text of the four models. It is intuitive that text and voice should have monotonous alignment in nature. Therefore, the ideal attention alignment should be reduced continuously and monotonically (or increase along the y axis of the intensive alignment plot).

컨텍스트 삽입과 잔여 연결은 네트워크가 집중 정렬을 쉽게 예측할 수있게 하여 예리하고 명확한 집중 정렬을 얻을 수 있다고 예측되었다. 기본 모델이 음성 합성에 문제가 있는 문장을 선택했다. 그림 도 9 (a)에서, 베이스 라인 모델의 집중 정렬에서 불투명한 부분과 불연속성을 확인할 수 있다. 대조적으로 제안된 모델 (baseline + CI + RC)의 집중 정렬은 그림 도 9 (b)에서 단조와 연속선을 보여 주었다. 생성 된 결과를 검사했을 때, 제안된 모델의 집중 정렬은 예리하고 깨끗했고 제안된 모델의 생성된 음성의 품질은 기준 모델보다 낫다. 그러나 제안된 방법 중 하나만 적용하여 큰 향상을 볼 수 없었다 도 9 (c) 및 도 9 (d) 참조). 이러한 각각의 접근법을 개별적으로 적용한이 결과는 집중 RNN과 인코더 출력에서 나온 두 가지 정보 흐름이 똑같이 중요 함을 나타낼 수 있다. 모델은 하나의 정보 소스만 사용하여서는 음성을 생성하지 못할 수도 있다. It was predicted that the context insertion and the residual connection would allow the network to easily predict the intensive alignment, resulting in sharp and clear intensive alignment. The basic model selected a sentence that had problems with speech synthesis. In Fig. 9(a), the opaque part and discontinuity can be confirmed in the centralized alignment of the baseline model. In contrast, the intensive alignment of the proposed model (baseline + CI + RC) showed monotonic and continuous lines in Figure 9(b). When examining the generated results, the intensive alignment of the proposed model was sharp and clean, and the quality of the generated voice of the proposed model was better than the reference model. However, no significant improvement could be seen by applying only one of the proposed methods (see FIGS. 9(c) and 9(d)). The results of applying each of these approaches separately can indicate that the two information flows from the centralized RNN and encoder output are equally important. The model may not be able to generate speech using only one information source.

단조로운 집중 메커니즘과 반교사 강제 교육을 실시하면 성능이 더 향상 될 것으로 예상된다. 자연스럽게 기존의 단조로움을 이용하여 텍스트와 음성의 정렬에서 모델은 집중 정렬을 이전보다 쉽게 배우게되고 모델은 스펙트로그램을 예측하는 데 더 많은 용량을 사용할 수있게될 수 있다. 반교사 강제 교육은 훈련 단계 및 시험 단계에서 디코더 모듈의 입력 분포의 불일치를 줄일 수 있다. 노출-바이어스라고 불리는 이 불일치는 자가 회귀 네트워크가 훈련 단계에서 접지 진리 입력을 취하는 동안 시험 단계에서 이전 시간 단계의 입력을 취하기 때문에 발생할 수 있다. 그라운드 진실(ground truth) 입력을 생성된 입력으로 무작위로 대체하여 일정 샘플링으로이 문제를 해결했으며, 반 교사 강제 교육은 모델을 훈련 단계에서 생성된 샘플에 노출시킬 수 있다. 이러한 트릭을 적용 할 수는 있지만 제안 된 방법을 이러한 트릭에 직각으로 사용할 수 있기 때문에 적용하지 않았다.It is expected that performance will be further improved by providing a monotonous concentration mechanism and anti-teaching compulsory education. Naturally, in the alignment of text and speech using conventional monotony, the model can learn intensive alignment more easily than before, and the model may be able to use more capacity to predict the spectrogram. Semi-teacher compulsory education can reduce the discrepancy in the input distribution of the decoder module during the training phase and the testing phase. This mismatch, called exposure-bias, can occur because the autoregression network takes the input of the previous time step in the test phase while the ground truth input in the training phase. This problem was solved by constant sampling by randomly replacing the ground truth input with the generated input, and semi-teacher compulsory training can expose the model to samples generated during the training phase. You can apply these tricks, but I didn't apply the proposed method as it can be used at right angles to these tricks.

도 10은 본 발명의 일 실시 예에 따른 텍스트-다중 음성 변환 타코트론의 모식도이다.10 is a schematic diagram of a text-to-multi-speech tacolon according to an embodiment of the present invention.

도 10을 참조하면, 본 발명의 입력으로 문자 외에 speaker-id를 추가로 받아 음성을 생성하게 된다. Speaker-id는 보통 0,1,2,…의 정수로 지정될 수 있고, 실제 모델의 입력으로 들어갈 때에는 one-hot vector형태로 들어갈 수 있다. 이 One-hot vector은 룩업 테이블(1개의 linear layer과 같은 개념)을 거쳐 밀한 형태의 음성 임베딩 벡터로 바뀌고, 이것이 타코트론의 디코더 RNN 입력과 집중 RNN 입력에 사슬처럼 엮이게(concatenate)되어 사용된다. 따라서 음성 입력이 누구인가에 따라서 출력 음성이 달라지게 된다. 도 10(a) 참고 바람. 다중 음성 타코트론의 경우 실험적으로 각 화자당 30분정도의 데이터를 모으고 모든 화자를 합쳐 20시간 이상의 데이터를 가지고 있으면 여러 화자에 대한 음성합성이 가능한 것을 확인할 수 있었다.Referring to FIG. 10, a speaker-id is additionally received in addition to a character as an input of the present invention to generate a voice. Speaker-id is usually 0,1,2,… It can be specified as an integer of, and when entering the input of the actual model, it can be entered as a one-hot vector. This one-hot vector is transformed into a dense speech embedding vector through a look-up table (same concept as one linear layer), which is used as a chain to the decoder RNN input and the intensive RNN input of the Tacotron. . Therefore, the output voice varies depending on who the voice input is. See Figure 10(a). In the case of a multi-voice tacotron, it was confirmed that voice synthesis is possible for multiple speakers by experimentally collecting about 30 minutes of data for each speaker and combining all speakers with data over 20 hours.

스피커 삽입은 생성된 음성의 음성을 변경하는 유일한 변수이다. 즉, 적절한 방법으로 스피커 내장을 조정하면 새로운 음성을 얻을 수 있다. 그러나 스피커 내장 벡터의 각 차원이 어떤 특징을 나타내는지 알지 못한다. 스피커 내장 벡터의 공간을 비선형으로 설정하여 스피커 내장 벡터의 값을 튜닝함으로써 직관적으로 음성을 변경하는 것을 방지할 수 있다. 스피커 내장 벡터의 각 차원을 직각 피쳐로 표현할 수 있다면 새로운 목소리를 만드는 것이 훨씬 쉽고 직관적 일 것이다. Sparsity Constraint를 추가하면 여러 스피커의 음성을 생성하는 데 직교성을 부여할 수 있다. 이러한 의미에서 훈련 목적 함수에 희소성 제약을 추가했다. e_d를 포함하는 스피커의 각 차원 d는 베르누이 분포를 따른 것으로 가정하므로 닫힌 간격 [0,1]에 값이 있다. 스피커 내장에있는 확률 변수에 1을 가질 확률이 낮으면 스피커 내장은 희소성의 정의에 의해 드물게된다. 이 성질은 각 스피커 내장 차원의 표본 평균

과 표적 희박성 매개 변수

사이의 Kullback-Leibler 발산을 최소화함으로써 얻을 수 있다. 일반적으로 희박 매개 변수는 0에 가까운 값 (즉, 0.05)을 취한다. Kullback-Leibler divergence term을 원래 목적 함수에 추가하면 다음과 같은 수정 된 목적 함수 수학식 8을 얻을 수 있다.Speaker insertion is the only variable that changes the voice of the generated voice. In other words, you can get a new voice by adjusting the built-in speaker in a proper way. However, we do not know what each dimension of the speaker built-in vector represents. By setting the space of the speaker built-in vector to non-linear, the value of the speaker built-in vector can be tuned to prevent intuitively changing the voice. It would be much easier and more intuitive to create a new voice if each dimension of the speaker built-in vector could be represented by a right angle feature. Adding Sparsity Constraint gives orthogonality to the production of voices from multiple speakers. In this sense, sparsity constraints have been added to the training objective function. Each dimension d of the speaker including e _d is assumed to follow the Bernoulli distribution, so there is a value in the closed interval [0,1]. If the probability of having a 1 in the random variable on the speaker built-in is low, the speaker built-in becomes rare by the definition of scarcity. This property is the sample mean of each speaker's built-in dimension.

And target lean parameters

It can be obtained by minimizing the Kullback-Leibler divergence between. Typically, lean parameters take values close to zero (i.e. 0.05). By adding the Kullback-Leibler divergence term to the original objective function, the following modified objective function equation (8) can be obtained.

[수학식 8][Equation 8]

여기서, D, m,

및 KL()는 각각 음성 임베딩 차원, 미니 배치(batch) 크기, 정규화 계수 및 Kullback-Leibler 분기 함수의 차원을 나타낸다.Where D, m,

And KL() denotes the dimensions of the voice embedding dimension, mini-batch size, normalization coefficient, and Kullback-Leibler branch function, respectively.

학습된 음성 삽입이 주성분 분석 (PCA)으로 얻은 점의 2 차원 시각화를 통해 벡터 공간에서 다양체를 공식화하는 방법을 분석했다. 또한 희박성 제약이 학습된 스피커 삽입 벡터에 어떻게 영향을 미치는지 비교했다.We analyzed how to formulate manifolds in vector space through two-dimensional visualization of points obtained by learned speech insertion by principal component analysis (PCA). We also compared how sparsity constraints affect the learned speaker insertion vectors.

도 10(b)를 참조하면, 다중 음성 타코트론이 여러명의 화자에 대해 음성합성이 가능하고, 각 화자당 30분의 음성을 필요로 하는 것이 기존의 타코트론에 비해서는 발전된 것이지만, 30분의 깨끗한 음성을 얻는 것 또한 많은 수고를 필요로 하는 작업이다. 또한 하나의 다중 음성 타코트론을 학습시켜 둔 상황에서, 기존에 없던 새로운 화자의 목소리를 생성하는 모델이 필요하다면 새로운 목소리에 대해 모델을 다시 학습시키는 과정이 필요하고, 이 과정에서 몇 시간 정도 소요가 된다. 본 발명의 일 실시 예에 따른 모델은 새로운 목소리를 생성하는데 아주 적은 데이터 (실험에서는 6초) 만을 필요로하고, 모델을 추가로 학습시키지 않아도 새로운 목소리를 즉시 생성할 수 있다는 장점이 있다.Referring to FIG. 10(b), multiple speech tacotrons are capable of speech synthesis for multiple speakers, and the need for 30 minutes of speech per speaker is an improvement over the existing tacotron, but 30 minutes Getting a clear voice is also a laborious task. In addition, in the situation where one multi-voice tacolon is trained, if a model for generating a new speaker's voice that was not previously needed is required, the process of retraining the model for the new voice is required, and this process takes several hours. do. The model according to an embodiment of the present invention has the advantage that only a very small amount of data (6 seconds in the experiment) is required to generate a new voice, and a new voice can be generated immediately without additional training.

상기 모델은 Voice imitative TTS이고, 다중 음성 타코트론과 상당히 유사한 구조를 가진다. 한가지 다른 점은 다중 음성 타코트론이 (text, speaker_id)를 입력으로 받았다면, Voice imitative TTS는 (text, 생성하고자 하는 화자의 음성의 log-mel-spectrogram) pair을 입력으로 받을 수 있다. 입력으로 받은 log-mel-spectrogram은 임베더 네트워크를 거쳐 음성 임베딩 벡터로 변환이 될 수 있다. 이렇게 생성된 음성 임베딩 벡터는 다중 음성 TTS에서와 같은 방식으로 디코더 RNN과 집중 RNN에 입력으로 들어가게 된다.The model is a Voice imitative TTS, and has a structure quite similar to that of a multiple voice tacotron. One difference is that if multiple voice tacotrons receive (text, speaker_id) as input, the voice imitative TTS can receive (text, log-mel-spectrogram) of the speaker's voice to be generated as input. The log-mel-spectrogram received as input can be converted into a speech embedding vector via the embedder network. The generated speech embedding vector is input to the decoder RNN and the concentrated RNN in the same way as in the multi-voice TTS.

핫 스피커 ID 벡터를 입력으로 사용하는 현재 멀티 스피커 TTS 모델은 음성 데이터가 교육 데이터에 없는 새로운 음성에 쉽게 확장 할 수 없다. 모델은 단일 핫 벡터로 표현된 음성에 대해서만 임베딩을 배웠기 때문에 새로운 음성의 임베딩을 즉시 얻을 수 있는 방법이 없습니다. 새로운 음성을 생성하려면 전체 TTS 모델을 재교육하거나 TTS 모델의 임베디드 레이어를 미세 조정해야한다. 이는 GPU가 장착 된 컴퓨터로도 시간이 많이 소요되는 프로세스다. 원하는 음성을 얻기 위해 스피커 내장을 조정할 수 있지만, 스피커 내장 벡터의 정확한 값 조합을 찾는 것은 어려울 것다. 상기한 접근법은 부정확 할뿐만 아니라 노동 집약적이다. 상기한 문제를보다 효율적으로 해결하기 위해 우리는 TTS 모델을 추가로 교육하거나 스피커 내장 벡터를 수동으로 검색하지 않고도 새로운 스피커의 음성을 즉시 생성 할 수있는 새로운 TTS 아키텍처를 포함시킬 수 있다.Current multi-speaker TTS models that use hot speaker ID vectors as inputs cannot easily extend to new voices where the voice data is not in the training data. Since the model learned to embed only the speech represented by a single hot vector, there is no way to immediately get the embedding of a new speech. To generate a new voice, you need to retrain the entire TTS model or fine-tune the embedded layer of the TTS model. This is a time-consuming process even with a GPU-equipped computer. You can adjust the built-in speaker to get the desired voice, but finding the correct combination of values for the built-in speaker vector will be difficult. The above approach is not only inaccurate, but also labor intensive. To solve the above problem more efficiently, we can include a new TTS architecture that can instantly generate the voice of a new speaker without further training the TTS model or manually searching the speaker embedded vector.

도 11은 본 발명의 일 실시 예에 따른 음성 임베딩 예측 네트워크의 모식도이다.11 is a schematic diagram of a voice embedding prediction network according to an embodiment of the present invention.

도 11을 참조하면, 스피커 내장 벡터를 예측하는 서브 네트워크를 추가할 수 있다. 도 11은 완전히 연결된 레이어가 뒤 따르는 컨볼루션레이어를 포함하는 서브 네트워크 스피커 삽입자를 입력할 수 있다. 상기 네트워크는 log-Mel-spectrogram을 입력으로 사용하여 고정 차원 스피커 임베딩 벡터를 예측할 수 있다. 스펙트로그램을 사용하는데 제약이 없기 때문에 임의의 스펙트로그램을이 네트워크에 삽입 할 수 있으며, 이를 통해 네트워크의 즉각적인 적응을 통해 새로운 화자의 음성을 생성 할 수 있다. 입력 스펙트로그램은 다양한 길이를 가질 수 있지만, 콘볼루션 레이어 끝에 위치한 최대 시간 초과 풀링 레이어는 시간 축에 대해 길이가 1 인 고정된 치수 벡터로 입력을 출력 수 있다. 음성 삽입자는 도 10(b)에서 설명한대로 다중 음성 타코트론의 단일 핫 벡터 입력을 대체할 수 있다. 음성 모조 모델은 동일한 목적 함수 수학식 5를 사용하여 훈련될 수 있다.Referring to FIG. 11, a sub-network for predicting a speaker embedded vector can be added. 11 can input a sub-network speaker inserter including a convolutional layer followed by a fully connected layer. The network can predict a fixed-dimensional speaker embedding vector using log-Mel-spectrogram as input. Since there are no restrictions on the use of the spectrogram, an arbitrary spectrogram can be inserted into this network, and through this, the new speaker's voice can be generated through the immediate adaptation of the network. The input spectrogram can have various lengths, but the maximum timed pooling layer located at the end of the convolution layer can output the input as a fixed dimension vector of length 1 with respect to the time axis. The voice inserter can replace a single hot vector input of multiple voice tacotrons as described in FIG. 10(b). The voice imitation model can be trained using the same objective function equation (5).

도 12는 본 발명의 일 실시 예에 따른 테스트 세트에서 음성 프로필을 나타낸 표이고, 도 13은 본 발명의 일 실시 예에 따른 다중 음성 타코트론의 파라미터를 나타낸 표다.12 is a table showing voice profiles in a test set according to an embodiment of the present invention, and FIG. 13 is a table showing parameters of multiple voice tacotrons according to an embodiment of the present invention.

도 12 및 도 13의 파라미터에 따라 텍스트- 다중 음성 변환을 수행할 수 있고 이하에서 설명하는 도면에서 결과를 확인할 수 있다. Text-to-multi-speech conversion can be performed according to the parameters of FIGS. 12 and 13 and the results can be confirmed in the drawings described below.

도 14는 본 발명의 일 실시 예에 따른 생성된 음성 스펙토그램의 예시이다.14 is an example of a generated voice spectrogram according to an embodiment of the present invention.

도 14를 참조하면, 본 발명의 일 실시 예에서는 140 개 에코 (epoch)의 다중 음성 타코트론을 훈련 시켰는데, 1 에포크는 음성 임베딩 벡터에 희소성 제약이 있거나 없는 1289 반복과 같다. 두 모델의 생성된 음성 샘플은 비슷한 품질을 보였다. 각 스피커의 성별과 악센트를 샘플을 통해 명확하게 구분할 수 있다. 도 14는 세 가지 다른 음성의 샘플 스펙트로그램을 보여준다. 스펙트로그램에 해당하는 텍스트는 음성마다 동일하며 교육 데이터에는 나타나지 않는다. 각 샘플의 피치와 속도가 다른 샘플과 다른 것을 확인할 수 있다. 도 14 (a), (b), (c)는 각각 (여성, 하이피치), (여성, 로우피치), (남성, 로우피치)의 결과이다.Referring to FIG. 14, in one embodiment of the present invention, 140 negative echo tacolons of 140 echoes were trained, and 1 epoch is equivalent to 1289 repetitions with or without sparse constraints in the speech embedding vector. The generated negative samples of both models showed similar quality. The gender and accent of each speaker can be clearly identified through samples. 14 shows sample spectrograms of three different voices. The text corresponding to the spectrogram is the same for each voice and does not appear in the training data. You can see that the pitch and speed of each sample is different from other samples. 14(a), (b), and (c) are the results of (female, high pitch), (female, low pitch), (male, low pitch), respectively.

도 15는 본 발명의 일 실시 예에 따른 이론 요소에 따른 음성 임베딩 결과를 도시한 그래프이다.15 is a graph showing a result of voice embedding according to a theoretical element according to an embodiment of the present invention.

도 15를 참조하면, 우리는 PCA를 사용하여 스피커 내장이 어떻게 학습되었는지 확인했습니다. 2차 고유치를 갖는 2차원을 선택하여 어떤 요인이 데이터의 분산을 가장 많이 일으키는지 확인할 수 있다. 도 15는 성별이나 악센트에 의한 선형 분리 가능성을 강조하기 위해 색상이 지정된다. 희소성 제한없이 모델이 훈련된 도 15(a), (c)는 모든 embedding이 얽혀있는 것처럼 보인다. 대조적으로 모델이 희소성 제약으로 훈련 된 도 15(b), (d)는 성별 측면에서 명확한 분리를 보여준다. 도 15 (d)에서 점 (0,1)과 점 (3,0)을 가로 지르는 선을 그리면 선이 영어 액센트와 미국 액센트를 구분할 수 있음을 알 수 있다. 이 결과는 음성 특성을보다 잘 표현할 수있는 희소성 제약 조건을 반드시 의미하지는 않는다. 가장 큰 분산을 제공하는 2 차원을 선택했기 때문에 희소성 제약 조건이 없는 모델은 각 개인의 음성을 구별 할 수있는 다른 요인을 발견했을 수 있다. 그러나 음성 내장이 매니 폴드에 분산되어 있으면 직관적으로 이해할 수 있으므로 사람들이 원하는 음성을 생성하기 위해 음성을 조정하는 것이 훨씬 쉬울수 있다. 음성 임베딩 값을 직접 조작하여 음성 특성을 제어 할 수 있다. 특히, 희박성 제약으로 훈련된 모델에서 각 주성분은 성별, 억양, 속도 및 진폭을 나타낼 수 있다.Referring to Figure 15, we used PCA to see how speaker embedding was learned. By selecting the 2D with the 2nd eigenvalue, we can see which factors cause the most variance of the data. 15 is colored to highlight the possibility of linear separation by gender or accent. 15(a) and (c) where the model was trained without scarcity restrictions, it appears that all embeddings are intertwined. In contrast, Figures 15(b) and (d), where the model was trained with sparsity constraints, show a clear separation in terms of gender. It can be seen from FIG. 15(d) that a line crossing a point (0,1) and a point (3,0) draws a line between English accents and US accents. This result does not necessarily imply a sparse constraint that can better express speech characteristics. Models without sparse constraints may have found other factors that can distinguish each individual's voice, because they chose the two dimensions that provide the largest variance. However, if the voice intestines are distributed across the manifold, it can be intuitively understood, so it can be much easier to tune the voices to create the voices people want. The voice characteristics can be controlled by directly manipulating the voice embedding values. In particular, in a model trained with sparsity constraints, each principal component can represent gender, intonation, velocity and amplitude.

도 16은 본 발명의 일 실시 예에 따른 이론 요소의 변화에 따른 스펙토그램을 도시한 그래프이다.16 is a graph showing a spectrogram according to a change in a theoretical element according to an embodiment of the present invention.

도 16을 참조하면, 음성 내장 튜닝 접근법을 보여주기 위해 다른 구성 요소를 고정 된 상태로 유지하면서 한 번에 한 주 구성 요소의 값을 변경하여 음성을 생성했다. 주 구성 요소가 선택되면 선택한 구성 요소의 최대 값과 최소값을 검사한다. 그런 다음 5 개의 값이 최대와 최소 사이에서 선택되며, 선택한 구성 요소에서 다른 구성 요소와 동일한 값만 다른 5 개의 음성 임베딩을 얻을 수 있다. 획득된 음성 임베딩이 PCA에 의해 변환 된 공간에 있기 때문에 다중 음성 타코트론과 호환되도록 음성 임베딩에 역변환을 적용했다. 결국 5 개의 음성 임베딩을 사용하여 음성 샘플을 생성했다.Referring to FIG. 16, to show the voice intrinsic tuning approach, voices were generated by changing the values of one main component at a time while keeping other components fixed. When the main component is selected, the maximum and minimum values of the selected component are checked. Then, 5 values are selected between the maximum and the minimum, and 5 voice embeddings that differ only in the selected component from the other components can be obtained. Since the acquired speech embedding is in the space converted by the PCA, inverse transformation was applied to the speech embedding to be compatible with multiple speech tacotrons. Eventually, a voice sample was generated using five voice embeddings.

도 16의 첫번재 그래프는 첫 번째 주성분의 값이 커질수록 피치가 높아지는 것을 보여줍니다. 우리가 도 15 (c)에서 예상했듯이, 표본 음성을 듣는 것에 따라 값이 증가함에 따라 음성이 남성에서 여성으로 바뀌는 것을 알 수 있다.The first graph in FIG. 16 shows that the pitch increases as the value of the first principal component increases. As we expected in Fig. 15(c), it can be seen that the voice changes from male to female as the value increases as the sample voice is heard.

두 번째 주요 구성 요소의 경우, 그림 도 15 (d)를 보면 화자의 액센트가 변경 될 것으로 예상된다. " I need some water from fast river."라는 문장을 생성하고 듣는 것으로 "water", "fast" 및 "river"이라는 단어에서 악센트의 변경을 식별 할 수 있다. 불행히도 이 변화를 시각화하는 효과적인 방법이 없으므로 이 실험의 수치는 생략했다.In the case of the second main component, seeing Figure 15(d), the speaker's accent is expected to change. By creating and listening to the phrase "I need some water from fast river." you can identify changes in accents in the words "water", "fast" and "river". Unfortunately, there is no effective way to visualize this change, so the values for this experiment were omitted.

세 번째와 네 번째 주성분의 PCA 플롯에서 직감을 얻을 수는 없지만 각 구성 요소의 값을 변경하여 경향을 볼 수 있다. 도 16의 두번째 그래프에서 각 스펙트로 그램의 유성음의 폭을 보면 구성 요소의 값이 중앙값으로 이동함에 따라 폭이 작아진다. 샘플을 들을 때 폭이 줄어들면 음성의 속도가 빨라진다는 것을 느낄 수 있다. 그러나, 제 1 및 제 2 주성분과 비교하여 명백한 경향을 나타내지 않는다.Intuition cannot be obtained from the PCA plots of the third and fourth principal components, but trends can be seen by changing the values of each component. Looking at the width of the voiced sound of each spectrogram in the second graph of FIG. 16, the width decreases as the value of the component moves to the median value. When listening to a sample, you may notice that the voice speed increases as the width decreases. However, it does not show a clear trend compared to the first and second main components.

마지막으로, 네 번째 주요 구성 요소는 음성의 진폭을 나타내는 것으로 보인다. 스펙트로그램은 사운드가 클 때 더 강한 패턴을 갖는다. 도 16의 세번째 그래프에서 가장 밝은 색의 영역은 가장 오른쪽의 스펙트로그램에서 가장 작고 오른쪽의 스펙트로그램은 일반적으로 왼쪽의 스펙트로그램보다 어두움을 볼 수 있다. 또한 오른쪽의 음성이 왼쪽의 음성보다 더 크게 느껴질 수 있다.Finally, the fourth major component seems to indicate the amplitude of the voice. Spectrograms have a stronger pattern when the sound is loud. In the third graph of FIG. 16, the brightest color region is the smallest in the rightmost spectrogram and the right spectrogram is generally darker than the left spectrogram. Also, the voice on the right may feel louder than the voice on the left.

음성 임베딩을 위해 단지 4 차원이 있었기 때문에, 음성 임베딩 벡터에 의해 영향을받은 4 개의 특징만을 관찰 할 수 있었다. 음성 임베딩 벡터의 차원을 증가시킴으로써 더 의미있는 특징을 발견 할 수있을 것이라고 가정 할 수있다. 세 번째와 네 번째 주요 구성 요소의 분석에서 속도와 소리의 크기가 연설의 가변성의 주요 요인 중 하나 일 수 있음을 관찰했다. 마지막 2 가지 주요 구성 요소에 대한 뚜렷한 차이를 보여주기 위해 스피커 내장 튜닝 접근 방식을 사용하기 위해 기존 데이터의 다양한 속도 및 음량으로 데이터 확대를 사용할 수 있다. 속도와 음량을 구별하기 위해 데이터 세트가 수집되지 않기 때문에 대부분의 데이터는 평균 속도와 평균 음량을 가진다. 따라서 속도와 소리 크기를 명확하게 구분하는 것만으로는 충분하지 않을 수 있다. 프로그램 Sound eXchange (SOX)를 비롯한 여러 가지 방법으로 오디오 파일을 조작 할 수 있다. 조작 된 사운드는 원래 데이터와 다른 특성을 가질 수 있지만, 속도와 소리의 크기가 다른 샘플의 수를 늘리기 위해 이러한 도구를 사용할 수 있다.Since there were only four dimensions for speech embedding, only four features affected by the speech embedding vector could be observed. It can be assumed that by increasing the dimension of the negative embedding vector, more meaningful features will be found. In the analysis of the third and fourth major components, we observed that speed and loudness may be one of the main factors in speech variability. To show a distinct difference between the last two main components, data augmentation can be used at varying speeds and volumes of existing data to use the speaker's built-in tuning approach. Most data has average speed and average volume because no data set is collected to distinguish speed and volume. Therefore, it may not be enough to clearly distinguish between speed and loudness. You can manipulate audio files in several ways, including the program Sound eXchange (SOX). The rigged sound may have different characteristics from the original data, but you can use these tools to increase the number of samples with different speeds and loudness.

도 17은 본 발명의 일 실시 예에 따른 혼합된 음성에 의한 스펙토그램을 도시한 그래프이다.17 is a graph showing a spectrogram by mixed voice according to an embodiment of the present invention.

도 17을 참조하면, 희소성 제한없이 다중 음성 타코트론과 동일한 분석을 수행 할 때 이전 분석과 달리 샘플에서 경향을 인식 할 수 없었다. 이 분석과 음성 삽입의 2 차원 시각화에서 직감을 얻을 수는 없지만 보간된 음성 임베딩을 모델의 입력으로 사용하여 음성을 생성 할 수 있다는 점은 주목할 가치가 있다. 두 개의 음성 임베딩 벡터를 평균하여 그것들을 사용하여 음성을 생성했습니다. 생성된 샘플은 도 17과 같이 원래의 두 음성 사이의 피치 레벨을 가지고 있음을 알 수 있다. 각 구성 요소가 나타내는 특성을 파악하려면 추가 조사가 필요하다.Referring to FIG. 17, when performing the same analysis with multiple negative tacotrons without scarcity limitation, the trend was not recognized in the sample unlike the previous analysis. Intuition cannot be obtained from this analysis and two-dimensional visualization of speech insertion, but it is worth noting that interpolated speech embedding can be used as input to the model to generate speech. Two speech embedding vectors were averaged and used to generate speech. It can be seen that the generated sample has a pitch level between two original voices as shown in FIG. 17. Further investigation is needed to understand the characteristics of each component.

희소성 제약이 문제 복잡성을 더 높일 수 있다고 추측하며, 이것은 희소성 제약이없는 모델보다 모델이 더 빨리 수렴하도록 유도 할 수있다. 모델의 복잡성은 문제의 복잡성에 비해 너무 높을 수 있다. 따라서 모델이 과장되기 쉽다. 여기서 희소성 제약은 정규화 용어로 사용되며 문제와 모델의 복잡성이 일치 할 수 있다. 이 시나리오를 가정하면, 제한된 모델이 어느 정도 수렴 된 후에 정규화 계수 ρ를 감소시킴으로써 이익을 얻을 수 있다. 이것은 초기 단계에서 수렴을 향상시킬뿐만 아니라 후반 단계에서 정확한 최적 매개 변수를 찾는 데 도움이된다. 희박성 제약 조건을 가진 모델이 충분히 훈련되지 않았을 수도 있다. I assume that sparsity constraints can increase the complexity of the problem further, which can lead to a model converging faster than models without sparse constraints. The complexity of the model can be too high for the complexity of the problem. Therefore, the model is likely to be exaggerated. Here, the sparsity constraint is used as a normalization term and the problem and the complexity of the model can be matched. Assuming this scenario, we can benefit from reducing the normalization coefficient ρ after the limited model converges to some extent. This not only improves convergence in the early stages, but also helps to find the correct optimal parameters in the later stages. Models with sparsity constraints may not be sufficiently trained.

본 발명은, 하드웨어, 소프트웨어, 펌웨어, 특수 목적 프로세서, 또는 이들의 조합체의 여러 형태로 구현될 수 있다는 것을 이해하여야 할 것이다. 바람직하게는, 본 발명은 하드웨어와 소프트웨어의 조합으로 구현된다. 나아가, 소프트웨어는 바람직하게는, 프로그램 저장 디바이스에 유형적으로 구현되는 어플리케이션 프로그램으로 구현된다. 이 어플리케이션 프로그램은 임의의 적절한 구조를 포함하는 머신(machine)에 업로드되며 이 머신에 의해 실행될 수 있다. 바람직하게, 이 머신은 하나 이상의 중앙 처리 장치(CPU)와 랜덤 억세스 메모리(RAM)와, 입/출력(I/O) 인터페이스(들)와 같은 하드웨어를 구비하는 컴퓨터 플랫폼(platform) 상에 구현된다. 이 컴퓨터 플랫폼은 또한 운영 체계와 마이크로 명령 코드를 포함한다. 본 명세서에 기술되는 여러 처리 및 기능은 운용 체계를 통해 실행되는 마이크로 명령 코드의 일부 또는 어플리케이션 프로그램의 일부(또는 이들의 조합)일 수 있다. 나아가, 부가적인 데이터 저장 디바이스와 프린팅 디바이스와 같은 여러 다른 주변 디바이스들이 이 컴퓨터 플랫폼에 연결될 수 있다. It should be understood that the present invention may be implemented in various forms of hardware, software, firmware, special purpose processors, or combinations thereof. Preferably, the present invention is implemented by a combination of hardware and software. Furthermore, the software is preferably implemented as an application program tangibly implemented in a program storage device. This application program is uploaded to a machine containing any suitable structure and can be executed by this machine. Preferably, the machine is implemented on a computer platform having one or more central processing units (CPUs) and random access memory (RAM) and hardware such as input/output (I/O) interface(s). . This computer platform also includes an operating system and micro-instruction code. The various processes and functions described herein may be part of micro-instruction code or part of an application program (or combinations thereof) executed through an operating system. Furthermore, several other peripheral devices such as additional data storage devices and printing devices can be connected to this computer platform.

첨부하는 도면에 도시된 구성요소의 시스템 성분과 방법 단계의 일부는 바람직하게는 소프트웨어로 구현되기 때문에, 시스템 성분(또는 방법 단계) 사이의 실제 연결은 본 발명이 프로그래밍되는 방식에 따라 달라질 수 있다는 것을 더 이해하여야 할 것이다. 본 명세서에 개시된 내용에 따라, 관련 기술 분야에 통상의 지식을 가진 자라면 본 발명의 이들 구현예나 구성 및 이와 유사한 구현예나 구성을 생각할 수 있을 것이다.Since some of the system components and method steps of the components shown in the accompanying drawings are preferably implemented in software, the actual connection between system components (or method steps) may vary depending on how the present invention is programmed. You should understand more. According to the contents disclosed in the present specification, those skilled in the relevant art may think of these embodiments or configurations of the present invention and similar embodiments or configurations.

Claims

인코더가 음성을 생성할 문장을 문자 단위로 문자 임베딩 벡터를 생성 하는 단계;
디코더가 상기 임베딩 벡터로부터 음성 스펙트럼을 생성하는 단계; 및
음성 출력부가 상기 음성 스펙트럼을 음성으로 출력하는 단계;를 포함하고,
상기 디코더가 상기 음성 스펙트럼 생성 단계는,
다중 음성 데이터를 입력받아 음성 임베딩 벡터로 변환해 음성 스펙트럼 병합에 이용하며,
상기 다중 음성 데이터는,
One-hot vector형태의 speaker-id이고 룩업 테이블을 거쳐 음성 임베딩 벡터로 변환되고,
상기 음성 임베딩 벡터는, 디코더 RNN과 집중 RNN에 입력되며,
상기 디코더는 음성을 입력 받는 스피커의 내장 벡터의 공간을 비선형으로 설정함으로써 상기 스피커 내장 벡터를 조절하는 것을 특징으로 하는 텍스트-다중음성 변환 방법.
An encoder generating a character embedding vector in units of characters for a sentence to generate speech;
A decoder generating a speech spectrum from the embedding vector; And
Including a voice output unit for outputting the speech spectrum to speech; includes,
The decoder is the step of generating the speech spectrum,
It receives multiple speech data, converts them into speech embedding vectors, and uses them for speech spectrum merging.
The multi-voice data,
It is a speaker-id in the form of a one-hot vector and is converted into a speech embedding vector through a look-up table.
The speech embedding vector is input to the decoder RNN and the intensive RNN,
The decoder adjusts the speaker embedded vector by setting the space of the embedded vector of the speaker receiving the speech as a nonlinear method.

제1항에 있어서,
상기 다중 음성 데이터는,
log-mel-spectrogram형태의 데이터인 텍스트-다중음성 변환 방법.
According to claim 1,
The multi-voice data,
A text-to-multi-voice conversion method that is log-mel-spectrogram data.

제2항에 있어서,
상기 log-mel-spectrogram은 임베더 네트워크를 거쳐 음성 임베딩 벡터로 변환되는 텍스트-다중음성 변환 방법.
According to claim 2,
The log-mel-spectrogram is a text-multi-voice conversion method that is converted to a speech embedding vector via an embedder network.

문자로부터 상기 문자의 음성을 출력하는 텍스트-다중음성 변환모듈;을 포함하고,
상기 텍스트-음성 변환모듈은,
음성을 생성할 문장을 문자 단위로 문자 임베딩 벡터를 생성하는 인코더,
상기 임베딩 벡터로부터 음성 스펙트럼을 생성하는 디코더; 및
상기 음성 스펙트럼을 음성으로 출력하는 음성 출력부;를 포함하고,
상기 디코더는,
다중 음성 데이터를 입력받아 음성 임베딩 벡터로 변환해 음성 스펙트럼 병합에 이용하며,
상기 다중 음성 데이터는,
One-hot vector형태의 speaker-id이고 룩업 테이블을 거쳐 음성 임베딩 벡터로 변환되고,
상기 음성 임베딩 벡터는, 디코더 RNN과 집중 RNN에 입력되며,
상기 디코더는 음성을 입력 받는 스피커의 내장 벡터의 공간을 비선형으로 설정함으로써 상기 스피커 내장 벡터를 조절하는 것을 특징으로 하는 텍스트-다중음성 변환 시스템.
Includes a text-to-multi-speech conversion module for outputting the voice of the text from the text,
The text-to-speech conversion module,
An encoder that generates a character embedding vector in units of characters for the sentence to generate speech,
A decoder that generates a speech spectrum from the embedding vector; And
Includes; a voice output unit for outputting the voice spectrum as a voice,
The decoder,
It receives multiple speech data, converts them into speech embedding vectors, and uses them for speech spectrum merging.
The multi-voice data,
It is a speaker-id in the form of a one-hot vector and is converted into a speech embedding vector through a look-up table.
The speech embedding vector is input to the decoder RNN and the intensive RNN,
And the decoder adjusts the speaker embedded vector by setting the space of the speaker's embedded vector to be nonlinear.

제4항에 있어서,
상기 다중 음성 데이터는,
log-mel-spectrogram형태의 데이터인 텍스트-다중음성 변환 시스템.
According to claim 4,
The multi-voice data,
Text-to-multi-voice conversion system that is log-mel-spectrogram data.

제5항에 있어서,
상기 log-mel-spectrogram은 임베더 네트워크를 거쳐 음성 임베딩 벡터로 변환되는 텍스트-다중음성 변환 시스템.The method of claim 5,
The log-mel-spectrogram is a text-multi-voice conversion system that is converted to a speech embedding vector via an embedder network.