KR20100085433A

KR20100085433A - High quality voice synthesizing method using multiple target prosody

Info

Publication number: KR20100085433A
Application number: KR1020090004711A
Authority: KR
Inventors: 이종석; 이준우; 전원석; 나덕수
Original assignee: 주식회사 보이스웨어
Priority date: 2009-01-20
Filing date: 2009-01-20
Publication date: 2010-07-29

Abstract

PURPOSE: A high quality voice combining method using multiple target prosody is provided to predict additional information for whether a break is variable or fixed and generate multiple target prosody through the predicted information, thereby improving sound quality of combined sounds. CONSTITUTION: A number, a symbol, and a special character included in an input sentence are converted into a text. A word class of a character converted into the text is analyzed and the analyzed sentence is converted by phonetic transcription. A break, a duration time, and an accent of combination unit candidates are generated. Final combination unit candidates to be used during a combination unit selecting process are selected from the combination unit candidates in advance. A combination unit to be used in voice combination of input texts is selected using the selected final candidates.

Description

다중 목표운율 이용한 고음질 음성합성 방법 {High Quality Voice Synthesizing Method using Multiple Target Prosody}High Quality Voice Synthesizing Method using Multiple Target Prosody

제1도는 본 발명에서 사용하는 코퍼스 기반 음성합성시스템에 대한 시스템구성에 대한 블록 다이어그램이다.1 is a block diagram of a system configuration for a corpus-based speech synthesis system used in the present invention.

제2도는 본 발명에서 사용하는 운율 구조를 도시한 트리 다이어그램이다.2 is a tree diagram showing a rhyme structure used in the present invention.

제3도는 본 발명에 따른 가변 휴지기 모델링 과정을 도시한 다이어그램이다.3 is a diagram illustrating a variable pause modeling process according to the present invention.

제4도는 실제 악센트 구 경계와 실제 Major Phrase 경계에 대한 예측 확률의 분포를 나타낸 막대그래프이다. 제4a도는 본 발명에 따른 가변 휴지기 예측에서 악센트 구 예측 확률분포이고 제4b도는 본 발명에 따른 가변 휴지기 예측에서 Major Phrase 예측 확률분포를 나타내는 막대그래프이다. 4 is a bar graph showing the distribution of predicted probabilities for the actual accent sphere boundary and the actual major phrase boundary. FIG. 4a is an accent sphere prediction probability distribution in the variable pause prediction according to the present invention, and FIG. 4b is a bar graph showing the Major Phrase prediction probability distribution in the variable pause prediction according to the present invention.

제5도는 본 발명에 따른 가변 휴지기 결정 과정을 도시한 블록 다이어그램이다.5 is a block diagram illustrating a variable pause determination process according to the present invention.

제6도는 본 발명에 따른 다중 목표운율 생성 과정을 도시한 블록 다이어그램이다.6 is a block diagram showing a process of generating multiple target rhymes according to the present invention.

제7도는 목표운율을 도시한 블록 다이어그램이다. 제7a도는 종래의 단일 목표운율을 도시하였고 제7b도는 종래의 단일 목표운율을 생성한 예를 도시하였고 제 7c도는 본 발명에 따른 다중 목표운율을 사용하는 예를 도시하였고 그리고 제7d도는 본 발명에 따른 다중 목표운율을 생성한 예를 도시한 것이다.7 is a block diagram showing a target rhyme. FIG. 7a shows a conventional single target rhyme, FIG. 7b shows an example of generating a conventional single target rhyme, FIG. 7c shows an example of using multiple target rhymes according to the present invention, and FIG. It shows an example of generating multiple target rhymes according to.

제8도는 본 발명에 따른 합성단위 선택 과정을 도시한 블록 다이어그램이다. 8 is a block diagram showing a synthesis unit selection process according to the present invention.

* 도면의 주요부호에 대한 간단한 설명 *Brief description of the main symbols in the drawing

110 : 문장입력부 120 : 언어학적 처리부110: sentence input unit 120: linguistic processing unit

121 : Text전처리모듈 122 : 문장분석모듈121: Text preprocessing module 122: Sentence analysis module

123 : 발음표기변환모듈 130 : 운율처리부123: phonetic transcription conversion module 130: rhyme processing unit

140 : 음성신호처리부 141 : 합성단위선택모듈 140: voice signal processing unit 141: synthesis unit selection module

142 : 음성파형생성모듈 150 : 음성출력부142: voice waveform generation module 150: voice output unit

161 : 숫자/약어/기호사전 162 : 품사사전161: number / abbreviation / symbol dictionary 162: parts of speech dictionary

163 : 발음사전 164 : 음성 데이터베이스(DB)163: Pronunciation dictionary 164: Voice database (DB)

발명의 분야Field of invention

본 발명은 음성합성 방법에 관한 것이고, 보다 구체적으로 본 발명은 음성합성기의 휴지기 예측 방법에 있어 휴지기가 가변적인지 고정적인지의 추가 정보를 예측하고, 이를 이용하여 다중 목표운율을 생성하여 합성음의 음질을 향상시키는 음성합성 방법에 관한 것이다.The present invention relates to a speech synthesis method, and more particularly, the present invention predicts additional information of whether a resting period is variable or fixed in a resting period prediction method of a speech synthesizer, and generates multiple target rhymes using the same to determine sound quality of a synthesized sound. The present invention relates to a method for improving speech synthesis.

발명의 배경Background of the Invention

지금까지 제안된 여러 가지 합성 방법 중 가장 고음질의 합성음을 생성할 수 있는 방법은 코퍼스 기반 음성 합성 방법이다. 코퍼스 기반 음성 합성 방법은 음성을 합성단위(unit) 형태로 구성한 데이터베이스(DB)에서 합성에 필요한 단위를 선택하고 이것들을 적절히 연결하여 합성음을 생성함으로써 고음질의 합성음을 생성할 수 있다.Among the various synthesis methods proposed so far, a method of generating the highest quality synthesized sound is a corpus-based speech synthesis method. In the corpus-based speech synthesis method, a high-quality synthesized sound can be generated by selecting a unit necessary for synthesis from a database (DB) in which a speech is formed in a unit form, and connecting the appropriate units to generate a synthesized sound.

이러한 코퍼스 기반 음성 합성 시스템에 대한 기본 시스템 구성도가 제1도에 도시되어 있다. 제1도를 참고로 일반적인 음성 합성 방법을 살펴보면, 문장이 입력되면 언어학적 처리부(120)의 텍스트 전처리모듈(121)이 숫자/약어/기호사전(161)을 사용하여 문장에 포함된 숫자, 기호 등을 텍스트(Text)로 전환하고, 문장분석모듈(122)이 품사사전(162)을 사용하여 문장을 분석하고, 발음표기변환모듈(123)이 발음사전(163)을 사용하여 발음표기로 변환한다. 상기 언어학적 처리부에서 입력문장의 전처리가 이루어지면, 운율처리부(130)는 언어학적 처리부에서 추출된 정보들을 이용하여 억양, 지속시간 등의 운율 정보를 생성한다. 또한 음성신호처리부(140)의 합성단위 선택모듈(141)은 언어학적 처리부에서 생성된 정보들을 이용하여 음성 DB(164)로부터 최적의 합성단위를 선택하고, 음성파형생성모듈(142)에서 합성단위들을 연결하여 합성음을 만들어 음성출력부(150)를 통해 합성음을 출력한다.A basic system diagram for such a corpus based speech synthesis system is shown in FIG. Referring to FIG. 1, when a sentence is input, a text preprocessing module 121 of the linguistic processing unit 120 uses a number / abbreviation / symbol dictionary 161 to input a sentence. And the like, and the sentence analysis module 122 analyzes the sentence using the part-of-speech dictionary 162, and the phonetic transcription conversion module 123 converts the phonetic pronunciation using the pronunciation dictionary 163. do. When pre-processing of input sentences is performed in the linguistic processing unit, the rhyme processing unit 130 generates rhyme information such as intonation and duration using information extracted from the linguistic processing unit. In addition, the synthesis unit selection module 141 of the speech signal processing unit 140 selects an optimal synthesis unit from the speech DB 164 using the information generated by the linguistic processing unit, and synthesizes the unit from the speech waveform generation module 142. By connecting them to make a synthesized sound and outputs the synthesized sound through the voice output unit 150.

상기 음성합성 시스템에서 입력 텍스트로부터 운율 정보를 생성하기 위해서는 운율구 경계(휴지기), 음소 지속시간, 기본 주파수 윤곽선 설정의 3가지 기본적인 모듈이 필요하다. 운율구의 경계인 휴지기는 음소 지속시간과 기본주파수 윤곽선 설정에 영향을 미치는 것으로 기본적인 정보이면서 가장 중요한 것이다.In order to generate rhyme information from the input text in the speech synthesis system, three basic modules are required: rhyme phrase boundary (pause), phoneme duration, and basic frequency contour setting. Resting period, which is the boundary of the rhyme phrase, affects the phoneme duration and the fundamental frequency contour setting.

운율구의 구조를 나타내기 위해 5∼6가지의 휴지기를 사용하는데 이것을 BI(break indices)라고 하고, BI를 생성하는 방법은 규칙 기반 방법과 코퍼스 기반 방법이 있다. 규칙 기반 방법은 문장기호, 품사, 발음 열(phoneme stream) 등의 정보와 언어학적인 정보를 이용한다. 정확한 BI를 얻기 위해서는 매우 복잡하고 정교한 작업이 필요하다. 코퍼스 기반 방법에는 여러 가지 특징들을 이용하여 자동으로 의사 결정 나무(decision tree)를 구축하는 classification and regression trees(CART) 방법과, hidden Markov model(HMM)을 이용하는 방법이 있다. 하지만 개인마다 조금씩 다른 읽기습관(reading style)에 의해 모델링의 정확도가 높지 못하다. 1996년 Cambell의 연구를 살펴보면, 자동으로 예측한 BI와 사람이 검사한 BI가 일치하는 정확도는 69%정도이다. 그러나 예측 값을 +/- 1로 조정한다면 정확도가 90%로 올라가는 것을 알 수 있다. 이러한 결과는 BI 사이의 불명확성(uncertainty, subtleness)을 나타내는 것으로 BI 예측을 어렵게 하는 요인이다.To represent the structure of rhyme phrases, five to six pauses are used, which are called BI (break indices), and there are two methods of generating BI: rule-based and corpus-based. Rule-based methods use linguistic information and information such as punctuation marks, parts of speech, and phoneme streams. Getting accurate BI requires very complex and sophisticated work. The corpus-based method includes a classification and regression trees (CART) method that automatically constructs a decision tree using various features, and a hidden Markov model (HMM). However, the accuracy of modeling is not high due to the slightly different reading style for each individual. In a 1996 study by Cambell, the accuracy of matching automatically predicted BI with human-tested BI was about 69%. However, if you adjust the prediction to +/- 1, you see that the accuracy goes up to 90%. These results indicate the uncertainty (subtleness) between BIs and make the BI prediction difficult.

그런데 합성기의 일반적인 휴지기 예측 모듈에서는 하나의 경계에 한 가지의 휴지기 형태(BI)만 생성하고 이것을 이용하여 단일 목표운율을 생성하고 합성단위 선택을 수행한다. 따라서 유사한 다른 휴지기 형태(BI)를 가지는 데이터들을 사용하지 못하여 최적의 합성 음질을 얻을 수 없었다.However, in the general resting period prediction module of the synthesizer, only one type of resting form (BI) is generated at one boundary, and a single target rhyme is generated using this, and the synthesis unit selection is performed. As a result, data with similar resting phases (BI) could not be used to obtain optimal synthesized sound quality.

나아가 규칙 기반 방식의 음성합성도 텍스트를 분석하여 휴지기 정보와 기본주파수 윤곽선 및 음소지속시간을 생성하여 이것을 이용하여 신호처리 기법으로 파형을 생성하는 합성방법이므로 고정된 휴지기를 갖는 경우 코퍼스 방식과 같은 문제점이 있다.Furthermore, it is a synthesis method that generates wave information, basic frequency contour, and phoneme duration by analyzing text based on rule synthesis method, and generates waveform by signal processing method. There is this.

결론적으로 음성 합성 방법이 운율정보 생성을 위해 중요한 휴지기를 한 가지 형태만 고정적으로 사용하여 자연어에 가까운 최적의 음질을 얻지 못하는 문제가 있었다.In conclusion, there is a problem that the speech synthesis method does not obtain an optimal sound quality close to natural language by using only one form of pause which is important for generating rhyme information.

따라서, 상기의 문제점을 해결하기 위한 본 발명의 목적은 합성음의 자연성을 향상시키기 위해 생성된 운율을 보다 효율적으로 이용할 수 있는 방법으로 운율정보의 하나인 휴지기에 가변 휴지기를 도입하고 이를 이용하여 다중 목표운율을 생성하여 음성 데이터베이스의 각 음성 단위가 가지는 다양한 운율정보를 이용할 수 있는 합성단위 선택을 가능하게 하는 방법을 제공하는 것이다.Accordingly, an object of the present invention for solving the above problems is to introduce a variable resting period to one of the rhyme information as a way to more efficiently use the generated rhymes to improve the naturalness of the synthesized sound, and to use the multiple targets. The present invention provides a method of generating a rhyme to enable selection of a synthesis unit that can use various rhyme information of each voice unit of a voice database.

본 발명의 다른 목적은 대용량 음성 코퍼스를 사용하는 음성 합성기의 합성단위 선택 과정에서 고정된 휴기지 뿐만 아니라 가변 휴지기를 갖는 다중 목표 운율 정보를 이용함으로써 합성기의 부족한 휴지기 예측 규칙을 효율적으로 보완하는 방법을 제공하는 것이다.Another object of the present invention is to provide a method for efficiently compensating the lack of pause prediction rules of a synthesizer by using multiple target rhyme information having a variable pause as well as a fixed pause in a synthesis unit selection process of a speech synthesizer using a large-capacity speech corpus. To provide.

본 발명의 또 다른 목적은 휴지기 예측 모듈에서 생성된 BI가 2와 3인 경우 고정 휴지기(FB; fixed break)와 가변 휴지기(VB; variable break)로 구분하고, 고 정 휴지기인 경우 휴지기 예측 모듈에서 결정된 한 가지 BI(BI 2 또는 3)를 이용하여 단일 목표운율을 생성하고 이를 이용하여 합성단위 선택을 수행하고, 가변 휴지기인 경우 BI가 2와 3을 가지는 경우를 고려한 다중 목표운율을 생성하여 합성단위 선택을 수행하는 방법을 제공하는 것이다.(BI 2 및 3에 대해서는 발명의 상세한 설명에서 정의한다.)Yet another object of the present invention is to classify a fixed break (FB) and a variable break (VB) when the BI generated in the resting prediction module is 2 and 3, and in the resting prediction module in the case of a fixed resting period A single target rhyme is generated using one BI (BI 2 or 3) determined, and a synthesis unit selection is performed using the selected BI. In the case of a variable rest period, a multiple target rhyme is generated considering a case in which BI has 2 and 3 It provides a method for performing unit selection. (BI 2 and 3 are defined in the detailed description of the invention.)

본 발명의 또 다른 목적은 통계적 모델링 방법을 이용하여 가변 휴지기를 예측하고 이것을 이용하여 다양한 운율패턴을 가지는 합성단위들을 후보로 추출하여 합성음의 음질을 향상 시키는 방법을 제공하는 것이다.It is still another object of the present invention to provide a method of predicting a variable rest period using a statistical modeling method and using the same to extract synthesis units having various rhyme patterns as candidates to improve the sound quality of the synthesized sound.

본 발명의 또 다른 목적은 기존의 합성기에서 사용되던 단일 목표운율을 가변 휴지기 정보를 이용하여 다중 목표운율로 변환하여 운율 예측의 오류를 보완하고, 이것을 이용하여 합성음의 운율을 자연스럽게 향상 시키는 방법을 제공하는 것이다.Still another object of the present invention is to convert a single target rhyme used in a conventional synthesizer into multiple target rhymes using variable rest period information to compensate for error of rhyme prediction, and to provide a method of naturally improving the rhythm of synthesized sound using the same. It is.

본 발명의 또 다른 목적은 기존의 합성기에서 사용하는 문장 또는 IP단위의 단일 목표운율을 단어들의 휴지기 정보를 이용하여 단어 단위의 목표운율로 표현하고, 각 단어의 고정 또는 가변 휴지기 형태에 따라 운율을 생성하여 한 문장 또는 IP에서 여러 가지의 운율을 동시에 표현할 수 있도록 하여 합성단위 선택에서 최적의 운율 패턴이 선택될 가능성을 높이는 방법을 제공하는 것이다.It is still another object of the present invention to express a single target rhyme of a sentence or IP unit used in a conventional synthesizer as target rhymes in word units by using pause information of words, and express rhymes according to fixed or variable pause forms of each word. It is possible to express various rhymes in one sentence or IP at the same time, thereby providing a method of increasing the probability of selecting an optimal rhyme pattern in the synthesis unit selection.

본 발명의 또 다른 목적은 합성 시간을 단축시키기 위해 수행하는 사전선택(pre-selection) 과정에서 다중 목표운율을 이용하여 운율구조가 서로 다른 합성 단위가 적절히 조합된 후보군을 생성하여 합성단위 선택을 수행함으로써 자연스러 운 운율을 가진 합성음을 생성하는 방법을 제공하는 것이다. It is still another object of the present invention to perform a synthesis unit selection by generating candidate groups in which synthesis units having different rhyme structures are properly combined using multiple target rhymes in a pre-selection process performed to shorten the synthesis time. This provides a method of generating synthesized sound with natural rhyme.

본 발명의 또 다른 목적은 코퍼스 방식뿐만 아니라 규칙 기반 합성 방식에서도 다중 목표운율을 생성하고 이용하여 자연스러운 운율을 가진 합성음을 생성하는 방법을 제공하는 것이다. It is still another object of the present invention to provide a method for generating synthesized sounds having natural rhythms by generating and using multiple target rhymes in a rule-based synthesis method as well as a corpus method.

본 발명의 상기 및 기타의 목적들은 하기 설명되는 본 발명에 의하여 모두 달성될 수 있다. The above and other objects of the present invention can be achieved by the present invention described below.

발명의 요약Summary of the Invention

본 발명은 입력된 문장에 포함된 숫자, 기호, 특수문자 등을 텍스트로 변환하는 전처리하고, 상기 텍스트 전처리된 문자의 품사를 분석하고, 분석된 문장을 발음표기로 전환하는 과정을 포함하는 언어학적 처리를 하고, 상기 언어학적 처리 단계를 통해 생성되는 합성단위 후보들의 휴지기, 지속시간 및 억양(기본주파수 윤곽선)을 생성하고, 합성단위 선택과정에서 이용될 최종 합성 단위 후보들을 사전 선택하고, 상기 사전 선택된 최종 후보들을 이용하여 입력된 문자의 음성합성에 이용될 합성단위를 선택하고 합성단위를 연결하여 합성음을 생성하는 음성 신호 처리하는 코퍼스 기반 음성합성 방법에서, 상기 운율 처리 단계의 휴지기가 고정 휴지기 또는 가변 휴지기를 갖는 것을 특징으로 하는 다중 목표운율을 이용한 고음질 음성합성 방법이다.The present invention provides a linguistic process including converting numbers, symbols, special characters, etc. included in an input sentence into text, analyzing parts of speech of the text preprocessed character, and converting the analyzed sentence into a phonetic notation. Processing, generating rest periods, durations, and intonations (base frequency contours) of the synthesis unit candidates generated through the linguistic processing step, preselecting the final synthesis unit candidates to be used in the synthesis unit selection process, and In the corpus-based speech synthesis method of selecting a synthesis unit to be used for speech synthesis of an input character using selected final candidates and concatenating the synthesis units to generate a synthesized sound, the pause of the rhyme processing step is a fixed pause or It is a high quality speech synthesis method using multiple target rhymes, characterized in that it has a variable pause.

본 발명은 상기 운율 처리 단계에서 휴지기가 BI 2 및 BI 3에 대해 고정 휴 지기인지 가변 휴지기인지 예측하고 가변 휴지기가 예측되는 경우 다중 운율 생성하는 단계를 포함하는 것을 특징으로 하는 다중 목표운율을 이용한 코퍼스 기반 고음질 음성합성 방법이다.The present invention includes a step of predicting whether the resting period is a fixed resting period or a variable resting period for the BI 2 and BI 3 in the rhythm processing step, and generating a multi-rhyme if the variable resting period is predicted corpus using multiple target rhymes Based high quality speech synthesis.

본 발명은 상기 휴지기 예측을 통계적 모델링 방법을 이용해 결정하는 것을 특징으로 하는 다중 목표운율을 이용한 코퍼스 기반 고음질 음성합성 방법이다.The present invention is a corpus-based high-quality speech synthesis method using multiple target rhymes, wherein the pause prediction is determined using a statistical modeling method.

본 발명은 상기 음성 신호 처리 단계에서 다중 운율 생성 단계를 통해 생성된 대량의 확장된 후보 합성단위에서 선택하고 대량의 확장된 후보 합성 단위 경우 사전선택의 비율로 (식 7)을 사용하여 조합된 합성단위 후보군을 생성하는 사전선택과정을 더 포함하는 것을 특징으로 하는 다중 목표 운율을 이용한 코퍼스 기반 고음질 음성합성 방법이다.In the speech signal processing step, the synthesis is selected from a large number of extended candidate synthesis units generated through a multi-rhyme generation step and combined using Equation 7 as a ratio of preselection in the case of a large number of extended candidate synthesis units. A corpus-based high-quality speech synthesis method using multiple target rhymes further comprising a preselection process of generating a unit candidate group.

나아가 본 발명은 규칙 기반 음성 합성 방법에 있어서 텍스트 분석 및 언어처리, 목표 운율 생성과정은 코퍼스 기반 방법과 동일하므로 목표 운율 생성에 있어 상기 다중 목표운율을 이용하는 규칙 기반 음성 합성 방법이다.Furthermore, in the rule-based speech synthesis method, the text analysis, the language processing, and the target rhyme generation process are the same as the corpus-based method.

발명의 구체예에 대한 상세한 설명Detailed Description of the Invention

이하 첨부된 도면을 참조하여 본 발명의 바람직한 구체예를 상세히 설명함에 있어 관련된 공지 기능 또는 구성에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는 그 상세한 설명을 생략할 것이다.DETAILED DESCRIPTION In the following detailed description of preferred embodiments of the present invention with reference to the accompanying drawings, detailed descriptions of related well-known functions or configurations will be omitted if it is determined that the subject matter of the present invention may be unnecessarily obscured.

이하 본 발명에서 사용되는 용어를 정의한다. 일반적으로 합성기에서 사용하는 주요한 운율 정보에는 휴지기 종류, 기본주파수 윤곽선 및 음속시속시간이 있 다. 운율 생성에 있어 가장 먼저 각 단어의 연결 부위의 휴지기 종류를 결정하고, 결정된 휴지기에 따라 기본 주파수 윤곽선과 음속지속시간을 결정하게 된다. 각 단어의 연결부위의 휴지기에 하나의 휴지기 종류를 할당하는 것을 고정휴지기(FB)라고 하고, 단어 연결부위의 휴지기가 변동이 가능하여 2종류가 존재가능하다면 가변 휴지기(VB)라고 정의한다. 목표운율 생성에 있어서 기존의 합성기에서는 BI가 휴지기 예측 모듈에서 생성되면 변하지 않기 때문에 하나의 목표운율을 사용하지만, 가변 휴지기를 사용하면 하나의 (단어와 단어사이) 경계에 대해 두 종류의 휴지기(BI)가 가능하므로 각각의 휴지기 정보에 대응되는 기본주파수와 지속시간을 생성하여 사용해야하는데, 이것을 다중 목표운율이라 정의한다. Hereinafter, terms used in the present invention are defined. In general, the main rhyme information used by the synthesizer is the resting type, the fundamental frequency contour, and the speed of sound speed. In generating the rhyme, first, the type of resting period of the connection part of each word is determined, and the basic frequency contour and the sound velocity duration are determined according to the determined resting period. The assignment of one type of resting period to the resting part of each word connection is called a fixed pause (FB), and if the rest of the word connection part is variable and there are two types, it is defined as a variable resting period (VB). In target rhyme generation, the existing synthesizer uses one target rhyme because BI does not change when it is generated in the pause prediction module.However, when using a variable pause, two kinds of resting periods (BI) between one (word and word) boundary are used. Since it is possible to generate and use the fundamental frequency and duration corresponding to each pause information, this is defined as multi target rhyme.

본 발명에서 가변휴지기를 가질 수 있는 BI 2 및 3에 대해 살펴본다. 음성의 운율 구조는 언어마다 조금씩 다를 수 있지만 합성기에서 사용할 수 있는 세분화된 운율구조는 제2도와 같다. 하나의 문장은 몇 개의 억양구(IP ; Intonation Phrase)로 이루어지고, 억양구는 몇 개의 MP(Major Phrase, intermediate Phrase)로 구성된다. 그리고 몇 개의 악센트 구(AP ; Accent Phrase)는 MP를 구성하고, 악센트 구는 몇 개의 단어로 구성되고, 마지막으로 음소(phone) 또는 음절(syllable)로 이루어진다. 여기서 MP는 AP와 IP의 중간적 특징을 가지는 운율구조로 IP가 일반적으로 긴 포즈(묵음구간)를 경계로 형성된다면 MP는 보다 짧은 포즈를 경계로 형성된다. 본 발명에서는 0∼5까지의 6가지 BI를 사용하고, 규칙을 이용하여 BI를 생성한다. BI 0과 1은 텍스트 전처리의 결과인 음소 열과 단어분리 결과를 이용하여 생성할 수 있고, BI 2와 3은 악센트와 문장의 구조에 따라 결정할 수 있다. 4와 5는 쉼 표(comma)와 마침표(period)로 구분 할 수 있다. BI 중 2와 3을 다른 BI(0,1,4,5)와 구분하는 것은 쉬우나 (BI 2와 3은) 서로 유사하여 둘 중 하나를 결정하기 위해서는 복잡한 규칙이 필요하다. It looks at the BI 2 and 3 that can have a variable pause in the present invention. The rhyme structure of the voice may differ slightly from language to language, but the granular rhyme structure that can be used in the synthesizer is shown in FIG. One sentence is composed of several Intonation Phrases (IPs), and the accent is composed of several MPs (Major Phrases, intermediate Phrases). A few accent phrases (AP) constitute an MP, an accent phrase consists of a few words, and finally a phone or syllable. In this case, MP is a rhyme structure having an intermediate characteristic between the AP and the IP. If the IP is generally formed by a long pose (silent section), the MP is formed by a shorter pose. In the present invention, six BIs from 0 to 5 are used, and BI is generated using a rule. BI 0 and 1 can be generated using phoneme strings and word-separation results, which are the result of text preprocessing. BI 2 and 3 can be determined according to the structure of accents and sentences. 4 and 5 can be separated by commas and periods. It is easy to distinguish between 2 and 3 of BI from other BIs (0,1,4,5), but (BI 2 and 3) are similar to each other and complex rules are needed to determine either.

음성에서 악센트 흐름이 변하면서 포즈(pause)가 존재하지 않는 것이 BI 2(악센트 구 경계)에 해당하고, 악센트 흐름도 변하면서 짧은 포즈가 나타나는 것이 BI 3(Major Phrase 경계)에 해당한다. 휴지기를 검사한 음성 데이터베이스(DB)를 분석해 보면 문맥정보(context)에 따라 확실하게 BI 2와 3이 구분되는 경우도 있지만 그렇지 않은 경우도 많다. 심지어 같은 문맥정보 이지만 BI가 2인 데이터와 3인 데이터의 비율이 50:50인 경우도 있다.The absence of a pause as the accent flow changes in the voice corresponds to BI 2 (the accent sphere boundary), and the short pose as the accent flow diagram changes corresponds to the BI 3 (major phrase boundary). Analyzing the voice database (DB) after checking the resting period, BI 2 and 3 can be clearly distinguished depending on the context, but in many cases it is not. There is even a 50:50 ratio of data with two BIs to three data with the same context.

코퍼스를 이용한 합성기를 만들기 위해 녹음된 음성 데이터와 녹음 문장들을 분석해보면, 명사와 조사, 명사와 접미사 또는 복합명사를 만드는 명사와 명사처럼, 결합되면서 악센트나 문법에 영향을 주는 단어들 사이의 휴지기는 기본적으로 고정 휴지기일 확률이 높고, 의미론적 또는 형태론적인 특징으로 인해 결정되는 휴지기는 가변 휴지기일 확률이 높다. 따라서 고정 휴지기인지 가변 휴지기인지를 결정하기 위한 규칙은 문법적 지식에 근거한 규칙과 실험적인 규칙 등이 있을 수 있으며, 언어처리 모듈의 결과인 품사 정보와 문자 정보 등을 이용할 수 있다. 이러한 규칙들의 대부분은 악센트 구 경계와 MP 경계에 해당하는 BI 2와 3에 대해 그것이 고정 휴지기인지 가변 휴지기인지를 예측하는 것들이다. 정확한 가변 휴지기 예측을 위해서는 문장의 의미와 구조를 분석해야하지만 현재의 기술로는 불가능하므로, 가변 휴지기를 예측하기 위한 규칙은 인접한 품사의 종류나 순서를 검사하는 것과 같은 범용적인 규칙들로 이루어지고, 고정 휴지기를 예측하는 규칙은 붙여 읽어야만 하는 관용적 표현들과 문법 패턴을 분석하여 특정 단어들이 조합된 형태인지를 검색하거나 특정 품사의 위치를 검색하는 구체적이고 제한적인 규칙들로 이루어진다.Analyzing the recorded voice data and recorded sentences to create a corpus synthesizer, the pauses between words that combine to affect accents or grammar, such as nouns and investigations, nouns and suffixes, and nouns and nouns that produce compound nouns. Basically, there is a high probability that it is a fixed resting period, and a resting period determined due to semantic or morphological characteristics is likely to be a variable resting period. Therefore, the rules for determining whether the fixed or variable rest periods may include rules based on grammatical knowledge and experimental rules, and may use part-of-speech information and character information, which are the result of the language processing module. Most of these rules are those for predicting BI 2 and 3, which correspond to accent sphere boundaries and MP boundaries, whether they are fixed or variable. In order to accurately predict variable pauses, it is necessary to analyze the meaning and structure of sentences, but it is not possible with current techniques. Therefore, the rules for predicting variable pauses consist of general rules such as checking the type and order of adjacent parts of speech. The rules for predicting fixed resting periods consist of specific and restrictive rules that analyze idiomatic expressions and grammatical patterns that must be pasted to search for a combination of specific words or to locate specific parts of speech.

본 발명에서는 규칙을 이용하여 생성된 BI 2와 3의 에러를 보완하기 위해 코퍼스를 이용하여 가변 휴지기를 예측하는데, 먼저 CART를 이용하여 규칙으로 생성되는 BI 2와 3에 대해서 악센트 구(AP) 경계인지 MP 경계인지를 결정하는 의사 결정 나무(decision tree)를 구성하고, 확률 값으로 의사 결정 나무의 출력을 표현하여 (합성단위 선택과정의) 사전선택 과정에서 합성단위의 후보군(entry)을 구성할 비율(rate)로 이용한다. 위에서 언급한 의사 결정 나무를 이하 설명에서 AP/MP 의사 결정 나무로 명명한다.In the present invention, in order to compensate for errors of BI 2 and 3 generated using a rule, a variable rest period is predicted using a corpus. First, an accent phrase (AP) boundary for BI 2 and 3 generated as a rule using a CART is used. Construct a decision tree to determine whether it is an MP boundary, and express the output of the decision tree using probability values to form a candidate group of synthesis units in the pre-selection process. Use as a rate. The decision trees mentioned above are referred to as AP / MP decision trees in the description below.

(식 1)

(Equation 1)

(식 2)

(Equation 2)

는 AP/MP 의사 결정 나무(decision tree)의 출력형태를 나타낸 것으로 악센트 구 경계가 될 확률

와 MP 경계가 될 확률

의 벡터(vector)로 출력 되고

와

의 관계는 식 2와 같다.

Is the output form of the AP / MP decision tree, and the probability of becoming an accent phrase boundary.

And the probability of becoming an MP boundary

Output as a vector of

Wow

The relationship of is given by Equation 2.

제3도는 가변 휴지기의 모델링 과정을 도시한 것이고 이것은 AP/MP 의사 결정 나무를 생성하는 것과 제4도의 가변 휴지기 결정과정에서 사용될 문턱값(

,

)을 계산하는 것으로 이루어진다. Figure 3 illustrates the modeling process of the variable resting phase, which is used to generate the AP / MP decision tree and the threshold values used in the variable resting phase determination of FIG.

,

) Is calculated.

먼저 녹음 음성과 문장을 수동으로 검사하여 작성한 BI에 대한 정보를 이용하여 CART 모델링 방법으로 훈련함으로써 의사 결정 나무를 생성한다.First, a decision tree is created by training the CART modeling method using the information on the BI created by manually checking the recorded voice and sentence.

표 1은 CART 훈련에 사용한 특징들이고,

는 k번째 단어이고,

은

의 마지막 음절이다. 7번째 특징은 형태소에 관한 것으로, 비슷한 문법적 기능을 가지는 단어들을 분류하여 사용하고, 예로 앞 단어와 강하게 연결되는 형식명사(structural nouns)와 조동사(auxiliary verbs)인 단어들을 분류하고, 관용구를 형성하는 단어들과 자주 사용되는 동사 및 조사 등을 분류한다.Table 1 lists the features used in the CART training.

Is the k word,

silver

Is the last syllable. The seventh feature is about morphemes, which classify words with similar grammatical functions, classify words with structural nouns and auxiliary verbs that are strongly linked to the preceding words, and form idioms. Classify words, frequently used verbs, and surveys.

제4도는 실제 AP 경계와 실제 MP 경계에 대한 예측 확률의 분포를 도시한 것으로, 실제 AP 경계 또는 실제 MP 경계란 성우가 발성한 음성에서 휴지기가 AP 또는 MP 경계임을 나타낸다. 이 예측 확률 분포를 구하기 위해 먼저 각 음성에 해당하는(성우가 읽은) 텍스트에 대해 규칙으로 BI를 예측하고, 예측된 BI가 2 또는 3인 경우 AP/MP 의사 결정 나무를 이용하여

와

의 값들을 수집하여 실제 성우가 발성한 경계정보에 따라 확률 값들을 분류한다. 따라서 도 4a는 성우가 악센트 구로 읽은 텍스트에 대해 AP/MP 의사 결정 나무의

값의 분포를 나타낸 것이고, 성우가 MP 경계로 읽은 텍스트에 대해

값의 분포를 나타낸 것이다. 실제 악센트 구로 발성되는 경우는 악센트 구 예측 확률(

)도 대부분 1에 근접한 값으로 나타나지만, MP로 발성되는 경우는 MP 예측 확률(

)이 0.8∼0.9의 확률 구간이 가장 많이 나타나고 그 정도도 30%이하이다. 이것은 악센트 구로 읽는 패턴은 매우 일정하고 규칙적이고 MP로 읽는 패턴은 악센트 구로 읽는 패턴 보다 불규칙적이어서 악센트 구로도 읽을 확률이 높음을 나타내는 것이다.FIG. 4 illustrates the distribution of prediction probabilities for the real AP boundary and the real MP boundary, and the real AP boundary or the real MP boundary indicates that the resting period is the AP or MP boundary in the voice generated by the voice actor. To obtain this predictive probability distribution, we first predict the BI with a rule for the text corresponding to each voice (read by the voice actor), and then use the AP / MP decision tree if the predicted BI is 2 or 3.

Wow

Probability values are classified according to boundary information generated by real voice actors by collecting values of. Thus, FIG. 4A shows the AP / MP decision tree for text read by the voice actor as an accent phrase.

For the text read by the voice actors on the MP boundary.

It shows the distribution of values. If you are speaking with an actual accent phrase, the accent phrase prediction probability (

) Also appears to be close to 1, but when spoken with MP, the MP prediction probability (

) Has the highest probability interval of 0.8 to 0.9, and the degree is less than 30%. This indicates that the pattern read by the accent phrase is very constant and regular, and the pattern read by the MP is more irregular than the pattern read by the accent phrase.

가변 휴지기를 결정할 때 사용되는 문턱값,

와

는 다음과 같다.The threshold used when determining a variable pause,

Wow

Is as follows.

(식 3) (Equation 3)

(식 4)

(Equation 4)

(식 5)

(Equation 5)

(식 6)

(Equation 6)

는 실제 악센트 구 경계에 대해 AP/MP 의사 결정 나무의 출력 값이

>

인 경우의

의 평균이고,

는 실제 MP 경계에 대해 AP/MP 의사 결정 나무의 출력 값이

<

인 경우의

의 평균이다.

와

는 실제 악센트 구 경계와 실제 MP 경계의 개수이다.

Is the output value of the AP / MP decision tree for the actual accent sphere boundary.

>

When

Is the average of

Is the output value of the AP / MP decision tree for the actual MP boundary.

<

When

Is the average.

Wow

Is the number of actual accent sphere boundaries and actual MP boundaries.

합성기에서 생성된 BI 중 0,1,4,5는 고정 휴지기가 되고, BI 2와 3에 대해서만 가변 휴지기인지 고정 휴지기인지를 결정한다. 가변 휴지기는 AP/MP 의사 결정 나무와 문턱값를 이용하여 제5도와 같이 결정한다. 생성된 BI가 2와 3인 경우 AP/MP 의사결정 나무를 이용하여 예측확률을 구하고 이것이 문턱값 보다 높은 경우는 고정 휴지기로 하고 낮은 경우에는 가변 휴지기로 결정한다.Among the BIs generated by the synthesizer, 0, 1, 4, and 5 become fixed rest periods, and determine whether the variable rest periods or fixed rest periods are set for BI 2 and 3 only. The variable pause is determined as shown in FIG. 5 using the AP / MP decision tree and the threshold. If the generated BI is 2 and 3, the AP / MP decision tree is used to determine the prediction probability.

(식 7)

(Equation 7)

가변 휴지기는

와

가 모두 각각의 문턱값 보다 작은 경우이고, 그렇지 않은 경우(고정 휴지기 인 경우)

와

를 1 또는 0으로 조정한다.

는 가변 휴지기 결정 결과를 적용하여

를 다시 나타낸 것이다. Variable Resting Machine

Wow

Are all less than their respective thresholds, otherwise they are at fixed rest.

Wow

Is adjusted to 1 or 0.

By applying the result of the variable pause

Will be shown again.

제6도는 본 발명에서 사용하는 운율 생성 과정을 도시한 것으로 언어처리부 정보를 바탕으로 운율 처리부에서 휴지기 생성을 수행한 후 가변 휴지기 모델을 이용하여 가변 휴지기를 생성하고 생성된 가변 휴지기 정보를 이용하여 단어 단위의 기본 주파수 윤곽선 생성 및 음속지속시간을 생성한다. 즉, 다중 목표운율 생성을 수행한다.FIG. 6 illustrates a rhythm generation process used in the present invention. After generating a pause in the rhyme processor based on the language processor information, a variable pause is generated using a variable pause model and a word is generated using the generated variable pause information. Generate basic frequency contour of unit and sound speed duration. That is, multiple target rhythm generation is performed.

제7a도는 기존의 단일 목표운율을 생성하는 방법을 도시하고 제7b도는 종래의 단일 목표운율 생성의 구체예를 도시한다. 본 발명에서는 다중 목표운율을 생성하기 위해 기존의 문장 또는 IP 단위의 목표운율 구조를 단어 단위의 목표운율 구조로 표현하였는데, 한 단어의 목표운율은 그 단어의 왼쪽 휴지기(BI)와 오른쪽 휴지기(BI)에 따라 달라질 수 있다. 도 7a는 종래의 합성기에서 사용하는 목표운율을 나타낸 것으로, 휴지기 예측 모듈에서 생성된 BI가 변하지 않는 고정 휴지기만으로 이루어진 경우는 하나의 단어에 하나의 목표운율만이 생성되므로 한 문장의 목표운율도 하나로 표현된다. (즉, 문장의 목표운율은 각 단어의 목표운율이 연결된 형태인 TP0-TP1-TP2-TP3-TP4로 표현할 수 있다.) FIG. 7A shows a method for generating a conventional single target rhyme, and FIG. 7B shows an embodiment of conventional single target rhyme generation. In the present invention, the target sentence structure of the existing sentence or IP unit is expressed by the target rhyme structure of the word unit in order to generate the multiple target rhymes, and the target rhyme of one word is the left pause (BI) and the right pause (BI) of the word. ) May vary. FIG. 7A illustrates a target rhyme used in a conventional synthesizer. When the BI generated by the pause prediction module is composed of only fixed pauses that do not change, only one target rhyme is generated in one word. Is expressed. (In other words, the target rhyme of a sentence can be expressed as TP0-TP1-TP2-TP3-TP4, in which the target rhyme of each word is connected.)

제7b도는 제7a도의 구체예에 해당한다. "오늘 날씨가 매우 춥다고 합니다."라는 문장에 대해 고정휴지기와 각 단어의 목표운율에 대해 나타내고 각각 단어의 목표 운율을 연결하여 한 문장의 기본 주파수 윤곽선과 한 문장의 음속지속시간을 생성한다. 도면에서 휴지기 정보는 문장의 시작과 끝, Break Index 1,2,3,4 등을 가질 수 있고, 기본 주파수 윤곽선은 각 음절에 해당하는 기본 주파수의 변화를 연결하여 단어단위로 표현하였다. 고정 휴지기를 사용하는 경우에는 각 단어마다 하나의 목표운율만 가질 수 있으므로 문장에 대한 목표운율도 각 단어의 목표운율을 연결한 하나만 가질 수 있다.FIG. 7B corresponds to the embodiment of FIG. 7A. For the sentence "It is said that the weather is very cold today," it shows the fixed pause and the target rhyme of each word, and connects the target rhyme of each word to generate the basic frequency contour of one sentence and the sonic duration of one sentence. In the drawing, the pause information may have a beginning and an end of a sentence, break index 1,2,3,4, and the like, and the basic frequency outline is expressed in word units by connecting the change of the fundamental frequency corresponding to each syllable. In the case of using the fixed pause, since each word can have only one target rhyme, the target rhyme for a sentence can also have only one linking the target rhyme of each word.

제7c도는 본 발명에 따른 다중 목표운율을 생성하는 방법을 도시하고 제7d도는 본 발명에 따른 다중 목표운율 생성의 구체예를 도시한다. 제7c도와 같이 가변 휴지기를 사용할 경우에는 각 단어의 왼쪽과 오른쪽 BI는 생성된 BI(GBI；generated BI)와 바뀔 수 있는 BI(EBI; expanded BI)의 2종류 휴지기가 모두 가능하다. 따라서 한 단어의 목표운율은 왼쪽과 오른쪽 BI가 가변 휴지기인 경우 최대 4가지의 목표운율을 가질 수 있다. 실제 예로 휴지기 예측 모듈에서 생성된 BI가 2이고 가변 휴지기로 예측되었다면 바뀔 수 있는 EBI는 3이 되고, 생성된 BI가 3이면 EBI는 2가 된다. 이렇게 한 단어에 대해 최대 4가지씩 목표운율을 표현하면 한 문장 전체에 대해서는 문장이 길어지면 수십에서 수백 가지의 목표운율도 동시에 표현할 수 있으므로, 다양한 목표운율을 이용하여 합성음을 생성할 수 있는 것이다. (제7c도에서 한 문장의 목표운율은 4가지 목표운율, TP00-TP10-TP20-TP30-TP40, TP00-TP10-TP21-TP30-TP40, TP00-TP11-TP22-TP30-TP40, TP00-TP11-TP23-TP31-TP40이 표현가능하다.)7C shows a method for generating multiple target rhymes according to the present invention and FIG. 7D shows an embodiment of generating multiple target rhymes according to the present invention. When using a variable rest as shown in FIG. 7c, the left and right BIs of each word can have two kinds of resting periods: a generated BI (GBI; expanded BI) and a changeable BI (EBI). Therefore, the target rhyme of one word can have up to four target rhymes when the left and right BIs are variable rest periods. In a practical example, if the BI generated in the dormant prediction module is 2 and is predicted to be variable dormant, the EBI that can be changed is 3, and if the generated BI is 3, the EBI is 2. In this way, if the target rhyme is expressed up to four kinds of words, if the sentence is long for the whole sentence, tens or hundreds of target rhymes can be simultaneously expressed, and thus, a synthesized sound can be generated using various target rhymes. (The goal rhyme of one sentence in Figure 7c is four target rhymes, TP00-TP10-TP20-TP30-TP40, TP00-TP10-TP21-TP30-TP40, TP00-TP11-TP22-TP30-TP40, TP00-TP11- TP23-TP31-TP40 can be expressed.)

제7d도는 제7c도의 구체예에 해당한다. 각 단어의 휴지기와 목표운율을 나타내고, 이것들을 연결하여 한 문장에 대한 기본주파수 윤곽선과 음소지속시간을 나타내었다. 고정 휴지기와 가변 휴지기가 함께 있는 경우를 나타낸 것으로 "날씨가"와 "춥다고"처럼 오른쪽 또는 왼쪽 휴지기만 가변 휴지기인 경우 그 단어의 목표운율은 2개가 되고, "매우"처럼 왼쪽과 오른쪽 모두 가변 휴지기인 경우 단어의 목표 운율은 4개가 된다. 연결되는 휴지기가 같도록 각 단어의 목표 운율을 연결하면 문장에 대해 모두 4개의 목표 운율을 얻을 수 있다. 각각의 목표운율은 휴지기 정보, 기본 주파수 윤곽선, 음소지속시간이 조금씩 차이가 있는데, '//'을 끊어 읽기 표시라 한다면 첫 번째 목표운율은 "매우" 뒤에서 끊어 읽게 되어 "오늘 날씨가 매우 // 춥다고 한다."에 맞게 기본 주파수 윤곽선과 음소지속시간이 결정되고, 두 번째 목표 운율은 모든 단어를 붙여 읽는 형태가 되어 기본 주파수 윤곽선과 음소지속시간도 첫 번째 목표운율과 다른 구조를 가진다. 세 번째 목표운율에서는 두 군데에서 끊어 읽게 되는 "오늘 날씨가 // 매우 // 춥다고 한다."의 형태이고, 네 번째 목표운율은 "오늘 날씨가 // 매우 춥다고 한다."의 형태로 끊어 읽는 부분이 변화함으로써, 즉 휴지기 정보가 변경됨으로써 기본 주파수 윤곽선과 음소지속시간 정보가 각각 다른 목표 운율을 생성할 수 있게 된다.FIG. 7d corresponds to the embodiment of FIG. 7c. The pause and target rhymes of each word are shown, and the basic frequency contours and phoneme durations for a sentence are connected by connecting them. It is the case that fixed and variable rest periods are together. If only right or left rest periods such as "weather" and "cold" are variable rest periods, the target rhyme of the word is two. In case of, the target rhyme of words is four. By connecting the target rhymes of each word so that the rest periods are the same, all four target rhymes can be obtained for the sentence. Each target rhyme has a little difference in pause information, basic frequency contour, and phoneme duration. If you read '//' and read it, the first target rhyme will be read after "very". It is said that it is cold. "The basic frequency contour and phoneme duration are determined, and the second target rhyme is a form of all words read together, so the basic frequency contour and phoneme duration are different from the first target rhyme. In the third target rhyme, it is read in two places: "Today's weather is // very cold." The fourth target rhyme is "Today's weather is // very cold." By this change, i.e., the pause information is changed, it is possible to generate target rhymes different from each other in the basic frequency contour and the phoneme duration information.

제8도는 본 발명에 따른 합성단위 선택과정을 도시한다. 합성단위 선택 기반 음성 합성시스템의 선택 과정 자체는 동적 프로그래밍(dynamic programming, Viterbi) 알고리즘으로 수행되지만 그 전에 이루어지는 후보 합성단위(candidate) 선택과 비용(cost) 계산이 합성음의 음질에 보다 많은 영향을 미친다. 일반적으로 문맥(target context) 정보와 계산이 간단한 목표비용(target cost)으로 후보 합성단위를 선택하여 이것들에 대하여 연결 비용(concatenation cost, join cost)을 계산하는데, 후보 합성단위의 수가 많을수록 합성음의 음질이 좋아 질 수 있다. 자연스러운 합성음을 생성하기 위해서는 음성 코퍼스에 다양한 운율 정보가 포함되어 있어야 하고 후보 합성단위들도 코퍼스의 이러한 특징이 반영되도록 가능성 있는 많은 합성단위들을 포함하여야 한다. 그러나 후보 합성단위의 수가 증가 할 경우 합성단위 검색(Viterbi search)과정의 수행시간이 급격히 늘어나 실시간 합성이 어려워지기 때문에 후보 합성단위의 수를 무조건 늘리는 것도 효율적이지 못하다. 본 발명에서는 가변 휴지기에 대해서 BI를 확장하여 후보 합성단위의 개수를 증가시킨다. 휴지기 예측 모듈에서 생성된 BI(GBI)를 가지는 합성단위 뿐 아니라 확장 가능한 BI(EBI)를 가지는 합성단위도 후보에 포함 시키는 것이다. 8 shows a process for selecting a synthesis unit according to the present invention. The selection process of the synthesis unit selection-based speech synthesis system itself is performed by a dynamic programming (Viterbi) algorithm, but the candidate selection and cost calculations made earlier affect the sound quality of the synthesized sound. . In general, the target context information and target cost are simple to select candidate synthesis units, and the concatenation cost (join cost) is calculated for them.The higher the number of candidate synthesis units, the higher the sound quality of the synthesized sound. This can be good. In order to generate a natural synthesized sound, the voice corpus should include various rhyme information, and the candidate synthesized units should also include many possible synthesized units to reflect this characteristic of the corpus. However, if the number of candidate synthesis units increases, it is not efficient to increase the number of candidate synthesis units unconditionally because the execution time of the Viterbi search process increases rapidly and real-time synthesis becomes difficult. In the present invention, the number of candidate synthesis units is increased by extending BI for the variable rest period. In addition to the synthesis unit having the BI (GBI) generated in the pause prediction module, the synthesis unit having the expandable BI (EBI) is included in the candidate.

제8도에 도시된 합성단위 선택과정은 고정휴지기를 갖는 합성단위 및 가변 휴지기를 갖는 확장된 합성단위에 목표비용 계산하고 사전선택을 수행하고 연결비용을 계산하여 최종적으로 전체 비용이 최소인 최적의 합성단위 검색을 수행한다. 가변휴지기로 인해 확장된 후보합성단위 경우 사전선택의 비율은 식 7을 사용한다. 예컨대, 생성된 BI가 2이고

가 0.6이면, 합성단위 선택의 엔트리는 BI가 2인 후보 합성단위가 60%가 되고 BI가 3인 후보 합성단위가 40%가 되도록 사전선택을 수행하는 것이다.In the synthesis unit selection process shown in FIG. 8, the target cost is calculated, the preselection is performed, and the connection cost is calculated for the synthesis unit having the fixed pause and the expanded synthesis unit having the variable rest period. Perform compound unit search. For extended candidate synthesis units due to variable pauses, the ratio of preselection is used in equation 7. For example, the generated BI is 2

Is 0.6, the entry of the synthesis unit selection is to perform the preselection so that the candidate synthesis unit with BI is 2% and the candidate synthesis unit with BI is 3%.

(식 8)

(Expression 8)

(식 9)

(Eq. 9)

식 8은 목표 비용함수(target cost function)를 나타낸 것이고,

는 목표 비용 중 운율 비용(prosodic cost)이고

는 문맥정보 비용(context cost)이다. 본 발명에서는 다중 목표운율을 사용하므로 식 9와 같이 각각의 목표운율에 대하여 비용을 계산하고 그 중에서 비용이 최소인 것을 목표 비용으로 사용하여 합성단위 선택에 사용한다. Equation 8 shows the target cost function,

Is the prosodic cost of the target cost

Is the context cost. In the present invention, since the multiple target rhymes are used, the cost is calculated for each target rhythm as shown in Equation 9, and the lowest cost among them is used as the target cost to select the synthesis unit.

본 발명은 기존의 합성기에서 사용하는 문장 또는 IP단위의 단일 목표운율을 단어들의 휴지기 정보를 이용하여 단어 단위의 목표운율로 표현하고, 각 단어의 고정 또는 가변 휴지기 형태에 따라 운율을 생성하여 한 문장 또는 IP에서 여러 가지의 운율을 동시에 표현할 수 있도록 하여 합성단위 선택에서 최적의 운율 패턴이 선택될 가능성을 높일 수 있다.The present invention expresses a single target rhyme of a sentence or IP unit used in a conventional synthesizer as the target rhyme of a word unit using pause information of words, and generates a rhyme according to a fixed or variable pause form of each word. Alternatively, by allowing multiple rhymes to be expressed simultaneously in IP, it is possible to increase the likelihood that an optimal rhyme pattern is selected in the synthesis unit selection.

상기 설명은 코퍼스 기반을 전제로 설명한 것이나 본 발명의 또 다른 구체예로서 다중 목표운율을 사용하는 운율 생성 방법은 코퍼스 방식의 합성 방식뿐만 아니라 규칙 기반 합성 방식에서도 적용 하여 목표운율을 생성하는 과정에서 발생할 수 있는 오류를 합성과정에서 보완할 수 있다. 텍스트 분석 및 언어처리, 목표 운율 생성과정은 코퍼스 기반 방법과 동일하고 목표 운율로부터 직접 합성음을 생성하는 방식이 다른 것이기 때문이다.The above description is based on a corpus-based premise, but as another embodiment of the present invention, a method of generating a rhyme using multiple target rhymes may occur in a process of generating a target rhyme by applying not only a corpus-based synthesis but also a rule-based synthesis. Possible errors can be compensated for in the synthesis process. This is because the process of text analysis, language processing, and target rhyme is the same as the corpus-based method, and the method of generating the synthesized sound directly from the target rhyme is different.

규칙 기반 합성은 텍스트를 분석하여 휴지기 정보와 기본주파수 윤곽선 및 음소지속시간을 생성하여 이것을 이용하여 신호처리 기법으로 파형을 생성하는 합성 방법으로, 코퍼스 방식과의 차이점은 파형을 생성하는 과정이다. 코퍼스 방식은 합성단위 선택과정을 통해 음성 코퍼스에서 분절된 음성파형을 가져와 연결하는 방식이지만 규칙 기반 방식은 음성파형을 여기원(Excitation, Harmonics, Vocal Cord 성분)과 여파기(포만트, Vocal Tract 성분) 성분 등으로 표현하여 합성하는 보코더(vocoder) 방식을 사용한다. 규칙 기반 합성 방식에서도 가변 휴지기와 다중 목표 운율을 사용 한다면 제7d도와 같이 합성음을 여러 가지로 만들어 낼 수 있다. 즉 하나의 운율 패턴을 가지는 합성음 출력을 보다 다양화 할 수 있을 뿐만 아니라, 각 문장마다의 최종 합성음의 모든 운율 패턴을 이용하여 운율 생성과정에서의 오류를 보완할 수 있는 가능성을 얻을 수 있다. Rule-based synthesis is a synthesis method that analyzes text, generates pause information, basic frequency contours, and phoneme durations, and uses them to generate waveforms using signal processing. The difference from the corpus method is the process of generating waveforms. The corpus method combines the speech waveform segmented from the speech corpus through the synthesis unit selection process, but the rule-based method combines the speech waveform with excitation sources (Excitation, Harmonics, Vocal Cord components) and filter (formants, Vocal Tract components). A vocoder method is used to synthesize by expressing it with components. In the rule-based synthesis method, if variable pause and multiple target rhymes are used, various synthesis sounds can be produced as shown in FIG. That is, not only can the output of the synthesized sound having a single rhyme pattern be more diversified, but also the possibility of compensating for errors in the rhythm generation process can be obtained by using all the rhyme patterns of the final synthesized sound for each sentence.

나아가 코퍼스 합성 방식에서도 음성파형을 연결한 후, 규칙 기반 합성 방식과 같이 목표 운율을 이용하여 기본주파수와 음소지속시간을 변경하는 경우 다중 목표 운율을 사용할 경우 성능향상을 얻을 수 있는데, 여러 목표 운율 중 합성음의 휴지기 정보와 동일한 휴지기를 나타내는 목표 운율을 이용하여 구현할 수 있다. 즉 여러 목표 운율 중 합성단위 선택에 의해 최적의 목표 운율과 그에 가장 근접한 음성파형들이 선택되었으므로 기본 주파수나 음소지속시간을 변경하여도 음질 열화를 최소화 할 수 있는 것이다. Furthermore, in the corpus synthesis method, after the speech waveforms are connected, if the fundamental frequency and the phoneme duration are changed by using the target rhyme as in the rule-based synthesis method, the performance improvement can be obtained when the multiple target rhymes are used. It can be implemented using a target rhyme representing the same pause as the pause information of the synthesized sound. That is, the optimum target rhyme and the closest voice waveforms were selected by selecting the synthesis unit among several target rhymes, so that the degradation of sound quality can be minimized even by changing the fundamental frequency or the phoneme duration.

본 발명에서는 합성음의 자연성을 향상시키기 위해 운율정보의 하나인 휴지기 정보에 가변 휴지기를 도입하고 여러 가지 패턴의 목표운율 생성하여 음성 데이터베이스의 각 세그먼트가 가지는 다양한 운율정보를 이용할 수 있는 합성단위 선택을 수행함으로써 합성음의 음질을 향상시킬 수 있다.In the present invention, to improve the naturalness of the synthesized sound, a variable pause is introduced into the pause information, which is one of the rhyme information, and various target patterns are generated to select a synthesis unit that can use various rhyme information of each segment of the speech database. By doing so, the sound quality of the synthesized sound can be improved.

본 발명은 휴지기 예측 모듈에서 예측된 휴지기에 대해서 그것이 가변적인지에 대한 정보를 추가로 예측하여 운율 생성모듈에서 다양한 목표운율을 생성함으로써, 기존의 방법에서 실제 최적의 합성단위가 데이터베이스에 존재하여도 휴지기 예측이 잘못되어 합성음을 생성할 때 최적의 데이터가 사용되지 못하는 문제를 해결할 수 있게 된다.The present invention further predicts the information on whether it is variable for the resting period predicted by the resting period prediction module to generate various target rhymes in the rhythm generation module, so that even when the actual optimal synthesis unit exists in the database in the existing method, It is possible to solve the problem that the optimal data is not used when generating the synthesized sound due to the wrong prediction.

본 발명에서는 음성 합성에서 사용하는 휴지기가 가변적인지 아닌지를 예측하기 위해 통계적인 방법을 사용하여 부족한 규칙을 보완하고 예측 결과를 확률로 출력하여 이를 합성단위 선택의 사전선택에 사용하여 최적의 합성단위 검색을 가능하게 한다.In the present invention, in order to predict whether or not the pause period used in speech synthesis is variable, a statistical method is used to compensate for the insufficient rule, and the prediction result is outputted as a probability and used for preselection of synthesis unit selection. To make it possible.

본 발명에서는 단어 단위로 여러 개의 목표운율을 생성하여 이것을 문장 단위로 조합하는 다중 목표 운율을 사용함으로써 합성단위 선택과정에서 운율 생성 오류를 줄일 수 있도록 함으로써 운율 생성 오류에 의한 함성음의 음질 저하 문제 를 해결할 수 있다.In the present invention, by using multiple target rhymes that generate a plurality of target rhymes by word unit and combine them by sentence unit, it is possible to reduce the rhythm generation error in the synthesis unit selection process, thereby reducing the problem of sound quality degradation of the shout sound due to the rhyme generation error. I can solve it.

다중 목표운율을 종래의 단일 목표운율처럼 문장 단위로 표현하려면 많은 정보가 필요하지만, 단어 단위로 목표운율을 나누어 처리하고 이것을 연결하여 문장 단위의 목표운율을 생성하면 동시에 여러 문장 단위의 목표운율을 표현할 수 있어 운율생성 오류를 줄일 수 있게 된다. To express multiple target rhymes in sentence units like a conventional single target rhyme requires a lot of information.However, if the target rhymes are processed by dividing them into words and the target rhymes are generated by concatenating them, the target rhymes in multiple sentence units can be expressed simultaneously. This can reduce the rhyming error.

다중 목표운율을 사용하는 운율 생성 방법은 코퍼스 방식의 합성 방식뿐만 아니라 규칙 기반 합성 방식에서도 적용 가능한 방법으로 목표운율을 생성하는 과정에서 발생할 수 있는 오류를 합성과정에서 보완할 수 있는 효과가 있다. The rhythm generation method using multiple target rhymes has the effect of compensating for errors in the process of generating target rhymes in the synthesis process, which is applicable to not only corpus-based synthesis but also rule-based synthesis.

다중 목표운율을 이용하여 합성단위를 선택하기위해 사전선택과정에서 합성단위들의 후보군을 여러 종류의 휴지기 정보를 가지도록 구성하는 방법은 합성단위 선택에서 운율 생성 오류를 보완할 뿐 아니라 보다 최적의 합성단위를 검색할 수 있는 확률을 높이는 효과가 있어 자연스러운 운율을 가지는 합성음을 생성할 수 있다.In order to select the synthesis unit using multiple target rhymes, the method of constructing candidate groups of synthesis units to have various kinds of pause information during the pre-selection process not only compensates for the error of generation of rhythms in the synthesis unit selection, but also the more optimal synthesis unit. There is an effect of increasing the probability of searching for can produce a synthetic sound having a natural rhyme.

상기 본 발명의 바람직한 구체예를 설명되었으나, 본 발명의 단순한 변형 내지 변경은 이 분야의 통상의 지식을 가진 자에 의하여 용이하게 이용될 수 있으며, 이러한 변형이나 변경은 모두 본 발명의 영역에 포함되는 것으로 볼 수 있다.Although preferred embodiments of the present invention have been described above, simple modifications or variations of the present invention can be readily used by those skilled in the art, and all such modifications or changes are included in the scope of the present invention. It can be seen as.

Claims

입력된 문장에 포함된 숫자, 기호, 특수문자를 텍스트(text)로 변환하는 전처리하고, 텍스트 전처리된 문자의 품사를 분석하고, 분석된 문장을 발음표기로 전환하는 과정을 포함하는 언어학적 처리 단계;A linguistic processing step of converting the numbers, symbols, and special characters included in the input sentence into text, analyzing the parts of speech of the text preprocessed characters, and converting the analyzed sentences into phonetic notations. ;

상기 언어학적 처리 단계를 통해 생성되는 합성단위 후보들의 휴지기, 지속시간 및 억양(기본주파수 윤곽선)을 생성하는 운율 처리 단계;A rhyme processing step of generating pauses, durations, and intonations (base frequency contours) of synthesis unit candidates generated through the linguistic processing step;

상기 합성단위 후보들에서 합성단위 선택과정에서 이용될 최종 합성단위 후보들을 사전 선택하는 단계; 그리고Preselecting final synthesis unit candidates to be used in a synthesis unit selection process from the synthesis unit candidates; And

상기 사전 선택된 최종 후보들을 이용하여 입력된 문자의 음성합성에 이용될 합성단위를 선택하고 합성단위를 연결하여 합성음을 생성하는 음성 신호 처리단계;A voice signal processing step of selecting a synthesis unit to be used for voice synthesis of the input character by using the preselected final candidates and concatenating the synthesis units to generate a synthesized sound;

를 포함하고, 상기 운율 처리 단계의 휴지기가 고정 휴지기 또는 가변 휴지기를 갖는 것을 특징으로 하는 다중 목표운율을 이용한 코퍼스 기반 고음질 음성합성방법.The corpus-based high-quality speech synthesis method using multiple target rhymes, wherein the pause of the rhyme processing step includes a fixed pause or a variable pause.

제1항에 있어서, 상기 운율 처리 단계에서 휴지기가 고정 휴지기인지 가변 휴지기인지 예측하고 가변 휴지기가 예측되는 경우 다중 운율 생성하는 단계를 더 포함하는 것을 특징으로 하는 다중 목표운율을 이용한 코퍼스 기반 고음질 음성합성 방법.The corpus-based high-quality speech synthesis of claim 1, further comprising: predicting whether the resting period is a fixed resting period or a variable resting period and generating multiple rhymes when the variable resting period is predicted. Way.

제2항에 있어서, 상기 휴지기 예측 단계가 휴지기가 BI 2 및 BI 3에 대해 고정 휴지기인지 가변 휴지기인지 예측하는 것을 특징으로 하는 다중 목표운율을 이용한 코퍼스 기반 고음질 음성합성 방법.The corpus-based high-quality speech synthesis method of claim 2, wherein the resting phase predicting step predicts whether the resting phase is a fixed resting period or a variable resting period for BI 2 and BI 3.

제2항에 있어서, 상기 휴지기 예측 단계가,The method of claim 2, wherein the resting period prediction step,

각 음성에 해당하는 텍스트에 대해 규칙으로 BI를 예측하고;Predict BI by a rule for text corresponding to each voice;

BI가 2 또는 3인 경우 하기 (표 1)의 특징을 갖는 CART 모델링 방법으로 AP/MP 의사 결정 나무 생성하고;Generating an AP / MP decision tree with a CART modeling method having the characteristics of Table 1 when BI is 2 or 3;

상기 AP/MP 의사 결정 나무를 이용해

및

의 값을 수집하여 실제 성우가 발성한 경계정보에 따라 확률 값들을 분류하고;Using the AP / MP decision tree

And

Classifying probability values according to boundary information generated by a real voice actor;

하기 (식 4) 및 (식 6)으로 표현되는 문턱값을 결정하고; 그리고,Determining thresholds represented by the following formulas (4) and (6); And,

AP/MP 의사 결정 나무를 이용하여 예측확률을 구한 것이 문턱값 보다 높은 경우, 고정휴지기로 결정하고 AP/MP 의사 결정 나무를 이용하여 예측확률을 구한 것이 문턱값 보다 작은 경우 가변휴지기로 결정하는;If the prediction probability using the AP / MP decision tree is higher than the threshold value, the fixed pause is determined, and if the prediction probability using the AP / MP decision tree is smaller than the threshold value, the variable pause is determined;

단계를 포함하는 것을 특징으로 하는 다중 목표운율을 이용한 코퍼스 기반 고음질 음성합성 방법.A corpus based high quality speech synthesis method using multiple target rhymes, characterized in that it comprises a step.

표 1. CART 모델링에 사용한 특징들여기서,

는 k번째 단어,

은

의 마지막 음절임.Table 1. Features used in CART modeling

Is the k word,

silver

Last syllable.

(식 4)

(Equation 4)

(식 6) (Equation 6)

여기서,

는 AP 문턱값,

는 MP 문턱값, here,

Is the AP threshold,

MP threshold,

와

는 실제 AP 경계와 실제 MP 경계의 개수, 그리고

Wow

Is the number of actual AP boundaries and actual MP boundaries, and

,

임.

being.

제1항에 있어서, 상기 운율 처리 단계가,The method of claim 1, wherein the rhyme processing step,

종래의 방법으로 언어처리 정보를 이용해 휴지기를 생성하고; 그리고Generating a pause using language processing information in a conventional manner; And

제4항의 상기 휴지기 예측단계를 통해 가변휴지기가 예측되는 경우 생성되는 가변 휴지기를 이용하여 단어 단위의 기본 주파수 윤곽선 및 음속 지속 시간을 생성하는;Generating a fundamental frequency contour and a sound velocity duration in word units using the variable pause generated when the variable pause is predicted through the pause predictor of claim 4;

단계를 포함하는 것을 특징으로 하는 다중 목표 운율을 이용한 코퍼스 기반 고음질 음성합성 방법.A corpus-based high-quality speech synthesis method using multiple target rhymes, characterized in that it comprises a step.

제1항에 있어서, 상기 음성 신호 처리 단계의 합성단위 선택이 다중 운율 생성 단계를 통해 생성된 대량의 확장된 후보 합성단위에서 선택하는 것을 특징으로 하는 다중 목표 운율을 이용한 코퍼스 기반 고음질 음성합성 방법.2. The method of claim 1, wherein the synthesis unit selection of the speech signal processing step is selected from a large number of extended candidate synthesis units generated through the multiple rhythm generation step.

제1항에 있어서, 상기 사전 선택 단계가 제6항의 대량의 확장된 후보 합성 단위 경우 사전선택의 비율로 하기 (식 7)을 사용하여 조합된 최종 합성단위 후보군을 생성하는 것을 특징으로 하는 다중 목표 운율을 이용한 코퍼스 기반 고음질 음성합성 방법:The multi-target of claim 1, wherein the pre-selection step generates a combined final synthesis unit candidate group using Equation 7 as the ratio of the pre-selection in the case of the large-scale extended candidate synthesis unit of claim 6. Corpus-based High Quality Speech Synthesis Using Rhymes:

(식 7)

(Equation 7)

여기서,

인 값임.here,

Is a value.

제1항에 있어서, 상기 음성 신호 처리단계가 후보 합성단위에 대한 하기 (식 8)로 표현되는 목표비용 함수를 이용하여 목표 비용을 계산하고, 그 비용이 최소인 합성단위를 선택하여 합성음을 생성하는 것을 특징으로 하는 다중 목표 운율을 이용한 코퍼스 기반 고음질 음성합성 방법:The method of claim 1, wherein the voice signal processing step calculates a target cost by using a target cost function represented by Equation (8) for the candidate synthesis unit, and generates a synthesis sound by selecting a synthesis unit having the minimum cost. A corpus based high quality speech synthesis method using multiple target rhymes, characterized in that:

(식 8)

(Expression 8)

여기서

는 목표 비용 중 운율 비용,

는 문맥정보 비용 이고

임.here

Is the rhyme cost of the target cost,

Is the cost of contextual information

being.

상기 언어학적 처리 단계를 통해 생성되는 합성단위 후보들의 휴지기, 지속시간 및 억양(기본주파수 윤곽선)을 생성하는 운율 처리 단계; 그리고A rhyme processing step of generating pauses, durations, and intonations (base frequency contours) of synthesis unit candidates generated through the linguistic processing step; And

상기 언어학적 처리 정보 및 운율 처리 정보를 이용하여 음성파형을 직접 만들어 합성음 생성하는 음성 신호 처리단계;A speech signal processing step of directly generating a speech waveform using the linguistic processing information and the rhythm processing information to generate a synthesized sound;

를 포함하고, 상기 운율 처리 단계의 휴지기에 대해 제4항의 휴지기 예측 단계를 통해 가변 휴지기가 예측되는 경우 다중 운율을 생성하는 것을 특징으로 하는 다중 목표 운율을 이용한 규칙 기반 고음질 음성합성 방법.And a multi-rate rhyme when the variable pause is predicted through the pause prediction step of claim 4 with respect to the pause of the rhyme processing step.