KR20230148048A

KR20230148048A - Method and system for synthesizing emotional speech based on emotion prediction

Info

Publication number: KR20230148048A
Application number: KR1020220047188A
Authority: KR
Inventors: 윤현욱; 황민제; 권오성; 송은우; 이호연; 류이치 야마모토
Original assignee: 네이버 주식회사; 웍스모바일재팬 가부시키가이샤
Priority date: 2022-04-15
Filing date: 2022-04-15
Publication date: 2023-10-24
Also published as: KR102626618B1

Abstract

본 개시는 컴퓨팅 장치의 적어도 하나의 프로세서에 의해 수행되는, 감정 음성 합성 방법 및 시스템에 관한 것이다. 감정 음성 합성 방법은, 감정 추정부에 의해, 입력 텍스트로부터 감정 특징 임베딩을 생성하는 단계, 텍스트 분석부에 의해, 입력 텍스트로부터 언어 특징을 추출하는 단계, 음성 합성부에 의해, 감정 특징 임베딩 및 언어 특징으로부터 음성의 길이를 예측하고 예측된 음성의 길이에 기초하여 감정 음성을 합성하는 단계를 포함한다.The present disclosure relates to a method and system for emotional speech synthesis, performed by at least one processor of a computing device. The emotional speech synthesis method includes generating emotional feature embeddings from input text by an emotion estimation unit, extracting language features from the input text by a text analysis unit, and generating emotional feature embeddings and language by a speech synthesis unit. It includes predicting the length of the voice from the features and synthesizing the emotional voice based on the predicted voice length.

Description

감정 추정 기반의 감정 음성 합성 방법 및 시스템{METHOD AND SYSTEM FOR SYNTHESIZING EMOTIONAL SPEECH BASED ON EMOTION PREDICTION}Emotional speech synthesis method and system based on emotion estimation {METHOD AND SYSTEM FOR SYNTHESIZING EMOTIONAL SPEECH BASED ON EMOTION PREDICTION}

본 개시는 감정 추정 기반의 감정 음성 합성 방법 및 시스템에 관한 것으로, 구체적으로, 텍스트로부터 감정 종류 및 강도를 예측하고 이를 이용하여 감정 음성을 합성하는 방법 및 시스템에 관한 것이다.The present disclosure relates to an emotional voice synthesis method and system based on emotion estimation. Specifically, it relates to a method and system for predicting the type and intensity of emotion from text and synthesizing emotional voice using the same.

최근 인공신경망 기반의 텍스트 음성 변환(TTS: text-to-speech) 시스템의 성능이 자연적인 음성 수준까지 도달함에 따라, 인간의 다양한 감정을 표현하는 감정 텍스트 음성 변환 시스템에 대한 관심이 증가하고 있다. 감정을 표현하는 텍스트 음성 변환을 위한 일반적인 방법은 레퍼런스 오디오를 조건부 입력 데이터로 사용하는 것이다. 이러한 방법에서는, 레퍼런스 오디오로부터 감정 표현이 추출되고, 이와 같이 추출된 감정 표현이 텍스트 음성 변환 시스템의 보조 입력데이터로 사용된다.Recently, as the performance of artificial neural network-based text-to-speech (TTS) systems has reached the level of natural speech, interest in emotional text-to-speech systems that express various human emotions is increasing. A common method for text-to-speech that expresses emotions is to use reference audio as conditional input data. In this method, emotional expressions are extracted from reference audio, and the extracted emotional expressions are used as auxiliary input data for a text-to-speech conversion system.

자연적인 음성의 감정 정보를 조건부 입력 데이터로 직접 활용하는 것은 합성 음성을 통해 감정 표현을 효과적으로 전달할 수 있다. 그러나, 이 방법은 감정 추론 과정에서 보조 입력 데이터의 지속적인 선별 처리를 요구하기 때문에, 각 문장에서 적절한 감정 특성을 결정 및 추출하는데 소요되는 비용이 높아지는 문제가 있다.Directly utilizing the emotional information of natural speech as conditional input data can effectively convey emotional expressions through synthetic speech. However, because this method requires continuous selective processing of auxiliary input data in the emotion inference process, there is a problem in that the cost of determining and extracting appropriate emotional characteristics from each sentence increases.

본 개시는 상기와 같은 문제점을 해결하기 위한 감정 예측 기반의 감정 음성 합성 방법, 시스템 및 기록매체에 저장된 컴퓨터 프로그램을 제공한다.The present disclosure provides an emotion prediction-based emotional voice synthesis method, system, and computer program stored in a recording medium to solve the above problems.

본 개시는 방법, 장치(시스템) 또는 판독 가능 저장 매체에 저장된 컴퓨터 프로그램을 포함한 다양한 방식으로 구현될 수 있다.The present disclosure may be implemented in various ways, including as a method, device (system), or computer program stored in a readable storage medium.

본 개시의 일 실시예에 따르면, 컴퓨팅 장치의 적어도 하나의 프로세서에 의해 수행되는, 감정 음성 생성 방법은, 감정 추정부에 의해, 입력 텍스트로부터 감정 특징 임베딩을 생성하는 단계, 텍스트 분석부에 의해, 입력 텍스트로부터 언어 특징을 추출하는 단계, 및 음성 합성부에 의해, 감정 특징 임베딩 및 언어 특징으로부터 음성의 길이를 예측하고 예측된 음성의 길이에 기초하여 감정 음성을 합성하는 단계를 포함한다.According to one embodiment of the present disclosure, a method of generating emotional speech, performed by at least one processor of a computing device, includes generating emotional feature embeddings from input text by an emotion estimation unit, by a text analysis unit, Extracting linguistic features from the input text, and predicting, by a speech synthesis unit, the length of the speech from the emotional feature embedding and the linguistic features, and synthesizing the emotional speech based on the predicted length of the speech.

본 개시의 일 실시예에 따른 감정 추정 기반의 감정 음성 생성 방법을 컴퓨터에서 실행하기 위해 컴퓨터 판독 가능한 기록 매체에 저장된 컴퓨터 프로그램이 제공된다.A computer program stored in a computer-readable recording medium is provided to execute the emotion estimation-based emotional voice generation method on a computer according to an embodiment of the present disclosure.

본 개시의 일 실시예에 따른 정보 처리 시스템은, 메모리 및 메모리와 연결되고, 메모리에 포함된 컴퓨터 판독 가능한 적어도 하나의 프로그램을 실행하도록 구성된 적어도 하나의 프로세서를 포함하고, 적어도 하나의 프로세서는, 입력 텍스트로부터 감정 특징 임베딩을 생성하는 감정 추정부, 입력 텍스트로부터 언어 특징을 추출하는 텍스트 분석부, 및 감정 특징 임베딩 및 언어 특징으로부터 음성의 길이를 예측하고 예측된 음성의 길이에 기초하여 감정 음성을 합성하는 음성 합성부를 포함한다.An information processing system according to an embodiment of the present disclosure includes a memory and at least one processor connected to the memory and configured to execute at least one computer-readable program included in the memory, and the at least one processor An emotion estimation unit that generates emotional feature embeddings from text, a text analysis unit that extracts language features from the input text, and predicts the length of the voice from the emotional feature embeddings and language features and synthesizes emotional voices based on the predicted voice length. It includes a voice synthesis unit that does.

본 개시의 일부 실시예에 따르면, 언어 모델을 통해 입력 텍스트에 포함된 문장의 문맥상의 감정을 분석하여 음성 합성에 활용함으로써, 감정의 속성을 정확히 예측하여 합성된 음성에 반영할 수 있다.According to some embodiments of the present disclosure, by analyzing the contextual emotions of sentences included in input text through a language model and using them for speech synthesis, the attributes of the emotions can be accurately predicted and reflected in the synthesized speech.

본 개시의 일부 실시예에 따르면, 입력 텍스트로부터 감정 종류 뿐만 아니라 감정 강도도 예측함으로써, 합성된 음성 내에 감정 특성의 대략적인 또는 세부적인 구조를 효과적으로 반영할 수 있다.According to some embodiments of the present disclosure, by predicting not only the type of emotion but also the intensity of the emotion from the input text, the rough or detailed structure of the emotional characteristics can be effectively reflected in the synthesized voice.

본 개시의 일부 실시예에 따르면, 음성 합성 대상인 입력 텍스트로부터 감정 속성을 예측하여 음성 합성에 활용함으로써, 음성 합성에 감정 특징을 반영하기 위한 보조적인 입력 데이터 없이 감정 텍스트 음성 변환 시스템을 구성할 수 있다.According to some embodiments of the present disclosure, by predicting emotional attributes from input text that is the target of voice synthesis and using them for voice synthesis, an emotional text-to-speech conversion system can be constructed without auxiliary input data to reflect emotional characteristics in voice synthesis. .

본 개시의 일부 실시예에 따르면, 언어 모델을 통해 복수의 문장을 분석하여 감정을 예측하여 음성 합성에 반영함으로써, 문단 수준의 텍스트 컨텐츠에 대한 음성 합성에 있어 자연어에 가깝게 생성할 수 있다. According to some embodiments of the present disclosure, by analyzing a plurality of sentences through a language model to predict emotions and reflecting them in speech synthesis, speech synthesis for text content at the paragraph level can be generated close to natural language.

본 개시의 효과는 이상에서 언급한 효과로 제한되지 않으며, 언급되지 않은 다른 효과들은 청구범위의 기재로부터 본 개시가 속하는 기술분야에서 통상의 지식을 가진 자("통상의 기술자"라 함)에게 명확하게 이해될 수 있을 것이다.The effects of the present disclosure are not limited to the effects mentioned above, and other effects not mentioned are clear to a person skilled in the art (referred to as “a person skilled in the art”) in the technical field to which the present disclosure pertains from the description of the claims. It will be understandable.

본 개시의 실시예들은, 이하 설명하는 첨부 도면들을 참조하여 설명될 것이며, 여기서 유사한 참조 번호는 유사한 요소들을 나타내지만, 이에 한정되지는 않는다.
도 1은 본 개시의 일 실시예에 따라 감정 추정 기반의 감정 음성 생성 시스템이 사용되는 예시를 나타내는 도면이다.
도 2는 본 개시의 일 실시예에 따른 정보 처리 시스템의 내부 구성을 나타내는 블록도이다.
도 3은 본 개시의 일 실시예에 따른 정보 처리 시스템의 프로세서의 내부 구성을 나타내는 도면이다.
도 4는 본 개시의 일 실시예에 따른 감정 추정부의 내부 구성을 나타내는 블록도이다.
도 5는 본 개시의 일 실시예에 따른 음성 합성부의 내부 구성을 나타내는 블록도이다.
도 6은 본 개시의 일 실시예에 따른 감정 추정 기반의 감정 음성 생성 시스템의 전체 구성을 나타내는 도면이다.
도 7은 본 개시의 일 실시예에 따른 감정 특징 벡터의 T-분포 확률적 임베딩 도면이다.
도 8은 본 개시의 일 실시예에 따른 감정 추정 기반의 감정 음성 생성 방법을 나타내는 흐름도이다.
도 9는 본 개시의 일 실시예에 따른 감정 추정부에서의 감정 특징 임베딩 생성하는 방법을 나타내는 흐름도이다.
도 10는 본 개시의 일 실시예에 따른 음성 합성부에서의 감정 음성을 합성하는 방법을 나타내는 흐름도이다.Embodiments of the present disclosure will be described with reference to the accompanying drawings described below, in which like reference numerals indicate like elements, but are not limited thereto.
1 is a diagram illustrating an example in which an emotion estimation-based emotional voice generation system is used according to an embodiment of the present disclosure.
Figure 2 is a block diagram showing the internal configuration of an information processing system according to an embodiment of the present disclosure.
Figure 3 is a diagram showing the internal configuration of a processor of an information processing system according to an embodiment of the present disclosure.
Figure 4 is a block diagram showing the internal configuration of an emotion estimation unit according to an embodiment of the present disclosure.
Figure 5 is a block diagram showing the internal configuration of a voice synthesis unit according to an embodiment of the present disclosure.
Figure 6 is a diagram showing the overall configuration of an emotion estimation-based emotional voice generation system according to an embodiment of the present disclosure.
Figure 7 is a T-distribution probabilistic embedding diagram of an emotional feature vector according to an embodiment of the present disclosure.
Figure 8 is a flowchart illustrating a method for generating an emotional voice based on emotion estimation according to an embodiment of the present disclosure.
Figure 9 is a flowchart showing a method of generating emotional feature embeddings in an emotion estimation unit according to an embodiment of the present disclosure.
Figure 10 is a flowchart showing a method of synthesizing emotional voices in a voice synthesis unit according to an embodiment of the present disclosure.

이하, 본 개시의 실시를 위한 구체적인 내용을 첨부된 도면을 참조하여 상세히 설명한다. 다만, 이하의 설명에서는 본 개시의 요지를 불필요하게 흐릴 우려가 있는 경우, 널리 알려진 기능이나 구성에 관한 구체적 설명은 생략하기로 한다.Hereinafter, specific details for implementing the present disclosure will be described in detail with reference to the attached drawings. However, in the following description, detailed descriptions of well-known functions or configurations will be omitted if there is a risk of unnecessarily obscuring the gist of the present disclosure.

첨부된 도면에서, 동일하거나 대응하는 구성요소에는 동일한 참조부호가 부여되어 있다. 또한, 이하의 실시예들의 설명에 있어서, 동일하거나 대응되는 구성요소를 중복하여 기술하는 것이 생략될 수 있다. 그러나, 구성요소에 관한 기술이 생략되어도, 그러한 구성요소가 어떤 실시예에 포함되지 않는 것으로 의도되지는 않는다.In the accompanying drawings, identical or corresponding components are given the same reference numerals. Additionally, in the description of the following embodiments, overlapping descriptions of identical or corresponding components may be omitted. However, even if descriptions of components are omitted, it is not intended that such components are not included in any embodiment.

개시된 실시예의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나, 본 개시는 이하에서 개시되는 실시예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시예들은 본 개시가 완전하도록 하고, 본 개시가 통상의 기술자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것일 뿐이다.Advantages and features of the disclosed embodiments and methods for achieving them will become clear by referring to the embodiments described below in conjunction with the accompanying drawings. However, the present disclosure is not limited to the embodiments disclosed below and may be implemented in various different forms. The present embodiments are merely provided to ensure that the present disclosure is complete and that the present disclosure does not convey the scope of the invention to those skilled in the art. It is provided only for complete information.

본 명세서에서 사용되는 용어에 대해 간략히 설명하고, 개시된 실시예에 대해 구체적으로 설명하기로 한다. 본 명세서에서 사용되는 용어는 본 개시에서의 기능을 고려하면서 가능한 현재 널리 사용되는 일반적인 용어들을 선택하였으나, 이는 관련 분야에 종사하는 기술자의 의도 또는 판례, 새로운 기술의 출현 등에 따라 달라질 수 있다. 또한, 특정한 경우는 출원인이 임의로 선정한 용어도 있으며, 이 경우 해당되는 발명의 설명 부분에서 상세히 그 의미를 기재할 것이다. 따라서, 본 개시에서 사용되는 용어는 단순한 용어의 명칭이 아닌, 그 용어가 가지는 의미와 본 개시의 전반에 걸친 내용을 토대로 정의되어야 한다.Terms used in this specification will be briefly described, and the disclosed embodiments will be described in detail. The terms used in this specification are general terms that are currently widely used as much as possible while considering the function in the present disclosure, but this may vary depending on the intention or precedent of a technician working in the related field, the emergence of new technology, etc. In addition, in certain cases, there are terms arbitrarily selected by the applicant, and in this case, the meaning will be described in detail in the description of the relevant invention. Accordingly, the terms used in this disclosure should be defined based on the meaning of the term and the overall content of the present disclosure, rather than simply the name of the term.

본 명세서에서의 단수의 표현은 문맥상 명백하게 단수인 것으로 특정하지 않는 한, 복수의 표현을 포함한다. 또한, 복수의 표현은 문맥상 명백하게 복수인 것으로 특정하지 않는 한, 단수의 표현을 포함한다. 명세서 전체에서 어떤 부분이 어떤 구성요소를 포함한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있음을 의미한다.In this specification, singular expressions include plural expressions, unless the context clearly specifies the singular. Additionally, plural expressions include singular expressions, unless the context clearly specifies plural expressions. When it is said that a certain part includes a certain element throughout the specification, this does not mean excluding other elements, but may further include other elements, unless specifically stated to the contrary.

또한, 명세서에서 사용되는 '모듈' 또는 '부'라는 용어는 소프트웨어 또는 하드웨어 구성요소를 의미하며, '모듈' 또는 '부'는 어떤 역할들을 수행한다. 그렇지만, '모듈' 또는 '부'는 소프트웨어 또는 하드웨어에 한정되는 의미는 아니다. '모듈' 또는 '부'는 어드레싱할 수 있는 저장 매체에 있도록 구성될 수도 있고 하나 또는 그 이상의 프로세서들을 재생시키도록 구성될 수도 있다. 따라서, 일 예로서, '모듈' 또는 '부'는 소프트웨어 구성요소들, 객체지향 소프트웨어 구성요소들, 클래스 구성요소들 및 태스크 구성요소들과 같은 구성요소들과, 프로세스들, 함수들, 속성들, 프로시저들, 서브루틴들, 프로그램 코드의 세그먼트들, 드라이버들, 펌웨어, 마이크로 코드, 회로, 데이터, 데이터베이스, 데이터 구조들, 테이블들, 어레이들 또는 변수들 중 적어도 하나를 포함할 수 있다. 구성요소들과 '모듈' 또는 '부'들은 안에서 제공되는 기능은 더 작은 수의 구성요소들 및 '모듈' 또는 '부'들로 결합되거나 추가적인 구성요소들과 '모듈' 또는 '부'들로 더 분리될 수 있다.Additionally, the term 'module' or 'unit' used in the specification refers to a software or hardware component, and the 'module' or 'unit' performs certain roles. However, 'module' or 'unit' is not limited to software or hardware. A 'module' or 'unit' may be configured to reside on an addressable storage medium and may be configured to run on one or more processors. Thus, as an example, a 'module' or 'part' refers to components such as software components, object-oriented software components, class components and task components, processes, functions and properties. , procedures, subroutines, segments of program code, drivers, firmware, microcode, circuits, data, databases, data structures, tables, arrays, or variables. Components and 'modules' or 'parts' may be combined into smaller components and 'modules' or 'parts' or further components and 'modules' or 'parts'. Could be further separated.

본 개시의 일 실시예서, '모듈' 또는 '부'는 프로세서 및 메모리로 구현될 수 있다. '프로세서'는 범용 프로세서, 중앙 처리 장치(CPU), 마이크로프로세서, 디지털 신호 프로세서(DSP), 제어기, 마이크로제어기, 상태 머신 등을 포함하도록 넓게 해석되어야 한다. 몇몇 환경에서, '프로세서'는 주문형 반도체(ASIC), 프로그램가능 로직 디바이스(PLD), 필드 프로그램가능 게이트 어레이(FPGA) 등을 지칭할 수도 있다. '프로세서'는, 예를 들어, DSP와 마이크로프로세서의 조합, 복수의 마이크로프로세서들의 조합, DSP 코어와 결합한 하나 이상의 마이크로프로세서들의 조합, 또는 임의의 다른 그러한 구성들의 조합과 같은 처리 디바이스들의 조합을 지칭할 수도 있다. 또한, '메모리'는 전자 정보를 저장 가능한 임의의 전자 컴포넌트를 포함하도록 넓게 해석되어야 한다. '메모리'는 임의 액세스 메모리(RAM), 판독-전용 메모리(ROM), 비-휘발성 임의 액세스 메모리(NVRAM), 프로그램가능 판독-전용 메모리(PROM), 소거-프로그램가능 판독 전용 메모리(EPROM), 전기적으로 소거가능 PROM(EEPROM), 플래쉬 메모리, 자기 또는 광학 데이터 저장장치, 레지스터들 등과 같은 프로세서-판독가능 매체의 다양한 유형들을 지칭할 수도 있다. 프로세서가 메모리로부터 정보를 판독하고/하거나 메모리에 정보를 기록할 수 있다면 메모리는 프로세서와 전자 통신 상태에 있다고 불린다. 프로세서에 집적된 메모리는 프로세서와 전자 통신 상태에 있다.In one embodiment of the present disclosure, a 'module' or 'unit' may be implemented with a processor and memory. 'Processor' should be interpreted broadly to include general-purpose processors, central processing units (CPUs), microprocessors, digital signal processors (DSPs), controllers, microcontrollers, state machines, etc. In some contexts, 'processor' may refer to an application-specific integrated circuit (ASIC), programmable logic device (PLD), field programmable gate array (FPGA), etc. 'Processor' refers to a combination of processing devices, for example, a combination of a DSP and a microprocessor, a combination of a plurality of microprocessors, a combination of one or more microprocessors in combination with a DSP core, or any other such combination of configurations. You may. Additionally, 'memory' should be interpreted broadly to include any electronic component capable of storing electronic information. 'Memory' refers to random access memory (RAM), read-only memory (ROM), non-volatile random access memory (NVRAM), programmable read-only memory (PROM), erasable-programmable read-only memory (EPROM), May also refer to various types of processor-readable media, such as electrically erasable PROM (EEPROM), flash memory, magnetic or optical data storage, registers, etc. A memory is said to be in electronic communication with a processor if the processor can read information from and/or write information to the memory. The memory integrated into the processor is in electronic communication with the processor.

본 개시에서, '시스템'은 서버 장치와 클라우드 장치 중 적어도 하나의 장치를 포함할 수 있으나, 이에 한정되는 것은 아니다. 예를 들어, 시스템은 하나 이상의 서버 장치로 구성될 수 있다. 다른 예로서, 시스템은 하나 이상의 클라우드 장치로 구성될 수 있다. 또 다른 예로서, 시스템은 서버 장치와 클라우드 장치가 함께 구성되어 동작될 수 있다.In the present disclosure, 'system' may include at least one of a server device and a cloud device, but is not limited thereto. For example, a system may consist of one or more server devices. As another example, a system may consist of one or more cloud devices. As another example, the system may be operated with a server device and a cloud device configured together.

본 개시에서, '감정 음성' 은 감정이 제거 또는 절제되어진 상태에서 발화 또는 재생되는 낭독형 음성이 아닌 감정이 반영 또는 표현된 상태에서 발화 또는 재생되는 음성을 지칭할 수 있다. 음성에 반영된 감정 표현은 운율로 알려진 음성 어조에 의해 전달되고, 운율의 주요 성질은 상대적인 음높이, 음성의 길이 및 강도를 포함할 수 있다. 또한, 합성된 음성에 반영된 운율의 성질들은 억양, 리듬, 강세 등으로 표현되어 해당 음성의 청취 시 감정을 인식하는데 영향을 줄 수 있다.In the present disclosure, 'emotional voice' may refer to a voice uttered or reproduced with emotions reflected or expressed, rather than a read-aloud voice uttered or reproduced with emotions removed or suppressed. Emotional expressions reflected in speech are conveyed by vocal tone, known as prosody, and the main properties of prosody can include relative pitch, length and intensity of speech. Additionally, the properties of prosody reflected in the synthesized voice are expressed in intonation, rhythm, stress, etc., which can affect the recognition of emotions when listening to the voice.

본 개시에서, '언어 모델'은 단어 시퀀스 또는 문장에 확률을 할당하거나 적절한 단어 시퀀스를 찾아내는 모델을 지칭할 수 있다. 예를 들어, 언어 모델은 텍스트 데이터를 분석하여 해당 텍스트에 포함되어 있거나 포함되어야 하는 특정 단어의 확률을 결정하고, 해당 텍스트의 자연어 맥락에 대한 규칙을 설정하는 알고리즘을 통해 텍스트 데이터를 해석할 수 있다. 언어 모델은 통계적 방식을 이용하는 언어 모델, 또는 GPT-3(Generative Pre-trained Transformer 3), BERT(Bidirectional Encoder Representations from Transformers) 등과 같이 인공 신경망을 이용하는 모델을 포함할 수 있으나, 이에 한정되는 것은 아니다.In this disclosure, 'language model' may refer to a model that assigns probabilities to word sequences or sentences or finds appropriate word sequences. For example, a language model can analyze text data to determine the probability that certain words are or should be contained in that text and interpret the text data through algorithms that establish rules for the natural language context of that text. . Language models may include, but are not limited to, language models using statistical methods or models using artificial neural networks such as GPT-3 (Generative Pre-trained Transformer 3), BERT (Bidirectional Encoder Representations from Transformers), etc.

도 1은 본 개시의 일 실시예에 따라 감정 추정 기반의 감정 음성 합성 시스템(100)이 사용되는 예시를 나타내는 도면이다. 도시된 바와 같이, 감정 추정 기반의 감정 음성 합성 시스템(100)은, 입력 텍스트(110)를 기초로 감정 추정(130), 텍스트 분석(120), 및 음성 합성(140)의 동작을 수행하여 감정 표현이 반영되어 합성된 감정 음성(150)을 출력할 수 있다. FIG. 1 is a diagram illustrating an example in which the emotion estimation-based emotional voice synthesis system 100 is used according to an embodiment of the present disclosure. As shown, the emotion estimation-based emotional voice synthesis system 100 performs the operations of emotion estimation 130, text analysis 120, and voice synthesis 140 based on the input text 110 to create an emotion. An emotional voice 150 synthesized by reflecting the expression can be output.

감정 추정(130) 동작에서, 감정 음성 합성 시스템(100)은 입력 텍스트(110) (예를 들어, "안녕하세요, 반갑습니다")에 기초하여, 해당 텍스트(110)의 언어적 특징 또는 문맥 정보를 분석하여 감정 표현 또는 감정 정보를 추출할 수 있다. 일 실시예에 따르면, 감정 추정(130) 동작에서, 입력 텍스트 (110)에 사전에 학습된 언어 모델을 적용하여 해당 텍스트(110)의 언어적 특징을 나타내는 특징 벡터를 추출할 수 있다. 또한, 감정 추정(130) 동작에서, 감정 종류 예측부 및 감정 강도 예측부에 의해 특징 벡터로부터 감정 종류와 감정 강도가 각각 추정될 수 있다. 일 실시예에 따르면, 감정 종류 예측부 및 감정 강도 예측부는 각각 완전 연결(FC: fully-connected) 신경망 레이어로 구성될 수 있다. 추가적으로, 감정 종류 및 감정 강도는 벡터 합성부를 통해 감정 표현을 나타내는 감정 특징 임베딩으로 변환될 수 있다.In the emotion estimation 130 operation, the emotional speech synthesis system 100 determines linguistic features or context information of the input text 110 (e.g., “Hello, nice to meet you”) based on the input text 110. Through analysis, emotional expressions or emotional information can be extracted. According to one embodiment, in the emotion estimation 130 operation, a feature vector representing the linguistic features of the text 110 may be extracted by applying a previously learned language model to the input text 110. Additionally, in the emotion estimation 130 operation, the emotion type and emotion intensity may be estimated from the feature vector by the emotion type prediction unit and the emotion intensity prediction unit, respectively. According to one embodiment, the emotion type prediction unit and the emotion intensity prediction unit may each be composed of a fully-connected (FC) neural network layer. Additionally, the emotion type and emotion intensity can be converted into emotional feature embeddings representing emotional expressions through the vector synthesis unit.

한편, 텍스트 분석(120) 동작에서, 감정 음성 합성 시스템(100)은 입력 텍스트(110)를 수신하고, 입력 텍스트(110)로부터 언어 특징을 추출할 수 있다. 여기서, 텍스트 분석(120)은, 텍스트의 정규화, 전처리, 토큰화 등과 같이 텍스트(110)에 포함된 숫자 등을 읽을 수 있는 표현으로 변환하는 작업, 및 텍스트(110)에 포함된 숙어, 단어나 문장 등을 운율 단위로 분할하고 각 단어를 발음 기호로 변환하는 작업 등을 포함할 수 있다.Meanwhile, in the text analysis 120 operation, the emotional speech synthesis system 100 may receive the input text 110 and extract language features from the input text 110. Here, the text analysis 120 involves converting numbers, etc. included in the text 110 into readable expressions, such as normalizing, preprocessing, and tokenizing the text, and idioms, words, or words included in the text 110. This may include dividing sentences, etc. into rhyme units and converting each word into a phonetic symbol.

감정 음성 합성 시스템(100)은, 감정 추정(130) 및 텍스트 분석(120) 동작에서 각각 추출된 감정 특징 임베딩 및 언어 특징으로부터 음성의 길이를 예측하고, 예측된 음성의 길이에 기초하여 음성 합성(140)을 수행할 수 있다. 일 실시예에 따르면, 길이 예측부에 의해, 감정 특징 임베딩 및 언어 특징으로부터 음성의 길이가 예측되고, 업샘플링부에 의해, 언어 특징으로부터 추출된 언어 임베딩이 음성 길이에 맞춰 업샘플링될 수 있다. 또한, 디코더에 의해, 업샘플링된 언어 임베딩으로부터 음향 특징들이 생성될 수 있으며, 보코더에 의해, 음향 특징들로부터 감정 음성이 합성될 수 있다. 음성 합성(140) 동작에는, 통계 기반 모델(예: HMM 모델), 딥 러닝 기반 모델(예: End to End 모델) 등을 포함하는 다양한 음성 합성 모델들 중 하나가 사용될 수 있다.The emotional voice synthesis system 100 predicts the length of the voice from the emotional feature embedding and language features extracted from the emotion estimation 130 and text analysis 120 operations, respectively, and synthesizes the voice based on the predicted length of the voice. 140) can be performed. According to one embodiment, the length of the voice may be predicted from the emotional feature embedding and the language feature by the length prediction unit, and the language embedding extracted from the language feature may be upsampled to match the voice length by the upsampling unit. Additionally, acoustic features can be generated from the upsampled language embedding by a decoder, and emotional speech can be synthesized from the acoustic features by a vocoder. In the speech synthesis 140 operation, one of various speech synthesis models, including a statistical-based model (e.g., HMM model), a deep learning-based model (e.g., End to End model), etc., may be used.

상술한 바와 같이, 감정 음성 합성 시스템(100)은, 입력 텍스트(110)로부터 감정을 추정(130)하여 감정 특징 임베딩을 생성하고, 입력 텍스트(110)에 기초하여 텍스트를 분석(120)하여 언어 특징을 추출하고, 언어 특징과 감정 특징 임베딩으로부터 감정 음성을 합성(140)함으로써, 보조적인 또는 추가적인 입력 데이터 없이 입력 텍스트(110)로부터 감정 표현이 반영된 감정 음성(예를 들어, 기쁜 목소리로 발화 또는 재생되는 "안녕하세요. 반갑습니다.")(150)을 출력할 수 있다. 또한, 감정 음성 합성 시스템(100)은, 언어 모델을 통해 입력 텍스트(110)에 포함된 문장의 문맥상의 감정을 분석하여 음성 합성에 활용함으로써, 감정의 속성을 정확히 예측하여 합성된 감정 음성(150)에 반영할 수 있다. 또한, 감정 음성 합성 시스템(100)은, 입력 텍스트(110)로부터 감정 종류 뿐만 아니라 감정 강도도 예측함으로써, 합성된 감정 음성(150) 내에 감정 특성의 대략적인 또는 세부적인 구조를 효과적으로 반영할 수 있다.As described above, the emotional speech synthesis system 100 generates emotional feature embeddings by estimating emotions from the input text 110 (130), and analyzes the text based on the input text 110 (120) to generate language. By extracting features and synthesizing emotional speech 140 from linguistic features and emotional feature embeddings, emotional speech reflecting emotional expressions from the input text 110 without auxiliary or additional input data (e.g., utterance in a happy voice or You can output the played "Hello. Nice to meet you.")(150). In addition, the emotional voice synthesis system 100 analyzes the contextual emotions of sentences included in the input text 110 through a language model and uses them for voice synthesis, thereby accurately predicting the properties of the emotion and synthesizing the emotional voice (150). ) can be reflected in . In addition, the emotional voice synthesis system 100 predicts not only the type of emotion but also the intensity of the emotion from the input text 110, thereby effectively reflecting the approximate or detailed structure of emotional characteristics in the synthesized emotional voice 150. .

도 2는 본 개시의 일 실시예에 따른 정보 처리 시스템(200)의 내부 구성을 나타내는 블록도이다. 정보 처리 시스템(200)은, 도 1에 도시된 감정 음성 합성 시스템(100)에 대응될 수 있으며, 메모리(210), 프로세서(220), 통신 모듈(230) 및 입출력 인터페이스(240)를 포함할 수 있다. 정보 처리 시스템(200)은 통신 모듈(230)을 이용하여 네트워크를 통해 외부 시스템과 정보 및/또는 데이터를 통신할 수 있도록 구성될 수 있다.Figure 2 is a block diagram showing the internal configuration of the information processing system 200 according to an embodiment of the present disclosure. The information processing system 200 may correspond to the emotional voice synthesis system 100 shown in FIG. 1 and may include a memory 210, a processor 220, a communication module 230, and an input/output interface 240. You can. The information processing system 200 may be configured to communicate information and/or data with an external system through a network using the communication module 230.

메모리(210)는 비-일시적인 임의의 컴퓨터 판독 가능한 기록매체를 포함할 수 있다. 일 실시예에 따르면, 메모리(210)는 RAM(random access memory), ROM(read only memory), 디스크 드라이브, SSD(solid state drive), 플래시 메모리(flash memory) 등과 같은 비소멸성 대용량 저장 장치(permanent mass storage device)를 포함할 수 있다. 다른 예로서, ROM, SSD, 플래시 메모리, 디스크 드라이브 등과 같은 비소멸성 대용량 저장 장치는 메모리와는 구분되는 별도의 영구 저장 장치로서 정보 처리 시스템(200)에 포함될 수 있다. 또한, 메모리(210)에는 운영체제와 적어도 하나의 프로그램 코드(예를 들어, 정보 처리 시스템(200)에 설치되어 구동되는 감정 추정 및 음성 합성 등을 실행하기 위한 코드 등)가 저장될 수 있다.Memory 210 may include any non-transitory computer-readable recording medium. According to one embodiment, the memory 210 is a non-permanent mass storage device such as random access memory (RAM), read only memory (ROM), disk drive, solid state drive (SSD), flash memory, etc. mass storage device). As another example, non-perishable mass storage devices such as ROM, SSD, flash memory, disk drive, etc. may be included in the information processing system 200 as a separate persistent storage device that is distinct from memory. Additionally, the memory 210 may store an operating system and at least one program code (for example, a code for executing emotion estimation and voice synthesis that is installed and driven in the information processing system 200).

이러한 소프트웨어 구성요소들은 메모리(210)와는 별도의 컴퓨터에서 판독 가능한 기록매체로부터 로딩될 수 있다. 이러한 별도의 컴퓨터에서 판독 가능한 기록매체는 이러한 정보 처리 시스템(200)에 직접 연결가능한 기록 매체를 포함할 수 있는데, 예를 들어, 플로피 드라이브, 디스크, 테이프, DVD/CD-ROM 드라이브, 메모리 카드 등의 컴퓨터에서 판독 가능한 기록매체를 포함할 수 있다. 다른 예로서, 소프트웨어 구성요소들은 컴퓨터에서 판독 가능한 기록매체가 아닌 통신 모듈(230)을 통해 메모리(210)에 로딩될 수도 있다. 예를 들어, 적어도 하나의 프로그램은 개발자들 또는 어플리케이션의 설치 파일을 배포하는 파일 배포 시스템이 통신 모듈(230)을 통해 제공하는 파일들에 의해 설치되는 컴퓨터 프로그램(예를 들어, 감정 추정 및 음성 합성 등을 실행하기 위한 프로그램 등)에 기반하여 메모리(210)에 로딩될 수 있다.These software components may be loaded from a computer-readable recording medium separate from the memory 210. Recording media readable by such a separate computer may include recording media directly connectable to the information processing system 200, for example, floppy drives, disks, tapes, DVD/CD-ROM drives, memory cards, etc. It may include a recording medium that can be read by a computer. As another example, software components may be loaded into the memory 210 through the communication module 230 rather than a computer-readable recording medium. For example, at least one program is a computer program (e.g., emotion estimation and voice synthesis) installed by files provided through the communication module 230 by developers or a file distribution system that distributes the installation file of the application. It may be loaded into the memory 210 based on a program to execute, etc.).

프로세서(220)는 기본적인 산술, 로직 및 입출력 연산을 수행함으로써, 컴퓨터 프로그램의 명령을 처리하도록 구성될 수 있다. 명령은 메모리(210) 또는 통신 모듈(230)에 의해 사용자 단말(미도시) 또는 다른 외부 시스템으로 제공될 수 있다. 예를 들어, 프로세서(220)는, 입력 텍스트에 기초하여 언어 특징을 추출하고, 입력 텍스트에 기초하여 감정 특징 임베딩을 생성하고, 감정 특징 임베딩 및 언어 특징으로부터 음성의 길이를 예측하고, 음성의 길이에 기초하여 감정 음성을 합성할 수 있다.The processor 220 may be configured to process instructions of a computer program by performing basic arithmetic, logic, and input/output operations. Commands may be provided to a user terminal (not shown) or another external system by the memory 210 or the communication module 230. For example, processor 220 may extract linguistic features based on the input text, generate emotional feature embeddings based on the input text, predict the length of speech from the emotional feature embeddings and linguistic features, and predict the length of speech. Based on this, emotional voices can be synthesized.

통신 모듈(230)은 네트워크를 통해 사용자 단말(미도시)과 정보 처리 시스템(200)이 서로 통신하기 위한 구성 또는 기능을 제공할 수 있으며, 정보 처리 시스템(200)이 외부 시스템(일례로 별도의 클라우드 시스템 등)과 통신하기 위한 구성 또는 기능을 제공할 수 있다. 일례로, 정보 처리 시스템(200)의 프로세서(220)의 제어에 따라 제공되는 제어 신호, 명령, 데이터 등이 통신 모듈(230)과 네트워크를 거쳐 사용자 단말 및/또는 외부 시스템의 통신 모듈을 통해 사용자 단말 및/또는 외부 시스템으로 전송될 수 있다. 예를 들어, 사용자 단말은 생성된 감정 음성을 수신할 수 있다.The communication module 230 may provide a configuration or function for a user terminal (not shown) and the information processing system 200 to communicate with each other through a network, and the information processing system 200 may be connected to an external system (for example, a separate Configuration or functions for communicating with cloud systems, etc. can be provided. For example, control signals, commands, data, etc. provided under the control of the processor 220 of the information processing system 200 pass through the communication module 230 and the network to the user through the communication module of the user terminal and/or external system. It may be transmitted to the terminal and/or external system. For example, the user terminal may receive the generated emotional voice.

또한, 정보 처리 시스템(200)의 입출력 인터페이스(240)는 정보 처리 시스템(200)과 연결되거나 정보 처리 시스템(200)이 포함할 수 있는 입력 또는 출력을 위한 장치(미도시)와의 인터페이스를 위한 수단일 수 있다. 도 2에서는 입출력 인터페이스(240)가 프로세서(220)와 별도로 구성된 요소로서 도시되었으나, 이에 한정되지 않으며, 입출력 인터페이스(240)가 프로세서(220)에 포함되도록 구성될 수 있다. 정보 처리 시스템(200)은 도 2의 구성요소들보다 더 많은 구성요소들을 포함할 수 있다. 그러나, 대부분의 종래기술적 구성요소들을 명확하게 도시할 필요성은 없다.In addition, the input/output interface 240 of the information processing system 200 is connected to the information processing system 200 or means for interfacing with a device (not shown) for input or output that the information processing system 200 may include. It can be. In FIG. 2 , the input/output interface 240 is shown as an element configured separately from the processor 220, but the present invention is not limited thereto, and the input/output interface 240 may be included in the processor 220. Information processing system 200 may include more components than those in FIG. 2 . However, there is no need to clearly show most prior art components.

정보 처리 시스템(200)의 프로세서(220)는 복수의 사용자 단말 및/또는 복수의 외부 시스템으로부터 수신된 정보 및/또는 데이터를 관리, 처리 및/또는 저장하도록 구성될 수 있다. 일 실시예에 따르면, 프로세서(220)는 복수의 사용자 단말로부터 입력 텍스트를 수신할 수 있다. 그런 다음, 프로세서(220)는, 입력 텍스트에 기초하여 언어 특징을 추출하고, 입력 텍스트에 기초하여 감정 특징 임베딩을 생성하고, 감정 특징 임베딩 및 언어 특징으로부터 음성의 길이를 예측하고, 음성의 길이에 기초하여 감정 음성을 합성할 수 있다.The processor 220 of the information processing system 200 may be configured to manage, process, and/or store information and/or data received from a plurality of user terminals and/or a plurality of external systems. According to one embodiment, the processor 220 may receive input text from a plurality of user terminals. Processor 220 then extracts linguistic features based on the input text, generates emotional feature embeddings based on the input text, predicts the length of the speech from the emotional feature embeddings and linguistic features, and calculates the length of the speech. Based on this, emotional voices can be synthesized.

도 3은 본 개시의 일 실시예에 따른 정보 처리 시스템의 프로세서(220)의 내부 구성(300)을 나타내는 도면이다. 일 실시예에 따르면, 프로세서(220)는 텍스트 분석부(310), 감정 추정부(320), 음성 합성부(330)를 포함할 수 있다. 도 3에서 도시한 정보 처리 시스템의 프로세서(220)의 내부 구성은 일 예시일 뿐이며, 이와 다르게 구현될 수 있다. 예를 들어, 프로세서(220)의 적어도 일부의 구성이 생략되거나, 다른 구성이 추가될 수 있으며, 프로세서(220)가 수행하는 적어도 일부의 동작 또는 과정이 정보 처리 시스템과 통신 가능하도록 연결된 사용자 단말의 프로세서에 의해 수행될 수 있다.FIG. 3 is a diagram showing the internal configuration 300 of the processor 220 of the information processing system according to an embodiment of the present disclosure. According to one embodiment, the processor 220 may include a text analysis unit 310, an emotion estimation unit 320, and a voice synthesis unit 330. The internal configuration of the processor 220 of the information processing system shown in FIG. 3 is only an example and may be implemented differently. For example, at least some of the components of the processor 220 may be omitted, or other components may be added, and at least some of the operations or processes performed by the processor 220 may be performed by a user terminal connected to communicate with the information processing system. It can be performed by a processor.

일 실시예에 따르면, 프로세서(220)는 입력 텍스트를 수신할 수 있으며, 입력 텍스트는 텍스트 분석부(310) 및 감정 추정부(320) 중 적어도 하나로 제공될 수 있다. 예를 들어, 입력 텍스트는 사용자 단말, 외부 시스템 또는 다른 애플리케이션(예를 들어, 챗봇(chat-bot)을 구현하는 대화 생성 애플리케이션)으로부터 수신되거나, 프로세서(220)의 대화 생성부(미도시)에 의해 생성될 수 있다.According to one embodiment, the processor 220 may receive input text, and the input text may be provided to at least one of the text analysis unit 310 and the emotion estimation unit 320. For example, the input text is received from a user terminal, an external system, or another application (e.g., a conversation creation application that implements a chat-bot), or is sent to the conversation creation unit (not shown) of the processor 220. can be created by

텍스트 분석부(310)는 입력 텍스트에 기초하여 언어 특징을 생성하여 음성 합성부(330)로 전송할 수 있다. 일 실시예에 따르면, 언어 특징은, 음성 합성부(330)의 입력 벡터로써, 입력 텍스트로부터 추출된 330 차원의 범주적 컨텍스트(contexts)와 24차원의 수치적 컨텍스트로 구성된 354 차원의 특징 벡터일 수 있으나, 이에 한정되지 않는다.The text analysis unit 310 may generate language features based on the input text and transmit them to the speech synthesis unit 330. According to one embodiment, the language feature is an input vector of the speech synthesis unit 330, and is a 354-dimensional feature vector composed of 330-dimensional categorical contexts and 24-dimensional numerical context extracted from the input text. However, it is not limited to this.

감정 추정부(320)는 입력 텍스트에 기초하여 감정 특징 임베딩을 생성할 수 있다. 일 실시예에 따르면, 감정 추정부(320)는, 입력 텍스트로부터 특징 벡터를 추출하는 언어 모델, 특징 벡터로부터 감정 종류를 예측하는 감정 종류 예측부, 특징 벡터로부터 감정 강도를 예측하는 감정 강도 예측부, 및 감정 종류 및 감정 강도로부터 감정특징 임베딩을 생성하는 벡터 합성부를 포함할 수 있다. 여기서, 감정 종류는 원 핫 인코딩된 값(one hot encoded value)로 표현될 수 있고, 감정 강도는 스칼라 값(scalr-valued)으로 표현될 수 있으나, 이에 한정하지 않는다.The emotion estimation unit 320 may generate emotional feature embeddings based on the input text. According to one embodiment, the emotion estimation unit 320 includes a language model for extracting feature vectors from input text, an emotion type prediction unit for predicting emotion types from feature vectors, and an emotion intensity prediction unit for predicting emotion intensity from feature vectors. , and a vector synthesis unit that generates emotional feature embeddings from the emotion type and emotion intensity. Here, the type of emotion may be expressed as a one hot encoded value, and the emotion intensity may be expressed as a scalar value, but are not limited to this.

일 실시예에 따르면, 프로세서(220)는, 사전 녹음된 음성 데이터로부터 추출된 감정 종류 데이터 및 감정 강도 데이터에 기초하여 감정 추정부(320)를 학습시키기 위한 감정 학습부(340)를 더 포함할 수 있다. 감정 종류 데이터는 사전 녹음된 음성 데이터의 화자로부터 수신될 수 있으며, 감정 강도 데이터는 중립적인 감정 언어로부터 추출된 감정 특징들의 상대적인 순위 정보를 출력하도록 학습된 랭킹 서포트 벡터 머신(Ranking Support Vector Machine)에 의해 생성될 수 있다. 구체적으로, 멜 주파수 켑스트럼 계수(Mel-frequency cepstral coefficients) 또는 기본 주파수를 포함한 감정 특징들이 중립적인 감정 언어로부터 추출될 수 있으며, 감정 특징들 사이의 랭킹 함수가 랭킹 서포트 벡터 머신에 의해 학습될 수 있다. 그 결과, 감정의 상대적인 순위 정보가 정량화된 감정 강도 데이터가 생성될 수 있다. 또한, 랭킹 서포트 벡터 머신을 학습시키기 위한 감정 특징들은 예를 들면, 384 차원으로 구성되며 음성 데이터 또는 신호로부터 추출될 수 있다.According to one embodiment, the processor 220 may further include an emotion learning unit 340 for training the emotion estimation unit 320 based on emotion type data and emotion intensity data extracted from pre-recorded voice data. You can. Emotion type data can be received from the speaker of pre-recorded voice data, and emotion intensity data can be input to a Ranking Support Vector Machine that is trained to output relative ranking information of emotional features extracted from neutral emotional language. can be created by Specifically, emotional features including Mel-frequency cepstral coefficients or fundamental frequencies can be extracted from a neutral emotional language, and a ranking function between emotional features can be learned by a ranking support vector machine. You can. As a result, emotional intensity data in which relative ranking information of emotions is quantified can be generated. Additionally, emotional features for training a ranking support vector machine are, for example, composed of 384 dimensions and can be extracted from voice data or signals.

음성 합성부(330)는, 감정 특징 임베딩 및 언어 특징으로부터 음성의 길이를 예측하고 예측된 음성의 길이에 기초하여 감정 음성을 합성할 수 있다. 일 실시예에 따르면, 음성 합성부(330)는 언어 특징으로부터 언어 임베딩을 추출하고, 언어 임베딩을 음성의 길이에 맞춰 업샘플링하고, 업샘플링된 언어 임베딩으로부터 음향 특징들을 생성하고, 음향 특징들(acoustic features)로부터 감정 음성을 합성할 수 있다. 일 실시예에 따르면, 음성 합성부(330)는, 음성의 길이에 기초하여 음향 특징을 음소 배열과 정렬할 수 있는 타코트론2(Tacotron2) 음성 모델을 이용하여 구현될 수 있다.The voice synthesis unit 330 may predict the length of a voice from emotional feature embedding and language features and synthesize an emotional voice based on the predicted length of the voice. According to one embodiment, the speech synthesis unit 330 extracts language embeddings from language features, upsamples the language embeddings according to the length of the voice, generates acoustic features from the upsampled language embeddings, and generates acoustic features ( Emotional voices can be synthesized from acoustic features. According to one embodiment, the voice synthesizer 330 may be implemented using the Tacotron2 voice model, which can align acoustic features with phoneme arrays based on the length of the voice.

일 실시예에 따르면, 감정 학습부(340)는 감정 추정부(320)와 음성 합성부(330)를 학습시키기 위한 학습 기준(training criteria)을 사용할 수 있다. 학습 기준으로써 사용되는 감정 추정부(320) 모델 및 음성 합성부(330) 모델 각각의 손실 함수는, 하기의 [수학식 1]과 같이 나타낼 수 있다. 여기서, 하이퍼 파라미터인 는 0.01로 설정될 수 있다.According to one embodiment, the emotion learning unit 340 may use training criteria to train the emotion estimation unit 320 and the voice synthesis unit 330. The loss function of each of the emotion estimation unit 320 model and the voice synthesis unit 330 model used as a learning standard can be expressed as [Equation 1] below. Here, the hyperparameter can be set to 0.01.

: 음성 합성부 모델의 손실 함수 : Loss function of speech synthesis model

: 감정 추정부 모델의 손실 함수 : Loss function of emotion estimation model

: 음향 특징의 생성된 예측값 : Generated predicted values of acoustic features

: 음향 특징 : Sound characteristics

: 음소 길이의 생성된 예측값 : Generated predicted value of phoneme length

: 음소 길이 : phoneme length

: L-2 손실함수 : L-2 loss function

: 크로스 엔트로피 손실 함수 : Cross entropy loss function

: 두 손실 함수 사이의 밸런싱을 위한 하이퍼 파라미터 : Hyperparameter for balancing between two loss functions

도 4는 본 개시의 일 실시예에 따른 감정 추정부(320)의 내부 구성(400)을 나타내는 블록도이다. 일 실시예에 따르면, 감정 추정부(320)는 언어 모델(410), 감정 종류 예측부(420), 감정 강도 예측부(430) 및 벡터 합성부(440)를 포함할 수 있다. 도 4에서 도시한 감정 추정부(320)의 내부 구성(400)은 일 예시일 뿐이며, 이와 다르게 구현될 수 있다. 예를 들어, 감정 추정부(320)의 적어도 일부의 구성이 생략되거나, 다른 구성이 추가될 수 있으며, 감정 추정부(320)가 수행하는 적어도 일부의 과정이 사용자 단말의 프로세서에 의해 수행될 수 있다. FIG. 4 is a block diagram showing the internal configuration 400 of the emotion estimation unit 320 according to an embodiment of the present disclosure. According to one embodiment, the emotion estimation unit 320 may include a language model 410, an emotion type prediction unit 420, an emotion intensity prediction unit 430, and a vector synthesis unit 440. The internal configuration 400 of the emotion estimation unit 320 shown in FIG. 4 is only an example and may be implemented differently. For example, at least some of the components of the emotion estimation unit 320 may be omitted or other components may be added, and at least some of the processes performed by the emotion estimation unit 320 may be performed by the processor of the user terminal. there is.

언어 모델(410)은 입력 텍스트로부터 특징 벡터를 추출할 수 있다. 일 실시예에 따르면, 언어 모델(410)은 감정 속성을 예측하기 위한 특징을 추출하도록 학습된 인공지능 기반의 모델에 해당할 수 있다. 예를 들어, 언어 모델(410)은 한국어 텍스트 코퍼스로 학습되어 인접한 문자들 사이의 상위 레벨의 컨텍스트(또는 문맥) 특징을 효과적으로 추출하도록 구성된 GPT-3 기반의 언어 모델일 수 있다. 이 경우, 언어 모델(410)은 768개의 히든 유닛과 16개의 어텐션 헤드를 가진 12개의 트랜스포머 디코더 레이어로 구성되고, 5천 6백억개의 토큰으로 구성된 텍스트 코퍼스를 사용하여 학습될 수 있다. The language model 410 may extract feature vectors from input text. According to one embodiment, the language model 410 may correspond to an artificial intelligence-based model learned to extract features for predicting emotional attributes. For example, the language model 410 may be a GPT-3-based language model learned with a Korean text corpus and configured to effectively extract high-level context (or context) features between adjacent characters. In this case, the language model 410 consists of 12 transformer decoder layers with 768 hidden units and 16 attention heads, and can be learned using a text corpus consisting of 560 billion tokens.

감정 종류 예측부(420) 및 감정 강도 예측부(430)는 각각 완전 연결형(FC: fully-connected)의 레이어로 구성될 수 있으며, 각각 특징 벡터로부터 감정 종류 및 감정 강도를 예측할 수 있다. 일 실시예에 따르면, 감정 종류 예측부(420) 및 감정 강도 예측부(430)는, 각각 256개의 히든 유닛으로 구성된 하나의 히든 레이어 및 ReLU 활성화 레이어를 포함할 수 있으며, 각 FC 레이어는 4차원의 소프트맥스(softmax) 출력 레이어와 1차원의 선형 출력 레이어를 포함할 수 있다.The emotion type prediction unit 420 and the emotion intensity prediction unit 430 may each be composed of fully-connected (FC) layers, and may predict emotion type and emotion intensity from feature vectors, respectively. According to one embodiment, the emotion type prediction unit 420 and the emotion intensity prediction unit 430 may include one hidden layer and a ReLU activation layer each composed of 256 hidden units, and each FC layer has a four-dimensional It may include a softmax output layer and a one-dimensional linear output layer.

벡터 합성부(440)는 감정 종류 및 감정 강도로부터 감정 특징 임베딩을 생성할 수 있다. 일 실시예에 따르면, 벡터 합성부(440)는 감정 종류를 사전 학습된 룩업 테이블(LUT: look-up table)을 통해 대응하는 감정 종류 임베딩으로 변환하고, 감정 종류 임베딩을 감정 강도와 혼합하여 감정 특징 임베딩을 생성할 수 있다. 감정 종류 임베딩과 감정 강도를 혼합하는 구체적인 방법은 하기의 [수학식 2]과 같이 나타낼 수 있다.The vector synthesis unit 440 may generate emotional feature embeddings from the emotion type and emotion intensity. According to one embodiment, the vector synthesis unit 440 converts the emotion type into the corresponding emotion type embedding through a pre-learned look-up table (LUT), and mixes the emotion type embedding with the emotion intensity to determine the emotion type. Feature embeddings can be generated. A specific method of mixing emotion type embedding and emotion intensity can be expressed as [Equation 2] below.

여기서, [수학식 2]에 포함된 변수 설명은 다음과 같다.Here, the description of the variables included in [Equation 2] is as follows.

: 감정 특징 임베딩 : Emotion feature embedding

: 소프트플러스 활성 함수 : Soft Plus activation function

: 사전 학습된 사영 행렬 : Pre-trained projective matrix

: 감정 종류 임베딩 : Emotion type embedding

: 스칼라 변수 : Scalar variable

: 감정 강도 : emotional intensity

수학식 2에 따르면, 벡터 합성부(440)는, 소프트플러스(softplus) 활성 함수를 사용하여, 감정 종류 임베딩()과 사전 학습된 사영 행렬()을 곱한 값과 감정 강도()와 스칼라 변수()를 곱한 값으로부터 감정 특징 임베딩()을 생성할 수 있다. 이 때, 감정 종류를 32 차원의 4 가지 특징 벡터 중 하나로 변환하기 위한 룩업 테이블이 사용될 수 있고, 학습된 사영 행렬()의 차원은 32 일 수 있다.According to Equation 2, the vector synthesis unit 440 uses the softplus activation function to generate emotion type embedding ( ) and the pre-trained projective matrix ( ) and the emotional intensity ( ) and scalar variables ( ) from the value multiplied by the emotional feature embedding ( ) can be created. At this time, a lookup table can be used to convert the emotion type into one of four 32-dimensional feature vectors, and the learned projection matrix ( ) can have dimensions of 32.

도 5는 본 개시의 일 실시예에 따른 음성 합성부(330)의 내부 구성(500)을 나타내는 블록도이다. 일 실시예에 따르면, 음성 합성부(330)는 길이 예측부(510), 인코더(520), 업샘플링부(530), 디코더(540) 및 보코더(550)를 포함할 수 있다. 도 5에서 도시한 음성 합성부(330)의 내부 구성(500)은 일 예시일 뿐이며, 이와 다르게 구현될 수 있다. 예를 들어, 음성 합성부(330)의 적어도 일부의 구성이 생략되거나, 다른 구성이 추가될 수 있으며, 음성 합성부(330)가 수행하는 적어도 일부의 과정이 사용자 단말의 프로세서에 의해 수행될 수 있다.FIG. 5 is a block diagram showing the internal configuration 500 of the voice synthesis unit 330 according to an embodiment of the present disclosure. According to one embodiment, the voice synthesis unit 330 may include a length prediction unit 510, an encoder 520, an upsampling unit 530, a decoder 540, and a vocoder 550. The internal configuration 500 of the voice synthesis unit 330 shown in FIG. 5 is only an example and may be implemented differently. For example, at least some of the components of the voice synthesis unit 330 may be omitted or other components may be added, and at least some of the processes performed by the voice synthesis unit 330 may be performed by the processor of the user terminal. there is.

길이 예측부(510)는 감정 특징 임베딩 및 언어 특징을 결합시킨 벡터로부터 감정 의존적인 음소 단위의 길이를 예측할 수 있다. 일 실시예에 따르면, 길이 예측부(510)는 3개의 완전 연결형 레이어와 256개의 유닛을 가진 단방향의 LSTM(long short-term memory) 레이어로 구성될 수 있다. 또한, 3개의 완전 연결형 레이어는 각각 1024개, 1024개, 512개의 유닛으로 구성될 수 있다.The length prediction unit 510 can predict the length of an emotion-dependent phoneme unit from a vector combining emotional feature embedding and language features. According to one embodiment, the length prediction unit 510 may be composed of three fully connected layers and a unidirectional long short-term memory (LSTM) layer with 256 units. Additionally, the three fully connected layers can consist of 1024, 1024, and 512 units, respectively.

인코더(520)는 언어 특징으로부터 언어 임베딩을 추출할 수 있다. 일 실시예에 따르면, 인코더(520)는 10 × 1의 커널과 512개의 채널을 가진 3개의 컨볼루션(convolution) 레이어, 512개의 메모리 블록을 가진 양방향 LSTM(long short term memory), 512개의 유닛을 가진 완전 연결형 레이어들로 구성될 수 있다. Encoder 520 may extract language embeddings from language features. According to one embodiment, the encoder 520 includes a 10 × 1 kernel, three convolution layers with 512 channels, a bidirectional long short term memory (LSTM) with 512 memory blocks, and 512 units. It can be composed of fully connected layers.

업샘플링부(530)는 언어 임베딩을 음성의 길이에 맞춰 업샘플링할 수 있다. 일 실시예에 따르면, 업샘플링부(530)는 음성의 길이에 맞춰 언어 특징으로부터 추출된 언어 임베딩의 시간적 해상도를 출력될 음향 특징의 해상도와 매칭하는 반복과정을 통해 언어 임베딩을 업샘플링할 수 있다. The upsampling unit 530 can upsample the language embedding according to the length of the voice. According to one embodiment, the upsampling unit 530 may upsample the language embedding through an iterative process of matching the temporal resolution of the language embedding extracted from the language feature with the resolution of the acoustic feature to be output according to the length of the voice. .

디코더(540)는 업샘플링된 언어 임베딩과 감정 특징 임베딩을 결합하여 자기회귀적으로 음향 특징들을 생성할 수 있다. 일 실시예에 따르면, 디코더(540)는 프리넷(PreNet), 포스트넷(PostNet) 및 메인 LSTM 블록으로 구성될 수 있다. 프리넷은 256개의 유닛을 가진 2개의 완전 연결형 레이어로, 포스트넷은 5 × 1 커널과 512개의 채널을 가진 5개의 컨볼루션 레이어로, 메인 LSTM 블록은 1024개의 메모리 블록을 가진 2개의 단방향 LSTM 레이어로 구성될 수 있다. The decoder 540 may generate acoustic features autoregressively by combining the upsampled language embedding and emotional feature embedding. According to one embodiment, the decoder 540 may be composed of PreNet, PostNet, and main LSTM blocks. Freenet is two fully connected layers with 256 units, Postnet is five convolutional layers with a 5 × 1 kernel and 512 channels, and the main LSTM block is two unidirectional LSTM layers with 1024 memory blocks. It can be configured.

일 실시예에 따르면, 디코더(540)에 의해 생성된 음향 특징들은, 텍스트 음성 변환 시스템(TTS) 음성 모델의 출력 벡터로서, 40차원의 라인 스펙트럼 주파수, F0, 에너지, 바이너리 보이싱 플래그(binary voicing flag), 32차원의 SEW(Slowly Evolving Waveform) 및 4차원의 REW(Rapidly Evolving Waveform)로 구성된 ITFTE(Improved Time-Frequency Trajectory Excitation)에서 추출된 79차원의 특징 벡터일 수 있다. TTS 음성 모델이 사용하는 프레임 길이와 쉬프팅 사이즈는 각각 20ms 및 5ms로 설정될 수 있다.According to one embodiment, the acoustic features generated by decoder 540 are output vectors of a text-to-speech system (TTS) speech model, including 40-dimensional line spectral frequency, F0, energy, and binary voicing flag. ), it may be a 79-dimensional feature vector extracted from ITFTE (Improved Time-Frequency Trajectory Excitation), which consists of a 32-dimensional SEW (Slowly Evolving Waveform) and a 4-dimensional REW (Rapidly Evolving Waveform). The frame length and shifting size used by the TTS voice model can be set to 20ms and 5ms, respectively.

보코더(550)는 음향 특징들로부터 감정 음성을 합성할 수 있다. 일 실시예에 따르면, 보코더(550)는, MB-HN-PWG(multi-band harmonic-plus-noise Parallel WaveGAN) 보코더일 수 있다. 이 경우, 보코더(550)는 두 개의 분리된 하모닉 및 노이즈 웨이브넷(WaveNet)을 이용하여 하모닉 및 노이즈 구성요소들을 함께 캡처함으로써 고품질의 음성 파형을 생성할 수 있다. 예를 들어, 보코더(550)의 하모닉 웨이브넷은 2개의 기하급수적으로 증가하는 팽창 사이클을 가진 20개의 팽창된 잔차 블록으로 구성되며, 노이즈 웨이브넷은 하나의 기하급수적으로 증가하는 팽창 사이클을 가진 10개의 잔차 블록으로 구성될 수 있다. 여기서, 잔차 채널 및 스킵 채널의 수는 64로, 컨볼루션 필터의 크기는 5로 설정될 수 있다.Vocoder 550 can synthesize emotional speech from acoustic features. According to one embodiment, the vocoder 550 may be a multi-band harmonic-plus-noise Parallel WaveGAN (MB-HN-PWG) vocoder. In this case, the vocoder 550 can generate a high-quality speech waveform by capturing the harmonic and noise components together using two separate harmonic and noise WaveNets. For example, the harmonic wavenet of vocoder 550 consists of 20 dilated residual blocks with two exponentially increasing dilation cycles, and the noise wavenet consists of 10 dilated residual blocks with one exponentially increasing dilatation cycle. It may consist of residual blocks. Here, the number of residual channels and skip channels can be set to 64, and the size of the convolution filter can be set to 5.

도 6은 본 개시의 일 실시예에 따른 감정 추정 기반의 감정 음성 생성 시스템(600)의 전체 구성을 나타내는 도면이다. 일 실시예에 따르면, 감정 추정 기반의 감정 음성 생성 시스템(600)은 텍스트 분석부(310), 감정 추정부(320), 및 음성 합성부(330)를 포함할 수 있다. 도 6에서는 도 4 및 도 5와 중복되는 구성에 대하여는, 도 6에 도시된 실시예를 기준으로 간략히 서술한다. FIG. 6 is a diagram showing the overall configuration of an emotion estimation-based emotional voice generation system 600 according to an embodiment of the present disclosure. According to one embodiment, the emotion estimation-based emotional voice generation system 600 may include a text analysis unit 310, an emotion estimation unit 320, and a voice synthesis unit 330. In FIG. 6 , configurations overlapping with FIGS. 4 and 5 are briefly described based on the embodiment shown in FIG. 6 .

텍스트 분석부(310)는 입력 텍스트로부터 언어특징을 생성한 뒤 음성 합성부로 전달할 수 있다. 일 실시예에서, 텍스트 분석부(310)는 텍스트의 정규화, 전처리, 토큰화 등과 같이 입력 텍스트에 포함된 숫자 등을 읽을 수 있는 표현으로 변환하는 작업 및 입력 텍스트에 포함된 숙어, 단어나 문장 등을 운율 단위로 분할하고 각 단어를 발음 기호로 변환하는 작업 등을 할 수 있다.The text analysis unit 310 may generate language features from the input text and then transmit them to the speech synthesis unit. In one embodiment, the text analysis unit 310 performs tasks such as normalizing, preprocessing, and tokenizing the text, converting numbers, etc. included in the input text into readable expressions, and idioms, words, sentences, etc. included in the input text. You can do things like split into prosody units and convert each word into a phonetic symbol.

감정 추정부(320)는 언어 모델(410), 감정 종류 예측부(420), 감정 강도 예측부(430) 및 벡터 합성부(440)를 포함할 수 있다. 언어 모델(410)은 입력 텍스트로부터 특징 벡터를 추출하고, 감정 종류 예측부(420) 및 감정 강도 예측부(430)는 특징 벡터로부터 감정 종류와 감정 강도를 예측할 수 있다. 벡터 합성부(440)는 감정 종류 및 감정 강도로부터 감정 특징 임베딩을 생성하고 음성 합성부에 전달할 수 있다.The emotion estimation unit 320 may include a language model 410, an emotion type prediction unit 420, an emotion intensity prediction unit 430, and a vector synthesis unit 440. The language model 410 extracts a feature vector from the input text, and the emotion type prediction unit 420 and the emotion intensity prediction unit 430 can predict the emotion type and emotion intensity from the feature vector. The vector synthesis unit 440 may generate emotional feature embeddings from the emotion type and emotion intensity and transmit them to the voice synthesis unit.

음성 합성부(330)는 길이 예측부(510), 인코더(520), 업샘플링부(530), 디코더(540) 및 보코더(550)를 포함할 수 있다. 길이 예측부(510)는 감정 특징 임베딩 및 언어특징을 수신하여 음성의 길이를 예측하고, 인코더(520)는 언어 특징으로부터 언어 임베딩을 추출할 수 있다. 업샘플링부(530)는 언어 임베딩과 음성의 길이를 수신하며 언어 임베딩을 음성의 길이에 맞춰 업샘플링 할 수 있다. 디코더(540)는 업샘플링된 언어 임베딩으로부터 음향 특징들을 생성할 수 있고, 보코더(550)는 음향 특징들로부터 감정 음성을 합성하여 감정 음성을 생성 및 출력할 수 있다.The voice synthesis unit 330 may include a length prediction unit 510, an encoder 520, an upsampling unit 530, a decoder 540, and a vocoder 550. The length prediction unit 510 receives emotional feature embeddings and language features to predict the length of the voice, and the encoder 520 can extract language embeddings from the language features. The upsampling unit 530 receives the language embedding and the length of the voice and can upsample the language embedding according to the length of the voice. The decoder 540 may generate acoustic features from the upsampled language embedding, and the vocoder 550 may synthesize the emotional voice from the acoustic features to generate and output the emotional voice.

본 출원의 발명자들은, 감정 음성 생성 시스템(600)에 의한 감정 예측 및 음성 합성의 학습 및 검증을 위한 테스트 셋으로서, 한국 여성 전문 성우가 녹음한 TTS 코퍼스를 사용했다. 여기서, TTS 코퍼스는 4가지의 다른 감정(중립, 기쁨, 슬픔, 분노)로 분류되었으며, 녹음된 음성 신호는 16 비트의 양자화와 함께 24kHz로 샘플링되었다.The inventors of the present application used a TTS corpus recorded by a Korean female professional voice actor as a test set for learning and verification of emotion prediction and voice synthesis by the emotional voice generation system 600. Here, the TTS corpus was categorized into four different emotions (neutral, happy, sad, and angry), and the recorded speech signals were sampled at 24 kHz with quantization of 16 bits.

또한, 감정 음성 생성 시스템(600)의 성능 검증을 위해 MOS 청취 테스트가 실행되었다. 테스트를 위해 18명의 한국인 청취자를 대상으로 합성 음성의 감정이 텍스트와 일치하는지를 나타내는 5점 척도(1점 - Bad, 2 - Poor, 3 - Fair, 4 - Good, 5 - Excellent) 평가가 진행되었으며, 테스트 셋으로부터 하나 감정 종류 당 15개의 문장을 선택하고, 상이한 복수의 언어 모델을 사용하여 음성 신호를 합성했다. Additionally, a MOS listening test was performed to verify the performance of the emotional voice generation system 600. For the test, 18 Korean listeners were evaluated on a 5-point scale (1 point - Bad, 2 - Poor, 3 - Fair, 4 - Good, 5 - Excellent) indicating whether the emotion of the synthesized voice matched the text. From the test set, 15 sentences per emotion type were selected, and speech signals were synthesized using multiple different language models.

이상 설명한 테스트의 결과는 하기의 [표 1]과 같다. 테스트에 사용된 모든 시스템은 같은 인코더, 음향 모델, 보코더를 사용했으며, 기준 시스템인 시스템 1부터 시스템 2,3 및 4 순으로 테스트를 진행했다. 테스트 결과, 감정 종류 및 강도를 부분적으로 예측하였을 때 평균 MOS 성능이 약간 저하되었으며, 특히 감정 강도를 예측할 때보다 감정 종류를 예측할 때 더 크게 저하되었다. 감정 종류와 감정 강도를 모두 예측했을 때는, 전체적으로 다른 시스템들에 비해 약간 성능이 저하되었으나 그 차이는 최대 0.05로 치명적이지 않았기 때문에, 수동으로 감정을 표시하는 처리 없이 감정 음성을 합성할 수 있다는 감정 추정 기반의 감정 음성 생성 시스템(600)의 장점이 입증되었다. The results of the tests described above are shown in [Table 1] below. All systems used in the test used the same encoder, acoustic model, and vocoder, and the tests were conducted in the order of System 1, the reference system, to Systems 2, 3, and 4. As a result of the test, the average MOS performance deteriorated slightly when partially predicting the type and intensity of the emotion, and in particular, the deterioration was greater when predicting the type of emotion than when predicting the intensity of the emotion. When predicting both emotion type and emotion intensity, overall performance was slightly lower than other systems, but the difference was not critical at a maximum of 0.05, so it was estimated that emotional speech could be synthesized without manual emotion display processing. The advantages of the based emotional voice generation system 600 have been proven.

시스템 인덱스system index 언어 모델language model 감정 종류type of emotion 감정 강도emotional intensity 중립neutrality 기쁨pleasure 슬픔sadness 분노anger 평균average 1One -- 수동manual 수동manual 4.05±0.224.05±0.22 4.13±0.224.13±0.22 3.58±0.303.58±0.30 3.38±0.293.38±0.29 3.79±0.173.79±0.17 22 GPT3GPT3 예측prediction 수동manual 4.07±0.204.07±0.20 3.79±0.283.79±0.28 3.60±0.333.60±0.33 3.31±0.293.31±0.29 3.74±0.163.74±0.16 33 GPT3GPT3 수동manual 예측prediction 4.14±0.214.14±0.21 4.02±0.264.02±0.26 3.58±0.293.58±0.29 3.32±0.313.32±0.31 3.77±0.183.77±0.18 44 GPT3GPT3 예측prediction 예측prediction 4.07±0.204.07±0.20 4.09±0.244.09±0.24 3.52±0.363.52±0.36 3.30±0.313.30±0.31 3.74±0.183.74±0.18

또한, 본 출원의 발명자들은, 문단 수준의 음성 합성에서 감정 음성 생성 시스템(600)의 효율성을 증명하기 위해 다양한 오디오북 스크립트를 사용한 선호도 테스트를 시행했다. 테스트에서 각 6-7개의 문장으로 구성된 음성 코퍼스의 10개의 문단을 합성하고, 18명의 한국인 청취자에게 합성된 음성 샘플 둘 중 더 선호되는 것을 선택하도록 했다. 수동으로 감정 종류 및 감정 강도를 선택한 시스템 1, 단일한 표현을 하나의 입력으로 받아 연속된 문장 사이의 감정적인 맥락을 고려할 수 없는 시스템 2, 복수의 표현을 입력 받아 감정을 예측하는데 연속적인 문장을 이용할 수 있는 시스템 3에 대한 테스트가 진행되었다.Additionally, the inventors of the present application conducted a preference test using various audiobook scripts to prove the effectiveness of the emotional speech generation system 600 in paragraph-level speech synthesis. In the test, 10 paragraphs from the speech corpus, each consisting of 6-7 sentences, were synthesized, and 18 Korean listeners were asked to choose which of the two synthesized speech samples they preferred more. System 1, which manually selects the emotion type and emotion intensity; System 2, which receives a single expression as one input and cannot consider the emotional context between consecutive sentences; System 2, which receives multiple expressions as input and uses consecutive sentences to predict emotions Tests were conducted on available System 3.

테스트의 결과는 하기의 [표 2]와 같다. 요약하면, 본 개시의 감정 예측 방법이 수동으로 감정을 표시한 방법과 유의미한 차이를 보이지 않으며, 이는 본 개시의 감정 예측 방법이 문단 수준의 텍스트 스크립트에도 여전히 유효하다는 것을 의미한다. 또한 피실험자는 복수의 문장을 사용한 시스템 2의 음성 합성 결과를 단일 문장을 사용한 시스템 3의 음성 합성 결과 보다 더 선호했으며, 이는 본 개시의 감성 예측 방법이 입력 문장의 문맥을 고려해야하는 문단 수준의 컨텐츠를 생성하는데 효과적이라는 것을 보여준다.The results of the test are shown in [Table 2] below. In summary, the emotion prediction method of the present disclosure shows no significant difference from the method of manually displaying emotions, which means that the emotion prediction method of the present disclosure is still effective for paragraph-level text scripts. Additionally, the subjects preferred the speech synthesis results of System 2, which used multiple sentences, more than the speech synthesis results of System 3, which used a single sentence, which means that the emotion prediction method of the present disclosure uses paragraph-level content that must take into account the context of the input sentence. It shows that it is effective in creating

시스템1system 1 시스템2system 2 시스템3System 3 선호 없음No preference 유의 확률significance probability 테스트 1test 1 49.449.4 -- 35.035.0 15.615.6 0.080.08 테스트 2test 2 -- 30.530.5 47.847.8 21.721.7 0.020.02

또한, 큰 언어모델을 가진 감정 음성 생성 시스템(600)의 효과성을 검증하기 위해 다른 크기의 GPT-3를 가진 감정추정부를 학습시키는 실험이 시행되었다. 구체적으로, 감정 종류 및 강도 예측력을 평가하기 위해 감정 분류 정확성(ECA: emotion classification accuracy) 및 감정 강도의 평균 제곱근 오차(S-RMSE: strength root mean square)를 측정하였다. 추가적으로, 메모리 및 인퍼런스(추론)의 효율성을 측정하기 위해, 텍스트 음향 모델과 언어 모델을 포함한 TTS 모델의 파라미터 수와 RTF(Real-Time Factor)를 측정하였다. 테스트의 결과는 하기의 [표 3]과 같다. Additionally, to verify the effectiveness of the emotional speech generation system 600 with a large language model, an experiment was conducted to learn an emotion estimation unit with GPT-3 of different sizes. Specifically, to evaluate emotion type and intensity prediction ability, emotion classification accuracy (ECA) and root mean square error of emotion intensity (S-RMSE: strength root mean square) were measured. Additionally, to measure the efficiency of memory and inference (inference), the number of parameters and RTF (Real-Time Factor) of the TTS model, including the text sound model and language model, were measured. The results of the test are shown in [Table 3] below.

결과를 분석하면, 감정 강도의 평균 제곱근 오차를 고려했을 때, 감정 강도 예측의 정확성은 언어 모델의 크기 변화에 민감하지 않았다. 감정 분류 정확성을 고려했을 때, GPT-3의 크기가 증가할수록 감정 종류의 예측 정확성도 증가하나, 작은 크기의 GPT-3 역시 모델 크기와 생성 속도에 있어 유의미한 성과를 얻을 수 있었다.Analyzing the results, the accuracy of emotional intensity prediction was not sensitive to changes in the size of the language model, considering the root mean square error of emotional intensity. When considering emotion classification accuracy, as the size of GPT-3 increases, the prediction accuracy of emotion types also increases, but even small-sized GPT-3 was able to achieve significant results in terms of model size and generation speed.

ECA(%)ECA(%) S-RMSE(dB)S-RMSE(dB) 모델 크기model size 생성 속도(RTF)Generation Rate (RTF) 소형 GPT-3Small GPT-3 76.5476.54 -86.27-86.27 175.97M175.97M 0.110.11 중형 GPT-3Medium GPT-3 75.4275.42 -87.21-87.21 389.17M389.17M 0.200.20 대형 GPT-3Large GPT-3 78.7778.77 -85.13-85.13 799.56M799.56M 0.300.30 특대형 GPT-3Extra large GPT-3 79.8979.89 -86.02-86.02 1.34B1.34B 0.500.50

도 7은 본 개시의 일 실시예에 따른 감정 특징 벡터의 T-분포 확률적 임베딩 도면(700)이다. 도 7은, 예를 들어, 도 1에 도시된 감정 추정(130) 동작 또는 도 3, 4 및 6에 도시된 감정 추정부(320)에 의해 출력되는 감정 특징 벡터(임베딩)의 t-분포 확률적 임베딩(t-SNE: t-distributed stochastic neighbor embedding)을 도시한 것이다. Figure 7 is a T-distribution probabilistic embedding diagram 700 of an emotional feature vector according to an embodiment of the present disclosure. FIG. 7 shows, for example, the t-distribution probability of an emotional feature vector (embedding) output by the emotion estimation 130 operation shown in FIG. 1 or the emotion estimation unit 320 shown in FIGS. 3, 4, and 6. This shows adversary embedding (t-SNE: t-distributed stochastic neighbor embedding).

도시된 바와 같이, 감정 추정부에 의해 추출되는 감정 강도가 약해질수록, 별도 추출되는 감정 종류와 상관없이, 각 감정 특징 임베딩이 하나의 벡터로 수렴(710)할 수 있다. 이에 반해, 감정 추정부에 추출되는 감정 강도가 강해질수록, 각 감정 특징 임베딩은 각 감정 분류에 대응하는 특정 방향으로 발산(720)할 수 있다. 즉, 감정 종류는 감정 특징 임베딩 공간에서 하나 이상의 분류된 감정 특징에 대응하는 군집을 나타내며, 감정 강도는 감정 특징 임베딩 공간에서 군집 내 포함된 감정 특징 임베딩의 강도를 세부적으로 나타낼 수 있다. 따라서, 본 개시의 다양한 실시예에 따른 감정 음성 합성 시스템(또는 감정 추정부)에 의해 입력 텍스트로부터 추출되는 감정 종류와 감정 강도를 이용함으로써, 감정 특성의 대략적인 또는 세부적인 구조를 합성된 음성에 효과적으로 반영할 수 있다.As shown, as the emotion intensity extracted by the emotion estimation unit becomes weaker, each emotion feature embedding may converge into one vector (710), regardless of the type of emotion extracted separately. On the other hand, as the emotion intensity extracted from the emotion estimation unit becomes stronger, each emotion feature embedding may diverge in a specific direction corresponding to each emotion classification (720). In other words, the emotion type represents a cluster corresponding to one or more classified emotional features in the emotional feature embedding space, and the emotional intensity can represent in detail the strength of emotional feature embeddings included in the cluster in the emotional feature embedding space. Therefore, by using the emotion type and emotion intensity extracted from the input text by the emotional speech synthesis system (or emotion estimation unit) according to various embodiments of the present disclosure, the approximate or detailed structure of the emotional characteristics is expressed in the synthesized voice. can be reflected effectively.

도 8은 본 개시의 일 실시예에 따른 감정 추정 기반의 감정 음성 합성 방법(800)을 나타내는 흐름도이다. 일 실시예에 따르면, 감정 추정 기반의 감정 음성 합성 방법(800)은, 프로세서의 감정 추정부에 의해 입력 텍스트로부터 감정 특징 임베딩을 생성함으로써 개시될 수 있다(S810). 일 실시예에서 감정 특징 임베딩을 생성하는 단계는, 입력 텍스트로부터 특징 벡터를 추출하는 단계, 특징 벡터로부터 감정 종류 및 감정 강도를 예측하는 단계, 및 감정 종류 및 감정 강도로부터 감정 특징 임베딩을 생성하는 단계를 포함할 수 있다. 한편, 프로세서의 텍스트 분석부에 의해 입력 텍스트로부터 언어 특징을 추출할 수 있다(S820). 도 8에는 단계(S810) 및 단계(S820)이 순차적으로 실행되는 것으로 도시되었으나, 이에 한정되지 않으며, 단계(S810)과 단계(S820)이 병렬적으로 실행될 수도 있다.FIG. 8 is a flowchart illustrating an emotional voice synthesis method 800 based on emotion estimation according to an embodiment of the present disclosure. According to one embodiment, the emotion estimation-based emotional voice synthesis method 800 may be started by generating emotional feature embeddings from input text by an emotion estimation unit of a processor (S810). In one embodiment, generating emotional feature embeddings includes extracting feature vectors from input text, predicting emotion type and emotional intensity from the feature vector, and generating emotional feature embeddings from the emotional type and emotional intensity. may include. Meanwhile, language features can be extracted from the input text by the text analysis unit of the processor (S820). In FIG. 8 , steps S810 and S820 are shown to be executed sequentially, but the present invention is not limited thereto, and steps S810 and S820 may be executed in parallel.

그런 다음, 프로세서의 음성 합성부에 의해, 감정 특징 임베딩 및 언어 특징으로부터 음성의 길이를 예측하고 예측된 음성의 길이에 기초하여 감정 음성을 생성할 수 있다(S830). 일 실시예에 따르면, 프로세서는, 인코더에 의해, 언어 특징으로부터 언어 임베딩을 추출하고, 업샘플링부에 의해, 언어 임베딩을 음성의 길이에 맞춰 업샘플링하며, 디코더에 의해, 업샘플링된 언어 임베딩으로부터 음향 특징들을 생성할 수 있다. 또한, 프로세서는, 보코더에 의해, 음향 특징들로부터 감정 음성을 합성할 수 있다.Then, the voice synthesis unit of the processor may predict the length of the voice from the emotional feature embedding and language features and generate an emotional voice based on the predicted length of the voice (S830). According to one embodiment, the processor extracts the language embedding from the language features by the encoder, upsamples the language embedding to the length of the speech by the upsampling unit, and extracts the language embedding from the upsampled language embedding by the decoder. Acoustic features can be generated. Additionally, the processor may synthesize emotional speech from acoustic features by means of a vocoder.

도 8에서 도시한 흐름도 및 상술한 설명은 일 예시일 뿐이며, 일부 실시예에서는 다르게 구현될 수 있다. 예를 들어, 일부 실시예에서는 각 단계의 순서가 바뀌거나, 일부 단계가 반복 수행되거나, 일부 단계가 생략되거나, 일부 단계가 추가될 수 있다.The flowchart shown in FIG. 8 and the above description are only examples and may be implemented differently in some embodiments. For example, in some embodiments, the order of each step may be changed, some steps may be performed repeatedly, some steps may be omitted, or some steps may be added.

도 9는 본 개시의 일 실시예에 따른 감정 추정부에서의 감정 특징 임베딩 생성하는 방법(S810)을 나타내는 흐름도이다. 일 실시예에 따르면, 방법(S810)은, 언어 모델에 의해, 입력 텍스트로부터 특징 벡터를 추출함으로써 개시될 수 있다(S910). 일 실시예에서, 언어 모델은 입력 텍스트의 인접 문장 사이의 높은 레벨의 컨텍스트 특징을 추출하도록 구성된 GPT-3 모델을 포함할 수 있다.Figure 9 is a flowchart showing a method (S810) for generating emotional feature embeddings in an emotion estimation unit according to an embodiment of the present disclosure. According to one embodiment, the method S810 may begin by extracting feature vectors from input text using a language model (S910). In one embodiment, the language model may include a GPT-3 model configured to extract high-level context features between adjacent sentences in the input text.

다음으로, 감정 종류 예측부에 의해, 특징 벡터로부터 감정 종류가 예측될 수 있다(S920). 일 실시예에 따르면, 감정 종류는 원 핫 인코딩된 값(one hot encoded value)으로 표현될 수 있으며, 감정 특징 임베딩 공간에서 하나 이상의 분류된 감정 특징에 대응하는 군집을 나타낼 수 있다.Next, the emotion type can be predicted from the feature vector by the emotion type prediction unit (S920). According to one embodiment, an emotion type may be expressed as a one hot encoded value and may represent a cluster corresponding to one or more classified emotional features in an emotional feature embedding space.

한편, 감정 강도 예측부에 의해 특징 벡터로부터 감정 특징이 예측될 수 있다(S930). 일 실시예에 따르면, 감정 강도는 스칼라 값(scalar-valued)으로 표현될 수 있으며, 감정 특징 임베딩 공간에서 군집 내 포함된 감정 특징 임베딩의 강도를 세부적으로 나타낼 수 있다.Meanwhile, emotional features may be predicted from the feature vector by the emotional intensity prediction unit (S930). According to one embodiment, emotional intensity may be expressed as a scalar value, and the intensity of emotional feature embeddings included in a cluster in the emotional feature embedding space may be expressed in detail.

그러고 나서, 벡터 합성부에 의해 감정 종류 및 감정 강도로부터 감정 특징 임베딩이 생성될 수 있다(S940). 일 실시예에 따르면, 벡터 합성부는 감정 종류를 사전 학습된 룩업 테이블(look-up table)을 통해 대응하는 감정 종류 임베딩으로 변환하고, 감정 종류 임베딩과 감정 강도를 혼합하여 감정 특징 임베딩을 생성할 수 있다. 구체적으로, 벡터 합성부는, 소프트플러스(softplus) 활성 함수에 의해 감정 종류 임베딩과 사전 학습된 사영 행렬을 곱한 값과 감정 강도와 스칼라 변수를 곱한 값으로부터 감정 특징 임베딩을 생성할 수 있다.Then, emotional feature embeddings can be generated from the emotion type and emotion intensity by the vector synthesis unit (S940). According to one embodiment, the vector synthesis unit can convert the emotion type into the corresponding emotion type embedding through a pre-learned look-up table, and generate emotional feature embeddings by mixing the emotion type embedding and emotion intensity. there is. Specifically, the vector synthesis unit may generate emotional feature embeddings from the product of the emotion type embedding and a pre-learned projective matrix and the product of the emotion intensity and the scalar variable using the softplus activation function.

도 9에서 도시한 흐름도 및 상술한 설명은 일 예시일 뿐이며, 일부 실시예에서는 다르게 구현될 수 있다. 예를 들어, 일부 실시예에서는 각 단계의 순서가 바뀌거나, 일부 단계가 반복 수행되거나, 일부 단계가 생략되거나, 일부 단계가 추가될 수 있다.The flowchart shown in FIG. 9 and the above description are only examples and may be implemented differently in some embodiments. For example, in some embodiments, the order of each step may be changed, some steps may be performed repeatedly, some steps may be omitted, or some steps may be added.

도 10는 본 개시의 일 실시예에 따른 음성 합성부에서의 감정 음성을 합성하는 방법(S830)을 나타내는 흐름도이다. 일 실시예에 따르면, 방법(S830)은 인코더에 의해 언어 특징으로부터 언어 임베딩을 추출하는 단계(S1010)로 개시될 수 있다(S1010). Figure 10 is a flowchart showing a method (S830) of synthesizing emotional voices in a voice synthesis unit according to an embodiment of the present disclosure. According to one embodiment, the method (S830) may begin with a step (S1010) of extracting language embeddings from language features by an encoder (S1010).

다음으로, 업샘플링부에 의해, 언어 임베딩이 음성 길이에 맞춰 업샘플링될 수 있다(S1020). 일 실시예에 따르면, 업샘플링부는 음성의 길이에 맞춰 언어 특징으로부터 추출된 언어 임베딩의 시간적 해상도를 출력될 음향 특징의 해상도와 매칭하는 반복 과정을 통해 언어 임베딩을 업샘플링 할 수 있다. 여기서, 음성의 길이는, 길이 예측부에 의해 감정 특징 임베딩 및 언어 특징을 결합시킨 벡터로부터 예측된 감정 의존적인 음소 단위의 길이일 수 있다.Next, the language embedding may be upsampled according to the voice length by the upsampling unit (S1020). According to one embodiment, the upsampling unit may upsample the language embedding through an iterative process of matching the temporal resolution of the language embedding extracted from the language feature with the resolution of the acoustic feature to be output according to the length of the voice. Here, the length of the voice may be the length of an emotion-dependent phoneme unit predicted from a vector combining emotional feature embedding and language features by the length prediction unit.

또한, 디코더에 의해, 업샘플링된 언어 임베딩으로부터 음향 특징들이 생성될 수 있다(S1030). 일 실시예에 따르면, 디코더는 업샘플링된 언어 임베딩과 감정 특징 임베딩을 결합하여 자기회귀적으로 음향 특징들을 생성할 수 있다.Additionally, acoustic features may be generated from the upsampled language embedding by the decoder (S1030). According to one embodiment, the decoder may generate acoustic features autoregressively by combining upsampled language embedding and emotional feature embedding.

마지막으로, 보코더에 의해, 음향 특징들로부터 감정 음성이 합성될 수 있다(S1040). 일 실시예에 따르면, 보코더는, 음향 특징들로부터 하모닉 및 노이즈 구성요소들을 함께 캡처하여 감정 음성을 합성할 수 있다.Finally, an emotional voice can be synthesized from the acoustic features by a vocoder (S1040). According to one embodiment, a vocoder can synthesize emotional speech by capturing harmonic and noise components together from acoustic features.

도 10에서 도시한 흐름도 및 상술한 설명은 일 예시일 뿐이며, 일부 실시예에서는 다르게 구현될 수 있다. 예를 들어, 일부 실시예에서는 각 단계의 순서가 바뀌거나, 일부 단계가 반복 수행되거나, 일부 단계가 생략되거나, 일부 단계가 추가될 수 있다.The flowchart shown in FIG. 10 and the above description are only examples and may be implemented differently in some embodiments. For example, in some embodiments, the order of each step may be changed, some steps may be performed repeatedly, some steps may be omitted, or some steps may be added.

상술한 방법은 컴퓨터에서 실행하기 위해 컴퓨터 판독 가능한 기록 매체에 저장된 컴퓨터 프로그램으로 제공될 수 있다. 매체는 컴퓨터로 실행 가능한 프로그램을 계속 저장하거나, 실행 또는 다운로드를 위해 임시 저장하는 것일수도 있다. 또한, 매체는 단일 또는 수개 하드웨어가 결합된 형태의 다양한 기록 수단 또는 저장수단일 수 있는데, 어떤 컴퓨터 시스템에 직접 접속되는 매체에 한정되지 않고, 네트워크 상에 분산 존재하는 것일 수도 있다. 매체의 예시로는, 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체, CD-ROM 및 DVD 와 같은 광기록 매체, 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto optical medium), 및 ROM, RAM, 플래시 메모리 등을 포함하여 프로그램 명령어가 저장되도록 구성된 것이 있을 수 있다. 또한, 다른 매체의 예시로, 애플리케이션을 유통하는 앱 스토어나 기타 다양한 소프트웨어를 공급 내지 유통하는 사이트, 서버 등에서 관리하는 기록매체 내지 저장매체도 들 수 있다.The above-described method may be provided as a computer program stored in a computer-readable recording medium for execution on a computer. Media may be used to continuously store executable programs on a computer, or may be temporarily stored for execution or download. In addition, the medium may be a variety of recording or storage means in the form of a single or several pieces of hardware combined. It is not limited to a medium directly connected to a computer system and may be distributed over a network. Examples of media include magnetic media such as hard disks, floppy disks, and magnetic tapes, optical recording media such as CD-ROMs and DVDs, magneto-optical media such as floptical disks, and There may be something configured to store program instructions, including ROM, RAM, flash memory, etc. Additionally, examples of other media include recording or storage media managed by app stores that distribute applications, sites or servers that supply or distribute various other software, etc.

본 개시의 방법, 동작 또는 기법들은 다양한 수단에 의해 구현될 수도 있다. 예를 들어, 이러한 기법들은 하드웨어, 펌웨어, 소프트웨어, 또는 이들의 조합으로 구현될 수도 있다. 본원의 개시와 연계하여 설명된 다양한 예시적인 논리적 블록들, 모듈들, 회로들, 및 알고리즘 단계들은 전자 하드웨어, 컴퓨터 소프트웨어, 또는 양자의 조합들로 구현될 수도 있음을 통상의 기술자들은 이해할 것이다. 하드웨어 및 소프트웨어의 이러한 상호 대체를 명확하게 설명하기 위해, 다양한 예시적인 구성요소들, 블록들, 모듈들, 회로들, 및 단계들이 그들의 기능적 관점에서 일반적으로 위에서 설명되었다. 그러한 기능이 하드웨어로서 구현되는지 또는 소프트웨어로서 구현되는 지의 여부는, 특정 애플리케이션 및 전체 시스템에 부과되는 설계 요구사항들에 따라 달라진다. 통상의 기술자들은 각각의 특정 애플리케이션을 위해 다양한 방식들로 설명된 기능을 구현할 수도 있으나, 그러한 구현들은 본 개시의 범위로부터 벗어나게 하는 것으로 해석되어서는 안된다.The methods, operations, or techniques of this disclosure may be implemented by various means. For example, these techniques may be implemented in hardware, firmware, software, or a combination thereof. Those skilled in the art will understand that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented in electronic hardware, computer software, or combinations of both. To clearly illustrate this interchange of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends on the specific application and design requirements imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementations should not be interpreted as causing a departure from the scope of the present disclosure.

하드웨어 구현에서, 기법들을 수행하는 데 이용되는 프로세싱 유닛들은, 하나 이상의 ASIC들, DSP들, 디지털 신호 프로세싱 디바이스들(digital signal processing devices; DSPD들), 프로그램가능 논리 디바이스들(programmable logic devices; PLD들), 필드 프로그램가능 게이트 어레이들(field programmable gate arrays; FPGA들), 프로세서들, 제어기들, 마이크로제어기들, 마이크로프로세서들, 전자 디바이스들, 본 개시에 설명된 기능들을 수행하도록 설계된 다른 전자 유닛들, 컴퓨터, 또는 이들의 조합 내에서 구현될 수도 있다.In a hardware implementation, the processing units used to perform the techniques may include one or more ASICs, DSPs, digital signal processing devices (DSPDs), programmable logic devices (PLDs). ), field programmable gate arrays (FPGAs), processors, controllers, microcontrollers, microprocessors, electronic devices, and other electronic units designed to perform the functions described in this disclosure. , a computer, or a combination thereof.

따라서, 본 개시와 연계하여 설명된 다양한 예시적인 논리 블록들, 모듈들, 및 회로들은 범용 프로세서, DSP, ASIC, FPGA나 다른 프로그램 가능 논리 디바이스, 이산 게이트나 트랜지스터 로직, 이산 하드웨어 컴포넌트들, 또는 본원에 설명된 기능들을 수행하도록 설계된 것들의 임의의 조합으로 구현되거나 수행될 수도 있다. 범용 프로세서는 마이크로프로세서일 수도 있지만, 대안으로, 프로세서는 임의의 종래의 프로세서, 제어기, 마이크로제어기, 또는 상태 머신일 수도 있다. 프로세서는 또한, 컴퓨팅 디바이스들의 조합, 예를 들면, DSP와 마이크로프로세서, 복수의 마이크로프로세서들, DSP 코어와 연계한 하나 이상의 마이크로프로세서들, 또는 임의의 다른 구성의 조합으로서 구현될 수도 있다.Accordingly, the various illustrative logical blocks, modules, and circuits described in connection with this disclosure may be general-purpose processors, DSPs, ASICs, FPGAs or other programmable logic devices, discrete gate or transistor logic, discrete hardware components, or It may be implemented or performed as any combination of those designed to perform the functions described in. A general-purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, such as a DSP and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a DSP core, or any other configuration.

펌웨어 및/또는 소프트웨어 구현에 있어서, 기법들은 랜덤 액세스 메모리(random access memory; RAM), 판독 전용 메모리(read-only memory; ROM), 비휘발성 RAM(non-volatile random access memory; NVRAM), PROM(programmable read-only memory), EPROM(erasable programmable read-only memory), EEPROM(electrically erasable PROM), 플래시 메모리, 컴팩트 디스크(compact disc; CD), 자기 또는 광학 데이터 스토리지 디바이스 등과 같은 컴퓨터 판독가능 매체 상에 저장된 명령들로서 구현될 수도 있다. 명령들은 하나 이상의 프로세서들에 의해 실행 가능할 수도 있고, 프로세서(들)로 하여금 본 개시에 설명된 기능의 특정 양태들을 수행하게 할 수도 있다.For firmware and/or software implementations, techniques include random access memory (RAM), read-only memory (ROM), non-volatile random access memory (NVRAM), and PROM ( on computer-readable media such as programmable read-only memory (EPROM), electrically erasable PROM (EEPROM), flash memory, compact disc (CD), magnetic or optical data storage devices, etc. It may also be implemented as stored instructions. Instructions may be executable by one or more processors and may cause the processor(s) to perform certain aspects of the functionality described in this disclosure.

소프트웨어로 구현되는 경우, 상기 기법들은 하나 이상의 명령들 또는 코드로서 컴퓨터 판독 가능한 매체 상에 저장되거나 또는 컴퓨터 판독 가능한 매체를 통해 전송될 수도 있다. 컴퓨터 판독가능 매체들은 한 장소에서 다른 장소로 컴퓨터 프로그램의 전송을 용이하게 하는 임의의 매체를 포함하여 컴퓨터 저장 매체들 및 통신 매체들 양자를 포함한다. 저장 매체들은 컴퓨터에 의해 액세스될 수 있는 임의의 이용 가능한 매체들일 수도 있다. 비제한적인 예로서, 이러한 컴퓨터 판독가능 매체는 RAM, ROM, EEPROM, CD-ROM 또는 다른 광학 디스크 스토리지, 자기 디스크 스토리지 또는 다른 자기 스토리지 디바이스들, 또는 소망의 프로그램 코드를 명령들 또는 데이터 구조들의 형태로 이송 또는 저장하기 위해 사용될 수 있으며 컴퓨터에 의해 액세스될 수 있는 임의의 다른 매체를 포함할 수 있다. 또한, 임의의 접속이 컴퓨터 판독가능 매체로 적절히 칭해진다.When implemented in software, the techniques may be stored on or transmitted through a computer-readable medium as one or more instructions or code. Computer-readable media includes both computer storage media and communication media, including any medium that facilitates transfer of a computer program from one place to another. Storage media may be any available media that can be accessed by a computer. By way of non-limiting example, such computer readable media may include RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or the desired program code in the form of instructions or data structures. It can be used to transfer or store data and can include any other media that can be accessed by a computer. Any connection is also properly termed a computer-readable medium.

예를 들어, 소프트웨어가 동축 케이블, 광섬유 케이블, 연선, 디지털 가입자 회선 (DSL), 또는 적외선, 무선, 및 마이크로파와 같은 무선 기술들을 사용하여 웹사이트, 서버, 또는 다른 원격 소스로부터 전송되면, 동축 케이블, 광섬유 케이블, 연선, 디지털 가입자 회선, 또는 적외선, 무선, 및 마이크로파와 같은 무선 기술들은 매체의 정의 내에 포함된다. 본원에서 사용된 디스크(disk) 와 디스크(disc)는, CD, 레이저 디스크, 광 디스크, DVD(digital versatile disc), 플로피디스크, 및 블루레이 디스크를 포함하며, 여기서 디스크들(disks)은 보통 자기적으로 데이터를 재생하고, 반면 디스크들(discs) 은 레이저를 이용하여 광학적으로 데이터를 재생한다. 위의 조합들도 컴퓨터 판독가능 매체들의 범위 내에 포함되어야 한다.For example, if the Software is transmitted from a website, server, or other remote source using coaxial cable, fiber optic cable, twisted pair cable, digital subscriber line (DSL), or wireless technologies such as infrared, radio, and microwave, , fiber optic cable, twisted pair, digital subscriber line, or wireless technologies such as infrared, radio, and microwave are included within the definition of medium. As used herein, disk and disk include CD, laser disk, optical disk, digital versatile disc (DVD), floppy disk, and Blu-ray disk, where disks are usually magnetic. It reproduces data optically, while discs reproduce data optically using lasers. Combinations of the above should also be included within the scope of computer-readable media.

소프트웨어 모듈은, RAM 메모리, 플래시 메모리, ROM 메모리, EPROM 메모리, EEPROM 메모리, 레지스터들, 하드 디스크, 이동식 디스크, CD-ROM, 또는 공지된 임의의 다른 형태의 저장 매체 내에 상주할 수도 있다. 예시적인 저장 매체는, 프로세가 저장 매체로부터 정보를 판독하거나 저장 매체에 정보를 기록할 수 있도록, 프로세서에 연결될 수 있다. 대안으로, 저장 매체는 프로세서에 통합될 수도 있다. 프로세서와 저장 매체는 ASIC 내에 존재할 수도 있다. ASIC은 유저 단말 내에 존재할 수도 있다. 대안으로, 프로세서와 저장 매체는 유저 단말에서 개별 구성요소들로서 존재할 수도 있다.A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, removable disk, CD-ROM, or any other form of storage medium known. An exemplary storage medium may be coupled to the processor such that the processor may read information from or write information to the storage medium. Alternatively, the storage medium may be integrated into the processor. The processor and storage medium may reside within an ASIC. ASIC may exist within the user terminal. Alternatively, the processor and storage medium may exist as separate components in the user terminal.

이상 설명된 실시예들이 하나 이상의 독립형 컴퓨터 시스템에서 현재 개시된 주제의 양태들을 활용하는 것으로 기술되었으나, 본 개시는 이에 한정되지 않고, 네트워크나 분산 컴퓨팅 환경과 같은 임의의 컴퓨팅 환경과 연계하여 구현될 수도 있다. 또 나아가, 본 개시에서 주제의 양상들은 복수의 프로세싱 칩들이나 장치들에서 구현될 수도 있고, 스토리지는 복수의 장치들에 걸쳐 유사하게 영향을 받게 될 수도 있다. 이러한 장치들은 PC들, 네트워크 서버들, 및 휴대용 장치들을 포함할 수도 있다.Although the above-described embodiments have been described as utilizing aspects of the presently disclosed subject matter in one or more standalone computer systems, the disclosure is not limited thereto and may also be implemented in conjunction with any computing environment, such as a network or distributed computing environment. . Furthermore, aspects of the subject matter of this disclosure may be implemented in multiple processing chips or devices, and storage may be similarly effected across the multiple devices. These devices may include PCs, network servers, and portable devices.

본 명세서에서는 본 개시가 일부 실시예들과 관련하여 설명되었지만, 본 개시의 발명이 속하는 기술분야의 통상의 기술자가 이해할 수 있는 본 개시의 범위를 벗어나지 않는 범위에서 다양한 변형 및 변경이 이루어질 수 있다. 또한, 그러한 변형 및 변경은 본 명세서에 첨부된 특허청구의 범위 내에 속하는 것으로 생각되어야 한다.Although the present disclosure has been described in relation to some embodiments in this specification, various modifications and changes may be made without departing from the scope of the present disclosure as can be understood by a person skilled in the art to which the invention pertains. Additionally, such modifications and changes should be considered to fall within the scope of the claims appended hereto.

100: 감정 추정 기반의 감정 음성 생성 시스템
110: 입력 텍스트
120: 텍스트 분석
130: 감정 추정
140: 음성 합성
150: 감정 음성100: Emotional speech generation system based on emotion estimation
110: input text
120: Text analysis
130: Emotion estimation
140: Voice synthesis
150: Emotional voice

Claims

컴퓨팅 장치의 적어도 하나의 프로세서에 의해 수행되는, 감정 음성 합성 방법에 있어서,
감정 추정부에 의해, 입력 텍스트로부터 감정 특징 임베딩을 생성하는 단계;
텍스트 분석부에 의해, 상기 입력 텍스트로부터 언어 특징을 추출하는 단계; 및
음성 합성부에 의해, 상기 감정 특징 임베딩 및 상기 언어 특징으로부터 음성의 길이를 예측하고 상기 예측된 음성의 길이에 기초하여 감정 음성을 합성하는 단계를 포함하는, 감정 음성 합성 방법.
An emotional speech synthesis method performed by at least one processor of a computing device, comprising:
Generating, by an emotion estimation unit, emotional feature embeddings from input text;
Extracting language features from the input text by a text analysis unit; and
An emotional speech synthesis method, comprising: predicting a length of speech from the emotional feature embedding and the language features, by a speech synthesis unit, and synthesizing an emotional speech based on the predicted length of speech.

제1항에 있어서,
감정 추정부에 의해, 상기 입력 텍스트로부터 감정 특징 임베딩을 생성하는 단계는,
언어 모델에 의해, 상기 입력 텍스트로부터 특징 벡터를 추출하는 단계;
감정 종류 예측부에 의해, 상기 특징 벡터로부터 감정 종류를 예측하는 단계; 및
감정 강도 예측부에 의해, 상기 특징 벡터로부터 감정 강도를 예측하는 단계를 포함하는, 감정 음성 합성 방법.
According to paragraph 1,
The step of generating emotional feature embeddings from the input text by the emotion estimation unit,
extracting feature vectors from the input text by a language model;
Predicting an emotion type from the feature vector by an emotion type prediction unit; and
An emotional voice synthesis method comprising predicting emotional intensity from the feature vector by an emotional intensity prediction unit.

제2항에 있어서,
감정 추정부에 의해, 상기 입력 텍스트로부터 감정 특징 임베딩을 생성하는 단계는,
벡터 합성부에 의해, 상기 예측된 감정 종류 및 상기 예측된 감정 강도로부터 상기 감정 특징 임베딩을 생성하는 단계를 더 포함하는, 감정 음성 합성 방법.
According to paragraph 2,
The step of generating emotional feature embeddings from the input text by the emotion estimation unit,
An emotional speech synthesis method further comprising generating, by a vector synthesis unit, the emotional feature embedding from the predicted emotion type and the predicted emotional intensity.

제2항에 있어서,
상기 감정 종류는 원 핫 인코딩된 값(one hot encoded value)으로 표현되고,
상기 감정 강도는 스칼라 값 (scalar-valued)으로 표현되는, 감정 음성 합성 방법.
According to paragraph 2,
The emotion type is expressed as a one hot encoded value,
An emotional voice synthesis method, wherein the emotional intensity is expressed as a scalar value.

제2항에 있어서,
상기 감정 종류는 감정 특징 임베딩 공간에서 하나 이상의 분류된 감정 특징에 대응하는 군집을 나타내며,
상기 감정 강도는 상기 감정 특징 임베딩 공간에서 상기 군집 내에 포함된 감정 특징 임베딩의 강도를 세부적으로 나타내는, 감정 음성 합성 방법.
According to paragraph 2,
The emotion type represents a cluster corresponding to one or more classified emotional features in the emotional feature embedding space,
The emotional intensity is a method for synthesizing emotional speech, indicating in detail the intensity of emotional feature embeddings included in the cluster in the emotional feature embedding space.

제2항에 있어서,
감정 추정부에 의해, 상기 입력 텍스트로부터 감정 특징 임베딩을 생성하는 단계는,
상기 예측된 감정 종류를 사전 학습된 룩업 테이블(look-up table)을 통해 대응하는 감정 종류 임베딩으로 변환하는 단계; 및
상기 감정 종류 임베딩과 상기 예측된 감정 강도를 혼합하여 상기 감정 특징 임베딩을 생성하는 단계를 더 포함하는, 감정 음성 합성 방법.
According to paragraph 2,
The step of generating emotional feature embeddings from the input text by the emotion estimation unit,
Converting the predicted emotion type into a corresponding emotion type embedding through a pre-learned look-up table; and
The emotional speech synthesis method further comprising generating the emotional feature embedding by mixing the emotion type embedding and the predicted emotional intensity.

제6항에 있어서,
상기 감정 종류 임베딩과 상기 예측된 감정 강도를 혼합하여 감정 특징 임베딩을 생성하는 단계는,
소프트플러스(softplus) 활성 함수에 의해, 상기 감정 종류 임베딩과 사전 학습된 사영 행렬을 곱한 값과 상기 예측된 감정 강도와 스칼라 변수를 곱한 값으로부터 상기 감정 특징 임베딩을 생성하는 단계를 포함하는, 감정 음성 합성 방법.
According to clause 6,
The step of generating emotional feature embeddings by mixing the emotion type embedding and the predicted emotional intensity,
Emotional speech, comprising generating the emotional feature embedding from a product of the emotion type embedding and a pre-learned projective matrix and a product of the predicted emotional intensity and a scalar variable, by a softplus activation function. Synthesis method.

제1항에 있어서,
사전 녹음된 음성 데이터로부터 추출된 감정 종류 데이터 및 감정 강도 데이터에 기초하여 상기 감정 추정부를 학습시키는 단계를 더 포함하는, 감정 합성 생성 방법.
According to paragraph 1,
An emotion synthesis generating method further comprising training the emotion estimation unit based on emotion type data and emotion intensity data extracted from pre-recorded voice data.

제8항에 있어서,
사전 녹음된 음성 데이터로부터 추출된 감정 종류 데이터 및 감정 강도 데이터에 기초하여 상기 감정 추정부를 학습시키는 단계는,
상기 사전 녹음된 음성 데이터의 화자로부터 제공된 상기 감정 종류 데이터를 수신하는 단계; 및
중립적인 감정 언어로부터 추출된 감정 특징들의 상대적인 순위 정보를 출력하도록 학습된 랭킹 서포트 벡터 머신에 의해, 상기 감정 강도 데이터를 생성하는 단계를 포함하는, 감정 음성 합성 방법.
According to clause 8,
The step of training the emotion estimation unit based on emotion type data and emotion intensity data extracted from pre-recorded voice data,
Receiving the emotion type data provided from the speaker of the pre-recorded voice data; and
An emotional speech synthesis method comprising generating the emotional intensity data by a ranking support vector machine trained to output relative ranking information of emotional features extracted from a neutral emotional language.

제1항에 있어서,
음성 합성부에 의해, 상기 감정 특징 임베딩 및 상기 언어 특징으로부터 음성의 길이를 예측하고 상기 예측된 음성의 길이에 기초하여 감정 음성을 합성하는 단계는,
길이 예측부에 의해, 상기 감정 특징 임베딩 및 상기 언어 특징을 결합시킨 벡터로부터 감정 의존적인 음소 단위의 길이를 예측하는 단계를 포함하는, 감정 합성 생성 방법.
According to paragraph 1,
The step of predicting the length of a voice from the emotional feature embedding and the language feature and synthesizing an emotional voice based on the predicted length of the voice by a voice synthesis unit,
A method for generating emotion synthesis, comprising predicting, by a length prediction unit, the length of an emotion-dependent phoneme unit from a vector combining the emotion feature embedding and the language feature.

제1항에 있어서,
음성 합성부에 의해, 상기 감정 특징 임베딩 및 상기 언어 특징으로부터 음성의 길이를 예측하고 상기 예측된 음성의 길이에 기초하여 감정 음성을 합성하는 단계는,
인코더에 의해, 상기 언어 특징으로부터 언어 임베딩을 추출하는 단계;
업샘플링(upsampling)부에 의해, 상기 언어 임베딩을 상기 음성의 길이에 맞춰 업샘플링하는 단계;
디코더에 의해, 상기 업샘플링된 언어 임베딩으로부터 음향 특징들을 생성하는 단계; 및
보코더에 의해, 상기 음향 특징들로부터 감정 음성을 합성하는 단계를 포함하는, 감정 음성 합성 방법.
According to paragraph 1,
The step of predicting the length of a voice from the emotional feature embedding and the language feature and synthesizing an emotional voice based on the predicted length of the voice by a voice synthesis unit,
extracting, by an encoder, language embeddings from the language features;
Upsampling the language embedding according to the length of the voice by an upsampling unit;
generating, by a decoder, acoustic features from the upsampled language embedding; and
A method of emotional speech synthesis, comprising synthesizing, by a vocoder, an emotional speech from the acoustic features.

제1항에 있어서,
음성 합성부에 의해, 상기 감정 특징 임베딩 및 상기 언어 특징으로부터 음성의 길이를 예측하고 상기 예측된 음성의 길이에 기초하여 감정 음성을 합성하는 단계는,
상기 음성의 길이에 맞춰 상기 언어 특징으로부터 추출된 언어 임베딩의 시간적 해상도를 출력될 음향 특징의 해상도와 매칭하는 반복 과정을 통해 상기 언어 임베딩을 업샘플링하는 단계를 포함하는, 감정 음성 합성 방법.
According to paragraph 1,
The step of predicting the length of a voice from the emotional feature embedding and the language feature and synthesizing an emotional voice based on the predicted length of the voice by a voice synthesis unit,
An emotional speech synthesis method comprising upsampling the language embedding through an iterative process of matching the temporal resolution of the language embedding extracted from the language feature with the resolution of the acoustic feature to be output according to the length of the voice.

제12항에 있어서,
음성 합성부에 의해, 상기 감정 특징 임베딩 및 상기 언어 특징으로부터 음성의 길이를 예측하고 상기 예측된 음성의 길이에 기초하여 감정 음성을 합성하는 단계는,
상기 업샘플링된 언어 임베딩과 상기 감정 특징 임베딩을 결합하여 자기회귀적으로 음향 특징들을 생성하는 단계를 더 포함하는, 감정 음성 합성 방법.
According to clause 12,
The step of predicting the length of a voice from the emotional feature embedding and the language feature and synthesizing an emotional voice based on the predicted length of the voice by a voice synthesis unit,
The emotional speech synthesis method further comprising autoregressively generating acoustic features by combining the upsampled language embedding and the emotional feature embedding.

제13항에 있어서,
음성 합성부에 의해, 상기 감정 특징 임베딩 및 상기 언어 특징으로부터 음성의 길이를 예측하고 상기 예측된 음성의 길이에 기초하여 감정 음성을 합성하는 단계는,
상기 음향 특징들로부터 하모닉 및 노이즈 구성요소들을 함께 캡처하여 상기 감정 음성을 합성하는 단계를 더 포함하는, 감정 음성 합성 방법.
According to clause 13,
The step of predicting the length of a voice from the emotional feature embedding and the language feature and synthesizing an emotional voice based on the predicted length of the voice by a voice synthesis unit,
Synthesizing the emotional voice by capturing together harmonic and noise components from the acoustic features.

제1항 내지 제14항 중 어느 한 항에 따른 방법을 컴퓨터에서 실행하기 위해 컴퓨터로 판독 가능한 기록 매체에 저장된 컴퓨터 프로그램.
A computer program stored in a computer-readable recording medium for executing the method according to any one of claims 1 to 14 on a computer.

정보 처리 시스템으로서,
통신 모듈;
메모리; 및
상기 메모리와 연결되고, 상기 메모리에 포함된 컴퓨터 판독 가능한 적어도 하나의 프로그램을 실행하도록 구성된 적어도 하나의 프로세서
를 포함하고,
상기 적어도 하나의 프로세서는,
입력 텍스트로부터 감정 특징 임베딩을 생성하는 감정 추정부,
상기 입력 텍스트로부터 언어 특징을 추출하는 텍스트 분석부,
상기 감정 특징 임베딩 및 상기 언어 특징으로부터 음성의 길이를 예측하고 상기 예측된 음성의 길이에 기초하여 감정 음성을 합성하는 음성 합성부를 포함하는, 정보 처리 시스템.
As an information processing system,
communication module;
Memory; and
At least one processor connected to the memory and configured to execute at least one computer-readable program included in the memory
Including,
The at least one processor,
An emotion estimation unit that generates emotional feature embeddings from input text,
A text analysis unit that extracts language features from the input text,
An information processing system comprising a speech synthesis unit that predicts a length of speech from the emotional feature embedding and the language features and synthesizes an emotional speech based on the predicted length of speech.

제16항에 있어서,
감정 추정부는,
상기 입력 텍스트로부터 특징 벡터를 추출하는 언어 모델;
상기 특징 벡터로부터 감정 종류를 예측하는 감정 종류 예측부; 및
상기 특징 벡터로부터 감정 강도를 예측하는 감정 강도 예측부를 포함하는, 정보 처리 시스템.
According to clause 16,
The appraisal department,
a language model that extracts feature vectors from the input text;
an emotion type prediction unit that predicts an emotion type from the feature vector; and
An information processing system comprising an emotional intensity prediction unit that predicts emotional intensity from the feature vector.

제17항에 있어서,
상기 감정 추정부는,
상기 예측된 감정 종류 및 상기 예측된 감정 강도로부터 상기 감정 특징 임베딩을 생성하는 벡터 합성부를 더 포함하는, 정보 처리 시스템.
According to clause 17,
The emotion estimation unit,
An information processing system further comprising a vector synthesis unit generating the emotional feature embedding from the predicted emotion type and the predicted emotion intensity.

제16항에 있어서,
상기 음성 합성부는,
상기 감정 특징 임베딩 및 상기 언어 특징을 결합시킨 벡터로부터 감정 의존적인 음소 단위의 길이를 예측하는 길이 예측부를 포함하는, 정보 처리 시스템.
According to clause 16,
The voice synthesis unit,
An information processing system comprising a length prediction unit that predicts the length of an emotion-dependent phoneme unit from a vector combining the emotion feature embedding and the language feature.

제16항에 있어서,
상기 음성 합성부는,
상기 언어 특징으로부터 언어 임베딩을 추출하는 인코더;
상기 음성의 길이에 맞춰 상기 언어 임베딩을 업샘플링하는 업샘플링부;
상기 업샘플링된 언어 임베딩으로부터 음향 특징들을 생성하는 디코더; 및
상기 음향 특징들로부터 감정 음성을 합성하는 보코더를 포함하는, 정보 처리 시스템.

According to clause 16,
The voice synthesis unit,
an encoder that extracts language embeddings from the language features;
an upsampling unit that upsamples the language embedding according to the length of the voice;
a decoder that generates acoustic features from the upsampled language embedding; and
An information processing system comprising a vocoder for synthesizing emotional speech from the acoustic features.