KR20150029846A

KR20150029846A - Method of mapping text data onto audia data for synchronization of audio contents and text contents and system thereof

Info

Publication number: KR20150029846A
Application number: KR20130108730A
Authority: KR
Inventors: 김무중; 김준수
Original assignee: 주식회사 청담러닝
Priority date: 2013-09-10
Filing date: 2013-09-10
Publication date: 2015-03-19
Also published as: KR102140438B1

Abstract

Embodiments of the present invention provide a method which maps text data to audio data in the light of speech section information of the audio data and text data. The method of mapping the text data to the audio data comprises: a step of extracting speech time information of audio data; a step of classifying text data with a sentence unit; a step of extracting pause time information of the audio data from the audio data based on the classified text data; a step of calculating a speech section ratio of the text data by extracting a phoneme included in the classified text data; and a step of mapping the text data to the audio data based on the pause time information, the speech section ratio of the text data, and the speech time information of the extracted audio data.

Description

오디오 컨텐츠 및 텍스트 컨텐츠의 동기화 서비스를 위해 텍스트 데이터를 오디오 데이터에 매핑하는 방법 및 시스템{METHOD OF MAPPING TEXT DATA ONTO AUDIA DATA FOR SYNCHRONIZATION OF AUDIO CONTENTS AND TEXT CONTENTS AND SYSTEM THEREOF}BACKGROUND OF THE INVENTION 1. Field of the Invention [0001] The present invention relates to a method and system for mapping text data to audio data for synchronizing service of audio contents and text contents,

텍스트 데이터를 오디오 데이터에 매핑하는 시스템 및 방법에 관한 기술로서, 보다 구체적으로, 오디오 컨텐츠 및 텍스트 컨텐츠의 동기화를 위해 오디오 데이터의 시간 정보 및 텍스트 데이터의 비율 정보에 기초하여 텍스트 데이터를 오디오 데이터에 매핑하는 기술에 관한 것이다.
More particularly, the present invention relates to a system and method for mapping text data to audio data, and more particularly, to mapping audio data and audio data to audio data based on time information of audio data and ratio information of text data Lt; / RTI >

텍스트 데이터를 오디오 데이터에 매핑하는 기술은 언어 학습의 분야에서 학습 효율을 증가시키기 위해, 오디오 컨텐츠 및 텍스트 컨텐츠를 동기화하는데 이용되는 기술로서, 문장 단위로 구분된 텍스트 데이터를 공백 구간을 포함하는 오디오 데이터에 매핑하는 기술이다. 이 때, 텍스트 데이터를 오디오 데이터에 매핑하는 기술은 오디오 데이터의 공백 구간 및 문장 단위로 구분된 텍스트 데이터만을 고려하여 매핑하기 때문에, 발화 구간에서 오디오 데이터와 텍스트 데이터의 불일치가 발생하는 문제점이 발생할 수 있다.A technique for mapping text data to audio data is a technique used for synchronizing audio contents and text contents in order to increase learning efficiency in the field of language learning. In this technique, text data classified by sentence is divided into audio data . At this time, since the technique of mapping the text data to the audio data is performed in consideration of only the blank section of the audio data and the text data classified by the sentence unit, there arises a problem that inconsistency occurs between the audio data and the text data in the ignition section have.

예를 들어, 공개 특허 2012-0129015 "어학컨텐츠 생성 방법 및 이를 위한 단말기"를 살펴보면, 문장 단위로 미리 구분된 텍스트 데이터를, 파형을 분석하여 획득된 공백 구간을 포함하는 오디오 데이터에 일대일 매칭함으로써, 텍스트 데이터를 오디오 데이터에 매핑한다. 여기서, 공개 특허 2012-0129015는 매핑 과정에서 오디오 데이터 및 텍스트 데이터의 발화 구간을 고려하지 않기 때문에, 발화 구간에서 오디오 데이터와 텍스트 데이터의 불일치가 발생할 수 있다.For example, in a method of generating language content and a terminal for the same, Japanese Unexamined Patent Application Publication No. 2012-0129015 discloses a method for generating a linguistic content by matching text data classified in sentence units with audio data including a blank section obtained by analyzing a waveform, Map text data to audio data. In this case, since the speech data of the audio data and the text data are not considered during the mapping process, the mismatch between the audio data and the text data may occur in the speech section.

이에, 오디오 데이터 및 텍스트 데이터의 발화 구간 정보를 고려하는 매핑 기술이 요구된다.
Accordingly, a mapping technique that considers ignition interval information of audio data and text data is required.

본 발명의 실시예들은 오디오 데이터 및 텍스트 데이터의 발화 구간 정보를 고려하여 텍스트 데이터를 오디오 데이터에 매핑하는 방법, 장치 및 시스템을 제공한다.Embodiments of the present invention provide a method, apparatus, and system for mapping text data to audio data in consideration of audio data and speech region information of text data.

또한, 본 발명의 실시예들은 오디오 데이터 및 텍스트 데이터의 발화 구간 정보를 고려하는 과정에서, 오디오 데이터 및 텍스트 데이터에 포함되는 문장, 문장에 포함되는 단어 또는 단어에 포함되는 음소의 발화 시간을 이용하는 방법, 장치 및 시스템을 제공한다.The embodiments of the present invention are also directed to a method of using speech time of phonemes included in sentences included in audio data and text data, words included in sentences or phonemes in the process of considering speech region information of audio data and text data , A device and a system.

또한, 본 발명의 실시예들은 오디오 데이터의 재생 속도에 대한 변화에 적응적으로, 텍스트 데이터를 오디오 데이터에 매핑하는 방법, 장치 및 시스템을 제공한다.
Embodiments of the present invention also provide a method, apparatus, and system for mapping text data to audio data adaptively to changes in the playback speed of audio data.

본 발명의 일실시예에 따른 텍스트 데이터를 오디오 데이터에 매핑하는 방법은 오디오 데이터의 발화 시간 정보를 추출하는 단계; 텍스트 데이터를 문장 단위로 구분하는 단계; 상기 구분된 텍스트 데이터에 기초하여 상기 오디오 데이터로부터 상기 오디오 데이터의 포즈(Pause) 시간 정보를 추출하는 단계; 상기 구분된 텍스트 데이터에 포함되는 음소를 추출하여 상기 텍스트 데이터의 발화 구간 비율을 계산하는 단계; 및 상기 추출된 오디오 데이터의 상기 발화 시간 정보, 상기 포즈 시간 정보 및 상기 텍스트 데이터의 상기 발화 구간 비율에 기초하여 상기 텍스트 데이터를 상기 오디오 데이터에 매핑하는 단계를 포함한다.The method of mapping text data to audio data according to an embodiment of the present invention includes extracting speech time information of audio data; Classifying text data by sentence units; Extracting pause time information of the audio data from the audio data based on the separated text data; Extracting phonemes included in the segmented text data and calculating a speech segment ratio of the text data; And mapping the text data to the audio data based on the speech time information, the pause time information, and the speech interval ratio of the text data of the extracted audio data.

상기 오디오 데이터의 상기 발화 시간 정보를 추출하는 단계는, 미리 설정된 프레임 단위로 상기 오디오 데이터의 ZCR(Zero Crossing Ratio) 또는 에너지 중 적어도 하나를 계산하는 단계; 상기 계산 결과에 기초하여 상기 오디오 데이터에 포함되는 상기 프레임을 유성음 프레임(Voiced Frame), 무성음 프레임(Unvoiced Frame) 또는 묵음 프레임(Silence Frame) 중 적어도 하나로 구분하는 단계; 및 상기 구분 결과에 따라 상기 문장 단위로 상기 오디오 데이터의 상기 발화 시간 정보를 획득하는 단계를 포함할 수 있다.The step of extracting the speech time information of the audio data may include calculating at least one of ZCR (zero crossing ratio) or energy of the audio data in a preset frame unit; Dividing the frame included in the audio data into at least one of a voiced frame, an unvoiced frame, and a silence frame based on the calculation result; And obtaining the speech time information of the audio data in units of the sentence according to the classification result.

상기 오디오 데이터의 상기 포즈 시간 정보를 추출하는 단계는, 상기 구분된 텍스트 데이터의 포즈 구간 정보에 기초하여 상기 오디오 데이터의 포즈 구간을 설정하는 단계; 및 상기 설정된 오디오 데이터의 상기 포즈 구간의 시작 시간 정보 및 종료 시간 정보를 획득하는 단계를 포함할 수 있다.Wherein the extracting of the pause time information of the audio data comprises: setting a pause interval of the audio data based on pause interval information of the divided text data; And obtaining start time information and end time information of the pause interval of the set audio data.

상기 오디오 데이터의 상기 포즈 구간을 설정하는 단계는, 상기 오디오 데이터의 후보 포즈 구간을 검출하는 단계; 상기 검출된 후보 포즈 구간을 미리 설정된 기준 포즈 구간과 비교하는 단계; 및 상기 후보 포즈 구간으로부터 상기 비교 결과에 기초하여 오디오 데이터의 상기 포즈 구간을 선택하는 단계를 포함할 수 있다.Wherein the setting of the pause interval of the audio data comprises: detecting a candidate pause period of the audio data; Comparing the detected candidate pause interval with a preset reference pause interval; And selecting the pause interval of the audio data based on the comparison result from the candidate pause interval.

상기 텍스트 데이터의 상기 발화 구간 비율을 계산하는 단계는, 상기 추출된 음소에 미리 설정된 고정 비율을 적용하는 단계; 및 상기 미리 설정된 고정 비율이 적용된 상기 음소에 기초하여 상기 문장 단위로 상기 텍스트 데이터의 상기 발화 구간 비율을 생성하는 단계를 포함할 수 있다.Wherein the step of calculating the ratio of speech segments of the text data comprises: applying a predetermined fixed ratio to the extracted phonemes; And generating the speech interval ratio of the text data on a sentence-by-sentence basis based on the phoneme to which the predetermined fixed ratio is applied.

상기 텍스트 데이터를 상기 오디오 데이터에 매핑하는 단계는, 상기 오디오 데이터의 상기 발화 시간 정보에 상기 텍스트 데이터의 상기 발화 구간 비율을 적용하여 상기 오디오 데이터의 상기 발화 시간 정보에 대응하는 상기 텍스트 데이터의 발화 시간 정보를 계산하는 단계; 및 상기 오디오 데이터의 상기 포즈 시간 정보에 대응하는 상기 텍스트 데이터의 포즈 시간 정보를 생성하는 단계를 포함할 수 있다.Wherein the step of mapping the text data to the audio data further comprises the step of applying the ratio of the speech segment of the text data to the speech time information of the audio data to calculate the speech time of the text data corresponding to the speech time information of the audio data Calculating information; And generating pause time information of the text data corresponding to the pause time information of the audio data.

상기 텍스트 데이터를 상기 문장 단위로 구분하는 단계는, 상기 텍스트 데이터에 포함되는 문장 각각이 구별되도록 미리 삽입된 기호 정보에 기초하여, 상기 텍스트 데이터를 상기 문장 단위로 구분하는 단계일 수 있다.The step of dividing the text data by the sentence unit may be a step of dividing the text data by the sentence unit based on the preference information inserted in advance so that the sentences included in the text data are distinguished.

상기 텍스트 데이터를 오디오 데이터에 매핑하는 방법은 상기 오디오 데이터에 프리엠퍼시스(Pre-emphasis), DC 성분 제거 또는 잡음 제거 중 적어도 하나를 적용하는 단계를 더 포함할 수 있다.The method of mapping the text data to audio data may further include applying at least one of pre-emphasis, DC component removal, or noise removal to the audio data.

상기 텍스트 데이터를 오디오 데이터에 매핑하는 방법은 상기 텍스트 데이터를 상기 오디오 데이터에 매핑한 결과, 오디오/텍스트 데이터를 생성하는 단계를 더 포함할 수 있다.The method of mapping the text data to audio data may further include generating audio / text data as a result of mapping the text data to the audio data.

본 발명의 일실시예에 따른 텍스트 데이터를 오디오 데이터에 매핑하는 시스템은 오디오 데이터의 발화 시간 정보를 추출하는, 발화 시간 정보 추출부; 텍스트 데이터를 문장 단위로 구분하는, 문장 구분부; 상기 구분된 텍스트 데이터에 기초하여 상기 오디오 데이터로부터 상기 오디오 데이터의 포즈(Pause) 시간 정보를 추출하는, 포즈 시간 정보 추출부; 상기 구분된 텍스트 데이터에 포함되는 음소를 추출하여 상기 텍스트 데이터의 발화 구간 비율을 계산하는, 발화 구간 비율 계산부; 및 상기 추출된 오디오 데이터의 상기 발화 시간 정보, 상기 포즈 시간 정보 및 상기 텍스트 데이터의 상기 발화 구간 비율에 기초하여 상기 텍스트 데이터를 상기 오디오 데이터에 매핑하는, 매핑부를 포함한다.According to an embodiment of the present invention, there is provided a system for mapping text data to audio data, comprising: a speech time information extraction unit for extracting speech time information of audio data; A sentence classifying unit for classifying text data on a sentence level basis; A pause time information extracting unit for extracting pause time information of the audio data from the audio data based on the separated text data; A speech segment ratio calculation unit for calculating a speech segment ratio of the text data by extracting phonemes included in the segmented text data; And a mapping unit for mapping the text data to the audio data based on the speaking time information, the pause time information, and the speaking ratio of the text data of the extracted audio data.

상기 발화 시간 정보 추출부는, 미리 설정된 프레임 단위로 상기 오디오 데이터의 ZCR(Zero Crossing Ratio) 또는 에너지 중 적어도 하나를 계산하는, 계산부; 상기 계산 결과에 기초하여 상기 오디오 데이터에 포함되는 상기 프레임을 유성음 프레임(Voiced Frame), 무성음 프레임(Unvoiced Frame) 또는 묵음 프레임(Silence Frame) 중 적어도 하나로 구분하는, 구분부; 및 상기 구분 결과에 따라 상기 문장 단위로 상기 오디오 데이터의 상기 발화 시간 정보를 획득하는, 획득부를 포함할 수 있다.Wherein the speech time information extraction unit comprises: a calculation unit that calculates at least one of ZCR (zero crossing ratio) or energy of the audio data in units of a preset frame; A dividing unit dividing the frame included in the audio data into at least one of a voiced frame, an unvoiced frame, and a silence frame based on the calculation result; And an acquiring unit for acquiring the speech time information of the audio data on the basis of the sentence in accordance with the classification result.

상기 포즈 시간 정보 추출부는, 상기 구분된 텍스트 데이터의 포즈 구간 정보에 기초하여 상기 오디오 데이터의 포즈 구간을 설정하는, 설정부; 및 상기 설정된 오디오 데이터의 상기 포즈 구간의 시작 시간 정보 및 종료 시간 정보를 획득하는, 획득부를 포함할 수 있다.Wherein the pause time information extracting unit comprises: a setting unit for setting a pause interval of the audio data based on the pause interval information of the divided text data; And an obtaining unit for obtaining start time information and end time information of the pause interval of the set audio data.

상기 발화 구간 비율 계산부는, 상기 추출된 음소에 미리 설정된 고정 비율을 적용하는, 적용부; 및 상기 미리 설정된 고정 비율이 적용된 상기 음소에 기초하여 상기 문장 단위로 상기 텍스트 데이터의 상기 발화 구간 비율을 생성하는, 생성부를 포함할 수 있다.Wherein the speech region ratio calculation unit comprises: an application unit for applying a predetermined fixed ratio to the extracted phonemes; And a generator for generating the speech interval ratio of the text data on a sentence-by-sentence basis based on the phoneme to which the predetermined fixed ratio is applied.

상기 매핑부는, 상기 오디오 데이터의 상기 발화 시간 정보에 상기 텍스트 데이터의 상기 발화 구간 비율을 적용하여 상기 오디오 데이터의 상기 발화 시간 정보에 대응하는 상기 텍스트 데이터의 발화 시간 정보를 계산하는 계산부; 및 상기 오디오 데이터의 상기 포즈 시간 정보에 대응하는 상기 텍스트 데이터의 포즈 시간 정보를 생성하는, 생성부를 포함할 수 있다.The mapping unit may include a calculation unit for calculating the speech time information of the text data corresponding to the speech time information of the audio data by applying the speech segment ratio of the text data to the speech time information of the audio data; And pause time information of the text data corresponding to the pause time information of the audio data.

상기 텍스트 데이터를 오디오 데이터에 매핑하는 시스템은 상기 오디오 데이터에 프리엠퍼시스(Pre-emphasis), DC 성분 제거 또는 잡음 제거 중 적어도 하나를 적용하는, 전처리 적용부를 더 포함할 수 있다.The system for mapping the text data to audio data may further include a pre-processing application unit for applying at least one of pre-emphasis, DC component removal, or noise removal to the audio data.

상기 텍스트 데이터를 오디오 데이터에 매핑하는 시스템은 상기 텍스트 데이터를 상기 오디오 데이터에 매핑한 결과, 오디오/텍스트 데이터를 생성하는, 오디오/텍스트 데이터 생성부를 더 포함할 수 있다.
The system for mapping the text data to audio data may further include an audio / text data generation unit for generating audio / text data as a result of mapping the text data to the audio data.

본 발명의 실시예들은 오디오 데이터 및 텍스트 데이터의 발화 구간 정보를 고려하여 텍스트 데이터를 오디오 데이터에 매핑하는 방법, 장치 및 시스템을 제공할 수 있다.Embodiments of the present invention can provide a method, an apparatus, and a system for mapping text data to audio data in consideration of speech region information of audio data and text data.

또한, 본 발명의 실시예들은 오디오 데이터 및 텍스트 데이터의 발화 구간 정보를 고려하는 과정에서, 오디오 데이터 및 텍스트 데이터에 포함되는 문장, 문장에 포함되는 단어 또는 단어에 포함되는 음소의 발화 시간을 이용하는 방법, 장치 및 시스템을 제공할 수 있다.The embodiments of the present invention are also directed to a method of using speech time of phonemes included in sentences included in audio data and text data, words included in sentences or phonemes in the process of considering speech region information of audio data and text data , Devices and systems.

또한, 본 발명의 실시예들은 오디오 데이터의 재생 속도에 대한 변화에 적응적으로, 텍스트 데이터를 오디오 데이터에 매핑하는 방법, 장치 및 시스템을 제공할 수 있다.
Embodiments of the present invention can also provide a method, an apparatus, and a system for mapping text data to audio data adaptively to changes in reproduction speed of audio data.

도 1은 본 발명의 일실시예에 따른 텍스트 데이터를 오디오 데이터에 매핑하는 방법을 나타낸 플로우 차트이다.
도 2는 도 1에 도시된 오디오 데이터의 발화 시간 정보를 추출하는 단계를 구체적으로 나타낸 플로우 차트이다.
도 3은 도 1에 도시된 오디오 데이터의 포즈 시간 정보를 추출하는 단계를 구체적으로 나타낸 플로우 차트이다.
도 4는 도 3에 도시된 오디오 데이터의 포즈 구간을 설정하는 단계를 구체적으로 나타낸 플로우 차트이다.
도 5는 도 1에 도시된 텍스트 데이터의 발화 구간 비율을 계산하는 단계를 구체적으로 나타낸 플로우 차트이다.
도 6은 도 1에 도시된 텍스트 데이터를 오디오 데이터에 매핑하는 단계를 구체적으로 나타낸 플로우 차트이다.
도 7은 본 발명의 일실시예에 따른 텍스트 데이터를 오디오 데이터에 매핑하는 과정을 나타낸 도면이다.
도 8은 본 발명의 일실시예에 따른 텍스트 데이터를 오디오 데이터에 매핑하는 시스템을 나타낸 블록도이다.1 is a flowchart illustrating a method of mapping text data to audio data according to an embodiment of the present invention.
FIG. 2 is a flowchart specifically illustrating a step of extracting the speech time information of the audio data shown in FIG.
FIG. 3 is a flowchart specifically illustrating a step of extracting pause time information of the audio data shown in FIG.
FIG. 4 is a flowchart specifically illustrating a step of setting a pause interval of the audio data shown in FIG.
FIG. 5 is a flowchart specifically illustrating a step of calculating a speech interval ratio of the text data shown in FIG.
6 is a flowchart specifically illustrating a step of mapping the text data shown in FIG. 1 to audio data.
7 is a diagram illustrating a process of mapping text data to audio data according to an embodiment of the present invention.
8 is a block diagram illustrating a system for mapping text data to audio data according to an embodiment of the present invention.

이하, 본 발명에 따른 실시예들을 첨부된 도면을 참조하여 상세하게 설명한다. 그러나 본 발명이 실시예들에 의해 제한되거나 한정되는 것은 아니다. 또한, 각 도면에 제시된 동일한 참조 부호는 동일한 부재를 나타낸다.
Hereinafter, embodiments according to the present invention will be described in detail with reference to the accompanying drawings. However, the present invention is not limited to or limited by the embodiments. In addition, the same reference numerals shown in the drawings denote the same members.

도 1은 본 발명의 일실시예에 따른 텍스트 데이터를 오디오 데이터에 매핑하는 방법을 나타낸 플로우 차트이다.1 is a flowchart illustrating a method of mapping text data to audio data according to an embodiment of the present invention.

도 1을 참조하면, 본 발명의 일실시예에 따른 텍스트 데이터를 오디오 데이터에 매핑하는 시스템은 오디오 데이터에 프리엠퍼시스(pre-emphasis), DC 성분 제거 또는 잡음 제거 중 적어도 하나를 적용하는 전처리(pre-processing)를 수행한다(110).Referring to FIG. 1, a system for mapping text data to audio data according to an exemplary embodiment of the present invention includes a pre-processing unit that applies at least one of pre-emphasis, DC component removal, or noise removal to audio data pre-processing (step 110).

또한, 텍스트 데이터를 오디오 데이터에 매핑하는 시스템은 오디오 데이터의 발화 시간 정보를 추출한다(120).In addition, the system for mapping the text data to the audio data extracts the speech time information of the audio data (120).

또한, 텍스트 데이터를 오디오 데이터에 매핑하는 시스템은 텍스트 데이터를 문장 단위로 구분한다(130). In addition, the system for mapping text data to audio data divides text data into sentences (130).

또한, 텍스트 데이터를 오디오 데이터에 매핑하는 시스템은 구분된 텍스트 데이터에 기초하여 오디오 데이터로부터 오디오 데이터의 포즈 시간 정보를 추출한다(140). 이 때, 텍스트 데이터를 오디오 데이터에 매핑하는 시스템은 문장 단위로 구분된 텍스트 데이터 개수에 기초하여 오디오 데이터로부터 오디오 데이터의 포즈 시간 정보를 추출할 수 있다.In addition, the system for mapping text data to audio data extracts pause time information of audio data from audio data based on the separated text data (140). At this time, the system for mapping the text data to the audio data can extract the pause time information of the audio data from the audio data based on the number of the text data divided by the sentence unit.

또한, 텍스트 데이터를 오디오 데이터에 매핑하는 시스템은 구분된 텍스트 데이터에 포함되는 음소를 추출하여 텍스트 데이터의 발화 구간 비율을 계산한다(150). 이 때, 텍스트 데이터를 오디오 데이터에 매핑하는 시스템은 문장 단위로 구분된 텍스트 데이터에 포함되는 단어 및 단어에 포함되는 음소를 추출하여 텍스트 데이터의 발화 구간 비율을 계산할 수 있다.In addition, the system for mapping text data to audio data extracts phonemes included in the divided text data and calculates a speech interval ratio of the text data (150). In this case, the system for mapping the text data to the audio data can extract the phonemes included in the words and the words included in the text data classified by the sentence unit, and calculate the speech interval ratio of the text data.

또한, 텍스트 데이터를 오디오 데이터에 매핑하는 시스템은 오디오 데이터의 발화 시간 정보, 오디오 데이터의 포즈 시간 정보 및 텍스트 데이터의 발화 구간 비율에 기초하여, 텍스트 데이터를 오디오 데이터에 매핑한다(160).The system for mapping the text data to the audio data maps the text data to the audio data based on the speech time information of the audio data, the pause time information of the audio data, and the speech interval ratio of the text data.

또한, 텍스트 데이터를 오디오 데이터에 매핑하는 시스템은 텍스트 데이터를 오디오 데이터에 매핑한 결과, 오디오/텍스트 데이터를 생성한다(170).In addition, the system for mapping text data to audio data generates audio / text data as a result of mapping the text data to audio data (170).

여기서, 각각의 단계에 대해서는 아래에서 상세히 설명한다.
Here, each step will be described in detail below.

도 2는 도 1에 도시된 오디오 데이터의 발화 시간 정보를 추출하는 단계를 구체적으로 나타낸 플로우 차트이다.FIG. 2 is a flowchart specifically illustrating a step of extracting the speech time information of the audio data shown in FIG.

도 2를 참조하면, 본 발명의 일실시예에 따른 텍스트 데이터를 오디오 데이터에 매핑하는 시스템은 미리 설정된 프레임 단위로 오디오 데이터의 제로 크로싱 비율(Zero Crossing Rate; ZCR) 또는 에너지 중 적어도 하나를 계산할 수 있다(210). 여기서, ZCR을 계산하는 과정은 오디오 데이터에 포함되는 각각의 프레임에 대한 신호가 '0' 값을 기준으로 어느 정도 변화하는지를 측정하고, 측정된, 각각의 프레임에 대한 ZCR을 상대적으로 비교하는 과정을 통하여 수행될 수 있다. 또한, 에너지를 계산하는 과정은 오디오 데이터에 포함되는 각각의 프레임에 대한 에너지를 추출하고, 추출된, 각각의 프레임에 대한 에너지를 상대적으로 비교하는 과정을 통하여 수행될 수 있다.Referring to FIG. 2, a system for mapping text data to audio data according to an embodiment of the present invention may calculate at least one of a zero crossing rate (ZCR) (210). Here, the process of calculating the ZCR is a process of measuring how much the signal for each frame included in the audio data changes based on the value '0', and comparing the measured ZCRs relative to the respective frames Lt; / RTI > Also, the process of calculating the energy may be performed by extracting energy for each frame included in the audio data, and comparing the extracted energy for each frame.

텍스트 데이터를 오디오 데이터에 매핑하는 시스템은 위에서 상술한 계산 결과에 기초하여, 오디오 데이터에 포함되는 프레임을 유성음 프레임, 무성음 프레임 또는 묵음 프레임 중 적어도 하나로 구분할 수 있다(220). 예를 들어, 텍스트 데이터를 오디오 데이터에 매핑하는 시스템은 '_strick_'라는 오디오 데이터에 포함되는 /s/ 음소, /t/ 음소, /r/ 음소, /i/ 음소, /c/ 음소 및 /k/ 음소 각각에 해당하는 음성 프레임의 ZCR 및 에너지를 계산하고 이를 비교하여, 각각의 음성 프레임을 유성음 프레임, 무성음 프레임 및 묵음 프레임으로 구분할 수 있다. 더 구체적인 예를 들어, 계산된 ZCR 값을 비교한 결과, /s/ 음소 및 /t/ 음소에 해당하는 프레임의 ZCR 값이 /i/ 음소에 해당하는 프레임의 ZCR 값보다 높다면, /s/ 음소 및 /t/ 음소에 해당하는 프레임은 무성음 프레임으로 구분될 수 있다. 또한, 계산된 에너지 값을 비교한 결과, 에너지 값이 /i/ 음소, /r/ 음소, /s/ 음소, /t/ 음소 및 /k/ 음소에 해당하는 프레임의 순서대로 높다면, 높은 에너지 값을 갖는 /i/ 음소 및 /r/ 음소에 해당하는 프레임이 유성음 프레임으로, 그 다음 높은 에너지 값을 갖는 /s/ 음소, /t/ 음소 및 /k/ 음소에 해당하는 프레임이 무성음 프레임으로 구분될 수 있고, '_strick_'의 시작 및 끝에 해당하는 프레임이 묵음 프레임으로 구분될 수 있다.A system for mapping text data to audio data may classify a frame included in the audio data into at least one of a voiced sound frame, an unvoiced sound frame, and a silent frame based on the calculation result described above (220). For example, a system for mapping text data to audio data includes / s / phoneme, / t / phoneme, / r / phoneme, / i / phoneme, / c / phoneme, and / k / ZCR and energy of the speech frame corresponding to each phoneme are calculated and compared, and each voice frame can be divided into a voiced sound frame, an unvoiced sound frame and a silent frame. More specifically, for example, if the ZCR value of the frame corresponding to the / s / phoneme and / t / phoneme is higher than the ZCR value of the frame corresponding to the / i / phoneme as a result of comparing the calculated ZCR value, Phonemes and / t / Phonemes can be separated into unvoiced frames. Also, as a result of comparing the calculated energy values, if the energy value is high in the order of / i / phoneme, / r / phoneme, / s / phoneme, / t / phoneme, and / k / phoneme, A frame corresponding to / i / phoneme and / r / phoneme having a value of / s / phoneme / t / phoneme and / k / phoneme having a next high energy value is regarded as a voiced frame , And the frame corresponding to the start and end of '_strick_' may be divided into silent frames.

또한, 텍스트 데이터를 오디오 데이터에 매핑하는 시스템은 오디오 데이터에 포함되는 프레임 각각을 구분한 결과에 기초하여, 오디오 데이터의 발화 시간 정보를 획득할 수 있다(230). 여기서, 오디오 데이터의 발화 시간 정보는 오디오 데이터의 문장 단위로 획득될 수 있다. 이 때, 텍스트 데이터를 오디오 데이터에 매핑하는 시스템은 프레임의 개수에 기초하여 오디오 데이터의 발화 시간 정보를 획득할 수 있다. 예를 들어, 텍스트 데이터를 오디오 데이터에 매핑하는 시스템은 미리 설정된 프레임 단위에 기초하여 묵음 프레임으로 구분된 음소가 미리 설정된 개수 이상으로 지속되는 경우, 해당 구간을 포즈 구간으로 인식할 수 있고, 인식된 포즈 구간을 기초로 오디오 데이터를 문장 단위로 구분함으로써, 문장 단위의 발화 시간 정보를 획득할 수 있다. 더 구체적인 예를 들면, 오디오 데이터에 'Have you answered yes to any of the questions above?'의 제1 문장 및 'Video game addictions are becoming recognized as a real problem."의 제2 문장이 포함된다면, 텍스트 데이터를 오디오 데이터에 매핑하는 시스템은 제1 문장과 제2 문장 사이의 프레임이 묵음 프레임으로 구분되어 미리 설정된 개수 이상으로 지속되는지를 판단한다. 판단 결과, 미리 설정된 개수 이상으로 묵음 프레임이 지속되는 경우, 해당 구간이 문장을 구분하는 포즈 구간임을 인식하여, 오디오 데이터를 제1 문장의 오디오 데이터 및 제2 문장의 오디오 데이터로 구분함으로써, 제1 문장의 오디오 데이터의 발화 시간 정보 및 제2 문장의 오디오 데이터 발화 시간 정보를 각각 획득할 수 있다.
Further, the system for mapping the text data to the audio data may obtain the speech time information of the audio data based on the result of dividing each frame included in the audio data (230). Here, the speech time information of the audio data can be obtained in units of sentences of the audio data. At this time, the system for mapping the text data to the audio data can obtain the speech time information of the audio data based on the number of frames. For example, in a system for mapping text data to audio data, if a phoneme separated by a silent frame lasts for a predetermined number or more based on a preset frame unit, the corresponding section can be recognized as a pause section, By dividing the audio data on a sentence-by-phrase basis based on the pause interval, it is possible to obtain the speech time information on a sentence-by-sentence basis. For example, if the first sentence of "Have you answered yes to any of the questions above?" And the second sentence of "Video game addictions are recognized as a real problem." Are included in the audio data, The system for mapping audio data to audio data determines whether a frame between the first sentence and the second sentence is divided into mute frames and continues for a predetermined number or more. If it is determined that the mute frame continues for a predetermined number or more, The audio data is divided into the audio data of the first sentence and the audio data of the second sentence by recognizing that the corresponding section is a pause section that distinguishes the sentence so that the audio data of the first sentence and the audio data of the second sentence And can acquire the ignition time information, respectively.

도 3은 도 1에 도시된 오디오 데이터의 포즈 시간 정보를 추출하는 단계를 구체적으로 나타낸 플로우 차트이다.FIG. 3 is a flowchart specifically illustrating a step of extracting pause time information of the audio data shown in FIG.

도 3을 참조하면, 본 발명의 일실시예에 따른 텍스트 데이터를 오디오 데이터에 매핑하는 시스템은 구분된 텍스트 데이터의 포즈 구간 정보에 기초하여 오디오 데이터의 포즈 구간을 설정할 수 있다(310). 여기서, 오디오 데이터의 포즈 구간은 텍스트 데이터의 포즈 구간 정보에 기초하여, 오디오 데이터에 포함되는 프레임이 미리 설정된 개수 이상으로 묵음 프레임으로 지속될 때 설정될 수 있다. 이에 대해서는 도 4를 참조하며 상세히 설명한다.Referring to FIG. 3, a system for mapping text data to audio data according to an embodiment of the present invention may set a pause interval of audio data based on pause interval information of the divided text data (310). Here, the pause interval of the audio data may be set based on the pause interval information of the text data, when the frame included in the audio data continues to be a silent frame at a predetermined number or more. This will be described in detail with reference to FIG.

또한, 텍스트 데이터를 오디오 데이터에 매핑하는 시스템은 설정된 오디오 데이터의 포즈 구간의 시작 시간 정보 및 종료 시간 정보를 획득할 수 있다(320). 이 때, 포즈 구간의 시작 시간 정보 및 종료 시간 정보는 오디오 데이터의 전체 파일의 재생 시간을 기준으로 획득될 수 있다. 예를 들어, 포즈 구간이 시작되는 시작 시간 정보 및 종료되는 종료 시간 정보는 오디오 데이터의 전체 파일의 재생 시작 시간을 '0:00' 의 기준으로 설정하여, 이에 대응되는 시간 정보를 측정함으로써, 획득될 수 있다.
In addition, the system for mapping the text data to the audio data may obtain the start time information and the end time information of the pause period of the set audio data (320). At this time, the start time information and the end time information of the pause interval can be obtained based on the reproduction time of the entire file of the audio data. For example, the start time information at which the pause interval starts and the end time information at which the pause interval starts are set by setting the reproduction start time of the entire file of the audio data as the reference of '0:00' .

도 4는 도 3에 도시된 오디오 데이터의 포즈 구간을 설정하는 단계를 구체적으로 나타낸 플로우 차트이다.FIG. 4 is a flowchart specifically illustrating a step of setting a pause interval of the audio data shown in FIG.

도 4를 참조하면, 본 발명의 일실시예에 따른 텍스트 데이터를 오디오 데이터에 매핑하는 시스템은 오디오 데이터에 포함되는 프레임 각각을 구분하고, 미리 설정된 개수 이상으로 묵음 프레임이 지속되는 구간을 후보(candidate) 포즈 구간으로 검출할 수 있다(410). 이 때, 텍스트 데이터를 오디오 데이터에 매핑하는 시스템은 쉼표 포즈 구간이 문장 사이의 포즈 구간 보다도 긴 경우가 발생할 수 있으므로, 텍스트 데이터의 포즈 구간보다 미리 설정된 개수만큼 많이 검출할 수 있다.Referring to FIG. 4, a system for mapping text data to audio data according to an embodiment of the present invention divides each of frames included in audio data, and determines a section where a silence frame lasts more than a predetermined number as a candidate ) Pause period (410). In this case, the system for mapping the text data to the audio data may cause the comma pause interval to be longer than the pause interval between the sentences, so that the predetermined number of pause intervals can be detected more than the pause interval of the text data.

또한, 텍스트 데이터를 오디오 데이터에 매핑하는 시스템은 구분된 텍스트 데이터의 포즈 구간 정보에 기초하여, 검출된 후보 포즈 구간을 미리 설정된 기준 포즈 구간과 비교할 수 있다(420). 예를 들어, 텍스트 데이터가 텍스트 데이터에 포함되는 문장 각각이 구별되도록 미리 삽입된 기호 정보에 기초하여 11개의 문장으로 구분되어 10개의 포즈 구간을 갖는 경우, 오디오 데이터의 후보 포즈 구간이 15개로 검출된 경우, 텍스트 데이터를 오디오 데이터에 매핑하는 시스템은 오디오 데이터의 단어와 단어 사이에서 연속되는 묵음 프레임과 문장과 문장 사이의 포즈를 구별하기 위하여, 오디오 데이터의 후보 포즈 구간을 미리 설정된 기준 포즈 구간과 비교한다(420).In addition, the system for mapping the text data to the audio data may compare the detected candidate pause period with the predetermined reference pause period based on the pause interval information of the separated text data (420). For example, if the text data is divided into 11 sentences based on the pre-inserted symbol information so that each of the sentences included in the text data is distinguished and has 10 pause intervals, 15 candidate pause sections of the audio data are detected The system for mapping the text data to the audio data compares the candidate pause section of the audio data with the preset reference pause section in order to distinguish the pauses between the sentence and the sentence, (420).

또한, 텍스트 데이터를 오디오 데이터에 매핑하는 시스템은 후보 포즈 구간으로부터 비교 결과에 기초하여 오디오 데이터의 포즈 구간을 선택할 수 있다(430). 예를 들어, 기준 포즈 구간을 묵음 프레임이 6개 이상 지속되는 경우로 설정한 경우, 텍스트 데이터를 오디오 데이터에 매핑하는 시스템은 15개의 후보 포즈 구간 중 묵음 프레임이 6개 이상 지속되는 구간을 오디오 데이터의 문장과 문장 사이의 포즈 구간으로 선택하여, 오디오 데이터의 포즈 구간이 텍스트 데이터의 10개의 포즈 구간과 일치하도록 할 수 있다.
Further, the system for mapping text data to audio data may select a pause interval of audio data based on the comparison result from the candidate pause interval (430). For example, if the reference pause duration is set to a case where six or more silent frames are set, the system for mapping text data to audio data may include a period in which six or more silent frames last for 15 candidate pauses are referred to as audio data The pause interval of the audio data can be made to coincide with the ten pause intervals of the text data.

도 5는 도 1에 도시된 텍스트 데이터의 발화 구간 비율을 계산하는 단계를 구체적으로 나타낸 플로우 차트이다.FIG. 5 is a flowchart specifically illustrating a step of calculating a speech interval ratio of the text data shown in FIG.

도 5를 참조하면, 본 발명의 일실시예에 따른 텍스트 데이터를 오디오 데이터에 매핑하는 시스템은 문장 단위로 구분된 텍스트 데이터에 단어 별로 포함되는 음소를 추출하여, 추출된 음소에 미리 설정된 고정 비율을 적용할 수 있다(510). 예를 들어, 추출된 음소가 유음인 /m/ 음소, /l/ 음소 및 /n/ 음소인 경우, 0.7의 미리 설정된 고정 비율을 적용하고, 추출된 음소가 나머지 자음인 경우, 0.5의 미리 설정된 고정 비율을 적용할 수 있다.Referring to FIG. 5, a system for mapping text data to audio data according to an embodiment of the present invention extracts phonemes included in each word of text data classified by sentence unit, (510). For example, if the extracted phoneme is / m / phoneme / l / phoneme and / n / phoneme, the predetermined fixed ratio of 0.7 is applied, and if the extracted phoneme is the remaining consonant, A fixed ratio can be applied.

또한, 텍스트 데이터를 오디오 데이터에 매핑하는 시스템은 미리 설정된 고정 비율이 적용된 음소에 기초하여 문장 단위로 텍스트 데이터의 발화 구간 비율을 생성할 수 있다(520). 예를 들어, 문장 단위로 미리 설정된 고정 비율이 적용된 음소의 총 합 비율을 계산하여, 문장 단위의 텍스트 데이터의 발화 구간 비율을 생성할 수 있다.
In addition, the system for mapping the text data to the audio data may generate the speech interval ratio of the text data on a sentence-by-sentence basis based on the phoneme to which the preset fixed ratio is applied (520). For example, the ratio of the total sum of the phonemes to which the predetermined fixed ratio is applied in units of sentences can be calculated, and the ratio of the speech segments of the text data in sentence units can be generated.

도 6은 도 1에 도시된 텍스트 데이터를 오디오 데이터에 매핑하는 단계를 구체적으로 나타낸 플로우 차트이다.6 is a flowchart specifically illustrating a step of mapping the text data shown in FIG. 1 to audio data.

도 6을 참조하면, 본 발명의 일실시예에 따른 텍스트 데이터를 오디오 데이터에 매핑하는 시스템은 오디오 데이터의 발화 시간 정보에 텍스트 데이터의 발화 구간 비율을 적용하여 오디오 데이터의 발화 시간 정보에 대응하는 텍스트 데이터의 발화 시간 정보를 계산할 수 있다(610).Referring to FIG. 6, a system for mapping text data to audio data according to an embodiment of the present invention applies a ratio of speech region of text data to speech time information of audio data, The speech time information of the data can be calculated (610).

또한, 텍스트 데이터를 오디오 데이터에 매핑하는 시스템은 오디오 데이터의 포즈 시간 정보에 대응하는 텍스트 데이터의 포즈 시간 정보를 생성할 수 있다(620).Further, the system for mapping the text data to the audio data may generate pause time information of the text data corresponding to the pause time information of the audio data (620).

또한, 텍스트 데이터를 오디오 데이터에 매핑하는 시스템은 오디오 데이터의 발화 시간 정보에 대응하는 텍스트 데이터의 발화 시간 정보를 계산하는 과정(610) 및 오디오 데이터의 포즈 시간 정보에 대응하는 텍스트 데이터의 포즈 시간 정보를 생성하는 과정(620)을 통하여, 오디오 데이터에 매핑될 텍스트 데이터를 생성함으로써, 텍스트 데이터를 오디오 데이터에 매핑할 수 있다(도시되지 아니함).
The system for mapping the text data to the audio data includes a step 610 of calculating the speech time information of the text data corresponding to the speech time information of the audio data and a process of calculating 610 the pause time information of the text data corresponding to the pause time information of the audio data The text data may be mapped to audio data (not shown) by generating text data to be mapped to the audio data through a process 620 of generating the text data.

도 7은 본 발명의 일실시예에 따른 텍스트 데이터를 오디오 데이터에 매핑하는 과정을 나타낸 도면이다.7 is a diagram illustrating a process of mapping text data to audio data according to an embodiment of the present invention.

도 7을 참조하면, 본 발명의 일실시예에 따른 텍스트 데이터를 오디오 데이터에 매핑하는 시스템은 텍스트 데이터의 발화 구간 비율(710)을 오디오 데이터(730)의 발화 시간 정보에 적용하여, 매핑될 텍스트 데이터(720)의 발화 시간 정보를 계산함으로써, 매핑될 텍스트 데이터(720)의 발화 구간을 생성할 수 있다. 예를 들어, 텍스트 데이터를 오디오 데이터에 매핑하는 시스템은 텍스트 데이터에 포함되는 제1 문장(711)의 발화 구간 비율을 오디오 데이터(730)에 포함되는 제1 문장(731)의 발화 시간 정보에 적용하여, 매핑될 텍스트 데이터(720)에 포함되는 제1 문장(721)의 시간 정보를 계산함으로써, 매핑될 텍스트 데이터(720)에 포함되는 제1 문장(721)의 발화 구간을 생성할 수 있다(740).Referring to FIG. 7, a system for mapping text data to audio data according to an embodiment of the present invention applies a speech segment ratio 710 of text data to speech time information of audio data 730, By calculating the ignition time information of the data 720, a speaking interval of the text data 720 to be mapped can be generated. For example, the system for mapping text data to audio data may apply the ratio of the speech segment of the first sentence 711 included in the text data to the speech time information of the first sentence 731 included in the audio data 730 To generate a speech section of the first sentence 721 included in the text data 720 to be mapped by calculating the time information of the first sentence 721 included in the text data 720 to be mapped 740).

또한, 텍스트 데이터를 오디오 데이터에 매핑하는 시스템은 오디오 데이터(730)의 포즈 시간 정보에 대응하는, 매핑될 텍스트 데이터(720)의 포즈 시간 정보를 생성함으로써, 매핑될 텍스트 데이터(720)의 포즈 구간을 설정할 수 있다. 예를 들어, 텍스트 데이터를 오디오 데이터에 매핑하는 시스템은 오디오 데이터(730)에 포함되는 제1 문장(731)과 제2 문장(732) 사이의 포즈 구간(733)에 대한 시간 정보에 대응하는, 매핑될 텍스트 데이터(720)에 포함되는 제1 문장(721)과 제2 문장(722) 사이의 포즈 구간(723)에 대한 시간 정보를 생성함으로써, 매핑될 텍스트 데이터(720)에 포함되는 제1 문장(721)과 제2 문장(722) 사이의 포즈 구간(723)을 설정할 수 있다.The system for mapping text data to audio data also generates pause time information of the text data 720 to be mapped that corresponds to the pause time information of the audio data 730, Can be set. For example, a system for mapping text data to audio data may include a first sentence 731 corresponding to time information for a pause interval 733 between a first sentence 731 and a second sentence 732 included in the audio data 730, The first sentence 721 included in the text data 720 to be mapped and the pause interval 723 between the second sentence 722 are generated so that the first sentence 721 included in the text data 720 to be mapped A pause interval 723 between the sentence 721 and the second sentence 722 can be set.

텍스트 데이터를 오디오 데이터에 매핑하는 시스템은 매핑될 텍스트 데이터(720)에 포함되는 문장 및 포즈 구간을 모두 생성한 이후에, 매핑될 텍스트 데이터(720)를 오디오 데이터에 매핑할 수 있다(750).
A system for mapping text data to audio data may map text data 720 to be mapped to audio data after generating both a sentence and a pause interval included in the text data 720 to be mapped.

도 8은 본 발명의 일실시예에 따른 텍스트 데이터를 오디오 데이터에 매핑하는 시스템을 나타낸 블록도이다.8 is a block diagram illustrating a system for mapping text data to audio data according to an embodiment of the present invention.

도 8을 참조하면, 본 발명의 일실시예에 따른 텍스트 데이터를 오디오 데이터에 매핑하는 시스템은 전처리 적용부(810), 발화 시간 정보 추출부(820), 문장 구분부(830), 포즈 시간 정보 추출부(840), 발화 구간 비율 계산부(850), 매핑부(860) 및 오디오/텍스트 데이터 생성부(870)를 포함한다.Referring to FIG. 8, a system for mapping text data to audio data according to an embodiment of the present invention includes a pre-processing application unit 810, a speech time information extraction unit 820, a sentence classification unit 830, An extraction unit 840, a speech segment ratio calculation unit 850, a mapping unit 860, and an audio / text data generation unit 870.

전처리 적용부(810)는 오디오 데이터에 프리엠퍼시스, DC 성분 제거 또는 잡음 제거 중 적어도 하나를 적용할 수 있다.The pre-processing application unit 810 may apply at least one of pre-emphasis, DC component removal, and noise removal to audio data.

발화 시간 정보 추출부(820)는 오디오 데이터의 발화 시간 정보를 추출한다.The utterance time information extracting unit 820 extracts the utterance time information of the audio data.

또한, 발화 시간 정보 추출부(820)는 미리 설정된 프레임 단위로 오디오 데이터의 ZCR 또는 에너지 중 적어도 하나를 계산하는 계산부(821), 계산 결과에 기초하여 오디오 데이터에 포함되는 상기 프레임을 유성음 프레임, 무성음 프레임 또는 묵음 프레임 중 적어도 하나로 구분하는 구분부(822) 및 구분 결과에 따라 문장 단위로 오디오 데이터의 발화 시간 정보를 획득하는 획득부(823)를 포함할 수 있다.In addition, the utterance time information extracting unit 820 includes a calculating unit 821 for calculating at least one of ZCR or energy of audio data in a predetermined frame unit, a frame included in the audio data based on the calculation result, An unvoiced frame or a silence frame; and an obtaining unit 823 for obtaining the speech time information of the audio data on a sentence-by-sentence basis in accordance with the result of the discrimination.

문장 구분부(830)는 텍스트 데이터를 문장 단위로 구분한다.The sentence classifying unit 830 classifies the text data into sentences.

이 때, 문장 구분부(830)는 텍스트 데이터에 포함되는 문장 각각이 구별되도록 미리 삽입된 기호 정보에 기초하여, 텍스트 데이터를 문장 단위로 구분할 수 있다.At this time, the sentence classifying unit 830 can classify the text data on a sentence-by-sentence basis based on the preference information pre-inserted so that the sentences included in the text data are distinguished.

포즈 시간 정보 추출부(840)는 구분된 텍스트 데이터에 기초하여 오디오 데이터로부터 오디오 데이터의 포즈 시간 정보를 추출한다.The pause time information extracting unit 840 extracts pause time information of the audio data from the audio data based on the separated text data.

또한, 포즈 시간 정보 추출부(840)는 구분된 텍스트 데이터의 포즈 구간 정보에 기초하여 오디오 데이터의 포즈 구간을 설정하는 설정부(841) 및 설정된 오디오 데이터의 포즈 구간의 시작 시간 정보 및 종료 시간 정보를 획득하는 획득부(842)를 포함할 수 있다.The pause time information extracting unit 840 includes a setting unit 841 for setting a pause interval of the audio data based on the pause interval information of the divided text data, a start time information of the pause interval of the set audio data, And an acquiring unit 842 for acquiring the image data.

발화 구간 비율 계산부(850)는 구분된 텍스트 데이터에 포함되는 음소를 추출하여 텍스트 데이터의 발화 구간 비율을 계산한다.The utterance interval ratio calculation unit 850 calculates the utterance interval ratio of the text data by extracting the phonemes included in the divided text data.

또한, 발화 구간 비율 계산부(850)는 추출된 음소에 미리 설정된 고정 비율을 적용하는 적용부(851) 및 미리 설정된 고정 비율이 적용된 음소에 기초하여 문장 단위로 텍스트 데이터의 발화 구간 비율을 생성하는 생성부(852)를 포함할 수 있다.In addition, the speech segment ratio calculation unit 850 generates an speech segment ratio of text data on a sentence-by-sentence basis based on an application unit 851 that applies a predetermined fixed ratio to the extracted phoneme and a phoneme to which a predetermined fixed ratio is applied And a generating unit 852.

매핑부(860)는 추출된 오디오 데이터의 발화 시간 정보, 포즈 시간 정보 및 텍스트 데이터의 발화 구간 비율에 기초하여 텍스트 데이터를 오디오 데이터에 매핑한다.The mapping unit 860 maps the text data to the audio data based on the speech time information of the extracted audio data, the pause time information, and the speech interval ratio of the text data.

또한, 매핑부(860)는 오디오 데이터의 발화 시간 정보에 텍스트 데이터의 발화 구간 비율을 적용하여 오디오 데이터의 발화 시간 정보에 대응하는 텍스트 데이터의 발화 시간 정보를 계산하는 계산부(861) 및 텍스트 데이터의 발화 시간 정보로부터 오디오 데이터의 포즈 시간 정보에 대응하는 텍스트 데이터의 포즈 시간 정보를 생성하는 생성부(862)를 포함할 수 있다.The mapping unit 860 includes a calculation unit 861 for calculating the speech time information of the text data corresponding to the speech time information of the audio data by applying the speech segment ratio of the text data to the speech time information of the audio data, And pause time information of the text data corresponding to the pause time information of the audio data from the pause time information of the audio data.

오디오/텍스트 데이터 생성부(870)는 텍스트 데이터를 오디오 데이터에 매핑한 결과, 오디오/텍스트 데이터를 생성할 수 있다.
The audio / text data generation unit 870 may generate audio / text data as a result of mapping the text data to audio data.

이상에서 설명된 장치는 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치 및 구성요소는, 예를 들어, 프로세서, 콘트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPA(field programmable array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 하나 이상의 소프트웨어 애플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 콘트롤러를 포함할 수 있다. 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The apparatus described above may be implemented as a hardware component, a software component, and / or a combination of hardware components and software components. For example, the apparatus and components described in the embodiments may be implemented within a computer system, such as, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable array (FPA) A programmable logic unit (PLU), a microprocessor, or any other device capable of executing and responding to instructions. The processing device may execute an operating system (OS) and one or more software applications running on the operating system. The processing device may also access, store, manipulate, process, and generate data in response to execution of the software. For ease of understanding, the processing apparatus may be described as being used singly, but those skilled in the art will recognize that the processing apparatus may have a plurality of processing elements and / As shown in FIG. For example, the processing unit may comprise a plurality of processors or one processor and one controller. Other processing configurations are also possible, such as a parallel processor.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치, 또는 전송되는 신호 파(signal wave)에 영구적으로, 또는 일시적으로 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.The software may include a computer program, code, instructions, or a combination of one or more of the foregoing, and may be configured to configure the processing device to operate as desired or to process it collectively or collectively Device can be commanded. The software and / or data may be in the form of any type of machine, component, physical device, virtual equipment, computer storage media, or device , Or may be permanently or temporarily embodied in a transmitted signal wave. The software may be distributed over a networked computer system and stored or executed in a distributed manner. The software and data may be stored on one or more computer readable recording media.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 실시예의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The method according to an embodiment may be implemented in the form of a program command that can be executed through various computer means and recorded in a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, and the like, alone or in combination. The program instructions to be recorded on the medium may be those specially designed and configured for the embodiments or may be available to those skilled in the art of computer software. Examples of computer-readable media include magnetic media such as hard disks, floppy disks and magnetic tape; optical media such as CD-ROMs and DVDs; magnetic media such as floppy disks; Magneto-optical media, and hardware devices specifically configured to store and execute program instructions such as ROM, RAM, flash memory, and the like. Examples of program instructions include machine language code such as those produced by a compiler, as well as high-level language code that can be executed by a computer using an interpreter or the like. The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the embodiments, and vice versa.

이상과 같이 실시예들이 비록 한정된 실시예와 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.While the present invention has been particularly shown and described with reference to exemplary embodiments thereof, it is to be understood that the invention is not limited to the disclosed exemplary embodiments. For example, it is to be understood that the techniques described may be performed in a different order than the described methods, and / or that components of the described systems, structures, devices, circuits, Lt; / RTI > or equivalents, even if it is replaced or replaced.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다.
Therefore, other implementations, other embodiments, and equivalents to the claims are also within the scope of the following claims.

Claims

텍스트 데이터를 오디오 데이터에 매핑하는 방법에 있어서,
오디오 데이터의 발화 시간 정보를 추출하는 단계;
텍스트 데이터를 문장 단위로 구분하는 단계;
상기 구분된 텍스트 데이터에 기초하여 상기 오디오 데이터로부터 상기 오디오 데이터의 포즈(Pause) 시간 정보를 추출하는 단계;
상기 구분된 텍스트 데이터에 포함되는 음소를 추출하여 상기 텍스트 데이터의 발화 구간 비율을 계산하는 단계; 및
상기 추출된 오디오 데이터의 상기 발화 시간 정보, 상기 포즈 시간 정보 및 상기 텍스트 데이터의 상기 발화 구간 비율에 기초하여 상기 텍스트 데이터를 상기 오디오 데이터에 매핑하는 단계
를 포함하는, 텍스트 데이터를 오디오 데이터에 매핑하는 방법.
A method for mapping text data to audio data,
Extracting speech time information of the audio data;
Classifying text data by sentence units;
Extracting pause time information of the audio data from the audio data based on the separated text data;
Extracting phonemes included in the segmented text data and calculating a speech segment ratio of the text data; And
Mapping the text data to the audio data based on the speech time information of the extracted audio data, the pause time information, and the speech interval ratio of the text data;
And mapping the text data to the audio data.

제1항에 있어서,
상기 오디오 데이터의 상기 발화 시간 정보를 추출하는 단계는,
미리 설정된 프레임 단위로 상기 오디오 데이터의 ZCR(Zero Crossing Ratio) 또는 에너지 중 적어도 하나를 계산하는 단계;
상기 계산 결과에 기초하여 상기 오디오 데이터에 포함되는 상기 프레임을 유성음 프레임(Voiced Frame), 무성음 프레임(Unvoiced Frame) 또는 묵음 프레임(Silence Frame) 중 적어도 하나로 구분하는 단계; 및
상기 구분 결과에 따라 상기 문장 단위로 상기 오디오 데이터의 상기 발화 시간 정보를 획득하는 단계
를 포함하는, 텍스트 데이터를 오디오 데이터에 매핑하는 방법.
The method according to claim 1,
Wherein the step of extracting the speech time information of the audio data comprises:
Calculating at least one of ZCR (Zero Crossing Ratio) or energy of the audio data in a preset frame unit;
Dividing the frame included in the audio data into at least one of a voiced frame, an unvoiced frame, and a silence frame based on the calculation result; And
Acquiring the speech time information of the audio data in units of the sentence in accordance with the classification result
And mapping the text data to the audio data.

제1항에 있어서,
상기 오디오 데이터의 상기 포즈 시간 정보를 추출하는 단계는,
상기 구분된 텍스트 데이터의 포즈 구간 정보에 기초하여 상기 오디오 데이터의 포즈 구간을 설정하는 단계; 및
상기 설정된 오디오 데이터의 상기 포즈 구간의 시작 시간 정보 및 종료 시간 정보를 획득하는 단계
를 포함하는, 텍스트 데이터를 오디오 데이터에 매핑하는 방법.
The method according to claim 1,
Wherein the extracting the pause time information of the audio data comprises:
Setting a pause interval of the audio data based on pause interval information of the divided text data; And
Acquiring start time information and end time information of the pause interval of the set audio data
And mapping the text data to the audio data.

제3항에 있어서,
상기 오디오 데이터의 상기 포즈 구간을 설정하는 단계는,
상기 오디오 데이터의 후보 포즈 구간을 검출하는 단계;
상기 검출된 후보 포즈 구간을 미리 설정된 기준 포즈 구간과 비교하는 단계; 및
상기 후보 포즈 구간으로부터 상기 비교 결과에 기초하여 오디오 데이터의 상기 포즈 구간을 선택하는 단계
를 포함하는, 텍스트 데이터를 오디오 데이터에 매핑하는 방법.
The method of claim 3,
Wherein the setting of the pause interval of the audio data comprises:
Detecting a candidate pause period of the audio data;
Comparing the detected candidate pause interval with a preset reference pause interval; And
Selecting the pause interval of the audio data based on the comparison result from the candidate pause interval
And mapping the text data to the audio data.

제1항에 있어서,
상기 텍스트 데이터의 상기 발화 구간 비율을 계산하는 단계는,
상기 추출된 음소에 미리 설정된 고정 비율을 적용하는 단계; 및
상기 미리 설정된 고정 비율이 적용된 상기 음소에 기초하여 상기 문장 단위로 상기 텍스트 데이터의 상기 발화 구간 비율을 생성하는 단계
를 포함하는, 텍스트 데이터를 오디오 데이터에 매핑하는 방법.
The method according to claim 1,
Wherein the step of calculating the speaking interval ratio of the text data comprises:
Applying a preset fixed ratio to the extracted phonemes; And
Generating the speech segment ratio of the text data in units of sentences based on the phoneme to which the predetermined fixed ratio is applied
And mapping the text data to the audio data.

제1항에 있어서,
상기 텍스트 데이터를 상기 오디오 데이터에 매핑하는 단계는,
상기 오디오 데이터의 상기 발화 시간 정보에 상기 텍스트 데이터의 상기 발화 구간 비율을 적용하여 상기 오디오 데이터의 상기 발화 시간 정보에 대응하는 상기 텍스트 데이터의 발화 시간 정보를 계산하는 단계; 및
상기 오디오 데이터의 상기 포즈 시간 정보에 대응하는 상기 텍스트 데이터의 포즈 시간 정보를 생성하는 단계
를 포함하는, 텍스트 데이터를 오디오 데이터에 매핑하는 방법.
The method according to claim 1,
Wherein the step of mapping the text data to the audio data comprises:
Calculating speaking time information of the text data corresponding to the speaking time information of the audio data by applying the speaking interval ratio of the text data to the speaking time information of the audio data; And
Generating pause time information of the text data corresponding to the pause time information of the audio data
And mapping the text data to the audio data.

제1항에 있어서,
상기 텍스트 데이터를 상기 문장 단위로 구분하는 단계는,
상기 텍스트 데이터에 포함되는 문장 각각이 구별되도록 미리 삽입된 기호 정보에 기초하여, 상기 텍스트 데이터를 상기 문장 단위로 구분하는 단계인, 텍스트 데이터를 오디오 데이터에 매핑하는 방법.
The method according to claim 1,
Wherein the step of classifying the text data by the sentence comprises:
And dividing the text data into the sentence units based on the preference information inserted in advance so that the sentences included in the text data are distinguished.

제1항에 있어서,
상기 오디오 데이터에 프리엠퍼시스(Pre-emphasis), DC 성분 제거 또는 잡음 제거 중 적어도 하나를 적용하는 단계
를 더 포함하는, 텍스트 데이터를 오디오 데이터에 매핑하는 방법.
The method according to claim 1,
Applying at least one of pre-emphasis, DC component removal or noise removal to the audio data;
And mapping the text data to audio data.

제1항에 있어서,
상기 텍스트 데이터를 상기 오디오 데이터에 매핑한 결과, 오디오/텍스트 데이터를 생성하는 단계
를 더 포함하는, 텍스트 데이터를 오디오 데이터에 매핑하는 방법.
The method according to claim 1,
Generating audio / text data as a result of mapping the text data to the audio data;
And mapping the text data to audio data.

제1항 내지 제9항 중 어느 한 항의 방법을 수행하기 위한 프로그램이 기록된 컴퓨터로 판독 가능한 기록 매체.
A computer-readable recording medium having recorded thereon a program for performing the method according to any one of claims 1 to 9.

텍스트 데이터를 오디오 데이터에 매핑하는 시스템에 있어서,
오디오 데이터의 발화 시간 정보를 추출하는, 발화 시간 정보 추출부;
텍스트 데이터를 문장 단위로 구분하는, 문장 구분부;
상기 구분된 텍스트 데이터에 기초하여 상기 오디오 데이터로부터 상기 오디오 데이터의 포즈(Pause) 시간 정보를 추출하는, 포즈 시간 정보 추출부;
상기 구분된 텍스트 데이터에 포함되는 음소를 추출하여 상기 텍스트 데이터의 발화 구간 비율을 계산하는, 발화 구간 비율 계산부; 및
상기 추출된 오디오 데이터의 상기 발화 시간 정보, 상기 포즈 시간 정보 및 상기 텍스트 데이터의 상기 발화 구간 비율에 기초하여 상기 텍스트 데이터를 상기 오디오 데이터에 매핑하는, 매핑부
를 포함하는, 텍스트 데이터를 오디오 데이터에 매핑하는 시스템.
A system for mapping text data to audio data,
An utterance time information extracting unit for extracting utterance time information of audio data;
A sentence classifying unit for classifying text data on a sentence level basis;
A pause time information extracting unit for extracting pause time information of the audio data from the audio data based on the separated text data;
A speech segment ratio calculation unit for calculating a speech segment ratio of the text data by extracting phonemes included in the segmented text data; And
And mapping the text data to the audio data based on the speaking time information, the pause time information, and the speaking ratio of the text data of the extracted audio data,
And mapping the text data to audio data.

제11항에 있어서,
상기 발화 시간 정보 추출부는,
미리 설정된 프레임 단위로 상기 오디오 데이터의 ZCR(Zero Crossing Ratio) 또는 에너지 중 적어도 하나를 계산하는, 계산부;
상기 계산 결과에 기초하여 상기 오디오 데이터에 포함되는 상기 프레임을 유성음 프레임(Voiced Frame), 무성음 프레임(Unvoiced Frame) 또는 묵음 프레임(Silence Frame) 중 적어도 하나로 구분하는, 구분부; 및
상기 구분 결과에 따라 상기 문장 단위로 상기 오디오 데이터의 상기 발화 시간 정보를 획득하는, 획득부
를 포함하는, 텍스트 데이터를 오디오 데이터에 매핑하는 시스템.
12. The method of claim 11,
Wherein the ignition time information extracting unit
A calculation unit for calculating at least one of ZCR (zero crossing ratio) or energy of the audio data in a preset frame unit;
A dividing unit dividing the frame included in the audio data into at least one of a voiced frame, an unvoiced frame, and a silence frame based on the calculation result; And
And acquiring the speech time information of the audio data on the basis of the sentence in accordance with the classification result,
And mapping the text data to audio data.

제11항에 있어서,
상기 포즈 시간 정보 추출부는,
상기 구분된 텍스트 데이터의 포즈 구간 정보에 기초하여 상기 오디오 데이터의 포즈 구간을 설정하는, 설정부; 및
상기 설정된 오디오 데이터의 상기 포즈 구간의 시작 시간 정보 및 종료 시간 정보를 획득하는, 획득부
를 포함하는, 텍스트 데이터를 오디오 데이터에 매핑하는 시스템.
12. The method of claim 11,
Wherein the pause time information extracting unit
A setting unit for setting a pause interval of the audio data based on pause interval information of the divided text data; And
Acquiring start time information and end time information of the pause interval of the set audio data,
And mapping the text data to audio data.

제11항에 있어서,
상기 발화 구간 비율 계산부는,
상기 추출된 음소에 미리 설정된 고정 비율을 적용하는, 적용부; 및
상기 미리 설정된 고정 비율이 적용된 상기 음소에 기초하여 상기 문장 단위로 상기 텍스트 데이터의 상기 발화 구간 비율을 생성하는, 생성부
를 포함하는, 텍스트 데이터를 오디오 데이터에 매핑하는 시스템.
12. The method of claim 11,
The ignition period ratio calculation unit may calculate,
An application unit for applying a preset fixed ratio to the extracted phonemes; And
And generates the speech section ratio of the text data on the basis of the sentence on the basis of the phoneme to which the predetermined fixed ratio is applied,
And mapping the text data to audio data.

제11항에 있어서,
상기 매핑부는,
상기 오디오 데이터의 상기 발화 시간 정보에 상기 텍스트 데이터의 상기 발화 구간 비율을 적용하여 상기 오디오 데이터의 상기 발화 시간 정보에 대응하는 상기 텍스트 데이터의 발화 시간 정보를 계산하는 계산부; 및
상기 오디오 데이터의 상기 포즈 시간 정보에 대응하는 상기 텍스트 데이터의 포즈 시간 정보를 생성하는, 생성부
를 포함하는, 텍스트 데이터를 오디오 데이터에 매핑하는 시스템.
12. The method of claim 11,
Wherein the mapping unit comprises:
A calculation unit for calculating the speech time information of the text data corresponding to the speech time information of the audio data by applying the speech region ratio of the text data to the speech time information of the audio data; And
And generates pause time information of the text data corresponding to the pause time information of the audio data,
And mapping the text data to audio data.

제11항에 있어서,
상기 오디오 데이터에 프리엠퍼시스(Pre-emphasis), DC 성분 제거 또는 잡음 제거 중 적어도 하나를 적용하는, 전처리 적용부
를 더 포함하는, 텍스트 데이터를 오디오 데이터에 매핑하는 시스템.
12. The method of claim 11,
A pre-emphasis unit for applying at least one of pre-emphasis, DC component removal, and noise removal to the audio data;
And mapping the text data to the audio data.

제11항에 있어서,
상기 텍스트 데이터를 상기 오디오 데이터에 매핑한 결과, 오디오/텍스트 데이터를 생성하는, 오디오/텍스트 데이터 생성부
를 더 포함하는, 텍스트 데이터를 오디오 데이터에 매핑하는 시스템.12. The method of claim 11,
And an audio / text data generating unit for generating audio / text data as a result of mapping the text data to the audio data,
And mapping the text data to the audio data.