KR102413514B1

KR102413514B1 - Voice data set building method based on subject domain

Info

Publication number: KR102413514B1
Application number: KR1020200131177A
Authority: KR
Inventors: 정유철; 봉대현; 서덕진
Original assignee: 금오공과대학교 산학협력단
Priority date: 2020-10-12
Filing date: 2020-10-12
Publication date: 2022-06-24
Also published as: KR20220048248A

Abstract

본 발명은 주제 도메인에 기초한 음성 데이터 세트 구축 방법에 관한 것이다.
본 발명의 실시예에 따르면, 웹사이트를 크롤링하여 동영상 파일 및 자막 파일을 수집하고, 자막 파일로부터 추출한 이벤트 표현에 이벤트 태그를 태깅하며, 이벤트 표현에 관한 음성 데이터 파일을 포함하는 음성 데이터 세트를 이벤트 태그에 기초하여 주제 도메인별로 구축하도록 구성된 주제 도메인에 기초한 음성 데이터 세트 구축 방법이 개시된다. The present invention relates to a method of constructing a speech data set based on a subject domain.
According to an embodiment of the present invention, a video file and a subtitle file are collected by crawling a website, an event tag is tagged to an event expression extracted from the subtitle file, and a voice data set including a voice data file related to the event expression is an event A method for constructing a voice data set based on a subject domain configured to be built for each subject domain based on tags is disclosed.

Description

주제 도메인에 기초한 음성 데이터 세트 구축 방법 {Voice data set building method based on subject domain}{Voice data set building method based on subject domain}

본 발명은 주제 도메인에 기초한 음성 데이터 세트 구축 방법에 관한 것으로서, 웹사이트를 크롤링하여 동영상 파일 및 자막 파일을 수집하고, 자막 파일로부터 추출한 이벤트 표현에 이벤트 태그를 태깅하며, 이벤트 표현에 관한 음성 데이터 파일을 포함하는 음성 데이터 세트를 이벤트 태그에 기초하여 주제 도메인별로 구축하도록 구성된 주제 도메인에 기초한 음성 데이터 세트 구축 방법에 관한 것이다. The present invention relates to a method of constructing a voice data set based on a subject domain, by crawling a website to collect video files and subtitle files, tagging event expressions extracted from subtitle files with event tags, and voice data files related to event expressions To a method for constructing a voice data set based on a subject domain, configured to build a voice data set comprising: for each subject domain based on an event tag.

음성인식이란 사람의 입으로부터 나온 음성신호를 분석하여 자동으로 문자열로 변환해주는 기술이다. 산업계나 학계에서는 Automatic Speech Recognition(ASR) 또는 Speech To Text (STT)라고 부른다.Speech recognition is a technology that analyzes a voice signal from a person's mouth and automatically converts it into a character string. In industry and academia, it is called Automatic Speech Recognition (ASR) or Speech To Text (STT).

음성인식기술의 실용화 확산에 걸림돌로서, 화자 및 발성 스타일에 따른 인식률의 차이, 배경잡음에 따른 인식률 저하, 인식대상 어휘의 제한으로 인한 인식오류 발생 등의 문제들이 있다. As obstacles to the practical spread of speech recognition technology, there are problems such as a difference in recognition rate according to the speaker and vocal style, a decrease in the recognition rate due to background noise, and the occurrence of recognition errors due to the limitation of the vocabulary to be recognized.

이러한 문제들을 해결하기 위해 다양한 경우의 음성 데이터를 구축하여 다양한 화자, 다양한 환경을 학습하고, 어휘 탐색 공간을 무제한으로 늘리는 음성인식 방법론이 연구의 주류를 형성하고 있다. 일반적으로 음성인식기 구현에 있어 학습데이터가 부족하게 되면 음성 인식률이 저하되므로 상황별 대용량 음성 데이터를 구축하고 이를 이용하여 학습하여 인식률을 높이는 방법을 사용하고 있다.To solve these problems, a voice recognition methodology that learns various speakers and various environments by building voice data in various cases, and increases the vocabulary search space unlimitedly, is forming the mainstream of research. In general, in the implementation of a voice recognizer, if the learning data is insufficient, the voice recognition rate is lowered, so a method is used to increase the recognition rate by constructing large-capacity voice data for each situation and using it to learn.

음성 데이터 구축에 관해 다음과 같은 기술들이 제안된 바 있다. The following techniques have been proposed for constructing voice data.

첫번째, 음성 데이터 수동 구축 방식이다. First, it is a manual construction method of voice data.

수동 구축 방식은, 예를 들어 서울말뿐만 아니라 지역별로 한국어 음성 데이터를 구축하기 위해서 1000명의 사람을 녹음하여 데이터를 직접 수동으로 구축하는 방식이다. 그런데 이러한 방식은, 수동으로 음성 데이터를 수집하기 위해서 음성 녹음을 할 때 음성 왜곡이 나타나거나 잡음 등이 포함되지 않게 하려고 고성능 디지털 녹음 장비를 활용하여 녹음을 해야 한다. 따라서, 고품질의 음성인식 학습데이터 구축을 위해서는 녹음과 품질관리에 많은 시간과 노력이 투입되어야 하기에 대량 구축에 있어서 제약이 많다. 따라서, 음성 데이터 수동 수집은 장기 계획으로 잡는 경우가 대부분이다. In the manual construction method, for example, in order to build Korean voice data not only in the Seoul dialect but also in each region, 1,000 people are recorded and the data is manually built. However, in this method, in order to manually collect voice data, it is necessary to record using high-performance digital recording equipment in order to prevent voice distortion or noise from being included during voice recording. Therefore, in order to build high-quality voice recognition learning data, a lot of time and effort must be put into recording and quality control, so there are many restrictions in mass construction. Therefore, in most cases, manual collection of voice data is taken as a long-term plan.

두번째, 음성 데이터 자동 구축 방식이다. Second, it is an automatic construction method of voice data.

Lakomkin et al.(E. Lakomkin, S. Magg, C. Weber, and S. Wermter, “KT-Speech-Crawler: Automatic Dataset Construction for Speech Recognition from YouTube Videos,” arXiv:1903.00216, 2019.)에서는 YoutTube만 이용하거나 WSJ만을 이용하여 얻은 음성 데이터를 구축한 인식률보다 YouTube의 방대한 음성과 자막을 함께 이용하여 많은 양의 고품질의 음성 데이터를 자동으로 구축하여 인식률 개선을 보였다. In Lakomkin et al. (E. Lakomkin, S. Magg, C. Weber, and S. Wermter, “KT-Speech-Crawler: Automatic Dataset Construction for Speech Recognition from YouTube Videos,” arXiv:1903.00216, 2019.), YoutTube only The recognition rate was improved by automatically constructing a large amount of high-quality voice data using YouTube's vast amount of voice and subtitles, rather than the recognition rate using voice data obtained using or using only WSJ.

Kaewprateep et al.(Jirayu Kaewprateep and Santitham Prom-on, “Evaluation of small-scale deep learning architectures in Thai speech recognition,” 2018 International ECTI Northern Section Conference on Electrical, Electronics, Computer and Telecommunications Engineering (ECTI-NCON))에서는 적은 양의 데이터를 구축하여 음성인식의 성능을 확인한 사례도 있었지만, 높은 성능을 보이지는 않았다.In Kaewprateep et al. (Jirayu Kaewprateep and Santitham Prom-on, “Evaluation of small-scale deep learning architectures in Thai speech recognition,” 2018 International ECTI Northern Section Conference on Electrical, Electronics, Computer and Telecommunications Engineering (ECTI-NCON)), There were cases where the performance of voice recognition was confirmed by constructing a small amount of data, but it did not show high performance.

Choi et al.(Y. Choi and B. Lee, “Pansori: ASR Corpus Generation from Open Online Video Contents,” IEEE Seoul Sect. Student Pap. Contest, pp. 117-121, 2018.)에서는 고성능의 ASR(Automatic Speech Recognition)을 위해 TED 혹은 TEDx와 같은 웹에 존재하는 음성 데이터로부터 음성 데이터베이스를 반자동으로 구축하였다. 이 경우 한국어 음성 데이터와 한국어 자막 데이터를 자동 생성하였지만, 녹음의 부정확성, 발음의 모호성, 비 이상적인 오디오 조건으로 인한 레코딩 저품질등을 극복하기 위해, 자원봉사자들을 동원하여 데이터 검증을 추가적으로 진행하여 한국의 판소리 관련 음성 데이터 및 자막 데이터를 자동으로 생성하였지만 높은 성능을 보이지는 않았다.Choi et al. (Y. Choi and B. Lee, “Pansori: ASR Corpus Generation from Open Online Video Contents,” IEEE Seoul Sect. Student Pap. Contest, pp. 117-121, 2018.) For Speech Recognition), a speech database was built semi-automatically from speech data existing on the web such as TED or TEDx. In this case, Korean voice data and Korean subtitle data were automatically generated, but in order to overcome recording inaccuracy, ambiguity in pronunciation, and poor recording quality due to non-ideal audio conditions, volunteers were mobilized to additionally verify the data, and the Korean pansori Although related audio data and subtitle data were automatically generated, it did not show high performance.

세번째, 음성 데이터 증강(augmentation) 구축 기법이다. Third, it is a voice data augmentation construction technique.

자동 및 수동으로 수집하는 음성 데이터, 즉 학습데이터를 새로 수집하는 데에는 많은 시간과 인력. 그리고 재원이 들어가는 한계점이 있으므로 이를 해결하기 위하여 기 보유한 학습데이터를 변환하여 새로운 데이터를 늘리는 ‘데이터 증강 기법’이 제안되었다. 대표적으로 SpecAugment(Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D. Cubuk, Quoc V. Le "SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition", arXiv:1904.08779, 2019) 기법이 있다. It takes a lot of time and manpower to newly collect voice data that is automatically and manually collected, that is, learning data. In addition, there is a limit to the amount of financial resources, so to solve this problem, a ‘data augmentation technique’ that converts existing learning data to increase new data has been proposed. Representatively, SpecAugment (Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D. Cubuk, Quoc V. Le "SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition", arXiv:1904.08779, 2019) method.

그런데 상술한 종래기술들은 음성명령어 데이터를 주제 도메인별로 빠른 시간에 구축하는데 여전히 한계를 갖고 있었다. However, the above-described prior art still has a limitation in constructing voice command data for each subject domain in a fast time.

대한민국 등록특허 10-0314659 (2001.10.31)Republic of Korea Patent Registration 10-0314659 (Jan. 31, 2001) 대한민국 등록특허 10-0251003 (2000.01.10)Republic of Korea Patent Registration 10-0251003 (2000.01.10) 대한민국 공개특허 10-2001-0069452 (2001.07.25)Korean Patent Laid-Open Patent No. 10-2001-0069452 (July 25, 2001) 대한민국 등록특허 10-0822170 (2008.04.07)Republic of Korea Patent Registration 10-0822170 (2008.04.07)

E. Lakomkin, S. Magg, C. Weber, and S. Wermter, “KT-Speech-Crawler: Automatic Dataset Construction for Speech Recognition from YouTube Videos,” arXiv:1903.00216, 2019.E. Lakomkin, S. Magg, C. Weber, and S. Wermter, “KT-Speech-Crawler: Automatic Dataset Construction for Speech Recognition from YouTube Videos,” arXiv:1903.00216, 2019. Jirayu Kaewprateep and Santitham Prom-on, “Evaluation of small-scale deep learning architectures in Thai speech recognition,” 2018 International ECTI Northern Section Conference on Electrical, Electronics, Computer and Telecommunications Engineering (ECTI-NCON)Jirayu Kaewprateep and Santitham Prom-on, “Evaluation of small-scale deep learning architectures in Thai speech recognition,” 2018 International ECTI Northern Section Conference on Electrical, Electronics, Computer and Telecommunications Engineering (ECTI-NCON) Y. Choi and B. Lee, “Pansori: ASR Corpus Generation from Open Online Video Contents,” IEEE Seoul Sect. Student Pap. Contest, pp. 117-121, 2018.Y. Choi and B. Lee, “Pansori: ASR Corpus Generation from Open Online Video Contents,” IEEE Seoul Sect. Student Pap. Contest, pp. 117-121, 2018. Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D. Cubuk, Quoc V. Le "SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition", arXiv:1904.08779, 2019Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D. Cubuk, Quoc V. Le "SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition", arXiv:1904.08779, 2019

본 발명은 상기와 같은 문제점을 감안하여 안출한 것으로서, 웹사이트를 크롤링하여 동영상 파일 및 자막 파일을 수집하고, 자막 파일로부터 추출한 이벤트 표현에 이벤트 태그를 태깅하며, 이벤트 표현에 관한 음성 데이터 파일을 포함하는 음성 데이터 세트를 이벤트 태그에 기초하여 주제 도메인별로 구축하도록 구성된 주제 도메인에 기초한 음성 데이터 세트 구축 방법을 제공하는 것을 목적으로 한다.The present invention has been devised in view of the above problems, and includes crawling a website to collect video files and subtitle files, tagging an event expression extracted from the subtitle file with an event tag, and including a voice data file related to the event expression An object of the present invention is to provide a method for constructing a voice data set based on a subject domain, configured to construct a voice data set for each subject domain based on an event tag.

상기 목적을 감안한 본 발명의 일 측면에 따르면, 웹사이트에 접속 가능한 음성 데이터 세트 구축 시스템이 이벤트 표현에 대한 음성 데이터를 추출하여 음성 데이터 세트를 구축하는 방법으로서, 1) 웹사이트를 크롤링하여 수집한 동영상 파일에 기초하여 자막 파일을 수득하는 단계; 2) 형태소 분석기를 이용하여 상기 자막 파일로부터 이벤트 표현을 추출하고, 미리 학습된 개체명 인식기를 이용하여 이벤트 표현에 이벤트 태그를 태깅하는 단계- 상기 이벤트 표현은 미리 설정된 형태소 단위 자질 표현에 기반하여 추출한 자질이며, 상기 이벤트 태그는 주제 도메인을 구분하기 위한 태그임-; 3) 상기 이벤트 표현을 검색 조건으로 하여 웹사이트로부터 음성 파일을 수득하고, 상기 음성 파일로부터 상기 이벤트 표현이 포함된 구간을 발췌하여 음성 데이터 파일을 수득하는 단계- 상기 음성 파일은 자막 파일의 수득이 가능한 것이며, 상기 음성 데이터 파일은 상기 자막 파일에 기초하여 발췌하여 수득한 것임-; 및 4) 상기 음성 데이터 파일을 STT(Speech to Text) 변환하여 수득한 텍스트가 상기 음성 데이터 파일을 수득하는데 기초가 된 이벤트 표현의 발화에 상응하는 것으로 판단한 경우에 해당 음성 데이터 파일과 이벤트 태그를 이용하여 주제 도메인별 음성 데이터 세트를 구축하는 단계;를 포함하여 구성된 주제 도메인에 기초한 음성 데이터 세트 구축 방법이 개시된다.According to one aspect of the present invention in consideration of the above object, there is provided a method for constructing a voice data set by extracting voice data for an event expression by a system for constructing a voice data set accessible to a website, wherein 1) the website is collected by crawling obtaining a subtitle file based on the moving picture file; 2) extracting an event expression from the subtitle file using a morpheme analyzer, and tagging the event expression with an event tag using a pre-learned entity name recognizer - The event expression is extracted based on a preset morpheme unit feature expression quality, and the event tag is a tag for identifying a subject domain; 3) obtaining an audio file from the website using the event expression as a search condition, and extracting a section including the event expression from the audio file to obtain an audio data file- The audio file is a subtitle file possible, and the audio data file is obtained by extracting based on the subtitle file; and 4) when it is determined that the text obtained by converting the speech data file to STT (Speech to Text) corresponds to the utterance of the event expression that is the basis for obtaining the speech data file, the speech data file and the event tag are used A method for constructing a voice data set based on a configured subject domain is disclosed, including: constructing a voice data set for each subject domain by doing so.

본 발명의 또다른 일 측면에 따르면, 하나 이상의 명령을 저장하는 메모리; 및 상기 메모리에 저장된 상기 하나 이상의 명령을 실행하는 프로세서를 포함하고, 상기 프로세서는, 웹사이트를 크롤링하여 수집한 동영상 파일에 기초하여 자막 파일을 수득하고; 형태소 분석기를 이용하여 상기 자막 파일로부터 이벤트 표현을 추출하고, 미리 학습된 개체명 인식기를 이용하여 이벤트 표현에 이벤트 태그를 태깅하며- 상기 이벤트 표현은 미리 설정된 형태소 단위 자질 표현에 기반하여 추출한 자질이며, 상기 이벤트 태그는 주제 도메인을 구분하기 위한 태그임-; 상기 이벤트 표현을 검색 조건으로 하여 웹사이트로부터 음성 파일을 수득하고, 상기 음성 파일로부터 상기 이벤트 표현이 포함된 구간을 발췌하여 음성 데이터 파일을 수득하고- 상기 음성 파일은 자막 파일의 수득이 가능한 것이며, 상기 음성 데이터 파일은 상기 자막 파일에 기초하여 발췌하여 수득한 것임-; 상기 음성 데이터 파일을 STT(Speech to Text) 변환하여 수득한 텍스트가 상기 음성 데이터 파일을 수득하는데 기초가 된 이벤트 표현의 발화에 상응하는 것으로 판단한 경우에 해당 음성 데이터 파일과 이벤트 태그를 이용하여 주제 도메인별 음성 데이터 세트를 구축하는 것;을 특징으로 하는 음성 데이터 구축 시스템이 개시된다.According to another aspect of the present invention, a memory for storing one or more instructions; and a processor executing the one or more instructions stored in the memory, wherein the processor is configured to: obtain a subtitle file based on a video file collected by crawling a website; Extracting an event expression from the subtitle file using a morpheme analyzer and tagging an event tag to the event expression using a pre-learned entity name recognizer - The event expression is a feature extracted based on a preset morpheme unit feature expression, The event tag is a tag for identifying a subject domain; Obtaining an audio file from the website using the event expression as a search condition, extracting a section including the event expression from the audio file to obtain an audio data file - the audio file is capable of obtaining a subtitle file; the audio data file is obtained by extracting based on the subtitle file; When it is determined that the text obtained by converting the speech data file to STT (Speech to Text) corresponds to the utterance of the event expression that is the basis for obtaining the speech data file, the subject domain using the speech data file and the event tag Disclosed is a system for constructing voice data, characterized in that it constructs a separate voice data set.

본 발명의 또다른 일 측면에 따르면, 하나 이상의 명령을 저장하는 메모리와 상기 메모리에 저장된 상기 하나 이상의 명령을 실행하는 프로세서를 포함하는 하드웨어와 결합되어 주제 도메인에 기초한 음성 데이터 세트 구축 방법을 실행하는 컴퓨터 프로그램으로서, 상기 주제 도메인에 기초한 음성 데이터 세트 구축 방법은, 1) 웹사이트를 크롤링하여 수집한 동영상 파일에 기초하여 자막 파일을 수득하는 단계; 2) 형태소 분석기를 이용하여 상기 자막 파일로부터 이벤트 표현을 추출하고, 미리 학습된 개체명 인식기를 이용하여 이벤트 표현에 이벤트 태그를 태깅하는 단계- 상기 이벤트 표현은 미리 설정된 형태소 단위 자질 표현에 기반하여 추출한 자질이며, 상기 이벤트 태그는 주제 도메인을 구분하기 위한 태그임-; 3) 상기 이벤트 표현을 검색 조건으로 하여 웹사이트로부터 음성 파일을 수득하고, 상기 음성 파일로부터 상기 이벤트 표현이 포함된 구간을 발췌하여 음성 데이터 파일을 수득하는 단계- 상기 음성 파일은 자막 파일의 수득이 가능한 것이며, 상기 음성 데이터 파일은 상기 자막 파일에 기초하여 발췌하여 수득한 것임-; 및 4) 상기 음성 데이터 파일을 STT(Speech to Text) 변환하여 수득한 텍스트가 상기 음성 데이터 파일을 수득하는데 기초가 된 이벤트 표현의 발화에 상응하는 것으로 판단한 경우에 해당 음성 데이터 파일과 이벤트 태그를 이용하여 주제 도메인별 음성 데이터 세트를 구축하는 단계;를 포함하여 구성된 것을 특징으로 하는 컴퓨터 판독가능 매체에 저장된 컴퓨터 프로그램이 개시된다. According to another aspect of the present invention, a computer for executing a method of constructing a speech data set based on a subject domain in combination with hardware comprising a memory for storing one or more instructions and a processor for executing the one or more instructions stored in the memory A program, comprising: 1) obtaining a subtitle file based on a video file collected by crawling a website; 2) extracting an event expression from the subtitle file using a morpheme analyzer, and tagging the event expression with an event tag using a pre-learned entity name recognizer - The event expression is extracted based on a preset morpheme unit feature expression quality, and the event tag is a tag for identifying a subject domain; 3) obtaining an audio file from the website using the event expression as a search condition, and extracting a section including the event expression from the audio file to obtain an audio data file- The audio file is a subtitle file possible, and the audio data file is obtained by extracting based on the subtitle file; and 4) when it is determined that the text obtained by converting the speech data file to STT (Speech to Text) corresponds to the utterance of the event expression that is the basis for obtaining the speech data file, the speech data file and the event tag are used A computer program stored in a computer-readable medium is disclosed, characterized in that it comprises; constructing a voice data set for each subject domain.

이와 같은 본 발명은, 이벤트 표현에 관한 음성 데이터 파일을 포함하는 음성 데이터 세트를 이벤트 태그에 기초하여 주제 도메인별로 간편하게 구축하는 장점을 제공한다. As described above, the present invention provides an advantage of easily constructing a voice data set including a voice data file related to an event expression for each subject domain based on an event tag.

도 1은 본 발명의 실시예에 따른 음성 데이터 세트 구축 방법이 실행되는 전체 시스템 구성도,
도 2는 본 발명의 실시예에 따른 음성 데이터 세트 구축 시스템의 구성도,
도 3은 본 발명의 실시예에 따른 음성 데이터 세트 구축 시스템의 하드웨어 관점의 모식도,
도 4는 본 발명의 실시예에 따른 음성 데이터 세트 구축 방법의 흐름도,
도 5는 본 발명의 실시예에 따른 개체명 인식기의 학습 과정의 흐름도,
도 6은 본 발명의 실시예에 따른 이벤트 표현의 추출 및 이벤트 태깅 과정의 흐름도,
도 7 및 도 8는 본 발명의 실시예에 따른 STT 모듈의 학습 과정의 흐름도,
도 9 내지 도 14는 본 발명의 실시예에 따른 음성 데이터 세트 구축 방법의 예시 설명을 위한 도면이다. 1 is an overall system configuration diagram in which a method of constructing a voice data set according to an embodiment of the present invention is executed;
2 is a block diagram of a voice data set building system according to an embodiment of the present invention;
3 is a schematic diagram from a hardware perspective of a voice data set building system according to an embodiment of the present invention;
4 is a flowchart of a method for constructing a voice data set according to an embodiment of the present invention;
5 is a flowchart of a learning process of an entity name recognizer according to an embodiment of the present invention;
6 is a flowchart of an event expression extraction and event tagging process according to an embodiment of the present invention;
7 and 8 are flowcharts of the learning process of the STT module according to an embodiment of the present invention;
9 to 14 are diagrams for explaining an example of a method for constructing a voice data set according to an embodiment of the present invention.

본 발명은 그 기술적 사상 또는 주요한 특징으로부터 벗어남이 없이 다른 여러가지 형태로 실시될 수 있다. 따라서, 본 발명의 실시예들은 모든 점에서 단순한 예시에 지나지 않으며 한정적으로 해석되어서는 안 된다.The present invention may be embodied in various other forms without departing from its technical spirit or main characteristics. Accordingly, the embodiments of the present invention are merely illustrative in all respects and should not be construed as limiting.

제1, 제2 등의 용어는 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 예를 들어, 본 발명의 권리 범위를 벗어나지 않으면서 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소도 제1 구성요소로 명명될 수 있다. The terms 1st, 2nd, etc. are used only for the purpose of distinguishing one component from another component. For example, without departing from the scope of the present invention, a first component may be referred to as a second component, and similarly, a second component may also be referred to as a first component.

어떤 구성요소가 다른 구성요소에 "연결되어" 있다거나 "접속되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결되어 있거나 또는 접속되어 있을 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다. When a component is referred to as being “connected” or “connected” to another component, it may be directly connected or connected to the other component, but another component may exist in between.

본 출원에서 사용한 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 출원에서, "포함하다" 또는 "구비하다", "가지다" 등의 용어는 명세서에 기재된 구성요소 또는 이들의 조합이 존재하는 것을 표현하려는 것이지, 다른 구성요소 또는 특징이 존재 또는 부가될 가능성을 미리 배제하는 것은 아니다. As used herein, the singular expression includes the plural expression unless the context clearly dictates otherwise. In the present application, terms such as "comprises" or "comprising", "have" and the like are intended to represent the presence of elements or combinations thereof described in the specification, and the possibility that other elements or features may be present or added. It is not precluded.

이하, 첨부된 도면을 참조하여 본 발명에 따른 바람직한 실시예를 상세히 설명한다.Hereinafter, preferred embodiments according to the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 실시예에 따른 음성 데이터 세트 구축 방법이 실행되는 전체 시스템 구성도, 도 2는 본 발명의 실시예에 따른 음성 데이터 세트 구축 시스템의 구성도, 도 3은 본 발명의 실시예에 따른 음성 데이터 세트 구축 시스템의 하드웨어 관점의 모식도이다. 1 is an overall system configuration diagram in which a voice data set construction method according to an embodiment of the present invention is executed, FIG. 2 is a configuration diagram of a voice data set construction system according to an embodiment of the present invention, and FIG. 3 is an embodiment of the present invention It is a schematic diagram from the hardware point of view of the voice data set construction system according to

본 발명의 일실시예에 따른 음성 데이터 세트 구축 방법은 네트워크(10)를 통해 웹 서버(2000)와 연결된 음성 데이터 세트 구축 시스템(1000)에서 실행된다. The voice data set construction method according to an embodiment of the present invention is executed in the voice data set construction system 1000 connected to the web server 2000 through the network 10 .

상기 음성 데이터 세트 구축 시스템(1000)은 네트워크(10)를 통해 웹 서버(2000)에서 운영되는 웹사이트를 크롤링하여 동영상 파일, 음성 파일, 자막 파일을 수득하고, 이에 기초하여 주제 도메인별 이벤트 표현에 관한 음성 데이터 세트를 구축하고 사용자 단말(3000)을 통해 사용자들에게 제공한다. The voice data set building system 1000 crawls a website operated in the web server 2000 through the network 10 to obtain a moving picture file, an audio file, and a subtitle file, and based on this, A voice data set is constructed and provided to users through the user terminal 3000 .

상기 사용자 단말(3000)은 통신 모듈 및 웹브라우징 소프트웨어가 구비되어 웹사이트에 접속가능 하도록 구성되는 PC, 스마트폰, 태블릿 등으로 이해될 수 있으며, 문자 및 이미지 표시 기능과 입력 기능을 수행할 수 있다.The user terminal 3000 may be understood as a PC, smartphone, tablet, etc. configured to be accessible to a website provided with a communication module and web browsing software, and may perform text and image display functions and input functions. .

상기 웹 서버(2000)는 웹 사이트를 통해 동영상 파일, 음성 파일, 자막 파일을 제공하는 서버이며, 예를 들어 유튜브(Youtube), TedEx, 뉴스 등 웹 상에서 동영상과 함께 자막을 동시에 제공하는 서버이다. 동영상 파일, 음성 파일, 자막 파일은 웹 서버(2000)의 API(Application Programming Interface)를 이용하여 제공받거나 웹 크롤러를 이용하여 수집할 수 있다. 음성 파일, 자막 파일은 동영상 파일로부터 추출하여 수득할 수도 있다. 동영상 파일, 음성 파일, 자막 파일 등의 컨텐츠는 본 실시예의 음성 데이터 세트 구축을 위한 온라인 리소스로 볼 수 있다. The web server 2000 is a server that provides a video file, an audio file, and a subtitle file through a website, for example, a server that simultaneously provides subtitles and a video on the web such as YouTube, TedEx, and News. The video file, audio file, and subtitle file may be provided by using an API (Application Programming Interface) of the web server 2000 or may be collected using a web crawler. Audio files and subtitle files can also be obtained by extracting them from video files. Contents such as video files, audio files, and subtitle files can be viewed as online resources for constructing the audio data set of this embodiment.

도 2에 도시된 바와 같이, 본 실시예의 음성 데이터 세트 구축 시스템(1000)은 기능적 관점에서, 웹사이트를 크롤링하여 수집한 동영상 파일에 기초하여 음성 파일 및/또는 자막 파일을 수득하는 웹 크롤러 모듈(102), 형태소 분석기(114)를 이용하여 상기 자막 파일로부터 이벤트 표현을 추출하고 미리 학습된 개체명 인식기(116)를 이용하여 이벤트 표현에 이벤트 태그를 태깅하는 이벤트 표현 추출 모듈(104), 이벤트 표현을 검색 조건으로 하여 웹사이트로부터 음성 파일을 수득하고 상기 음성 파일로부터 상기 이벤트 표현이 포함된 구간을 발췌하여 음성 데이터 파일을 수득하는 음성 데이터 파일 수득 모듈(106), 음성 데이터 파일을 STT(Speech to Text) 변환하여 수득한 텍스트가 상기 음성 데이터 파일을 수득하는데 기초가 된 이벤트 표현의 발화에 상응하는 것으로 판단한 경우에 해당 음성 데이터 파일과 이벤트 태그를 이용하여 주제 도메인별 음성 데이터 세트를 구축하는 음성 데이터 세트 구축 모듈(108)을 포함하며, 도시되지는 않았지만 시스템 운영 전반을 관리하며 상기 음성 데이터 세트 구축 시스템(1000)과 사용자 단말(3000)과의 네트워크(10)를 통한 통신 기능을 제공할 수 있도록 하는 운영 모듈을 포함한다. As shown in Fig. 2, the voice data set building system 1000 of this embodiment is a web crawler module ( 102), an event expression extraction module 104 for extracting an event expression from the subtitle file using the morpheme analyzer 114 and tagging the event expression with an event tag using the pre-learned entity name recognizer 116, the event expression A voice data file obtaining module 106 that obtains a voice file from a website as a search condition and extracts a section including the event expression from the voice file to obtain a voice data file, and converts the voice data file to STT (Speech to Text) When it is determined that the text obtained by conversion corresponds to the utterance of the event expression that is the basis for obtaining the voice data file, the voice data for constructing a voice data set for each subject domain using the corresponding voice data file and event tag It includes a set building module 108, and although not shown, manages overall system operation and provides a communication function through the network 10 between the voice data set building system 1000 and the user terminal 3000 It includes an operating module that

또한, 상기 음성 데이터 세트 구축 시스템(1000)은, 웹사이트를 크롤링하여 수집한 동영상 파일/음성 파일/자막 파일을 저장하는 웹 크롤링 파일 저장소(120), 웹사이트를 크롤링하여 수집한 동영상 파일에 기초하여 수득한 노이즈 데이터를 저장하는 노이즈 DB(130), 웹사이트를 크롤링하여 수집한 동영상 파일에 기초하여 수집한 음성 데이터를 저장하는 오디오 DB(140), 이벤트 표현이 포함된 구간을 발췌하여 수득한 음성 데이터 파일을 저장하는 음성 데이터 파일 저장소(150), 음성 데이터 파일과 이벤트 태그를 이용하여 구축한 주제 도메인별 음성 데이터 세트를 저장하는 음성 데이터 세트 저장소(160), 도시되지는 않았지만 사용자 정보를 저장하고 각각의 사용자의 로그 기록 등을 저장, 관리하는 사용자 DB(미도시)를 포함한다. In addition, the voice data set building system 1000 is based on the web crawling file storage 120 for storing the video file/audio file/subtitle file collected by crawling the website, and the video file collected by crawling the website. Noise DB 130 for storing the noise data obtained through Voice data file storage 150 for storing voice data files, voice data set storage 160 for storing voice data sets for each subject domain constructed using voice data files and event tags, although not shown, user information is stored and a user DB (not shown) that stores and manages log records of each user.

도 3을 참조하면 하드웨어적 관점에서, 본 실시예의 음성 데이터 세트 구축 시스템(1000)은 하나 이상의 명령을 저장하는 메모리(2) 및 상기 메모리(2)에 저장된 상기 하나 이상의 명령을 실행하는 프로세서(4)를 포함하며, 주제 도메인에 기초한 음성 데이터 세트 구축 방법을 실행하도록 매체에 저장된 컴퓨터 프로그램이 실행되는 컴퓨팅 장치이다. 본 실시예의 음성 데이터 세트 구축 시스템(1000)은 데이터 입출력 인터페이스(6)와 통신 인터페이스(8), 데이터 표시 수단(3), 데이터 저장 수단(5)을 더욱 포함할 수 있다. Referring to FIG. 3 , from a hardware point of view, the system 1000 for constructing a voice data set of this embodiment includes a memory 2 storing one or more instructions and a processor 4 executing the one or more instructions stored in the memory 2 . ), and is a computing device on which a computer program stored in a medium is executed to execute a method of constructing a voice data set based on a subject domain. The voice data set building system 1000 of this embodiment may further include a data input/output interface 6 , a communication interface 8 , a data display unit 3 , and a data storage unit 5 .

도 4는 본 발명의 실시예에 따른 음성 데이터 세트 구축 방법의 흐름도, 도 5는 본 발명의 실시예에 따른 개체명 인식기의 학습 과정의 흐름도, 도 6은 본 발명의 실시예에 따른 이벤트 표현의 추출 및 이벤트 태깅 과정의 흐름도, 도 7 및 도 8는 본 발명의 실시예에 따른 STT 모듈의 학습 과정의 흐름도이다. 4 is a flowchart of a method for constructing a voice data set according to an embodiment of the present invention, FIG. 5 is a flowchart of a learning process of an entity name recognizer according to an embodiment of the present invention, and FIG. 6 is an event expression according to an embodiment of the present invention A flowchart of an extraction and event tagging process, FIGS. 7 and 8 are flowcharts of a learning process of the STT module according to an embodiment of the present invention.

1)단계에서 음성 데이터 세트 구축 시스템(1000)은 웹사이트를 크롤링하여 수집한 동영상 파일에 기초하여 자막 파일을 수득한다. In step 1), the audio data set building system 1000 obtains a subtitle file based on the video file collected by crawling the website.

동영상 파일과 자막 파일을 함께 크롤링하는 공지기술들이 있으며, 일예로, Pansori(ASR Corpus Generation from Open Online Video Contents, https://arxiv.org/abs/1812.09798) 등이 사용될 수 있다. There are known technologies for crawling a video file and a subtitle file together, and for example, Pansori (ASR Corpus Generation from Open Online Video Contents, https://arxiv.org/abs/1812.09798) may be used.

일예로, 동영상 파일과 자막 파일의 형태로 수집한 후, 자막의 얼라이먼트를 사용하여 동영상에서 해당 부분의 음성과 자막을 추출할 수 있다. For example, after collecting in the form of a moving picture file and a subtitle file, the audio and subtitle of the corresponding part can be extracted from the moving picture by using the subtitle alignment.

2)단계에서 음성 데이터 세트 구축 시스템(1000)은 형태소 분석기(114)를 이용하여 상기 자막 파일로부터 이벤트 표현을 추출하고, 미리 학습된 개체명 인식기(116)를 이용하여 이벤트 표현에 이벤트 태그를 태깅한다. In step 2), the speech data set construction system 1000 extracts an event expression from the subtitle file using the morpheme analyzer 114 and tags the event expression with the event expression using the pre-learned entity name recognizer 116 . do.

상기 이벤트 표현은 미리 설정된 형태소 단위 자질 표현에 기반하여 추출한 자질이며, 상기 이벤트 태그는 주제 도메인을 구분하기 위한 태그이다. 주제 도메인은 예를 들어, 경제/투자, 건강 및 일상생활 등의 주제로 구분될 수 있다. The event expression is a feature extracted based on a preset morpheme unit feature expression, and the event tag is a tag for classifying a subject domain. The subject domain may be divided into topics such as economy/investment, health and daily life, for example.

일예로, 본 실시예에서, 미리 설정된 형태소 단위 자질 표현은 명사 자질, 동사 자질, 명사구 자질을 포함하며, 각각의 자질은 다음과 같이 구성될 수 있다. For example, in this embodiment, the preset morpheme unit feature expression includes a noun feature, a verb feature, and a noun phrase feature, and each feature may be configured as follows.

자질endowment 내용Contents 예Yes 명사 자질noun qualities 형태소 분석을 통해 NNG(보통 명사)+XSN(명사파생접미사), NR(수사)+NNB(의존 명사) 등으로 분석된 자질Qualities analyzed as NNG (ordinary noun) + XSN (noun-derived suffix), NR (rhetorical) + NNB (dependent noun) through morpheme analysis 100억원 100(NR)+억원(NNB)10 billion KRW 100 (NR) + KRW 100 million (NNB) 동사 자질verb qualities 형태소 분석을 통해 VV(동사) 및 NNG(보통 명사) + XSV(동사파생접미사) 등으로 분석된 자질Qualities analyzed as VV (verb) and NNG (common noun) + XSV (verb-derived suffix) through morphological analysis 상승하다: 상승/NNG+하/XSV+다/EFRise: Rise/NNG+Down/XSV+Da/EF 명사구 자질noun phrase qualities 형태소 분석을 통해 NNG+NNG 으로 추출된 자질, 같은 문장 내에서 인접하게 나타나는 명사 자질들을 bi-gram으로 추출한 것Features extracted as NNG+NNG through morpheme analysis, and bi-grams of noun features that appear adjacent in the same sentence 주식(NNG)+상승(NNG), 비타민(NNG)+영양제(NNG) Stock (NNG)+Rising (NNG), Vitamin (NNG)+Nutrition (NNG)

보다 상세하게, 상기 2)단계는 다음과 같은 상세 구성을 가질 수 있다. More specifically, step 2) may have the following detailed configuration.

21)단계에서, 상기 자막 파일에 포함된 자막을 문장 분리한다. In step 21), the subtitles included in the subtitle file are separated into sentences.

일예로, 온라인 리소스에서 획득한 텍스트들은 konlpy(한국어 정보처리를 위한 파이썬 패키지-형태소 분석기를 하나로 모은 패키지)의 Mecab 클래스를 이용하여 문장을 분리할 수 있으며, 자막 파일만을 이용하여 문장 분리할 수 있다. For example, texts obtained from online resources can be separated using the Mecab class of konlpy (a package that collects a Python package for Korean information processing - stemming analyzers), and sentences can be separated using only a subtitle file. .

문장 분리는 예를 들어, 다음과 같이 이뤄질 수 있다. Sentence separation may be performed, for example, as follows.

자막 원문subtitle text 문장 분리된 상태separate sentences 이번 통합은 이해진 네이버 글로벌투자책임자 (GIO)가 지난 7월 4일 한국을 찾은 손정의 소프트뱅크 회장을 만난 지 4개월 만에 이뤄졌다.
두 회사는 다음 달 중 본계약을 체결하고 내년 10월까지 통합 작업을 마무리하는 것을 목표로 잡았다.The integration took place four months after Naver Global Investment Officer Hae-jin Hae-jin met SoftBank Chairman Son Jeong-eui, who visited Korea on July 4th.
The two companies aim to sign the main contract within the next month and complete the integration work by October next year. [이번 통합은 이해진 네이버 글로벌투자책임자 (GIO)가 지난 7월 4일 한국을 찾은 손정의 소프트뱅크 회장을 만난 지 4개월 만에 이뤄졌다.]

[두 회사는 다음 달 중 본계약을 체결하고 내년 10월까지 통합 작업을 마무리하는 것을 목표로 잡았다.] [This integration took place four months after Naver Global Investment Officer Hae-jin Hae-jin met SoftBank Chairman Son Jeong-eui, who visited Korea on July 4th.]

[The two companies aim to sign a main contract within the next month and complete the integration work by October next year.]

22)단계에서, 문장 분리된 각각의 문장에 대해 형태소 분석기(114)를 이용하여 이벤트 표현을 추출한다. In step 22), an event expression is extracted using the morpheme analyzer 114 for each sentence separated from the sentence.

형태소 분석기(114)로서 공지기술들이 있으며, 일예로, python 라이브러리의 konlpy 내의 hannanum, KKma, Mecab, Okt 들이 사용 가능하며, 바람직하게 Mecab이 사용될 수 있다. As the morpheme analyzer 114, there are known technologies. For example, hannanum, KKma, Mecab, and Okt in konlpy of the python library can be used, and Mecab can be preferably used.

23)단계에서, 미리 학습된 개체명 인식기(116)에 상기 이벤트 표현을 입력하여 이벤트 태그를 태깅한다. In step 23), an event tag is tagged by inputting the event expression into the pre-learned entity name recognizer 116 .

개체명 인식기(116)로서 다양한 공지기술이 사용 가능하며, 일예로, Bi-direction LSTM-CRF기반의 모델이 사용 가능하다. As the entity name recognizer 116, various known technologies can be used, for example, a bi-direction LSTM-CRF-based model can be used.

이벤트 표현의 추출 및 이벤트 태깅 과정에서 중의성 해소 사전이 사용될 수도 있다. 도 10은 중의성 해소 사전을 일예를 나타내며, 예를 들어, "비타민A", "영양제", "종합비타민", "보충제" 등의 이벤트 표현의 중의성을 해소하여 "비타민"이라는 이벤트 표현을 추출하고 태그 "EL"을 부여할 수 있다. A disambiguation dictionary may be used in the process of extracting event expressions and tagging events. 10 shows an example of the disambiguation dictionary, for example, "vitamin" by clearing the ambiguity of event expressions such as "vitamin A", "nutrition", "multivitamin", "supplement", etc. It can be extracted and given the tag "EL".

바람직하게, 상기 개체명 인식기(116)의 학습은 다음과 같이 실행될 수 있다. Preferably, the learning of the entity name recognizer 116 may be performed as follows.

먼저, 웹사이트를 크롤링하여 수집한 동영상 파일에 기초하여 자막 파일을 수득한다. First, a subtitle file is obtained based on the video file collected by crawling the website.

다음, 상기 자막 파일에 포함된 자막을 문장 분리한다. Next, the subtitles included in the subtitle file are separated into sentences.

다음, 문장 분리된 각각의 문장에 대해 형태소 분석기(114)를 이용하여 이벤트 표현을 추출한다. 일예로, 상기 이벤트 표현은 미리 설정된 형태소 단위 자질 표현(예, 명사 자질, 동사 자질, 명사구 자질)에 기반하여 추출한 자질이다. 일예로, 추출한 자질은 1 개의 단어 또는 2~3 개의 단어가 결합된 형태를 가질 수 있다. Next, an event expression is extracted using the morpheme analyzer 114 for each sentence separated from the sentence. For example, the event expression is a feature extracted based on a preset morpheme unit feature expression (eg, a noun feature, a verb feature, and a noun phrase feature). For example, the extracted feature may have a form in which one word or two or three words are combined.

다음, 추출한 이벤트 표현에 대해 이벤트 태그를 부여하여 학습 데이터를 생성한다. 일예로, 추출한 이벤트 표현에 대해 이벤트 태그를 부여하는 것은, 사용자 단말(3000)을 통한 사용자 입력(전문가 입력)에 기초하여 이뤄질 수 있다. Next, an event tag is assigned to the extracted event expression to generate learning data. As an example, assigning an event tag to the extracted event expression may be performed based on a user input (expert input) through the user terminal 3000 .

예를 들어, 주식/투자 관련된 이벤트 표현들은 'HK'를 이벤트 태그로 부여하고, 건강 관련된 이벤트 표현들은 'HH'를 이벤트 태그로 부여하고, 일상생활 관련 이벤트 표현들은 'EL'를 이벤트 태그로 부여하는 방식으로 이벤트 태그를 구분하여 부여하고 학습 데이터를 구축하게 된다. 각각의 이벤트 태그는 하나의 주제 도메인을 구분하여 정의하게 된다. For example, stock/investment related event expressions assign 'HK' as an event tag, health related event expressions assign 'HH' as an event tag, and daily life related event expressions assign 'EL' as an event tag In this way, event tags are classified and assigned, and learning data is built. Each event tag is defined by distinguishing one subject domain.

일예로, 각 주제 도메인 별로 약 1천건의 학습데이터를 구축하여 이벤트 표현(명령어) 추출 모델의 학습을 진행한다. For example, learning of the event expression (command) extraction model is carried out by constructing about 1,000 pieces of learning data for each subject domain.

아래의 표 3은 이벤트 태그별 이벤트 표현을 예시한다. Table 3 below exemplifies event expression for each event tag.

이벤트 태그event tag 이벤트 표현event expression HHHH 다이어트, 바이러스, 부작용Diet, Virus, Side Effects ELEL 캘린더, 이메일, 카메라Calendar, Email, Camera HKHK 유상증자, 주가, 매출Capital increase, stock price, sales

도 9는 이벤트 표현에 이벤트 태그가 부여된 상태를 예시한다. 도 11은 주식/투자 분야에서 최종 주출된 이벤트 표현(명령어)을 예시한다. 9 illustrates a state in which an event tag is attached to an event expression. 11 illustrates the final extracted event expression (command) in the stock/investment sector.

상기 학습 데이터를 이용하여 개체명 인식기(116)를 학습시킨다. The entity name recognizer 116 is trained using the learning data.

3)단계에서 음성 데이터 세트 구축 시스템(1000)은 상기 이벤트 표현을 검색 조건으로 하여 웹사이트로부터 음성 파일을 수득하고, 상기 음성 파일로부터 상기 이벤트 표현이 포함된 구간을 발췌하여 음성 데이터 파일을 수득한다. In step 3), the voice data set construction system 1000 obtains an audio file from the website using the event expression as a search condition, and extracts a section including the event expression from the audio file to obtain an audio data file. .

상기 음성 파일은 자막 파일의 수득이 가능한 것이며, 상기 음성 데이터 파일은 상기 자막 파일에 기초하여 발췌하여 수득한 것이다. The audio file is a subtitle file that can be obtained, and the audio data file is obtained by extracting the subtitle file based on the subtitle file.

예를 들어, 유튜브 검색(Youtube Search) API에 이벤트 표현을 쿼리로 입력하면 해당 쿼리와 관련된 동영상 파일과 자막 파일을 획득할 수 있다. 획득한 자막 파일을 확인하여 해당 이벤트 명령어가 존재하는 부분의 음성 파일을 잘라내어 저장한다. For example, if an event expression is entered as a query in the YouTube Search API, a video file and subtitle file related to the query can be obtained. Check the acquired subtitle file, cut out the audio file of the part where the corresponding event command exists, and save it.

일예로, 이러한 기능은 Honk의 Keyword_spotting_data_generator 모듈을 통해 구현할 수 있다(Honk, https://github.com/castorini/honk). Honk에서는 문장 단위로 쿼리로 입력한 이벤트 표현을 포함하는지 확인하고, 해당 이벤트 표현이 존재하면 그 부분의 음성과 자막 내 단어들과의 관계를 고려하여 음성을 강제 정렬하여 잘라내어 그 이벤트 명령어를 발화한 음성만 추출한다. 이 과정에서 하나의 쿼리에 대해 다수의 음성 파일이 추출될 수도 있다. For example, this function can be implemented through Honk's Keyword_spotting_data_generator module (Honk, https://github.com/castorini/honk). Honk checks whether an event expression entered as a query is included in a sentence unit, and if the event expression exists, it is forced to sort and cut the speech in consideration of the relationship between the speech and the words in the subtitles, and the event command is uttered. Extract only voice. In this process, multiple voice files may be extracted for one query.

상기 음성 데이터 파일은 상기 자막 파일에 기초하여 발췌하여 수득할 수 있으며, 예를 들어 자막을 활용하여 전체 동영상의 음성 파일에서 발화가 존재하지 않는 타임 스탬프는 스킵함으로써, 검색 공간을 줄일 수 있다. The audio data file may be obtained by extracting the audio data file based on the subtitle file. For example, the search space may be reduced by skipping a time stamp in which an utterance does not exist in the audio file of the entire video by using the subtitle.

4)단계에서 음성 데이터 세트 구축 시스템(1000)은, 상기 음성 데이터 파일을 STT(Speech to Text) 변환하여 수득한 텍스트가 상기 음성 데이터 파일을 수득하는데 기초가 된 이벤트 표현의 발화에 상응하는 것으로 판단한 경우에, 해당 음성 데이터 파일과 이벤트 태그를 이용하여 주제 도메인별 음성 데이터 세트를 구축한다. In step 4), the speech data set building system 1000 determines that the text obtained by converting the speech data file to STT (Speech to Text) corresponds to the utterance of the event expression based on obtaining the speech data file. In this case, a voice data set for each subject domain is constructed using the corresponding voice data file and event tag.

바람직하게, 상기 음성 데이터 파일은 하나의 음성 파일로부터 이벤트 표현이 포함된 구간을 발췌하여 수득한 것으로서, 상기 STT 변환, 판단 및 음성 데이터 세트의 구축은 타임 스탬프의 이동에 따라 진행될 수 있다. 타임 스탬프의 이동은 음성을 자막의 얼라이먼트에 따라 추출한 다음 시간 영역을 의미한다. 도 12는 유튜브에서 수집한 음성 파일의 자막으로서, 14번 음성 -> 15번 음성 -> 16번 음성으로 타임 스탬프가 이동하는 것을 예시한다. Preferably, the voice data file is obtained by extracting a section including an event expression from one voice file, and the STT conversion, judgment, and construction of the voice data set can be performed according to the movement of the time stamp. The shift of the timestamp refers to the time domain after the audio is extracted according to the alignment of the subtitles. 12 is a subtitle of a voice file collected from YouTube, illustrating that the timestamp moves from 14th voice -> 15th voice -> 16th voice.

상기 4)단계에서, 상기 STT 변환되어 수득한 텍스트가 상기 음성 데이터 파일을 수득하는데 기초가 된 이벤트 표현의 발화에 상응하는지 여부의 판단은, 편집거리 알고리즘(levenshtein distance)에 기초하여 실행할 수 있다. 예를 들어, 상기 STT 변환되어 수득한 텍스트와 이벤트 표현 상호 간에, 편집거리 알고리즘(levenshtein distance)을 사용하여 임의로 주어진 임계값보다 작은 값을 가질 때, 상기 음성 데이터 파일이 해당 이벤트 표현의 발화를 포함하고 있다고 판단한다. 편집거리 알고리즘(levenshtein distance)의 구성은 "https://en.wikipedia.org/wiki/Levenshtein_distance" 등을 통해 공지된 바 있다. In the step 4), the determination of whether the text obtained by the STT conversion corresponds to the utterance of the event expression based on obtaining the speech data file may be performed based on a levenshtein distance. For example, when the text obtained from the STT conversion and the event expression have a value smaller than a threshold value arbitrarily given using a levenshtein distance, the speech data file contains the utterance of the event expression judge that you are doing The configuration of the edit distance algorithm (levenshtein distance) has been known through "https://en.wikipedia.org/wiki/Levenshtein_distance" or the like.

상기 4)단계의 STT 변환은 다음과 같이 이뤄질 수 있다. The STT conversion in step 4) may be performed as follows.

일예로, 상기 4)단계의 STT 변환은 STT Open API를 이용하여 실행될 수 있다. 예를 들어, STT API에는 Naver의 clova, Kakao의 speech API, Google의 cloud speech-to-text 등이 있다. 제공사에 따라 인식 성능은 다소 차이가 있으며, 데이터의 특성을 고려하여 가장 우수한 성능을 보이는 오픈 API를 선택할 수 있다. For example, the STT conversion in step 4) may be performed using the STT Open API. For example, STT API includes clova of Naver, speech API of Kakao, and cloud speech-to-text of Google. Recognition performance is somewhat different depending on the provider, and the open API with the best performance can be selected by considering the characteristics of the data.

한편, 다른예로, 상기 4)단계의 STT 변환은 STT 모듈(118)을 이용하여 실행할 수 있다. Meanwhile, as another example, the STT conversion in step 4) may be performed using the STT module 118 .

바람직하게, 상기 STT 모듈(118)은 준지도학습인 슈도 라벨링(Pseudo Labeling)에 기반하여 학습된 음성 인식 모델을 포함하여 구성될 수 있다. 준지도학습인 슈도 라벨링(Pseudo Labeling)의 구성은 "Pseudo-Label : The Simple and Efficient Semi-Supervised Learning Method for Deep Neural Networks (http://deeplearning.net/wp-content/uploads/2013/03/pseudo_label_final.pdf)" 등을 통해 공지된 바 있다. Preferably, the STT module 118 may be configured to include a speech recognition model learned based on pseudo-labeling, which is semi-supervised learning. The composition of Pseudo Labeling, a semi-supervised learning, is "Pseudo-Label: The Simple and Efficient Semi-Supervised Learning Method for Deep Neural Networks (http://deeplearning.net/wp-content/uploads/2013/03/ pseudo_label_final.pdf)" and the like.

학습 과정은 다음과 같이 실행될 수 있다. The learning process can be executed as follows.

먼저, STT 모듈(118)은 음성 내용이 라벨링된 음성 데이터와 노이즈 데이터를 합성하여 생성한 합성 음성 데이터를 이용하여 신경망 기반의 음성 인식 모델을 학습시킨다. First, the STT module 118 trains a neural network-based voice recognition model using synthesized voice data generated by synthesizing voice data labeled with voice content and noise data.

음성 내용이 라벨링된 음성 데이터는 예를 들어 Ai-Hub, Zeroth, Clova call 등 공개된 음성 학습용 데이터 세트를 활용할 수 있다.Voice data labeled with voice content can utilize publicly available data sets for voice training, such as Ai-Hub, Zeroth, and Clova call.

상기 음성 인식 모델의 학습은 스펙트로그램(spectrogram)으로 파악한 음성 데이터의 파형과 스펙트럼의 특징에 기초하여 이뤄진다. 예를 들어, 학습 데이터로서 음성 데이터의 파형과 스펙트럼의 특징이 입력값으로 사용될 수 있다. The learning of the speech recognition model is performed based on the characteristics of the waveform and spectrum of speech data identified by a spectrogram. For example, as the learning data, the waveform and spectrum characteristics of voice data may be used as input values.

바람직하게, 상기 노이즈 데이터는, 웹사이트를 크롤링하여 동영상 데이터에 관한 음성 데이터와 자막 데이터를 수득하고, 음성 데이터를 자막이 포함된 구간과 포함되지 않은 구간으로 구분하여 슬라이싱 하고, 자막이 포함되지 않은 구간을 노이즈 데이터로서 수득한다. 노이즈가 없는 음성 데이터를 통해 학습할 경우 음성 인식 모델의 인식률은 노이즈에 약하기 때문에, 여러 가지 노이즈를 합성하여 학습하는 경우 관측되지 않은 데이터(Unseen Data)에 대해서도 음성 인식 모델이 강건해질 수 있다. Preferably, the noise data is obtained by crawling a website to obtain audio data and caption data related to the video data, dividing the audio data into a section including a caption and a section not including a section, and slicing, The interval is obtained as noise data. When learning from noise-free voice data, the recognition rate of the voice recognition model is weak to noise. Therefore, when learning by synthesizing various noises, the voice recognition model can be robust even for unseen data.

예를 들어, 웹사이트를 크롤링하여 수득한 동영상 데이터에 관한 음성 데이터와 자막 데이터는, 유튜브 검색(YouTube Search) API에 이벤트 표현을 쿼리(query)로 입력하여 얻은 영상의 링크를 추출한 뒤, 영상에서 음성과 자막을 다운로드하는 방식으로 수득할 수 있다. For example, audio data and subtitle data about video data obtained by crawling a website are extracted from the video link obtained by inputting the event expression as a query to the YouTube Search API, and then It can be obtained by downloading audio and subtitles.

다음으로, STT 모듈(118)은 상기 음성 인식 모델에 웹사이트를 크롤링하여 수집한 음성 데이터를 입력하여 음성 내용이 슈도 라벨링된 음성 데이터를 수득한다. Next, the STT module 118 inputs the voice data collected by crawling the website to the voice recognition model to obtain voice data in which the voice content is pseudo-labeled.

바람직하게, 상기 수집한 음성 데이터는, 웹사이트를 크롤링하여 동영상 데이터에 관한 음성 데이터와 자막 데이터를 수득하고, 음성 데이터를 자막이 포함된 구간과 포함되지 않은 구간으로 구분하여 슬라이싱 하고, 자막이 포함된 구간을 상기 수집한 음성 데이터로서 수득한다. Preferably, the collected audio data is obtained by crawling a website to obtain audio data and caption data related to video data, dividing the audio data into sections including and not including captions and slicing, and including captions The obtained section is obtained as the collected voice data.

다음으로 STT 모듈(118)은 슈도 라벨링된 음성 데이터를 이용하여 상기 음성 인식 모델을 미세 조정(fine-tuning) 학습시킨다. Next, the STT module 118 fine-tunes and trains the speech recognition model using the pseudo-labeled speech data.

도 7 및 도 8을 참조하여, 준지도학습인 슈도 라벨링에 기반하여 음성 인식 모델을 학습시키는 구현예를 예시 설명한다. 도 7은 유튜브(Yotube) 상 음성과 노이즈 데이터를 획득하는 흐름도이다. 도 8은 프리트레인드 모델(Pretrained Model)과 미세 조정(Fine-tuning) 흐름도이다. An example of an implementation for learning a speech recognition model based on pseudo-labeling, which is semi-supervised learning, will be described with reference to FIGS. 7 and 8 . 7 is a flowchart of acquiring voice and noise data on YouTube. 8 is a flow chart of a pretrained model and fine-tuning.

P1) 이벤트 표현을 Youtube Search API에 쿼리를 보내 이벤트 표현과 연관된 영상 주소를 획득한다.P1) Obtain the video address associated with the event expression by sending a query to the Youtube Search API.

P2) 영상 주소로부터 음성 파일과 자막을 다운로드한다. P2) Download the audio file and subtitles from the video address.

P3) 음성 파일을 객체로 생성한다. P3) Create a voice file as an object.

P4) 음성의 자막을 확인할 때, 5글자 이상일 경우에만 데이터로 사용한다. 이를 위해 n-grams 언어 모델을 사용할 수 있다. P4) When checking the subtitles of the audio, it is used as data only if there are more than 5 characters. For this, we can use the n-grams language model.

P5) 음성 파일의 위치와 음성 메타 정보를 RDB(relational database)로 데이터베이스에 추가한다. P5) The location of the voice file and voice meta information are added to the database as a relational database (RDB).

P6) 음성을 실제 파일로 저장한다. P6) Save the voice as an actual file.

S1) Ai-Hub, Zeroth 그리고 Clova call 등 공개된 음성 학습용 데이터를 마련한다. S1) Prepare open voice learning data such as Ai-Hub, Zeroth, and Clova call.

S2) 도 7의 과정을 거쳐 구축된 유튜브 데이터 베이스이다. S2) It is a YouTube database built through the process of FIG. 7 .

S3), S4) 특정 구간의 연속균등분포 값을 뽑아내어 주어진 확률 p보다 클 경우 음성과 노이즈를 합성한다. S3), S4) If the continuous uniform distribution value is extracted from a specific section and is greater than the given probability p, voice and noise are synthesized.

S5) 모델 학습을 위한 음성의 특징을 추출한다. S5) Extract features of speech for model learning.

S6) 프리트레인드 모델(Pretrained Model)을 생성하기 위해 학습한다. S6) Learn to create a pretrained model.

S7) S6에서 생성된 프리트레인드 모델(Pretrained Model)에 유튜브에서 수집된 음성을 예측해 슈도 라벨링한다. S7) Pseudo labeling is performed by predicting voices collected from YouTube on the pretrained model created in S6.

S8) 자막이 추가된 음성 데이터 베이스를 새롭게 구축한다. S8) A new audio database with subtitles is built.

S9) 기존 학습된 모델을 슈도 라벨링된 데이터로 학습시킴으로써 강인하게 만든다. S9) Make the existing trained model robust by training it with pseudo-labeled data.

한편, 바람직하게, 미세 조정 학습된 STT 모듈(118)에 대해 음성 인식 성능을 검사할 수 있다. On the other hand, preferably, the speech recognition performance may be checked for the fine-tuned learned STT module 118 .

일예로, 유튜브 검색(YouTube Search) API에 수집하고자 하는 이벤트 표현을 쿼리를 입력하여 얻은 영상의 링크로부터 추출된 음성과 자막을 다운로드하고, 발화가 존재하는 음성만을 획득하여 미세 조정 학습된 STT 모듈(118)에서 나온 결과로 해당 이벤트 표현을 발화하였는지 검사한다. As an example, the STT module ( 118), it is checked whether the corresponding event expression is ignited.

음질이 그리 좋지 않은 음성의 인식 결과와 쿼리로 입력한 이벤트 표현이 일치하기는 힘들기 때문에 많은 양의 음성 데이터를 확보하기 위해 정제의 검사 방법으로써, 일반적으로 음성인식의 성능을 평가하는데 사용되는 편집 거리 알고리즘(Levenshtein Distance)을 사용한다. 두 시퀀스간의 차이를 계산하여 Character Error Rate(CER)을 계산하여 주어진 threshold보다 CER이 작을 경우 음성은 해당 명령어 발화를 담고 있다고 판단하여 수집한다. 예를 들어서, “점유율”이라는 음성 명령어를 획득하기 위해 유튜브에서 획득한 음성의 인식결과가 “점율”로 나올 경우 CER은 33%이고 “검유”로 나올 경우 CER은 66%이다. 만약 threshold 값이 40%일 경우 인식결과가 꼭 “점유율”이 아니라도 “점율”의 경우에는 명령어를 포함하는 음성 데이터로서 획득하게 되고 “검유”일 경우에는 음성 데이터로써 사용하지 않는다. Since it is difficult to match the recognition result of speech with poor sound quality and the event expression entered as a query, it is a refinement test method to secure a large amount of speech data, and is generally used to evaluate the performance of speech recognition. The distance algorithm (Levenshtein Distance) is used. By calculating the difference between the two sequences, the Character Error Rate (CER) is calculated. If the CER is less than the given threshold, the voice is determined to contain the corresponding command utterance and collected. For example, if the recognition result of the voice acquired from YouTube to acquire the voice command of “share” comes out as “score”, the CER is 33%, and when it comes out as “excuse”, the CER is 66%. If the threshold value is 40%, even if the recognition result is not necessarily “occupancy”, in the case of “score”, it is acquired as voice data including commands, and in the case of “detection”, it is not used as voice data.

한편, 변형예로서, 상기 4)단계의 STT 변환은 STT Open API와 STT 모듈(118)을 함께 이용하여 실행할 수 있다. Meanwhile, as a modification, the STT conversion in step 4) may be performed using the STT Open API and the STT module 118 together.

이 경우, 먼저, STT Open API를 이용하여 상기 4)단계의 STT 변환을 실행하고, 상기 STT 변환되어 수득한 텍스트가 상기 음성 데이터 파일을 수득하는데 기초가 된 이벤트 표현의 발화에 상응하는지 여부를 판단한다. In this case, first, the STT conversion of step 4) is executed using the STT Open API, and it is determined whether the text obtained by the STT conversion corresponds to the utterance of the event expression that is the basis for obtaining the voice data file. do.

STT Open API를 이용하여 STT 변환되어 수득한 텍스트가 상기 음성 데이터 파일을 수득하는데 기초가 된 이벤트 표현의 발화에 상응하는 것으로 판단한 경우, 해당 음성 데이터 파일과 이벤트 태그를 이용하여 주제 도메인별 음성 데이터 세트를 구축한다. 이 경우 STT 모듈(118)을 이용한 STT 변환은 생략한다. When it is determined that the text obtained by STT conversion using the STT Open API corresponds to the utterance of the event expression based on obtaining the speech data file, the speech data file and the event tag are used to set the speech data for each subject domain to build In this case, STT conversion using the STT module 118 is omitted.

만일, 상기 STT 변환되어 수득한 텍스트가 상기 음성 데이터 파일을 수득하는데 기초가 된 이벤트 표현의 발화에 상응하는 것으로 판단하지 않는 경우, STT 모듈(118)을 이용하여 상기 4)단계의 STT 변환을 더욱 실행하고, 상기 STT 변환되어 수득한 텍스트가 상기 음성 데이터 파일을 수득하는데 기초가 된 이벤트 표현의 발화에 상응하는지 여부를 판단하고, 그 결과에 따라 해당 음성 데이터 파일과 이벤트 태그를 이용하여 주제 도메인별 음성 데이터 세트를 구축한다. If it is determined that the text obtained by the STT conversion does not correspond to the utterance of the event expression that is the basis for obtaining the voice data file, the STT conversion of step 4) is further performed using the STT module 118. execution, it is determined whether the text obtained by converting the STT corresponds to the utterance of the event expression that is the basis for obtaining the voice data file, and according to the result, using the corresponding voice data file and event tag for each subject domain Build a voice data set.

실제 구현된 실시예로서, STT Open API(네이버 Clova Speech Recognition, Google Cloud Speech-to-Text)를 이용하여 도 13과 같은 음성 데이터를 도 14와 같은 텍스트로 변환하여 데이터를 확보하였다. As an actually implemented embodiment, the voice data as shown in FIG. 13 was converted into text as shown in FIG. 14 using the STT Open API (Naver Clova Speech Recognition, Google Cloud Speech-to-Text) to obtain data.

도 14는 주제 도메인별로 텍스트를 분류하였으며 해당 텍스트들의 수집개수 및 정제개수를 보여주며, 주제 도메인별 특성에 따라 수집/정제 후 최종 획득되는 비율을 보여준다. 일상생활에 관한 음성데이터를 수집한 시간은 약 2일 8시간 정도로 21853개를 수집하였으며 건강에 관한 음성데이터는 약 2일 8시간에 20172개를 수집하였다. 일상생활에 관한 음성데이터 평균 정제 비율은 21.60%이고 건강에 관한 음성데이터 평균 정제 비율은 23.13%이다. 14 shows texts classified by subject domain, and shows the number of collections and refinements of the corresponding texts, and shows the ratios finally obtained after collection/refinement according to the characteristics of each subject domain. The time to collect voice data about daily life was about 2 days and 8 hours, 21853 pieces were collected, and for voice data about health, 20172 pieces were collected in about 2 days and 8 hours. The average purification rate of voice data related to daily life was 21.60%, and the average purification rate of voice data related to health was 23.13%.

결과적으로 STT Open API를 통해 주제 도메인별 음성 데이터들을 텍스트로 변환하여 평균적으로 수집된 데이터에서 고품질의 학습데이터를 수집량 대비 약 22% 선에서 수집할 수 있었다. 만약, STT 모듈을 추가적으로 사용하면 추가적인 데이터 확보가 가능하다. As a result, voice data for each subject domain was converted into text through the STT Open API, and high-quality learning data could be collected at about 22% of the amount of data collected on average. If the STT module is additionally used, additional data can be secured.

본 발명의 실시예들은 다양한 컴퓨터로 구현되는 동작을 수행하기 위한 프로그램과 이를 기록한 컴퓨터 판독 가능 기록 매체를 포함한다. 상기 컴퓨터 판독 가능 기록 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체는 본 발명을 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체, CD-ROM, DVD, USB 드라이브와 같은 광기록 매체, 플롭티컬 디스크와 같은 자기-광 매체, 및 롬, 램, 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다.Embodiments of the present invention include a program for performing various computer-implemented operations and a computer-readable recording medium recording the same. The computer-readable recording medium may include program instructions, data files, data structures, etc. alone or in combination. The medium may be specially designed and configured for the present invention, or may be known and available to those skilled in the art of computer software. Examples of the computer-readable recording medium include hard disks, magnetic media such as floppy disks and magnetic tapes, optical recording media such as CD-ROMs, DVDs, and USB drives, magneto-optical media such as floppy disks, and ROM, RAM; Included are hardware devices specially configured to store and execute program instructions, such as flash memory and the like. Examples of program instructions include not only machine language codes such as those generated by a compiler, but also high-level language codes that can be executed by a computer using an interpreter or the like.

1000: 음성 데이터 구축 시스템
2000: 웹 서버
3000: 사용자 단말1000: voice data building system
2000: web server
3000: user terminal

Claims

웹사이트에 접속 가능한 음성 데이터 세트 구축 시스템이 이벤트 표현에 대한 음성 데이터를 추출하여 음성 데이터 세트를 구축하는 방법으로서,
1) 웹사이트를 크롤링하여 수집한 동영상 파일에 기초하여 자막 파일을 수득하는 단계;
2) 형태소 분석기를 이용하여 상기 자막 파일로부터 이벤트 표현을 추출하고, 미리 학습된 개체명 인식기를 이용하여 이벤트 표현에 이벤트 태그를 태깅하는 단계- 상기 이벤트 표현은 미리 설정된 형태소 단위 자질 표현에 기반하여 추출한 자질이며, 상기 이벤트 태그는 주제 도메인을 구분하기 위한 태그임-;
3) 상기 이벤트 표현을 검색 조건으로 하여 웹사이트로부터 음성 파일을 수득하고, 상기 음성 파일로부터 상기 이벤트 표현이 포함된 구간을 발췌하여 음성 데이터 파일을 수득하는 단계- 상기 음성 파일은 자막 파일의 수득이 가능한 것이며, 상기 음성 데이터 파일은 상기 자막 파일에 기초하여 발췌하여 수득한 것임-; 및
4) 상기 음성 데이터 파일을 STT(Speech to Text) 변환하여 수득한 텍스트가 상기 음성 데이터 파일을 수득하는데 기초가 된 이벤트 표현의 발화에 상응하는 것으로 판단한 경우에 해당 음성 데이터 파일과 이벤트 태그를 이용하여 주제 도메인별 음성 데이터 세트를 구축하는 단계;를 포함하며,
상기 4)단계의 STT(Speech to Text) 변환은 STT 모듈을 이용하여 실행하며,
상기 STT 모듈은,
음성 내용이 라벨링된 음성 데이터와 노이즈 데이터를 합성하여 생성한 합성 음성 데이터를 이용하여 신경망 기반의 음성 인식 모델을 학습시키는 단계;
상기 음성 인식 모델에 웹사이트를 크롤링하여 수집한 음성 데이터를 입력하여 음성 내용이 슈도 라벨링(Pseudo Labeling)된 음성 데이터를 수득하는 단계; 및
슈도 라벨링(Pseudo Labeling)된 음성 데이터를 이용하여 상기 음성 인식 모델을 미세 조정(fine-tuning) 학습시키는 단계;를 통해 학습된 음성 인식 모델을 포함하여 구성된 것을 특징으로 하는 주제 도메인에 기초한 음성 데이터 세트 구축 방법.
A method for constructing a voice data set by extracting voice data for an event expression by a website-accessible voice data set construction system, the method comprising:
1) obtaining a subtitle file based on the video file collected by crawling the website;
2) extracting an event expression from the subtitle file using a morpheme analyzer, and tagging an event tag on the event expression using a pre-learned entity name recognizer - The event expression is extracted based on a preset morpheme unit feature expression quality, and the event tag is a tag for identifying a subject domain;
3) obtaining an audio file from a website using the event expression as a search condition, and extracting a section including the event expression from the audio file to obtain an audio data file - The audio file is a subtitle file possible, and the audio data file is obtained by extracting based on the subtitle file; and
4) When it is determined that the text obtained by converting the speech data file to STT (Speech to Text) corresponds to the utterance of the event expression that is the basis for obtaining the speech data file, the speech data file and the event tag are used to Containing; constructing a speech data set for each subject domain;
The STT (Speech to Text) conversion in step 4) is executed using the STT module,
The STT module is
training a neural network-based voice recognition model using synthesized voice data generated by synthesizing voice data labeled with voice content and noise data;
obtaining voice data in which voice content is pseudo-labeled by inputting voice data collected by crawling a website into the voice recognition model; and
A speech data set based on a subject domain, characterized in that it includes the speech recognition model learned through the steps of fine-tuning and learning the speech recognition model using pseudo-labeled speech data How to build.

삭제delete

웹사이트에 접속 가능한 음성 데이터 세트 구축 시스템이 이벤트 표현에 대한 음성 데이터를 추출하여 음성 데이터 세트를 구축하는 방법으로서,
1) 웹사이트를 크롤링하여 수집한 동영상 파일에 기초하여 자막 파일을 수득하는 단계;
2) 형태소 분석기를 이용하여 상기 자막 파일로부터 이벤트 표현을 추출하고, 미리 학습된 개체명 인식기를 이용하여 이벤트 표현에 이벤트 태그를 태깅하는 단계- 상기 이벤트 표현은 미리 설정된 형태소 단위 자질 표현에 기반하여 추출한 자질이며, 상기 이벤트 태그는 주제 도메인을 구분하기 위한 태그임-;
3) 상기 이벤트 표현을 검색 조건으로 하여 웹사이트로부터 음성 파일을 수득하고, 상기 음성 파일로부터 상기 이벤트 표현이 포함된 구간을 발췌하여 음성 데이터 파일을 수득하는 단계- 상기 음성 파일은 자막 파일의 수득이 가능한 것이며, 상기 음성 데이터 파일은 상기 자막 파일에 기초하여 발췌하여 수득한 것임-; 및
4) 상기 음성 데이터 파일을 STT(Speech to Text) 변환하여 수득한 텍스트가 상기 음성 데이터 파일을 수득하는데 기초가 된 이벤트 표현의 발화에 상응하는 것으로 판단한 경우에 해당 음성 데이터 파일과 이벤트 태그를 이용하여 주제 도메인별 음성 데이터 세트를 구축하는 단계;를 포함하며,
상기 4)단계에서,
상기 STT 변환되어 수득한 텍스트가 상기 음성 데이터 파일을 수득하는데 기초가 된 이벤트 표현의 발화에 상응하는지 여부의 판단은, 편집거리 알고리즘(levenshtein distance)에 기초하여 실행하는 것을 특징으로 하는 주제 도메인에 기초한 음성 데이터 세트 구축 방법.
A method for constructing a voice data set by extracting voice data for an event expression by a website-accessible voice data set construction system, the method comprising:
1) obtaining a subtitle file based on the video file collected by crawling the website;
2) extracting an event expression from the subtitle file using a morpheme analyzer, and tagging an event tag on the event expression using a pre-learned entity name recognizer - The event expression is extracted based on a preset morpheme unit feature expression quality, and the event tag is a tag for identifying a subject domain;
3) obtaining an audio file from a website using the event expression as a search condition, and extracting a section including the event expression from the audio file to obtain an audio data file - The audio file is a subtitle file possible, and the audio data file is obtained by extracting based on the subtitle file; and
4) When it is determined that the text obtained by converting the speech data file to STT (Speech to Text) corresponds to the utterance of the event expression that is the basis for obtaining the speech data file, the speech data file and the event tag are used to Containing; constructing a speech data set for each subject domain;
In step 4) above,
The determination of whether the text obtained by converting the STT corresponds to the utterance of the event expression based on obtaining the speech data file is executed based on a levenshtein distance. How to build a speech data set.

삭제delete