KR20220025578A

KR20220025578A - Method for learning a speech recognition model in an objective task domain with sparse speech data based on teacher-student learning

Info

Publication number: KR20220025578A
Application number: KR1020200106507A
Authority: KR
Inventors: 강병옥; 박전규; 전형배
Original assignee: 한국전자통신연구원
Priority date: 2020-08-24
Filing date: 2020-08-24
Publication date: 2022-03-03

Abstract

Provided is a method for learning a speech recognition model in an objective task domain having the sparse speech data based on a teacher-student learning. The method comprises: a step of preparing a large amount of speech data and the raw speech data corresponding to a target task area (hereinafter referred to as the target task raw speech data); a step of generating the speech data augmented to correspond to the target task area (hereinafter referred to as the target task augmented speech data) by learning based on the target task raw speech data and the large volume of speech data; and a step of performing transfer learning for a second speech recognition model using the target task augmented speech data as an input based on a first speech recognition model learned in advance by inputting the large amount of speech data as the input. Therefore, the present invention is capable of allowing the speech recognition model optimized for the target task domain having the sparse speech data to be generated.

Description

교사-학생 학습에 기반한 희소 음성 데이터를 갖는 목적 태스크 영역에서의 음성 인식 모델을 학습하는 방법{METHOD FOR LEARNING A SPEECH RECOGNITION MODEL IN AN OBJECTIVE TASK DOMAIN WITH SPARSE SPEECH DATA BASED ON TEACHER-STUDENT LEARNING}A method for learning a speech recognition model in a target task domain with sparse speech data based on teacher-student learning

본 발명은 교사-학생 학습에 기반한 희소 음성 데이터를 갖는 목적 태스크 영역에서의 음성 인식 모델을 학습하는 방법에 관한 것이다.The present invention relates to a method for learning a speech recognition model in a target task domain with sparse speech data based on teacher-student learning.

음성 인식 기술은 발전을 거듭하여 현재 다양한 분야에 적용되고 있다. 음성 인식 서비스에 사용되는 음향 모델은 해당 서비스를 사용하는 화자의 음향 특성, 잡음 및 채널 환경에 매칭되는 대용량 데이터를 대상으로 학습하여 음향 모델을 생성할 때 가장 최적의 성능을 기대할 수 있다.Speech recognition technology has developed and is currently being applied to various fields. The most optimal performance can be expected when generating an acoustic model by learning the acoustic model used in the speech recognition service on large data that matches the acoustic characteristics, noise, and channel environment of the speaker using the service.

하지만, 음성 인식 서비스에 따라서, 예를 들어 비원어민 화자 음성 인식과 같이 원어민에 비해 소수로 한정된 화자로 인해 해당 서비스에 매칭되는 음성 데이터를 대용량으로 수집하기 어려운 경우나, 콜센터 음성 인식 서비스와 같이 개인 정보 보안의 문제로 대용량의 음성 데이터를 수집하는데 한계가 있어, 음향 모델의 학습 및 생성에 어려움이 있다.However, depending on the voice recognition service, for example, when it is difficult to collect a large amount of voice data matching the service due to the limited number of speakers compared to native speakers such as non-native speaker voice recognition, or when it is difficult to collect voice data matching the service in a large amount, such as in a call center voice recognition service, Due to the problem of information security, there is a limit to collecting a large amount of voice data, so it is difficult to learn and generate an acoustic model.

공개특허공보 제10-2019-0101333호 (2019.08.30)Laid-open Patent Publication No. 10-2019-0101333 (2019.08.30)

본 발명이 해결하고자 하는 과제는 대용량의 음성 데이터를 수집하기 어려운 희소 데이터 영역을 대상으로, 소량의 데이터를 이용한 데이터 증강 및 전이 학습을 통해 음성인식 성능을 개선할 수 있는, 교사-학생 학습에 기반한 희소 음성 데이터를 갖는 목적 태스크 영역에서의 음성 인식 모델을 학습하는 방법을 제공하는 것이다.The problem to be solved by the present invention is based on teacher-student learning, which can improve speech recognition performance through data augmentation and transfer learning using a small amount of data in a sparse data area where it is difficult to collect a large amount of voice data. An object of the present invention is to provide a method for training a speech recognition model in a target task domain with sparse speech data.

다만, 본 발명이 해결하고자 하는 과제는 상기된 바와 같은 과제로 한정되지 않으며, 또다른 과제들이 존재할 수 있다.However, the problems to be solved by the present invention are not limited to the problems described above, and other problems may exist.

상술한 과제를 해결하기 위한 본 발명의 일 면에 따른 교사-학생 학습에 기반한 희소 음성 데이터를 갖는 목적 태스크 영역에서의 음성 인식 모델을 학습하는 방법은 대용량의 음성 데이터와 상기 목적 태스크 영역에 대응하는 원시 음성 데이터(이하, 목적 태스크 원시 음성 데이터)를 준비하는 단계; 상기 목적 태스크 원시 음성 데이터와 상기 대용량의 음성 데이터를 기반으로 학습하여, 상기 목적 태스크 영역에 상응하도록 증강된 음성 데이터(이하, 목적 태스크 증강 음성 데이터)를 생성하는 단계; 및 상기 대용량의 음성 데이터를 입력으로 미리 학습된 제1 음성 인식 모델을 기반으로, 상기 목적 태스크 증강 음성 데이터를 입력으로 하는 제2 음성 인식 모델에 대한 전이 학습을 수행하는 단계를 포함한다. 이때, 상기 목적 태스크 원시 음성 데이터는 상기 대용량의 음성 데이터보다 상대적으로 적은 희소 음성 데이터이다.A method of learning a speech recognition model in a target task area having sparse speech data based on teacher-student learning according to an aspect of the present invention for solving the above-mentioned problems is a method of learning a speech recognition model in a target task area with a large amount of speech data and corresponding to the target task area. preparing raw voice data (hereinafter, target task raw voice data); generating voice data augmented to correspond to the target task area (hereinafter referred to as target task augmented voice data) by learning based on the target task raw voice data and the large-capacity voice data; and performing transfer learning on a second voice recognition model using the target task augmented voice data as an input, based on a first voice recognition model trained in advance by inputting the large amount of voice data as an input. In this case, the target task raw voice data is sparse voice data that is relatively smaller than the large-capacity voice data.

본 발명의 일부 실시예에서, 상기 목적 태스크 증강 음성 데이터를 생성하는 단계는, 상기 대용량의 음성 데이터와 상기 목적 태스크 원시 음성 데이터의 속성 정보를 갖는 잠재 변수를 기반으로 학습을 수행하는 단계; 및 상기 학습 수행 결과에 기초하여, 상기 대용량의 음성 데이터와 상기 목적 태스크 원시 음성 데이터의 각 속성 정보를 치환 및 재결합하여 상기 목적 태스크 증강 음성 데이터를 생성하는 단계를 포함할 수 있다.In some embodiments of the present invention, the generating of the target task augmented voice data includes: performing learning based on a latent variable having attribute information of the large-capacity voice data and the target task raw voice data; and generating the target task augmented voice data by substituting and recombining respective attribute information of the large-capacity voice data and the target task raw voice data based on the learning performance result.

본 발명의 일부 실시예에서, 상기 대용량의 음성 데이터와 상기 목적 태스크 원시 음성 데이터의 속성 정보에 상응하는 잠재 변수를 기반으로 학습을 수행하는 단계는, 상기 대용량의 음성 데이터와 상기 목적 태스크 원시 음성 데이터를 하나의 인코더에 대한 입력 데이터로 설정하여, 그 출력으로 상기 입력 데이터의 속성 정보를 갖는 잠재 변수를 추론하는 단계; 및 상기 잠재 변수를 디코더의 입력 데이터로 설정하여, 상기 인코더의 입력 데이터로 설정된 대용량의 음성 데이터 및 상기 목적 태스크 원시 음성 데이터를 재생성하는 단계를 포함할 수 있다.In some embodiments of the present invention, the step of performing learning based on a latent variable corresponding to attribute information of the large-capacity voice data and the target task raw voice data includes: the large-capacity voice data and the target task raw voice data set as input data for one encoder, and inferring a latent variable having attribute information of the input data as an output; and setting the latent variable as the input data of the decoder, and regenerating the large-capacity voice data set as the input data of the encoder and the target task raw voice data.

본 발명의 일부 실시예에서, 상기 속성 정보는 해당 발화 내에서 컨텐츠 속성을 유지하며 음소열로 구성되며 일정 시간에 따라 가변되는 제1 속성 정보와, 상기 각 음성 데이터에서의 해당 발화의 전체 시간 구간에 상응하는 채널 및 잡음 환경 정보와 화자 특성 정보인 제2 속성 정보를 포함할 수 있다.In some embodiments of the present invention, the attribute information includes first attribute information that maintains the content attribute within the corresponding utterance, is composed of phoneme sequences, and varies according to a predetermined time, and the entire time interval of the corresponding utterance in the respective voice data. It may include channel and noise environment information corresponding to , and second attribute information, which is speaker characteristic information.

본 발명의 일부 실시예에서, 상기 학습 수행 결과에 기초하여, 상기 목적 태스크 증강 음성 데이터를 생성하는 단계는, 상기 대용량의 음성 데이터와 상기 목적 태스크 원시 음성 데이터를 각각의 상기 인코더에 대한 각 입력 데이터로 설정하여, 그 출력으로 상기 각 입력 데이터에서의 제1 및 제2 속성 정보를 갖는 제1 및 제2 잠재 변수를 추론하는 단계; 상기 대용량의 음성 데이터에 상응하는 제2 속성 정보를 갖는 제2 잠재 변수를 상기 목적 태스크 원시 음성 데이터에 상응하는 제2 잠재 변수로 치환하는 단계; 및 상기 대용량의 음성 데이터에 상응하는 제1 잠재 변수 및 상기 치환된 제2 잠재 변수를 하나의 상기 디코더의 입력 데이터로 설정하여, 상기 목적 태스크 증강 음성 데이터를 생성하는 단계를 포함할 수 있다.In some embodiments of the present invention, the generating of the target task augmented voice data based on the learning performance result comprises: converting the large amount of voice data and the target task raw voice data into each input data for each of the encoders. inferring first and second latent variables having first and second attribute information in each of the input data as outputs; substituting a second latent variable having second attribute information corresponding to the large volume of voice data with a second latent variable corresponding to the original voice data of the target task; and setting the first latent variable and the substituted second latent variable corresponding to the large volume of voice data as input data of one of the decoders, and generating the target task augmented voice data.

본 발명의 일부 실시예에서, 상기 제2 음성 인식 모델에 대한 전이 학습을 수행하는 단계는, 상기 제1 및 제2 음성 인식 모델에 대하여 각각 동일 레이블을 갖는 학습 데이터를 입력 데이터로 하여 학습을 수행하는 단계; 상기 학습 수행 결과, 상기 제1 음성 인식 모델의 일부 레이어에 대한 분포 값을 상기 제2 음성 인식 모델에 적용하는 단계; 및 상기 적용된 분포 값에 기초하여 상기 제2 음성 인식 모델에 대한 학습을 수행하는 단계를 포함할 수 있다.In some embodiments of the present invention, the step of performing transfer learning on the second speech recognition model includes learning the first and second speech recognition models by using training data having the same label as input data. to do; applying, as a result of the learning, distribution values for some layers of the first speech recognition model to the second speech recognition model; and performing learning on the second speech recognition model based on the applied distribution value.

본 발명의 일부 실시예에서, 상기 제2 음성 인식 모델에 대한 전이 학습을 수행하는 단계는, 상기 제1 및 제2 음성 인식 모델의 각 디코더에서의 중간 레이어에 대한 분포 값을 각 입력으로 설정하여 상기 음성 데이터에 상응하는 영역 분류를 수행하는 영역 적대적 다중 학습을 수행하는 단계를 더 포함할 수 있다.In some embodiments of the present invention, the step of performing transfer learning on the second speech recognition model comprises setting a distribution value for an intermediate layer in each decoder of the first and second speech recognition models as each input. The method may further include performing domain adversarial multi-learning for performing domain classification corresponding to the voice data.

본 발명의 일부 실시예는, 미리 준비된 정제된 목적 태스크 음성 데이터를 입력 데이터로 설정하여, 상기 전이 학습된 제2 음성 인식 모델의 디코더에 대한 사후 학습을 수행하는 단계를 더 포함할 수 있다.Some embodiments of the present invention may further include performing post-learning on the decoder of the transfer-learned second speech recognition model by setting the pre-prepared refined target task speech data as input data.

상술한 과제를 해결하기 위한 본 발명의 다른 면에 따른 컴퓨터 프로그램은, 하드웨어인 컴퓨터와 결합되어 상기 교사-학생 학습에 기반한 희소 음성 데이터를 갖는 목적 태스크 영역에서의 음성 인식 모델을 학습하는 방법을 실행하며, 컴퓨터 판독가능 기록매체에 저장된다.A computer program according to another aspect of the present invention for solving the above-described problems is combined with a computer that is hardware to execute a method of learning a speech recognition model in a target task area having sparse speech data based on the teacher-student learning and stored in a computer-readable recording medium.

본 발명의 기타 구체적인 사항들은 상세한 설명 및 도면들에 포함되어 있다.Other specific details of the invention are included in the detailed description and drawings.

상술한 본 발명에 의하면, 대용량의 음성 데이터를 수집하기 어려운 희소 음성 데이터를 인식 대상으로 하는 목적 태스크 영역에 대하여, 상대적으로 데이터 수집이 용이한 대용량의 타 영역 정제 음성 코퍼스를 이용하여 목적 태스크 영역과 유사한 음향적 특성을 갖도록 음성 데이터를 증강시키고, 증강된 음성 데이터와 대용량 음성 데이터를 입력으로 교사-학생 기반 전이학습을 수행함으로써, 희소 음성 데이터를 갖는 목적 태스크 영역에 최적화된 음성 인식 모델을 생성할 수 있다.According to the present invention, as described above, with respect to the target task area for recognizing rare speech data in which it is difficult to collect large volume of speech data, the target task area and By augmenting voice data to have similar acoustic characteristics, and performing teacher-student-based transfer learning with the augmented voice data and large-capacity voice data as inputs, a voice recognition model optimized for the target task area with sparse voice data can be generated. can

본 발명의 효과들은 이상에서 언급된 효과로 제한되지 않으며, 언급되지 않은 또 다른 효과들은 아래의 기재로부터 통상의 기술자에게 명확하게 이해될 수 있을 것이다.Effects of the present invention are not limited to the effects mentioned above, and other effects not mentioned will be clearly understood by those skilled in the art from the following description.

도 1은 본 발명의 일 실시예에 따른 음성 인식 모델 학습 방법의 순서도이다.
도 2a 및 도 2b는 목적 태스크 증강 음성 데이터를 생성하는 내용을 설명하기 위한 도면이다.
도 3은 본 발명의 일 실시예에서의 교사-학생 학습에 기반한 전이 학습을 설명하기 위한 도면이다.
도 4는 본 발명의 일 실시예에 따른 음성 인식 모델 학습 장치를 설명하기 위한 도면이다.1 is a flowchart of a method for learning a speech recognition model according to an embodiment of the present invention.
2A and 2B are diagrams for explaining the content of generating target task augmented voice data.
3 is a diagram for explaining transfer learning based on teacher-student learning in an embodiment of the present invention.
4 is a diagram for explaining an apparatus for learning a speech recognition model according to an embodiment of the present invention.

본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나, 본 발명은 이하에서 개시되는 실시예들에 제한되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시예들은 본 발명의 개시가 완전하도록 하고, 본 발명이 속하는 기술 분야의 통상의 기술자에게 본 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다. Advantages and features of the present invention and methods of achieving them will become apparent with reference to the embodiments described below in detail in conjunction with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below, but may be implemented in various different forms, and only the present embodiments allow the disclosure of the present invention to be complete, and those of ordinary skill in the art to which the present invention pertains. It is provided to fully understand the scope of the present invention to those skilled in the art, and the present invention is only defined by the scope of the claims.

본 명세서에서 사용된 용어는 실시예들을 설명하기 위한 것이며 본 발명을 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함한다. 명세서에서 사용되는 "포함한다(comprises)" 및/또는 "포함하는(comprising)"은 언급된 구성요소 외에 하나 이상의 다른 구성요소의 존재 또는 추가를 배제하지 않는다. 명세서 전체에 걸쳐 동일한 도면 부호는 동일한 구성 요소를 지칭하며, "및/또는"은 언급된 구성요소들의 각각 및 하나 이상의 모든 조합을 포함한다. 비록 "제1", "제2" 등이 다양한 구성요소들을 서술하기 위해서 사용되나, 이들 구성요소들은 이들 용어에 의해 제한되지 않음은 물론이다. 이들 용어들은 단지 하나의 구성요소를 다른 구성요소와 구별하기 위하여 사용하는 것이다. 따라서, 이하에서 언급되는 제1 구성요소는 본 발명의 기술적 사상 내에서 제2 구성요소일 수도 있음은 물론이다.The terminology used herein is for the purpose of describing the embodiments and is not intended to limit the present invention. In this specification, the singular also includes the plural unless specifically stated otherwise in the phrase. As used herein, “comprises” and/or “comprising” does not exclude the presence or addition of one or more other components in addition to the stated components. Like reference numerals refer to like elements throughout, and "and/or" includes each and every combination of one or more of the recited elements. Although "first", "second", etc. are used to describe various elements, these elements are not limited by these terms, of course. These terms are only used to distinguish one component from another. Accordingly, it goes without saying that the first component mentioned below may be the second component within the spirit of the present invention.

다른 정의가 없다면, 본 명세서에서 사용되는 모든 용어(기술 및 과학적 용어를 포함)는 본 발명이 속하는 기술분야의 통상의 기술자에게 공통적으로 이해될 수 있는 의미로 사용될 수 있을 것이다. 또한, 일반적으로 사용되는 사전에 정의되어 있는 용어들은 명백하게 특별히 정의되어 있지 않는 한 이상적으로 또는 과도하게 해석되지 않는다.Unless otherwise defined, all terms (including technical and scientific terms) used herein will have the meaning commonly understood by those of ordinary skill in the art to which this invention belongs. In addition, terms defined in a commonly used dictionary are not to be interpreted ideally or excessively unless specifically defined explicitly.

이하, 첨부된 도면을 참조하여 본 발명의 실시예에 따른 교사-학생 학습에 기반한 희소 음성 데이터를 갖는 목적 태스크 영역에서의 음성 인식 모델을 학습하는 시스템(100)에 의해 수행되는 방법(이하, 음성 인식 모델 학습 방법)을 상세하게 설명한다. Hereinafter, with reference to the accompanying drawings, a method performed by the system 100 for learning a speech recognition model in a target task area with sparse speech data based on teacher-student learning according to an embodiment of the present invention Recognition model training method) will be described in detail.

도 1은 본 발명의 일 실시예에 따른 음성 인식 모델 학습 방법의 순서도이다.1 is a flowchart of a method for learning a speech recognition model according to an embodiment of the present invention.

한편, 도 1에 도시된 단계들은 교사-학생 학습에 기반한 희소 음성 데이터를 갖는 목적 태스크 영역에서의 음성 인식 모델을 학습하는 시스템(100)을 구성하는 서버(이하, 서버)에 의해 수행되는 것으로 이해될 수 있지만, 이에 제한되는 것은 아니다.On the other hand, the steps shown in FIG. 1 are understood to be performed by a server (hereinafter, a server) constituting the system 100 for learning a speech recognition model in a target task area with sparse speech data based on teacher-student learning. may be, but is not limited thereto.

먼저, 서버는 대용량의 음성 데이터를 준비하고, 목적 태스크 영역에 대응하는 원시 음성 데이터(이하, 목적 태스크 원시 음성 데이터)를 준비한다(S110).First, the server prepares a large amount of voice data, and prepares raw voice data corresponding to the target task area (hereinafter, target task original voice data) (S110).

여기에서, 대용량의 음성 데이터는 상기 목적 태스크 영역과 상이한 타 영역에 대응하는 음성 데이터에 해당한다. 또한, 목적 태스크 원시 음성 데이터는 대용량의 음성 데이터보다 상대적으로 적은 양으로 구성된 희소 음성 데이터에 해당한다.Here, the large-capacity voice data corresponds to voice data corresponding to another region different from the target task region. In addition, the target task raw voice data corresponds to the sparse voice data composed of a relatively smaller amount than the large-capacity voice data.

다음으로, 서버는 목적 태스크 원시 음성 데이터와 대용량의 음성 데이터를 기반으로 학습하여, 목적 태스크 영역에 상응하도록 증강된 음성 데이터(이하, 목적 태스크 증강 음성 데이터)를 생성한다(S120).Next, the server learns based on the target task raw voice data and the large volume of voice data, and generates voice data augmented to correspond to the target task area (hereinafter, target task augmented voice data) (S120).

도 2a 및 도 2b는 목적 태스크 증강 음성 데이터를 생성하는 내용을 설명하기 위한 도면이다.2A and 2B are diagrams for explaining the content of generating target task augmented voice data.

서버는 대용량의 음성 데이터와 목적 태스크 원시 음성 데이터의 속성 정보를 갖는 잠재 변수를 기반으로 학습을 수행하고, 학습 수행 결과에 기초하여 대용량의 음성 데이터와 목적 태스크 원시 음성 데이터의 각 속성 정보를 치환 및 재결합하여 목적 태스크 증강 음성 데이터를 생성한다.The server performs learning based on a latent variable having attribute information of large-capacity voice data and target task raw voice data, and substitutes each attribute information of large-capacity voice data and target task raw voice data based on the learning performance result Recombining to generate objective task-enhanced voice data.

구체적으로 도 2a를 참조하면, 서버는 대용량의 음성 데이터와 목적 태스크 원시 음성 데이터를 하나의 인코더에 대한 입력 데이터로 설정하여, 그 출력으로 입력 데이터의 속성 정보를 갖는 잠재 변수를 추론한다(S121).Specifically, referring to FIG. 2A , the server sets a large amount of voice data and target task raw voice data as input data for one encoder, and infers a latent variable having attribute information of the input data as the output (S121) .

일 실시예로, 본 발명에서의 인코더는 가변 오토 인코더가 적용될 수 있다. 본 발명의 일 실시예는 가변 오토 인코더를 이용하여, 콜센터 음성 DB와 같은 목적 태스크에 상응하는 목적 태스크 원시 음성 데이터와, 방송 음성 DB와 같은 대용량의 음성 데이터를 가변 오토 인코더의 입력으로 설정하여 속성 정보가 분리된 잠재 변수를 추론할 수 있다.In one embodiment, a variable auto encoder may be applied to the encoder in the present invention. An embodiment of the present invention uses a variable auto-encoder to set the target task raw voice data corresponding to the target task, such as a call center voice DB, and large-capacity voice data, such as a broadcast voice DB, as an input of the variable auto-encoder, and set the properties Able to infer latent variables from which information is segregated.

일 실시예로, 속성 정보는 제1 속성 정보와 제2 속성 정보를 포함할 수 있다. 제1 속성 정보는 해당 발화 내에서 컨텐츠 속성을 유지하며 음소열로 구성된 일정 시간에 따라 가변되는 짧은 시간 구간 레벨의 속성에 해당하며, 제2 속성 정보는 각 음성 데이터에서의 해당 발화의 전체 시간 구간 레벨에 상응하는 채널 및 잡음 환경 정보와 화자 특성 정보의 속성에 해당한다. 이때, 화자 특성 정보는 악센트(accent), 톤(tone), 리듬(rhythm), 발화 속도(speaking rate)와 같은 특이 발성 정보일 수 있다.As an embodiment, the attribute information may include first attribute information and second attribute information. The first attribute information maintains the content attribute within the corresponding utterance and corresponds to an attribute of a short time interval level that varies according to a predetermined time composed of phoneme sequences, and the second attribute information corresponds to the entire time interval of the corresponding utterance in each voice data. Corresponds to the channel and noise environment information corresponding to the level and properties of the speaker characteristic information. In this case, the speaker characteristic information may be specific speech information such as accent, tone, rhythm, and speaking rate.

가변 오토 인코더를 이용하여 속성 정보가 분리된 잠재 변수(Z₁, Z₂)를 추론하고 난 후, 서버는 잠재 변수를 디코더의 입력 데이터로 설정하여, 인코더의 입력 데이터로 설정된 대용량의 음성 데이터 및 목적 태스크 원시 음성 데이터를 재생성하는 학습 과정을 수행한다(S122).After deducing the latent variables (Z ₁ , Z ₂ ) from which attribute information is separated using the variable auto-encoder, the server sets the latent variable as the input data of the decoder, Objective Task A learning process of regenerating raw voice data is performed (S122).

서버는 제1 및 제2 속성 정보로 분리된 잠재 변수를 디코더의 입력 데이터로 설정하여 다시 원래의 입력 음성(대용량의 음성 데이터와 목적 태스크 원시 음성 데이터)을 생성하는 학습을 통해, 대량의 음성 데이터로부터 비교사 학습 방법(unsupervised learning algorithm)으로 속성 분리된 잠재 변수를 학습할 수 있다.The server sets the latent variable separated into the first and second attribute information as the input data of the decoder, and through learning to generate the original input voice again (a large amount of voice data and the target task raw voice data), a large amount of voice data It is possible to learn the attribute-separated latent variable from the unsupervised learning algorithm.

그 다음 도 2b를 참조하면, 서버는 대용량의 음성 데이터와 목적 태스크 원시 음성 데이터를 각각의 인코더에 대한 각 입력 데이터로 설정하여, 그 출력으로 각 입력 데이터에서의 제1 및 제2 속성 정보를 갖는 제1 및 제2 잠재 변수를 추론한다(S123).Then, referring to FIG. 2B, the server sets a large amount of voice data and the target task raw voice data as each input data for each encoder, and has first and second attribute information in each input data as the output. The first and second latent variables are inferred (S123).

즉, 대용량으로 확보 가능한 정제 음성 코퍼스인 대용량 음성 데이터를 가변 오토 인코더에 입력시키고, 목적 태스크 원시 음성 데이터를 가변 오토 인코더에 입력시켜, 제1 및 제2 속성 정보를 갖는 각각의 제1 및 제2 잠재 변수(Z₁, Z₂)를 추론한다.That is, large-capacity speech data, which is a refined speech corpus that can be secured in a large capacity, is input to the variable auto-encoder, and the target task raw speech data is input to the variable auto-encoder, and first and second respectively having first and second attribute information. Infer latent variables (Z ₁ , Z ₂ ).

이후, 서버는 대용량의 음성 데이터에 상응하는 제2 속성 정보를 갖는 제2 잠재 변수를 목적 태스크 원시 음성 데이터에 상응하는 제2 잠재 변수로 치환환다(S124).Thereafter, the server substitutes the second latent variable having the second attribute information corresponding to the large volume of voice data with the second latent variable corresponding to the original voice data of the target task (S124).

그리고 서버는 대용량의 음성 데이터에 상응하는 제1 잠재 변수 및 치환된 제2 잠재 변수를 하나의 디코더의 입력 데이터로 설정하여, 목적 태스크 증강 음성 데이터를 생성한다(S125).And the server sets the first latent variable and the substituted second latent variable corresponding to the large volume of voice data as input data of one decoder, and generates the target task augmented voice data (S125).

다시 말해, 서버는 대용량으로 확보 가능한 음성 코퍼스의 제1 잠재 변수를 통해 음소열로 구성된 컨텐츠 속성을 유지하여 전사 정보로 사용하고, 제2 잠재 변수의 경우 목적 태스크에 매칭되는 희소 음성 데이터에서의 제2 속성 정보를 갖는 제2 잠재 변수로 치환한 후, 이와 같이 획득한 제1 및 제2 잠재 변수를 디코더에 입력함으로써, 목적 태스크 증강 음성 데이터를 대용량으로 증강시킬 수 있다. In other words, the server maintains the content attribute composed of phoneme sequences through the first latent variable of the speech corpus that can be secured in a large capacity and uses it as transcription information, and in the case of the second latent variable, the first latent variable in the sparse voice data matching the target task. After substituting the second latent variable having 2 attribute information, by inputting the first and second latent variables obtained in this way to the decoder, the target task augmented voice data can be augmented to a large capacity.

그 결과, 대용량의 음성 데이터와 목적 태스크 증강 음성 데이터는 동일한 텍스트 전사를 공유하게 된다.As a result, the large-capacity voice data and the target task-enhanced voice data share the same text transcription.

다시 도 1을 참조하면, 서버는 교사-학생 학습 기법에 기반하여, 대용량의 음성 데이터를 입력으로 미리 학습된 제1 음성 인식 모델을 교사(teacher)로 설정하고, 이전 단계에서 증강되어 생성된 목적 태스크 증강 음성 데이터를 입력으로 하는 제2 음성 인식 모델을 학생(student)으로 설정하여 교사-학생 학습에 기반한 전이 학습을 수행한다(S130).Referring back to FIG. 1, the server sets the first voice recognition model trained in advance by inputting a large amount of voice data as a teacher, based on the teacher-student learning technique, and the purpose created by augmenting the previous step. The second voice recognition model to which task augmented voice data is input is set as a student to perform transfer learning based on teacher-student learning ( S130 ).

이 과정에서는 대용량의 음성 데이터를 수집하기 어려운 희소 음성 데이터 영역을 인식 대상으로 하는 목적 태스크에 대한 교사-학생 학습 기반의 전이 학습을 수행한다. In this process, transfer learning based on teacher-student learning is performed for a target task that targets a sparse voice data area where it is difficult to collect a large amount of voice data.

도 3은 본 발명의 일 실시예에서의 교사-학생 학습에 기반한 전이 학습 과정을 설명하기 위한 도면이다.3 is a diagram for explaining a transfer learning process based on teacher-student learning in an embodiment of the present invention.

교사-학생 학습에 기반한 지식 전이 방식은 제1 및 제2 음성 인식 모델에 대하여 각각 동일한 레이블(정답) 정보를 갖는 학습 데이터를 입력 데이터로 하여 학습을 수행한다.In the knowledge transfer method based on teacher-student learning, learning is performed using, as input data, learning data having the same label (correct answer) information for the first and second speech recognition models, respectively.

그리고 학습 수행 결과, 제1 음성 인식 모델의 일부 레이어에 대한 분포 값을 제2 음성 인식 모델에 적용하고, 적용된 분포 값에 기초하여 제2 음성 인식 모델에 대한 학습을 수행한다. 일 실시예로, 본 발명은 교사 네트워크의 종단의 사후 분포를 학생 네트워크의 종단에 전달하여 학습하는 지식 증류(Knowledge Distillation) 방법과, 중간 레이어에서의 분포를 전달하여 학습하는 집중 전이(Attention Transfer) 방법이 결합된 학습 방법을 적용할 수 있다.As a result of the learning, the distribution values of some layers of the first speech recognition model are applied to the second speech recognition model, and learning of the second speech recognition model is performed based on the applied distribution values. In one embodiment, the present invention provides a Knowledge Distillation method for learning by transferring the posterior distribution of the end of the teacher network to the end of the student network, and Attention Transfer for learning by transferring the distribution from the middle layer A learning method in which methods are combined can be applied.

일 실시예로, 서버는 동일한 레이블을 갖는 대용량의 음성 데이터와 목적 태스크 증강 음성 데이터를 입력으로 하여, 교사-학생 네트워크인 제1 및 제2 음성 인식 모델의 각 디코더에서의 중간 레이어에서 얻어진 분포 값(deep feature)을 제1 및 제2 음성 인식 모델의 각 입력으로 다시 설정하여, 음성 데이터에 상응하는 영역 분류를 수행하는 영역 적대적 다중 학습을 수행할 수 있다(S140).In one embodiment, the server receives a large amount of voice data and objective task augmented voice data having the same label as inputs, and distribution values obtained from intermediate layers in each decoder of the first and second voice recognition models that are teacher-student networks. (deep feature) may be set back to each input of the first and second voice recognition models, and domain adversarial multiple learning for performing domain classification corresponding to voice data may be performed (S140).

이는, 음성 데이터의 각 영역 변이에 둔감한 방향으로 학습되는 손실 함수가 추가된 다중 학습을 수행하는 것이다. 이를 위해 생성적 적대 신경망(Generative Adversarial Network, GAN)을 이용하여 다중 학습을 수행할 수 있다.This is to perform multi-learning in which a loss function, which is learned in a direction insensitive to variation in each region of speech data, is added. For this purpose, multi-learning can be performed using a generative adversarial network (GAN).

일 실시예로, 음성 인식 모델이 인코더와 디코더를 포함하는 종단형으로 구성된 경우, 서버는 인코더를 고정한 상태에서, 미리 준비된 소량의 정제된 목적 태스크 음성 데이터를 입력 데이터로 설정하여, 전이 학습된 제2 음성 인식 모델의 디코더에 대한 사후 학습을 수행할 수 있다(S150) In one embodiment, when the speech recognition model is configured as a longitudinal type including an encoder and a decoder, the server sets a small amount of pre-prepared refined target task speech data as input data while fixing the encoder, 2 Post-learning of the decoder of the speech recognition model may be performed (S150)

이와 같이 별도로 준비된 소량의 정제된 목적 태스크 음성 데이터를 이용한 추가적인 사후 학습을 통해, 최종적으로 목적 태스크 영역에 최적화된 음성 인식 모델을 획득할 수 있다.As such, through additional post-learning using a small amount of separately prepared and refined target task voice data, it is possible to finally obtain a voice recognition model optimized for the target task area.

한편, 상술한 설명에서, 단계 S110 내지 S150은 본 발명의 구현예에 따라서, 추가적인 단계들로 더 분할되거나, 더 적은 단계들로 조합될 수 있다. 또한, 일부 단계는 필요에 따라 생략될 수도 있고, 단계 간의 순서가 변경될 수도 있다. 아울러, 기타 생략된 내용이라 하더라도 도 1 내지 도 3의 음성 인식 모델 학습 방법의 내용은 후술하는 도 4의 내용에도 적용될 수 있다.Meanwhile, in the above description, steps S110 to S150 may be further divided into additional steps or combined into fewer steps according to an embodiment of the present invention. In addition, some steps may be omitted if necessary, and the order between steps may be changed. In addition, the contents of the voice recognition model learning method of FIGS. 1 to 3 may also be applied to the contents of FIG. 4 to be described later, even if other contents are omitted.

이하에서는 본 발명의 일 실시예에 따른 교사-학생 학습에 기반한 희소 음성 데이터를 갖는 목적 태스크 영역에서의 음성 인식 모델을 학습하는 시스템(100, 이하 음성 인식 모델 학습 시스템)에 대하여 설명하도록 한다.Hereinafter, a system for learning a speech recognition model in a target task area having sparse speech data based on teacher-student learning according to an embodiment of the present invention (hereinafter, referred to as speech recognition model learning system) will be described.

도 4는 본 발명의 일 실시예에 따른 음성 인식 모델 학습 시스템(100)을 설명하기 위한 도면이다.4 is a diagram for explaining a voice recognition model learning system 100 according to an embodiment of the present invention.

도 4를 참조하면, 본 발명의 일 실시예에 따른 음성 인식 모델 학습 시스템(100)은 통신모듈(110), 메모리(120) 및 프로세서(130)를 포함한다.Referring to FIG. 4 , the system 100 for learning a speech recognition model according to an embodiment of the present invention includes a communication module 110 , a memory 120 , and a processor 130 .

통신모듈(110)은 준비된 대용량의 음성 데이터와 목적 태스크 영역에 대응하는 원시 음성 데이터를 수신한다.The communication module 110 receives the prepared large-capacity voice data and raw voice data corresponding to the target task area.

메모리(120)에는 통신모듈(110)로부터 수신한 데이터에 기초하여 교사-학생 학습에 기반한 전이 학습을 수행하기 위한 프로그램이 저장된다.A program for performing transfer learning based on teacher-student learning based on data received from the communication module 110 is stored in the memory 120 .

프로세서(130)는 메모리(120)에 저장된 프로그램을 실행시킴에 따라, 목적 태스크 원시 음성 데이터와 대용량 음성 데이터를 기반으로 학습하여, 목적 태스크 영역에 상응하도록 증강된 음성 데이터를 생성하고, 대용량 음성 데이터를 입력으로 미리 학습된 제1 음성 인식 모델을 기반으로, 목적 태스크 증강 음성 데이터를 입력으로 하는 제2 인식 모델에 대한 전이 학습을 수행한다.As the processor 130 executes the program stored in the memory 120, it learns based on the target task raw voice data and the large-capacity voice data, generates voice data augmented to correspond to the target task area, and the large-capacity voice data Based on the pre-trained first speech recognition model as input, transfer learning is performed on the second recognition model using target task augmented speech data as input.

도 4를 참조하여 설명한 음성 인식 모델 학습 시스템(100)은 상술한 서버의 구성요소로 제공될 수 있다.The speech recognition model learning system 100 described with reference to FIG. 4 may be provided as a component of the above-described server.

이상에서 전술한 본 발명의 일 실시예에 따른 음성 인식 모델 학습 방법은, 하드웨어인 컴퓨터와 결합되어 실행되기 위해 프로그램(또는 어플리케이션)으로 구현되어 매체에 저장될 수 있다.The method for learning a voice recognition model according to an embodiment of the present invention described above may be implemented as a program (or application) and stored in a medium in order to be executed in combination with a computer, which is hardware.

상기 전술한 프로그램은, 상기 컴퓨터가 프로그램을 읽어 들여 프로그램으로 구현된 상기 방법들을 실행시키기 위하여, 상기 컴퓨터의 프로세서(CPU)가 상기 컴퓨터의 장치 인터페이스를 통해 읽힐 수 있는 C, C++, JAVA, Ruby, 기계어 등의 컴퓨터 언어로 코드화된 코드(Code)를 포함할 수 있다. 이러한 코드는 상기 방법들을 실행하는 필요한 기능들을 정의한 함수 등과 관련된 기능적인 코드(Functional Code)를 포함할 수 있고, 상기 기능들을 상기 컴퓨터의 프로세서가 소정의 절차대로 실행시키는데 필요한 실행 절차 관련 제어 코드를 포함할 수 있다. 또한, 이러한 코드는 상기 기능들을 상기 컴퓨터의 프로세서가 실행시키는데 필요한 추가 정보나 미디어가 상기 컴퓨터의 내부 또는 외부 메모리의 어느 위치(주소 번지)에서 참조되어야 하는지에 대한 메모리 참조관련 코드를 더 포함할 수 있다. 또한, 상기 컴퓨터의 프로세서가 상기 기능들을 실행시키기 위하여 원격(Remote)에 있는 어떠한 다른 컴퓨터나 서버 등과 통신이 필요한 경우, 코드는 상기 컴퓨터의 통신 모듈을 이용하여 원격에 있는 어떠한 다른 컴퓨터나 서버 등과 어떻게 통신해야 하는지, 통신 시 어떠한 정보나 미디어를 송수신해야 하는지 등에 대한 통신 관련 코드를 더 포함할 수 있다.The above-mentioned program, in order for the computer to read the program and execute the methods implemented as a program, C, C++, JAVA, Ruby, which the processor (CPU) of the computer can read through the device interface of the computer; It may include code coded in a computer language such as machine language. Such code may include functional code related to a function defining functions necessary for executing the methods, etc., and includes an execution procedure related control code necessary for the processor of the computer to execute the functions according to a predetermined procedure. can do. In addition, this code may further include additional information necessary for the processor of the computer to execute the functions or code related to memory reference for which location (address address) in the internal or external memory of the computer should be referenced. there is. In addition, when the processor of the computer needs to communicate with any other computer or server located remotely in order to execute the functions, the code uses the communication module of the computer to determine how to communicate with any other computer or server remotely. It may further include a communication-related code for whether to communicate and what information or media to transmit and receive during communication.

상기 저장되는 매체는, 레지스터, 캐쉬, 메모리 등과 같이 짧은 순간 동안 데이터를 저장하는 매체가 아니라 반영구적으로 데이터를 저장하며, 기기에 의해 판독(reading)이 가능한 매체를 의미한다. 구체적으로는, 상기 저장되는 매체의 예로는 ROM, RAM, CD-ROM, 자기 테이프, 플로피디스크, 광 데이터 저장장치 등이 있지만, 이에 제한되지 않는다. 즉, 상기 프로그램은 상기 컴퓨터가 접속할 수 있는 다양한 서버 상의 다양한 기록매체 또는 사용자의 상기 컴퓨터상의 다양한 기록매체에 저장될 수 있다. 또한, 상기 매체는 네트워크로 연결된 컴퓨터 시스템에 분산되어, 분산방식으로 컴퓨터가 읽을 수 있는 코드가 저장될 수 있다.The storage medium is not a medium that stores data for a short moment, such as a register, a cache, a memory, etc., but a medium that stores data semi-permanently and can be read by a device. Specifically, examples of the storage medium include, but are not limited to, ROM, RAM, CD-ROM, magnetic tape, floppy disk, and an optical data storage device. That is, the program may be stored in various recording media on various servers accessible by the computer or in various recording media on the computer of the user. In addition, the medium may be distributed in a computer system connected to a network, and a computer-readable code may be stored in a distributed manner.

전술한 본 발명의 설명은 예시를 위한 것이며, 본 발명이 속하는 기술분야의 통상의 지식을 가진 자는 본 발명의 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 쉽게 변형이 가능하다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다. 예를 들어, 단일형으로 설명되어 있는 각 구성 요소는 분산되어 실시될 수도 있으며, 마찬가지로 분산된 것으로 설명되어 있는 구성 요소들도 결합된 형태로 실시될 수 있다.The above description of the present invention is for illustration, and those of ordinary skill in the art to which the present invention pertains can understand that it can be easily modified into other specific forms without changing the technical spirit or essential features of the present invention. will be. Therefore, it should be understood that the embodiments described above are illustrative in all respects and not restrictive. For example, each component described as a single type may be implemented in a dispersed form, and likewise components described as distributed may be implemented in a combined form.

본 발명의 범위는 상기 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 균등 개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 발명의 범위에 포함되는 것으로 해석되어야 한다.The scope of the present invention is indicated by the following claims rather than the above detailed description, and all changes or modifications derived from the meaning and scope of the claims and their equivalent concepts should be interpreted as being included in the scope of the present invention. do.

100 : 음성 인식 모델 학습 시스템
110 : 통신모듈
120 : 메모리
130 : 프로세서100: speech recognition model training system
110: communication module
120: memory
130: processor

Claims

컴퓨터에 의해 수행되는 방법에 있어서,
대용량의 음성 데이터와 상기 목적 태스크 영역에 대응하는 원시 음성 데이터(이하, 목적 태스크 원시 음성 데이터)를 준비하는 단계;
상기 목적 태스크 원시 음성 데이터와 상기 대용량의 음성 데이터를 기반으로 학습하여, 상기 목적 태스크 영역에 상응하도록 증강된 음성 데이터(이하, 목적 태스크 증강 음성 데이터)를 생성하는 단계; 및
상기 대용량의 음성 데이터를 입력으로 미리 학습된 제1 음성 인식 모델을 기반으로, 상기 목적 태스크 증강 음성 데이터를 입력으로 하는 제2 음성 인식 모델에 대한 전이 학습을 수행하는 단계를 포함하며,
상기 목적 태스크 원시 음성 데이터는 상기 대용량의 음성 데이터보다 상대적으로 적은 희소 음성 데이터인 것인,
교사-학생 학습에 기반한 희소 음성 데이터를 갖는 목적 태스크 영역에서의 음성 인식 모델을 학습하는 방법.A method performed by a computer comprising:
preparing a large amount of voice data and raw voice data corresponding to the target task area (hereinafter, target task raw voice data);
generating voice data augmented to correspond to the target task area (hereinafter referred to as target task augmented voice data) by learning based on the target task raw voice data and the large volume of voice data; and
Based on the first voice recognition model trained in advance by inputting the large amount of voice data as an input, performing transfer learning on a second voice recognition model using the target task augmented voice data as an input,
wherein the target task raw voice data is relatively less sparse voice data than the large-capacity voice data;
A method for training a speech recognition model in an objective task domain with sparse speech data based on teacher-student learning.