KR102344218B1

KR102344218B1 - Speech recognition system and learning method thereof

Info

Publication number: KR102344218B1
Application number: KR1020200108156A
Authority: KR
Inventors: 전재진; 김의성
Original assignee: 주식회사 카카오엔터프라이즈
Priority date: 2020-08-26
Filing date: 2020-08-26
Publication date: 2021-12-28

Abstract

The present invention relates to a transducer-based end-to-end speech recognition system and a learning method thereof. A speech recognition method in an embodiment comprises the steps of: obtaining a first feature vector for searching for a candidate word corresponding to an input speech signal; obtaining a second feature vector for searching for a candidate word corresponding to a next word in a word string based on the word string that has been already retrieved in response to the input speech signal; obtaining a first classification probability according to a transducer model based on the first feature vector and the second feature vector; obtaining a second classification probability according to a connectionist temporal classification (CTC) model based on the first feature vector; obtaining a third classification probability according to a language model based on the second feature vector; inferring a word corresponding to the speech signal based on the first classification probability, the second classification probability, and the third classification probability; and updating the word string based on the estimated word. In the embodiment, it is possible to improve recognition performance without significantly increasing the amount of computation for speech recognition by adopting a separate classifier for the transducer-based speech recognition system.

Description

음성 인식 시스템 및 음성 인식 시스템의 학습 방법{SPEECH RECOGNITION SYSTEM AND LEARNING METHOD THEREOF}Speech recognition system and learning method of speech recognition system

음성 인식 시스템 및 그 학습 방법에 관한 것이다. 보다 구체적으로, 트랜스듀서(transducer) 기반 종단간(end-to-end) 음성 인식 시스템 및 그 학습 방법에 관한 것이다.It relates to a speech recognition system and a learning method thereof. More specifically, it relates to a transducer-based end-to-end speech recognition system and a learning method thereof.

음성 인식(Speech Recognition) 기술은 발화에 의하여 발생한 음성 신호를 텍스트 데이터로 전환하여 처리하는 기술로, STT(Speech-to-Text)라고도 한다. 음성 인식 기술로 인해 음성이 장치의 신규한 입력 방식으로 이용 가능해지면서, 음성을 통한 기기 제어 및 정보 검색 등 다양한 기술 분야에 음성 인식 기술이 응용되고 있다. 최근에는, 딥 러닝(deep learning) 기반의 기계 학습(machine learning) 기술이 발전함에 따라, 심층 신경망(deep neural network)으로 구성된 음향 모델을 이용하여, 음성 데이터로부터 발음을 분석하는 과정을 거치지 않고 음성 데이터에서 단어나 문장 등의 텍스트를 직접 인식하는 종단간(end-to-end) 음성 인식 기술에 대한 연구가 활발하게 진행되고 있다.Speech recognition (Speech Recognition) technology is a technology that converts a speech signal generated by utterance into text data and processes it, also referred to as Speech-to-Text (STT). As voice becomes available as a new input method for devices due to voice recognition technology, voice recognition technology is being applied to various technical fields such as device control and information retrieval through voice. In recent years, as machine learning technology based on deep learning has been developed, using an acoustic model composed of a deep neural network, the voice without going through the process of analyzing the pronunciation from the voice data Research on end-to-end speech recognition technology for directly recognizing text such as words or sentences from data is being actively conducted.

실시예는 트랜스듀서(transducer) 기반의 종단간(end-to-end) 음성 인식 시스템에 CTC 모델 및 언어 모델을 결합하여, 음성 인식의 성능을 향상시킬 수 있다.The embodiment may combine a CTC model and a language model with a transducer-based end-to-end speech recognition system to improve speech recognition performance.

실시예는 음성 인식 시스템에 다중 작업 학습(multi-task learning)을 적용하여 학습된 시스템 내 모델들을 다른 유형의 음성 인식기에 이용함으로써, 확장된 음성 인식 기술을 제공하는 서비스를 제공할 수 있다.The embodiment may provide a service for providing an expanded voice recognition technology by applying multi-task learning to a voice recognition system and using models in the learned system for other types of voice recognizers.

일 측에 따른 음성 인식 방법은 입력된 음성 신호에 대응하는 후보 단어를 탐색하기 위한 제1 특징 벡터를 획득하는 단계; 상기 입력된 음성 신호에 대응하여 이미 탐색된 단어열에 기초하여, 상기 단어열의 다음 단어에 대응하는 후보 단어를 탐색하기 위한 제2 특징 벡터를 획득하는 단계; 상기 제1 특징 벡터 및 상기 제2 특징 벡터에 기초하여, 트랜스듀서(transducer)모델에 따른 제1 분류 확률을 획득하는 단계; 상기 제1 특징 벡터에 기초하여, CTC(connectionist temporal classification) 모델에 따른 제2 분류 확률을 획득하는 단계; 상기 제2 특징 벡터에 기초하여, 언어 모델에 따른 제3 분류 확률을 획득하는 단계; 상기 제1 분류 확률, 상기 제2 분류 확률 및 상기 제3 분류 확률에 기초하여, 상기 음성 신호에 대응하는 단어를 추정하는 단계; 및 상기 추정된 단어에 기초하여, 상기 단어열을 갱신하는 단계를 포함한다.According to one aspect, a speech recognition method includes: obtaining a first feature vector for searching for a candidate word corresponding to an input speech signal; obtaining a second feature vector for searching for a candidate word corresponding to a next word in the word string based on a word string already searched for in response to the input voice signal; obtaining a first classification probability according to a transducer model based on the first feature vector and the second feature vector; obtaining a second classification probability according to a connectionist temporal classification (CTC) model based on the first feature vector; obtaining a third classification probability according to a language model based on the second feature vector; estimating a word corresponding to the speech signal based on the first classification probability, the second classification probability, and the third classification probability; and updating the word sequence based on the estimated word.

상기 음성 신호에 대응하는 단어를 추정하는 단계는 상기 제1 분류 확률, 상기 제2 분류 확률 및 상기 제3 분류 확률에 기초하여, 빔 서치를 통해 상기 음성 신호에 대응하는 단어를 추정하는 단계를 포함할 수 있다.The estimating of the word corresponding to the speech signal includes estimating the word corresponding to the speech signal through a beam search based on the first classification probability, the second classification probability, and the third classification probability. can do.

상기 제2 특징 벡터를 획득하는 단계는 상기 이미 탐색된 단어열이 없는 경우, 공백 문자에 기초하여, 상기 제2 특징 벡터를 획득하는 단계를 포함할 수 있다.The obtaining of the second feature vector may include obtaining the second feature vector based on a blank character when there is no previously searched word string.

상기 제1 특징 벡터를 획득하는 단계는 상기 입력된 음성 신호를 음성 신호에 대응하는 후보 단어를 탐색하는 제1 신경망에 인가하여, 상기 제1 특징 벡터를 획득하는 단계를 포함할 수 있다.The obtaining of the first feature vector may include obtaining the first feature vector by applying the input speech signal to a first neural network that searches for a candidate word corresponding to the speech signal.

상기 제2 특징 벡터를 획득하는 단계는 상기 입력된 음성 신호에 대응하여 이미 탐색된 단어열을 단어열의 다음 단어에 대응하는 후보 단어를 탐색하는 제2 신경망에 인가하여, 상기 제2 특징 벡터를 획득하는 단계를 포함할 수 있다.The obtaining of the second feature vector includes applying a word string already searched for in response to the input voice signal to a second neural network that searches for a candidate word corresponding to a next word in the word string to obtain the second feature vector. may include the step of

상기 제1 분류 확률을 획득하는 단계는 상기 제1 특징 벡터 및 제2 특징 벡터를 음성 신호 및 단어열에 대응하는 후보 단어를 탐색하는 트랜스듀서 모델의 제3 신경망에 인가하여, 제3 특징 벡터를 획득하는 단계; 및 상기 제3 특징 벡터에 기초하여, 상기 트랜스듀서 모델의 분류 함수에 따른 상기 제1 분류 확률을 획득하는 단계를 포함할 수 있다.The obtaining of the first classification probability includes applying the first feature vector and the second feature vector to a third neural network of a transducer model that searches for a candidate word corresponding to a speech signal and a word sequence to obtain a third feature vector to do; and obtaining the first classification probability according to the classification function of the transducer model, based on the third feature vector.

일 측에 따른 음성 인식 시스템의 학습 방법은 제1 학습 데이터에 포함된 제1 음성 신호를 제1 신경망에 인가하여, 상기 제1 음성 신호에 대응하는 후보 단어를 탐색하기 위한 제1 특징 벡터를 획득하는 단계; 제1 학습 데이터에 포함된 제1 음성 신호에 대응하는 제1 단어열을 제2 신경망에 인가하여, 상기 제1 단어열의 다음 단어에 대응하는 후보 단어를 탐색하기 위한 제2 특징 벡터를 획득하는 단계; 상기 제1 특징 벡터 및 제2 특징 벡터를 제3 신경망에 인가하여, 제1 분류 확률을 획득하는 단계; 상기 제1 학습 데이터에 포함된 상기 제1 음성 신호에 대응하는 제1 레이블에 기초하여, 상기 제1 분류 확률에 관한 제1 손실을 획득하는 단계; 및 제2 학습 데이터에 포함된 제2 음성 신호 및 상기 제2 음성 신호에 대응하는 제2 레이블에 기초하여, CTC 모델에 따른 제2 손실을 획득하는 단계; 제3 학습 데이터에 포함된 제2 단어열에 기초하여, 언어 모델에 따른 제3 손실을 획득하는 단계; 상기 제1 손실, 상기 제2 손실 및 상기 제3 손실에 기초하여, 상기 트랜스듀서 모델, 상기 CTC모델 및 상기 언어 모델을 학습시키는 단계를 포함한다.A learning method of a voice recognition system according to an aspect obtains a first feature vector for searching for a candidate word corresponding to the first voice signal by applying a first voice signal included in first training data to a first neural network to do; obtaining a second feature vector for searching for a candidate word corresponding to a next word of the first word string by applying a first word string corresponding to the first voice signal included in the first training data to a second neural network; ; obtaining a first classification probability by applying the first feature vector and the second feature vector to a third neural network; obtaining a first loss with respect to the first classification probability based on a first label corresponding to the first speech signal included in the first training data; and obtaining a second loss according to the CTC model based on a second speech signal included in the second training data and a second label corresponding to the second speech signal; obtaining a third loss according to the language model based on the second word string included in the third training data; and training the transducer model, the CTC model, and the language model based on the first loss, the second loss, and the third loss.

상기 제1 신경망은 상기 트랜스듀서 모델 및 상기 CTC 모델이 공유할 수 있고, 상기 제2 신경망은 상기 트랜스듀서 모델 및 상기 언어 모델이 공유할 수 있다.The first neural network may be shared by the transducer model and the CTC model, and the second neural network may be shared by the transducer model and the language model.

상기 트랜스듀서 모델, 상기 CTC모델 및 상기 언어 모델을 학습시키는 단계는 상기 제1 손실에 기초하여, 상기 트랜스듀서 모델의 레이어 별 제1 그래디언트(gradient)를 생성하는 단계; 상기 제2 손실에 기초하여, 상기 CTC 모델의 레이어 별 제2 그래디언트를 생성하는 단계; 상기 제3 손실에 기초하여, 상기 언어 모델의 레이어 별 제3 그래디언트를 생성하는 단계; 상기 제1 신경망의 레이어 별 상기 제1 그래디언트 및 상기 제2 그래디언트를 누적하는 단계; 상기 제2 신경망의 레이어 별 상기 제1 그래디언트 및 상기 제3 그래디언트를 누적하는 단계; 및 상기 트랜스듀서 모델, 상기 CTC 모델 및 상기 언어 모델의 레이어 별 그래디언트에 기초하여, 상기 레이어 별 파라미터를 최적화하는 단계를 포함할 수 있다.The step of training the transducer model, the CTC model, and the language model may include: generating a first gradient for each layer of the transducer model based on the first loss; generating a second gradient for each layer of the CTC model based on the second loss; generating a third gradient for each layer of the language model based on the third loss; accumulating the first gradient and the second gradient for each layer of the first neural network; accumulating the first gradient and the third gradient for each layer of the second neural network; and optimizing the parameters for each layer based on the gradient for each layer of the transducer model, the CTC model, and the language model.

상기 제2 손실을 획득하는 단계는 상기 제2 음성 신호를 음성 신호에 대응하는 후보 단어를 탐색하는 상기 제1 신경망에 인가하여, 제4 특징 벡터를 획득하는 단계; 상기 제4 특징 벡터에 기초하여, 상기 CTC 모델에 따른 제2 분류 확률을 획득하는 단계; 및 상기 제2 레이블에 기초하여, 상기 제2 분류 확률에 관한 제2 손실을 획득하는 단계를 포함할 수 있다.The obtaining of the second loss may include: obtaining a fourth feature vector by applying the second speech signal to the first neural network searching for a candidate word corresponding to the speech signal; obtaining a second classification probability according to the CTC model based on the fourth feature vector; and obtaining a second loss related to the second classification probability based on the second label.

상기 제3 손실을 획득하는 단계는 상기 제2 단어열을 단어열의 다음 단어에 대응하는 후보 단어를 탐색하는 상기 제2 신경망에 인가하여, 제5 특징 벡터를 획득하는 단계; 상기 제5 특징 벡터에 기초하여, 상기 언어 모델에 따른 제3 분류 확률을 획득하는 단계; 및 상기 제2 단어열에 기초하여, 상기 제3 분류 확률에 관한 제3 손실을 획득하는 단계를 포함할 수 있다.The obtaining of the third loss may include: obtaining a fifth feature vector by applying the second word string to the second neural network searching for a candidate word corresponding to a next word in the word string; obtaining a third classification probability according to the language model based on the fifth feature vector; and obtaining a third loss related to the third classification probability based on the second word string.

상기 제1 레이블 및 상기 제2 레이블은 하나의 문자 및 복수의 문자들의 조합으로 구성된 문자열을 포함할 수 있다.The first label and the second label may include one character and a character string composed of a combination of a plurality of characters.

상기 제2 학습 데이터에 포함된 레이블의 개수는 상기 제1 학습 데이터에 포함된 레이블의 개수보다 적을 수 있다.The number of labels included in the second training data may be less than the number of labels included in the first training data.

상기 제1 분류 확률은 상기 제1 음성 신호가 상기 제1 학습 데이터에 포함된 적어도 하나의 레이블 중 어느 하나로 분류될 확률을 포함할 수 있다.The first classification probability may include a probability that the first voice signal is classified by any one of at least one label included in the first training data.

상기 제2 분류 확률은 상기 제2 음성 신호가 상기 제2 학습 데이터에 포함된 적어도 하나의 레이블 중 어느 하나로 분류될 확률을 포함할 수 있다.The second classification probability may include a probability that the second voice signal is classified by any one of at least one label included in the second learning data.

일 측에 따른 음성 인식 장치는 입력된 음성 신호에 대응하는 후보 단어를 탐색하기 위한 제1 특징 벡터를 획득하고, 상기 입력된 음성 신호에 대응하여 이미 탐색된 단어열에 기초하여, 상기 단어열의 다음 단어에 대응하는 후보 단어를 탐색하기 위한 제2 특징 벡터를 획득하고, 상기 제1 특징 벡터 및 제2 특징 벡터에 기초하여, 트랜스듀서(transducer)모델에 따른 제1 분류 확률을 획득하고, 상기 제1 특징 벡터에 기초하여, CTC(connectionist temporal classification) 모델에 따른 제2 분류 확률을 획득하고, 상기 제2 특징 벡터에 기초하여, 언어 모델에 따른 제3 분류 확률을 획득하고, 상기 제1 분류 확률, 상기 제2 분류 확률 및 상기 제3 분류 확률에 기초하여, 상기 음성 신호에 대응하는 단어를 추정하며, 상기 추정된 단어에 기초하여, 상기 단어열을 갱신하는, 적어도 하나의 프로세서를 포함할 수 있다.A speech recognition apparatus according to an aspect obtains a first feature vector for searching for a candidate word corresponding to an input speech signal, and based on a word string already searched for in response to the input speech signal, a next word in the word string obtains a second feature vector for searching for a candidate word corresponding to , obtains a first classification probability according to a transducer model based on the first and second feature vectors, obtaining a second classification probability according to a connectionist temporal classification (CTC) model based on the feature vector, obtaining a third classification probability according to the language model based on the second feature vector, the first classification probability; and estimating a word corresponding to the speech signal based on the second classification probability and the third classification probability, and updating the word sequence based on the estimated word. .

상기 프로세서는 상기 음성 신호에 대응하는 단어를 추정함에 있어서, 상기 제1 분류 확률, 상기 제2 분류 확률 및 상기 제3 분류 확률에 기초하여, 빔 서치를 통해 상기 음성 신호에 대응하는 단어를 추정할 수 있다.In estimating the word corresponding to the speech signal, the processor is configured to estimate a word corresponding to the speech signal through a beam search based on the first classification probability, the second classification probability, and the third classification probability. can

상기 프로세서는 상기 제2 특징 벡터를 획득함에 있어서, 상기 이미 탐색된 단어열이 없는 경우, 공백 문자에 기초하여, 상기 제2 특징 벡터를 획득할 수 있다.When acquiring the second feature vector, the processor may acquire the second feature vector based on a blank character when there is no previously searched word string.

상기 프로세서는 상기 제1 특징 벡터를 획득함에 있어서, 상기 입력된 음성 신호를 음성 신호에 대응하는 후보 단어를 탐색하는 제1 신경망에 인가하여, 상기 제1 특징 벡터를 획득할 수 있다.In obtaining the first feature vector, the processor may obtain the first feature vector by applying the input speech signal to a first neural network that searches for a candidate word corresponding to the speech signal.

상기 프로세서는 상기 제2 특징 벡터를 획득함에 있어서, 상기 입력된 음성 신호에 대응하여 이미 탐색된 단어열을 단어열의 다음 단어에 대응하는 후보 단어를 탐색하는 제2 신경망에 인가하여, 상기 제2 특징 벡터를 획득할 수 있다.In obtaining the second feature vector, the processor applies a word string already searched for in response to the input voice signal to a second neural network that searches for a candidate word corresponding to a next word in the word string, so that the second feature vector is obtained. vector can be obtained.

상기 프로세서는 상기 제1 분류 확률을 획득함에 있어서, 상기 제1 특징 벡터 및 제2 특징 벡터를 음성 신호 및 단어열에 대응하는 후보 단어를 탐색하는 트랜스듀서 모델의 제3 신경망에 인가하여, 제3 특징 벡터를 획득하고, 상기 제3 특징 벡터에 기초하여, 상기 트랜스듀서 모델의 분류 함수에 따른 상기 제1 분류 확률을 획득할 수 있다.In obtaining the first classification probability, the processor applies the first feature vector and the second feature vector to a third neural network of a transducer model that searches for a candidate word corresponding to a speech signal and a word sequence to obtain a third feature A vector may be obtained, and the first classification probability according to the classification function of the transducer model may be obtained based on the third feature vector.

일 측에 따른 음성 인식 시스템은 입력된 음성 신호에 대응하는 후보 단어를 탐색하기 위한 제1 특징 벡터를 출력하는 제1 신경망; 입력된 단어열의 다음 단어에 대응하는 후보 단어를 탐색하기 위한 제2 특징 벡터를 획득하는 제2 신경망; 상기 제1 특징 벡터 및 상기 제2 특징 벡터에 기초하여, 상기 음성 신호 및 상기 단어열에 대응하는 후보 단어를 탐색하기 위한 제3 특징 벡터를 획득하는 제3 신경망; 상기 제1 특징 벡터에 기초하여, 제2 분류 확률을 출력하는 CTC 모델에 따른 분류기; 상기 제2 특징 벡터에 기초하여, 제3 분류 확률을 출력하는 언어 모델에 따른 분류기; 및 상기 제3 특징 벡터에 기초하여, 제1 분류 확률을 출력하는 트랜스듀서 모델에 따른 분류기를 포함한다.According to one aspect, a speech recognition system includes: a first neural network for outputting a first feature vector for searching for a candidate word corresponding to an input speech signal; a second neural network for obtaining a second feature vector for searching for a candidate word corresponding to a next word in the input word string; a third neural network for obtaining a third feature vector for searching for a candidate word corresponding to the speech signal and the word string based on the first feature vector and the second feature vector; a classifier according to a CTC model that outputs a second classification probability based on the first feature vector; a classifier according to a language model that outputs a third classification probability based on the second feature vector; and a classifier according to a transducer model that outputs a first classification probability based on the third feature vector.

상기 제1 신경망은 상기 CTC 모델 및 상기 트랜스듀서 모델이 공유할 수 있고, 상기 제2 신경망은 상기 언어 모델 및 상기 트랜스듀서 모델이 공유할 수 있다.The first neural network may be shared by the CTC model and the transducer model, and the second neural network may be shared by the language model and the transducer model.

실시예는 트랜스듀서 기반의 음성 인식 시스템에 별도의 분류기(classifier)를 적용하여 음성 인식을 위한 연산량을 크게 증가시키지 않으면서, 인식 성능을 향상시킬 수 있다.In the embodiment, a separate classifier may be applied to a transducer-based speech recognition system to improve recognition performance without significantly increasing the amount of computation for speech recognition.

실시예는 내부에 언어 모델을 포함하여, 디코딩 과정에서 외부 언어 모델을 추가로 사용함으로써 증가되는 연산량 없이 효율적인 음성 인식 연산이 가능하다.The embodiment includes an internal language model, so that an efficient speech recognition operation is possible without an increased amount of computation by additionally using an external language model in the decoding process.

실시예에 따른 음성 인식 시스템의 구조는 시스템에서 사용하는 신경망(neural network) 구조의 종류에 관계없이 적용할 수 있어, 음성 인식기에 범용적으로 적용될 수 있다.The structure of the speech recognition system according to the embodiment can be applied regardless of the type of the neural network structure used in the system, and thus can be universally applied to the speech recognizer.

도 1은 일실시예에 따른 트랜스듀서(transducer) 기반의 종단간(end-to-end) 음성 인식 시스템의 구조를 도시한 도면.
도 2는 일실시예에 따른 디코딩을 수행하는 음성 인식 시스템의 구조를 설명하기 위한 도면.
도 3은 일실시예에 따른 음성 인식 시스템에 따른 음성 인식 방법의 순서도를 도시한 도면.
도 4는 일실시예에 따른 음성 인식 시스템의 학습 방법의 순서도를 도시한 도면.
도 5는 트랜스듀서 모델에 따른 손실을 획득하는 구체적인 동작의 순서도를 도시한 도면.
도 6은 CTC 모델에 따른 손실을 획득하는 구체적인 동작의 순서도를 도시한 도면.
도 7은 언어 모델에 따른 손실을 획득하는 구체적인 동작의 순서도를 도시한 도면.
도 8a는 일실시예에 따른 제1 손실에 기초하여 트랜스듀서 모델을 학습시키는 방법을 설명하기 위한 도면.
도 8b는 일실시예에 따른 제2 손실에 기초하여 CTC 모델을 학습시키는 방법을 설명하기 위한 도면.
도 8c는 일실시예에 따른 제3 손실에 기초하여 언어 모델을 학습시키는 방법을 설명하기 위한 도면.
도 9a는 호출어 인식을 포함하는 음성 인식 기술을 제공하는 시스템의 동작 과정의 예를 도시한 도면.
도 9b는 호출어 인식을 포함하는 음성 인식 기술을 제공하는 시스템의 구조의 예를 도시한 도면.1 is a diagram showing the structure of a transducer-based end-to-end speech recognition system according to an embodiment.
2 is a diagram for explaining the structure of a speech recognition system for performing decoding according to an embodiment;
3 is a diagram illustrating a flow chart of a voice recognition method according to a voice recognition system according to an embodiment.
4 is a diagram illustrating a flow chart of a learning method of a voice recognition system according to an embodiment.
5 is a view showing a flowchart of a specific operation for obtaining a loss according to the transducer model.
6 is a view showing a flowchart of a specific operation for obtaining a loss according to the CTC model.
7 is a diagram showing a flowchart of a specific operation for obtaining a loss according to a language model.
8A is a view for explaining a method of learning a transducer model based on a first loss according to an embodiment;
8B is a diagram for explaining a method of learning a CTC model based on a second loss according to an embodiment;
8C is a diagram for explaining a method of learning a language model based on a third loss according to an embodiment;
9A is a diagram illustrating an example of an operation process of a system for providing a speech recognition technology including call word recognition.
Fig. 9B is a diagram showing an example of the structure of a system for providing a speech recognition technology including call word recognition.

이하에서, 첨부된 도면을 참조하여 실시예들을 상세하게 설명한다. 그러나, 실시예들에는 다양한 변경이 가해질 수 있어서 특허출원의 권리 범위가 이러한 실시예들에 의해 제한되거나 한정되는 것은 아니다. 실시예들에 대한 모든 변경, 균등물 내지 대체물이 권리 범위에 포함되는 것으로 이해되어야 한다.Hereinafter, embodiments will be described in detail with reference to the accompanying drawings. However, since various changes may be made to the embodiments, the scope of the patent application is not limited or limited by these embodiments. It should be understood that all modifications, equivalents and substitutes for the embodiments are included in the scope of the rights.

실시예에서 사용한 용어는 단지 설명을 목적으로 사용된 것으로, 한정하려는 의도로 해석되어서는 안된다. 단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한, 복수의 표현을 포함한다. 본 명세서에서, "포함하다" 또는 "가지다" 등의 용어는 명세서 상에 기재된 특징, 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것이 존재함을 지정하려는 것이지, 하나 또는 그 이상의 다른 특징들이나 숫자, 단계, 동작, 구성요소, 부품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.The terms used in the examples are used for the purpose of description only, and should not be construed as limiting. The singular expression includes the plural expression unless the context clearly dictates otherwise. In this specification, terms such as "comprise" or "have" are intended to designate that a feature, number, step, operation, component, part, or a combination thereof described in the specification exists, but one or more other features It should be understood that this does not preclude the existence or addition of numbers, steps, operations, components, parts, or combinations thereof.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 실시예가 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥 상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.Unless otherwise defined, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which the embodiment belongs. Terms such as those defined in commonly used dictionaries should be interpreted as having a meaning consistent with the meaning in the context of the related art, and should not be interpreted in an ideal or excessively formal meaning unless explicitly defined in the present application. does not

또한, 첨부 도면을 참조하여 설명함에 있어, 도면 부호에 관계없이 동일한 구성 요소는 동일한 참조부호를 부여하고 이에 대한 중복되는 설명은 생략하기로 한다. 실시예를 설명함에 있어서 관련된 공지 기술에 대한 구체적인 설명이 실시예의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다.In addition, in the description with reference to the accompanying drawings, the same components are given the same reference numerals regardless of the reference numerals, and the overlapping description thereof will be omitted. In describing the embodiment, if it is determined that a detailed description of a related known technology may unnecessarily obscure the gist of the embodiment, the detailed description thereof will be omitted.

또한, 실시 예의 구성 요소를 설명하는 데 있어서, 제 1, 제 2, A, B, (a), (b) 등의 용어를 사용할 수 있다. 이러한 용어는 그 구성 요소를 다른 구성 요소와 구별하기 위한 것일 뿐, 그 용어에 의해 해당 구성 요소의 본질이나 차례 또는 순서 등이 한정되지 않는다. 어떤 구성 요소가 다른 구성요소에 "연결", "결합" 또는 "접속"된다고 기재된 경우, 그 구성 요소는 그 다른 구성요소에 직접적으로 연결되거나 접속될 수 있지만, 각 구성 요소 사이에 또 다른 구성 요소가 "연결", "결합" 또는 "접속"될 수도 있다고 이해되어야 할 것이다. In addition, in describing the components of the embodiment, terms such as first, second, A, B, (a), (b), etc. may be used. These terms are only for distinguishing the elements from other elements, and the essence, order, or order of the elements are not limited by the terms. When it is described that a component is "connected", "coupled" or "connected" to another component, the component may be directly connected or connected to the other component, but another component is between each component. It will be understood that may also be "connected", "coupled" or "connected".

어느 하나의 실시 예에 포함된 구성요소와, 공통적인 기능을 포함하는 구성요소는, 다른 실시 예에서 동일한 명칭을 사용하여 설명하기로 한다. 반대되는 기재가 없는 이상, 어느 하나의 실시 예에 기재한 설명은 다른 실시 예에도 적용될 수 있으며, 중복되는 범위에서 구체적인 설명은 생략하기로 한다.Components included in one embodiment and components having a common function will be described using the same names in other embodiments. Unless otherwise stated, descriptions described in one embodiment may be applied to other embodiments as well, and detailed descriptions within the overlapping range will be omitted.

도 1은 일실시예에 따른 트랜스듀서(transducer) 기반의 종단간(end-to-end) 음성 인식 시스템의 구조를 도시한 도면이다.1 is a diagram illustrating the structure of an end-to-end voice recognition system based on a transducer according to an embodiment.

종단간 음성 인식은 신경망(neural network)을 이용하여, 음성 신호에서 텍스트 데이터를 인식하는 것으로, 종단간 음성 인식을 위한 신경망은 음성 신호를 입력 받아 그에 대응되는 텍스트 레이블을 출력하도록 학습된 신경망에 해당할 수 있다. 종단간 음성 인식을 위한 신경망은, 적어도 하나의 레이어를 포함하는 다양한 구조로 구성될 수 있으며, 예를 들어 RNN(Recurrent Neural Network), BRNN(Bi-directional RNN), LSTM(Long Short Term Memory), BLSTM(Bi-directional LSTM), GRU(Gated Recurrent Unit), BGRU(Bi-directional GRU) 등의 구조로 구성될 수 있다.End-to-end speech recognition uses a neural network to recognize text data from a speech signal. The neural network for end-to-end speech recognition corresponds to a neural network trained to receive a speech signal and output a corresponding text label. can do. A neural network for end-to-end speech recognition may be configured in various structures including at least one layer, for example, a Recurrent Neural Network (RNN), a Bi-directional RNN (BRNN), a Long Short Term Memory (LSTM), It may be configured in a structure such as a bi-directional LSTM (BLSTM), a gated recurrent unit (GRU), or a bi-directional GRU (BGRU).

일실시예에 따른 음성 인식 시스템은 음성 신호를 입력 받아 음성 신호에 대응하는 텍스트 데이터를 출력할 수 있다. 일실시예에 따른 음성 인식 시스템은 입력된 음성 신호를 프레임 단위로 처리하여, 프레임 별로 대응되는 텍스트 데이터를 추정하고, 프레임 별 예측된 텍스트 데이터를 연결하여, 최종적으로 입력된 음성 신호에 대응하는 텍스트 데이터를 출력할 수 있다. 일실시예에 따른 입력된 음성 신호의 각 프레임에 대응하여 추정된 텍스트 데이터는 적어도 하나의 문자를 포함하는 단어를 포함할 수 있으며, 최종적으로 입력된 음성 신호에 대응하여 출력되는 텍스트 데이터는 적어도 하나의 단어를 포함하는 단어의 열(sequence) 또는 단어열을 포함할 수 있다. 일실시예에 따를 때, 단어는 사전적 의미의 단어가 아니라 적어도 하나의 문자로 구성된 문자열을 의미할 수 있다.The voice recognition system according to an embodiment may receive a voice signal and output text data corresponding to the voice signal. A voice recognition system according to an embodiment processes an input voice signal in units of frames, estimates text data corresponding to each frame, connects predicted text data for each frame, and finally texts corresponding to an input voice signal. data can be output. According to an embodiment, the text data estimated corresponding to each frame of the input voice signal may include a word including at least one character, and finally, the text data output corresponding to the input voice signal may include at least one text data. It may include a sequence of words or a sequence of words including the words of . According to an embodiment, the word may not mean a word in a dictionary meaning, but a character string composed of at least one character.

도 1을 참조하면, 일실시예에 따른 음성 인식 시스템은 음성 신호(101) 및 음성 신호(101)에 대응하여 이미 탐색된 단어열(102)을 입력 받아, 음성 신호에 대응하는 후보 단어 또는 단어열(102)의 다음 단어에 대응하는 후보 단어에 관한 제1 분류 확률(103)을 출력하는 트랜스듀서(transducer) 모델, 음성 신호(101)를 입력 받아, 음성 신호에 대응하는 후보 단어에 관한 제2 분류 확률(104)을 출력하는 CTC(connectionist temporal classification) 모델, 및 단어열(102)을 입력 받아, 단어열의 다음 단어에 대응하는 후보 단어에 관한 제3 분류 확률(105)을 출력하는 언어 모델(language model)을 포함할 수 있다. Referring to FIG. 1 , a voice recognition system according to an embodiment receives a voice signal 101 and a word string 102 already searched for in response to the voice signal 101, and receives a candidate word or word corresponding to the voice signal. A transducer model that outputs a first classification probability 103 for a candidate word corresponding to the next word in the column 102, receives a voice signal 101, and receives a second word for a candidate word corresponding to the voice signal A CTC (connectionist temporal classification) model that outputs two classification probabilities 104, and a language model that receives a word sequence 102 and outputs a third classification probability 105 for a candidate word corresponding to the next word in the word sequence (language model) may be included.

일실시예에 따른 음성 신호의 프레임에 대응하는 단어는 트랜스듀서 모델에서 출력된 제1 분류 확률(103), CTC 모델에서 출력된 제2 분류 확률(104), 및 언어 모델에서 출력된 제3 분류 확률(105)에 기초하여 추정될 수 있다. 일실시예에 따른 분류 확률은 학습 데이터에 기초하여 결정된 적어도 하나의 후보 단어에 대한 확률을 포함할 수 있으며, 학습 데이터로 학습된 신경망에서 출력된 특징 벡터를 학습 데이터로 학습된 분류기에 입력함으로써 획득될 수 있다.A word corresponding to a frame of a speech signal according to an embodiment includes a first classification probability 103 output from the transducer model, a second classification probability 104 output from the CTC model, and a third classification output from the language model It can be estimated based on the probability 105 . The classification probability according to an embodiment may include a probability of at least one candidate word determined based on the training data, and is obtained by inputting a feature vector output from the neural network trained as the training data to the classifier trained as the training data. can be

도 1을 참조하면, 일실시예에 따른 트랜스듀서 모델은 전사 네트워크(transcription network)(110), 예측 네트워크(prediction network)(120), 연결 네트워크(joint network)(130) 및 분류기(softmax)(140)를 포함할 수 있다. 일실시예에 따른 CTC 모델은 트랜스듀서 모델과 공유하는 전사 네트워크(110) 및 분류기(CTC classifier)(150)를 포함할 수 있다. 일실시예에 따른 언어 모델은 트랜스듀서 모델과 공유하는 예측 네트워크(120) 및 분류기(LM classifier)(160)를 포함할 수 있다.1, the transducer model according to an embodiment is a transcription network 110, a prediction network 120, a joint network 130 and a classifier (softmax) ( 140) may be included. The CTC model according to an embodiment may include a transcription network 110 and a classifier 150 shared with the transducer model. The language model according to an embodiment may include a prediction network 120 and a classifier (LM classifier) 160 shared with the transducer model.

일실시예에 따른 트랜스듀서 모델 및 CTC 모델이 공유하는 전사 네트워크(110)는 음성 신호를 추상화하는 오디오 인코더에 해당할 수 있으며, 음성 신호(101)를 입력 받아 음성 신호(101)에 대응하는 후보 단어를 탐색하기 위한 특징 벡터를 출력하는 신경망을 포함할 수 있다. 일실시예에 따른 전사 네트워크(110)는 RNN, LSTM 등 다양한 신경망 구조로 구성될 수 있으며, 이하에서 제1 신경망으로 지칭될 수 있다. The transcription network 110 shared by the transducer model and the CTC model according to an embodiment may correspond to an audio encoder that abstracts a voice signal, and a candidate corresponding to the voice signal 101 by receiving the voice signal 101 . It may include a neural network that outputs a feature vector for searching for a word. The transcription network 110 according to an embodiment may be configured with various neural network structures such as RNN and LSTM, and may be referred to as a first neural network hereinafter.

일실시예에 따른 트랜스듀서 모델 및 언어 모델이 공유하는 예측 네트워크(120)는 단어열을 추상화하는 레이블 인코더에 해당할 수 있으며, 단어열(102)을 입력 받아 단어열의 다음 단어에 대응하는 후보 단어를 탐색하기 위한 특징 벡터를 출력하는 신경망을 포함할 수 있다. 일실시예에 따른 예측 네트워크(120)는 RNN, LSTM 등 다양한 신경망 구조로 구성될 수 있으며, 이하에서 제2 신경망으로 지칭될 수 있다.The prediction network 120 shared by the transducer model and the language model according to an embodiment may correspond to a label encoder that abstracts a word string, and receives the word string 102 as input and a candidate word corresponding to the next word in the word string. It may include a neural network that outputs a feature vector for searching for . The prediction network 120 according to an embodiment may be configured with various neural network structures such as RNN and LSTM, and may be referred to as a second neural network hereinafter.

일실시예에 따른 트랜스듀서 모델에 포함된 연결 네트워크(130)는 전사 네트워크(110)의 출력과 예측 네트워크(120)의 출력을 취합하는 신경망을 포함할 수 있다. 일실시예에 따를 때, 연결 네트워크(130)는 전사 네트워크(110)에서 출력된 제1 특징 벡터 및 예측 네트워크(120)에서 출력된 제2 특징 벡터에 기초하여, 음성 신호(101) 및 단어열(102)에 대응하는 후보 단어를 탐색하기 위한 특징 벡터를 출력하는 신경망을 포함할 수 있다. 일실시예에 따른 연결 네트워크는 다양한 신경망 구조로 구성될 수 있으며, 연결 네트워크는 제3 신경망으로 지칭될 수 있다. The connection network 130 included in the transducer model according to an embodiment may include a neural network that collects the output of the transcription network 110 and the output of the prediction network 120 . According to one embodiment, the connection network 130 is based on the first feature vector output from the transcription network 110 and the second feature vector output from the prediction network 120, the speech signal 101 and the word sequence It may include a neural network that outputs a feature vector for searching for a candidate word corresponding to (102). A connection network according to an embodiment may be configured with various neural network structures, and the connection network may be referred to as a third neural network.

일실시예에 따른 트랜스듀서 모델은 음성 신호(101) 및 현재 입력되는 음성 신호의 프레임의 이전 프레임에 대응하여 추정된 단어열(102)을 입력 받아, 현재 입력되는 음성 신호의 프레임(T)에 대응하는 제1 분류 확률(103)을 출력할 수 있다. 일실시예에 따른 제1 분류 확률은 입력된 음성 신호의 프레임에 대응하는 후보 단어의 확률 분포에 해당할 수 있다. 일실시예에 따른 트랜스듀서 모델의 연결 네트워크는 제1 특징 벡터 및 제2 특징 벡터에 기초하여, 음성 신호의 프레임 및 추정된 단어열에 대응하는 후보 단어를 탐색하기 위한 제3 특징 벡터를 출력할 수 있다. 일실시예에 따른 트랜스듀서 모델의 분류기는 제3 특징 벡터에 기초하여, 음성 신호의 프레임에 대응하는 제1 분류 확률을 출력할 수 있다. The transducer model according to an embodiment receives the voice signal 101 and the word sequence 102 estimated in correspondence to the previous frame of the frame of the currently input voice signal, and receives the word sequence 102 of the current input voice signal frame (T). A corresponding first classification probability 103 may be output. The first classification probability according to an embodiment may correspond to a probability distribution of a candidate word corresponding to a frame of the input speech signal. The connection network of the transducer model according to an embodiment may output a third feature vector for searching for a candidate word corresponding to the frame of the speech signal and the estimated word sequence, based on the first feature vector and the second feature vector. have. The classifier of the transducer model according to an embodiment may output a first classification probability corresponding to the frame of the speech signal based on the third feature vector.

일실시예에 따른 CTC 모델은 음성 신호(101)를 프레임 별로 입력 받아, 음성 신호(101)의 프레임(T)에 대응하는 제2 분류 확률(104)을 출력할 수 있다. 일실시예에 따른 음성 신호의 프레임에 대응하는 제2 분류 확률(104)은 음성 신호의 프레임에 대응하는 후보 단어에 대한 확률 분포에 해당할 수 있다. 일실시예에 따른 CTC 모델의 오디오 인코더인 전사 네트워크(110)는 음성 신호를 프레임 별로 입력 받아, 음성 신호의 프레임에 대응하는 후보 단어를 탐색하기 위한 제1 특징 벡터를 출력할 수 있다. 일실시예에 따른 CTC 모델의 분류기는 제1 특징 벡터에 기초하여, 음성 신호의 프레임에 대응하는 제2 분류 확률을 출력할 수 있다. The CTC model according to an embodiment may receive the voice signal 101 for each frame and output the second classification probability 104 corresponding to the frame T of the voice signal 101 . According to an embodiment, the second classification probability 104 corresponding to the frame of the speech signal may correspond to a probability distribution for the candidate word corresponding to the frame of the speech signal. The transcription network 110 , which is an audio encoder of the CTC model according to an embodiment, may receive a voice signal for each frame and output a first feature vector for searching for a candidate word corresponding to the frame of the voice signal. The classifier of the CTC model according to an embodiment may output a second classification probability corresponding to the frame of the speech signal based on the first feature vector.

일실시예에 따른 언어 모델은 단어열을 입력 받아, 단어열의 다음 단어에 대응하는 제3 분류 확률을 출력할 수 있다. 일실시예에 따른 단어열의 다음 단어에 대응하는 제3 분류 확률은 단어열의 다음 단어에 대응하는 후보 단어에 대한 확률 분포에 해당할 수 있다. 일실시예에 따른 언어 모델의 레이블 인코더인 예측 네트워크는 단어열을 입력 받아, 단어열의 다음 단어에 대응하는 후보 단어를 탐색하기 위한 제2 특징 벡터를 출력할 수 있다. 일실시예에 따른 언어 모델의 분류기는 제2 특징 벡터에 기초하여, 단어열의 다음 단어에 대응하 제3 분류 확률 을 출력할 수 있다.The language model according to an embodiment may receive a word string and output a third classification probability corresponding to a next word in the word string. According to an exemplary embodiment, the third classification probability corresponding to the next word in the word string may correspond to a probability distribution of candidate words corresponding to the next word in the word string. A prediction network that is a label encoder of a language model according to an embodiment may receive a word string and output a second feature vector for searching for a candidate word corresponding to a next word in the word string. The classifier of the language model according to an exemplary embodiment may output a third classification probability corresponding to the next word in the word string based on the second feature vector.

일실시예에 따른 음성 인식 시스템은 제1 분류 확률, 제2 분류 확률 및 제 3 분류 확률에 기초한 디코딩(decoding) 과정을 통해 입력된 음성 신호에 대응하는 단어열을 추정할 수 있다. 일실시예에 따를 때, 디코딩 과정은 음성 신호의 각 프레임에 대응하는 후보 단어의 확률에 기초하여, 각 프레임에 대응하는 후보 단어를 조합하여, 입력된 음성 신호에 대응하는 단어열을 추정하는 과정에 해당할 수 있다. 일실시예에 따를 때, 디코딩 알고리즘은 빔 서치 알고리즘, 그리디 알고리즘 등을 포함할 수 있다. 일실시예에 따른 디코딩 과정은 이하의 도 2에서 상술한다.The speech recognition system according to an embodiment may estimate a word sequence corresponding to the input speech signal through a decoding process based on the first classification probability, the second classification probability, and the third classification probability. According to an embodiment, the decoding process is a process of estimating a word sequence corresponding to an input voice signal by combining candidate words corresponding to each frame based on the probability of a candidate word corresponding to each frame of the voice signal. may correspond to According to an embodiment, the decoding algorithm may include a beam search algorithm, a greedy algorithm, and the like. A decoding process according to an embodiment will be described in detail with reference to FIG. 2 below.

일실시예에 따른 음성 인식 시스템은 내부적으로 언어 모델을 포함하여, 음성 신호에 대응하는 단어열을 추정하는 디코딩 과정에서 외부 언어 모델을 이용할 필요없이, 내부 언어 모델에 따른 분류 확률을 이용할 수 있다. 일실시예에 따른 언어 모델은 트랜스듀서 모델과 레이블 인코더를 공유하므로, 디코딩 과정에서 추가적인 언어 모델의 인코딩 과정으로 인한 연산을 생략할 수 있어 효율적인 디코딩 연산이 가능하다.The speech recognition system according to an embodiment includes a language model internally, so that it is possible to use a classification probability according to the internal language model without using an external language model in a decoding process for estimating a word sequence corresponding to a voice signal. Since the language model according to an embodiment shares the transducer model and the label encoder, it is possible to omit the operation due to the encoding process of the additional language model in the decoding process, so that an efficient decoding operation is possible.

도 2는 일실시예에 따른 디코딩을 수행하는 음성 인식 시스템의 구조를 설명하기 위한 도면이다.2 is a diagram for explaining the structure of a speech recognition system performing decoding according to an embodiment.

도 2를 참조하면, 일실시예에 따른 음성 인식 시스템은 제1 분류 확률(103), 제2 분류 확률(104) 및 제 3 분류 확률(105)에 기초한 디코딩(decoding) 과정을 수행하는 디코더(decoder)(170)를 포함할 수 있다. 일실시예에 따른 디코더(170)는 제1 분류 확률, 제2 분류 확률 및 제 3 분류 확률에 기초하여, 입력된 음성 신호(101)의 현재 프레임에 대응하는 단어를 추정하고, 추정된 단어에 기초하여, 음성 신호의 이전 프레임들에 대하여 탐색된 단어열을 갱신함으로써, 현재 프레임까지의 음성 신호에 대응하는 단어열을 탐색하는 디코딩 과정을 수행할 수 있다. Referring to FIG. 2 , a speech recognition system according to an embodiment includes a decoder ( decoder) 170 . The decoder 170 according to an embodiment estimates a word corresponding to the current frame of the input speech signal 101 based on the first classification probability, the second classification probability, and the third classification probability, and adds the word to the estimated word. Based on the updating of the searched word sequence for previous frames of the voice signal, a decoding process of searching for a word sequence corresponding to the voice signal up to the current frame may be performed.

예를 들어, 디코딩 과정은 음성 신호(X)의 각 프레임에 대응하는 후보 단어들을 조합한 단어열들(Y) 중 다음의 수학식 1을 최대화하는 단어열(

)을 탐색하는 과정에 해당할 수 있다. For example, in the decoding process, the following Equation 1 is maximized among the word sequences Y in which candidate words corresponding to each frame of the speech signal X are combined.

) may correspond to the process of exploring

수학식 1에서, P_Trans.(Y|X)는 트랜스듀서 모델에 따른 음성 신호(X)에 대응하는 단어열(Y)의 확률, P_CTC(Y|X)는 CTC 모델에 따른 음성 신호(X)에 대응하는 단어열(Y)의 확률, P_LM(Y)는 언어 모델에 따른 단어열(Y)의 확률을 의미할 수 있다. 일실시예에 따를 때, P_Trans.(Y|X)는 음성 신호(X)의 프레임 별 대응하는 후보 단어에 관한 제1 분류 확률에 기초하여 결정될 수 있고, P_CTC(Y|X)는 음성 신호(X)의 프레임 별 대응하는 후보 단어에 관한 제2 분류 확률에 기초하여 결정될 수 있으며, P_LM(Y)는 단어열의 다음 단어에 대응하는 후보 단어에 관한 제3 분류 확률에 기초하여 결정될 수 있다. 수학식 1에서, β₁,β₂,β₃는 상수로, 단어열 추정을 위해 어떤 모델에 비중을 두는지에 따라 다르게 결정될 수 있다.In Equation 1, P _Trans. (Y|X) is the probability of the word sequence (Y) corresponding to the voice signal (X) according to the transducer model, P _CTC (Y|X) is the word sequence (Y) corresponding to the voice signal (X) according to the CTC model The probability of Y), P _LM (Y), may mean the probability of the word sequence (Y) according to the language model. According to one embodiment, P _Trans. (Y|X) may be determined based on a first classification probability of a corresponding candidate word for each frame of the speech signal X, and P _CTC (Y|X) is a corresponding candidate for each frame of the speech signal X It may be determined based on the second classification probability of the word, and P _LM (Y) may be determined based on the third classification probability of the candidate word corresponding to the next word in the word string. In Equation 1, β ₁ ,β ₂ ,β ₃ is a constant and may be determined differently depending on which model is used for estimating the word sequence.

일실시예에 따를 때, 디코딩 방법으로 P_Trans.(Y|X), P_CTC(Y|X), P_LM(Y)의 확률을 각각 구하는 멀티 패스 디코딩(multi-pass decoding) 방법, 및 P_Trans.(Y|X)를 구하기 위하여, CTC 모델에 따른 제2 분류 확률 및 LM 모델에 따른 제3 분류 확률을 이용하는 원 패스 디코딩(one-pass decoding) 방법이 사용될 수 있다. 일실시예에 따른 멀티 패스 디코딩 방법은 예를 들어, 후보 단어열(Y)에 대한 P_Trans.(Y|X), P_CTC(Y|X), P_LM(Y)를 각각 구하여, 동일한 후보 단어열에 대한 P_Trans.(Y|X), P_CTC(Y|X), P_LM(Y)의 연산 결과가 가장 높은 단어열을 입력된 음성 신호에 대응하는 단어열로 추정하는 방법에 해당할 수 있다.According to one embodiment, the decoding method is P _Trans. (Y|X), P _CTC (Y|X), a multi-pass decoding method for obtaining the probabilities of _{P LM} _{(Y), respectively, and P Trans.} In order to obtain (Y|X), a one-pass decoding method using the second classification probability according to the CTC model and the third classification probability according to the LM model may be used. A multi-pass decoding method according to an embodiment may include, for example, P _Trans. (Y|X), P _CTC (Y|X), and P _LM (Y) are obtained, respectively, and P _Trans. (Y|X), P _CTC (Y|X), and P _LM (Y) may correspond to a method of estimating the highest word sequence as a word sequence corresponding to the input voice signal.

일실시예에 따른 원 패스 디코딩 방법은 예를 들어, CTC 모델에 따른 제2 분류 확률 및 LM 모델에 따른 제3 분류 확률을 트랜스듀서 모델에 따른 제1 분류 확률에 반영하여, 음성 신호의 프레임 별 제1 분류 확률을 구하고, 음성 신호의 프레임 별 제1 분류 확률에 기초하여, 프레임 별 후보 단어를 조합한 음성 신호에 대응하는 후보 단어열(Y)에 대한 P_Trans.(Y|X)을 구하여, P_Trans.(Y|X)가 가장 높은 단어열을 입력된 음성 신호에 대응하는 단어열로 추정하는 방법에 해당할 수 있다.The one-pass decoding method according to an embodiment reflects, for example, the second classification probability according to the CTC model and the third classification probability according to the LM model to the first classification probability according to the transducer model, for each frame of the voice signal _{P Trans} for a candidate word string (Y) corresponding to a speech signal obtained by obtaining a first classification probability and combining candidate words for each frame based on the first classification probability for each frame of the speech signal. (Y|X), P _Trans. It may correspond to a method of estimating a word string having the highest (Y|X) as a word string corresponding to an input voice signal.

일실시예에 따를 때, 음성 신호의 현재 프레임에 대하여 탐색된 단어열은 예측 네트워크(120)로 입력될 수 있다. 일실시예에 따른 음성 인식 시스템은 최종적으로 입력된 음성 신호(101)에 대응하는 텍스트 데이터(106)를 출력할 수 있다. 일실시예에 따를 때, 음성 신호(101)가 처음 입력된 경우, 다시 말해 입력된 음성 신호의 첫 프레임에 대하여 음성 인식을 수행하는 경우, 예측 네트워크(120)에는 공백 문자 등 이미 탐색된 단어열이 없음을 지시하는 정보가 입력될 수 있다.According to an embodiment, the word sequence searched for with respect to the current frame of the speech signal may be input to the prediction network 120 . The voice recognition system according to an embodiment may output text data 106 corresponding to the finally input voice signal 101 . According to an embodiment, when the voice signal 101 is first input, that is, when voice recognition is performed on the first frame of the input voice signal, the prediction network 120 contains a previously searched word sequence such as a blank character. Information indicating the absence of this may be input.

도 3은 일실시예에 따른 음성 인식 시스템에 따른 음성 인식 방법의 순서도를 도시한 도면이다.3 is a diagram illustrating a flowchart of a voice recognition method according to a voice recognition system according to an exemplary embodiment.

도 3을 참조하면, 일실시예에 따른 음성 인식 방법은 입력된 음성 신호에 대응하는 후보 단어를 탐색하기 위한 제1 특징 벡터를 획득하는 단계(310), 입력된 음성 신호에 대응하여 이미 탐색된 단어열에 기초하여, 단어열의 다음 단어에 대응하는 후보 단어를 탐색하기 위한 제2 특징 벡터를 획득하는 단계(320), 제1 특징 벡터 및 제2 특징 벡터에 기초하여, 트랜스듀서 모델에 따른 제1 분류 확률을 획득하는 단계(330), 제1 특징 벡터에 기초하여, CTC 모델에 따른 제2 분류 확률을 획득하는 단계(340), 제2 특징 벡터에 기초하여, 언어 모델에 따른 제3 분류 확률을 획득하는 단계(350), 및 제1 분류 확률, 제2 분류 확률 및 제3 분류 확률에 기초하여, 음성 신호에 대응하는 단어를 추정하고, 단어열을 갱신하는 단계(360)를 포함한다.Referring to FIG. 3 , the voice recognition method according to an embodiment includes the steps of obtaining a first feature vector for searching for a candidate word corresponding to an input voice signal ( 310 ); Acquiring a second feature vector for searching for a candidate word corresponding to a next word in the word sequence based on the word sequence ( 320 ), and based on the first feature vector and the second feature vector, a first according to the transducer model Obtaining a classification probability (330), obtaining a second classification probability according to the CTC model based on the first feature vector (340), Based on the second feature vector, a third classification probability according to the language model obtaining (350), and estimating a word corresponding to the speech signal based on the first classification probability, the second classification probability, and the third classification probability, and updating the word sequence (360).

일실시예에 따른 단계(310)는 입력된 음성 신호를 제1 신경망에 인가하여, 제1 특징 벡터를 획득하는 단계를 포함할 수 있다. 다시 말해, 일실시예에 따른 단계(310)는 트랜스듀서 모델 및 CTC 모델이 공유하는 전사 네트워크에 음성 신호를 프레임 별로 입력하여, 제1 특징 벡터를 획득하는 단계를 포함할 수 있다.Step 310 according to an embodiment may include obtaining a first feature vector by applying the input voice signal to the first neural network. In other words, step 310 according to an embodiment may include inputting a voice signal to a transcription network shared by the transducer model and the CTC model for each frame, and obtaining a first feature vector.

일실시예에 따른 단계(320)는 입력된 음성 신호에 대응하여 이미 탐색된 단어열을 제2 신경망에 인가하여, 제2 특징 벡터를 획득하는 단계를 포함할 수 있다. 다시 말해, 일실시예에 따른 단계(320)는 트랜스듀서 모델 및 언어 모델이 공유하는 예측 네트워크에 현재 전사 네트워크에 입력되는 음성 신호의 프레임의 이전 프레임에 대응하는 음성 신호에 대하여 탐색된 단어열을 입력하여, 제2 특징 벡터를 획득하는 단계를 포함할 수 있다.Step 320 according to an embodiment may include obtaining a second feature vector by applying a word sequence already searched for in response to the input voice signal to the second neural network. In other words, in step 320 according to an embodiment, the word sequence searched for the voice signal corresponding to the previous frame of the frame of the voice signal currently input to the transcription network in the prediction network shared by the transducer model and the language model input to obtain a second feature vector.

일실시예에 따른 단계(330)는 단계(310)에 의해 획득된 제1 특징 벡터 및 단계(320)에 의해 획득된 제2 특징 벡터를 제3 신경망에 인가하여, 제3 특징 벡터를 획득하는 단계 및 제3 특징 벡터에 기초하여, 트랜스듀서 모델의 분류 함수에 따른 제1 분류 확률을 획득하는 단계를 포함할 수 있다. 다시 말해, 일실시예에 따른 단계(330)는 제1 특징 벡터 및 제2 특징 벡터를 트랜스듀서 모델의 연결 네트워크에 입력하여 제3 특징 벡터를 획득하고, 제3 특징 벡터를 트랜스듀서 모델의 분류 함수에 입력하여, 제1 분류 확률을 획득하는 단계를 포함할 수 있다.In step 330 according to an embodiment, the first feature vector obtained in step 310 and the second feature vector obtained in step 320 are applied to a third neural network to obtain a third feature vector. and, based on the third feature vector, obtaining a first classification probability according to a classification function of the transducer model. In other words, in step 330 according to an exemplary embodiment, the first feature vector and the second feature vector are input to the connected network of the transducer model to obtain a third feature vector, and the third feature vector is used to classify the transducer model. input to the function to obtain a first classification probability.

일실시예에 따른 단계(360)는 제1 분류 확률, 제2 분류 확률 및 제3 분류 확률에 기초하여, 음성 신호에 대응하는 단어를 추정하는 단계 및 추정된 단어에 기초하여, 단어열을 갱신하는 단계를 포함할 수 있다. 일실시예에 따른 단계(360)는 빔 서치 등의 알고리즘을 이용하여 디코딩을 수행하는 단계에 해당할 수 있다.Step 360 according to an embodiment includes estimating a word corresponding to the speech signal based on the first classification probability, the second classification probability, and the third classification probability, and updating the word sequence based on the estimated word may include the step of Step 360 according to an embodiment may correspond to a step of performing decoding using an algorithm such as a beam search.

도 4는 일실시예에 따른 음성 인식 시스템의 학습 방법의 순서도를 도시한 도면이다.4 is a diagram illustrating a flowchart of a learning method of a voice recognition system according to an embodiment.

도 4를 참조하면, 일실시예에 따른 음성 인식 시스템의 학습 방법은 트랜스듀서 모델에 따른 제1 손실을 획득하는 단계(410), CTC 모델에 따른 제2 손실을 획득하는 단계(420), 언어 모델에 따른 제3 손실을 획득하는 단계(430), 및 제1 손실, 제2 손실, 및 제3 손실에 기초하여, 음성 인식 시스템을 학습시키는 단계(440)를 포함한다. 일실시예에 따른 음성 인식 시스템을 학습시킨다는 것은 음성 인식 시스템 내 인공 신경망을 학습시킨다는 것을 의미할 수 있다. 일실시예에 따를 때, 인공 신경망의 학습은 다양한 학습 알고리즘에 의해 수행될 수 있으며, 예를 들어 손실 함수에 따라 인공 신경망에 포함된 레이어 간 그래디언트를 생성하고, 그래디언트에 따라 레이어의 파라미터를 최적화하는 과정을 포함할 수 있다.Referring to FIG. 4 , the learning method of a speech recognition system according to an embodiment includes the steps of obtaining a first loss according to the transducer model ( 410 ), obtaining a second loss according to the CTC model ( 420 ), language obtaining ( 430 ) a third loss according to the model, and training ( 440 ) the speech recognition system based on the first loss, the second loss, and the third loss. Learning the voice recognition system according to an embodiment may mean learning the artificial neural network in the voice recognition system. According to an embodiment, the learning of the artificial neural network may be performed by various learning algorithms, for example, generating a gradient between layers included in the artificial neural network according to a loss function, and optimizing the parameters of the layer according to the gradient. process may be included.

일실시예에 따른 단계(410)의 구체적인 동작은 도 5를 참조할 수 있다. 도 5를 참조하면, 트랜스듀서 모델에 따른 제1 손실을 획득하는 단계는 학습 데이터에 포함된 음성 신호를 CTC 모델과 공유하는 제1 신경망에 인가하여, 제1 특징 벡터를 획득하는 단계(510), 학습 데이터에 포함된 음성 신호에 대응하는 단어열을 언어 모델과 공유하는 제2 신경망에 인가하여, 제2 특징 벡터를 획득하는 단계(520), 제1 특징 벡터 및 제2 특징 벡터를 제3 신경망에 인가하여, 제1 분류 확률을 획득하는 단계(530), 및 학습 데이터에 포함된 음성 신호에 대응하는 레이블에 기초하여, 제1 분류 확률에 관한 제1 손실을 획득하는 단계(540)를 포함할 수 있다. 일실시예에 따른 트랜스듀서 모델의 학습에 이용되는 학습 데이터는 음성 신호 및 그에 대응하는 단어열을 포함하는 음성-텍스트 코퍼스에 해당할 수 있다. 이하에서, 일실시예에 따른 트랜스듀서 모델의 학습에 이용되는 학습 데이터는 제1 학습 데이터로 지칭될 수 있다.A detailed operation of step 410 according to an embodiment may refer to FIG. 5 . Referring to FIG. 5 , the step of obtaining the first loss according to the transducer model is obtaining a first feature vector by applying the speech signal included in the training data to the first neural network sharing the CTC model ( 510 ) , obtaining a second feature vector by applying a word sequence corresponding to the speech signal included in the training data to a second neural network that shares with the language model ( 520 ); Step 530 of obtaining a first classification probability by applying to the neural network, and a step 540 of obtaining a first loss with respect to the first classification probability based on a label corresponding to the speech signal included in the training data may include The training data used for training the transducer model according to an embodiment may correspond to a voice-text corpus including a voice signal and a word sequence corresponding thereto. Hereinafter, training data used for training the transducer model according to an embodiment may be referred to as first training data.

다시 도 4를 참조하면, 일실시예에 따른 단계(420)는 학습 데이터에 포함된 음성 신호 및 그에 대응하는 레이블에 기초하여, CTC 모델에 따른 제2 손실을 획득하는 단계를 포함할 수 있다. 보다 구체적으로, 일실시예에 따른 단계(420)의 구체적인 동작은 도 6을 참조할 수 있다. 도 6을 참조하면, CTC 모델에 따른 제2 손실을 획득하는 단계는 학습 데이터에 포함된 음성 신호를 제1 신경망에 인가하여, 제4 특징 벡터를 획득하는 단계(610), 제4 특징 벡터에 기초하여, CTC 모델에 따른 제2 분류 확률을 획득하는 단계(620) 및 제2 레이블에 기초하여, 제2 분류 확률에 관한 제2 손실을 획득하는 단계(630)를 포함할 수 있다. 일실시예에 따른 CTC 모델의 학습에 이용되는 학습 데이터는 음성 신호 및 그에 대응하는 단어열을 포함하는 음성-텍스트 코퍼스에 해당할 수 있다. 이하에서, 일실시예에 따른 CTC 모델의 학습에 이용되는 학습 데이터는 제2 학습 데이터로 지칭될 수 있다.Referring back to FIG. 4 , step 420 according to an embodiment may include acquiring a second loss according to the CTC model based on a voice signal included in the training data and a label corresponding thereto. More specifically, a detailed operation of step 420 according to an embodiment may refer to FIG. 6 . Referring to FIG. 6 , the step of obtaining the second loss according to the CTC model includes the step of obtaining a fourth feature vector by applying the speech signal included in the training data to the first neural network ( 610 ). based on the CTC model, obtaining a second classification probability ( 620 ) and, based on the second label, obtaining a second loss with respect to the second classification probability ( 630 ). The training data used for learning the CTC model according to an embodiment may correspond to a voice-text corpus including a voice signal and a word sequence corresponding thereto. Hereinafter, training data used for learning the CTC model according to an embodiment may be referred to as second training data.

일실시예에 따른 제1 학습 데이터 및 제2 학습 데이터는 음성 신호 및 그에 대응되는 텍스트 레이블을 포함하는 음성 및 텍스트 코퍼스에 해당할 수 있다. 일실시예에 따른 텍스트 레이블은 음성 신호가 분류되는 클래스에 해당할 수 있다. 일실시예에 따른 텍스트 레이블은 하나의 문자 또는 적어도 하나의 문자를 포함하는 워드피스(word piece)를 포함할 수 있다. 또한, 일실시예에 따른 텍스트 레이블은 블랭크(blank)를 포함할 수 있다. 일실시예에 따른 블랭크 레이블은 입력 데이터와 출력 데이터의 정렬을 알 수 없는 경우 또는 입력 데이터와 출력 데이터의 길이가 다른 경우 등에 입력 데이터와 출력 데이터를 정렬하기 위해 이용되는 레이블에 해당할 수 있다. 예를 들어, 음성 신호를 프레임 별로 인식하여 텍스트로 변환하는 경우, 복수의 프레임들이 하나의 문자에 대응될 수 있다. 이 경우, 서로 다른 문자에 대응되는 프레임을 구분하기 위하여 블랭크 레이블이 이용될 수 있다.The first learning data and the second learning data according to an embodiment may correspond to a voice and text corpus including a voice signal and a text label corresponding thereto. A text label according to an embodiment may correspond to a class into which a voice signal is classified. A text label according to an embodiment may include one character or a word piece including at least one character. Also, the text label according to an embodiment may include a blank. The blank label according to an exemplary embodiment may correspond to a label used to align input data and output data when the alignment of the input data and the output data is unknown, or when the lengths of the input data and the output data are different. For example, when a voice signal is recognized for each frame and converted into text, a plurality of frames may correspond to one character. In this case, a blank label may be used to distinguish frames corresponding to different characters.

예를 들어, 음성 신호에서 연속한 제1 프레임 내지 제3 프레임이 “a”에 대응하는 경우, 제1 프레임 내지 제3 프레임을 포함하는 음성 신호가 “aa”에 대응하는 경우와 “a”에 대응하는 경우를 구분하기 위하여, 블랭크 레이블을 이용할 수 있다. 즉, 음성 신호에 대응하는 문자열이 “a”인 경우, 제1 프레임 내지 제3 프레임은 “a”레이블을 출력하도록 학습될 수 있고, 음성 신호에 대응하는 문자열이 “aa”인 경우, 제1 프레임은 “a”레이블, 제2 프레임은 블랭크 레이블, 제3 프레임은 “a”레이블을 출력하도록 학습될 수 있다.For example, in the case where consecutive first to third frames in the voice signal correspond to “a”, when the voice signal including the first to third frames corresponds to “aa”, and in “a”, In order to distinguish the corresponding cases, a blank label may be used. That is, when the character string corresponding to the voice signal is “a”, the first to third frames may be learned to output the label “a”, and when the character string corresponding to the voice signal is “aa”, the first frame to the third frame The frame may be learned to output an “a” label, the second frame as a blank label, and the third frame as outputting the “a” label.

일실시예에 따를 때, 제1 학습 데이터에 포함된 텍스트 레이블과 제2 학습 데이터에 포함된 텍스트 레이블의 종류는 다를 수 있다. 예를 들어, 제2 학습 데이터에 포함된 레이블의 개수가 제1 학습 데이터에 포함된 레이블의 개수보다 적을 수 있다. 이에 관하여는 이하에서 상술한다.According to an embodiment, the types of text labels included in the first training data and the text labels included in the second training data may be different. For example, the number of labels included in the second training data may be less than the number of labels included in the first training data. This will be described in detail below.

다시 도 4를 참조하면, 일실시예에 따른 단계(430)는 학습 데이터에 포함된 단어열에 기초하여, 언어 모델에 따른 제3 손실을 획득하는 단계를 포함할 수 있다. 보다 구체적으로, 일실시예에 따른 단계(430)의 구체적인 동작은 도 7을 참조할 수 있다. 도 7을 참조하면, 언어 모델에 따른 제3 손실을 획득하는 단계는 학습 데이터에 포함된 단어열을 제2 신경망에 인가하여, 제5 특징 벡터를 획득하는 단계(710), 제5 특징 벡터에 기초하여, 언어 모델에 따른 제3 분류 확률을 획득하는 단계(720), 및 학습 데이터에 포함된 단어열에 기초하여, 제3 분류 확률에 관한 제3 손실을 획득하는 단계(730)를 포함할 수 있다. 일실시예에 따른 언어 모델의 학습에 이용되는 학습 데이터는 단어열을 포함한 텍스트 코퍼스에 해당할 수 있다. 이하에서, 일실시예에 따른 언어 모델의 학습에 이용되는 학습 데이터는 제3 학습 데이터로 지칭될 수 있다.Referring back to FIG. 4 , step 430 according to an embodiment may include acquiring a third loss according to a language model based on a word string included in the training data. More specifically, a detailed operation of step 430 according to an embodiment may refer to FIG. 7 . Referring to FIG. 7 , the step of obtaining the third loss according to the language model includes the step of obtaining a fifth feature vector by applying the word string included in the training data to the second neural network ( 710 ). based on the step of obtaining a third classification probability according to the language model (720), and based on the word string included in the training data, the step of obtaining (730) a third loss with respect to the third classification probability. have. The training data used for learning the language model according to an embodiment may correspond to a text corpus including a word sequence. Hereinafter, learning data used for learning a language model according to an embodiment may be referred to as third learning data.

텍스트 데이터만을 포함하는 제3 학습 데이터는 음성 신호와 그에 대응되는 텍스트 레이블을 포함하는 제1 학습 데이터 또는 제2 학습 데이터에 비해 데이터의 크기가 매우 클 수 있다. 학습 데이터의 레이블링이 필요없는 self-supervised 학습 또는 비지도 학습이 가능한 언어 모델은 데이터의 크기가 큰 제3 학습 데이터를 이용하여 학습될 수 있어, 일실시예에 따른 음성 인식 시스템은 정교한 언어 모델을 포함할 수 있다.The size of the third training data including only text data may be much larger than that of the first training data or the second training data including the voice signal and a text label corresponding thereto. A language model capable of self-supervised learning or unsupervised learning that does not require labeling of training data can be learned using third training data with a large data size. may include

일실시예에 따를 때, 제1 손실, 제2 손실 및 제3 손실은 음성 인식 시스템에 포함된 신경망을 구성하는 레이어들 간의 파라미터를 학습시키기 위하여 이용될 수 있다. 일실시예에 따른 음성 인식 시스템의 학습에 이용되는 손실은 다음의 수학식 2와 같이 나타낼 수 있다.According to an embodiment, the first loss, the second loss, and the third loss may be used to learn parameters between layers constituting the neural network included in the speech recognition system. The loss used for learning the voice recognition system according to an embodiment may be expressed as Equation 2 below.

여기서, L_total은 음성 인식 시스템을 학습시키기 위한 전체 손실, L_Trans.는 제1 손실, L_CTC는 제2 손실, L_LM은 제3 손실, 을 의미한다. 수학식 2에서 α₁, α₂, α₃은 상수로, 음성 인식 시스템의 학습을 위해 어떤 모델에 비중을 두는지에 따라 다르게 결정될 수 있다. Here, L _total is the total loss for training the speech recognition system, L _Trans. denotes a first loss, L _CTC denotes a second loss, and L _LM denotes a third loss. In Equation 2, α ₁ , α ₂ , and α ₃ are constants, and may be determined differently depending on which model is given weight for learning the voice recognition system.

일실시예에 따른 제1 손실에 기초하여 트랜스듀서 모델을 학습시키는 구체적인 동작은 도 8a를 참조할 수 있다. 도 8a를 참조하면, 제1 학습 데이터는 제1 음성 신호(X) 및 제1 음성 신호(X)에 대응하는 제1 단어열(Y)을 포함할 수 있으며, 제1 음성 신호(X)는 트랜스듀서 모델의 전사 네트워크에 입력될 수 있고, 제1 단어열(Y)은 예측 네트워크에 입력될 수 있다. 일실시예에 따를 때, 제1 음성 신호는 프레임 별로 전사 네트워크에 입력될 수 있으며, 예측 네트워크에 입력되는 제1 단어열은 전사 네트워크에 현재 입력되는 프레임 이전 프레임까지의 음성 신호에 대응하는 단어열에 해당할 수 있다. 일실시예에 따른 제1 레이블은 전사 네트워크에 현재 입력되는 프레임에 대응하는 텍스트(또는 단어) 레이블에 해당할 수 있다. A detailed operation of training the transducer model based on the first loss according to an embodiment may refer to FIG. 8A . Referring to FIG. 8A , the first learning data may include a first voice signal (X) and a first word string (Y) corresponding to the first voice signal (X), and the first voice signal (X) is It may be input to the transcription network of the transducer model, and the first word string Y may be input to the prediction network. According to an embodiment, the first voice signal may be input to the transcription network for each frame, and the first word sequence input to the prediction network is in the word sequence corresponding to the voice signal up to the frame before the frame currently input to the transcription network. may be applicable. The first label according to an embodiment may correspond to a text (or word) label corresponding to a frame currently input to the transcription network.

일실시예에 따를 때, 제1 손실은 제1 음성 신호에 대응하여 출력된 제1 특징 벡터 및 제1 단어열에 대응하여 출력된 제2 특징 벡터를 트랜스듀서의 제3 신경망에 인가하여 획득된 제1 분류 확률 및 제1 음성 신호에 대응하는 제1 레이블 에 기초하여 결정된 손실 함수에 해당할 수 있다. 일실시예에 따른 제1 손실은 제1 음성 신호에 대응하여, 제1 레이블로 분류되는 제1 특징 벡터를 출력하도록 전사 네트워크의 파라미터를 학습시킬 수 있고, 제1 단어열에 대응하여, 제1 레이블로 분류되는 제2 특징 벡터를 출력하도록 예측 네트워크의 파라미터를 학습시킬 수 있으며, 제1 특징 벡터 및 제2 특징 벡터에 기초하여, 제1 레이블로 분류되는 분류 확률을 출력하도록 트랜스듀서 모델의 제3 신경망을 구성하는 레이어들 간의 파라미터를 학습시킬 수 있다.According to an exemplary embodiment, the first loss is obtained by applying a first feature vector output in response to the first voice signal and a second feature vector output in response to the first word string to a third neural network of the transducer. It may correspond to a loss function determined based on one classification probability and a first label corresponding to the first speech signal. In response to the first loss according to an embodiment, a parameter of the transcription network may be learned to output a first feature vector classified as a first label in response to a first voice signal, and in response to a first word string, a first label A parameter of the prediction network may be trained to output a second feature vector classified as It is possible to learn parameters between layers constituting the neural network.

일실시예에 따른 제2 손실에 기초하여 CTC 모델을 학습시키는 구체적인 동작은 도 8b를 참조할 수 있다. 도 8b를 참조하면, 제2 학습 데이터는 제2 음성 신호(X') 및 제2 음성 신호(X')에 대응하는 제2 레이블(Z)을 포함할 수 있다. 일실시예에 따를 때, 제2 음성 신호(X')는 전사 네트워크에 입력될 수 있으며, 프레임 별로 입력될 수 있다. 일실시예에 따른 제2 레이블(Z)은 전사 네트워크에 현재 입력되는 프레임에 대응하는 텍스트(또는 단어) 레이블에 해당할 수 있다.A detailed operation of learning the CTC model based on the second loss according to an embodiment may refer to FIG. 8B . Referring to FIG. 8B , the second learning data may include a second voice signal X′ and a second label Z corresponding to the second voice signal X′. According to an embodiment, the second voice signal X' may be input to the transcription network, and may be input for each frame. The second label Z according to an embodiment may correspond to a text (or word) label corresponding to a frame currently input to the transcription network.

일실시예에 따를 때, 제2 손실은 제2 학습 데이터에 포함된 제2 음성 신호를 CTC 모델에 입력하여 획득된 제2 분류 확률 및 제2 학습 데이터에 포함된 제2 음성 신호에 대응하는 제2 레이블에 기초하여 결정된 손실 함수에 해당할 수 있다. 일실시예에 따른 제2 레이블은 제2 음성 신호에 대응하도록 레이블링된 텍스트 데이터에 해당할 수 있다. 일실시예에 따른 제2 손실은 제2 음성 신호에 대응하여, 제2 레이블로 분류되는 제4 특징 벡터를 출력하도록 전사 네트워크 및 CTC 모델의 분류기를 학습시킬 수 있다. 보다 구체적으로, 제4 특징 벡터에 기초하여, 제2 레이블로 분류되는 분류 확률을 출력하도록 제1 신경망 및 CTC 분류기를 구성하는 레이어들 각각의 파라미터를 학습시킬 수 있다.According to an embodiment, the second loss includes a second classification probability obtained by inputting a second speech signal included in the second training data to the CTC model and a second loss corresponding to the second speech signal included in the second training data. 2 It may correspond to a loss function determined based on the label. The second label according to an embodiment may correspond to text data labeled to correspond to the second voice signal. The second loss according to an embodiment may train the transcription network and the classifier of the CTC model to output a fourth feature vector classified as a second label in response to the second speech signal. More specifically, based on the fourth feature vector, parameters of each of the layers constituting the first neural network and the CTC classifier may be trained to output a classification probability classified by the second label.

일실시예에 따른 제3 손실에 기초하여 언어 모델을 학습시키는 구체적인 동작은 도 8c를 참조할 수 있다. 도 8c를 참조하면, 제3 학습 데이터는 제2 단어열(Y’) 을 포함할 수 있다. 일실시예에 따를 때, 제2 단어열(Y’)은 예측 네트워크에 입력될 수 있다.A detailed operation of learning the language model based on the third loss according to an embodiment may refer to FIG. 8C . Referring to FIG. 8C , the third learning data may include a second word string Y′. According to an embodiment, the second word string Y' may be input to the prediction network.

일실시예에 따를 때, 제3 손실은 제3 학습 데이터에 포함된 제2단어열을 언어 모델에 입력하여 획득된 제3 분류 확률 및 제2 단어열에 기초하여 결정된 손실 함수에 해당할 수 있다. 일실시예에 따를 때, 제3 학습 데이터에 포함된 제2 단어열은 대응하는 레이블을 획득하기 위한 레이블링 과정을 거치지 않을 수 있다. 일실시예에 따를 때, 제3 학습 데이터에 포함된 단어열에 기초하여 제2 단어열의 다음에 위치한 단어가 제2 단어열에 대응하는 레이블로 인식될 수 있다. 일실시예에 따른 제3 손실은 제2 단어열에 대응하여, 제2 단어열의 다음 단어로 분류되는 제5 특징 벡터를 출력하도록 예측 네트워크 및 언어 모델의 분류기를 학습시킬 수 있다. 보다 구체적으로, 제5 특징 벡터에 기초하여, 제2 단어열의 다음 단어로 분류되는 분류 확률을 출력하도록 제2 신경망 및 LM 분류기를 구성하는 레이어들 각각의 파라미터를 학습시킬 수 있다.According to an embodiment, the third loss may correspond to a loss function determined based on the third classification probability and the second word sequence obtained by inputting the second word sequence included in the third training data to the language model. According to an embodiment, the second word string included in the third learning data may not undergo a labeling process for obtaining a corresponding label. According to an exemplary embodiment, a word located next to the second word string may be recognized as a label corresponding to the second word string based on the word string included in the third learning data. As the third loss according to an embodiment, in response to the second word string, the predictive network and the classifier of the language model may be trained to output a fifth feature vector classified as the next word of the second word string. More specifically, based on the fifth feature vector, parameters of each of the layers constituting the second neural network and the LM classifier may be trained to output a classification probability classified as the next word in the second word string.

일실시예에 따를 때, 제1 학습 데이터를 이용한 학습, 제2 학습 데이터를 이용한 학습 및 제3 학습 데이터를 이용한 학습은 순차적으로 진행될 수 있다. 예를 들어, 일실시예에 따른 트랜스듀서 모델은 제2 학습 데이터를 이용한 학습 및 제3 학습 데이터를 이용한 학습이 순차적으로 진행됨으로써, 트랜스듀서 모델의 신경망을 구성하는 레이어들의 파라미터가 학습될 수 있다.According to an embodiment, learning using the first learning data, learning using the second learning data, and learning using the third learning data may be sequentially performed. For example, in the transducer model according to an embodiment, as learning using the second learning data and learning using the third learning data are sequentially performed, parameters of layers constituting the neural network of the transducer model may be learned. .

다시 도 4를 참조하면, 일실시예에 따른 음성 인식 시스템을 학습시키는 단계(440)는 제1 손실에 기초하여, 트랜스듀서 모델의 레이어 별 제1 그래디언트(gradient)를 생성하는 단계, 제2 손실에 기초하여, CTC 모델의 레이어 별 제2 그래디언트를 생성하는 단계, 및 제3 손실에 기초하여, 언어 모델의 레이어 별 제3 그래디언트를 생성하는 단계를 포함할 수 있다. 일실시예에 따를 때, 트랜스듀서 모델 및 CTC 모델이 공유하는 제1 신경망의 레이어 별 그래디언트는 제1 그래디언트 및 제2 그래디언트를 누적하여 생성될 수 있고, 트랜스듀서 모델 및 언어 모델이 공유하는 제2 신경망의 레이어 별 그래디언트는 제1 그래디언트 및 제3 그래디언트를 누적하여 생성될 수 있다. 일실시예에 따를 때, 음성 인식 시스템을 학습시키는 단계(440)는 트랜스듀서 모델, CTC 모델 및 언어 모델의 레이어 별 그래디언트에 기초하여, 레이어 별 파라미터를 최적화하는 단계를 더 포함할 수 있다.Referring back to FIG. 4 , the step 440 of training the speech recognition system according to an embodiment includes, based on the first loss, generating a first gradient for each layer of the transducer model, and the second loss. based on , generating a second gradient for each layer of the CTC model, and generating a third gradient for each layer of the language model based on the third loss. According to an embodiment, the gradient for each layer of the first neural network shared by the transducer model and the CTC model may be generated by accumulating the first gradient and the second gradient, and the second gradient shared by the transducer model and the language model A gradient for each layer of the neural network may be generated by accumulating the first gradient and the third gradient. According to an embodiment, the step 440 of training the speech recognition system may further include optimizing the parameters for each layer based on the gradients for each layer of the transducer model, the CTC model, and the language model.

일실시예에 따른 학습된 음성 인식 시스템에 포함된 모델들은 음성 인식기에 적용될 수 있으며, 각각의 모델은 서로 다른 유형의 음성 인식기에 적용되어 음성 인식 기술과 관련된 서비스 제공에 이용될 수 있다. 예를 들어, 미리 설정된 특정 단어를 검출하는 호출어 인식기(wakeup word detector)와 음성 신호를 대응하는 텍스트 데이터로 출력하는 음성 인식기는 서로 다른 유형의 음성 인식기에 해당하며, 호출어 인식기 및 음성 인식기에 각각 다른 모델이 적용될 수 있다. Models included in the learned voice recognition system according to an embodiment may be applied to a voice recognizer, and each model may be applied to a different type of voice recognizer and used to provide services related to voice recognition technology. For example, a wakeup word detector that detects a preset specific word and a voice recognizer that outputs a voice signal as text data correspond to different types of voice recognizers, and include a wakeup word recognizer and a voice recognizer. Different models can be applied to each.

상술한 바와 같이, CTC 모델을 학습시키기 위한 학습 데이터 및 트랜스듀서 모델을 학습시키기 위한 학습 데이터는 음성 신호에 대응하는 텍스트 레이블을 포함할 수 있다. 예를 들어, 영어의 경우 a, b 등 문자 단위의 텍스트 레이블을 포함할 수도 있고, 여러 개의 문자들의 조합인 워드 피스(word piece)를 텍스트 레이블을 포함할 수도 있다.As described above, the training data for training the CTC model and the training data for training the transducer model may include a text label corresponding to the voice signal. For example, in the case of English, a text label in units of characters such as a and b may be included, and a word piece that is a combination of several characters may include a text label.

일실시예에 따른 CTC 모델 및 트랜스듀서 모델이 적용되는 인식기의 특성에 따라 CTC 모델을 학습시키기 위한 학습 데이터 및 트랜스듀서 모델을 학습시키기 위한 학습 데이터의 레이블의 크기가 다르게 결정될 수 있다. 예를 들어, 호출어 인식기에 CTC 모델이 적용되는 경우, 일상 생활에서 나타날 확률이 높은 워드 피스는 호출어로 사용될 가능성이 낮으므로, CTC 모델의 학습 데이터에 포함된 텍스트 레이블의 개수는 적게 결정되는 것이 유리하다. 반면, 일반적인 음성 인식기에 트랜스듀서 모델이 적용되는 경우, 학습 데이터에 포함된 텍스트 레이블의 개수는 크게 결정되는 것이 유리하다.According to the characteristics of the recognizer to which the CTC model and the transducer model according to an embodiment are applied, the label sizes of the training data for training the CTC model and the training data for training the transducer model may be determined differently. For example, when the CTC model is applied to the call word recognizer, the word piece with a high probability of appearing in daily life is less likely to be used as the call word, so it is better to determine the number of text labels included in the training data of the CTC model to be small. It is advantageous. On the other hand, when the transducer model is applied to a general speech recognizer, it is advantageous that the number of text labels included in the training data is largely determined.

예를 들어, 호출어 인식 기술 및 일반 음성 인식 기술을 모두 제공하는 인공지능 스피커에 있어서, CTC 모델 및 트랜스듀서 모델은 서로 다른 유형의 음성 인식 기술에 이용될 수 있다.For example, in an artificial intelligence speaker that provides both a call word recognition technology and a general speech recognition technology, the CTC model and the transducer model may be used for different types of speech recognition technology.

도 9a는 호출어 인식을 포함하는 음성 인식 기술을 제공하는 시스템의 동작 과정의 예를 도시한 도면이다. 도 9a를 참조하면, 호출어 인식을 포함하는 음성 인식 기술을 제공하는 시스템의 동작 과정은 호출어를 포함하는 음성 신호를 입력 받아 호출어 인식(910)을 수행하는 단계, 호출어가 인식된 경우, 호출어를 제외한 음성 신호에 대하여 음성 인식(920) 및 개체명 인식(930)을 수행하는 단계 및 인식 결과를 출력하는 단계를 포함할 수 있다.9A is a diagram illustrating an example of an operation process of a system for providing a speech recognition technology including call word recognition. Referring to FIG. 9A , the operating process of a system for providing a voice recognition technology including call word recognition includes receiving a voice signal including a call word and performing call word recognition 910, when the call word is recognized, The method may include performing voice recognition 920 and entity name recognition 930 on a voice signal excluding the calling word, and outputting a recognition result.

도 9b는 호출어 인식을 포함하는 음성 인식 기술을 제공하는 시스템의 구조의 예를 도시한 도면이다. 도 9b를 참조하면, 호출어 인식은 CTC 모델(940)에서 수행될 수 있고, 호출어를 제외한 음성 신호의 음성 인식은 트랜스듀서 모델(950)에서 수행될 수 있고, 음성 신호의 개체명 인식은 CTC 모델(960)에서 수행될 수 있다.9B is a diagram illustrating an example of the structure of a system for providing a speech recognition technology including call word recognition. Referring to FIG. 9B , call word recognition may be performed in the CTC model 940 , voice recognition of a voice signal excluding the call word may be performed by the transducer model 950 , and entity name recognition of a voice signal may be performed by the transducer model 950 . This may be performed in the CTC model 960 .

일실시예에 따를 때, 호출어는 일상 생활에서 흔히 쓰이는 단어로 설정되는 경우, 오인식이 자주 발생하므로, 일상 생활에서 발생하는 빈도가 낮은 단어로 설정된다. 따라서, 호출어 인식은 언어에 따른 확률에 기초하여 특정 단어열의 다음에 위치할 단어를 추정하는 언어 모델과 관련성이 낮다. 따라서 예측 네트워크를 포함하는 트랜스듀서 모델에 기반한 음성 인식기는 호출어 인식기로 부적합한 반면, CTC 모델에 기반한 음성 인식기는 분류 과정에서 언어 모델이 관여하지 않고, 음성 신호에 대응하는 텍스트를 추정하기 때문에 호출어 인식기로서 적합하다. 일실시예에 따를 때, 다음의 수학식 3에 따라 CTC 모델에 기반한 음성 인식기를 호출어 인식기로 사용할 수 있다.According to an embodiment, when the call word is set as a word commonly used in daily life, misrecognition occurs frequently, and thus, the call word is set as a word having a low frequency of occurrence in daily life. Therefore, the calling word recognition has little relevance to a language model that estimates a word to be located next to a specific word sequence based on a probability according to a language. Therefore, a voice recognizer based on a transducer model including a predictive network is not suitable as a calling word recognizer, whereas a voice recognizer based on a CTC model does not involve a language model in the classification process and estimates the text corresponding to the voice signal. It is suitable as a recognizer. According to an embodiment, a voice recognizer based on a CTC model may be used as a call word recognizer according to Equation 3 below.

다시 말해, 음성 신호(X)가 호출어(Wakeup_word)에 해당할 CTC 모델에 따른 분류 확률이 임계값(θ) 이상인 경우, 음성 신호(X)는 호출어에 해당하는 것으로 판단함으로써, CTC 모델에 기반한 음성 인식기를 호출어 인식기로 사용할 수 있다.In other words, when the classification probability according to the CTC model in which the voice signal X corresponds to the wakeup_word is greater than or equal to the threshold value θ, the voice signal X is determined to correspond to the call word, thereby adding to the CTC model. A voice recognizer based on it can be used as a call word recognizer.

또한, 음성 인식 과정에서 개체명(named entity;NE)은 장소, 시간 등 맥락을 파악하기 위한 정보가 이용될 수 있는데, 이 경우 개체명은 언어의 확률에 따라 인식되는 것이 아니므로 언어 모델에 따른 정보와 관련성이 낮다. 따라서 개체명 인식의 경우 NER을 활용하여 음성 인식 결과에서 개체명 부분을 추출한 후, 해당 부분에 대응하는 음성 신호에 대해서는 CTC 모델에 따른 음성 인식의 결과를 이용할 수 있다.In addition, in the speech recognition process, information for identifying a context such as a place and time for a named entity (NE) may be used. In this case, the entity name is not recognized according to the probability of the language, so information according to the language model has little relevance to Therefore, in the case of entity name recognition, after extracting the entity name part from the voice recognition result using NER, the voice recognition result according to the CTC model can be used for the voice signal corresponding to the part.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. 상기된 하드웨어 장치는 실시예의 동작을 수행하기 위해 하나 이상의 소프트웨어 모듈로서 작동하도록 구성될 수 있으며, 그 역도 마찬가지이다.The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded in a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, etc. alone or in combination. The program instructions recorded on the medium may be specially designed and configured for the embodiment, or may be known and available to those skilled in the art of computer software. Examples of the computer-readable recording medium include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic such as floppy disks. - includes magneto-optical media, and hardware devices specially configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. Examples of program instructions include not only machine language codes such as those generated by a compiler, but also high-level language codes that can be executed by a computer using an interpreter or the like. The hardware devices described above may be configured to operate as one or more software modules to perform the operations of the embodiments, and vice versa.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치, 또는 전송되는 신호 파(signal wave)에 영구적으로, 또는 일시적으로 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.Software may comprise a computer program, code, instructions, or a combination of one or more thereof, which configures a processing device to operate as desired or is independently or collectively processed You can command the device. The software and/or data may be any kind of machine, component, physical device, virtual equipment, computer storage medium or apparatus, to be interpreted by or to provide instructions or data to the processing device. , or may be permanently or temporarily embody in a transmitted signal wave. The software may be distributed over networked computer systems and stored or executed in a distributed manner. Software and data may be stored in one or more computer-readable recording media.

이상과 같이 실시예들이 비록 한정된 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기를 기초로 다양한 기술적 수정 및 변형을 적용할 수 있다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.As described above, although the embodiments have been described with reference to the limited drawings, those skilled in the art may apply various technical modifications and variations based on the above. For example, the described techniques are performed in an order different from the described method, and/or the described components of the system, structure, apparatus, circuit, etc. are combined or combined in a different form than the described method, or other components Or substituted or substituted by equivalents may achieve an appropriate result.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 청구범위의 범위에 속한다.Therefore, other implementations, other embodiments, and equivalents to the claims are also within the scope of the following claims.

Claims

입력된 음성 신호에 대응하는 후보 단어를 탐색하기 위한 제1 특징 벡터를 획득하는 단계;
상기 입력된 음성 신호에 대응하여 이미 탐색된 단어열에 기초하여, 상기 단어열의 다음 단어에 대응하는 후보 단어를 탐색하기 위한 제2 특징 벡터를 획득하는 단계;
상기 제1 특징 벡터 및 상기 제2 특징 벡터에 기초하여, 트랜스듀서(transducer)모델에 따른 제1 분류 확률을 획득하는 단계;
상기 제1 특징 벡터에 기초하여, CTC(connectionist temporal classification) 모델에 따른 제2 분류 확률을 획득하는 단계;
상기 제2 특징 벡터에 기초하여, 언어 모델에 따른 제3 분류 확률을 획득하는 단계;
상기 제1 분류 확률, 상기 제2 분류 확률 및 상기 제3 분류 확률에 기초하여, 상기 음성 신호에 대응하는 단어를 추정하는 단계; 및
상기 추정된 단어에 기초하여, 상기 단어열을 갱신하는 단계
를 포함하는,
음성 인식 방법.
obtaining a first feature vector for searching for a candidate word corresponding to the input speech signal;
obtaining a second feature vector for searching for a candidate word corresponding to a next word in the word string based on a word string already searched for in response to the input voice signal;
obtaining a first classification probability according to a transducer model based on the first feature vector and the second feature vector;
obtaining a second classification probability according to a connectionist temporal classification (CTC) model based on the first feature vector;
obtaining a third classification probability according to a language model based on the second feature vector;
estimating a word corresponding to the speech signal based on the first classification probability, the second classification probability, and the third classification probability; and
updating the word sequence based on the estimated word
containing,
Speech Recognition Method.

제1항에 있어서,
상기 음성 신호에 대응하는 단어를 추정하는 단계는
상기 제1 분류 확률, 상기 제2 분류 확률 및 상기 제3 분류 확률에 기초하여, 빔 서치를 통해 상기 음성 신호에 대응하는 단어를 추정하는 단계
를 포함하는,
음성 인식 방법.
According to claim 1,
The step of estimating a word corresponding to the voice signal
estimating a word corresponding to the speech signal through a beam search based on the first classification probability, the second classification probability, and the third classification probability;
containing,
Speech Recognition Method.

제1항에 있어서,
상기 제2 특징 벡터를 획득하는 단계는
상기 이미 탐색된 단어열이 없는 경우, 공백 문자에 기초하여, 상기 제2 특징 벡터를 획득하는 단계
를 포함하는,
음성 인식 방법.
According to claim 1,
The step of obtaining the second feature vector is
obtaining the second feature vector based on a blank character when there is no previously searched word sequence;
containing,
Speech Recognition Method.

제1항에 있어서,
상기 제1 특징 벡터를 획득하는 단계는
상기 입력된 음성 신호를 음성 신호에 대응하는 후보 단어를 탐색하는 제1 신경망에 인가하여, 상기 제1 특징 벡터를 획득하는 단계
를 포함하는,
음성 인식 방법.
According to claim 1,
The step of obtaining the first feature vector is
obtaining the first feature vector by applying the input speech signal to a first neural network that searches for a candidate word corresponding to the speech signal;
containing,
Speech Recognition Method.

제1항에 있어서,
상기 제2 특징 벡터를 획득하는 단계는
상기 입력된 음성 신호에 대응하여 이미 탐색된 단어열을 단어열의 다음 단어에 대응하는 후보 단어를 탐색하는 제2 신경망에 인가하여, 상기 제2 특징 벡터를 획득하는 단계
를 포함하는,
음성 인식 방법.
According to claim 1,
The step of obtaining the second feature vector is
obtaining the second feature vector by applying a word string already searched for in response to the input voice signal to a second neural network that searches for a candidate word corresponding to a next word in the word string;
containing,
Speech Recognition Method.

제1항에 있어서,
상기 제1 분류 확률을 획득하는 단계는
상기 제1 특징 벡터 및 제2 특징 벡터를 음성 신호 및 단어열에 대응하는 후보 단어를 탐색하는 트랜스듀서 모델의 제3 신경망에 인가하여, 제3 특징 벡터를 획득하는 단계; 및
상기 제3 특징 벡터에 기초하여, 상기 트랜스듀서 모델의 분류 함수에 따른 상기 제1 분류 확률을 획득하는 단계
를 포함하는,
음성 인식 방법.
According to claim 1,
The step of obtaining the first classification probability is
obtaining a third feature vector by applying the first feature vector and the second feature vector to a third neural network of a transducer model that searches for candidate words corresponding to speech signals and word sequences; and
Based on the third feature vector, obtaining the first classification probability according to the classification function of the transducer model
containing,
Speech Recognition Method.

제1 학습 데이터에 포함된 제1 음성 신호를 제1 신경망에 인가하여, 상기 제1 음성 신호에 대응하는 후보 단어를 탐색하기 위한 제1 특징 벡터를 획득하는 단계;
제1 학습 데이터에 포함된 제1 음성 신호에 대응하는 제1 단어열을 제2 신경망에 인가하여, 상기 제1 단어열의 다음 단어에 대응하는 후보 단어를 탐색하기 위한 제2 특징 벡터를 획득하는 단계;
상기 제1 특징 벡터 및 제2 특징 벡터를 트랜스듀서(transducer) 모델의 제3 신경망에 인가하여, 제1 분류 확률을 획득하는 단계;
상기 제1 학습 데이터에 포함된 상기 제1 음성 신호에 대응하는 제1 레이블에 기초하여, 상기 제1 분류 확률에 관한 제1 손실을 획득하는 단계;
제2 학습 데이터에 포함된 제2 음성 신호 및 상기 제2 음성 신호에 대응하는 제2 레이블에 기초하여, CTC 모델에 따른 제2 손실을 획득하는 단계;
제3 학습 데이터에 포함된 제2 단어열에 기초하여, 언어 모델에 따른 제3 손실을 획득하는 단계; 및
상기 제1 손실, 상기 제2 손실 및 상기 제3 손실에 기초하여, 상기 트랜스듀서 모델, 상기 CTC 모델 및 상기 언어 모델을 학습시키는 단계
를 포함하는,
음성 인식 시스템의 학습 방법.
obtaining a first feature vector for searching for a candidate word corresponding to the first speech signal by applying a first speech signal included in the first training data to a first neural network;
obtaining a second feature vector for searching for a candidate word corresponding to a next word of the first word string by applying a first word string corresponding to the first voice signal included in the first training data to a second neural network; ;
obtaining a first classification probability by applying the first feature vector and the second feature vector to a third neural network of a transducer model;
obtaining a first loss with respect to the first classification probability based on a first label corresponding to the first speech signal included in the first training data;
obtaining a second loss according to the CTC model based on a second speech signal included in the second training data and a second label corresponding to the second speech signal;
obtaining a third loss according to the language model based on the second word string included in the third training data; and
training the transducer model, the CTC model, and the language model based on the first loss, the second loss, and the third loss;
containing,
Learning method of speech recognition system.

제7항에 있어서,
상기 제1 신경망은 상기 트랜스듀서 모델 및 상기 CTC 모델이 공유하고,
상기 제2 신경망은 상기 트랜스듀서 모델 및 상기 언어 모델이 공유하는,
음성 인식 시스템의 학습 방법.
8. The method of claim 7,
The first neural network is shared by the transducer model and the CTC model,
The second neural network is shared by the transducer model and the language model,
Learning method of speech recognition system.

제7항에 있어서,
상기 트랜스듀서 모델, 상기 CTC 모델 및 상기 언어 모델을 학습시키는 단계는
상기 제1 손실에 기초하여, 상기 트랜스듀서 모델의 레이어 별 제1 그래디언트(gradient)를 생성하는 단계;
상기 제2 손실에 기초하여, 상기 CTC 모델의 레이어 별 제2 그래디언트를 생성하는 단계;
상기 제3 손실에 기초하여, 상기 언어 모델의 레이어 별 제3 그래디언트를 생성하는 단계;
상기 제1 신경망의 레이어 별 상기 제1 그래디언트 및 상기 제2 그래디언트를 누적하는 단계;
상기 제2 신경망의 레이어 별 상기 제1 그래디언트 및 상기 제3 그래디언트를 누적하는 단계; 및
상기 트랜스듀서 모델, 상기 CTC 모델 및 상기 언어 모델의 레이어 별 그래디언트에 기초하여, 레이어 별 파라미터를 최적화하는 단계
를 포함하는
음성 인식 시스템의 학습 방법.
8. The method of claim 7,
The step of learning the transducer model, the CTC model and the language model is
generating a first gradient for each layer of the transducer model based on the first loss;
generating a second gradient for each layer of the CTC model based on the second loss;
generating a third gradient for each layer of the language model based on the third loss;
accumulating the first gradient and the second gradient for each layer of the first neural network;
accumulating the first gradient and the third gradient for each layer of the second neural network; and
optimizing the parameters for each layer based on the gradient for each layer of the transducer model, the CTC model, and the language model
containing
Learning method of speech recognition system.

제7항에 있어서,
상기 제2 손실을 획득하는 단계는
상기 제2 음성 신호를 음성 신호에 대응하는 후보 단어를 탐색하는 상기 제1 신경망에 인가하여, 제4 특징 벡터를 획득하는 단계;
상기 제4 특징 벡터에 기초하여, 상기 CTC 모델에 따른 제2 분류 확률을 획득하는 단계; 및
상기 제2 레이블에 기초하여, 상기 제2 분류 확률에 관한 제2 손실을 획득하는 단계
를 포함하는,
음성 인식 시스템의 학습 방법.
8. The method of claim 7,
The step of acquiring the second loss is
obtaining a fourth feature vector by applying the second speech signal to the first neural network searching for a candidate word corresponding to the speech signal;
obtaining a second classification probability according to the CTC model based on the fourth feature vector; and
obtaining a second loss with respect to the second classification probability based on the second label;
containing,
Learning method of speech recognition system.

제7항에 있어서,
상기 제3 손실을 획득하는 단계는
상기 제2 단어열을 단어열의 다음 단어에 대응하는 후보 단어를 탐색하는 상기 제2 신경망에 인가하여, 제5 특징 벡터를 획득하는 단계;
상기 제5 특징 벡터에 기초하여, 상기 언어 모델에 따른 제3 분류 확률을 획득하는 단계; 및
상기 제2 단어열에 기초하여, 상기 제3 분류 확률에 관한 제3 손실을 획득하는 단계
를 포함하는,
음성 인식 시스템의 학습 방법.
8. The method of claim 7,
The step of acquiring the third loss is
obtaining a fifth feature vector by applying the second word string to the second neural network searching for a candidate word corresponding to a next word in the word string;
obtaining a third classification probability according to the language model based on the fifth feature vector; and
obtaining a third loss with respect to the third classification probability based on the second word string
containing,
Learning method of speech recognition system.

제7항에 있어서,
상기 제1 레이블 및 상기 제2 레이블은
하나의 문자 및 복수의 문자들의 조합으로 구성된 문자열을 포함하는,
음성 인식 시스템의 학습 방법.
8. The method of claim 7,
The first label and the second label are
comprising a character string consisting of a single character and a combination of a plurality of characters,
Learning method of speech recognition system.

제7항에 있어서,
상기 제2 학습 데이터에 포함된 레이블의 개수는 상기 제1 학습 데이터에 포함된 레이블의 개수보다 적은,
음성 인식 시스템의 학습 방법.
8. The method of claim 7,
The number of labels included in the second training data is less than the number of labels included in the first training data,
Learning method of speech recognition system.

제7항에 있어서,
상기 제1 분류 확률은
상기 제1 음성 신호가 상기 제1 학습 데이터에 포함된 적어도 하나의 레이블 중 어느 하나로 분류될 확률을 포함하는,
음성 인식 시스템의 학습 방법.
8. The method of claim 7,
The first classification probability is
Including a probability that the first voice signal is classified into any one of at least one label included in the first learning data,
Learning method of speech recognition system.

제10항에 있어서,
상기 제2 분류 확률은
상기 제2 음성 신호가 상기 제2 학습 데이터에 포함된 적어도 하나의 레이블 중 어느 하나로 분류될 확률을 포함하는,
음성 인식 시스템의 학습 방법.
11. The method of claim 10,
The second classification probability is
Including a probability that the second voice signal is classified into any one of the at least one label included in the second learning data,
Learning method of speech recognition system.

하드웨어와 결합되어 제1항 내지 제15항 중 어느 하나의 항의 방법을 실행시키기 위하여 매체에 저장된 컴퓨터 프로그램.
16. A computer program stored on a medium in combination with hardware to execute the method of any one of claims 1 to 15.

입력된 음성 신호에 대응하는 후보 단어를 탐색하기 위한 제1 특징 벡터를 획득하고,
상기 입력된 음성 신호에 대응하여 이미 탐색된 단어열에 기초하여, 상기 단어열의 다음 단어에 대응하는 후보 단어를 탐색하기 위한 제2 특징 벡터를 획득하고,
상기 제1 특징 벡터 및 제2 특징 벡터에 기초하여, 트랜스듀서(transducer)모델에 따른 제1 분류 확률을 획득하고,
상기 제1 특징 벡터에 기초하여, CTC(connectionist temporal classification) 모델에 따른 제2 분류 확률을 획득하고,
상기 제2 특징 벡터에 기초하여, 언어 모델에 따른 제3 분류 확률을 획득하고,
상기 제1 분류 확률, 상기 제2 분류 확률 및 상기 제3 분류 확률에 기초하여, 상기 음성 신호에 대응하는 단어를 추정하며,
상기 추정된 단어에 기초하여, 상기 단어열을 갱신하는,
적어도 하나의 프로세서
를 포함하는,
음성 인식 장치.
obtaining a first feature vector for searching for a candidate word corresponding to the input speech signal;
obtaining a second feature vector for searching for a candidate word corresponding to the next word in the word string based on the word string already searched for in response to the input voice signal;
Based on the first feature vector and the second feature vector, obtaining a first classification probability according to a transducer model,
obtaining a second classification probability according to a connectionist temporal classification (CTC) model based on the first feature vector;
Based on the second feature vector, obtain a third classification probability according to the language model,
estimating a word corresponding to the speech signal based on the first classification probability, the second classification probability, and the third classification probability;
updating the word sequence based on the estimated word;
at least one processor
containing,
speech recognition device.

제17항에 있어서,
상기 프로세서는
상기 음성 신호에 대응하는 단어를 추정함에 있어서,
상기 제1 분류 확률, 상기 제2 분류 확률 및 상기 제3 분류 확률에 기초하여, 빔 서치를 통해 상기 음성 신호에 대응하는 단어를 추정하는,
음성 인식 장치.
18. The method of claim 17,
the processor is
In estimating a word corresponding to the voice signal,
estimating a word corresponding to the speech signal through a beam search based on the first classification probability, the second classification probability, and the third classification probability;
speech recognition device.

제17항에 있어서,
상기 프로세서는
상기 제2 특징 벡터를 획득함에 있어서,
상기 이미 탐색된 단어열이 없는 경우, 공백 문자에 기초하여, 상기 제2 특징 벡터를 획득하는,
음성 인식 장치.
18. The method of claim 17,
the processor is
In obtaining the second feature vector,
obtaining the second feature vector based on a blank character when there is no previously searched word string;
speech recognition device.

제17항에 있어서,
상기 프로세서는
상기 제1 특징 벡터를 획득함에 있어서,
상기 입력된 음성 신호를 음성 신호에 대응하는 후보 단어를 탐색하는 제1 신경망에 인가하여, 상기 제1 특징 벡터를 획득하는,
음성 인식 장치.
18. The method of claim 17,
the processor is
In obtaining the first feature vector,
obtaining the first feature vector by applying the input speech signal to a first neural network that searches for a candidate word corresponding to the speech signal;
speech recognition device.

제17항에 있어서,
상기 프로세서는
상기 제2 특징 벡터를 획득함에 있어서,
상기 입력된 음성 신호에 대응하여 이미 탐색된 단어열을 단어열의 다음 단어에 대응하는 후보 단어를 탐색하는 제2 신경망에 인가하여, 상기 제2 특징 벡터를 획득하는,
음성 인식 장치.
18. The method of claim 17,
the processor is
In obtaining the second feature vector,
obtaining the second feature vector by applying a word string already searched in response to the input voice signal to a second neural network that searches for a candidate word corresponding to the next word in the word string;
speech recognition device.

제17항에 있어서,
상기 프로세서는
상기 제1 분류 확률을 획득함에 있어서,
상기 제1 특징 벡터 및 제2 특징 벡터를 음성 신호 및 단어열에 대응하는 후보 단어를 탐색하는 트랜스듀서 모델의 제3 신경망에 인가하여, 제3 특징 벡터를 획득하고,
상기 제3 특징 벡터에 기초하여, 상기 트랜스듀서 모델의 분류 함수에 따른 상기 제1 분류 확률을 획득하는,
음성 인식 장치.
18. The method of claim 17,
the processor is
In obtaining the first classification probability,
obtaining a third feature vector by applying the first feature vector and the second feature vector to a third neural network of a transducer model that searches for candidate words corresponding to speech signals and word sequences;
Based on the third feature vector, obtaining the first classification probability according to the classification function of the transducer model,
speech recognition device.

입력된 음성 신호에 대응하는 후보 단어를 탐색하기 위한 제1 특징 벡터를 출력하는 제1 신경망;
입력된 단어열의 다음 단어에 대응하는 후보 단어를 탐색하기 위한 제2 특징 벡터를 획득하는 제2 신경망;
상기 제1 특징 벡터 및 상기 제2 특징 벡터에 기초하여, 상기 음성 신호 및 상기 단어열에 대응하는 후보 단어를 탐색하기 위한 제3 특징 벡터를 획득하는 제3 신경망;
상기 제1 특징 벡터에 기초하여, 제2 분류 확률을 출력하는 CTC 모델에 따른 분류기;
상기 제2 특징 벡터에 기초하여, 제3 분류 확률을 출력하는 언어 모델에 따른 분류기; 및
상기 제3 특징 벡터에 기초하여, 제1 분류 확률을 출력하는 트랜스듀서 모델에 따른 분류기
를 포함하는,
음성 인식 시스템.
a first neural network for outputting a first feature vector for searching for a candidate word corresponding to the input speech signal;
a second neural network for obtaining a second feature vector for searching for a candidate word corresponding to a next word in the input word string;
a third neural network for obtaining a third feature vector for searching for a candidate word corresponding to the speech signal and the word string based on the first feature vector and the second feature vector;
a classifier according to a CTC model that outputs a second classification probability based on the first feature vector;
a classifier according to a language model that outputs a third classification probability based on the second feature vector; and
A classifier according to a transducer model that outputs a first classification probability based on the third feature vector
containing,
speech recognition system.

제23항에 있어서,
상기 제1 신경망은 상기 CTC 모델 및 상기 트랜스듀서 모델이 공유하고,
상기 제2 신경망은 상기 언어 모델 및 상기 트랜스듀서 모델이 공유하는,
음성 인식 시스템.

24. The method of claim 23,
The first neural network is shared by the CTC model and the transducer model,
The second neural network is shared by the language model and the transducer model,
speech recognition system.