KR102410850B1

KR102410850B1 - Method and apparatus for extracting reverberant environment embedding using dereverberation autoencoder

Info

Publication number: KR102410850B1
Application number: KR1020200103273A
Authority: KR
Inventors: 김형순; 박순찬
Original assignee: 부산대학교 산학협력단
Priority date: 2020-08-18
Filing date: 2020-08-18
Publication date: 2022-06-20
Also published as: KR20220022286A

Abstract

잔향 제거 오토 인코더를 이용한 잔향 환경 임베딩 추출 방법 및 장치가 제시된다. 일 실시예에 따른 잔향 제거 오토 인코더를 이용한 잔향 환경 임베딩 추출 방법은, 음성 데이터베이스(DB)와 실내 임펄스 응답(RIR) 데이터베이스(DB)로부터 훈련을 위한 잔향 음성을 생성하는 단계; 생성된 상기 잔향 음성에 대한 다중 선형 예측을 수행하여 다중 선형 예측 결과를 출력하는 단계; 상기 다중 선형 예측 결과들의 가중합을 위한 가중치를 추정하는 단계; 상기 다중 선형 예측 결과들과 추정된 상기 가중치로부터 최종 잔향 제거 음성을 추정하는 단계; 및 상기 가중치를 추정하는 가중치 추정 모델로부터 잔향 환경 임베딩 벡터를 추출하는 단계를 포함하여 이루어질 수 있다. A method and apparatus for extracting reverberation environment embeddings using a reverberation cancellation auto-encoder are presented. A method for extracting a reverberation environment embedding using a reverberation removal auto-encoder according to an embodiment includes: generating a reverberation speech for training from a speech database (DB) and an indoor impulse response (RIR) database (DB); performing multi-linear prediction on the generated reverberation speech and outputting a multi-linear prediction result; estimating a weight for a weighted sum of the multiple linear prediction results; estimating a final reverberation cancellation voice from the multiple linear prediction results and the estimated weight; and extracting a reverberation environment embedding vector from a weight estimation model for estimating the weight.

Description

잔향 제거 오토 인코더를 이용한 잔향 환경 임베딩 추출 방법 및 장치{METHOD AND APPARATUS FOR EXTRACTING REVERBERANT ENVIRONMENT EMBEDDING USING DEREVERBERATION AUTOENCODER}Method and apparatus for extracting reverberation environment embedding using reverberation cancellation auto encoder

아래의 실시예들은 잔향 제거 오토 인코더를 이용한 잔향 환경 임베딩 추출 방법 및 장치에 관한 것으로, 더욱 상세하게는 잔향 환경에서 발화된 음성에 대한 인식 성능을 높이기 위한 잔향 제거 오토 인코더를 이용한 잔향 환경 임베딩 추출 방법 및 장치에 관한 것이다. The following embodiments relate to a method and apparatus for extracting reverberation environment embeddings using a reverberation removal auto-encoder, and more particularly, a method for extracting reverberation environment embeddings using a reverberation removal auto-encoder to increase recognition performance for a voice uttered in a reverberation environment. and devices.

음성을 포함한 소리 신호가 전달되는 과정에서 직선 경로뿐만 아니라 벽, 천장, 바닥 등 여러 물체에 의해 흡수 및 반사되어 전달되는 잔향 현상이 발생한다. 이때 발생되는 잔향 성분은 음성의 명료도를 떨어뜨리며, 자동 음성 인식(Automatic Speech Recognition, ASR) 시스템 역시 잔향이 심한 음성에 대한 인식 성능이 떨어지는 현상이 나타난다. 잔향 음성에 대한 음성 인식 성능을 향상시키는 방법은 크게 음성으로부터 잔향 효과를 직접 제거하는 방법과, 잔향이 포함된 음성 자체를 잘 인식하도록 자동 음성 인식(ASR) 모델을 훈련하는 모델 적응 방법으로 나눌 수 있다.In the process of transmitting sound signals including voices, reverberation occurs, which is absorbed and reflected by various objects, such as walls, ceilings, and floors, as well as in a straight path. At this time, the generated reverberation component reduces the intelligibility of the voice, and the automatic speech recognition (ASR) system also has a phenomenon that the recognition performance of the voice with strong reverberation is deteriorated. Methods to improve speech recognition performance for reverberated speech can be largely divided into a method that directly removes the reverberation effect from speech and a model adaptation method that trains an automatic speech recognition (ASR) model to recognize the speech itself with reverberation well. have.

잔향 제거 방법은 입력 음성에서 잔향 효과를 제거하여 명료도를 향상시키거나 음성 인식 성능을 개선하는 방법이다. 선형 예측 기반 잔향 제거가 대표적인 방법으로, 시간 축에서 연속된 값들의 선형 결합을 통해 잔향 효과를 제거한다. 이는 잔향 음성을 음성과 실내 임펄스 응답(Room Impulse Response, RIR)의 합성곱(convolution)으로 모델링할 수 있다는 사실에 기반한다. 이 방법은 국소 푸리에 변환(Short-Time Fourier Transform, STFT) 영역 스펙트로그램(spectrogram)의 각 주파수 축에 대해 적용하는 경우, 특히 마이크 배열을 통해 수집된 다중 채널 음성에 적용한 경우의 성능이 뛰어나다고 알려져 있다. 그러나 각각의 대상 발화에 대해 선형 예측 필터의 계수를 반복 알고리즘을 통해 매번 새로 예측해야 한다는 단점이 존재한다.The reverberation cancellation method is a method of improving intelligibility or speech recognition performance by removing a reverberation effect from an input voice. Linear prediction-based reverberation cancellation is a representative method, and the reverberation effect is removed through linear combination of consecutive values on the time axis. It is based on the fact that reverberant speech can be modeled as a convolution of speech and Room Impulse Response (RIR). This method is known to have excellent performance when applied to each frequency axis of a local Short-Time Fourier Transform (STFT) domain spectrogram, especially when applied to multi-channel voices collected through a microphone array. have. However, there is a disadvantage in that the coefficients of the linear prediction filter for each target utterance must be newly predicted each time through an iterative algorithm.

잔향 제거 오토 인코더는 심층 신경망(Deep Neural Network, DNN)의 일종으로, 잔향 음성이 입력되었을 때 잔향이 제거된 음성이 출력되도록 훈련된다(비특허문헌 1). 잔향 음성과 잔향이 제거된 음성의 쌍을 훈련 데이터로 필요하게 되는데, 실제 환경의 데이터 쌍을 대규모로 확보하기는 쉽지 않다. 따라서 조용한 환경에서 녹음된 음성과 실내 임펄스 응답(RIR)의 합성곱을 통해 잔향 음성을 인공적으로 생성하여 원본 음성과 쌍을 이루어 훈련 데이터로 사용하는 경우가 많다. 선형 예측 기반 방식과 다르게 다양한 잔향 환경 데이터로 모델을 미리 훈련하기 때문에, 잔향 제거 단계에서는 대상 음성에 대한 파라미터를 추가로 예측할 필요가 없다는 장점이 있다. 그러나 L1, L2 손실 함수를 통해 훈련되는 DNN 회귀(regression) 모델의 경우 전체적인 경향은 잘 나타내지만, 세부 요소들에 대한 정보를 살리지 못한다. 이 현상은 스펙트로그램을 왜곡하여 경우에 따라서는 잔향 효과는 제거되었음에도 음성 인식 성능이 오히려 나빠지게 만든다.The reverberation cancellation autoencoder is a kind of deep neural network (DNN), and is trained to output a voice from which the reverberation is removed when a reverberated voice is input (Non-Patent Document 1). A pair of reverberated speech and reverberated speech is required as training data, but it is not easy to secure a large-scale data pair of a real environment. Therefore, reverberation speech is artificially generated through convolution of recorded speech and indoor impulse response (RIR) in a quiet environment, and is often paired with the original speech and used as training data. Unlike the linear prediction-based method, since the model is pre-trained with various reverberation environment data, there is an advantage in that there is no need to additionally predict parameters for the target voice in the reverberation removal step. However, in the case of a DNN regression model trained through the L1 and L2 loss functions, the overall trend is well represented, but information on detailed elements is not saved. This phenomenon distorts the spectrogram and in some cases makes speech recognition performance worse even though the reverberation effect is removed.

가장 널리 알려진 잔향 음성 인식을 위한 모델 적응 방법은 데이터 증강(data augmentation)을 통한 다중 환경 훈련(Multi-Condition Training, MCT) 이다. 다양한 환경에서 녹음된 음성을 통해 음향 모델을 훈련하면 보다 다양한 환경에 대한 음성 인식 성능이 향상되나, 실제로 다양한 환경에서 녹음된 음성을 충분히 확보하는 것은 쉽지 않다. 대신에 잔향 제거 오토 인코더의 훈련 데이터 생성과 마찬가지로 깨끗한 음성과 실내 임펄스 응답(RIR)을 통해 인공적으로 잔향 음성을 생성하는 데이터 증강 방법을 통해 다양한 환경의 훈련 데이터를 생성하여 잔향 환경 음성인식 성능을 개선할 수 있다.The most widely known model adaptation method for reverberant speech recognition is multi-condition training (MCT) through data augmentation. Training an acoustic model through voices recorded in various environments improves voice recognition performance for more diverse environments, but in practice, it is not easy to sufficiently secure voices recorded in various environments. Instead, as with the training data generation of the reverberation cancellation auto-encoder, the reverberation environment speech recognition performance is improved by generating training data in various environments through a data augmentation method that artificially generates reverberated speech through clean speech and room impulse response (RIR). can do.

또 다른 모델 적응 방법으로 잔향 정보를 담은 보조 특징 R-vector를 사용하는 방법이 있다(비특허문헌 2). 이 방법은 발화로부터 화자 정보를 추출하는 x-vector를 보조 특징으로 하여 음성 인식 성능을 향상시키는 화자 적응 방법으로부터 파생되었다. x-vector는 입력 음성이 어떠한 화자로부터 발화되었는지를 분류하는 과정을 통해 훈련된 DNN 모델의 은닉층으로부터 추출된다. R-vector는 이와 유사하게 합성곱을 통해 생성된 잔향 음성이 어떤 실내 임펄스 응답(RIR)에 의해 만들어졌는지를 분류하는 과정을 통해 훈련된 모델로 잔향 환경에 대한 정보를 추출한다. 이렇게 얻어진 R-vector를 기존 음향 모델의 보조 특징으로 사용하면 잔향 음성인식 성능을 향상시킬 수 있다.As another model adaptation method, there is a method of using an auxiliary feature R-vector containing reverberation information (Non-Patent Document 2). This method is derived from a speaker adaptation method that improves speech recognition performance by using an x-vector that extracts speaker information from a utterance as an auxiliary feature. The x-vector is extracted from the hidden layer of the trained DNN model through the process of classifying from which speaker the input speech was uttered. Similarly, R-vector extracts information about the reverberation environment with a trained model through the process of classifying which room impulse response (RIR) the reverberated speech generated through convolution is generated. If the R-vector obtained in this way is used as an auxiliary feature of the existing acoustic model, the reverberation speech recognition performance can be improved.

X. Feng, Y. Zhang and J. Glass, "Speech feature denoising and dereverberation via deep autoencoders for noisy reverberant speech recognition," 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, 2014, pp. 1759-1763, doi: 10.1109/ICASSP.2014.6853900. X. Feng, Y. Zhang and J. Glass, “Speech feature denoising and dereverberation via deep autoencoders for noisy reverberant speech recognition,” 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, 2014, pp. 1759-1763, doi: 10.1109/ICASSP.2014.6853900. Y. Khokhlov, A. Zatvornitskiy, I. Medennikov et al., "R-vectors: New technique for adaptation to room acoustics," INTERSPEECH, 2019. Y. Khokhlov, A. Zatvornitskiy, I. Medennikov et al., "R-vectors: New technique for adaptation to room acoustics," INTERSPEECH, 2019.

실시예들은 잔향 제거 오토 인코더를 이용한 잔향 환경 임베딩 추출 방법 및 장치에 관하여 기술하며, 보다 구체적으로 잔향 제거 오토 인코더를 이용하여 잔향 환경 임베딩을 추출하고, 이를 보조 특징으로 하는 모델 적응 방법을 통해 잔향 환경에서의 음성인식 성능을 개선하는 기술을 제공한다. The embodiments describe a method and apparatus for extracting a reverberation environment embedding using a reverberation removal auto-encoder, and more specifically, extracting a reverberation environment embedding using a reverberation removal auto-encoder, and a reverberation environment through a model adaptation method with auxiliary features. It provides a technology to improve speech recognition performance in

일 실시예에 따른 잔향 제거 오토 인코더를 이용한 잔향 환경 임베딩 추출 방법은, 음성 데이터베이스(DB)와 실내 임펄스 응답(RIR) 데이터베이스(DB)로부터 훈련을 위한 잔향 음성을 생성하는 단계; 생성된 상기 잔향 음성에 대한 다중 선형 예측을 수행하여 다중 선형 예측 결과를 출력하는 단계; 상기 다중 선형 예측 결과들의 가중합을 위한 가중치를 추정하는 단계; 상기 다중 선형 예측 결과들과 추정된 상기 가중치로부터 최종 잔향 제거 음성을 추정하는 단계; 및 상기 가중치를 추정하는 가중치 추정 모델로부터 잔향 환경 임베딩 벡터를 추출하는 단계를 포함하여 이루어질 수 있다. A method for extracting a reverberation environment embedding using a reverberation removal auto-encoder according to an embodiment includes: generating a reverberation speech for training from a speech database (DB) and an indoor impulse response (RIR) database (DB); performing multi-linear prediction on the generated reverberation speech and outputting a multi-linear prediction result; estimating a weight for a weighted sum of the multiple linear prediction results; estimating a final reverberation cancellation voice from the multiple linear prediction results and the estimated weight; and extracting a reverberation environment embedding vector from a weight estimation model for estimating the weight.

추정된 상기 최종 잔향 제거 음성과 잔향이 추가되기 이전의 원본 음성의 손실 함수를 통해 전체 모델을 훈련하는 단계를 더 포함할 수 있다. The method may further include training the entire model through a loss function of the estimated final reverberation-removed speech and the original speech before reverberation is added.

상기 다중 선형 예측 결과를 출력하는 단계는, 상기 잔향 음성을 멜 스펙트로그램(mel spectrogram)으로 변환 후 선형 예측 모델에 입력하여 제로 패딩(zero padding)을 적용하고, 선형 예측 계수들로 이루어진 합성곱 신경망(Convolutional Neural Network)을 통과시켜 상기 다중 선형 예측 결과를 출력할 수 있다. The outputting of the multiple linear prediction result includes converting the reverberation speech into a mel spectrogram, inputting it to a linear prediction model, applying zero padding, and a convolutional neural network composed of linear prediction coefficients. (Convolutional Neural Network) may be passed to output the multi-linear prediction result.

상기 가중치를 추정하는 단계는, 상기 잔향 음성을 로그 멜 스펙트로그램(log mel spectrogram)으로 변환 후 가중치 추정 모델에 입력하여 제로 패딩을 적용하고, 합성곱 신경망을 통과시켜 출력을 획득하고, 각각의 시간-주파수 인덱스에 속하는 값들에 소프트맥스(softmax) 함수를 적용하여 0 내지 1 사이 값을 가지며 합이 1이 되는 가중치로 변환할 수 있다. The step of estimating the weight includes converting the reverberation speech into a log mel spectrogram, inputting it into a weight estimation model, applying zero padding, passing it through a convolutional neural network to obtain an output, and each time - By applying a softmax function to the values belonging to the frequency index, it can be converted into a weight having a value between 0 and 1 and the sum is 1.

상기 최종 잔향 제거 음성을 추정하는 단계는, 상기 다중 선형 예측 결과들과 추정된 상기 가중치들을 요소별 곱셈(element-wise multiplication)을 함에 따라, 각각의 시간-주파수 인덱스에 대해 선형 예측 결과와 가중치가 곱해진 결과를 획득한 후, 각각의 시간-주파수 인덱스에서 선형 예측 계수의 집합 C개의 값을 더하여 상기 최종 잔향 제거 음성의 추정 값을 획득할 수 있다. In the estimating of the final reverberation-removed speech, the linear prediction results and weights for each time-frequency index are obtained by element-wise multiplication of the multiple linear prediction results and the estimated weights. After the multiplication result is obtained, an estimated value of the final dereverberation speech may be obtained by adding C values of a set of linear prediction coefficients in each time-frequency index.

상기 잔향 환경 임베딩 벡터를 추출하는 단계는, 대상 음성 신호를 로그 멜 스펙트로그램으로 변환한 후, 상기 가중치 추정 모델에 입력하여 출력을 획득하고, 획득된 상기 가중치에서 각 채널-주파수에 대해 프레임 값들의 평균을 취하여 행렬을 획득하며, 상기 행렬의 요소들을 재배열하여 상기 잔향 환경 임베딩 벡터를 추출할 수 있다. In the step of extracting the reverberation environment embedding vector, the target speech signal is converted into a log Mel spectrogram, input to the weight estimation model to obtain an output, and frame values for each channel-frequency from the weight obtained are obtained. A matrix may be obtained by taking the average, and the reverberation environment embedding vector may be extracted by rearranging the elements of the matrix.

상기 손실 함수를 통해 전체 모델을 훈련하는 단계는, 획득한 상기 최종 잔향 제거 음성과 상기 음성 데이터베이스(DB)의 원본 음성을 각각 로그 멜 스펙트로그램으로 변환한 후, 두 개의 로그 멜 스펙트로그램 사이의 손실 함수가 수렴할 때까지 역전파 알고리즘을 통한 신경망 훈련을 반복할 수 있다. In the step of training the entire model through the loss function, the obtained final reverberation cancellation voice and the original voice of the voice database (DB) are converted into log Mel spectrograms, respectively, and then the loss between the two log Mel spectrograms is performed. Training the neural network through the backpropagation algorithm can be repeated until the function converges.

다른 실시예에 따른 잔향 제거 오토 인코더를 이용한 잔향 환경 임베딩 추출 장치는, 음성 데이터베이스(DB)와 실내 임펄스 응답(RIR) 데이터베이스(DB)로부터 훈련을 위한 잔향 음성을 생성하는 잔향 음성 생성부; 생성된 상기 잔향 음성에 대한 다중 선형 예측을 수행하여 다중 선형 예측 결과를 출력하는 선형 예측 모델; 상기 다중 선형 예측 결과들의 가중합을 위한 가중치를 추정하는 가중치 추정 모델; 상기 다중 선형 예측 결과들과 추정된 상기 가중치로부터 최종 잔향 제거 음성을 추정하는 잔향 제거 음성 추정부; 및 상기 가중치를 추정하는 가중치 추정 모델로부터 잔향 환경 임베딩 벡터를 추출하는 임베딩 추출부를 포함하여 이루어질 수 있다. An apparatus for extracting a reverberation environment embedding using an auto-encoder for removing reverberation according to another embodiment includes: a reverberation speech generator for generating a reverberation speech for training from a speech database (DB) and an indoor impulse response (RIR) database (DB); a linear prediction model for outputting multiple linear prediction results by performing multiple linear prediction on the generated reverberation speech; a weight estimation model for estimating a weight for a weighted sum of the multiple linear prediction results; a dereverberation speech estimator for estimating a final dereverberation speech from the multiple linear prediction results and the estimated weight; and an embedding extractor configured to extract a reverberation environment embedding vector from a weight estimation model for estimating the weight.

추정된 상기 최종 잔향 제거 음성과 잔향이 추가되기 이전의 원본 음성의 손실 함수를 통해 전체 모델을 훈련하는 잔향 제거 오토 인코더 훈련부를 더 포함할 수 있다. The method may further include a dereverberation auto-encoder training unit that trains the entire model through the estimated final dereverberation speech and the loss function of the original speech before reverberation is added.

상기 임베딩 추출부는, 대상 음성 신호를 로그 멜 스펙트로그램으로 변환한 후, 상기 가중치 추정 모델에 입력하여 출력을 획득하고, 획득된 상기 가중치에서 각 채널-주파수에 대해 프레임 값들의 평균을 취하여 행렬을 획득하며, 상기 행렬의 요소들을 재배열하여 상기 잔향 환경 임베딩 벡터를 추출할 수 있다. The embedding extractor converts a target speech signal into a log Mel spectrogram, inputs it to the weight estimation model to obtain an output, and obtains a matrix by averaging frame values for each channel-frequency from the obtained weight. and rearranging the elements of the matrix to extract the reverberation environment embedding vector.

실시예들에 따르면 잔향 제거 오토 인코더를 이용하여 잔향 환경 임베딩을 추출하고, 이를 보조 특징으로 하는 모델 적응 방법을 통해 잔향 환경에서의 음성인식 성능을 개선하는 잔향 제거 오토 인코더를 이용한 잔향 환경 임베딩 추출 방법 및 장치를 제공할 수 있다. According to embodiments, the reverberation environment embedding extraction method using the reverberation removal auto-encoder extracts the reverberation environment embedding using the reverberation removal auto-encoder, and improves the speech recognition performance in the reverberation environment through a model adaptation method having the auxiliary feature. and devices may be provided.

실시예들에 따르면 실내 임펄스 응답(RIR) 분류기 대신 잔향 제거 오토 인코더를 통해 실내 임펄스 응답(RIR) 각각과 서로의 관계에 대한 특징을 자연스럽게 반영할 수 있는 잔향 제거 오토 인코더를 이용한 잔향 환경 임베딩 추출 방법 및 장치를 제공할 수 있다.According to embodiments, a reverberation environment embedding extraction method using a reverberation cancellation auto-encoder capable of naturally reflecting characteristics of each of the indoor impulse responses and their relationship to each other through a reverberation cancellation auto-encoder instead of an indoor impulse response (RIR) classifier and devices may be provided.

또한, 실시예들에 따르면 잔향 제거 오토 인코더의 관점에서 음향 모델의 입력 특징에 직접 적용하는 대신 임베딩 벡터를 추출하여 보조 입력으로 사용함으로써, 세부 정보를 복원하지 못하여 왜곡을 발생시키는 한계를 보완한 잔향 제거 오토 인코더를 이용한 잔향 환경 임베딩 추출 방법 및 장치를 제공할 수 있다. In addition, according to embodiments, from the viewpoint of the reverberation cancellation auto-encoder, instead of directly applying to the input feature of the acoustic model, an embedding vector is extracted and used as an auxiliary input, thereby compensating for the limitation of generating distortion due to inability to restore detailed information. A method and apparatus for extracting reverberation environment embeddings using a removal auto-encoder may be provided.

도 1은 일 실시예들에 따른 전자 장치를 도시하는 도면이다.
도 2는 일 실시예에 따른 잔향 제거 오토 인코더를 이용한 잔향 환경 임베딩 추출 장치를 나타내는 블록도이다.
도 3은 일 실시예에 따른 잔향 제거 오토 인코더를 이용한 잔향 환경 임베딩 추출 방법을 나타내는 흐름도이다.1 is a diagram illustrating an electronic device according to example embodiments.
2 is a block diagram illustrating an apparatus for extracting a reverberation environment embedding using a reverberation removal auto-encoder according to an exemplary embodiment.
3 is a flowchart illustrating a method of extracting a reverberation environment embedding using a reverberation cancellation auto-encoder according to an exemplary embodiment.

이하, 첨부된 도면을 참조하여 실시예들을 설명한다. 그러나, 기술되는 실시예들은 여러 가지 다른 형태로 변형될 수 있으며, 본 발명의 범위가 이하 설명되는 실시예들에 의하여 한정되는 것은 아니다. 또한, 여러 실시예들은 당해 기술분야에서 평균적인 지식을 가진 자에게 본 발명을 더욱 완전하게 설명하기 위해서 제공되는 것이다. 도면에서 요소들의 형상 및 크기 등은 보다 명확한 설명을 위해 과장될 수 있다.Hereinafter, embodiments will be described with reference to the accompanying drawings. However, the described embodiments may be modified in various other forms, and the scope of the present invention is not limited by the embodiments described below. In addition, various embodiments are provided in order to more completely explain the present invention to those of ordinary skill in the art. The shapes and sizes of elements in the drawings may be exaggerated for clearer description.

아래의 실시예들은 잔향 제거 오토 인코더를 이용하여 잔향 환경 임베딩을 추출하고, 이를 보조 특징으로 하는 모델 적응 방법을 통해 잔향 환경에서의 음성인식 성능을 개선하는 방법을 다룬다. 기존의 R-vector의 경우 잔향 환경 임베딩은 잔향 음성이 수만 개 중 어떤 실내 임펄스 응답(RIR)을 통해 생성되었는지를 분류하는데, 이는 실내 임펄스 응답(RIR) 사이에 어떤 관계가 있는지를 전혀 고려하지 않는다. 교차 엔트로피(cross entropy)를 통해 훈련되는 DNN 분류기의 특성상 각각의 실내 임펄스 응답(RIR) 사이가 매우 유사한 경우와 큰 차이가 있는 경우 모두 동일하게 다루어지기 때문이다. 그러나, 제안하는 방법에서는 실내 임펄스 응답(RIR) 분류기 대신 잔향 제거 오토 인코더를 통해 실내 임펄스 응답(RIR) 각각과 서로의 관계에 대한 특징을 자연스럽게 반영할 수 있도록 구성하였다. 또한, 잔향 제거 오토 인코더의 관점에서, 음향 모델의 입력 특징에 직접 적용하는 대신 임베딩 벡터를 추출하여 보조 입력으로 사용함으로써 세부 정보를 복원하지 못하여 왜곡을 발생시키는 한계를 보완하였다.The following embodiments deal with a method for improving speech recognition performance in a reverberation environment through a model adaptation method that extracts a reverberation environment embedding using a reverberation cancellation auto-encoder and uses the reverberation environment embedding as an auxiliary feature. In the case of the conventional R-vector, the reverberation environment embedding classifies which room impulse response (RIR) out of tens of thousands of reverberant speech was generated, which does not take into account any relationship between the room impulse response (RIR) at all. . This is because, due to the characteristics of the DNN classifier trained through cross entropy, both cases with very similar and large differences between the respective indoor impulse responses (RIRs) are treated the same. However, in the proposed method, the characteristics of each indoor impulse response (RIR) and their relationship to each other can be naturally reflected through the reverberation cancellation auto-encoder instead of the indoor impulse response (RIR) classifier. In addition, from the point of view of the reverberation cancellation auto-encoder, instead of directly applying to the input features of the acoustic model, an embedding vector is extracted and used as an auxiliary input, thereby compensating for the limitation of generating distortion due to inability to restore detailed information.

잔향 제거 오토 인코터의 구조는 선형 예측 기반 잔향 제거 방법을 채용 및 보완하여 구성할 수 있다. 대상 음성에 대한 선형 예측 파라미터를 매번 추정하는 대신, DNN 모델 내에 복수의 선형 예측 파라미터들을 설정하고 이들의 가중합을 통해 최종 결과를 얻을 수 있다. 이러한 방법은 선형 예측이라는 잔향 음성의 특징을 잘 반영한 방법을 사용하면서도, 매번 파라미터를 추정하지 않고 사전에 훈련된 모델을 통해 잔향 제거를 수행할 수 있다는 장점을 가진다. 가중합을 위한 가중치는 별도의 네트워크를 통해 추정되는데, 스펙트로그램(spectrogram)의 각 시간, 주파수 인덱스에서 다른 값을 갖는다. 이는 발화 내에서 파라미터가 고정된 기존 선형 예측 방법 대비 유연성을 갖는다. 가중치의 시간에 대한 평균을 통해 발화의 길이에 무관한 고정된 길이의 벡터를 얻을 수 있고, 이를 잔향 환경 임베딩 벡터로 사용한다.The structure of the reverberation cancellation auto encoder can be configured by adopting and supplementing the linear prediction-based reverberation cancellation method. Instead of estimating the linear prediction parameters for the target speech every time, a plurality of linear prediction parameters may be set in the DNN model and a final result may be obtained through a weighted sum thereof. This method has the advantage of being able to perform reverberation cancellation through a pre-trained model without estimating parameters each time while using a method that reflects the characteristics of reverberant speech well called linear prediction. The weights for the weighted summation are estimated through a separate network, and each time and frequency index of the spectrogram has different values. This has flexibility compared to the existing linear prediction method in which the parameters are fixed within the utterance. By averaging the weights over time, a vector of a fixed length can be obtained regardless of the length of the utterance, and this is used as a reverberation environment embedding vector.

도 1은 일 실시예들에 따른 전자 장치를 도시하는 도면이다. 1 is a diagram illustrating an electronic device according to example embodiments.

도 1을 참조하면, 일 실시예들에 따른 전자 장치(100)는 입력 모듈(110), 출력 모듈(120), 메모리(130) 또는 프로세서(140) 중 적어도 어느 하나 이상을 포함할 수 있다. Referring to FIG. 1 , an electronic device 100 according to embodiments may include at least one of an input module 110 , an output module 120 , a memory 130 , and a processor 140 .

입력 모듈(110)은 전자 장치(100)의 구성 요소에 사용될 명령 또는 데이터를 전자 장치(100)의 외부로부터 수신할 수 있다. 입력 모듈(110)은, 사용자가 전자 장치(100)에 직접적으로 명령 또는 데이터를 입력하도록 구성되는 입력 장치 또는 외부 전자 장치와 유선 또는 무선으로 통신하여 명령 또는 데이터를 수신하도록 구성되는 통신 장치 중 적어도 어느 하나를 포함할 수 있다. 예를 들면, 통신 장치는 유선 통신 장치 또는 무선 통신 장치 중 적어도 어느 하나를 포함하며, 무선 통신 장치는 근거리 통신 장치 또는 원거리 통신 장치 중 적어도 어느 하나를 포함할 수 있다. The input module 110 may receive a command or data to be used for a component of the electronic device 100 from the outside of the electronic device 100 . The input module 110 is at least one of an input device configured to allow a user to directly input a command or data to the electronic device 100 or a communication device configured to receive a command or data by wire or wireless communication with an external electronic device may include any one. For example, the communication device may include at least one of a wired communication device and a wireless communication device, and the wireless communication device may include at least one of a short-range communication device and a long-distance communication device.

출력 모듈(120)은 전자 장치(100)의 외부로 정보를 제공할 수 있다. 출력 모듈(120)은 정보를 청각적으로 출력하도록 구성되는 오디오 출력 장치, 정보를 시각적으로 출력하도록 구성되는 표시 장치 또는 외부 전자 장치와 유선 또는 무선으로 통신하여 정보를 전송하도록 구성되는 통신 장치 중 적어도 어느 하나를 포함할 수 있다. 예를 들면, 통신 장치는 유선 통신 장치 또는 무선 통신 장치 중 적어도 어느 하나를 포함하며, 무선 통신 장치는 근거리 통신 장치 또는 원거리 통신 장치 중 적어도 어느 하나를 포함할 수 있다.The output module 120 may provide information to the outside of the electronic device 100 . The output module 120 is at least one of an audio output device configured to audibly output information, a display device configured to visually output information, or a communication device configured to transmit information by wire or wireless communication with an external electronic device may include any one. For example, the communication device may include at least one of a wired communication device and a wireless communication device, and the wireless communication device may include at least one of a short-range communication device and a long-distance communication device.

메모리(130)는 전자 장치(100)의 구성 요소에 의해 사용되는 데이터를 저장할 수 있다. 데이터는 프로그램 또는 이와 관련된 명령에 대한 입력 데이터 또는 출력 데이터를 포함할 수 있다. 예를 들면, 메모리(130)는 휘발성 메모리 또는 비휘발성 메모리 중 적어도 어느 하나를 포함할 수 있다. The memory 130 may store data used by components of the electronic device 100 . The data may include input data or output data for a program or instructions related thereto. For example, the memory 130 may include at least one of a volatile memory and a non-volatile memory.

프로세서(140)는 메모리(130)의 프로그램을 실행하여, 전자 장치(100)의 구성 요소를 제어할 수 있고, 데이터 처리 또는 연산을 수행할 수 있다. 이 때 프로세서(140)는 잔향 음성 생성부, 선형 예측 모델, 가중치 추정 모델, 잔향 제거 음성 추정부 및 임베딩 추출부를 포함하여 이루어질 수 있고, 실시예에 따라 잔향 제거 오토 인코더 훈련부를 더 포함할 수 있다. 이를 통해 프로세서(140)는 잔향 제거 오토 인코더를 이용한 잔향 환경 임베딩 추출을 수행할 수 있다.The processor 140 may execute a program in the memory 130 to control the components of the electronic device 100 , and may process data or perform an operation. In this case, the processor 140 may include a reverberation speech generator, a linear prediction model, a weight estimation model, a dereverberation speech estimator, and an embedding extractor, and may further include a reverberation removal auto-encoder training unit according to an embodiment. . Through this, the processor 140 may perform the reverberation environment embedding extraction using the reverberation cancellation auto encoder.

도 2는 일 실시예에 따른 잔향 제거 오토 인코더를 이용한 잔향 환경 임베딩 추출 장치를 나타내는 블록도이다.2 is a block diagram illustrating an apparatus for extracting a reverberation environment embedding using a reverberation removal auto-encoder according to an exemplary embodiment.

도 2를 참조하면, 일 실시예에 따른 잔향 제거 오토 인코더를 이용한 잔향 환경 임베딩 추출 장치(200)는 잔향 음성 생성부(210), 선형 예측 모델(220), 가중치 추정 모델(230), 잔향 제거 음성 추정부(240) 및 임베딩 추출부(260)를 포함하여 이루어질 수 있다. 실시예에 따라 잔향 제거 오토 인코더를 이용한 잔향 환경 임베딩 추출 장치(200)는 잔향 제거 오토 인코더 훈련부(250)를 더 포함할 수 있다. 여기서, 잔향 제거 오토 인코더를 이용한 잔향 환경 임베딩 추출 장치(200)는 도 1의 프로세서(140)에 포함될 수 있다.Referring to FIG. 2 , the apparatus 200 for extracting reverberation environment embeddings using a reverberation removal auto-encoder according to an embodiment includes a reverberation speech generator 210 , a linear prediction model 220 , a weight estimation model 230 , and reverberation removal. It may include a voice estimator 240 and an embedding extractor 260 . According to an embodiment, the apparatus 200 for extracting the reverberation environment embedding using the reverberation removal auto-encoder may further include a reverberation removal auto-encoder training unit 250 . Here, the apparatus 200 for extracting the reverberation environment embedding using the reverberation removal auto-encoder may be included in the processor 140 of FIG. 1 .

잔향 음성 생성부(210)는 음성 데이터베이스(DB)와 실내 임펄스 응답(RIR) 데이터베이스(DB)로부터 훈련을 위한 잔향 음성을 생성할 수 있다. The reverberation speech generator 210 may generate a reverberation speech for training from the speech database DB and the indoor impulse response (RIR) database DB.

선형 예측 모델(220)은 생성된 잔향 음성에 대한 다중 선형 예측을 수행하여 다중 선형 예측 결과를 출력할 수 있다. 보다 구체적으로, 잔향 음성을 멜 스펙트로그램(mel spectrogram)으로 변환 후 선형 예측 모델(220)에 입력하여 제로 패딩(zero padding)을 적용하고, 선형 예측 계수들로 이루어진 합성곱 신경망(Convolutional Neural Network)을 통과시켜 다중 선형 예측 결과를 출력할 수 있다. The linear prediction model 220 may output multiple linear prediction results by performing multiple linear prediction on the generated reverberation speech. More specifically, after converting the reverberation speech into a mel spectrogram, it is input to the linear prediction model 220 to apply zero padding, and a convolutional neural network composed of linear prediction coefficients. It is possible to output multiple linear prediction results by passing .

가중치 추정 모델(230)은 다중 선형 예측 결과들의 가중합을 위한 가중치를 추정할 수 있다. 보다 구체적으로, 잔향 음성을 로그 멜 스펙트로그램(log mel spectrogram)으로 변환 후 가중치 추정 모델(230)에 입력하여 제로 패딩을 적용하고, 합성곱 신경망을 통과시켜 출력을 획득하고, 각각의 시간-주파수 인덱스에 속하는 값들에 소프트맥스(softmax) 함수를 적용하여 0 내지 1 사이 값을 가지며 합이 1이 되는 가중치로 변환할 수 있다. The weight estimation model 230 may estimate a weight for a weighted sum of multiple linear prediction results. More specifically, after converting the reverberation speech into a log mel spectrogram, it is input to the weight estimation model 230 to apply zero padding, passes through a convolutional neural network to obtain an output, and each time-frequency By applying a softmax function to values belonging to the index, it can be converted into a weight having a value between 0 and 1 and the sum of which is 1.

잔향 제거 음성 추정부(240)는 다중 선형 예측 결과들과 추정된 가중치로부터 최종 잔향 제거 음성을 추정할 수 있다. 보다 구체적으로, 잔향 제거 음성 추정부(240)는 다중 선형 예측 결과들과 추정된 가중치들을 요소별 곱셈(element-wise multiplication)을 함에 따라, 각각의 시간-주파수 인덱스에 대해 선형 예측 결과와 가중치가 곱해진 결과를 획득한 후, 각각의 시간-주파수 인덱스에서 선형 예측 계수의 집합 C개의 값을 더하여 최종 잔향 제거 음성의 추정 값을 획득할 수 있다. The dereverberation speech estimator 240 may estimate the final dereverberation speech from the multiple linear prediction results and the estimated weight. More specifically, as the reverberation removal speech estimator 240 multiplies the multiple linear prediction results and the estimated weights by element (element-wise multiplication), the linear prediction result and the weight for each time-frequency index are After the multiplication result is obtained, an estimated value of the final dereverberation speech may be obtained by adding C values of a set of linear prediction coefficients in each time-frequency index.

잔향 제거 오토 인코더 훈련부(250)는 추정된 최종 잔향 제거 음성과 잔향이 추가되기 이전의 원본 음성의 손실 함수를 통해 전체 모델을 훈련할 수 있다. 보다 구체적으로, 잔향 제거 오토 인코더 훈련부(250)는 획득한 최종 잔향 제거 음성과 음성 데이터베이스(DB)의 원본 음성을 각각 로그 멜 스펙트로그램으로 변환한 후, 두 개의 로그 멜 스펙트로그램 사이의 손실 함수가 수렴할 때까지 역전파 알고리즘을 통한 신경망 훈련을 반복할 수 있다. The reverberation cancellation auto encoder training unit 250 may train the entire model through the loss function of the estimated final reverberation cancellation voice and the original voice before reverberation is added. More specifically, the reverberation cancellation auto encoder training unit 250 converts the obtained final reverberation cancellation voice and the original voice of the voice database (DB) into log Mel spectrograms, respectively, and then the loss function between the two log Mel spectrograms is Training the neural network through the backpropagation algorithm can be repeated until convergence.

임베딩 추출부(260)는 가중치를 추정하는 가중치 추정 모델(230)로부터 잔향 환경 임베딩 벡터를 추출할 수 있다. 보다 구체적으로, 임베딩 추출부(260)는 대상 음성 신호를 로그 멜 스펙트로그램으로 변환한 후, 가중치 추정 모델(230)에 입력하여 출력을 획득하고, 획득된 가중치에서 각 채널-주파수에 대해 프레임 값들의 평균을 취하여 행렬을 획득하며, 행렬의 요소들을 재배열하여 잔향 환경 임베딩 벡터를 추출할 수 있다. The embedding extractor 260 may extract the reverberation environment embedding vector from the weight estimation model 230 for estimating the weight. More specifically, the embedding extraction unit 260 converts the target speech signal into a log Mel spectrogram, inputs it to the weight estimation model 230 to obtain an output, and from the obtained weights, frame values for each channel-frequency A matrix is obtained by taking the average of , and the reverberation environment embedding vector can be extracted by rearranging the elements of the matrix.

일 실시예에 따른 잔향 제거 오토 인코더를 이용한 잔향 환경 임베딩 추출 장치(200)는 일 실시예에 따른 잔향 제거 오토 인코더를 이용한 잔향 환경 임베딩 추출 방법을 예를 들어 보다 구체적으로 설명할 수 있다.The apparatus 200 for extracting reverberation environment embeddings using the reverberation removal auto-encoder according to an embodiment may describe the method for extracting reverberation environment embeddings using the reverberation removal auto-encoder according to an embodiment, for example.

도 3은 일 실시예에 따른 잔향 제거 오토 인코더를 이용한 잔향 환경 임베딩 추출 방법을 나타내는 흐름도이다.3 is a flowchart illustrating a method for extracting a reverberation environment embedding using a reverberation cancellation auto-encoder according to an exemplary embodiment.

도 3을 참조하면, 일 실시예에 따른 잔향 제거 오토 인코더를 이용한 잔향 환경 임베딩 추출 방법은, 음성 데이터베이스(DB)와 실내 임펄스 응답(RIR) 데이터베이스(DB)로부터 훈련을 위한 잔향 음성을 생성하는 단계(S110), 생성된 잔향 음성에 대한 다중 선형 예측을 수행하여 다중 선형 예측 결과를 출력하는 단계(S120), 다중 선형 예측 결과들의 가중합을 위한 가중치를 추정하는 단계(S130), 다중 선형 예측 결과들과 추정된 가중치로부터 최종 잔향 제거 음성을 추정하는 단계(S140), 및 가중치를 추정하는 가중치 추정 모델(230)로부터 잔향 환경 임베딩 벡터를 추출하는 단계(S160)를 포함하여 이루어질 수 있다. Referring to FIG. 3 , the method for extracting reverberation environment embeddings using a reverberation cancellation auto-encoder according to an embodiment includes generating a reverberation speech for training from a speech database (DB) and an indoor impulse response (RIR) database (DB) (S110), performing multiple linear prediction on the generated reverberation speech and outputting a multiple linear prediction result (S120), estimating a weight for a weighted sum of multiple linear prediction results (S130), multiple linear prediction result The steps of estimating the final reverberation-removed speech from the values and the estimated weights (S140), and extracting the reverberation environment embedding vector from the weight estimation model 230 for estimating the weights (S160) may be included.

추정된 최종 잔향 제거 음성과 잔향이 추가되기 이전의 원본 음성의 손실 함수를 통해 전체 모델을 훈련하는 단계(S150)를 더 포함할 수 있다. The method may further include training the entire model through a loss function of the estimated final reverberation-removed speech and the original speech before reverberation is added ( S150 ).

아래에서 일 실시예에 따른 잔향 제거 오토 인코더를 이용한 잔향 환경 임베딩 추출 방법의 각 단계를 예를 들어 설명한다. Hereinafter, each step of the method for extracting the reverberation environment embedding using the reverberation cancellation auto-encoder according to an embodiment will be described as an example.

단계(S110)에서, 잔향 음성 생성부(210)는 음성 데이터베이스(DB)와 실내 임펄스 응답(RIR) 데이터베이스(DB)로부터 훈련을 위한 잔향 음성을 생성할 수 있다. In step S110 , the reverberation speech generator 210 may generate a reverberation speech for training from the speech database DB and the indoor impulse response (RIR) database DB.

즉, 음성 DB와 실내 임펄스 응답(RIR) DB로부터 잔향 음성을 생성하는 과정이다. 음성 DB는 잡음 및 잔향이 없는 조용한 환경에서 녹음된 음성들로 구성된다. 실내 임펄스 응답(RIR) DB는 실제 잔향 환경에서 녹음된 실내 임펄스 응답(RIR) 또는 영상법을 통해 가상 공간에서 만들어진 실내 임펄스 응답(RIR)으로 구성된다. 실내 임펄스 응답(RIR)과 음성의 합성곱을 통해 잔향 음성을 생성하는데, 각 실내 임펄스 응답(RIR)에 대해 음성 DB에서 임의로 선택된 M개의 음성과 각각 합성곱하여 M개의 잔향 음성을 생성한다. 최종적으로 N개의 실내 임펄스 응답(RIR)로 구성된 실내 임펄스 응답(RIR) DB 전체에 대해 잔향 음성을 생성하면 전체 잔향 음성의 수는 N * M개가 된다. That is, it is a process of generating a reverberation voice from the voice DB and the indoor impulse response (RIR) DB. The voice DB consists of voices recorded in a quiet environment free from noise and reverberation. The indoor impulse response (RIR) DB consists of an indoor impulse response (RIR) recorded in a real reverberation environment or an indoor impulse response (RIR) created in a virtual space through imaging methods. Reverberant speech is generated through convolution of an indoor impulse response (RIR) and speech. For each indoor impulse response (RIR), M reverberated speech is generated by convolution with M speeches randomly selected from the speech DB. Finally, if reverberation is generated for the entire indoor impulse response (RIR) DB composed of N indoor impulse responses (RIR), the total number of reverberated speeches becomes N * M.

단계(S120)에서, 선형 예측 모델(220)은 생성된 잔향 음성에 대한 다중 선형 예측을 수행하여 다중 선형 예측 결과를 출력할 수 있다. In operation S120 , the linear prediction model 220 may perform multi-linear prediction on the generated reverberation speech to output a multi-linear prediction result.

즉, 선형 예측 모델(220)은 입력 잔향 음성에 대한 다중 선형 예측 결과를 출력하는 모델이다. 우선 입력되는 잔향 음성 신호는 국소 푸리에 변환(STFT), 멜 필터뱅크(mel filterbank) 필터링을 거쳐 K개의 주파수 빈(frequency bin), N개의 프레임으로 구성된 K x N 차원 멜 스펙트로그램으로 변환된다. 선형 예측 모델(220)은 길이 L의 선형 예측 계수(linear prediction coefficient) 집합 C개로 구성되어 있다. 이를 통해 K개의 주파수 빈에 대해 각각 C가지 선형 예측을 수행하게 되며, 최종적으로 C x K x N 차원 멜 스펙트로그램 출력이 얻어진다. 이 과정은 아래와 같은 수학식으로 표현할 수 있다.That is, the linear prediction model 220 is a model that outputs multiple linear prediction results for the input reverberation speech. First, the input reverberation speech signal is converted into a K x N-dimensional mel spectrogram composed of K frequency bins and N frames through local Fourier transform (STFT) and mel filterbank filtering. The linear prediction model 220 includes C sets of linear prediction coefficients of length L. Through this, C linear predictions are performed for each of the K frequency bins, and a C x K x N-dimensional Mel spectrogram output is finally obtained. This process can be expressed by the following equation.

[수학식 1][Equation 1]

여기서, x는 입력 잔향 음성의 멜 스펙트로그램,

는

번째 선형 예측 계수를 나타낸다. 이 과정은 입력 스펙트로그램에 적절한 제로 패딩을 적용하고, 선형 예측 계수들로 이루어진 합성곱 층(convolution layer)을 통과하는 인공 신경망 구조로 구현 가능하다.where x is the Mel spectrogram of the input reverberation voice,

Is

represents the th linear prediction coefficient. This process can be implemented as an artificial neural network structure that applies appropriate zero padding to the input spectrogram and passes through a convolution layer made of linear prediction coefficients.

단계(S130)에서, 가중치 추정 모델(230)은 다중 선형 예측 결과들의 가중합을 위한 가중치를 추정할 수 있다. In step S130 , the weight estimation model 230 may estimate a weight for a weighted sum of multiple linear prediction results.

즉, 가중치 추정 모델(230)은 다중 선형 예측 결과들의 가중합을 위한 가중치를 추정하는 모델이다. 가중치 추정 모델(230)의 입력은 선형 예측 모델(220)과 동일하게 얻어진 K x N 차원 멜 스펙트로그램에 로그를 취하여 사용한다. 여기에 적절한 제로 패딩을 적용하고, 합성곱 신경망들을 통과시켜 C x K x N 차원의 출력을 얻는다. 그리고 각각의 시간-주파수 인덱스에 속하는 C개 값들에 소프트맥스 함수를 적용하여 0 ~ 1 사이 값을 가지며, 그 합이 1이 되는 가중치로 변환한다. That is, the weight estimation model 230 is a model for estimating a weight for a weighted sum of multiple linear prediction results. The input of the weight estimation model 230 is used by taking the logarithm of the K x N-dimensional Mel spectrogram obtained in the same way as the linear prediction model 220 . Appropriate zero padding is applied here, and convolutional neural networks are passed through to obtain a C x K x N dimensional output. Then, the softmax function is applied to the C values belonging to each time-frequency index, which has a value between 0 and 1, and is converted into a weight whose sum is 1.

단계(S140)에서, 잔향 제거 음성 추정부(240)는 다중 선형 예측 결과들과 추정된 가중치로부터 최종 잔향 제거 음성을 추정할 수 있다. In operation S140 , the dereverberation speech estimator 240 may estimate the final dereverberation speech from the multiple linear prediction results and the estimated weight.

즉, 선형 예측 결과들과 추정된 가중치로부터 최종 잔향 제거 음성을 구하는 과정이다. 단계(S120)과 단계(S130)에서 선형 예측 모델(220) 및 가중치 추정 모델(230)의 출력은 모두 C x K x N 차원의 차원을 갖는다. 이들을 요소별 곱셈(element-wise multiplication)하여 동일한 C x K x N 차원의 결과를 얻는다. 이는 각 시간-주파수 인덱스에 대해 선형 예측 결과와 가중치가 곱해진 결과이므로, 각각의 시간-주파수 인덱스에서 C개의 값을 더하면 최종적으로 K x N 차원의 스펙트로그램으로 표현된 잔향 제거 음성의 추정 값을 얻을 수 있다. 이를 수학식으로 나타내면 아래와 같이 나타낼 수 있다.That is, it is a process of obtaining the final dereverberation speech from the linear prediction results and the estimated weight. In steps S120 and S130 , the outputs of the linear prediction model 220 and the weight estimation model 230 all have C x K x N dimensions. Element-wise multiplication of these results in the same C x K x N dimension. Since this is the result of multiplying the linear prediction result and the weight for each time-frequency index, adding C values in each time-frequency index finally gives the estimated value of the reverberation cancellation voice expressed as a K x N-dimensional spectrogram. can be obtained This can be expressed as a mathematical expression as follows.

[수학식 2][Equation 2]

여기서,

는 가중치 추정 모델(230)에서 구한 가중치,

는 선형 예측 모델(220)에서 구한 선형 예측 결과이며,

은 각각 선형 예측, 주파수, 프레임 인덱스를 나타낸다. here,

is the weight obtained from the weight estimation model 230,

is the linear prediction result obtained from the linear prediction model 220,

denotes a linear prediction, a frequency, and a frame index, respectively.

단계(S150)에서, 잔향 제거 오토 인코더 훈련부(250)는 추정된 최종 잔향 제거 음성과 잔향이 추가되기 이전의 원본 음성의 손실 함수를 통해 전체 모델을 훈련할 수 있다. In step S150 , the reverberation reduction auto encoder training unit 250 may train the entire model through the loss function of the estimated final dereverberation speech and the original speech before reverberation is added.

즉, 추정된 잔향 제거 음성과 잔향이 추가되기 전 원본 음성 사이의 차이를 최소화하여 모델을 훈련하는 과정이다. 잔향 음성 생성부(210)에서 잔향 음성의 합성에 사용된 음성 DB의 원본 음성을 K x N 차원 로그 멜 스펙트로그램으로 변환한다. 잔향 제거 음성 추정부(240)의 결과에도 로그를 취한 뒤, 두 개의 로그 멜 스펙트로그램 사이의 L1 손실 함숫값을 구한다. 훈련 데이터로부터 미니배치를 구성하고 위와 같은 과정을 통해 얻은 손실 함수가 수렴할 때까지 역전파 알고리즘을 통한 신경망 훈련을 반복한다.That is, it is a process of training a model by minimizing the difference between the estimated reverberation-removed speech and the original speech before reverberation is added. The reverberation voice generator 210 converts the original voice of the voice DB used for synthesizing the reverberation voice into a K x N-dimensional log Mel spectrogram. After taking a log of the result of the reverberation cancellation voice estimator 240, an L1 loss function value between two log Mel spectrograms is obtained. Construct a mini-batch from the training data and repeat the neural network training through the backpropagation algorithm until the loss function obtained through the above process converges.

단계(S160)에서, 임베딩 추출부(260)는 가중치를 추정하는 가중치 추정 모델(230)로부터 잔향 환경 임베딩 벡터를 추출할 수 있다. In operation S160 , the embedding extraction unit 260 may extract the reverberation environment embedding vector from the weight estimation model 230 for estimating the weight.

즉, 훈련된 가중치 추정 모델(230)의 출력으로부터 잔향 환경 임베딩 벡터를 구하는 과정이다. 대상 음성 신호를 로그 멜 스펙트로그램으로 변환하고 이를 가중치 추정 모델(230)의 입력으로 하여 출력을 얻는다. 얻어진 C x K x N 차원 가중치에서 각 채널-주파수에 대해 N개 프레임 값들의 평균을 취하여 C x K 차원의 행렬을 얻는다. 행렬의 요소들을 재배열하면 C * K 차원의 벡터를 얻을 수 있는데, 이것이 잔향 환경 임베딩 벡터가 된다. 가중치로부터 임베딩 벡터를 추출하는 과정은 아래의 수학식과 같이 나타낼 수 있다.That is, it is a process of obtaining a reverberation environment embedding vector from the output of the trained weight estimation model 230 . The target speech signal is converted into a log Mel spectrogram, and this is used as an input of the weight estimation model 230 to obtain an output. From the obtained C x K x N-dimensional weights, the average of N frame values for each channel-frequency is taken to obtain a C x K-dimensional matrix. Rearranging the elements of the matrix yields a C * K dimension vector, which becomes the reverberation environment embedding vector. The process of extracting the embedding vector from the weight can be expressed as the following equation.

[수학식 3][Equation 3]

여기서,

는 잔향 환경 임베딩 벡터 e의 인덱스이다.here,

is the index of the reverberation environment embedding vector e.

이상과 같이, 실시예들에 따르면 실내 임펄스 응답(RIR) 분류기 대신 잔향 제거 오토 인코더를 통해 실내 임펄스 응답(RIR) 각각과 서로의 관계에 대한 특징을 자연스럽게 반영할 수 있다.As described above, according to embodiments, characteristics of each of the indoor impulse responses and their relationship may be naturally reflected through the reverberation cancellation auto-encoder instead of the indoor impulse response (RIR) classifier.

또한, 실시예들에 따르면 잔향 제거 오토 인코더의 관점에서 음향 모델의 입력 특징에 직접 적용하는 대신 임베딩 벡터를 추출하여 보조 입력으로 사용함으로써, 세부 정보를 복원하지 못하여 왜곡을 발생시키는 한계를 보완할 수 있다. In addition, according to embodiments, from the viewpoint of the reverberation cancellation auto-encoder, instead of directly applying to the input feature of the acoustic model, an embedding vector is extracted and used as an auxiliary input, thereby compensating for the limitation of generating distortion due to inability to restore detailed information. have.

이상에서 설명된 장치는 하드웨어 구성요소, 소프트웨어 구성요소, 및/또는 하드웨어 구성요소 및 소프트웨어 구성요소의 조합으로 구현될 수 있다. 예를 들어, 실시예들에서 설명된 장치 및 구성요소는, 예를 들어, 프로세서, 컨트롤러, ALU(arithmetic logic unit), 디지털 신호 프로세서(digital signal processor), 마이크로컴퓨터, FPA(field programmable array), PLU(programmable logic unit), 마이크로프로세서, 또는 명령(instruction)을 실행하고 응답할 수 있는 다른 어떠한 장치와 같이, 하나 이상의 범용 컴퓨터 또는 특수 목적 컴퓨터를 이용하여 구현될 수 있다. 처리 장치는 운영 체제(OS) 및 상기 운영 체제 상에서 수행되는 하나 이상의 소프트웨어 애플리케이션을 수행할 수 있다. 또한, 처리 장치는 소프트웨어의 실행에 응답하여, 데이터를 접근, 저장, 조작, 처리 및 생성할 수도 있다. 이해의 편의를 위하여, 처리 장치는 하나가 사용되는 것으로 설명된 경우도 있지만, 해당 기술분야에서 통상의 지식을 가진 자는, 처리 장치가 복수 개의 처리 요소(processing element) 및/또는 복수 유형의 처리 요소를 포함할 수 있음을 알 수 있다. 예를 들어, 처리 장치는 복수 개의 프로세서 또는 하나의 프로세서 및 하나의 컨트롤러를 포함할 수 있다. 또한, 병렬 프로세서(parallel processor)와 같은, 다른 처리 구성(processing configuration)도 가능하다.The device described above may be implemented as a hardware component, a software component, and/or a combination of the hardware component and the software component. For example, devices and components described in the embodiments may include, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable array (FPA), It may be implemented using one or more general purpose or special purpose computers, such as a programmable logic unit (PLU), microprocessor, or any other device capable of executing and responding to instructions. The processing device may execute an operating system (OS) and one or more software applications running on the operating system. A processing device may also access, store, manipulate, process, and generate data in response to execution of the software. For convenience of understanding, although one processing device is sometimes described as being used, one of ordinary skill in the art will recognize that the processing device includes a plurality of processing elements and/or a plurality of types of processing elements. It can be seen that can include For example, the processing device may include a plurality of processors or one processor and one controller. Other processing configurations are also possible, such as parallel processors.

소프트웨어는 컴퓨터 프로그램(computer program), 코드(code), 명령(instruction), 또는 이들 중 하나 이상의 조합을 포함할 수 있으며, 원하는 대로 동작하도록 처리 장치를 구성하거나 독립적으로 또는 결합적으로(collectively) 처리 장치를 명령할 수 있다. 소프트웨어 및/또는 데이터는, 처리 장치에 의하여 해석되거나 처리 장치에 명령 또는 데이터를 제공하기 위하여, 어떤 유형의 기계, 구성요소(component), 물리적 장치, 가상 장치(virtual equipment), 컴퓨터 저장 매체 또는 장치에 구체화(embody)될 수 있다. 소프트웨어는 네트워크로 연결된 컴퓨터 시스템 상에 분산되어서, 분산된 방법으로 저장되거나 실행될 수도 있다. 소프트웨어 및 데이터는 하나 이상의 컴퓨터 판독 가능 기록 매체에 저장될 수 있다.Software may comprise a computer program, code, instructions, or a combination of one or more thereof, which configures a processing device to operate as desired or is independently or collectively processed You can command the device. The software and/or data may be any kind of machine, component, physical device, virtual equipment, computer storage medium or apparatus, to be interpreted by or to provide instructions or data to the processing device. may be embodied in The software may be distributed over networked computer systems and stored or executed in a distributed manner. Software and data may be stored in one or more computer-readable recording media.

실시예에 따른 방법은 다양한 컴퓨터 수단을 통하여 수행될 수 있는 프로그램 명령 형태로 구현되어 컴퓨터 판독 가능 매체에 기록될 수 있다. 상기 컴퓨터 판독 가능 매체는 프로그램 명령, 데이터 파일, 데이터 구조 등을 단독으로 또는 조합하여 포함할 수 있다. 상기 매체에 기록되는 프로그램 명령은 실시예를 위하여 특별히 설계되고 구성된 것들이거나 컴퓨터 소프트웨어 당업자에게 공지되어 사용 가능한 것일 수도 있다. 컴퓨터 판독 가능 기록 매체의 예에는 하드 디스크, 플로피 디스크 및 자기 테이프와 같은 자기 매체(magnetic media), CD-ROM, DVD와 같은 광기록 매체(optical media), 플롭티컬 디스크(floptical disk)와 같은 자기-광 매체(magneto-optical media), 및 롬(ROM), 램(RAM), 플래시 메모리 등과 같은 프로그램 명령을 저장하고 수행하도록 특별히 구성된 하드웨어 장치가 포함된다. 프로그램 명령의 예에는 컴파일러에 의해 만들어지는 것과 같은 기계어 코드뿐만 아니라 인터프리터 등을 사용해서 컴퓨터에 의해서 실행될 수 있는 고급 언어 코드를 포함한다. The method according to the embodiment may be implemented in the form of program instructions that can be executed through various computer means and recorded in a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, etc. alone or in combination. The program instructions recorded on the medium may be specially designed and configured for the embodiment, or may be known and available to those skilled in the art of computer software. Examples of the computer-readable recording medium include magnetic media such as hard disks, floppy disks and magnetic tapes, optical media such as CD-ROMs and DVDs, and magnetic such as floppy disks. - includes magneto-optical media, and hardware devices specially configured to store and execute program instructions, such as ROM, RAM, flash memory, and the like. Examples of program instructions include not only machine language codes such as those generated by a compiler, but also high-level language codes that can be executed by a computer using an interpreter or the like.

이상과 같이 실시예들이 비록 한정된 실시예와 도면에 의해 설명되었으나, 해당 기술분야에서 통상의 지식을 가진 자라면 상기의 기재로부터 다양한 수정 및 변형이 가능하다. 예를 들어, 설명된 기술들이 설명된 방법과 다른 순서로 수행되거나, 및/또는 설명된 시스템, 구조, 장치, 회로 등의 구성요소들이 설명된 방법과 다른 형태로 결합 또는 조합되거나, 다른 구성요소 또는 균등물에 의하여 대치되거나 치환되더라도 적절한 결과가 달성될 수 있다.As described above, although the embodiments have been described with reference to the limited embodiments and drawings, various modifications and variations are possible from the above description by those skilled in the art. For example, the described techniques are performed in an order different from the described method, and/or the described components of the system, structure, apparatus, circuit, etc. are combined or combined in a different form than the described method, or other components Or substituted or substituted by equivalents may achieve an appropriate result.

그러므로, 다른 구현들, 다른 실시예들 및 특허청구범위와 균등한 것들도 후술하는 특허청구범위의 범위에 속한다. Therefore, other implementations, other embodiments, and equivalents to the claims are also within the scope of the following claims.

Claims

잔향 제거 오토 인코더를 이용한 잔향 환경 임베딩 추출 방법에 있어서,
음성 데이터베이스(DB)와 실내 임펄스 응답(RIR) 데이터베이스(DB)로부터 훈련을 위한 잔향 음성을 생성하는 단계;
선형 예측 모델을 통해 생성된 상기 잔향 음성에 대한 다중 선형 예측을 수행하여 다중 선형 예측 결과를 출력하는 단계;
가중치 추정 모델을 통해 상기 다중 선형 예측 결과들의 가중합을 위한 가중치를 추정하는 단계;
상기 다중 선형 예측 결과들과 추정된 상기 가중치로부터 최종 잔향 제거 음성을 추정하는 단계; 및
상기 가중치를 추정하는 상기 가중치 추정 모델로부터 잔향 환경 임베딩 벡터를 추출하는 단계
를 포함하고,
추정된 상기 최종 잔향 제거 음성과 잔향이 추가되기 이전의 원본 음성의 손실 함수를 통해 전체 모델을 훈련하는 단계
를 더 포함하며,
상기 가중치를 추정하는 단계는,
별도의 상기 가중치 추정 모델을 통해 상기 가중합을 위한 가중치가 추정되어, 스펙트로그램(spectrogram)의 각각의 시간-주파수 인덱스에서 서로 다른 값을 가지며,
상기 잔향 환경 임베딩 벡터를 추출하는 단계는,
대상 음성 신호를 로그 멜 스펙트로그램으로 변환한 후, 상기 가중치 추정 모델에 입력하여 출력을 획득하고, 획득된 상기 가중치에서 각 채널-주파수에 대해 프레임 값들의 평균을 취하여 행렬을 획득하며, 상기 행렬의 요소들을 재배열하여 상기 잔향 환경 임베딩 벡터를 추출함에 따라, 상기 가중치의 시간에 대한 평균을 통해 발화의 길이에 무관한 고정된 길이의 상기 잔향 환경 임베딩 벡터를 획득하고, 상기 잔향 환경 임베딩 벡터를 음향 모델의 보조 입력으로 사용하는 것
을 특징으로 하는, 잔향 환경 임베딩 추출 방법.In the reverberation environment embedding extraction method using the reverberation removal auto encoder,
generating a reverberation voice for training from a voice database (DB) and an indoor impulse response (RIR) database (DB);
outputting multiple linear prediction results by performing multiple linear prediction on the reverberated speech generated through a linear prediction model;
estimating a weight for a weighted sum of the multiple linear prediction results through a weight estimation model;
estimating a final reverberation cancellation voice from the multiple linear prediction results and the estimated weight; and
extracting a reverberation environment embedding vector from the weight estimation model for estimating the weight
including,
Training the entire model through the estimated final dereverberation speech and the loss function of the original speech before reverberation is added.
further comprising,
The step of estimating the weight is
The weights for the weighted sum are estimated through the separate weight estimation model, and have different values in each time-frequency index of the spectrogram,
The step of extracting the reverberation environment embedding vector comprises:
After converting the target speech signal into a log Mel spectrogram, it is input to the weight estimation model to obtain an output, and from the obtained weights, the average of frame values for each channel-frequency is taken to obtain a matrix, As the elements are rearranged to extract the reverberation environment embedding vector, the reverberation environment embedding vector of a fixed length independent of the length of a utterance is obtained through the time average of the weights, and the reverberation environment embedding vector is used to sound What to use as an auxiliary input to the model
Characterized in, the reverberation environment embedding extraction method.

삭제delete

제1항에 있어서,
상기 다중 선형 예측 결과를 출력하는 단계는,
상기 잔향 음성을 멜 스펙트로그램(mel spectrogram)으로 변환 후 선형 예측 모델에 입력하여 제로 패딩(zero padding)을 적용하고, 선형 예측 계수들로 이루어진 합성곱 신경망(Convolutional Neural Network)을 통과시켜 상기 다중 선형 예측 결과를 출력하는 것
을 특징으로 하는, 잔향 환경 임베딩 추출 방법.According to claim 1,
The step of outputting the multiple linear prediction result comprises:
After the reverberation speech is converted into a mel spectrogram, it is input to a linear prediction model, zero padding is applied, and the multi-linear outputting the prediction result
Characterized in, the reverberation environment embedding extraction method.

제1항에 있어서,
상기 가중치를 추정하는 단계는,
상기 잔향 음성을 로그 멜 스펙트로그램(log mel spectrogram)으로 변환 후 가중치 추정 모델에 입력하여 제로 패딩을 적용하고, 합성곱 신경망을 통과시켜 출력을 획득하고, 각각의 시간-주파수 인덱스에 속하는 값들에 소프트맥스(softmax) 함수를 적용하여 0 내지 1 사이 값을 가지며 합이 1이 되는 가중치로 변환하는 것
을 특징으로 하는, 잔향 환경 임베딩 추출 방법.According to claim 1,
The step of estimating the weight is
After converting the reverberation speech into a log mel spectrogram, it is input to a weight estimation model, applied zero padding, passed through a convolutional neural network to obtain an output, and is soft to values belonging to each time-frequency index. Converting to weights that have a value between 0 and 1 and sum to 1 by applying the softmax function
Characterized in, the reverberation environment embedding extraction method.

제1항에 있어서,
상기 최종 잔향 제거 음성을 추정하는 단계는,
상기 다중 선형 예측 결과들과 추정된 상기 가중치들을 요소별 곱셈(element-wise multiplication)을 함에 따라, 각각의 시간-주파수 인덱스에 대해 선형 예측 결과와 가중치가 곱해진 결과를 획득한 후, 각각의 시간-주파수 인덱스에서 선형 예측 계수의 집합 C개의 값을 더하여 상기 최종 잔향 제거 음성의 추정 값을 획득하는 것
을 특징으로 하는, 잔향 환경 임베딩 추출 방법.According to claim 1,
The step of estimating the final reverberation cancellation voice includes:
By element-wise multiplication of the multiple linear prediction results and the estimated weights, a result obtained by multiplying a linear prediction result and a weight for each time-frequency index is obtained, and then at each time - Obtaining the estimated value of the final dereverberation speech by adding C values of a set of linear prediction coefficients in the frequency index
Characterized in, the reverberation environment embedding extraction method.

삭제delete

제1항에 있어서,
상기 손실 함수를 통해 전체 모델을 훈련하는 단계는,
획득한 상기 최종 잔향 제거 음성과 상기 음성 데이터베이스(DB)의 원본 음성을 각각 로그 멜 스펙트로그램으로 변환한 후, 두 개의 로그 멜 스펙트로그램 사이의 손실 함수가 수렴할 때까지 역전파 알고리즘을 통한 신경망 훈련을 반복하는 것
을 특징으로 하는, 잔향 환경 임베딩 추출 방법.According to claim 1,
The step of training the entire model through the loss function is,
After converting the acquired final reverberation cancellation voice and the original voice of the voice database (DB) into log Mel spectrograms, respectively, training the neural network through the back propagation algorithm until the loss function between the two log Mel spectrograms converges. repeating
Characterized in, the reverberation environment embedding extraction method.

잔향 제거 오토 인코더를 이용한 잔향 환경 임베딩 추출 장치에 있어서,
음성 데이터베이스(DB)와 실내 임펄스 응답(RIR) 데이터베이스(DB)로부터 훈련을 위한 잔향 음성을 생성하는 잔향 음성 생성부;
생성된 상기 잔향 음성에 대한 다중 선형 예측을 수행하여 다중 선형 예측 결과를 출력하는 선형 예측 모델;
상기 다중 선형 예측 결과들의 가중합을 위한 가중치를 추정하는 가중치 추정 모델;
상기 다중 선형 예측 결과들과 추정된 상기 가중치로부터 최종 잔향 제거 음성을 추정하는 잔향 제거 음성 추정부; 및
상기 가중치를 추정하는 가중치 추정 모델로부터 잔향 환경 임베딩 벡터를 추출하는 임베딩 추출부
를 포함하고,
추정된 상기 최종 잔향 제거 음성과 잔향이 추가되기 이전의 원본 음성의 손실 함수를 통해 전체 모델을 훈련하는 잔향 제거 오토 인코더 훈련부
를 더 포함하며,
별도의 상기 가중치 추정 모델을 통해 상기 가중합을 위한 가중치가 추정되어, 스펙트로그램(spectrogram)의 각각의 시간-주파수 인덱스에서 서로 다른 값을 가지며,
상기 임베딩 추출부는,
대상 음성 신호를 로그 멜 스펙트로그램으로 변환한 후, 상기 가중치 추정 모델에 입력하여 출력을 획득하고, 획득된 상기 가중치에서 각 채널-주파수에 대해 프레임 값들의 평균을 취하여 행렬을 획득하며, 상기 행렬의 요소들을 재배열하여 상기 잔향 환경 임베딩 벡터를 추출함에 따라, 상기 가중치의 시간에 대한 평균을 통해 발화의 길이에 무관한 고정된 길이의 상기 잔향 환경 임베딩 벡터를 획득하고, 상기 잔향 환경 임베딩 벡터를 음향 모델의 보조 입력으로 사용하는 것
을 특징으로 하는, 잔향 환경 임베딩 추출 장치. In the reverberation environment embedding extraction apparatus using the reverberation removal auto encoder,
a reverberation speech generator for generating a reverberation speech for training from a speech database (DB) and an indoor impulse response (RIR) database (DB);
a linear prediction model for outputting multiple linear prediction results by performing multiple linear prediction on the generated reverberation speech;
a weight estimation model for estimating a weight for a weighted sum of the multiple linear prediction results;
a dereverberation speech estimator for estimating a final dereverberation speech from the multiple linear prediction results and the estimated weight; and
An embedding extraction unit for extracting a reverberation environment embedding vector from a weight estimation model for estimating the weight
including,
An autoencoder training unit for dereverberation that trains the entire model through the estimated final dereverberation speech and the loss function of the original speech before reverberation is added.
further comprising,
The weights for the weighted sum are estimated through the separate weight estimation model, and have different values in each time-frequency index of the spectrogram,
The embedding extraction unit,
After converting the target speech signal into a log Mel spectrogram, it is input to the weight estimation model to obtain an output, and from the obtained weights, the average of frame values for each channel-frequency is taken to obtain a matrix, As the elements are rearranged to extract the reverberation environment embedding vector, the reverberation environment embedding vector of a fixed length independent of the length of a utterance is obtained through the time average of the weights, and the reverberation environment embedding vector is used to sound What to use as an auxiliary input to the model
Characterized in, the reverberation environment embedding extraction device.

삭제delete