KR102018286B1

KR102018286B1 - Method and Apparatus for Removing Speech Components in Sound Source

Info

Publication number: KR102018286B1
Application number: KR1020180132491A
Authority: KR
Inventors: 나태영; 임정연
Original assignee: 에스케이 텔레콤주식회사
Priority date: 2018-10-31
Filing date: 2018-10-31
Publication date: 2019-10-21

Abstract

An embodiment of the present invention utilizes deep learning in removing vocal components in sound source. The present invention relates to a method and apparatus for removing voice components in sound source capable of improving the accuracy for learning by applying different learning models according to attribute information of a sound source signal and calculating the sound source from which voice is removed with more improved quality by primarily receiving the sound source in which vocal components are removed.

Description

음원 내 음성 성분 제거방법 및 장치{Method and Apparatus for Removing Speech Components in Sound Source}Method and device for removing speech components in sound source {Method and Apparatus for Removing Speech Components in Sound Source}

본 실시예는 음원 내 음성 성분 제거방법 및 장치에 관한 것이다. 더욱 상세하게는, 딥러닝을 이용한 음원 내 음성 성분 제거방법 및 장치에 관한 것이다.The present embodiment relates to a method and an apparatus for removing speech components in a sound source. More specifically, the present invention relates to a method and an apparatus for removing sound components in a sound source using deep learning.

이 부분에 기술된 내용은 단순히 본 실시예에 대한 배경 정보를 제공할 뿐 종래기술을 구성하는 것은 아니다.The contents described in this section merely provide background information on the present embodiment and do not constitute a prior art.

일반적으로, 카라오케 기기, 노래 반주기, MP3 플레이어, 기타 오디오 기기의 스피커로 출력되는 스테레오 음원(AR: ALL Recorded)은 가수의 노래 성분인 보컬 성분(Vocal recorded)과 적어도 하나 이상의 악기 등을 이용한 반주 음악인 비보컬 성분(MR : Music recorded)을 포함한다.Generally, a stereo sound source (AR: ALL Recorded) output to a speaker of a karaoke device, a song half cycle, an MP3 player, or other audio device is an accompaniment music using a vocal component (Vocal recorded) and at least one or more instruments. Non-vocal component (MR: Music recorded).

최근, 소비자들 사이에, 반주 음악으로서 MR 신호가 많이 사용되고 있는데, 스테레오 음원로부터 MR 신호를 생성하기 위해서는 스테레오 음원에서 보컬 성분을 효과적으로 제거하여야 한다. 음원 분리 기술은 오디오 신호 처리 분야에서 오래된 연구 분야이지만, 음원에 포함된 특정 악기 소리 등을 일정 길이의 주파수 영역에 걸쳐 있기 때문에 음원에 손실 없이 완벽하게 제거하는 것이 기술적으로는 매우 어려운 실정이다. 특히, 보컬의 주파수 성분은 성별, 인종 등에 따른 개인 편차가 여타 악기류에 비해 매우 큰 편일뿐 아니라 일반 악기류와 중첩되는 부분이 대부분이기 때문에 보컬 성분만의 완벽한 분리를 위한 기술의 난이도는 매우 높은 편이다.Recently, MR signals are widely used as accompaniment music among consumers, and in order to generate MR signals from stereo sound sources, vocal components must be effectively removed from the stereo sound sources. Sound separation technology is an old research field in the field of audio signal processing, but it is technically difficult to completely remove the sound source without any loss because a certain instrument sound included in the sound source is spread over a certain frequency range. In particular, the frequency component of the vocal is very high compared to other instruments because of individual variation due to gender, race, etc., and since it overlaps with most general instruments, the difficulty of technology for perfect separation of the vocal component is very high. .

이에, 스테레오 음원로부터 MR 신호를 생성하기 위해 스테레오 음원 내 보컬 성분을 보다 효과적으로 제거 가능토록 하는 새로운 기술을 필요로 한다.Therefore, in order to generate an MR signal from a stereo sound source, a new technique is required to more effectively remove vocal components in the stereo sound source.

본 실시예는 음원 내 보컬 성분 제거함에 있어서 딥러닝을 활용하되, 음원 신호의 특성 정보에 따라 서로 다른 학습 모델을 적용함으로써 학습에 대한 정확성을 높이면서도, 1차적으로 보컬 성분이 제거된 음원을 입력으로 받아 보다 향상된 품질의 음성 제거 음원이 산출될 수 있도록 하는 데 그 목적이 있다.This embodiment uses deep learning to remove vocal components in a sound source, and inputs a sound source from which vocal components are first removed while improving accuracy of learning by applying different learning models according to characteristic information of a sound source signal. The purpose of the present invention is to provide a speech removal sound source of higher quality.

본 실시예는, 대상 음원 신호를 수신하고, 상기 대상 음원 신호를 전처리 하여 상기 대상 음원 신호 내 포함된 음성 성분을 1차 제거한 전처리 음원을 산출하는 전처리부; 적어도 하나의 특성 정보에 따라 분류된 음원 신호별로 상기 음성 성분의 제거와 관련하여 기 학습된 학습 데이터를 저장하며, 상기 대상 음원 신호의 특성 정보 및 상기 음원 신호별 상기 학습 데이터에 기반하여 상기 전처리 음원에 대응되는 보간 파라미터 정보를 산출하는 학습부; 및 상기 전처리 음원 및 상기 보간 파라미터 정보를 기반으로 상기 대상 음원 신호 내 상기 음성 성분이 2차 제거된 음성 제거 음원을 산출하는 산출부를 포함하는 것을 특징으로 하는 음성 성분 제거장치를 제공한다.The present exemplary embodiment includes a preprocessor configured to receive a target sound source signal and to preprocess the target sound source signal to calculate a preprocessed sound source from which voice components included in the target sound source signal are first removed; The training data pre-trained in relation to the removal of the voice component is stored for each sound source signal classified according to at least one characteristic information, and the preprocessing sound source is based on the characteristic information of the target sound source signal and the training data for each sound source signal. A learner for calculating interpolation parameter information corresponding to the learning unit; And a calculator configured to calculate a voice removing sound source from which the voice component in the target sound source signal is secondarily removed based on the preprocessed sound source and the interpolation parameter information.

또한, 본 실시예의 다른 측면에 의하면, 대상 음원 신호를 수신하고, 상기 대상 음원 신호를 전처리 하여 상기 대상 음원 신호 내 포함된 음성 성분을 1차 제거한 전처리 음원을 산출하는 과정; 적어도 하나의 특성 정보에 따라 분류된 음원 신호별로 상기 음성 성분의 제거와 관련하여 기 학습된 학습 데이터를 저장한 학습 모델 및 상기 상기 대상 음원 신호의 특성 정보에 기반하여 상기 전처리 음원에 대응되는 보간 파라미터 정보를 산출하는 과정; 및 상기 전처리 음원 및 상기 보간 파라미터 정보를 기반으로 상기 대상 음원 신호 내 상기 음성 성분이 2차 제거된 음성 제거 음원을 산출하는 과정을 포함하는 것을 특징으로 하는 음성 성분 제거방법을 제공한다.According to another aspect of the present embodiment, the method may further include: receiving a target sound source signal and preprocessing the target sound source signal to calculate a preprocessed sound source from which voice components included in the target sound source signal are first removed; An interpolation parameter corresponding to the preprocessing sound source based on a learning model storing learning data previously learned in relation to the removal of the speech component for each sound source signal classified according to at least one characteristic information and the characteristic information of the target sound source signal Calculating information; And calculating a speech removal sound source from which the speech component in the target sound source signal is secondarily removed based on the preprocessed sound source and the interpolation parameter information.

또한, 본 실시예의 다른 측면에 의하면, 하드웨어와 결합되어, 대상 음원 신호를 수신하고, 상기 대상 음원 신호를 전처리 하여 상기 대상 음원 신호 내 포함된 음성 성분을 1차 제거한 전처리 음원을 산출하는 과정; 적어도 하나의 특성 정보에 따라 분류된 음원 신호별로 상기 음성 성분의 제거와 관련하여 기 학습된 학습 데이터를 저장한 학습 모델 및 상기 상기 대상 음원 신호의 특성 정보에 기반하여 상기 전처리 음원에 대응되는 보간 파라미터 정보를 산출하는 과정; 및 상기 전처리 음원 및 상기 보간 파라미터 정보를 기반으로 상기 대상 음원 신호 내 상기 음성 성분이 2차 제거된 음성 제거 음원을 산출하는 과정을 실행시키기 위하여 컴퓨터로 읽을 수 있는 기록매체에 저장된 컴퓨터프로그램을 제공한다.In addition, according to another aspect of the present embodiment, the process of combining with the hardware, receiving a target sound source signal, pre-processing the target sound source signal to calculate a pre-processed sound source from which the speech component included in the target sound source signal is first removed; An interpolation parameter corresponding to the preprocessing sound source based on a learning model storing learning data previously learned in relation to the removal of the speech component for each sound source signal classified according to at least one characteristic information and the characteristic information of the target sound source signal Calculating information; And a computer program stored in a computer-readable recording medium for executing a process of calculating a voice removing sound source from which the voice component in the target sound source signal is secondarily removed based on the preprocessed sound source and the interpolation parameter information. .

본 실시예에 의하면, 음원 내 보컬 성분 제거함에 있어서 딥러닝을 활용하되, 음원 신호의 특성 정보에 따라 서로 다른 학습 모델을 적용함으로써 학습에 대한 정확성을 높이면서도, 1차적으로 보컬 성분이 제거된 음원을 입력으로 받아 보다 향상된 품질의 음성 제거 음원을 산출 가능한 효과가 있다.According to the present embodiment, a deep learning is used to remove vocal components in a sound source, but the application of different learning models according to the characteristic information of the sound source signal increases the accuracy of the learning while first removing the vocal components. It is possible to calculate a voice removal sound source of improved quality by receiving the input.

도 1은 본 실시예에 따른 음성 성분 제거장치를 개략적으로 나타낸 블록 구성도이다.
도 2는 본 실시예에 따른 전처리부의 동작을 설명하기 위한 예시도이다.
도 3은 본 실시예에 따른 학습부의 동작을 설명하기 위한 예시도이다.
도 4 내지 도 6는 본 실시예에 따른 음성 성분 제거를 위한 학습 방법을 설명하기 위한 예시도이다.
도 7은 본 실시예에 따른 음성 성분 제거방법을 설명하기 위한 순서도이다.1 is a block diagram schematically showing an apparatus for removing speech components according to the present embodiment.
2 is an exemplary view for explaining the operation of the preprocessor according to the present embodiment.
3 is an exemplary diagram for describing an operation of the learning unit according to the present embodiment.
4 to 6 are exemplary views for explaining a learning method for removing speech components according to the present embodiment.
7 is a flowchart illustrating a method of removing a negative component according to the present embodiment.

이하, 본 실시예를 첨부된 도면을 참조하여 상세하게 설명한다.Hereinafter, the present embodiment will be described in detail with reference to the accompanying drawings.

도 1은 본 실시예에 따른 음성 성분 제거장치를 개략적으로 나타낸 블록 구성도이다.1 is a block diagram schematically showing an apparatus for removing speech components according to the present embodiment.

본 실시예에 따른 음성 성분 제거장치(100)는 딥러닝 기반의 음원 분리 기법을 이용하여 음원 내 음성 성분(ex: 보컬 성분)을 제거하고, 이를 기반으로 음성 제거 음원(ex: MR 신호)을 생성하는 장치를 의미한다. 보다 자세하게는, 본 실시예에 따른 음원 내 보컬 성분 제거함에 있어서 딥러닝을 활용하되, 음원 신호의 특성 정보에 따라 서로 다른 학습 모델을 적용함으로써 학습에 대한 정확성을 높이면서도, 1차적으로 보컬 성분이 제거된 음원을 입력으로 받아 보다 향상된 품질의 음성 제거 음원을 산출 가능토록 구현된다.The speech component removing apparatus 100 according to the present embodiment removes a speech component (ex: vocal component) in a sound source using a deep learning-based sound source separation technique, and based on the speech component removing sound source (ex: MR signal) Means the device to create. More specifically, deep learning is used to remove the vocal component in the sound source according to the present embodiment, and the vocal component is primarily improved while applying the different learning models according to the characteristic information of the sound source signal. It is implemented to receive the removed sound source as an input and to calculate a voice quality sound source of improved quality.

이러한, 음성 성분 제거장치(100)는 바람직하게는, 음원 제작 및 유통을 담당하는 사업 주체에서 노래방 등 부가적인 수익을 창출할 수 있는 BM 분야에 적용될 수 있으며, 이 경우, 사용자로 하여금 고 퀄리티의 음성 제거 음원을 제공 가능한 효과가 있다.Such a voice component removing apparatus 100 may be preferably applied to a BM field that can generate additional revenue such as karaoke from a business entity in charge of producing and distributing a sound source, and in this case, a user of high quality It is possible to provide a voice removal sound source.

본 실시예에 따른 음성 성분 제거장치(100)는 전처리부(110), 샘플링부(120), 학습부(130) 및 산출부(140)를 포함한다. 이때, 본 실시예에 따른 음성 성분 제거장치(100)에 포함되는 구성요소는 반드시 이에 한정되는 것은 아니다.The speech component removing apparatus 100 according to the present exemplary embodiment includes a preprocessor 110, a sampling unit 120, a learner 130, and a calculator 140. At this time, the components included in the speech component removing apparatus 100 according to the present embodiment are not necessarily limited thereto.

전처리부(110)는 음성 성분의 제거 대상이되는 대상 음원 신호를 수신하고, 이에 대한 전처리 과정을 수행하는 장치를 의미한다.The preprocessor 110 refers to an apparatus for receiving a target sound source signal, which is a target of removing a voice component, and performing a preprocessing process.

한편, 음원 신호에 포함된 특정 악기 소리 등은 일정 길이의 주파수 영역에 걸쳐 있기 때문에 음원 신호의 손실 없이 완벽하게 제거하는 것이 기술적으로는 매우 어려운 실정이다. 특히, 보컬의 주파수 성분은 성별, 인종 등에 따른 개인 편차가 여타 악기류에 비해 매우 큰 편일뿐 아니라 일반 악기류와 중첩되는 부분이 대부분이기 때문에 보컬 성분만의 완벽한 분리를 위한 기술의 난이도는 매우 높은 편이다.On the other hand, since the specific instrument sounds included in the sound source signal spans a frequency range of a predetermined length, it is technically very difficult to completely remove the sound source signal without loss. In particular, the frequency component of the vocal is very difficult compared to other instruments because of the individual variation according to gender, race, etc., and since it is mostly overlapped with general instruments, the difficulty of technology for perfect separation of the vocal component is very high. .

이 점에 기인하여, 본 실시예에 따른 전처리부(110)는 대상 음원 신호를 수신하고, 수신한 대상 음원 신호을 전처리하여, 대상 음원 신호 내 포함된 음성 성분을 1차적으로 제거한 전처리 음원을 산출한다. 이러한, 전처리 음원은 이후, 학습부(130) 및 산출부(140)의 입력 값으로 제공되며, 이 경우, 대상 음원 신호가 그대로 입력되는 경우 대비 학습 범위가 줄어들어들뿐만 아니라, 보다 향상된 품질의 음성 제거 음원이 산출 가능하다는 효과가 있다.Due to this point, the preprocessor 110 according to the present embodiment receives the target sound source signal, preprocesses the received target sound source signal, and calculates a preprocessed sound source from which voice components included in the target sound source signal are primarily removed. . Such a preprocessed sound source is then provided as input values of the learner 130 and the calculator 140. In this case, the target learning range is reduced as well as the voice quality of the improved quality when the target sound source signal is input as it is. The effect is that the removal source can be calculated.

도 2를 참조하여 설명하자면, 본 실시예에 있어서, 전처리부(110)는 바람직하게는 음성 성분의 제거와 관련한 효과를 극대화하기 위해 딥러닝 기법을 활용하는 형태로 구현될 수 있다. 즉, 전처리부(110)는 컨볼루션 신경망을 구비하고, 구비된 컨볼루션 신경망을 활용하여 대상 음원 신호에 대한 학습절차를 수행할 수 있다.Referring to FIG. 2, in the present embodiment, the preprocessor 110 may be implemented in a form that utilizes a deep learning technique to maximize the effect related to the removal of the negative component. That is, the preprocessor 110 may include a convolutional neural network, and may perform a learning procedure on a target sound source signal by using the provided convolutional neural network.

본 실시예에 따른 전처리부(110)는 적어도 하나의 특성 정보에 따라 분류된 음원 신호별로 음성 성분의 제거와 관련하여 기 학습된 학습 데이터를 저장하여 제공할 수 있다. 예컨대, 전처리부(110)는 장르, 성별, 인종 및 스펙트럼 분석을 통한 음원 주파수 파형 유사도, 주파수 피치 유사도에 따라 서로 동일 또는 유사한 상기 특성 정보를 갖는 음원 신호별로 학습 DB를 구성하여 제공할 수 있다.The preprocessing unit 110 according to the present exemplary embodiment may store and provide learning data that have been previously learned in relation to the removal of the speech component for each sound source signal classified according to at least one characteristic information. For example, the preprocessor 110 may configure and provide a learning DB for each sound source signal having the same or similar characteristic information according to sound source frequency waveform similarity and frequency pitch similarity through genre, gender, race, and spectrum analysis.

전처리부(110)는 저장된 음원 신호별 학습 데이터를 활용하여 대상 음원 신호에 대한 학습절차를 수행하고, 이를 통해, 대상 음원 신호 내 포함된 음성 성분을 1차적으로 제거한 전처리 음원을 산출할 수 있다.The preprocessing unit 110 may perform a learning procedure on the target sound source signal by using the stored learning data for each sound source signal, and may calculate a preprocessing sound source from which voice components included in the target sound source signal are primarily removed.

한편, 전처리부(110)가 음원 신호별 학습 데이터를 활용하여 학습절차를 수행하는 구체적인 방법은 학습부(130)가 음원 신호별 학습 데이터를 활용하여 학습절차를 수행하는 방법과 동일하며, 이에 대한 구체적인 내용은 학습부(130)의 동작을 설명하는 과정에서 후술토록 한다.Meanwhile, the specific method of performing the learning procedure by using the learning data for each sound source signal by the preprocessor 110 is the same as the method by which the learning unit 130 performs the learning procedure by using the learning data for each sound source signal. The details will be described later in the process of explaining the operation of the learning unit 130.

본 실시예에 있어서, 전처리부(110)는 대상 음원 신호에 대한 학습절차를 수행하기 앞서, 대상 음원 신호를 다운 스케일링하고, 다운 스케일링된 대상 음원 신호를 구비된 컨볼루션 신경망의 입력 값으로서 제공할 수 있다. 예컨대, 본 실시예에 있어서, 전처리부(110)는 최초 44,2kHz 음원을 입력받아, 이를 22.1kHZ로 다운 스케일링하여 제공할 수 있다. 이는 학습해야할 샘플 수를 줄여주어 결과적으로 음성 제거 성능이 향상되는 효과를 야기한다.In the present embodiment, the preprocessor 110 downscales the target sound source signal and provides it as an input value of the convolutional neural network provided with the downscaled target sound source signal before performing the learning procedure on the target sound source signal. Can be. For example, in the present embodiment, the preprocessing unit 110 may receive the first 44,2 kHz sound source, and downscale it to 22.1 kHZ to provide the same. This reduces the number of samples to be learned, resulting in improved speech rejection.

또한, 도 2에서는 전처리부(110)가 시영역(Time-Domain)에서 학습 및 추론을 수행하는 것으로 예시하였으나 반드시 이에 한정되는 것은 아니다. 예컨대, 전처리부(110)는 주파수 영역 혹은 이와 유사한 변환(Transform)을 거친 후에 학습 및 추론을 수행할 수도 있다.In addition, although FIG. 2 illustrates that the preprocessor 110 performs learning and inference in a time-domain, the present invention is not limited thereto. For example, the preprocessor 110 may perform learning and inference after passing through a frequency domain or a similar transform.

다른 실시예에서, 전처리부(110)는 주파서 성분 제거 기반의 알고리즘을 활용하여, 대상 음원 신호 내 포함된 음성 성분을 1차적으로 제거한 전처리 음원을 산출할 수도 있다.In another embodiment, the preprocessing unit 110 may calculate a preprocessing sound source from which the speech component included in the target sound source signal is first removed by using an algorithm based on the component removal of the parser.

전처리부(110)는 산출한 전처리 음원을 샘플링부(120) 및 학습부(130)로 출력한다.The preprocessor 110 outputs the calculated preprocessed sound source to the sampling unit 120 and the learning unit 130.

샘플링부(120)는 전처리부(110)로부터 전처리 음원을 수신하고, 수신한 전처리 음원을 업스케일링하여 산출부(140)로 출력하는 기능을 수행한다. 한편, 전처리부(110)에서의 음성 성분 제거 동작에 따라 산출된 전처리 음원의 경우 음성 성분과 관련된 주파수 성분이 남아 잔향이 드리거나, 음악과 관련된 성분까지 함께 제거되며, 이 경우 전체적인 음원 품질이 떨어질 수 있다는 한계가 존재한다.The sampling unit 120 receives a preprocessing sound source from the preprocessor 110, upscales the received preprocessing sound source, and outputs the preprocessing sound source to the calculator 140. On the other hand, in the case of the preprocessed sound source calculated according to the speech component removal operation in the preprocessing unit 110, the frequency component associated with the voice component remains and reverberates, or even components associated with the music are removed together. There is a limit to that.

이에, 전처리 음원에 대한 보정이 필요하며, 이를 위해, 샘플링부(120)는 앞서, 다운스케일링 과정을 거쳐 생성된 전처리 음원을 다시 업스케일링하여 제공한다.Thus, the preprocessing sound source needs to be corrected. For this purpose, the sampling unit 120 previously upscales and provides the preprocessing sound source generated through the downscaling process.

즉, 샘플링부(120)는 22.1kHz로 다운스케일링되어 1차 음성 제거된 전처리 음원을 Bi-Linear 기법을 통해 44.2kHz 음원으로 다시 업스케일링 처리하여 제공한다. 이를 위해, 샘플링부(120)는 바람직하게는 디지털 필터를 구성요소로서 구비할 수 있다. 한편, Bi-Linear 기법의 특성 상 주변 음원 값들을 활용하기 때문에 노이즈 감소 효과를 동시에 얻을 수 있다.That is, the sampling unit 120 downscales to 22.1 kHz and provides the pre-processed sound source from which the primary speech is removed to the 44.2 kHz sound source again through the Bi-Linear technique. To this end, the sampling unit 120 may preferably include a digital filter as a component. On the other hand, due to the characteristics of the Bi-Linear technique, the noise reduction effect can be obtained simultaneously because the ambient sound source values are utilized.

학습부(130)는 전처리부(110)로부터 제공된 전처리 음원을 기반으로 하여 학습 절차(ex: 추론)를 수행하고, 학습 결과에 따라 전처리 음원에 대응되는 보간 파라미터 정보를 산출하는 장치를 의미한다.The learner 130 refers to an apparatus for performing a learning procedure (ex: inference) based on the preprocessed sound source provided from the preprocessor 110 and calculating interpolation parameter information corresponding to the preprocessed sound source according to the learning result.

이하, 도 3 내지 도 6을 함께 참조하여, 본 실시예에 따른 학습부(130)의 구성 및 동작에 대해 보다 구체적으로 설명하도록 한다. 한편, 이러한, 학습부(130)의 구성 및 동작은 바람직하게는 전처리부(110)가 딥러닝 기법을 활용하는 형태로 구현되는 경우 동일 또는 유사하게 적용될 수 있다.Hereinafter, the configuration and operation of the learning unit 130 according to the present embodiment will be described in more detail with reference to FIGS. 3 to 6. On the other hand, the configuration and operation of the learning unit 130 is preferably the same or similarly applied when the preprocessor 110 is implemented in the form of using the deep learning technique.

먼저, 도 3을 참조하면, 학습부(130)는 컨볼루션 신경망(ex: 학습 모델)을 구비하고, 구비된 컨볼루션 신경망을 활용하여 전처리 음원에 대한 학습절차를 수행할 수 있다. 예컨대, 이러한, 컨볼루션 신경망은 하나 이상의 컨볼루션 계층으로 이루어지며, 각각의 컨볼루션 계층이 적어도 하나의 필터를 통해 상기의 보간 파라미터 정보를 산출토록 하는 구조로 구현될 수 있다.First, referring to FIG. 3, the learner 130 may include a convolutional neural network (eg, a learning model) and perform a learning procedure on a preprocessed sound source by using the provided convolutional neural network. For example, such a convolutional neural network may consist of one or more convolutional layers, and each convolutional layer may be implemented in a structure that allows the interpolation parameter information to be calculated through at least one filter.

도 4에 도시하듯이, 본 실시예에 따른 학습부(130)는 적어도 하나의 특성 정보에 따라 분류된 음원 신호별로 음성 성분의 제거와 관련하여 기 학습된 학습 데이터를 저장하여 제공할 수 있다. 예컨대, 학습부(130)는 장르, 성별, 인종 및 스펙트럼 분석을 통한 음원 주파수 파형 유사도, 주파수 피치 유사도에 따라 서로 동일 또는 유사한 상기 특성 정보를 갖는 음원 신호별로 학습 DB를 구성하여 제공할 수 있다.As illustrated in FIG. 4, the learner 130 according to the present exemplary embodiment may store and provide learning data that have been previously learned in connection with removing speech components for each sound source signal classified according to at least one piece of characteristic information. For example, the learning unit 130 may configure and provide a learning DB for each sound source signal having the same or similar characteristic information according to sound source frequency waveform similarity and frequency pitch similarity through genre, gender, race, and spectrum analysis.

보다 자세하게는, 학습부(130)는 서로 동일 또는 유사한 특성 정보를 갖는 음원 신호별로 음성 성분의 제거 과정에서 도출된 보간 파라미터 정보를 학습하여 저장한다.In more detail, the learner 130 learns and stores interpolation parameter information derived in the process of removing a speech component for each sound source signal having the same or similar characteristic information.

이하, 본 실시예에 따른 학습부(130)의 학습 동작에 대해 설명하도록 한다.Hereinafter, the learning operation of the learning unit 130 according to the present embodiment will be described.

본 실시예에 따른 학습부(130)는 학습 절차를 수행함에 있어서 전처리 음원이 다운스케일링 과정을 거쳐 생성된 경우, 이를 다시 업스케일링한다. 즉, 학습부(130)는 22.1kHz로 다운스케일링되어 1차 음성 제거된 전처리 음원을 Bi-Linear 기법을 통해 44.2kHz 음원으로 다시 업스케일링 처리하여 제공한다. 이를 위해, 학습부(130)는 학습을 위한 컨볼루션 신경망 내부의 특정 컨볼루션 계층에 업스케일링 기능을 수행하도록 설계될 수 있다.The learning unit 130 according to the present embodiment upscales the preprocessed sound source when the preprocessed sound source is generated through the downscaling process in performing the learning procedure. In other words, the learning unit 130 downscales the 22.1 kHz to provide the pre-processed sound source that has been firstly removed to be 44.2 kHz again through the Bi-Linear technique. To this end, the learner 130 may be designed to perform an upscaling function on a specific convolutional layer inside the convolutional neural network for learning.

학습부(130)는 학습 절차를 수행함에 있어서, 음원 신호의 특성 정보에 따라 서로 다른 학습 모델을 적용함으로써 학습에 대한 정확성이 보다 높아질 수 있도록 동작한다.In performing the learning procedure, the learning unit 130 operates to apply the different learning models according to the characteristic information of the sound source signal so that the accuracy of the learning is higher.

즉, 학습부(130)는 학습부(130) 내 기 분류된 음원 신호 중 대상 음원 신호의 특성 정보에 상응하는 음원 신호를 선별하고, 선별된 음원 신호에 대하여 저장된 학습 데이터를 활용하여 전처리 음원에 대응되는 보간 파라미터 정보를 산출한다.That is, the learner 130 selects a sound source signal corresponding to the characteristic information of the target sound source signal from the sound source signals classified in the learner 130, and utilizes the stored learning data on the selected sound source signal to preprocess the sound source. Corresponding interpolation parameter information is calculated.

도 5를 참조하면, 학습부(130)는 대상 음원 신호에 대한 특성 정보를 수집하고, 이를 기반으로, 학습을 위한 적절한 학습 모델을 적용시킬 수 있다. 예컨대, 학습부(130)는 전처리부(110)가 딥러닝 기법을 활용하는 형태로 구현되는 경우 전처리부(110)를 통해 상기의 대상 음원 신호에 대한 특성 정보를 전달받을 수 있다. 한편, 전처리부(110)는 대상 음원 신호와 더불어 대상 음원 신호에 대한 메타 데이터를 함께 입력받으며, 이를 기반으로 분석된 장르, 인종, 성별 및 주파수 정보 중 적어도 하나를 상기의 특성 정보로서 산출하여 제공할 수 있다.Referring to FIG. 5, the learner 130 may collect characteristic information about a target sound source signal and apply an appropriate learning model for learning based on the characteristic information. For example, the learner 130 may receive the characteristic information on the target sound source signal through the preprocessor 110 when the preprocessor 110 is implemented using a deep learning technique. Meanwhile, the preprocessor 110 receives the target sound source signal together with the metadata of the target sound source signal, calculates and provides at least one of the genre, race, gender, and frequency information analyzed based on this as the characteristic information. can do.

다른 실시예에서, 학습부(130)는 직접 대상 음원 신호에 대한 특성 정보를 수집하도록 구현될 수도 있다.In another embodiment, the learner 130 may be implemented to directly collect characteristic information about the target sound source signal.

또한, 도 6에 도시하듯이, 본 실시예에 따른 학습부(130)는 수신한 전처리 음원을 특정 시간 주기로 분할하여 제공함으로써, 학습을 위한 샘플이 다양하게 확보될 수 있도록 동작한다. 이와 더불어, 학습부(130)는 분할 음원 각각에 상응하는 특성 정보에 기초하여 분할 음원별 상이한 학습 데이터를 적용하여 상기의 보간 파라미터 정보를 산출토록 하는 구조로 구현될 수 있다.In addition, as illustrated in FIG. 6, the learner 130 according to the present exemplary embodiment operates by dividing the received preprocessed sound source into specific time periods so as to secure various samples for learning. In addition, the learner 130 may be implemented to have a structure for calculating the interpolation parameter information by applying different learning data for each divided sound source based on the characteristic information corresponding to each divided sound source.

본 실시예에 따른 학습부(120)는 상기의 보간 파라미터 정보로서 대상 음원 신호의 음성 신호에 대한 원본 신호와 전처리 음원 사이의 차분 신호를 산출할 수 있다. 이를 위해, 본 실시예의 경우 전처리 음원이 입력되는 경우, 대상 음원 신호의 음성 신호에 대한 원본 신호와 전처리 음원 사이의 차분인 잔차 신호(Residual)가 학습 결과로서 산출되도록 컨볼루션 신경망이 설계될 수 있다.The learner 120 according to the present exemplary embodiment may calculate the difference signal between the original signal and the preprocessed sound source for the audio signal of the target sound source signal as the interpolation parameter information. To this end, in the present embodiment, when a preprocessing sound source is input, a convolutional neural network may be designed such that a residual signal, which is a difference between the original signal and the preprocessing sound source for the voice signal of the target sound source signal, is calculated as a learning result. .

산출부(140)는 전처리부(110)로부터 샘플링부(120)를 거쳐 전달된 전처리 음원 및 학습부(120)로부터 전달된 보간 파라미터 정보를 기반으로 대상 음원 신호 내 음성 성분이 제거된 최종 음성 제거 음원을 산출한다.The calculating unit 140 removes the final speech from which the speech component in the target sound source signal is removed based on the preprocessing sound source transmitted from the preprocessor 110 through the sampling unit 120 and the interpolation parameter information transmitted from the learning unit 120. Calculate the sound source.

본 실시예에 따른 산출부(140)는 보간 파라미터 정보를 전처리 음원에 반영함으로써 최종적인 음성 제거 음원이 생성될 수 있도록 한다. 예컨대, 산출부(140)는 보간 파라미터 정보와 전처리 음원과의 합을 최종 음성 제거 음원으로서 산출할 수 있다.The calculating unit 140 according to the present embodiment reflects the interpolation parameter information in the preprocessed sound source so that the final speech removal sound source can be generated. For example, the calculator 140 may calculate the sum of the interpolation parameter information and the preprocessed sound source as the final speech removal sound source.

도 7은 본 실시예에 따른 음성 성분 제거방법을 설명하기 위한 순서도이다.7 is a flowchart illustrating a method of removing a negative component according to the present embodiment.

음성 성분 제거장치(100)는 대상 음원 신호를 수신하고, 대상 음원 신호를 전처리하여 대상 음원 신호 내 포함된 음성 성분을 제거한 전처리 음원을 산출한다(S702). 단계 S702에서 음성 성분 제거장치(100)는 딥러닝 기법을 활용하여 대상 음원 신호에 대한 학습절차를 수행하고, 이를 통해, 대상 음원 신호 내 포함된 음성 성분을 1차적으로 제거한 전처리 음원을 산출할 수 있다.The voice component removing apparatus 100 receives the target sound source signal and preprocesses the target sound source signal to calculate a preprocessed sound source from which the voice component included in the target sound source signal is removed (S702). In operation S702, the speech component removing apparatus 100 may perform a learning procedure on a target sound source signal using a deep learning technique, and may calculate a preprocessed sound source from which speech components included in the target sound source signal are primarily removed. have.

음성 성분 제거장치(100)는 대상 음원 신호에 대한 학습절차를 수행하기 앞서, 대상 음원 신호를 다운 스케일링하고, 다운 스케일링된 대상 음원 신호에 대한 학습절차를 수행할 수 있다.The speech component removing apparatus 100 may downscale the target sound source signal and perform the learning procedure on the downscaled target sound source signal before performing the learning procedure on the target sound source signal.

음성 성분 제거장치(100)는 구비된 학습 모델 내 기 분류된 음원 신호 중 대상 음원 신호의 특성 정보에 상응하는 음성 신호를 선별한다(S704). 본 실시예에 있어서, 음성 성분 제거장치(100)는 적어도 하나의 특성 정보에 따라 분류된 음원 신호별로 음성 성분의 제거와 관련하여 기 학습된 학습 데이터를 학습 모델로서 저장하여 제공한다. The speech component removing apparatus 100 selects a speech signal corresponding to the characteristic information of the target sound source signal from the sound source signals classified in the learning model provided at step S704. In the present embodiment, the speech component removing apparatus 100 stores and provides, as a learning model, learning data previously learned in connection with removing speech components for each sound source signal classified according to at least one characteristic information.

음성 성분 제거장치(100)는 학습 절차를 수행함에 있어서 전처리 음원이 다운스케일링 과정을 거쳐 생성된 경우, 이를 다시 업스케일링한다.When performing the learning procedure, the speech component removing apparatus 100 upscales the pre-processed sound source when it is generated through the downscaling process.

음성 성분 제거장치(100)는 단계 S704에서 선별된 음성 신호에 대하여 저장된 학습 데이터를 활용하여 전처리 음원에 대응하는 보간 파라미터 정보를 산출한다(S706). 단계 S706에서 음성 성분 제거장치(100)는 보간 파라미터 정보로서 대상 음원 신호의 음성 신호에 대한 원본 신호와 전처리 음원 사이의 차분 신호를 산출할 수 있다.The speech component removing apparatus 100 calculates interpolation parameter information corresponding to the preprocessed sound source by using the learning data stored with respect to the speech signal selected in step S704 (S706). In operation S706, the speech component removing apparatus 100 may calculate a difference signal between the original signal and the preprocessed sound source for the speech signal of the target sound source signal as interpolation parameter information.

음성 성분 제거장치(100)는 학습 절차를 수행함에 있어서 전처리 음원이 다운스케일링 과정을 거쳐 생성된 경우, 이를 다시 업스케일링 처리한다.The speech component removing apparatus 100 performs the upscaling again when the preprocessed sound source is generated through the downscaling process in performing the learning procedure.

음성 성분 제거장치(100)는 단계 S702의 전처리 음원 및 단계 S706의 보간 파라미터 정보를 기반으로 최종적으로 음성 제거 음원을 산출한다(S708). 단계 S708에서 음성 성분 제거장치(100)는 보간 파라미터 정보와 전처리 음원을 합산한 합산결과를 최종 음성 제거 음원으로서 산출할 수 있다.The speech component removing apparatus 100 finally calculates the speech removing sound source based on the preprocessed sound source in step S702 and the interpolation parameter information in step S706 (S708). In step S708, the speech component removing apparatus 100 may calculate the summation result obtained by adding the interpolation parameter information and the preprocessed sound source as the final speech removing sound source.

음성 성분 제거장치(100)는 앞서, 단계 S702에서 전처리 음원이 다운스케일링 과정을 거쳐 생성된 경우, 이를 다시 업스케일링 처리한다.When the preprocessed sound source is generated through the downscaling process in step S702, the speech component removing apparatus 100 performs upscaling again.

여기서, 단계 S702 내지 S708은 앞서 설명된 음성 성분 제거장치(100)의 각 구성요소의 동작에 대응되므로 더 이상의 상세한 설명은 생략한다.Here, the steps S702 to S708 correspond to the operation of each component of the voice component removing apparatus 100 described above, and thus, further detailed description thereof will be omitted.

도 7에서는 각각의 과정을 순차적으로 실행하는 것으로 기재하고 있으나, 반드시 이에 한정되는 것은 아니다. 다시 말해, 도 7에 기재된 과정을 변경하여 실행하거나 하나 이상의 과정을 병렬적으로 실행하는 것으로 적용 가능할 것이므로, 도 7은 시계열적인 순서로 한정되는 것은 아니다.In FIG. 7, each process is described as being sequentially executed, but is not necessarily limited thereto. In other words, since the process described in FIG. 7 may be applied by changing or executing one or more processes in parallel, FIG. 7 is not limited to the time series order.

전술한 바와 같이 도 7에 기재된 음성 성분 제거방법은 프로그램으로 구현되고 컴퓨터의 소프트웨어를 이용하여 읽을 수 있는 기록매체(CD-ROM, RAM, ROM, 메모리 카드, 하드 디스크, 광자기 디스크, 스토리지 디바이스 등)에 기록될 수 있다.As described above, the speech component removing method of FIG. 7 is implemented as a program and can be read using software of a computer (CD-ROM, RAM, ROM, memory card, hard disk, magneto-optical disk, storage device, etc.). ) Can be recorded.

이상의 설명은 본 실시예의 기술 사상을 예시적으로 설명한 것에 불과한 것으로서, 본 실시예가 속하는 기술 분야에서 통상의 지식을 가진 자라면 본 실시예의 본질적인 특성에서 벗어나지 않는 범위에서 다양한 수정 및 변형이 가능할 것이다. 따라서, 본 실시예들은 본 실시예의 기술 사상을 한정하기 위한 것이 아니라 설명하기 위한 것이고, 이러한 실시예에 의하여 본 실시예의 기술 사상의 범위가 한정되는 것은 아니다. 본 실시예의 보호 범위는 아래의 청구범위에 의하여 해석되어야 하며, 그와 동등한 범위 내에 있는 모든 기술 사상은 본 실시예의 권리범위에 포함되는 것으로 해석되어야 할 것이다.The above description is merely illustrative of the technical idea of the present embodiment, and those skilled in the art to which the present embodiment belongs may make various modifications and changes without departing from the essential characteristics of the present embodiment. Therefore, the present embodiments are not intended to limit the technical idea of the present embodiment but to describe the present invention, and the scope of the technical idea of the present embodiment is not limited by these embodiments. The scope of protection of the present embodiment should be interpreted by the following claims, and all technical ideas within the scope equivalent thereto should be construed as being included in the scope of the present embodiment.

100: 음성 성분 제거장치 110: 전처리부
120: 샘플링부 130: 학습부
140: 산출부100: negative component removing device 110: preprocessing unit
120: sampling unit 130: learning unit
140: output unit

Claims

대상 음원 신호를 수신하고, 상기 대상 음원 신호를 전처리 하여 상기 대상 음원 신호 내 포함된 음성 성분을 1차 제거한 전처리 음원을 산출하는 전처리부;
적어도 하나의 특성 정보에 따라 분류된 음원 신호별로 상기 음성 성분의 제거와 관련하여 기 학습된 학습 데이터를 저장하며, 상기 대상 음원 신호의 특성 정보 및 상기 음원 신호별 상기 학습 데이터에 기반하여 상기 전처리 음원에 대응되는 보간 파라미터 정보를 산출하는 학습부; 및
상기 전처리 음원 및 상기 보간 파라미터 정보를 기반으로 상기 대상 음원 신호 내 상기 음성 성분이 2차 제거된 음성 제거 음원을 산출하는 산출부
를 포함하는 것을 특징으로 하는 음성 성분 제거장치.A preprocessor configured to receive a target sound source signal and to preprocess the target sound source signal to calculate a preprocessed sound source from which voice components included in the target sound source signal are first removed;
The training data pre-trained in relation to the removal of the speech component is stored for each sound source signal classified according to at least one characteristic information, and the preprocessing sound source is based on the characteristic information of the target sound source signal and the training data for each sound source signal. A learner for calculating interpolation parameter information corresponding to the learning unit; And
A calculator configured to calculate a speech removal sound source from which the speech components are secondarily removed from the target sound source signal based on the preprocessing sound source and the interpolation parameter information
A negative component removing device comprising a.

제 1항에 있어서,
상기 전처리부는,
상기 음원 신호별 상기 학습 데이터를 저장하며, 상기 학습 데이터를 활용하여 상기 전처리 음원을 산출하는 것을 특징으로 하는 음성 성분 제거장치.The method of claim 1,
The preprocessing unit,
And storing the learning data for each sound source signal, and calculating the preprocessed sound source by using the learning data.

제 1항에 있어서,
상기 전처리부는, 상기 대상 음원 신호를 다운스케일링하고, 다운스케일링된 대상 음원 신호를 기반으로 상기 전처리 음원을 산출하며,
상기 학습부는, 상기 전처리 음원을 수신하고, 수신된 전처리 음원에 대한 업스케일링을 수행하는 샘플링 수단을 포함하는 것을 특징으로 하는 음성 성분 제거장치.The method of claim 1,
The preprocessor, downscales the target sound source signal, calculates the preprocessed sound source based on the downscaled target sound source signal,
And the learning unit includes sampling means for receiving the preprocessed sound source and performing upscaling on the received preprocessed sound source.

제 3항에 있어서,
상기 전처리부로부터 상기 전처리 음원을 수신하고, 수신된 전처리 음원을 업스케일링하여 상기 산출부로 제공하는 샘플링부를 더 포함하는 것을 특징으로 하는 음성 성분 제거장치.The method of claim 3, wherein
And a sampling unit configured to receive the preprocessed sound source from the preprocessor, upscaling the received preprocessed sound source, and provide the preprocessed sound source to the calculator.

제 1항에 있어서,
상기 학습부는,
서로 동일 또는 유사한 상기 특성 정보를 갖는 음원 신호별로 상기 음성 성분의 제거 과정에서 도출된 보간 파라미터 정보를 학습하여 저장하는 것을 특징으로 하는 음성 성분 제거장치.The method of claim 1,
The learning unit,
And interpolation parameter information derived in the process of removing the speech component for each sound source signal having the same or similar characteristic information.

제 1항에 있어서,
상기 학습부는,
상기 학습부 내 기 분류된 상기 음원 신호 중 상기 대상 음원 신호의 특성 정보에 상응하는 음원 신호를 선별하고, 선별된 음원 신호에 대하여 저장된 학습 데이터를 활용하여 상기 보간 파라미터 정보를 산출하는 것을 특징으로 하는 음성 성분 제거장치.The method of claim 1,
The learning unit,
The sound source signal corresponding to the characteristic information of the target sound source signal is classified among the sound source signals classified in the learning unit, and the interpolation parameter information is calculated by using the stored learning data on the selected sound source signal. Negative component removal device.

제 6항에 있어서,
상기 특성 정보는,
상기 대상 음원 신호에 대하여 수집된 메타 데이터를 기반으로 분석된 장르, 인종, 성별 및 주파수 정보 중 적어도 하나의 정보인 것을 특징으로 하는 음성 성분 제거장치.The method of claim 6,
The characteristic information,
And at least one of genre, race, gender, and frequency information analyzed based on the collected metadata about the target sound source signal.

제 1항에 있어서,
상기 학습부는, 상기 보간 파라미터 정보로서 상기 대상 음원 신호의 음성 신호에 대한 원본 신호와 상기 전처리 음원 사이의 차분 신호를 산출하고,
상기 산출부는, 상기 전처리 음원에 상기 차분 신호를 적용하여 상기 음성 제거 음원을 산출하는 것을 특징으로 하는 음성 성분 제거장치.The method of claim 1,
The learning unit calculates, as the interpolation parameter information, a difference signal between the original signal and the preprocessed sound source for the audio signal of the target sound source signal,
And the calculating unit calculates the speech removing sound source by applying the difference signal to the preprocessing sound source.

제 1항에 있어서,
상기 학습부는,
상기 전처리 음원을 특정 시간 주기로 분할하고, 분할 음원 각각에 상응하는 특성 정보에 기초하여 상기 분할 음원별 상이한 학습 데이터를 적용하여 상기 보간 파라미터 정보를 산출하는 것을 특징으로 하는 음성 성분 제거장치.The method of claim 1,
The learning unit,
And dividing the preprocessed sound source into a specific time period, and calculating the interpolation parameter information by applying different learning data for each divided sound source based on the characteristic information corresponding to each of the divided sound sources.

대상 음원 신호를 수신하고, 상기 대상 음원 신호를 전처리 하여 상기 대상 음원 신호 내 포함된 음성 성분을 1차 제거한 전처리 음원을 산출하는 과정;
적어도 하나의 특성 정보에 따라 분류된 음원 신호별로 상기 음성 성분의 제거와 관련하여 기 학습된 학습 데이터를 저장한 학습 모델 및 상기 대상 음원 신호의 특성 정보에 기반하여 상기 전처리 음원에 대응되는 보간 파라미터 정보를 산출하는 과정; 및
상기 전처리 음원 및 상기 보간 파라미터 정보를 기반으로 상기 대상 음원 신호 내 상기 음성 성분이 2차 제거된 음성 제거 음원을 산출하는 과정
을 포함하는 것을 특징으로 하는 음성 성분 제거방법. Receiving a target sound source signal and preprocessing the target sound source signal to calculate a preprocessed sound source from which voice components included in the target sound source signal are first removed;
Interpolation parameter information corresponding to the preprocessed sound source based on a learning model storing learning data previously learned in relation to the removal of the speech component for each sound source signal classified according to at least one characteristic information and the characteristic information of the target sound source signal Calculating a process; And
Calculating a voice removing sound source from which the voice component in the target sound source signal is secondarily removed based on the preprocessed sound source and the interpolation parameter information
Negative component removal method comprising a.

하드웨어와 결합되어,
대상 음원 신호를 수신하고, 상기 대상 음원 신호를 전처리 하여 상기 대상 음원 신호 내 포함된 음성 성분을 1차 제거한 전처리 음원을 산출하는 과정;
적어도 하나의 특성 정보에 따라 분류된 음원 신호별로 상기 음성 성분의 제거와 관련하여 기 학습된 학습 데이터를 저장한 학습 모델 및 상기 대상 음원 신호의 특성 정보에 기반하여 상기 전처리 음원에 대응되는 보간 파라미터 정보를 산출하는 과정; 및
상기 전처리 음원 및 상기 보간 파라미터 정보를 기반으로 상기 대상 음원 신호 내 상기 음성 성분이 2차 제거된 음성 제거 음원을 산출하는 과정
을 실행시키기 위하여 컴퓨터로 읽을 수 있는 기록매체에 저장된 컴퓨터프로그램.Combined with hardware,
Receiving a target sound source signal and preprocessing the target sound source signal to calculate a preprocessed sound source from which voice components included in the target sound source signal are first removed;
Interpolation parameter information corresponding to the preprocessed sound source based on a learning model storing learning data previously learned in relation to the removal of the speech component for each sound source signal classified according to at least one characteristic information and the characteristic information of the target sound source signal Calculating a process; And
Calculating a voice removing sound source from which the voice component in the target sound source signal is secondarily removed based on the preprocessed sound source and the interpolation parameter information
A computer program stored in a computer-readable recording medium for executing the program.